spark, count and saveAsObjectFile without computing it twice -
using spark, filter , transform collection. want count size of result collection , save result collection file. so, if result collection not fit in memory, mean output computed twice? there way count , saveasobjectfile @ same time, not computed twice?
val input: rdd[page] = ... val output: rdd[result] = input.filter(...).map(...) // expensive computation output.cache() val count = output.count output.saveasobjectfile("file.out")
solution #1 using cache memory , disk
you can use cache memory , disk - you'll avoid computing twice, you'll have read data disk (instead of ram)
using persist() memory_and_disk parameter. save computed data memory or disk
http://spark.apache.org/docs/latest/programming-guide.html#which-storage-level-to-choose
memory_and_disk store rdd deserialized java objects in jvm. if rdd not fit in memory, store partitions don't fit on disk, , read them there when they're needed.
solution #2 perform count using accumulator
similar question asked/answer here: http://thread.gmane.org/gmane.comp.lang.scala.spark.user/7920
with suggestion use accumulator, apply before applying saveasobjectfile()
val counts_accum = sc.longaccumulator("count accumulator") output.map{x => counts_accum.add(1) x }.saveasobjectfile("file.out")
after saveasobjectfile completed, accumulator value hold total count, , you'll able print (you'll have use ".value" in order accumulator value)
println(counts_accum.value)
if accumulators created name, displayed in spark’s ui. can useful understanding progress of running stages
more info can found here: http://spark.apache.org/docs/latest/programming-guide.html#accumulators
Comments
Post a Comment