spark, count and saveAsObjectFile without computing it twice -

using spark, filter , transform collection. want count size of result collection , save result collection file. so, if result collection not fit in memory, mean output computed twice? there way count , saveasobjectfile @ same time, not computed twice?

val input: rdd[page] = ... val output: rdd[result] = input.filter(...).map(...)  // expensive computation output.cache() val count = output.count output.saveasobjectfile("file.out")

solution #1 using cache memory , disk

you can use cache memory , disk - you'll avoid computing twice, you'll have read data disk (instead of ram)

using persist() memory_and_disk parameter. save computed data memory or disk

http://spark.apache.org/docs/latest/programming-guide.html#which-storage-level-to-choose

memory_and_disk store rdd deserialized java objects in jvm. if rdd not fit in memory, store partitions don't fit on disk, , read them there when they're needed.

solution #2 perform count using accumulator

similar question asked/answer here: http://thread.gmane.org/gmane.comp.lang.scala.spark.user/7920

with suggestion use accumulator, apply before applying saveasobjectfile()

val counts_accum = sc.longaccumulator("count accumulator") output.map{x =>   counts_accum.add(1)   x }.saveasobjectfile("file.out")

after saveasobjectfile completed, accumulator value hold total count, , you'll able print (you'll have use ".value" in order accumulator value)

println(counts_accum.value)

if accumulators created name, displayed in spark’s ui. can useful understanding progress of running stages

more info can found here: http://spark.apache.org/docs/latest/programming-guide.html#accumulators

Search This Blog

Alcombright

spark, count and saveAsObjectFile without computing it twice -

Comments

Post a Comment

Popular posts from this blog

c# SetCompatibleTextRenderingDefault must be called before the first -

C#.NET Oracle.ManagedDataAccess ConfigSchema.xsd -

c++ - Fill runtime data at compile time with templates -