Simple Application

Let us start with a simple application to understand details related to architecture using pyspark.

  • As we have multiple versions of Spark on our lab and we are exploring Spark 2 we need to export SPARK_MAIOR_VERSION with 2.
  • For this demo, we will disable dynamic allocation by setting spark. dynamicAllocation.enabled to false.
  • Launch pyspark using YARN and disabling dynamic allocation ( also, use spark.ui.port as well to specify a unique port).
  • Develop a simple word count program by reading data from /public/randomtextwriter/part-m-00000
  • Save output to /user/training.
Data = sc.textFile(' /public/randomtextwriter/part-m-00000')
wc = data.\
flatMap(lambda line: line.split( ' ')). \
    map( lambda word: (word, 1)). \
    reduceByKey(lambda x, y: x + y)
wc. \
    map(lambda rec: rec[0] + ',' + str(rec[1])). \
    saveAsTextFile('/user/training/core/wordcount')
from. pyspark.sql.functions import split. explode
Data = spark.read.text('/public/randomtextwriter/part-m-00000')
Wc = data.select(explode(split(data.value, ' ')).alias('words')). \
groupBy('words'). \
agg(count('words').alisa('wc'))
wc.write.csv('/user/training/df/wordcount')

Share this post