Transforming data using mapPartitions

Private: AWS EMR and Spark 2 using Scala RDD Operations – Transformations and Actions Transforming data using mapPartitions

Read NYSE data in spark-shell
- val nyse = sc.textFile(“/public/nyse”)
To extract stock tickers,date and volume, use map-
- val nyseMap = nyse.map(e=>{ val a = e.split(“,”) ((a(1).toInt,a(0)),a(6).toLong)})
- This creates an RDD with tuples.Using count on it will give same number of records.
Let us see how to implement by MapPartitions.MapPartitions take iterator of strings.

val nyseMapPartitions = nyse.mapPartitions( e=>(

e.map(nyseRec => { val a = nyseRec.split(",")

((a(1).toInt,a(0)),a(6).toLong)}) })

You can create a variable to see the functions available with the iterator
- val l =(1 to 10000).toIterator
Difference between map and mapPartitions is that mapPartitions are specialised maps that are called once per partition.For example-in case of initialisation of database, we need to initialise it as many number of times as number of elements in RDD in case of map ,but in case of mapPartitions it is equal to number of partitions