Transforming data using mapPartitions

  • Read NYSE data in spark-shell
    • val nyse = sc.textFile(“/public/nyse”)
  • To extract stock tickers,date and volume, use map-
    • val nyseMap = nyse.map(e=>{  val a = e.split(“,”)  ((a(1).toInt,a(0)),a(6).toLong)})
    • This creates an RDD with tuples.Using count on it will give same number of records.
  • Let us see how to implement by MapPartitions.MapPartitions take iterator of strings.

val nyseMapPartitions = nyse.mapPartitions( e=>(

e.map(nyseRec => { val a = nyseRec.split(",")

((a(1).toInt,a(0)),a(6).toLong)})  })

  • You can create a variable to see the functions available with the iterator
    • val l =(1 to 10000).toIterator
  • Difference between map and mapPartitions is that mapPartitions are specialised maps that are called once per partition.For example-in case of initialisation of database, we need to initialise it as many number of times as number of elements in RDD in case of map ,but in case of mapPartitions it is equal to number of partitions