Using flatMap to process data read by wholeTextFiles

  • First go to NYSE data in the lab and check the total file size using du -sh .
  • Go to spark-shell and read this data-
    •   val nyse = sc.wholeTextFiles(“/public/nyse”,4) .
    • This creates an RDD which contains tuple with one field as name of the file and other field as content of the file.
    • nyse.first._2 gives the size of the file
  • Actual data should be typically represented as Strings in an RDD, so we need to use flatMap in this case-
      • val nyseRDD = nyse.flatMap( e=>e._2.split(“\\r?\\n”))
      • With flatMap array is flattened
      • We can apply transformation logic along with flatMap using scala collection APIs.-
        • val nyseRDD = nyse.flatMap( e=>  { val a = e._2.split("\\r?\\n"))

    val t= a.map(r =>{
    val rec = r.split(",")((rec(1).toInt,rec(0)), rec(6).toLong)
    t      })