Flattening arrays using flatMap

  • FlatMap also takes function as argument, function must return collection , that collection is mapped to zero or more items.
  • Example of using flatMap for converting a List of words into separate records-
    • First create a List collection on spark-shell-
      • val lines = List(“Hello World”,”how are you”,”are you interested ” , “in writing”,”simple wordcount program”, “using spark”, “to demonstrate”, “flatMap”)
      • lines.size gives 8 as output as it has 8 elements
      • lines.foreach(println) can be used to preview the data.
    • Convert this collection to RDD-
      • val linesRDD = sc.parallelize(lines)
    • val words = linesRDD.flatMap(e=>e.split(” “))
    •  words.collect.foreach(println)

 

  • Difference between map and flatMap-
    • val linesMap = linesRDD.map(e=>e.split(” “)) Using map ,we see an RDD is created ,but each element is of type Array.But our requirement is that we want to convert each word into separate record.In this case we require flatMap.flatMap is used in these cases where result is an Array but we need to flatten it.
    • So, using flatMap, if we have multiple words per line, and multiple lines, but we  get a single output array of words.