Transforming data using map

Private: AWS EMR and Spark 2 using Scala RDD Operations – Transformations and Actions Transforming data using map

Let us see Map transformation on an RDD-

Typically map is used for row-level transformations.Map takes function as parameter

Start spark-shell
Read the text file to create an RDD-
- val orders = sc.textFile(“/Users/itversity/Research/data/retail_db/orders”)
- Preview the data- orders.take(10).foreach(println)
- orders.count gives the total number of records
Extract the dates-
- val orderDates = orders.map( e=>e.split(“,”)(1))
- Preview the dates- orderDates.take(10).foreach(println)
After extracting date,discard timestamp and display in YYYYMMDD format and also type caste to Integer
- val orderDates = orders.map( e=>e.split(“,”)(1).substring(0,10).replace(“-“,””).toInt)
- Preview the dates- orderDates.take(10).foreach(println)
APIs which require shuffle operation , expect key value pair as input.So we need to create pairedRDD-
- For this we need to create tuple as part of map function.Tuple will have key and value.For example-for getting count per date, reduceByKey requires key value pair, ” 1″ can be used as value and later added to get count-
  - val orderDates = orders.map( e=>e.split(“,”)(1).substring(0,10).replace(“-“,””).toInt,1)
GetRevenuePerOrder class calculates revenue per order using orderItems and passing order_id as key and order_item_subtotal as value-
- val orderItems = sc.textFile(args[1])
- val revenuePerOrder = orderItems.map(oi =>(oi.split(“,”)(1).toInt,oi.split(“,”)(4).toFloat)).reduceByKey(_+_).map(oi=>oi._1+” “+ oi._2)