Transforming data using map

Let us see Map transformation on an RDD-

Typically map is used for row-level transformations.Map takes function as parameter

  • Start spark-shell
  • Read the text file to create an RDD-
    • val orders = sc.textFile(“/Users/itversity/Research/data/retail_db/orders”)
    • Preview the data- orders.take(10).foreach(println)
    • orders.count gives the total number of records
  • Extract the dates-
    • val orderDates = orders.map( e=>e.split(“,”)(1))
    • Preview the dates- orderDates.take(10).foreach(println)
  • After extracting date,discard timestamp and display in YYYYMMDD format and also type caste to Integer
    • val orderDates = orders.map( e=>e.split(“,”)(1).substring(0,10).replace(“-“,””).toInt)
    • Preview the dates- orderDates.take(10).foreach(println)
  • APIs which require shuffle operation , expect key value pair as input.So we need to create pairedRDD-
    • For this we need to create tuple as part of map function.Tuple will have key and value.For example-for getting count per date, reduceByKey requires key value pair, ” 1″ can be used as value and later added to get count-
      • val orderDates = orders.map( e=>e.split(“,”)(1).substring(0,10).replace(“-“,””).toInt,1)
  • GetRevenuePerOrder class calculates revenue per order using orderItems and passing order_id as key and order_item_subtotal as value-
    • val orderItems = sc.textFile(args[1])
    • val revenuePerOrder = orderItems.map(oi =>(oi.split(“,”)(1).toInt,oi.split(“,”)(4).toFloat)).reduceByKey(_+_).map(oi=>oi._1+” “+ oi._2)