Creating Data Set from CSV

How to create a dataset from CSV file?

ls -ltr Research/data/retail_db/orders

a)Read CSV file and create dataset out of it-

  • Login to spark-shell and read the data.
    • val orders = sc.textFile(“/Users/itversity/Reasearch/data/retail_db/orders”)
    • orders.take(10).foreach(println) {Preview the data}
  • Now create a structure around the data by creating a case class.
    •  case class Order{ order_id:Int, order_date:String, order_customer_id:Int, order_status:String}
  • Create an RDD of type Order-
    • val ordersMap = s=>{val a =s.split(“,”)

Order(a[0].toInt, a[1],a[2].toInt, a[3] ) })

  • Now create a dataset
    • ordersDS = ordersMap.toDS
    • ordersDS.printSchema and can be used to see schema and preview data of the dataset
  • Advantage of using Dataset is that you can perform dataframe like operations as well as RDD type operations.For example to find only “COMPLETE” orders-
    • ordersDS.filter(o => o.order_status == “COMPLETE”).show  [RDD style]
    • ordersDS.filter($”order_status” === “COMPLETE”)      [ Dataframe Style]