How to create a dataset from CSV file?
ls -ltr Research/data/retail_db/orders
a)Read CSV file and create dataset out of it-
- Login to spark-shell and read the data.
- val orders = sc.textFile(“/Users/itversity/Reasearch/data/retail_db/orders”)
- orders.take(10).foreach(println) {Preview the data}
- Now create a structure around the data by creating a case class.
- case class Order{ order_id:Int, order_date:String, order_customer_id:Int, order_status:String}
- Create an RDD of type Order-
- val ordersMap = orders.map( s=>{val a =s.split(“,”)
Order(a[0].toInt, a[1],a[2].toInt, a[3] ) })
- Now create a dataset
- ordersDS = ordersMap.toDS
- ordersDS.printSchema and orderDS.show can be used to see schema and preview data of the dataset
- Advantage of using Dataset is that you can perform dataframe like operations as well as RDD type operations.For example to find only “COMPLETE” orders-
- ordersDS.filter(o => o.order_status == “COMPLETE”).show [RDD style]
- ordersDS.filter($”order_status” === “COMPLETE”) [ Dataframe Style]