Selection and Projection of Data-
First go to spark-shell-
- Create a base directory variable-
- val inputBaseDir = “/Users/itversity/research/data/retail_db_json”
- Create Dataframe and preview the data.Also print the schema.
- val ordersDF = spark.read.json(inputBaseDir + “/orders”)
- ordersDF.printSchema
- ordersDF.show(100) (shows top 100 records)
- ordersDF.count (Counts the number of records)
- Use select for projecting order_id and order_date-
- Select function takes either column type or string type
- ordersDF.select(“order_id”,”order_date”).show //string type
- ordersDF.select($”order_id”,$”order_date”).show //column type
- To apply length function-
- ordersDF.select($”order_date”,length($”order_date”)).show
- To give an alias to column name-
- ordersDF.select($”order_date”,length($”order_date”).alias(“order_date_length”)).show
- To get unique elements-
- ordersDF.select($”order_status”).distinct.show
Note – All functions can be seen using import org.apache.spark.sql.functions. and then hit tab