Data Frames Operations – selection or projection of data

Selection and Projection of Data-

First go to spark-shell-

  • Create a base directory variable-
    • val inputBaseDir = “/Users/itversity/research/data/retail_db_json”
  • Create Dataframe and preview the data.Also print the schema.
    • val ordersDF = spark.read.json(inputBaseDir + “/orders”)
    • ordersDF.printSchema
    • ordersDF.show(100)   (shows top 100 records)
    • ordersDF.count    (Counts the number of records)
  • Use select for projecting order_id and order_date-
    • Select function takes either column type or string type
      • ordersDF.select(“order_id”,”order_date”).show   //string type
      • ordersDF.select($”order_id”,$”order_date”).show  //column type
  • To apply length function-
    • ordersDF.select($”order_date”,length($”order_date”)).show
  • To give an alias to column name-
    • ordersDF.select($”order_date”,length($”order_date”).alias(“order_date_length”)).show
  • To get unique elements-
    • ordersDF.select($”order_status”).distinct.show

 

Note – All functions can be seen using import org.apache.spark.sql.functions. and then hit tab