As part of this Spark Data Frames getting started session, let us understand how we can create data frames from different file formats and how to write data frames back to different file formats.
Overview of Data Frames
Creating Data Frames from RDD
Creating Data Frames – File Formats
Creating Data Sets from RDD
Creating Data Sets from Data Frame
Overview of Data Frames
Data Frames are nothing but named RDD.
https://youtu.be/DSrdUdPnR2E
With RDD we do not have names to refer field names
As there are no names, we only can access elements by position unless we use Object-Oriented Concepts
Once Data Frames are created we can process data using multiple approaches
Data Frame Operations
Spark SQL
We can create Data Frames from files using APIs, from Hive tables as well as over JDBC.
Data Frames can be written into different file formats, Hive tables as well as remote databases over JDBC.
Creating Data Frames from RDD
Now let us see how we can create Data Frames from RDD.
We do not use this that often as we can use spark.read APIs to convert data in flat files into Data Frames directly.
There are some file formats which are available on sc, but not on spark.read. In those scenarios we might have to load data from files into RDD, extract information using map and then create Data Frame.
Let us see an example where we read comma separated data from text files and create Data Frame out of it.
Read text data using sc.textFile
Apply map to convert data into tuple with right data type
Use toDF to create Data Frame. We can define data types while creating tables.
[gist]a3f73cc5e534b76f847cefb408ff8456[/gist]
Creating Data Frames – File Formats
Spark support multiple file formats out of the box. We will just go through overview for now and get into details at later point in time.
https://youtu.be/eksKBKwoz4c
spark.read and spark.load provide APIs to read data from files of different file formats.
Supported file formats
Text File Format – csv and text
parquet
orc
json
avro (require plugin)
All the file formats except text files typically store metadata along with data. Hence when we create Data Frame out of the special file formats, they typically inherit the schema.
Let us create Data Frame out of JSON data and process it using both Data Frame Operations as well as Spark SQL.
[gist]b80f6ad34b78c77798843cc31014a0e2[/gist]
Creating Data Sets from RDD
Let us see how we can create Data Set from RDD. We need to use case classes that comes as part of Scala to create Data Sets.
Create RDD by reading data from the file
Recap of Case Classes
Let us quickly review some of the concepts related to Case Classes as we need to use them for creating Data Sets.
Create case class – we will review some of the important concepts of Case Class
We get getters and setters with case classes.
It implements Serializable and Product
While Serializable gives us functionality to convert the object to the data stream, Product will give us functionality such as productArity, productIterator etc.
productArity gives us the number of elements while productIterator convert case class attribute values to the collection.
Creating Data Sets
As we understood details with respect to Case Classes, now let us go ahead and create Data Sets from RDD.
Data Frames can be processed either by using Data Frame Operations or Spark SQL, whereas Data Sets can be processed using core APIs as well.
APIs are same for both Data Frames as well as Data Sets to process using Data Frame Operations or Spark SQL
[gist]408db3fbd615d0f86398f705efa3b1d4[/gist]
Creating Data Sets from Data Frame
We can create Data Set from Data Frame using as function by applying case class. But you need to make sure the data types are in sync.
Create Data Frame from JSON
Create case class with required fields and appropriate data sets
Use as and convert Data Frame to Data Set.
Make sure the data types between Data Frame and case class are compatible.
Also as part of map we can only use Data Types such as Int, Long etc. Data Types such as BigInt are not serializable.