Let us see how we can read text data from files into a data frame. spark.read also have APIs for other types of file formats, but we will get into those details later.
- We can use spark.read.csv or spark.read.text to read text data.
- spark.read.csv can be used for comma separated data. Default field names will be in the form of _c0,_c1 etc. We can pass the delimiter using the keyword argument using sep to spark.read.csv.
- We can also use spark.read.format with the file type. We can use schema to define schema, option such as sep to pass delimiter and load to load data from a given location into Data Frame.
- spark.read.text can be used to read fixed length data where there is no delimiter. Default field name is value.
- We can also define attribute names using the toDF function
- In either of the case data will be represented as strings
- We can convert data types by using cast function –
- We will see all other functions soon, but let us perform the task of reading the data into the data frame and represent it in their original format.