Reading Text Data From Files

Let us see how we can read text data from files into a data frame. spark.read also have APIs for other types of file formats, but we will get into those details later.

  • We can use spark.read.csv or spark.read.text to read text data.
  • spark.read.csv can be used for comma separated data. Default field names will be in the form of _c0,_c1 etc. We can pass the delimiter using the keyword argument using sep to spark.read.csv.
  • We can also use spark.read.format with the file type. We can use schema to define schema, option such as sep to pass delimiter and load to load data from a given location into Data Frame.
  • spark.read.text can be used to read fixed length data where there is no delimiter. Default field name is value.
  • We can also define attribute names using the toDF function
  • In either of the case data will be represented as strings
  • We can convert data types by using cast function –
df.select(df.field.cast(IntegerType()))
  • We will see all other functions soon, but let us perform the task of reading the data into the data frame and represent it in their original format.

Share this post