Durga Gadiraju

Reading Text Data From Files

Let us see how we can read text data from files into a data frame. spark.read also have APIs for other types of file formats, but we will get into those details later.

We can use spark.read.csv or spark.read.text to read text data.
spark.read.csv can be used for comma separated data. Default field names will be in the form of _c0,_c1 etc. We can pass the delimiter using the keyword argument using sep to spark.read.csv.
We can also use spark.read.format with the file type. We can use schema to define schema, option such as sep to pass delimiter and load to load data from a given location into Data Frame.
spark.read.text can be used to read fixed length data where there is no delimiter. Default field name is value.
We can also define attribute names using the toDF function
In either of the case data will be represented as strings
We can convert data types by using cast function –

df.select(df.field.cast(IntegerType()))

We will see all other functions soon, but let us perform the task of reading the data into the data frame and represent it in their original format.

Durga Gadiraju

Reading Text Data From Files

Share this post

Join Our Community

Follow Us

Links

Contact Info

Address

Phone

Email