As part of this topic, let us quickly review basic concepts of Python before jumping into Spark APIs. Python is a programming language and Spark APIs are compatible with Python (along with Scala, Java etc). It is imperative to master at least one of the programming languages to build applications using Spark.
Let us revise below concepts before jumping into pyspark (Spark with Python).
We can use jupyter notebook in the lab to revise python concepts.
We need to revise following related to functions.
Let us revise the details related to Lambda Functions
Here are the examples related to lambda functions.
Now let us recollect details about collections and basic map reduce APIs.
Let us get into the details related to Map Reduce APIs to manipulate collections.
While collections are typically the group of objects or tuples or simple strings, we need to parse them to further process the data. With Data Frames we can define the structure and we can reference values in each record using column names in Data Frames. Also, Data Frames provide rich and simple APIs to convert CSV Files into Data Frames and process them with developer-friendly API.
Here are some of the examples of usage of Pandas data frames.