Quick revision of Python 3

As part of this topic, let us quickly review basic concepts of Python before jumping into Spark APIs. Python is a programming language and Spark APIs are compatible with Python (along with Scala, Java etc). It is imperative to master at least one of the programming languages to build applications using Spark.

Let us revise below concepts before jumping into pyspark (Spark with Python).

  • Basics of programming (help, type, indentation etc)
  • Functions – pre-defined, user-defined and lambda functions
  • Basic file I/O
  • Collections and Map Reduce APIs
  • Overview about Pandas Data Frames

We can use jupyter notebook in the lab to revise python concepts.

Functions

We need to revise following related to functions.

  • Pre-defined functions (especially string manipulation functions)
  • How to develop user defined functions,?

Lambda Functions

Let us revise the details related to Lambda Functions

Here are the examples related to lambda functions.

Collections and Map Reduce APIs

Now let us recollect details about collections and basic map reduce APIs.

Supported collections: list, set, dict

  • Tuples are unnamed objects where values of attributes can be retrieved using positional notation
  • Quite often we will create list or set of tuples

Map Reduce APIs

Let us get into the details related to Map Reduce APIs to manipulate collections.

  • If we have to sort the collection then we need to convert the collection to list
  • If we have to eliminate duplicates then we need to convert the collection to set

Pandas Data Frames and Data Frame Operations

While collections are typically the group of objects or tuples or simple strings, we need to parse them to further process the data. With Data Frames we can define the structure and we can reference values in each record using column names in Data Frames. Also, Data Frames provide rich and simple APIs to convert CSV Files into Data Frames and process them with developer-friendly API.

  • Using read_csv with names we can create Data Frame out of comma separated data with the field name
  • You can fetch data from specific columns using names
  • We can filter data using query
  • We can perform by key aggregations using group by and then aggregate functions
  • We can also join data using align

Here are some of the examples of usage of Pandas data frames.