Continuous Applications is new buzzword where enterprises can achieve real time reports with the lowest latency possible. Sarath Varma, Data Engineer at GrubHub is going to share his experience using Spark Structured Streaming to achieve Continuous Applications.
- A quick overview of Apache Spark on Amazon Elastic Map Reduce (EMR)
- Overview of Spark Structured Streaming
- Demo – Continuous Application using Spark Structured Streaming
- Read data from s3
- Process using Spark Structured Streaming
- Write data back to s3
- Q&A between me and Sarath. We will be sharing questions up front so that you guys have an idea about questions that are going to be asked.
- Q&A with public
- An announcement about the live course related to AWS Analytics (Beginning of March). Curriculum might change a bit.
- Elastic Map Reduce
- Apache Spark on EMR
- Continuous Applications using Spark Structured Streaming
- Integration between Kinesis and Spark Structured Streaming
- Case Study using boto
- Setup of Azkaban
- Create ETL workflow using Azkaban and EMR
Apache Spark
In case you are new to Spark, you can sign up to one of our courses to explore Spark in detail before diving deep about continuous applications.
What is Apache Spark?
- Apache Spark is a general-purpose distributed data processing engine.
- The main feature of Spark is its in-memory cluster computing that increases the processing speed of an application.
- Spark is designed to cover a wide range of workloads such as batch applications, iterative algorithms, interactive queries and streaming.
Why Apache Spark?
- Because of its in-memory processing engine that is known for its speed.
- Unification, Apache spark provides sophisticated analytics library and support for batch processing, streaming, interactive querying, machine learning etc…
- Programming languages supported by Spark include: Java, Python, Scala, and R.
What is DataFrame?
- DataFrame is a distributed collection of data organized into named columns.
- DataFrames can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing RDDs
- DataFrames are also evaluated lazily
What is Spark Streaming?
- Spark Streaming is an extension of the core Spark API.
- Spark Streaming divides continuously flowing input data into discrete units for processing.
- Spark Streaming provides a high-level abstraction called discretized stream or DStream.
Structured Streaming and Continuous Applications
Why Streaming Applications are Difficult?
- Complex Data
- Complex workloads
- Complex Systems
- They are nothing but micro batches which means there is pre-defined latency.
What is Structured Streaming?
- Structured Streaming is a scalable and fault-tolerant stream processing engine built on the Spark SQL engine.
- Structured Streaming provides fast, scalable, fault-tolerant, end-to-end exactly-once stream processing without the user having to reason about streaming.
- Since Spark 2.3, A new low-latency processing mode called Continuous Processing is introduced.
What are Continuous Applications?
Demo
Here are the details about the demo.
- We are using combination of Kinesis and Spark Structured Streaming for the demo.
- Data is pushed by web application simulator into s3 at regular intervals using Kinesis.
- Kinesis Datastream save files in text file format into an intermediate s3 bucket
- Data is read and processed by Spark Structured Streaming APIs.
- Processed data is written back to files in s3. We are using Parquet File Format with Snappy Compression.
- Hive tables will be created and we will demonstrate how reports are generated by Data Analysts or Data Scientists in the organization.
In actual project every thing is automated. Using spark.sql we are taking care of creating partitions before copying data. As we are using AWS EMR for Spark Streaming as well as Hive, integration between s3 and AWS are seamless. However, demo is being given on ITVersity labs which is built using Hortonworks distribution and hence we are copying data from s3 to HDFS and creating table as well as adding partitions manually.
- Can you give us an overview of infrastructure for building a streaming pipeline platform?
- Why you have chosen AWS EMR over on-premise cluster?
- How are you managing the life cycle of the cluster and business logic?
- Why you have used Spark Streaming instead of Storm or Flink?
- What is the source and target for the data sets?
- Can you elaborate on challenges related to s3 and how you overcome with EMRFS?
- What is your opinion on using Python for building Spark Streaming applications?