Private: AWS EMR and Spark 2 using Scala

Getting Started with AWS EMR and Spark 2
12 Topics
Sign up for AWS account
Setup Cygwin on Windows and Quick Preview
Understand the EC2 Pricing, Create and Connect to EC2 instance
EC2 Dashboard, States and Description
Using elastic ips to connect to EC2 Instance
Using security groups to provide security to EC2 Instance
Understanding the concept of bastion server
Terminating EC2 Instance and relieving all the resources
Create security credentials for AWS account (access and secret key)
Setting up AWS CLI in Windows
Creating s3 bucket
Getting Started Conclusion
User Management and Security using IAM
6 Topics
Introduction to IAM and Securing AWS account
Creating First IAM user, create group and IAM password policy
IAM best practices and Creating Custom policy
Assign policy to entities (user and/or group)
Creating role for EC2 trusted entity with permissions on s3 and assign the role
Conclusion and take aways to User Management and Security using IAM
Creating EMR cluster with Spark using quick options
7 Topics
Introduction to EMR and concepts
Pre-requisites before setting up EMR cluster and setup datasets
Setup EMR with Spark cluster using quick options
EMR cluster startup cycle and connecting to master node
Enabling web connection to access web interfaces
Running spark-sql queries
Understanding EMR pricing and terminating the cluster
Setup Development environment for Scala and Spark
15 Topics
Setup Development Environment – Scala and Spark
Setup Java and Install scala with Intellij IDE
Develop Hello World Program using Scala and run the application
Setting up winutils.exe and Data sets
Develop first spark application and building Jar file using sbt
Download and install Spark using 7z on Windows
Configure environment variables for Spark on Windows
Running spark job using spark-shell
Validating spark job from jar file using spark-submit
Setup Ubuntu using Windows subsystem for linux – Windows 10
Access C drive using Ubuntu (built using Windows subsystem for linux)
Setup java and jdk in Ubuntu
Download and setup Spark 2 in Linux or Mac
Configure environment variables for Spark 2 – Linux or Mac
Run jar file on Linux or Mac
Spark 2 – Building Blocks
13 Topics
Introduction to Spark 2 Building Blocks
Official page and documentation for Spark 2
Spark High level overview
Quick Start section of Spark official documentation
Linking with Spark – Updating dependencies in IDE for development
Initializing Spark – Programmatically using Scala
Initializing Spark – Spark Shell
Introduction to Big Data Developer labs of ITVersity
Run Spark job in YARN mode
Execution life cycle of Spark Job
Overview of YARN
YARN deployment modes – Client vs. Cluster
Cluster Modes – Glossary
Spark 2 – Resilient Distributed Data Sets (RDD)
19 Topics
Introduction to RDDs
Revision of Scala Collections
Scala Collection APIs – Filtering Data
Scala Collection APIs – map and reduce
Parallelize – Create RDD from collections
Parallelize – Develop application to get revenue for given order
Parallelize – Run application on cluster
External Datasets – Create RDD using text data
External Datasets – wholeTextFiles
External Datasets – sequenceFile
External Datasets – hadoopRDD, newAPIHadoopRDD and objectFile
RDD Operations – Introduction
RDD Operations – Printing Elements from RDD
RDD Operations – Working with Key Value pairs
Data processing life cycle using Spark
String Processing – Extracting fields – substring, indexOf and split
String Processing – Type conversions using functions such as toInt
String Processing – Comparing data
String Processing – Extract date and change the format
RDD Operations – Transformations and Actions
30 Topics
Problem Statement
Transforming data using map
Filtering data using filter
Flattening arrays using flatMap
Using flatMap to process data read by wholeTextFiles
Transforming data using mapPartitions
Transforming data using mapPartitionsWithIndex
Get top n products per day – Reading Data
Get top n products per day – Filter and extract data from orders
Get top n products per day – Get data from order_items
Get top n products per day – Get data from products split by regex
Get sample – using sample
Set operations – union
Set operations – intersection
Get unique values using distinct
groupByKey – Getting Started
groupByKey – process values per key using map
groupByKey – process values using flatMap
Concept of shuffling
Aggregate data using reduceByKey
1 of 2
Data Frame and Data Set Operations
12 Topics
Introduction to Data Frame and Data Set Operations
Reviewing Data Sets
Creating Data Set from CSV
Using Spark SQL and Data Frames programmatically
Creating Data Set – using Data Frame
Writing Data Frame to File System
Running application on cluster using Spark 2
Create Data Frame – Infer Schema using Reflection
Create Data Frame – Applying Schema programmatically using types
Data Frame Operations – Getting Started
Get Top N Products Per Day – Create application and read data
Data Frames Operations – selection or projection of data
Previous Topic
Next Lesson

String Processing – Extract date and change the format

Private: AWS EMR and Spark 2 using Scala Spark 2 – Resilient Distributed Data Sets (RDD) String Processing – Extract date and change the format

Previous Topic
Back to Lesson
Next Lesson
Login
Accessing this course requires a login. Please enter your credentials below!

Lost Your Password?
Register
Don't have an account? Register one!
Register an Account

Login with your Social ID

Registration confirmation will be emailed to you.