Durga Gadiraju

CCA 175 Spark and Hadoop Developer using Scala – Introduction

This is the reference material for CCA 175 Spark and Hadoop Developer using Scala.

Agenda

Introduction
Curriculum
Required Skills
Setup Environment
HDFS and YARN
Data Sets
Windows Environment (labs)

Introduction

CCA Spark and Hadoop Developer is well recognized certification in the industry
Conducted by Cloudera – a major Big Data vendor
ITVersity played key role in at least 300 people getting certified
Scenario based exam
8 to 12 questions

Curriculum

Data Ingest
- Sqoop
- HDFS
- Flume
- Kafka
- Spark Streaming
Transform, Stage and Store
- Loading Data from HDFS
- Write data back to HDFS
- File Formats
- Standard ETL
Data Analysis
- Use Hive tables
- Fundamentals of Querying
- Analytical or Windowing Functions
Configuration

Required Skills

Use Sqoop to import/export data between HDFS and MySQL
Programming using Scala or Python
Develop Spark based applications using Scala or Python
HDFS commands
Spark SQL or Hive
Flume
Kafka
Spark Streaming

Setup Environment – Options

You need to have environment ready using one of the approach
Setting up locally
Setting up Cloudera Quickstart VM
Use existing clusters
Using https://labs.itversity.com

Setting up – Locally

Setup Scala or Python
Setup Hadoop
Setup Spark
Setup Sqoop and more
Integrate them
Not recommended and not supported
It is tedious and time consuming

Setup Environment – Cloudera Quickstart VM

You can set up Cloudera Quickstart. But it requires

16 GB laptop with i7 Quadcore
Need to assign 8 GB to 10 GB RAM to the Cloudera Quickstart VM
Require Virtual Box or VMWare or Docker to set up virtual machine
Can be counter productive due to resource contention
Setup Process
- Install Virtual Box or VMWare or Docker
- Download Cloudera Quickstart virtual machine image
- Open using Virtual Box and make sure to allocate 8 GB RAM and 4 cores
Most of the stuff is available out of the box with Cloudera Quickstart VM
- MySQL database
- Sqoop
- Hive
- Spark and many others

Big Data Developer labs

Here is the URL – https://labs.itversity.com
Plans – 14.95$ for 31 days, 34.95$ for 93 days and 54.95$ for 185 days
It comes with all the tools well integrated and can start with in 2 minutes
Quick preview

Windows Environment (labs)

Putty and Winscp (to copy the data)
Cygwin
- Setup Cygwin
- Setup SSH
- Setup password less login
Make sure chrome is installed

HDFS Preview

Properties files
- /etc/hadoop/conf/core-site.xml
- /etc/hadoop/conf/hdfs-site.xml
Important Properties
- fs.defaultFS
- dfs.blocksize
- dfs.replication
HDFS commands
- Copying files
  - From local file system (hadoop fs -copyFromLocal or -put)
  - To local file system (hadoop fs -copyToLocal or -get)
  - From one HDFS location to other (hadoop fs -cp)
- Listing files (hadoop fs -ls)
- Previewing data from files (hadoop fs -tail or -cat)
- Checking sizes of the files (hadoop fs -du)

YARN Preview

In certifications Spark typically runs in YARN mode
We should be able to check the memory configuration to understand the cluster capacity
- /etc/hadoop/conf/yarn-site.xml
- /etc/spark/conf/spark-env.sh
Spark default settings
- Number of executors – 2
- Memory – 1 GB
Quite often we under utilize resources. Understanding memory settings thoroughly and then mapping them with data size we are trying to process we can accelerate the execution of our jobs

Data Sets

Go to https://github.com/dgadiraju/data
Clone or Download on to Virtual Machines created using Cloudera Quickstart or Hortonworks Sandbox
You can set up locally for practicing for Spark, but it is highly recommended to use HDFS which comes out of the box with Cloudera Quickstart or Hortonworks or our labs
On lab they are already available
retail_db
- Master tables
  - customers
  - products
  - categories
  - departments
- Transaction tables
  - orders
  - order_items

Durga Gadiraju

CCA 175 Spark and Hadoop Developer using Scala – Introduction

Agenda

Introduction

Curriculum

Required Skills

Setup Environment – Options

Setting up – Locally

Setup Environment – Cloudera Quickstart VM

Big Data Developer labs

Windows Environment (labs)

HDFS Preview

YARN Preview

Data Sets

Share this post

Join Our Community

Follow Us

Links

Contact Info

Address

Phone

Email