This is the reference material for CCA 175 Spark and Hadoop Developer using Scala.
Agenda
- Introduction
- Curriculum
- Required Skills
- Setup Environment
- HDFS and YARN
- Data Sets
- Windows Environment (labs)
Introduction
- CCA Spark and Hadoop Developer is well recognized certification in the industry
- Conducted by Cloudera – a major Big Data vendor
- ITVersity played key role in at least 300 people getting certified
- Scenario based exam
- 8 to 12 questions
Curriculum
- Data Ingest
- Sqoop
- HDFS
- Flume
- Kafka
- Spark Streaming
- Transform, Stage and Store
- Loading Data from HDFS
- Write data back to HDFS
- File Formats
- Standard ETL
- Data Analysis
- Use Hive tables
- Fundamentals of Querying
- Analytical or Windowing Functions
- Configuration
Required Skills
- Use Sqoop to import/export data between HDFS and MySQL
- Programming using Scala or Python
- Develop Spark based applications using Scala or Python
- HDFS commands
- Spark SQL or Hive
- Flume
- Kafka
- Spark Streaming
Setup Environment – Options
- You need to have environment ready using one of the approach
- Setting up locally
- Setting up Cloudera Quickstart VM
- Use existing clusters
- Using https://labs.itversity.com
Setting up – Locally
- Setup Scala or Python
- Setup Hadoop
- Setup Spark
- Setup Sqoop and more
- Integrate them
- Not recommended and not supported
- It is tedious and time consuming
Setup Environment – Cloudera Quickstart VM
You can set up Cloudera Quickstart. But it requires
- 16 GB laptop with i7 Quadcore
- Need to assign 8 GB to 10 GB RAM to the Cloudera Quickstart VM
- Require Virtual Box or VMWare or Docker to set up virtual machine
- Can be counter productive due to resource contention
- Setup Process
- Install Virtual Box or VMWare or Docker
- Download Cloudera Quickstart virtual machine image
- Open using Virtual Box and make sure to allocate 8 GB RAM and 4 cores
- Most of the stuff is available out of the box with Cloudera Quickstart VM
- MySQL database
- Sqoop
- Hive
- Spark and many others
Big Data Developer labs
- Here is the URL – https://labs.itversity.com
- Plans – 14.95$ for 31 days, 34.95$ for 93 days and 54.95$ for 185 days
- It comes with all the tools well integrated and can start with in 2 minutes
- Quick preview
Windows Environment (labs)
- Putty and Winscp (to copy the data)
- Cygwin
- Setup Cygwin
- Setup SSH
- Setup password less login
- Make sure chrome is installed
HDFS Preview
- Properties files
- /etc/hadoop/conf/core-site.xml
- /etc/hadoop/conf/hdfs-site.xml
- Important Properties
- fs.defaultFS
- dfs.blocksize
- dfs.replication
- HDFS commands
- Copying files
- From local file system (hadoop fs -copyFromLocal or -put)
- To local file system (hadoop fs -copyToLocal or -get)
- From one HDFS location to other (hadoop fs -cp)
- Listing files (hadoop fs -ls)
- Previewing data from files (hadoop fs -tail or -cat)
- Checking sizes of the files (hadoop fs -du)
- Copying files
YARN Preview
- In certifications Spark typically runs in YARN mode
- We should be able to check the memory configuration to understand the cluster capacity
- /etc/hadoop/conf/yarn-site.xml
- /etc/spark/conf/spark-env.sh
- Spark default settings
- Number of executors – 2
- Memory – 1 GB
- Quite often we under utilize resources. Understanding memory settings thoroughly and then mapping them with data size we are trying to process we can accelerate the execution of our jobs
Data Sets
- Go to https://github.com/dgadiraju/data
- Clone or Download on to Virtual Machines created using Cloudera Quickstart or Hortonworks Sandbox
- You can set up locally for practicing for Spark, but it is highly recommended to use HDFS which comes out of the box with Cloudera Quickstart or Hortonworks or our labs
- On lab they are already available
- retail_db
- Master tables
- customers
- products
- categories
- departments
- Transaction tables
- orders
- order_items
- Master tables