Data-Engineering-Bootcamp-1 (2)

Data Engineering Bootcamp

Current Status
Not Enrolled
Price
$599.95
Get Started

Data Engineering, by definition, is the practice of processing data for an enterprise. Through the course of this bootcamp, a user will learn this essential skill and will be equipped to process both streaming data and data in offline batches.

  • About Instructor
  • Job Roles
  • Data Engineering
  • Big Data ecosystem
  • Data Engineering vs. Data Science
  • Curriculum

About Instructor

Job Roles

Here are the different Job Roles dealing with Data.

  • Data Engineer
  • BI Developer
  • Application Developer
  • DevOps Engineer

Data Engineering

Let us understand what is Data Engineering all about.

Responsibilities

Here are the high-level responsibilities of Data Engineer.

  • Get data from different sources
  • Design Data Marts for reporting
  • Process data by applying transformation rules
    • Row-level transformations
    • Aggregations
    • Sorting
    • Ranking
    • And more
  • Port data back to Data Marts for reporting

Limitations of Conventional Approach

Traditionally Data Engineering is termed as ETL and delivered using tools like Informatica. Here are the limitations of using the conventional approach.

  • Scalability is a major challenge
  • Hardware Cost
  • Licensing

Big Data ecosystem of tools and technologies solve the problem of Scalability.

Technology Stack

For a moment it sounds overwhelming when we hear about technology stack of Data Engineering. But at a fundamental level, if you are good at Linux, SQL and Basic Programming you will excel in the field.

Big Data ecosystem

Big Data eco system tools and technologies solve the problem of scalability.

High-level Categories

All the technologies in the previous slide can be categorized into these

  • File system
  • Data ingestion
  • Data processing
    • Batch
    • Real-time or Streaming
  • Visualization
  • Support

File System

File systems supporting Big Data should be typically distributed file systems. However, cloud-based storages are also becoming quite popular as they can cut down the operational costs significantly with a pay-as-you-go model.

  • HDFS –Hadoop Distributed File System
  • AWS S3 –Amazon’s cloud-based storage
  • Azure Blob –Microsoft Azure’s cloud-based storage
  • NoSQL file systems

Data Ingestion

Data ingestion can be done either in real time or in batches. Data can be pulled either from relational databases or streamed from web logs

  • Sqoop – a map reduce based tool to pull data in batches from relational databases into Big Data file systems
  • Flume – an agent based technology which can poll web server logs and pull data to save it in any sink. One category of sink is Big Data Technologies
  • Kafka – a queue based technology from which data can be consumed to any technology. One category is Big Data.
  • There are many other tools and at times we might have to customize as per our requirements

Data Processing

Data processing is categorized into Batch, Real-time, and Streaming.

  • Batch
    • Map Reduce – I/O driven
    • Spark – Memory driven
  • Real-time (real-time operations)
    • NoSQL – HBase/MongoDB/Cassandra, Primarily used for operational systems.
    • Ad hoc querying – Impala/Presto/Spark SQL
  • Streaming (near real-time data processing)
    • Spark Streaming
    • Flink
    • Storm
  • Examples
    • Amazon Recommendation Engine
    • LinkedIn Endorsements

Visualization

Once the data is processed we need to visualize the data using standard reporting tools or custom applications.

  • Datameer
  • d3js
  • Tableau
  • Qlikview
  • and many more

Support

There are bunch of tools which are used to support the clusters

  • Ambari/Cloudera Manager/Ganglia –Used to setup and maintain the tools
  • Zookeeper –Load balancing and fail over
  • Kerberos –Security
  • Knox/Ranger

Categories and Skill Mapping

Job Roles – Skills and Technologies

Data Engineering vs. Data Science

Data Science and Data Engineering are 2 different fields
  • Data Science can be implemented even using excel on smaller volumes of data
  • When it comes to larger volumes of data, Data Scientist team work closely with Data Engineers to
    • Ingest data from different sources
    • Process data –Data Cleansing, Standardization, Aggregations etc
    • Data can ported to data science algorithms after processing the data
    • Data science algorithms can be applied by using Big Data modules such as Mahout, Spark MLLibetc
  • Data Scientists should be cognizant about Data Engineering, but need not be hands on. Data Engineers are the ones who work on Big Data eco system. But in the smaller organization Data Scientist/Data Engineer has to be master of both.

Curriculum

Curriculum is designed based up on roles, responsibilities and technologies that are relevant as of today.

Roles and Responsibilities

  • Environment – Linux
  • Ad hoc querying and reporting – SQL
  • Data ingestion – Sqoop or Flume or Kafka
  • Performing ETL
    • Conventional tools such as Informatica
    • Programming languages such as Python or Scala
    • Spark – heavy volumes of data
  • Validations – SQL or Shell Scripting
  • Big Data on Cloud – AWS EMR
  • Visualization – Tableau

Required Skills

We are going to cover majority of these technologies as part of the curriculum.

  • Linux Fundamentals
  • Database Essentials
  • Basics of Programming (Python and Scala)
  • Big Data eco system tools and technologies
  • Building applications at scale
  • Data Ingestion
  • Streaming Data Pipelines
  • Visualization
  • Big Data on Cloud

Course Content

Expand All

Share this post