Data Engineering, by definition, is the practice of processing data for an enterprise. Through the course of this bootcamp, a user will learn this essential skill and will be equipped to process both streaming data and data in offline batches.
- About Instructor
- Job Roles
- Data Engineering
- Big Data ecosystem
- Data Engineering vs. Data Science
- Curriculum
About Instructor
- 13+ years of rich industry experience in building large-scale data-driven applications
- IT Versity, Inc. is Dallas based startup specializing in low-cost quality training in emerging technologies such as Big Data, Cloud etc
- We provide training using the following platforms
- https://labs.itversity.com – low cost big data lab to learn technologies.
- http://discuss.itversity.com– support while learning
- http://www.itversity.com– website for content
- https://youtube.com/itversityin
- https://github.com/dgadiraju
Job Roles
Here are the different Job Roles dealing with Data.
- Data Engineer
- BI Developer
- Application Developer
- DevOps Engineer
Data Engineering
Let us understand what is Data Engineering all about.
Responsibilities
Here are the high-level responsibilities of Data Engineer.
- Get data from different sources
- Design Data Marts for reporting
- Process data by applying transformation rules
- Row-level transformations
- Aggregations
- Sorting
- Ranking
- And more
- Port data back to Data Marts for reporting
Limitations of Conventional Approach
Traditionally Data Engineering is termed as ETL and delivered using tools like Informatica. Here are the limitations of using the conventional approach.
- Scalability is a major challenge
- Hardware Cost
- Licensing
Big Data ecosystem of tools and technologies solve the problem of Scalability.
Technology Stack
For a moment it sounds overwhelming when we hear about technology stack of Data Engineering. But at a fundamental level, if you are good at Linux, SQL and Basic Programming you will excel in the field.
Big Data ecosystem
Big Data eco system tools and technologies solve the problem of scalability.
High-level Categories
All the technologies in the previous slide can be categorized into these
- File system
- Data ingestion
- Data processing
- Batch
- Real-time or Streaming
- Visualization
- Support
File System
File systems supporting Big Data should be typically distributed file systems. However, cloud-based storages are also becoming quite popular as they can cut down the operational costs significantly with a pay-as-you-go model.
- HDFS –Hadoop Distributed File System
- AWS S3 –Amazon’s cloud-based storage
- Azure Blob –Microsoft Azure’s cloud-based storage
- NoSQL file systems
Data Ingestion
Data ingestion can be done either in real time or in batches. Data can be pulled either from relational databases or streamed from web logs
- Sqoop – a map reduce based tool to pull data in batches from relational databases into Big Data file systems
- Flume – an agent based technology which can poll web server logs and pull data to save it in any sink. One category of sink is Big Data Technologies
- Kafka – a queue based technology from which data can be consumed to any technology. One category is Big Data.
- There are many other tools and at times we might have to customize as per our requirements
Data Processing
Data processing is categorized into Batch, Real-time, and Streaming.
- Batch
- Map Reduce – I/O driven
- Spark – Memory driven
- Real-time (real-time operations)
- NoSQL – HBase/MongoDB/Cassandra, Primarily used for operational systems.
- Ad hoc querying – Impala/Presto/Spark SQL
- Streaming (near real-time data processing)
- Spark Streaming
- Flink
- Storm
- Examples
- Amazon Recommendation Engine
- LinkedIn Endorsements
Visualization
Once the data is processed we need to visualize the data using standard reporting tools or custom applications.
- Datameer
- d3js
- Tableau
- Qlikview
- and many more
Support
There are bunch of tools which are used to support the clusters
- Ambari/Cloudera Manager/Ganglia –Used to setup and maintain the tools
- Zookeeper –Load balancing and fail over
- Kerberos –Security
- Knox/Ranger
Categories and Skill Mapping
Job Roles – Skills and Technologies
Data Engineering vs. Data Science
- Data Science can be implemented even using excel on smaller volumes of data
- When it comes to larger volumes of data, Data Scientist team work closely with Data Engineers to
- Ingest data from different sources
- Process data –Data Cleansing, Standardization, Aggregations etc
- Data can ported to data science algorithms after processing the data
- Data science algorithms can be applied by using Big Data modules such as Mahout, Spark MLLibetc
- Data Scientists should be cognizant about Data Engineering, but need not be hands on. Data Engineers are the ones who work on Big Data eco system. But in the smaller organization Data Scientist/Data Engineer has to be master of both.
Curriculum
Curriculum is designed based up on roles, responsibilities and technologies that are relevant as of today.
Roles and Responsibilities
- Environment – Linux
- Ad hoc querying and reporting – SQL
- Data ingestion – Sqoop or Flume or Kafka
- Performing ETL
- Conventional tools such as Informatica
- Programming languages such as Python or Scala
- Spark – heavy volumes of data
- Validations – SQL or Shell Scripting
- Big Data on Cloud – AWS EMR
- Visualization – Tableau
Required Skills
We are going to cover majority of these technologies as part of the curriculum.
- Linux Fundamentals
- Database Essentials
- Basics of Programming (Python and Scala)
- Big Data eco system tools and technologies
- Building applications at scale
- Data Ingestion
- Streaming Data Pipelines
- Visualization
- Big Data on Cloud