Introduction to Big Data
Beyond the Hype, Big Data Skills and Sources of Big Data, Big Data
Adoption, Research and Changing Nature of Data Repositories, Data
Sharing and Reuse Practices and Their Implications for Repository Data
Curation.
Hadoop
Introduction of Big data programming-Hadoop, The ecosystem and
stack, The Hadoop Distributed File System (HDFS), Components of
Hadoop, Design of HDFS, Java interfaces to HDFS, Architecture
overview, Development Environment, Hadoop distribution and basic
commands, Eclipse development, The HDFS command line and web
interfaces, The HDFS Java API (lab), Analyzing the Data with Hadoop,
Scaling Out, Hadoop event stream processing, complex event
processing, MapReduce Introduction, Developing a Map Reduce
Application, How Map Reduce Works, The MapReduce Anatomy of a Map
Reduce Job run, Failures, Job Scheduling, Shuffle
and Sort, Task execution, Map Reduce Types and Formats, Map Reduce
Features, Real-World MapReduce.
Hadoop Environment
Setting up a Hadoop Cluster, Cluster specification, Cluster Setup
and Installation, Hadoop Configuration, Security in Hadoop,
Administering Hadoop, HDFS – Monitoring & Maintenance, Hadoop
benchmarks.
Apache Airflow
Introduction to Data warehousing and Data lakes, Designing Data
warehousing for an ETL Data Pipeline, Designing Data Lakes for ETL
Data Pipeline, ETL vs ELT.
Introduction to HIVE
Programming with Hive: Data warehouse system for Hadoop, Optimizing
with Combiners
and Practitioners (lab), Bucketing, more common algorithms: sorting,
indexing and searching (lab),
Relational manipulation: map-side and reduce-side joins (lab),
evolution, purpose and use, Case
Studies on Ingestion and warehousing.
HBase
Overview, comparison and architecture, java client API, CRUD
operations and security
Apache Spark APIs for large-scale data processing:
APIs for large-scale data processing: Overview, Linking with Spark,
Initializing Spark, Resilient Distributed Datasets (RDDs), External
Datasets, RDD Operations, Passing Functions to Spark, Job
optimization, Working with Key-Value Pairs, Shuffle operations, RDD
Persistence, Removing Data, Shared Variables, EDA using PySpark,
Deploying to a Cluster Spark Streaming, Spark MLlib and ML APIs,
Spark Data Frames/Spark SQL, Integration of Spark and Kafka, Setting
up Kafka Producer and Consumer, Kafka Connect API, Mapreduce,
Connecting DB’s with Spark.