Get in Touch

Course Outline

Introduction:

  • Apache Spark within the Hadoop Ecosystem
  • Overview of Python and Scala

Core Concepts (Theory):

  • Architecture
  • Resilient Distributed Datasets (RDD)
  • Transformations and Actions
  • Stages, Tasks, and Dependencies

Hands-on Workshop: Basics via Databricks Environment

  • RDD API Exercises
  • Basic Transformation and Action Functions
  • PairRDDs
  • Joins
  • Caching Strategies
  • DataFrame API Exercises
  • SparkSQL
  • DataFrame Operations: select, filter, group, sort
  • UDFs (User-Defined Functions)
  • Introduction to the DataSet API
  • Streaming

Hands-on Workshop: Deployment via AWS Environment

  • Overview of AWS Glue
  • Differences between AWS EMR and AWS Glue
  • Example Jobs on Both Platforms
  • Evaluating Pros and Cons

Additional Topics:

  • Introduction to Apache Airflow Orchestration

Requirements

Programming proficiency (preferably in Python or Scala)

Foundational knowledge of SQL

 21 Hours

Testimonials (3)

Upcoming Courses

Related Categories