Get in Touch

Course Outline

  • Introduction
    • Hadoop history and core concepts
    • The Hadoop ecosystem
    • Distributions
    • High-level architecture
    • Common Hadoop myths
    • Hadoop challenges (hardware and software)
    • Labs: Discussing Big Data projects and associated problems
  • Planning and installation
    • Selecting software and Hadoop distributions
    • Cluster sizing and planning for future growth
    • Selecting hardware and network infrastructure
    • Rack topology
    • Installation procedures
    • Multi-tenancy
    • Directory structures and log management
    • Benchmarking
    • Labs: Cluster installation and running performance benchmarks
  • HDFS operations
    • Core concepts (horizontal scaling, replication, data locality, rack awareness)
    • Nodes and daemons (NameNode, Secondary NameNode, HA Standby NameNode, DataNode)
    • Health monitoring
    • Administration via command-line and browser interfaces
    • Expanding storage and replacing faulty drives
    • Labs: Familiarizing oneself with HDFS command lines
  • Data ingestion
    • Using Flume for log and other data ingestion into HDFS
    • Utilizing Sqoop for importing data from SQL databases to HDFS, and exporting back to SQL
    • Hadoop data warehousing with Hive
    • Transferring data between clusters (distcp)
    • Leveraging S3 as a complement to HDFS
    • Best practices and architectures for data ingestion
    • Labs: Setting up and utilizing Flume, along with Sqoop
  • MapReduce operations and administration
    • Parallel computing prior to MapReduce: comparing HPC with Hadoop administration
    • MapReduce cluster load management
    • Nodes and Daemons (JobTracker, TaskTracker)
    • Walk-through of the MapReduce UI
    • MapReduce configuration
    • Job configuration
    • Optimizing MapReduce performance
    • Ensuring robustness in MR: guidance for developers
    • Labs: Executing MapReduce examples
  • YARN: a new architecture and enhanced capabilities
    • YARN design objectives and implementation architecture
    • New actors: ResourceManager, NodeManager, Application Master
    • Installing YARN
    • Job scheduling under YARN
    • Labs: Investigating job scheduling mechanisms
  • Advanced topics
    • Hardware monitoring
    • Cluster-wide monitoring
    • Adding and removing servers, upgrading Hadoop
    • Backup, recovery, and business continuity planning
    • Oozie job workflows
    • Hadoop high availability (HA)
    • Hadoop Federation
    • Securing your cluster with Kerberos
    • Labs: Setting up monitoring systems
  • Optional tracks
    • Cloudera Manager for cluster administration, monitoring, and routine tasks; installation and usage. In this track, all exercises and labs are conducted within the Cloudera distribution environment (CDH5).
    • Ambari for cluster administration, monitoring, and routine tasks; installation and usage. In this track, all exercises and labs are conducted within the Ambari cluster manager and Hortonworks Data Platform (HDP 2.0).

Requirements

  • Proficiency in basic Linux system administration
  • Fundamental scripting capabilities

Prior knowledge of Hadoop or Distributed Computing is not mandatory, though these topics will be introduced and explained during the course.

Lab Environment

Zero Installation Required: Students are not required to install Hadoop software on their personal devices. A fully operational Hadoop cluster will be provided for use.

Participants will need to have the following tools available:

  • An SSH client (Linux and Mac systems include this by default; for Windows users, PuTTY is recommended)
  • A web browser to access the cluster. We recommend using Firefox with the FoxyProxy extension installed.
 21 Hours

Testimonials (1)

Upcoming Courses

Related Categories