Thank you for sending your enquiry! One of our team members will contact you shortly.
Thank you for sending your booking! One of our team members will contact you shortly.
Course Outline
-
Introduction
- Hadoop history and core concepts
- The Hadoop ecosystem
- Distributions
- High-level architecture
- Common Hadoop myths
- Hadoop challenges (hardware and software)
- Labs: Discussing Big Data projects and associated problems
-
Planning and installation
- Selecting software and Hadoop distributions
- Cluster sizing and planning for future growth
- Selecting hardware and network infrastructure
- Rack topology
- Installation procedures
- Multi-tenancy
- Directory structures and log management
- Benchmarking
- Labs: Cluster installation and running performance benchmarks
-
HDFS operations
- Core concepts (horizontal scaling, replication, data locality, rack awareness)
- Nodes and daemons (NameNode, Secondary NameNode, HA Standby NameNode, DataNode)
- Health monitoring
- Administration via command-line and browser interfaces
- Expanding storage and replacing faulty drives
- Labs: Familiarizing oneself with HDFS command lines
-
Data ingestion
- Using Flume for log and other data ingestion into HDFS
- Utilizing Sqoop for importing data from SQL databases to HDFS, and exporting back to SQL
- Hadoop data warehousing with Hive
- Transferring data between clusters (distcp)
- Leveraging S3 as a complement to HDFS
- Best practices and architectures for data ingestion
- Labs: Setting up and utilizing Flume, along with Sqoop
-
MapReduce operations and administration
- Parallel computing prior to MapReduce: comparing HPC with Hadoop administration
- MapReduce cluster load management
- Nodes and Daemons (JobTracker, TaskTracker)
- Walk-through of the MapReduce UI
- MapReduce configuration
- Job configuration
- Optimizing MapReduce performance
- Ensuring robustness in MR: guidance for developers
- Labs: Executing MapReduce examples
-
YARN: a new architecture and enhanced capabilities
- YARN design objectives and implementation architecture
- New actors: ResourceManager, NodeManager, Application Master
- Installing YARN
- Job scheduling under YARN
- Labs: Investigating job scheduling mechanisms
-
Advanced topics
- Hardware monitoring
- Cluster-wide monitoring
- Adding and removing servers, upgrading Hadoop
- Backup, recovery, and business continuity planning
- Oozie job workflows
- Hadoop high availability (HA)
- Hadoop Federation
- Securing your cluster with Kerberos
- Labs: Setting up monitoring systems
-
Optional tracks
- Cloudera Manager for cluster administration, monitoring, and routine tasks; installation and usage. In this track, all exercises and labs are conducted within the Cloudera distribution environment (CDH5).
- Ambari for cluster administration, monitoring, and routine tasks; installation and usage. In this track, all exercises and labs are conducted within the Ambari cluster manager and Hortonworks Data Platform (HDP 2.0).
Requirements
- Proficiency in basic Linux system administration
- Fundamental scripting capabilities
Prior knowledge of Hadoop or Distributed Computing is not mandatory, though these topics will be introduced and explained during the course.
Lab Environment
Zero Installation Required: Students are not required to install Hadoop software on their personal devices. A fully operational Hadoop cluster will be provided for use.
Participants will need to have the following tools available:
- An SSH client (Linux and Mac systems include this by default; for Windows users, PuTTY is recommended)
- A web browser to access the cluster. We recommend using Firefox with the FoxyProxy extension installed.
21 Hours
Testimonials (1)
Hands on exercises. Class should have been 5 days, but the 3 days helped to clear up a lot of questions that I had from working with NiFi already