The Apache™ Hadoop® project is an open-source software project that develops reliable, scalable, distributed computing. The Apache Hadoop software library is a framework. It provides fast and secure storage and retrieval of large data sets. It applies distributed processing of large data sets. The data sets are saved and distributed across clusters of computers using very simple programming models. This technology is designed to scale up from single servers to thousands of machines, each offering local computation and storage.
This course introduces the principles briefly as an general introduction to Apache Hadoop.
The problem space and example applications
Why don’t traditional approaches scale?
Requirements
Hadoop Background
Hadoop History
The ecosystem and stack: HDFS, MapReduce, Hive, Pig…
Cluster architecture overview
Development Environment
Hadoop distribution and basic commands
Eclipse development
HDFS Introduction
The HDFS command line and web interfaces
The HDFS Java API (lab)
MapReduce Introduction
Key philosophy: move computation, not data
Core concepts: Mappers, reducers, drivers
The MapReduce Java API (lab)
Real-World MapReduce
Optimizing with Combiners and Partitioners (lab)
More common algorithms: sorting, indexing and searching (lab)
Testing with MRUnit
Higher-level Tools
Patterns to abstract “thinking in MapReduce”
The Cascading library (practical)
§ The Hive database (practical)
Try a 30 minutes webinar free. Choose you own topic from our course list.