The Apache™ Hadoop® project is an open-source software project that develops reliable, scalable, distributed computing. The Apache Hadoop software library is a framework. It provides fast and secure storage and retrieval of large data sets. It applies distributed processing of large data sets. The data sets are saved and distributed across clusters of computers using very simple programming models. This technology is designed to scale up from single servers to thousands of machines, each offering local computation and storage.
This course introduces the principles briefly as an general introduction to Apache Hadoop.
The problem space and example applications
Why don’t traditional approaches scale?
The ecosystem and stack: HDFS, MapReduce, Hive, Pig…
Cluster architecture overview
Hadoop distribution and basic commands
The HDFS command line and web interfaces
The HDFS Java API (lab)
Key philosophy: move computation, not data
Core concepts: Mappers, reducers, drivers
The MapReduce Java API (lab)
Optimizing with Combiners and Partitioners (lab)
More common algorithms: sorting, indexing and searching (lab)
Testing with MRUnit
Patterns to abstract “thinking in MapReduce”
The Cascading library (practical)
§ The Hive database (practical)
Sign up to receive lessons, tips, examples, quizzes and other training materials.