Hadoop (http://hadoop.apache.org/), an open source volunteer project under
the Apache Software Foundation, is a framework for running applications on
large clusters built of commodity hardware. It lets one easily write and run
applications that process vast amounts of data (terabytes to petabytes).
Hadoop implements a computational paradigm named Map-Reduce, where the
application is divided into many small fragments of work, each of which may
be executed or reexecuted on any node in the cluster. In addition, it
provides a distributed file system (HDFS) that stores data on the compute
nodes, providing very high aggregate bandwidth across the cluster. Yahoo! is
one of the main contributors to Hadoop and uses it extensively to manage
large clusters of machines.
I hope to engage the open-source community on Hadoop and encourage
participation in its development. I will present an overview of Hadoop and
its architecture with a focus on the Map-Reduce component. I will describe
the engineering challenges and briefly talk about how Hadoop clusters are
used in Yahoo!.