Recently, I decided that I need to begin a new learning exercise. This time around, I’ve chosen Apache Hadoop, partly because of the weird name, and partly because of how it is defined. To add clarity, here’s the official definition from the Apache site:
What Is Apache Hadoop?
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-avaiability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.
I’ve actually heard of some of the sub components before (such as Cassandra) and I love the fact that the entire platform is designed for parallelism and distribution. To get started I decided to choose a couple of the main sub components to focus on – below:
What I’ll do is post a new article each time I’ve spent some time investigating the nuts and bolts of one of the subcomponents. That way, you can follow my notes and pick up the salient points as I go.
For a full list of the Hadoop platform, check out the Apache site – there’s quite a number of subcomponents that are worth looking at. I’m a little unsure what the hardware requirements might be if you wanted to play with this locally, but I’m sure you’d be able to throw something onto a cloud infrastructure for some fun time at a low cost.
In the meantime, here are the “getting started” pages for the components I’ve selected:
- Single Node Setup: http://hadoop.apache.org/common/docs/current/single_node_setup.html
- Cluster Setup: http://hadoop.apache.org/common/docs/stable/cluster_setup.html
- Cassandra Setup: http://wiki.apache.org/cassandra/GettingStarted
This really looks quite interesting
Here’s an architectural diagram of HDFS:
I’ve categorized this as ‘Cloud Computing’ but it’s really distributed computing. Forgive me. Now I’m really starting to look forward to seeing what this can do.. Check back soon.
“Win32 is supported as a development platform. Distributed operation has not been well tested on Win32, so it is not supported as a production platform. “
I’ll setup on Windows Server 2008 or Windows 7, but Production systems are obviously GNU/Linux based. Later I might dabble at re-installing Gentoo when my new home office is set up in a couple of weeks.