Tuesday, September 9, 2008

Installing Hadoop

This is my note written while I was following the installation documentation from hadoop's webpage.
http://hadoop.apache.org/core/docs/current/quickstart.html
I installed as a root, but I'm not sure if it is necessary.

Step 0. You have to have ssh, rsync, and java VM on your machine. I used,
1)ssh OpenSSH_4.3p2, OpenSSL 0.9.8b 04
2)rsync version 2.6.8
3)java 1.5.0_12

Step 1. Download software from a Hadoop distribution site.
http://hadoop.apache.org/core/releases.html

Step 2. Untar file

Step 3. reset the JAVA_HOME under your_hadoop_dir/conf/hadoop-env.sh
*note: I had JAVA_HOME defined in my .bashrc file. But I had to specify it again in the
hadoop-env.sh.

Step 4. now you can just run your standalone operation as it is.
$ mkdir input
$ cp conf/*.xml input
$ bin/hadoop jar hadoop-*-examples.jar grep input output 'dfs[a-z.]+'
$ cat output/*

Step 5. For the Pseudo-Distributed Operation, which runs multiple virtual machines in single node so that it imitates real distributed file systems you have to set up the configuration in
conf/hadoop-site.xml
The 'name' element is defined by hadoop system. Therefore, you can just use the names in the example from the hadoop page. I change the value of fs.default.name to hdfs://localhost:54310, and the one of mapred.job.tracker to localhost:54311.

Step 6. Check ssh localhost
In my case, I could not connect to localhost, but I could access to my numerical ip address. I changed my /etc/hosts.allow to have ALL:127.0.0.1 and it started to recognize localhost.
If it requires your passphrase:

$ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

If you still prefer use the passphrase, it will cause problem during the starting the daemons.

Step 7. Formatting namenode and starting the daemon

$ bin/hadoop namenode -format

$ bin/start-all.sh

Now you can check your namenode at http://localhost:50070/

Also your job tracker is available at http://localhost:50030/

Step 8. Test functions

Copy the input files into the distributed filesystem:
$ bin/hadoop fs -put conf input

Run some of the examples provided:
$ bin/hadoop jar hadoop-*-examples.jar grep input output 'dfs[a-z.]+'

Copy the output files from the distributed filesystem to the local filesytem and examine them:
$ bin/hadoop fs -get output output
$ cat output/*

View the output files on the distributed filesystem:
$ bin/hadoop fs -cat output/*

Step 9. Stop the daemon

$ bin/stop-all.sh

No comments: