Tag Archives: hadoop

Hadoop, Unix and lots of command line…

I decided to try hadoop for some huge files processing.

Basically, I’m doing some testing for one of the kaggle problems and needed to process 2-8G files in some way which requires a lot of CPU power.

I decided to try Amazon EMR with their pre-configured hadoop machines.

EMR is actually very good, but I have found for myself to have 1 special cluster  running all the time for tests – to check jobs before submitting large files to big clusters to save time on testing on a small inputs beforehand.

Discovered that Hive is not probably the best choice for you if you have  a lot of logic or very complex queries to run.

For myself I’m using custom jar clusters only.

How do I make a test before submitting job to big cluster? Connect to master machine and run:

hadoop jar myjar.jar input-files-from-s3

How to check what is the status of jobs you are running?

1. Look at monitoring status on Amazon screens

Amazon EMR monitoring

2.  Portforward to Hadoop web interface and look there – recommended way:

ssh -i your-ssh-key.pem -L <br />9100:amazon-public-ip-for-master-node:9100 <br />hadoop@amazon-public-ip-for-master-node

And then – just open http://localhost:9100 in browser to see hadoop web-console.