I decided to try hadoop for some huge files processing.
Basically, I’m doing some testing for one of the kaggle problems and needed to process 2-8G files in some way which requires a lot of CPU power.
I decided to try Amazon EMR with their pre-configured hadoop machines.
EMR is actually very good, but I have found for myself to have 1 special cluster running all the time for tests – to check jobs before submitting large files to big clusters to save time on testing on a small inputs beforehand.
Discovered that Hive is not probably the best choice for you if you have a lot of logic or very complex queries to run.
For myself I’m using custom jar clusters only.
How do I make a test before submitting job to big cluster? Connect to master machine and run:
hadoop jar myjar.jar input-files-from-s3
How to check what is the status of jobs you are running?
1. Look at monitoring status on Amazon screens

2. Portforward to Hadoop web interface and look there – recommended way:
ssh -i your-ssh-key.pem -L <br />9100:amazon-public-ip-for-master-node:9100 <br />hadoop@amazon-public-ip-for-master-node
And then – just open http://localhost:9100 in browser to see hadoop web-console.