I decided to try hadoop for some huge files processing.

Basically, I’m doing some testing for one of the kaggle problems and needed to process 2-8G files in some way which requires a lot of CPU power.

I decided to try Amazon EMR with their pre-configured hadoop machines.

EMR is actually very good, but I have found for myself to have 1 special cluster  running all the time for tests – to check jobs before submitting large files to big clusters to save time on testing on a small inputs beforehand.

Discovered that Hive is not probably the best choice for you if you have  a lot of logic or very complex queries to run.

For myself I’m using custom jar clusters only.

How do I make a test before submitting job to big cluster? Connect to master machine and run:

hadoop jar myjar.jar input-files-from-s3

How to check what is the status of jobs you are running?

1. Look at monitoring status on Amazon screens

Amazon EMR monitoring

2.  Portforward to Hadoop web interface and look there – recommended way:

ssh -i your-ssh-key.pem -L <br />9100:amazon-public-ip-for-master-node:9100 <br />hadoop@amazon-public-ip-for-master-node

And then – just open http://localhost:9100 in browser to see hadoop web-console.

  1. Principal feature of Hadoop is data locality – ability to send code to data nodes and process these data locally instead of transferring terabytes of data to a central app server. In your settings, with only 2-8Gb of memory and CPU intensive tasks, Hadoop is unlikely to give good performance. Much better approach whould be to use Apache Spark [1] and make sure data is processed in parallel (many map-like and few reduce-like operations).

    [1]: https://spark.apache.org/

