I decided to try hadoop for some huge files processing.
Basically, I’m doing some testing for one of the kaggle problems and needed to process 2-8G files in some way which requires a lot of CPU power.
I decided to try Amazon EMR with their pre-configured hadoop machines.
EMR is actually very good, but I have found for myself to have 1 special cluster running all the time for tests – to check jobs before submitting large files to big clusters to save time on testing on a small inputs beforehand.
Discovered that Hive is not probably the best choice for you if you have a lot of logic or very complex queries to run.
For myself I’m using custom jar clusters only.
How do I make a test before submitting job to big cluster? Connect to master machine and run:
hadoop jar myjar.jar input-files-from-s3
How to check what is the status of jobs you are running?
1. Look at monitoring status on Amazon screens
2. Portforward to Hadoop web interface and look there – recommended way:
ssh -i your-ssh-key.pem -L <br />9100:amazon-public-ip-for-master-node:9100 <br />hadoop@amazon-public-ip-for-master-node
And then – just open http://localhost:9100 in browser to see hadoop web-console.
Principal feature of Hadoop is data locality – ability to send code to data nodes and process these data locally instead of transferring terabytes of data to a central app server. In your settings, with only 2-8Gb of memory and CPU intensive tasks, Hadoop is unlikely to give good performance. Much better approach whould be to use Apache Spark [1] and make sure data is processed in parallel (many map-like and few reduce-like operations).
[1]: https://spark.apache.org/