Hadoop, Unix and lots of command line…

I decided to try hadoop for some huge files processing.

Basically, I’m doing some testing for one of the kaggle problems and needed to process 2-8G files in some way which requires a lot of CPU power.

I decided to try Amazon EMR with their pre-configured hadoop machines.

EMR is actually very good, but I have found for myself to have 1 special cluster running all the time for tests – to check jobs before submitting large files to big clusters to save time on testing on a small inputs beforehand.

Discovered that Hive is not probably the best choice for you if you have a lot of logic or very complex queries to run.

For myself I’m using custom jar clusters only.

How do I make a test before submitting job to big cluster? Connect to master machine and run:

hadoop jar myjar.jar input-files-from-s3

How to check what is the status of jobs you are running?

1. Look at monitoring status on Amazon screens

2. Portforward to Hadoop web interface and look there – recommended way:

ssh -i your-ssh-key.pem -L <br />9100:amazon-public-ip-for-master-node:9100 <br />hadoop@amazon-public-ip-for-master-node

And then – just open http://localhost:9100 in browser to see hadoop web-console.

1 thought on “Hadoop, Unix and lots of command line…”

Principal feature of Hadoop is data locality – ability to send code to data nodes and process these data locally instead of transferring terabytes of data to a central app server. In your settings, with only 2-8Gb of memory and CPU intensive tasks, Hadoop is unlikely to give good performance. Much better approach whould be to use Apache Spark [1] and make sure data is processed in parallel (many map-like and few reduce-like operations).

[1]: https://spark.apache.org/

Maxim Galushka

Hadoop, Unix and lots of command line…

1 thought on “Hadoop, Unix and lots of command line…”

Leave a Reply Cancel reply