amazon emr | Maxim Galushka

I decided to try hadoop for some huge files processing.

Basically, I’m doing some testing for one of the kaggle problems and needed to process 2-8G files in some way which requires a lot of CPU power.

I decided to try Amazon EMR with their pre-configured hadoop machines.

EMR is actually very good, but I have found for myself to have 1 special cluster running all the time for tests – to check jobs before submitting large files to big clusters to save time on testing on a small inputs beforehand.

Discovered that Hive is not probably the best choice for you if you have a lot of logic or very complex queries to run.

For myself I’m using custom jar clusters only.

How do I make a test before submitting job to big cluster? Connect to master machine and run:

hadoop jar myjar.jar input-files-from-s3

How to check what is the status of jobs you are running?

1. Look at monitoring status on Amazon screens

2. Portforward to Hadoop web interface and look there – recommended way:

ssh -i your-ssh-key.pem -L <br />9100:amazon-public-ip-for-master-node:9100 <br />hadoop@amazon-public-ip-for-master-node

And then – just open http://localhost:9100 in browser to see hadoop web-console.

Maxim Galushka

Tag Archives: amazon emr

Hadoop, Unix and lots of command line…