Category Archives: hadoop

How to configure Presto/Hive/HDFS on Mac

It is quite a pain to setup everything.
Here are some links which helped me significantly:

Tricks:

Use java 1.7 with newest hadoop/hdfs/hive 2.0.0

To create metastore – go to $HIVE_HOME/bin and run:

schematool -initSchema -dbType derby

Derby is java in-memory database. This option will not allow you to run simultaneously Hive metastore (required for Presto) and Hive itself and so consider using mysql for metastore.

Then install presto going through instructions on prestodb.io

So to use presto – you need to shutdown Hive CLI and start metastore service from same directory where your derby is being set with schematool. To start metastore:

hive --service metastore

To check which components of Hive/HDFS are running on machine, run:

jps

To start datanode:

hdfs datanode

Create 2 aliases in ~/.bashrc to start/stop hadoop/hdfs:

alias hstart="/usr/local/Cellar/hadoop/2.7.1/sbin/start-dfs.sh;/usr/local/Cellar/hadoop/2.7.1/sbin/start-yarn.sh"
alias hstop="/usr/local/Cellar/hadoop/2.7.1/sbin/stop-yarn.sh;/usr/local/Cellar/hadoop/2.7.1/sbin/stop-dfs.sh"

Hadoop, Unix and lots of command line…

I decided to try hadoop for some huge files processing.

Basically, I’m doing some testing for one of the kaggle problems and needed to process 2-8G files in some way which requires a lot of CPU power.

I decided to try Amazon EMR with their pre-configured hadoop machines.

EMR is actually very good, but I have found for myself to have 1 special cluster  running all the time for tests – to check jobs before submitting large files to big clusters to save time on testing on a small inputs beforehand.

Discovered that Hive is not probably the best choice for you if you have  a lot of logic or very complex queries to run.

For myself I’m using custom jar clusters only.

How do I make a test before submitting job to big cluster? Connect to master machine and run:

hadoop jar myjar.jar input-files-from-s3

How to check what is the status of jobs you are running?

1. Look at monitoring status on Amazon screens

Amazon EMR monitoring

2.  Portforward to Hadoop web interface and look there – recommended way:

ssh -i your-ssh-key.pem -L <br />9100:amazon-public-ip-for-master-node:9100 <br />hadoop@amazon-public-ip-for-master-node

And then – just open http://localhost:9100 in browser to see hadoop web-console.