Find process_id to be killed

Very often I need to kill some background job in unix.

To do this, I need to find its process_id to be passed to

kill -9 process_id

to be killed properly.

Here is quick way to combine finding process id for specific job:

ps aux | grep [my-fancy-filter-to-find-a task] |\
         awk '{print $2}'

This will just print process_id for my task to be killed.

Caution! Please, use this with care as if your grep return not the process_id  you are expecting – you may get to a trouble.

Watch git/mercurial branch in command prompt

Sometimes this is crucial to not make a mistake committing in wrong branch.

To help mitigating this type of errors, just enable previewing in prompt the current branch you are on.

Following code works equally for git/mercurial branches, you need to put this into your ~/.bashrc file.

function parse_git_branch () {
  git branch 2> /dev/null |
      sed -e '/^[^*]/d' -e 's/* \(.*\)/ (\1)/'

function hg_branch() {
      hg branch 2> /dev/null |
           awk '{ printf "\033[37;0m\033[35;40m" $1 }'
      hg bookmarks 2> /dev/null |
           awk '/\*/ { printf " (" $2 ")"}'

PS1="$GREEN\u@\h$NO_COLOR:\w$YELLOW\$(parse_git_branch)$YELLOW\$(hg_branch)$NO_COLOR\$ "

For mercurial it also display current bookmark you are on.

This is how it look on my command prompt now:

Command prompt with git/mercurial branch

Hadoop, Unix and lots of command line…

I decided to try hadoop for some huge files processing.

Basically, I’m doing some testing for one of the kaggle problems and needed to process 2-8G files in some way which requires a lot of CPU power.

I decided to try Amazon EMR with their pre-configured hadoop machines.

EMR is actually very good, but I have found for myself to have 1 special cluster  running all the time for tests – to check jobs before submitting large files to big clusters to save time on testing on a small inputs beforehand.

Discovered that Hive is not probably the best choice for you if you have  a lot of logic or very complex queries to run.

For myself I’m using custom jar clusters only.

How do I make a test before submitting job to big cluster? Connect to master machine and run:

hadoop jar myjar.jar input-files-from-s3

How to check what is the status of jobs you are running?

1. Look at monitoring status on Amazon screens

Amazon EMR monitoring

2.  Portforward to Hadoop web interface and look there – recommended way:

ssh -i your-ssh-key.pem -L <br />9100:amazon-public-ip-for-master-node:9100 <br />hadoop@amazon-public-ip-for-master-node

And then – just open http://localhost:9100 in browser to see hadoop web-console.