Tag Archives: unix

Find process_id to be killed

November 3, 2014unixkill, process, unixmgalushka

Very often I need to kill some background job in unix.

To do this, I need to find its process_id to be passed to


kill -9 process_id

to be killed properly.

Here is quick way to combine finding process id for specific job:


ps aux | grep [my-fancy-filter-to-find-a task] |\
         awk '{print $2}'

This will just print process_id for my task to be killed.

Caution! Please, use this with care as if your grep return not the process_id you are expecting – you may get to a trouble.

Watch git/mercurial branch in command prompt

November 3, 2014unixgit, hg, mercurial, unixmgalushka

Sometimes this is crucial to not make a mistake committing in wrong branch.

To help mitigating this type of errors, just enable previewing in prompt the current branch you are on.

Following code works equally for git/mercurial branches, you need to put this into your ~/.bashrc file.

function parse_git_branch () {
  git branch 2> /dev/null |
      sed -e '/^[^*]/d' -e 's/* \(.*\)/ (\1)/'
}

function hg_branch() {
      hg branch 2> /dev/null |
           awk '{ printf "\033[37;0m\033[35;40m" $1 }'
      hg bookmarks 2> /dev/null |
           awk '/\*/ { printf " (" $2 ")"}'
}

PS1="$GREEN\u@\h$NO_COLOR:\w$YELLOW\$(parse_git_branch)$YELLOW\$(hg_branch)$NO_COLOR\$ "

For mercurial it also display current bookmark you are on.

This is how it look on my command prompt now:

Hadoop, Unix and lots of command line…

October 23, 2013hadoopamazon emr, console, hadoop, map reduce, unixmgalushka

I decided to try hadoop for some huge files processing.

Basically, I’m doing some testing for one of the kaggle problems and needed to process 2-8G files in some way which requires a lot of CPU power.

I decided to try Amazon EMR with their pre-configured hadoop machines.

EMR is actually very good, but I have found for myself to have 1 special cluster running all the time for tests – to check jobs before submitting large files to big clusters to save time on testing on a small inputs beforehand.

Discovered that Hive is not probably the best choice for you if you have a lot of logic or very complex queries to run.

For myself I’m using custom jar clusters only.

How do I make a test before submitting job to big cluster? Connect to master machine and run:

hadoop jar myjar.jar input-files-from-s3

How to check what is the status of jobs you are running?

1. Look at monitoring status on Amazon screens

2. Portforward to Hadoop web interface and look there – recommended way:

ssh -i your-ssh-key.pem -L <br />9100:amazon-public-ip-for-master-node:9100 <br />hadoop@amazon-public-ip-for-master-node

And then – just open http://localhost:9100 in browser to see hadoop web-console.