I decided to try hadoop for some huge files processing.
Basically, I’m doing some testing for one of the kaggle problems and needed to process 2-8G files in some way which requires a lot of CPU power.
I decided to try Amazon EMR with their pre-configured hadoop machines.
EMR is actually very good, but I have found for myself to have 1 special cluster running all the time for tests – to check jobs before submitting large files to big clusters to save time on testing on a small inputs beforehand.
Discovered that Hive is not probably the best choice for you if you have a lot of logic or very complex queries to run.
For myself I’m using custom jar clusters only.
How do I make a test before submitting job to big cluster? Connect to master machine and run:
hadoop jar myjar.jar input-files-from-s3
How to check what is the status of jobs you are running?
1. Look at monitoring status on Amazon screens
2. Portforward to Hadoop web interface and look there – recommended way:
ssh -i your-ssh-key.pem -L <br />9100:amazon-public-ip-for-master-node:9100 <br />hadoop@amazon-public-ip-for-master-node
And then – just open http://localhost:9100 in browser to see hadoop web-console.
This is just to use full power of creativity part during doing boring programmer work.
I’ve watched this course which revealed for me the power of Processing:
The whole point is that you don’t need to make your program excellent from the first day of development. It simply possible that you will not have enough time and motivation to complete.
My idea is to use Processing for quick prototyping and then (only if the application is promising – you can check this with real users) – go ahead with more powerful solution.
Firstly – create a mockup which is just broken but have the ability to communicate an idea to the user. Only after that – fix and polish it.
That is why Processing was created for – to test, experiment and create quickly without much hassle and development tools.
Spent last weekend on #douhack (in Donetsk), I have been creating a program to count number of people walking through the street in front of web camera.
This appeared not such a simple task. To recognize moving objects I have used simple technique of background subtraction, when later frame with image capture from camera is subtracted pixel-by-pixel from previous image and revealing the regions which were moved from one frame to another.
More advanced algorithm described in documentation (see referenced works).
For tracking the person which moves I have tried a few techniques, camshift algorithm didn’t really helped. The reason for this is that algorithms doesn’t have enough “memory” capacity to track objects which are disappeared behind the other objects on the street. So I did a hack to linearize the movement of the person to estimate where moved object will appear again.
Here is demo how it works (pretty lame anyway):
Also Github link with sources:
I strongly recommend this book to understand the basics of OpenCV and objects tracking (if pdf is not available – give me a shot and I will update link).
Big special thanks to Mateusz Stankiewicz for his blog post regarding the topic.