What to do with your cluster?
I have been downloading data from the UK Metoffice for years using a simple script. The files have observations for all UK weather stations in the last 24 hours in JSON format.
So, how can we extract the data? I tried a few things including R, but one way is to stick the data into Hadoop and define a table in HIVE. The scripts used for the processing pipeline can be found on github (WordPress seems to mangle the table definition). The brickhouse jar is handy for operating on JSON and can easily be compiled from the git repo.