Skip to content

Use Python to download weather data and upload them to HBase

The goal of this tutorial is to write a Python code downloading data from my weather station (https://fulmanski.pl/data/weather/home/) and then uploading these data to HBase.

  1. Prerequisites
  2. Data source characteristic
  3. Download data with Python
  4. Choose suitable HBase data organisation
  5. Upload data to HBase with Python
  6. Retrieving data from HBase with Python
  7. Summary


Prerequisites

The prerequisite for this tutorial is to have HBase installed along with Python and the happybase Python library. I described this process in my tutorial Install and work with Apache HBase. I use PyCharm to simplifying code management for developing Python code -- you can do the same or choose any other preferred way which suits you.

You start with running Thrift service you will use to communicate with HBase:

and next HBase:

Start HBase shell to allow direct interaction with HBase:


Data source characteristic

To get data you have to make a request of the following form:

For example, if you type the following URL in your web browser:

you will get the following one-row data:

This is a typical JSON file organized as I described below:

Comment about data field: Value of this field is of text type. It consists of multiple JSONs separated by new line character \n. Every single JSON contains data for a given moment in time and starts with timestamp_server field (with value "20230905000006" in the above data) following by data field with the first filed describing weather station timestamp. Both timestamps may be different no more than few seconds except the situation when the time is changed to winter or summer time -- in our case you have "timestamp": "20230904230021". If you are curious, the server time is the correct one, because I don't correct time settings in my weather station when the time is shifted +/-1 hour the season changes.


Download data with Python

Now you are ready to write Python code to download data and parse it. My code I present below is rather simply than sophisticated but I want to keep it understandable:

For simplicity I decided to paste this code into main.py file of a new PyCharm Python project. You can run it directly from PyCharm or alternatively from command line as:

where /home/nosql/Pulpit/hbase_weather_data_task/pycharm_projects/get_weather_data_from_server/ is the location of the project.

Regardless of the method result was the same, with the total 8264 lines, first line:

and the last:


Choose suitable HBase data organization

Now is the most important part when you work with NoSQL databases: you have to decide how you want to aggregate (organize) your data.

For example, you can group your data (for one day) by hour. This way date like 20230905 would be a row key and hours from 0 to 23 would be a keys of column families. Then you can put as many columns in each family as you want -- for example you can create their keys based on a minutes part from timestamp: for the last row this would be 59. Now if you provide the key:

you will get all data collected at 23:10 of the September 5th in 2023 regardless of the seconds.

Of course there are many different ways you may aggregate your data, for example by sensor type or parameters like temperature or humidity -- it is strongly dependent on your future needs.

Whatever you decide, you have to prepare database. Login into HBase shell and do:

To complete my way of aggregation you have to add new function to your code:

Next you have to change "main" part:

When you run it, you will see:


Upload data to HBase with Python

Having the above you are ready to write code uploading data to HBase. Create a new Python project -- in my case it's again a PyCharm project with the name upload_weather_data_to_hbase.

Next copy and paste all previous code and change "main" part:

Now you will implement uploading procedure uploadData(year, month, day). Start with some testing code to be sure that your data are ready to be uploaded:

The above code prints data for the first familyKey. You can test it directly in PyCharm or in terminal:

The final part of res.txt file contains:

As you can see, you have all seconds for the 59. minute of the 00 hour of the September 5th, 2023.

Now you will change the code to push this data to HBase:

When executed you will see:

Check HBase contents -- type scan 'weatherstation' in HBase shell:

You will see a lot of text flooding your screen, but take care about final part. In my case it is:

Can you see the text 1 row(s) in the penultimate row of the output? Yes, all your data (according to your code) were placed in one row. The key of this row is 20230905. My last column is column=00:59 and it is preceded by column=00:58.

To avoid flooding you may simply get information about the total number of rows in your table:

Still you have one row:

Check if other families are not empty. For example try to get data from row 20230905 and column family 23:

Again a lot of data and final part in my case looks like below:

As you can see, there is no column= phrase in output because column indicator preceded by family name is printed on the left margin -- notice for example 23:59 indicating family 23 and column 59.

Great, all your data are now in HBase.


Retrieving data from HBase with Python

In the final part of this tutorial you will update the code to work in the following way:

  • When you execute it from command line, you have to provide year, month and date.
  • Script should check if data for specified date exists in HBase.
  • If data exist, you should get them directly from HBase and print some info (for example outer temperature averaged by hour).
  • If data are not present in database then first you should upload it and then make action from previous step.

To complete this, you have to keep the functions you wrote previously (getDataFromServer(year, month, day), grupData(data, by=None) and uploadData(year, month, day)) unchanged, add new functions (getData(year, month, day), printHBaseData(hbaseData) and main()) as well as change final call of main module:

When done you can run it:

And finally test it on new date:

If you re-run it for the same parameters again uploading step is skipped:

You can confirm presence of data in HBase (there should be two rows):

You can also print all row keys:

As you can notice, keys are always sorted (data are sorted according to keys).


Summary

At this point you have fully functional Python script for conditional HBase data uploading. Conditional because only non-present data are downloaded from weather server and then uploaded to HBase. Anyway you have an option to repeat this proces also for present data specyfying force parameter in command line call.

As always there is a lot of places you can improve in existing code or even write a new one, for example to turn it into web application.