Skip to content

Saving and retrieving data in Hadoop HDFS


Before we start

Note that in this tutorial hadoop is a root Hadoop user while nosql is a typical, unprivilliged, user. I will write one of the following if the user is important:

or

In other cases I will simply write:


Install what you need

For this part you need to have installed:

  • Java
  • Hadoop
  • Sqoop
  • PostgrSQL

To install all required components, please follow related documentation or you can check how I did it:

The easiest way to instal Java and PostgreSQL is to do this with pacgage manager, like Synapic, or simply with command line (it concerns Linux systems).


Getting data

Fireballs and bolides are astronomical terms for exceptionally bright meteors that are spectacular enough to to be seen over a very wide area (see Fireball and Bolide Data).

You can get data about fireballs using simple exposed API Fireball Data API

Create working directory

Either using wget

or with curl

Result

Now we will use a few command line commands to extract important data from this JSON and save them in CSV file (turn off highlighting and use plain text mode below to see it correctly):

Notice that part:

replaced more complex (turn off highlighting and use plain text mode below to see it correctly):

We can put part of this long command into a file json2csv to use multiple times easier:

Set correct permissions so you could execute this file:

Now we can call it as

As all tests are positive you can save result to fireball_data.csv file


Create database and import data into it

Login to PostgreSQL. To do this, you have three options:

  1. best: add another one superuse without touching default postgres [1];
  2. change pasword of default postgres;
  3. add "normal" user.

Please read also the following materials:

  1. How to Install PostgreSQL and phpPgAdmin on Ubuntu 20.04 LTS
  2. How to install PostgreSQL and phpPgAdmin
  3. What's the default superuser username/password for postgres after a new install?

I select the first option, and I'm going to add another one superuse without touching default postgres superuser.

Now it is possible to use phpPgAdmin at 127.0.0.1/phppgadmin:



Create database nosql



Create table


Put this code in SQL textarea:




Import a CSV file into a table using COPY statement


In case of problem with reading:


Please verify all right to directories and files. For example:


Export data from a table to CSV using COPY statement

In case of problem with saving


create an empty file manually and set correct permissions:

Compare exported version (fireball_data_from_postgresql.csv) file:

with original (fireball_data.csv) file:

You can also try to use command line for the above following this pattern (this is only an example - you have to adjust it to our example)


Start Hadoop

If it's not running yet, start Hadoop. Do this as a Hadoop superuser (hadoop in my case):


Prepare user account

Now we have to be sure that all necessary HDFS and Hadoop accounts for user exists. In order to enable new user to use your Hadoop cluster, follow these general steps.


Create OS Hadoop group and user

  1. Create the group
  2. If user doesn't exist

    Create an OS account on the Linux system from which you want to let a user execute Hadoop jobs.

    Note:

    • -g The group name or number of the user's initial login group.
    • -G A list of supplementary groups which the user is also a member of.

    According to my test option -g should be used to pass Hadoop user verification process.

  3. If user exists

    Create an OS account on the Linux system from which you want to let a user execute Hadoop jobs.

    Note:

    • -a append group to existing user's groups. Without this new group will overwrite all existing groups when -G is used.
  4. To make new gropu membership active you have to relogin (logout and then login):


Create HDFS user home directory and set permissions

  1. In order to create a new HDFS user, you need to create a directory under the /user directory. This directory will serve as the HDFS home directory for the user.
  2. Change the ownership of the directory, since you don’t want to use the default owner/group (hadoop/supergroup) for this directory.

    User nosql can now store the output of his/her MapReduce and other jobs under that user’s home directory in HDFS.
  3. Refresh the user and group mappings to let the NameNode know about the new user:
  4. Make sure that all the permissions on the Hadoop temp directory (which is specified in the core-site.xml file) are so all Hadoop users can access. Default temp directory is defined as below:

    • Check existing ownership
    • Create temp directory for nosql user
    • Change ownership of newly created directory (owner to nosql and group to hadoopuser)
    • Change right of tmp directory

The new user can now log into the gateway servers and execute his or her Hadoop jobs and store data in HDFS.


Import to HDFS


Using Sqoop

  • Put JDBC connector in correct dir (/usr/lib/sqoop/lib in my case; connector postgresql-42.2.18.jar).
  • Checking version
  • First try and first problem. Solving java.lang.NoClassDefFoundError: org/apache/commons/lang/StringUtils problem

    Locate all file having commons-lang in their name:

    Move correct file to Sqoop folder:
  • Second try and second problem. This problem may be related with incorrect JDBC URL syntax. You need to ensure that the JDBC URL is conform the JDBC driver documentation and keep in mind that it's usually case sensitive (see The infamous java.sql.SQLException: No suitable driver found: 2. Or, JDBC URL is in wrong syntax).
    • In case of PostgreSQL this takes one of the following forms:
    • In case of MySQL this takes the following form:
    • In case of Oracle there are 2 URL syntax, old syntax which will only work with SID and the new one with Oracle service name:
  • Success
  • Verify correctness
  • Use -m, –num-mappers argument to use n map tasks to import in parallel


Using copyFromLocal command

Displays first kilobyte of the file to stdout.


Getting data back from HDFS