Hadoop on a laptop

For a while I have been trying to install Hadoop on my personal laptop (ancient Dell with 16Gb RAM and 2Tb SSD) and here is a memo to self, to remind me the next time I have to do this.

Recently, I finally had some success. Getting Hadoop and HDFS to run is fairly straightforward and I did manage to add Hive to the mix as well, though currently just with MR as execution engine.

There are many pages on the interwebs detailing how to install Hadoop and I followed several of them. Most problems can be solved using Google.

The first things to do is to ensure the ssh server is installed and running. Next it is a good idea to enable the account that is going to run Hadoop etc. to login to the local machine without a password. If this is not done you need to type in passwords every time you bring the services up and down. For example this page is helpful.

Next we need to install Java, I first installed the latest which caused all sorts of problem and then ended up using apt-get install openjdk-8-jdk. There is no need to uninstall the other Javas, but update-alternatives --config java allows to select a version that is suitable.

Then we need Hadoop. I used hadoop-2.7.3, there are plenty of websites telling you what changes need to be made to the configuration files. I opted to use my normal user to run Hadoop and put everything in a directory called /opt/data_platform. You also need a few variables to when you login

#export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
# java -XshowSettings:properties -version
#  update-alternatives --config java

export HADOOP_HOME=/opt/data_platform/hadoop
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop

export HIVE_HOME=/opt/data_platform/hive
export PATH=$PATH:$HIVE_HOME/bin

Then you need to initialise your HDFS file system (like formatting a hard disk) and can run start-yarn.sh; start-dfs.sh and Hadoop is up and running. You may have to wait a few minutes/seconds depending on the speed of your box before everything is available. But then you should be able to to issue commands like

planck:/opt/data_platform>hdfs dfs -ls /
19/10/27 17:08:36 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 10 items
drwxr-xr-x   - uh supergroup          0 2019-10-25 20:35 /data

And you add data into your HDFS partition and check what YARN is doing.

Adding a database

Now we can install Hive (I used apache-hive-2.3.4-bin and the link does not mention to use the schematool to initialise mysql) and turn this into a useful database. Again we need to adjust a few configuration file, Hive needs to know where JAVA is (this is when I ran into problems with my first choice of JAVA) and of course where the HDFS partition is. From this point onwards HIVE should work on the commandline and you can make and query table in SQL-style.

To be able to connect to HIVE via JDBC or remotely you need to start hiverserver2. I needed to make a few configuration changes to be able to impersonate my own user.

After that you can connect R to the HIVE database

  Sys.setenv(LD_LIBRARY_PATH = "/usr/java/jdk1.8.0_65/jre/lib/amd64/server:/usr/lib64/R/lib:/usr/local/lib64")

  options( java.parameters = "-Xmx8g" )
  library(RHive) # why do we need that? connection fails otherwise
  # follow instruction on https://github.com/nexr/RHive#loading-rhive-and-connecting-to-hive ant can live in userspace

  cp = c("/opt/data_platform/hive/lib/hive-jdbc-2.3.4.jar", 
  drv <- JDBC("org.apache.hive.jdbc.HiveDriver",
  conn <- dbConnect(drv, "jdbc:hive2://localhost:10000/logs", "uh")
  SQLtext <- "select * from logs.future_i_v limit 10"
  dbResponse <- dbGetQuery(conn, SQLtext)

RHive needs to be compiled by hand, and the other libraries need JAVA of the right kind to be in the right place. The best thing is to install those libraries on the command line, not in Rstudio.