For a while I have been trying to install Hadoop on my personal laptop (ancient Dell with 16Gb RAM and 2Tb SSD) and here is a memo to self, to remind me the next time I have to do this.
Recently, I finally had some success. Getting Hadoop and HDFS to run is fairly straightforward and I did manage to add Hive to the mix as well, though currently just with MR as execution engine.
There are many pages on the interwebs detailing how to install Hadoop and I followed several of them. Most problems can be solved using Google.
The first things to do is to ensure the ssh
server is installed and running. Next it is a good idea to enable the account that is going to run Hadoop etc. to login to the local machine without a password. If this is not done you need to type in passwords every time you bring the services up and down. For example this page is helpful.
Next we need to install Java, I first installed the latest which caused all sorts of problem and then ended up using apt-get install openjdk-8-jdk
. There is no need to uninstall the other Javas, but update-alternatives --config java
allows to select a version that is suitable.
Then we need Hadoop. I used hadoop-2.7.3
, there are plenty of websites telling you what changes need to be made to the configuration files. I opted to use my normal user to run Hadoop and put everything in a directory called /opt/data_platform
. You also need a few variables to when you login
#export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64 export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 # java -XshowSettings:properties -version # update-alternatives --config java export HADOOP_HOME=/opt/data_platform/hadoop export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop export HADOOP_MAPRED_HOME=$HADOOP_HOME export HADOOP_COMMON_HOME=$HADOOP_HOME export HADOOP_HDFS_HOME=$HADOOP_HOME export YARN_HOME=$HADOOP_HOME export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin export HIVE_HOME=/opt/data_platform/hive export PATH=$PATH:$HIVE_HOME/bin
Then you need to initialise your HDFS file system (like formatting a hard disk) and can run start-yarn.sh; start-dfs.sh
and Hadoop is up and running. You may have to wait a few minutes/seconds depending on the speed of your box before everything is available. But then you should be able to to issue commands like
planck:/opt/data_platform>hdfs dfs -ls / 19/10/27 17:08:36 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Found 10 items drwxr-xr-x - uh supergroup 0 2019-10-25 20:35 /data ...
And you add data into your HDFS partition and check what YARN is doing.
Adding a database
Now we can install Hive (I used apache-hive-2.3.4-bin and the link does not mention to use the schematool to initialise mysql) and turn this into a useful database. Again we need to adjust a few configuration file, Hive needs to know where JAVA is (this is when I ran into problems with my first choice of JAVA) and of course where the HDFS partition is. From this point onwards HIVE should work on the commandline and you can make and query table in SQL-style.
To be able to connect to HIVE via JDBC or remotely you need to start hiverserver2
. I needed to make a few configuration changes to be able to impersonate my own user.
After that you can connect R to the HIVE database
Sys.setenv(HIVE_HOME='/opt/data_platform/hive') Sys.setenv(HADOOP_HOME='/opt/data_platform/hadoop') Sys.setenv(JAVA_HOME='/usr/lib/jvm/java-8-openjdk-amd64/jre') Sys.setenv(LD_LIBRARY_PATH = "/usr/java/jdk1.8.0_65/jre/lib/amd64/server:/usr/lib64/R/lib:/usr/local/lib64") options( java.parameters = "-Xmx8g" ) library(rJava) library(RJDBC) library(Rserve) library(RHive) # why do we need that? connection fails otherwise # follow instruction on https://github.com/nexr/RHive#loading-rhive-and-connecting-to-hive ant can live in userspace cp = c("/opt/data_platform/hive/lib/hive-jdbc-2.3.4.jar", "/opt/data_platform/hadoop/share/hadoop/common/hadoop-common-2.7.3.jar") .jinit(classpath=cp) drv <- JDBC("org.apache.hive.jdbc.HiveDriver", "/opt/data_platform/hive/lib/hive-jdbc-2.3.4.jar", identifier.quote="`") conn <- dbConnect(drv, "jdbc:hive2://localhost:10000/logs", "uh") SQLtext <- "select * from logs.future_i_v limit 10" dbResponse <- dbGetQuery(conn, SQLtext)
RHive
needs to be compiled by hand, and the other libraries need JAVA of the right kind to be in the right place. The best thing is to install those libraries on the command line, not in Rstudio.