Setup Hadoop 3 and Hive 3

Written on: June 11, 2020
Last modified on: May 06, 2021

I’ve been working for a while in the company that provides data analytics platform and enjoying interesting storage layer problems. I may roughly understand the whole picture of our Presto / Hive based MPP architecture. However, I recently feels that I need deeper understanding of their internal behavior to tackle complicated performance problems

This work is to create a sandbox environment to study the latest Hive and Hadoop.

Goal

Executing queries by hive on top of HDFS and MapReduce / Tez. I also think to use this environment to run Presto in the next step.

Note: I setup my enviroment just as a sandbox so security, reliability and availability are not considered well.

Environment and Software Versions

Hardware

$ lscpu
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   39 bits physical, 48 bits virtual
CPU(s):                          6
On-line CPU(s) list:             0-5
Thread(s) per core:              1
Core(s) per socket:              6
Socket(s):                       1
NUMA node(s):                    1
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           158
Model name:                      Intel(R) Core(TM) i5-8500 CPU @ 3.00GHz
Stepping:                        10
CPU MHz:                         2394.513
CPU max MHz:                     4100.0000
CPU min MHz:                     800.0000
BogoMIPS:                        6000.00
Virtualization:                  VT-x
L1d cache:                       192 KiB
L1i cache:                       192 KiB
L2 cache:                        1.5 MiB
L3 cache:                        9 MiB
NUMA node0 CPU(s):               0-5

All processes run on single server.

Software

$ uname -srvmo
Linux 5.4.0-29-generic #33-Ubuntu SMP Wed Apr 29 14:32:27 UTC 2020 x86_64 GNU/Linux

$ java -version
openjdk version "1.8.0_252"
OpenJDK Runtime Environment (build 1.8.0_252-8u252-b09-1ubuntu1-b09)
OpenJDK 64-Bit Server VM (build 25.252-b09, mixed mode)

$ ./bin/hadoop version
Hadoop 3.2.1
Source code repository https://gitbox.apache.org/repos/asf/hadoop.git -r b3cbbb467e22ea829b3808f4b7b01d07e0bf3842
Compiled by rohithsharmaks on 2019-09-10T15:56Z
Compiled with protoc 2.5.0
From source with checksum 776eaf9eee9c0ffc370bcbc1888737

Hadoop run by pseudo-distributed mode .

Hive 3.1.2
- replaced guava jar bundled on hive with the one of Hadoop as a workaround of guava version incompatibility problem

PostgreSQL 12.3 (for metastore)

$ docker run --name pg-12 -e POSTGRES_PASSWORD= -e POSTGRES_HOST_AUTH_METHOD=trust -d postgres:12.3

Tez 0.9.2

Setup HDFS & YARN

Follow steps of pseudo-distributed mode

Install dependencies
```
# apt-get install ssh pdsh
```
Check ssh to the localhost without password
Get hadoop distribution
Edit configurations
- etc/hadoop/core-site.xml

<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://localhost:9000</value>
    </property>
</configuration>

etc/hadoop/hdfs-site.xml

<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
</configuration>

Format HDFS

$ $HADOOP_HOME/bin/hdfs namenode -format
$ $HADOOP_HOME/sbin/start-dfs.sh
$ $HADOOP_HOME/sbin/start-yarn.sh

Note: HDFS data dir (dfs.datanode.data.dir) is /tmp/hadoop-${user.name}/dfs/data so it disappears by reboot.

Now HDFS NameNode and YARN ResourceManager Web UI are available

NameNode UI: dfs.namenode.http-address (default: namenode_host:9870)
ResourceManager UI: yarn.resourcemanager.webapp.address (default: resourcemanager_host:8088)

Setup Hive

Follow GettingStarted page.

Get Hive distribution and extract
Set HADOOP_HOME and HIVE_HOME

Create directories to store Hive tables

$ $HADOOP_HOME/bin/hadoop fs -mkdir /tmp
$ $HADOOP_HOME/bin/hadoop fs -mkdir -p /user/hive/warehouse  # hive.metastore.warehouse.dir
$ $HADOOP_HOME/bin/hadoop fs -chmod g+w /tmp
$ $HADOOP_HOME/bin/hadoop fs -chmod g+w /user/hive/warehouse

Edit configure to use PostgreSQL for metastore
- hive-site.xml (see here for configuration defail)

<configuration>
    <property>
        <name>javax.jdo.option.ConnectionDriverName</name>
        <value>org.postgresql.Driver</value>
        <description>Driver class name for a JDBC metastore</description>
    </property>

    <property>
        <name>javax.jdo.option.ConnectionURL</name>
        <value>jdbc:postgresql://${HOST}:${PORT}/hive_metastore</value>
        <description>
            JDBC connect string for a JDBC metastore.
            To use SSL to encrypt/authenticate the connection, provide database-specific SSL flag in the connection URL.
            For example, jdbc:postgresql://myhost/db?ssl=true for postgres database.
        </description>
    </property>

    <property>
        <name>javax.jdo.option.ConnectionUserName</name>
        <value>postgres</value>
        <description>Username to use against metastore database</description>
    </property>

    <property>
        <name>javax.jdo.option.ConnectionPassword</name>
        <value></value>
        <description>password to use against metastore database</description>
    </property>

    <property>
        <name>hive.server2.enable.doAs</name>
        <value>false</value>
        <description>
            Setting this property to true will have HiveServer2 execute
            Hive operations as the user making the calls to it.
        </description>
    </property>
</configuration>

Initialize metastore schema

$ $HIVE_HOME/bin/schematool -dbType postgres -initSchema

Boot hiveserver2
```
$ $HIVE_HOME/bin/hiveserver2
```

HiveServer2 Web UI is available at hive.server2.webui.host:hive.server2.webui.port (default: localhost:10002).

Connect to Hiveserver

$ $HIVE_HOME/bin/beeline -u jdbc:hive2://localhost:10000

Note: Bootstrap of hiveserver may take some time. Now it’s ready to execute queries.

Install Tez

Follow Install instruction.

Get tez binary tarball and extract
```
$ tar zxvf apache-tez-0.9.2-bin.tar.gz
$ cd apache-tez-0.9.2-bin
```
Given TEZ_HOME is the directory tez binary tarball was extracted after here.

Server side

Copy full tarball on HDFS
```
$ $HADOOP_HOME/bin/hadoop fs -mkdir -p /apps/tez-0.9.2
$ $HADOOP_HOME/bin/hadoop fs -copyFromLocal share/tez.tar.gz /apps/tez-0.9.2
```
Note: apache-tez-0.9.2-bin.tar.gz is not the one to upload but ${TEZ_HOME}/share/tez.tar.gz is the one. see here for the detail.

Workaround of HDFS-12920
To avoid the issue, following configuration was required on ${HADOOP_HOME}/conf/hdfs-site.xml.

  <property>
      <name>dfs.namenode.decommission.interval</name>
      <value>30</value>
  </property>
  <property>
      <name>dfs.client.datanode-restart.timeout</name>
      <value>30</value>
  </property>

Client side

To try tez applications, following configurations are required on a client.

Create tez-site.xml on ${TEZ_HOME}/conf

<configuration>
    <property>
        <name>tez.lib.uris</name>
        <value>${fs.defaultFS}/apps/tez-0.9.2/tez.tar.gz</value>
    </property>
</configuration>

Note: tez-default-template.xml is a template of configuration file.

Set hadop classpath

export HADOOP_CLASSPATH=${TEZ_HOME}/conf:${TEZ_HOME}/*:${TEZ_HOME}/lib/*

Hive on Tez

Set Hive configuration as follows and start hiveserver2

${HIVE_HOME}/conf/hive-env.sh

export HADOOP_CLASSPATH=${TEZ_HOME}/conf:${TEZ_HOME}/*:${TEZ_HOME}/lib/*:${HADOOP_CLASSPATH}

${HIVE_HOME}/conf/hive-site.xml

    <property>
        <name>hive.execution.engine</name>
        <value>tez</value>
    </property>

Start hiveserver2

That’s it. Now Hive is set up on Tez. I will set up Presto on the next post.

← → Top