Setup Hadoop 3 and Hive 3

I’ve been working for a while in the company that provides data analytics platform and enjoying interesting storage layer problems. I may roughly understand the whole picture of our Presto / Hive based MPP architecture. However, I recently feels that I need deeper understanding of their internal behavior to tackle complicated performance problems

This work is to create a sandbox environment to study the latest Hive and Hadoop.

Goal

Executing queries by hive on top of HDFS and MapReduce / Tez. I also think to use this environment to run Presto in the next step.

Note: I setup my enviroment just as a sandbox so security, reliability and availability are not considered well.

Environment and Software Versions

Hardware

$ lscpu
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   39 bits physical, 48 bits virtual
CPU(s):                          6
On-line CPU(s) list:             0-5
Thread(s) per core:              1
Core(s) per socket:              6
Socket(s):                       1
NUMA node(s):                    1
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           158
Model name:                      Intel(R) Core(TM) i5-8500 CPU @ 3.00GHz
Stepping:                        10
CPU MHz:                         2394.513
CPU max MHz:                     4100.0000
CPU min MHz:                     800.0000
BogoMIPS:                        6000.00
Virtualization:                  VT-x
L1d cache:                       192 KiB
L1i cache:                       192 KiB
L2 cache:                        1.5 MiB
L3 cache:                        9 MiB
NUMA node0 CPU(s):               0-5

All processes run on single server.

Software

$ uname -srvmo
Linux 5.4.0-29-generic #33-Ubuntu SMP Wed Apr 29 14:32:27 UTC 2020 x86_64 GNU/Linux

$ java -version
openjdk version "1.8.0_252"
OpenJDK Runtime Environment (build 1.8.0_252-8u252-b09-1ubuntu1-b09)
OpenJDK 64-Bit Server VM (build 25.252-b09, mixed mode)

$ ./bin/hadoop version
Hadoop 3.2.1
Source code repository https://gitbox.apache.org/repos/asf/hadoop.git -r b3cbbb467e22ea829b3808f4b7b01d07e0bf3842
Compiled by rohithsharmaks on 2019-09-10T15:56Z
Compiled with protoc 2.5.0
From source with checksum 776eaf9eee9c0ffc370bcbc1888737

Hadoop run by pseudo-distributed mode .

Setup HDFS & YARN

Follow steps of pseudo-distributed mode

<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://localhost:9000</value>
    </property>
</configuration>
<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
</configuration>

Note: HDFS data dir (dfs.datanode.data.dir) is /tmp/hadoop-${user.name}/dfs/data so it disappears by reboot.

Now HDFS NameNode and YARN ResourceManager Web UI are available

Setup Hive

Follow GettingStarted page.

<configuration>
    <property>
        <name>javax.jdo.option.ConnectionDriverName</name>
        <value>org.postgresql.Driver</value>
        <description>Driver class name for a JDBC metastore</description>
    </property>

    <property>
        <name>javax.jdo.option.ConnectionURL</name>
        <value>jdbc:postgresql://${HOST}:${PORT}/hive_metastore</value>
        <description>
            JDBC connect string for a JDBC metastore.
            To use SSL to encrypt/authenticate the connection, provide database-specific SSL flag in the connection URL.
            For example, jdbc:postgresql://myhost/db?ssl=true for postgres database.
        </description>
    </property>

    <property>
        <name>javax.jdo.option.ConnectionUserName</name>
        <value>postgres</value>
        <description>Username to use against metastore database</description>
    </property>

    <property>
        <name>javax.jdo.option.ConnectionPassword</name>
        <value></value>
        <description>password to use against metastore database</description>
    </property>

    <property>
        <name>hive.server2.enable.doAs</name>
        <value>false</value>
        <description>
            Setting this property to true will have HiveServer2 execute
            Hive operations as the user making the calls to it.
        </description>
    </property>
</configuration>

HiveServer2 Web UI is available at hive.server2.webui.host:hive.server2.webui.port (default: localhost:10002).

Connect to Hiveserver

$ $HIVE_HOME/bin/beeline -u jdbc:hive2://localhost:10000

Note: Bootstrap of hiveserver may take some time. Now it’s ready to execute queries.

Install Tez

Follow Install instruction.

Server side

Client side

To try tez applications, following configurations are required on a client.

<configuration>
    <property>
        <name>tez.lib.uris</name>
        <value>${fs.defaultFS}/apps/tez-0.9.2/tez.tar.gz</value>
    </property>
</configuration>

Note: tez-default-template.xml is a template of configuration file.

Hive on Tez

Set Hive configuration as follows and start hiveserver2

export HADOOP_CLASSPATH=${TEZ_HOME}/conf:${TEZ_HOME}/*:${TEZ_HOME}/lib/*:${HADOOP_CLASSPATH}
    <property>
        <name>hive.execution.engine</name>
        <value>tez</value>
    </property>

That’s it. Now Hive is set up on Tez. I will set up Presto on the next post.