Setup Hadoop 3 and Hive 3
I’ve been working for a while in the company that provides data analytics platform and enjoying interesting storage layer problems. I may roughly understand the whole picture of our Presto / Hive based MPP architecture. However, I recently feels that I need deeper understanding of their internal behavior to tackle complicated performance problems
This work is to create a sandbox environment to study the latest Hive and Hadoop.
Goal
Executing queries by hive on top of HDFS and MapReduce / Tez. I also think to use this environment to run Presto in the next step.
Note: I setup my enviroment just as a sandbox so security, reliability and availability are not considered well.
Environment and Software Versions
Hardware
$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 39 bits physical, 48 bits virtual
CPU(s): 6
On-line CPU(s) list: 0-5
Thread(s) per core: 1
Core(s) per socket: 6
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 158
Model name: Intel(R) Core(TM) i5-8500 CPU @ 3.00GHz
Stepping: 10
CPU MHz: 2394.513
CPU max MHz: 4100.0000
CPU min MHz: 800.0000
BogoMIPS: 6000.00
Virtualization: VT-x
L1d cache: 192 KiB
L1i cache: 192 KiB
L2 cache: 1.5 MiB
L3 cache: 9 MiB
NUMA node0 CPU(s): 0-5
All processes run on single server.
Software
$ uname -srvmo
Linux 5.4.0-29-generic #33-Ubuntu SMP Wed Apr 29 14:32:27 UTC 2020 x86_64 GNU/Linux
$ java -version
openjdk version "1.8.0_252"
OpenJDK Runtime Environment (build 1.8.0_252-8u252-b09-1ubuntu1-b09)
OpenJDK 64-Bit Server VM (build 25.252-b09, mixed mode)
$ ./bin/hadoop version
Hadoop 3.2.1
Source code repository https://gitbox.apache.org/repos/asf/hadoop.git -r b3cbbb467e22ea829b3808f4b7b01d07e0bf3842
Compiled by rohithsharmaks on 2019-09-10T15:56Z
Compiled with protoc 2.5.0
From source with checksum 776eaf9eee9c0ffc370bcbc1888737
Hadoop run by pseudo-distributed mode .
- Hive 3.1.2
- replaced guava jar bundled on hive with the one of Hadoop as a workaround of guava version incompatibility problem
- PostgreSQL 12.3 (for metastore)
$ docker run --name pg-12 -e POSTGRES_PASSWORD= -e POSTGRES_HOST_AUTH_METHOD=trust -d postgres:12.3
- Tez 0.9.2
Setup HDFS & YARN
Follow steps of pseudo-distributed mode
- Install dependencies
# apt-get install ssh pdsh
- Check ssh to the localhost without password
- Get hadoop distribution
- Edit configurations
- etc/hadoop/core-site.xml
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
- etc/hadoop/hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
- Format HDFS
$ $HADOOP_HOME/bin/hdfs namenode -format $ $HADOOP_HOME/sbin/start-dfs.sh $ $HADOOP_HOME/sbin/start-yarn.sh
Note: HDFS data dir (dfs.datanode.data.dir) is /tmp/hadoop-${user.name}/dfs/data
so it disappears by reboot.
Now HDFS NameNode and YARN ResourceManager Web UI are available
- NameNode UI: dfs.namenode.http-address (default: namenode_host:9870)
- ResourceManager UI: yarn.resourcemanager.webapp.address (default: resourcemanager_host:8088)
Setup Hive
Follow GettingStarted page.
- Get Hive distribution and extract
- Set
HADOOP_HOME
andHIVE_HOME
- Create directories to store Hive tables
$ $HADOOP_HOME/bin/hadoop fs -mkdir /tmp $ $HADOOP_HOME/bin/hadoop fs -mkdir -p /user/hive/warehouse # hive.metastore.warehouse.dir $ $HADOOP_HOME/bin/hadoop fs -chmod g+w /tmp $ $HADOOP_HOME/bin/hadoop fs -chmod g+w /user/hive/warehouse
- Edit configure to use PostgreSQL for metastore
- hive-site.xml (see here for configuration defail)
<configuration>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>org.postgresql.Driver</value>
<description>Driver class name for a JDBC metastore</description>
</property>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:postgresql://${HOST}:${PORT}/hive_metastore</value>
<description>
JDBC connect string for a JDBC metastore.
To use SSL to encrypt/authenticate the connection, provide database-specific SSL flag in the connection URL.
For example, jdbc:postgresql://myhost/db?ssl=true for postgres database.
</description>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>postgres</value>
<description>Username to use against metastore database</description>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value></value>
<description>password to use against metastore database</description>
</property>
<property>
<name>hive.server2.enable.doAs</name>
<value>false</value>
<description>
Setting this property to true will have HiveServer2 execute
Hive operations as the user making the calls to it.
</description>
</property>
</configuration>
- Initialize metastore schema
$ $HIVE_HOME/bin/schematool -dbType postgres -initSchema
- Boot hiveserver2
$ $HIVE_HOME/bin/hiveserver2
HiveServer2 Web UI is available at hive.server2.webui.host:hive.server2.webui.port (default: localhost:10002).
Connect to Hiveserver
$ $HIVE_HOME/bin/beeline -u jdbc:hive2://localhost:10000
Note: Bootstrap of hiveserver may take some time. Now it’s ready to execute queries.
Install Tez
Follow Install instruction.
- Get tez binary tarball and extract
$ tar zxvf apache-tez-0.9.2-bin.tar.gz $ cd apache-tez-0.9.2-bin
Given
TEZ_HOME
is the directory tez binary tarball was extracted after here.
Server side
- Copy full tarball on HDFS
$ $HADOOP_HOME/bin/hadoop fs -mkdir -p /apps/tez-0.9.2 $ $HADOOP_HOME/bin/hadoop fs -copyFromLocal share/tez.tar.gz /apps/tez-0.9.2
Note:
apache-tez-0.9.2-bin.tar.gz
is not the one to upload but${TEZ_HOME}/share/tez.tar.gz
is the one. see here for the detail. - Workaround of HDFS-12920
To avoid the issue, following configuration was required on${HADOOP_HOME}/conf/hdfs-site.xml
.<property> <name>dfs.namenode.decommission.interval</name> <value>30</value> </property> <property> <name>dfs.client.datanode-restart.timeout</name> <value>30</value> </property>
Client side
To try tez applications, following configurations are required on a client.
- Create tez-site.xml on
${TEZ_HOME}/conf
<configuration>
<property>
<name>tez.lib.uris</name>
<value>${fs.defaultFS}/apps/tez-0.9.2/tez.tar.gz</value>
</property>
</configuration>
Note: tez-default-template.xml
is a template of configuration file.
- Set hadop classpath
export HADOOP_CLASSPATH=${TEZ_HOME}/conf:${TEZ_HOME}/*:${TEZ_HOME}/lib/*
Hive on Tez
Set Hive configuration as follows and start hiveserver2
${HIVE_HOME}/conf/hive-env.sh
export HADOOP_CLASSPATH=${TEZ_HOME}/conf:${TEZ_HOME}/*:${TEZ_HOME}/lib/*:${HADOOP_CLASSPATH}
${HIVE_HOME}/conf/hive-site.xml
<property>
<name>hive.execution.engine</name>
<value>tez</value>
</property>
- Start hiveserver2
That’s it. Now Hive is set up on Tez. I will set up Presto on the next post.