Study Hive 3

Written on: June 21, 2020
Last modified on: May 06, 2021

Continues from the previous post. This is randam study memo to understand internal behavior of Hive.

Play with Queries

Simple insert By MR

Create table by Beeline

> CREATE TABLE test (id int, value int);
...
INFO  : Starting task [Stage-0:DDL] in serial mode
INFO  : Completed executing command(queryId=...); Time taken: 0.056 seconds
...

Insert

> INSERT INTO insert into test values (1, 1), (2, 2), (3, 3);
...
INFO  : Total jobs = 3
INFO  : Launching Job 1 out of 3
INFO  : Starting task [Stage-1:MAPRED] in serial mode
INFO  : Number of reduce tasks determined at compile time: 1
INFO  : In order to change the average load for a reducer (in bytes):
INFO  :   set hive.exec.reducers.bytes.per.reducer=<number>
INFO  : In order to limit the maximum number of reducers:
INFO  :   set hive.exec.reducers.max=<number>
INFO  : In order to set a constant number of reducers:
INFO  :   set mapreduce.job.reduces=<number>
INFO  : number of splits:1
INFO  : Submitting tokens for job: job_1591875950345_0001
INFO  : Executing with tokens: []
...
INFO  : Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
INFO  : 2020-06-11 23:00:23,692 Stage-1 map = 0%,  reduce = 0%
INFO  : 2020-06-11 23:00:27,818 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.02 sec
INFO  : 2020-06-11 23:00:32,923 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 3.19 sec
INFO  : MapReduce Total cumulative CPU time: 3 seconds 190 msec
INFO  : Ended Job = job_1591875950345_0002
INFO  : Starting task [Stage-7:CONDITIONAL] in serial mode
INFO  : Stage-4 is selected by condition resolver.
INFO  : Stage-3 is filtered out by condition resolver.
INFO  : Stage-5 is filtered out by condition resolver.
INFO  : Starting task [Stage-4:MOVE] in serial mode
INFO  : Moving data to directory hdfs://localhost:9000/user/hive/warehouse/test/.hive-staging_hive_2020-06-11_23-00-19_051_1627075465863944744-2/-ext-10000 from hdfs://localhost:9000/user/hive/warehouse/test/.hive-staging_hive_2020-06-11_23-00-19_051_1627075465863944744-2/-ext-10002
INFO  : Starting task [Stage-0:MOVE] in serial mode
INFO  : Loading data to table default.test from hdfs://localhost:9000/user/hive/warehouse/test/.hive-staging_hive_2020-06-11_23-00-19_051_1627075465863944744-2/-ext-10000
INFO  : Starting task [Stage-2:STATS] in serial mode
INFO  : MapReduce Jobs Launched:
INFO  : Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 3.19 sec   HDFS Read: 14302 HDFS Write: 263 SUCCESS
INFO  : Total MapReduce CPU Time Spent: 3 seconds 190 msec
INFO  : Completed executing command(queryId=...); Time taken: 14.843 seconds

Check file structure and format on FS

Default file format (‘hive.default.fileformat’) is textfile that columns are separated by ascii 001

partition files are stored under ${hive.metastore.warehouse.dir}/${table_name}/

$ ./bin/hadoop fs -ls /user/hive/warehouse/test
Found 1 items
-rw-r--r--   1 ... supergroup         12 2020-06-11 23:00 /user/hive/warehouse/test/000000_0
$ ./bin/hadoop fs -get /user/hive/warehouse/test/000000_0 .
$ od -Ax -tx1z -v 000000_0
000000 31 01 31 0a 32 01 32 0a 33 01 33 0a              >1.1.2.2.3.3.<
00000c
$ od -t c 000000_0
0000000   1 001   1  \n   2 001   2  \n   3 001   3  \n
0000014

Insert by Tez and ORC format

I enabled explain output (hive.log.explain.output = true) this time.

Create table by Beeline

> create table test_orc (id int, value int) stored as orc;
...
INFO  : EXPLAIN output for queryid ... : STAGE DEPENDENCIES:
  Stage-0 is a root stage [DDL]

STAGE PLANS:
  Stage: Stage-0
      Create Table Operator:
        Create Table
          columns: id int, value int
          input format: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
          output format: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
          serde name: org.apache.hadoop.hive.ql.io.orc.OrcSerde
          name: default.test_orc


INFO  : Completed compiling command(queryId=...); Time taken: 0.031 seconds
...
No rows affected (0.112 seconds)

Insert

0: jdbc:hive2://localhost:10000> insert into test_orc values (1, 1), (2, 2), (3, 3);
...
INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:col1, type:int, comment:null), FieldSchema(name:col2, type:int, comment:null)], properties:null)
INFO  : EXPLAIN output for queryid ... : STAGE DEPENDENCIES:
  Stage-1 is a root stage [MAPRED]
  Stage-2 depends on stages: Stage-1 [DEPENDENCY_COLLECTION]
  Stage-0 depends on stages: Stage-2 [MOVE]
  Stage-3 depends on stages: Stage-0 [STATS]

STAGE PLANS:
  Stage: Stage-1
    Tez
      DagId: ...
      Edges:
        Reducer 2 <- Map 1 (CUSTOM_SIMPLE_EDGE)
      DagName: ...
      Vertices:
        Map 1
            Map Operator Tree:
                TableScan
                  alias: _dummy_table
                  Row Limit Per Split: 1
                  Statistics: Num rows: 1 Data size: 10 Basic stats: COMPLETE Column stats: COMPLETE
                  GatherStats: false
                  Select Operator
                    expressions: array(const struct(1,1),const struct(2,2),const struct(3,3)) (type: array<struct<col1:int,col2:int>>)
                    outputColumnNames: _col0
                    Statistics: Num rows: 1 Data size: 64 Basic stats: COMPLETE Column stats: COMPLETE
                    UDTF Operator
                      Statistics: Num rows: 1 Data size: 64 Basic stats: COMPLETE Column stats: COMPLETE
                      function name: inline
                      Select Operator
                        expressions: col1 (type: int), col2 (type: int)
                        outputColumnNames: _col0, _col1
                        Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE
                        File Output Operator
                          compressed: false
                          GlobalTableId: 1
                          directory: hdfs://localhost:9000/user/hive/warehouse/test_orc/.hive-staging_hive_2020-06-16_22-39-50_981_8937473827142056315-1/-ext-10000
                          NumFilesPerFileSink: 1
                          Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE
                          Stats Publishing Key Prefix: hdfs://localhost:9000/user/hive/warehouse/test_orc/.hive-staging_hive_2020-06-16_22-39-50_981_8937473827142056315-1/-ext-10000/
                          table:
                              input format: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
                              output format: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
                              properties:
                                COLUMN_STATS_ACCURATE {"BASIC_STATS":"true","COLUMN_STATS":{"id":"true","value":"true"}}
                                bucket_count -1
                                bucketing_version 2
                                column.name.delimiter ,
                                columns id,value
                                columns.comments
                                columns.types int:int
                                file.inputformat org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
                                file.outputformat org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
                                location hdfs://localhost:9000/user/hive/warehouse/test_orc
                                name default.test_orc
                                numFiles 0
                                numRows 0
                                rawDataSize 0
                                serialization.ddl struct test_orc { i32 id, i32 value}
                                serialization.format 1
                                serialization.lib org.apache.hadoop.hive.ql.io.orc.OrcSerde
                                totalSize 0
                                transient_lastDdlTime 1592309097
                              serde: org.apache.hadoop.hive.ql.io.orc.OrcSerde
                              name: default.test_orc
                          TotalFiles: 1
                          GatherStats: true
                          MultiFileSpray: false
                        Select Operator
                          expressions: _col0 (type: int), _col1 (type: int)
                          outputColumnNames: id, value
                          Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE
                          Group By Operator
                            aggregations: compute_stats(id, 'hll'), compute_stats(value, 'hll')
                            mode: hash
                            outputColumnNames: _col0, _col1
                            Statistics: Num rows: 1 Data size: 848 Basic stats: COMPLETE Column stats: COMPLETE
                            Reduce Output Operator
                              null sort order:
                              sort order:
                              Statistics: Num rows: 1 Data size: 848 Basic stats: COMPLETE Column stats: COMPLETE
                              tag: -1
                              value expressions: _col0 (type: struct<columntype:string,min:bigint,max:bigint,countnulls:bigint,bitvector:binary>), _col1 (type: struct<columntype:string,min:bigint,max:bigint,countnulls:bigint,bitvector:binary>)
                              auto parallelism: false
            Path -> Alias:
              hdfs://localhost:9000/tmp/hive/...
            Path -> Partition:
              hdfs://localhost:9000/tmp/hive/...
                Partition
                  base file name: dummy_path
                  input format: org.apache.hadoop.hive.ql.io.NullRowsInputFormat
                  output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                  properties:
                    bucket_count -1
                    bucketing_version 2
                    column.name.delimiter ,
                    columns
                    columns.comments
                    columns.types
                    file.inputformat org.apache.hadoop.hive.ql.io.NullRowsInputFormat
                    file.outputformat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                    location hdfs://localhost:9000/tmp/hive/...
                    name _dummy_database._dummy_table
                    serialization.ddl struct _dummy_table { }
                    serialization.format 1
                    serialization.lib org.apache.hadoop.hive.serde2.NullStructSerDe
                  serde: org.apache.hadoop.hive.serde2.NullStructSerDe

                    input format: org.apache.hadoop.hive.ql.io.NullRowsInputFormat
                    output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                    properties:
                      bucket_count -1
                      bucketing_version 2
                      column.name.delimiter ,
                      columns
                      columns.comments
                      columns.types
                      file.inputformat org.apache.hadoop.hive.ql.io.NullRowsInputFormat
                      file.outputformat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                      location hdfs://localhost:9000/tmp/hive/...
                      name _dummy_database._dummy_table
                      serialization.ddl struct _dummy_table { }
                      serialization.format 1
                      serialization.lib org.apache.hadoop.hive.serde2.NullStructSerDe
                    serde: org.apache.hadoop.hive.serde2.NullStructSerDe
                    name: _dummy_database._dummy_table
                  name: _dummy_database._dummy_table
            Truncated Path -> Alias:
              hdfs://localhost:9000/tmp/hive/...
        Reducer 2
            Needs Tagging: false
            Reduce Operator Tree:
              Group By Operator
                aggregations: compute_stats(VALUE._col0), compute_stats(VALUE._col1)
                mode: mergepartial
                outputColumnNames: _col0, _col1
                Statistics: Num rows: 1 Data size: 880 Basic stats: COMPLETE Column stats: COMPLETE
                File Output Operator
                  compressed: false
                  GlobalTableId: 0
                  directory: hdfs://localhost:9000/tmp/hive/...
                  NumFilesPerFileSink: 1
                  Statistics: Num rows: 1 Data size: 880 Basic stats: COMPLETE Column stats: COMPLETE
                  Stats Publishing Key Prefix: hdfs://localhost:9000/tmp/hive/...
                  table:
                      input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                      output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                      properties:
                        columns _col0,_col1
                        columns.types struct<columntype:string,min:bigint,max:bigint,countnulls:bigint,numdistinctvalues:bigint,ndvbitvector:binary>:struct<columntype:string,min:bigint,max:bigint,countnulls:bigint,numdistinctvalues:bigint,ndvbitvector:binary>
                        escape.delim \
                        hive.serialization.extend.additional.nesting.levels true
                        serialization.escape.crlf true
                        serialization.format 1
                        serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                      serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                  TotalFiles: 1
                  GatherStats: false
                  MultiFileSpray: false

  Stage: Stage-2
    Dependency Collection

  Stage: Stage-0
    Move Operator
      tables:
          replace: false
          source: hdfs://localhost:9000/user/hive/warehouse/test_orc/.hive-staging_hive_2020-06-16_22-39-50_981_8937473827142056315-1/-ext-10000
          table:
              input format: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
              output format: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
              properties:
                COLUMN_STATS_ACCURATE {"BASIC_STATS":"true","COLUMN_STATS":{"id":"true","value":"true"}}
                bucket_count -1
                bucketing_version 2
                column.name.delimiter ,
                columns id,value
                columns.comments
                columns.types int:int
                file.inputformat org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
                file.outputformat org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
                location hdfs://localhost:9000/user/hive/warehouse/test_orc
                name default.test_orc
                numFiles 0
                numRows 0
                rawDataSize 0
                serialization.ddl struct test_orc { i32 id, i32 value}
                serialization.format 1
                serialization.lib org.apache.hadoop.hive.ql.io.orc.OrcSerde
                totalSize 0
                transient_lastDdlTime 1592309097
              serde: org.apache.hadoop.hive.ql.io.orc.OrcSerde
              name: default.test_orc

  Stage: Stage-3
    Stats Work
      Basic Stats Work:
          Stats Aggregation Key Prefix: hdfs://localhost:9000/user/hive/warehouse/test_orc/.hive-staging_hive_2020-06-16_22-39-50_981_8937473827142056315-1/-ext-10000/
      Column Stats Desc:
          Columns: id, value
          Column Types: int, int
          Table: default.test_orc
          Is Table Level Stats: true

INFO  : Completed compiling command(queryId=...); Time taken: 0.565 seconds
...
INFO  : Total jobs = 1
INFO  : Launching Job 1 out of 1
...
No rows affected (8.489 seconds)

Check file structure and format on FS

$ hdfs dfs -ls '/user/hive/warehouse/test_orc'
Found 1 items
-rw-r--r--   1 ... supergroup        246 2020-06-16 22:39 /user/hive/warehouse/test_orc/000000_0

Metastore

hive_metastore=# select "TBL_ID", "DB_ID", "RETENTION", "SD_ID", "TBL_NAME", "TBL_TYPE", "IS_REWRITE_ENABLED" from "TBLS";
 TBL_ID | DB_ID | RETENTION | SD_ID | TBL_NAME |   TBL_TYPE    | IS_REWRITE_ENABLED
--------|-------|-----------|-------|----------|---------------|--------------------
      6 |     1 |         0 |     6 | test     | MANAGED_TABLE | f
     11 |     1 |         0 |    11 | test_orc | MANAGED_TABLE | f
(2 rows)

hive_metastore=# select * from "TABLE_PARAMS";
 TBL_ID |       PARAM_KEY       |                            PARAM_VALUE
--------|-----------------------|--------------------------------------------------------------------
      6 | COLUMN_STATS_ACCURATE | {"BASIC_STATS":"true","COLUMN_STATS":{"id":"true","value":"true"}}
      6 | bucketing_version     | 2
      6 | transient_lastDdlTime | 1592307774
      6 | totalSize             | 12
      6 | numRows               | 34
      6 | rawDataSize           | 102
      6 | numFiles              | 1
     11 | COLUMN_STATS_ACCURATE | {"BASIC_STATS":"true","COLUMN_STATS":{"id":"true","value":"true"}}
     11 | bucketing_version     | 2
     11 | transient_lastDdlTime | 1592314799
     11 | totalSize             | 246
     11 | numRows               | 3
     11 | rawDataSize           | 24
     11 | numFiles              | 1
(14 rows)

hive_metastore=# select * from "TAB_COL_STATS";
 CS_ID | CAT_NAME | DB_NAME | TABLE_NAME | COLUMN_NAME | COLUMN_TYPE | TBL_ID | LONG_LOW_VALUE | LONG_HIGH_VALUE | DOUBLE_LOW_VALUE | DOUBLE_HIGH_VALUE | BIG_DECIMAL_LOW_VALUE | BIG_DECIMAL_HIGH_VALUE | NUM_NULLS | NUM_DISTINCTS |                BIT_VECTOR                | AVG_COL_LEN | MAX_COL_LEN | NUM_TRUES | NUM_FALSES | LAST_ANALYZED
-------|----------|---------|------------|-------------|-------------|--------|----------------|-----------------|------------------|-------------------|-----------------------|------------------------|-----------|---------------|------------------------------------------|-------------|-------------|-----------|------------|---------------
     6 | hive     | default | test       | id          | int         |      6 |              1 |               3 |                  |                   |                       |                        |         0 |             3 | \x484c4ca00303c1f3be48c1bbac62c2d2f48e03 |             |             |           |            |    1592307774
     7 | hive     | default | test       | value       | int         |      6 |              1 |               4 |                  |                   |                       |                        |         0 |             3 | \x484c4ca00303c1f3be48c1bbac62c2d2f48e03 |             |             |           |            |    1592307774
    11 | hive     | default | test_orc   | id          | int         |     11 |              1 |               3 |                  |                   |                       |                        |         0 |             3 | \x484c4ca00303c1f3be48c1bbac62c2d2f48e03 |             |             |           |            |    1592314799
    12 | hive     | default | test_orc   | value       | int         |     11 |              1 |               3 |                  |                   |                       |                        |         0 |             3 | \x484c4ca00303c1f3be48c1bbac62c2d2f48e03 |             |             |           |            |    1592314799
(4 rows)

← → Top