Slide 1

Slide 1 text

Ryu Kobayashi Apache Hive 4 on Treasure Data

Slide 2

Slide 2 text

About me ● Ryu Kobayashi ● Query Engine Team ● Background ○ Hadoop, Cassandra, Machine Learning ○ Past publications

Slide 3

Slide 3 text

© 2024 Treasure Data, Inc. Confidential—Internal Use Only 3 Our Hadoop and Hive history

Slide 4

Slide 4 text

© 2024 Treasure Data, Inc. Confidential Our Hadoop and Hive history 4 CDH3 CDH4 HDP2 Apache Hadoop, Hive and Tez

Slide 5

Slide 5 text

© 2024 Treasure Data, Inc. Confidential Our Hadoop and Hive history 5 CDH3 CDH4 HDP2 Apache Hadoop, Hive and Tez We stopped using the distribution

Slide 6

Slide 6 text

© 2024 Treasure Data, Inc. Confidential Why did we stop using the distribution? 6 ● Unable to upgrade Hadoop, Hive and Tez individually ● Our own patches and bug fixes ○ Since various customer queries are running, corner case bugs and more are happening ■ We will create a patch to fix this. ○ Difficult to source control the above ○ Patches must be applied on every upgrade unless reflected upstream

Slide 7

Slide 7 text

© 2024 Treasure Data, Inc. Confidential Current our Hive version 7 ● Stable: 2.3.2 base ● Experimental: 4.0.0-beta1 base ○ previous version: ■ 4.0.0-alpha1 base ■ 4.0.0-alpha2 base ● Skiped 3.x.x

Slide 8

Slide 8 text

© 2024 Treasure Data, Inc. Confidential We do for each upgrade 8 ● We have something like a Hive plugin and we need to update it. ○ We update the API to the latest. ○ We do not use HDFS to store data, we have our own data store called Plazma. ■ We need to update the metastore API when it is updated.

Slide 9

Slide 9 text

© 2024 Treasure Data, Inc. Confidential We do for each upgrade 9 ● Upgrade of Hivemall ○ The Hivemall podling retired on 2022-09-01 ○ The Hivemall supports up to Hive 2 ■ We are using Hivemall's UDF in CDP etc ■ So we need to upgrade Hivemall. ○ Currently, we use our own Hivemall for Hive 4

Slide 10

Slide 10 text

© 2024 Treasure Data, Inc. Confidential How do we test? 10 ● system test ○ general purpose test ● elephant-testing ○ Automatically test queries that have had issues in the past(about 100 tests) ● hive query simulator ○ Simulate and test previously executed queries

Slide 11

Slide 11 text

© 2024 Treasure Data, Inc. Confidential—Internal Use Only 11 Apache Hive 4 Overview

Slide 12

Slide 12 text

© 2024 Treasure Data, Inc. Confidential 12

Slide 13

Slide 13 text

© 2024 Treasure Data, Inc. Confidential 13

Slide 14

Slide 14 text

© 2024 Treasure Data, Inc. Confidential Why did it take 4 months? 14

Slide 15

Slide 15 text

© 2024 Treasure Data, Inc. Confidential 15

Slide 16

Slide 16 text

© 2024 Treasure Data, Inc. Confidential 16 HIVE-27858

Slide 17

Slide 17 text

© 2024 Treasure Data, Inc. Confidential—Internal Use Only 17 Apache Hive 4 Overview

Slide 18

Slide 18 text

© 2024 Treasure Data, Inc. Confidential Overview of Major Changes 18 ● Iceberg Integration ○ Advanced Snapshot management ○ Branches & Tags support ○ DML (insert/update/delete/merge) ○ COW & MOR modes ○ Vectorised Reads & Writes ○ Table migration command ○ LOAD DATA statements support ○ Partition-level operations support ○ Improved statistics (column stats support)

Slide 19

Slide 19 text

© 2024 Treasure Data, Inc. Confidential Overview of Major Changes 19 ● Hive ACID ○ Use sequences for TXN_ID generation (performance) ○ Read-only transactions optimization ○ Zero-wait readers ○ Optimistic and Pessimistic concurrency control ○ Lockless reads

Slide 20

Slide 20 text

© 2024 Treasure Data, Inc. Confidential Overview of Major Changes 20 ● Hive Metastore ○ API optimization (performance) ○ Dynamic leader election ○ External data sources support ○ HMS support for Thrift over HTTP ○ JWT authentication for Thrift over HTTP ○ HMS metadata summary ○ Use Zookeeper for service discovery

Slide 21

Slide 21 text

© 2024 Treasure Data, Inc. Confidential Overview of Major Changes 21 ● HiveServer2 ○ Support SAML 2.0/JWT authentication mode ○ Support both Kerberos and LDAP auth methods in parallel ○ Graceful shutdown ○ Easy access to the operation log through web UI

Slide 22

Slide 22 text

© 2024 Treasure Data, Inc. Confidential Overview of Major Changes 22 ● Hive Replication ○ Optimised Bootstrap Solution ○ Support for Snapshot Based Replication for External Table ○ Better Replication Tracking Metrics ○ Support for Checkpointing during Replication

Slide 23

Slide 23 text

© 2024 Treasure Data, Inc. Confidential Overview of Major Changes 23 ● Security ○ Authorizations in alter table/view, UDFs, and Views created from Apache Spark ○ Authorizations on tables created based on storage handlers ○ Critical CVE fixes of transitive dependencies

Slide 24

Slide 24 text

© 2024 Treasure Data, Inc. Confidential Overview of Major Changes 24 ● Compiler ○ Materialized view support for Iceberg tables ○ Improvements to refresh materialized views ○ Date/Timestamp fixes and improvements ○ Anti join support ○ Split update support ○ Branch pruning ○ Column histogram statistics support ○ Calcite upgrade to 1.25 ○ HPL/SQL improvements ○ Scheduled query support

Slide 25

Slide 25 text

© 2024 Treasure Data, Inc. Confidential Overview of Major Changes 25 ● Miscl ○ Support for ESRI GeopSpatial UDF's ○ Added support for Apache Ozone ○ Support Hadoop-3.3.6 ○ Supports Tez 0.10.3 ○ Works with Aarch64 (ARM)

Slide 26

Slide 26 text

© 2024 Treasure Data, Inc. Confidential Summary 26 ● Iceberg Integration ○ Many operations are now available in Iceberg(e.g. DML, Vectorized and more) ● Hive Docker Support ● Compiler Improvements ○ Anti-join support ○ Scheduled query support ○ Date/Timestamp fixes and improvements ■ hive.datetime.formatter(default: DATETIME) ● DATETIME: java.time.format.DateTimeFormatter ● SIMPLE: java.text.SimpleDateFormat ○ improved CBO rules leading to better query plans

Slide 27

Slide 27 text

© 2024 Treasure Data, Inc. Confidential Summary 27 ● Hive CLI Deprecated ● Hive on MapReduce Deprecated and Hive on Spark removed ● Support Hadoop-3.3.6 ● Supports Tez 0.10.3 ● New UDFs

Slide 28

Slide 28 text

© 2024 Treasure Data, Inc. Confidential New UDFs 28 ● quote ○ Enclose the string in quotes

Slide 29

Slide 29 text

© 2024 Treasure Data, Inc. Confidential New UDFs 29 ● quote ○ HIVE-685 was created by 24/Jul/09

Slide 30

Slide 30 text

© 2024 Treasure Data, Inc. Confidential New UDFs 30 ● deserialize ○ Returns plain text string of given message which was compressed in compressionFormat and base64 encoded. Currently, Supports only 'gzip' for Gzip compressed and base 64 encoded strings

Slide 31

Slide 31 text

© 2024 Treasure Data, Inc. Confidential New UDFs 31 ● cosh and tanh ○ Returns the hyperbolic cosine of x and hyperbolic tangent of x

Slide 32

Slide 32 text

© 2024 Treasure Data, Inc. Confidential New UDFs 32 ● array related ○ array_slice ○ array_min ○ array_max ○ array_distinct ○ array_join ○ array_expect ○ array_intersect ○ array_union ○ array_remove

Slide 33

Slide 33 text

© 2024 Treasure Data, Inc. Confidential New UDFs 33 ● typeof ○ Returns the type of the supplied argument

Slide 34

Slide 34 text

© 2024 Treasure Data, Inc. Confidential New UDFs 34 ● ESRI Geospatial UDFs(ST_xxxx UDFs) ○ Add ESRI-based geospatial data and operate it with Hive ○ NOTE: ESRI data is required for use ○ Related blog ■ The blog above has an example to get the number of earthquakes in California ■ Sample data: ● earthquakes.csv ● california-counties.json

Slide 35

Slide 35 text

© 2024 Treasure Data, Inc. Confidential New UDFs 35 ● ESRI Geospatial UDFs(ST_xxxx UDFs) ○ Some things to keep in note ■ california-counties.json uses its original format ■ We need to specify original InputFormat ● org.apache.hadoop.hive.ql.io.esriJson.EnclosedE sriJsonInputFormat ● This data also includes binary

Slide 36

Slide 36 text

© 2024 Treasure Data, Inc. Confidential ESRI Geospatial UDFs 36 ● e.g. CREATE TABLE counties ( Area string, Perimeter string, State string, County string, Name string, BoundaryShape binary) ROW FORMAT SERDE ‘org.apache.hadoop.hive.ql.udf.esri.serde.EsriJsonSerDe’ STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.esriJson.EnclosedEsriJsonInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat';

Slide 37

Slide 37 text

© 2024 Treasure Data, Inc. Confidential Hive Docker Support 37 ● This makes it easy to quickly check ● DockerHub ● Quick start ○ Launch the HiveServer2 with an embedded Metastore ○ Launch Standalone Metastore ○ Accessing Beeline ○ Accessing HiveServer2 Web UI: export HIVE_VERSION=4.0.0 docker run -d -p 10000:10000 -p 10002:10002 --env SERVICE_NAME=hiveserver2 --name hive4 apache/hive:${HIVE_VERSION} docker run -d -p 9083:9083 --env SERVICE_NAME=metastore --name metastore-standalone apache/hive:${HIVE_VERSION} docker exec -it hive4 beeline -u 'jdbc:hive2://localhost:10000/' Accessed on browser at http://localhost:10002/

Slide 38

Slide 38 text

© 2024 Treasure Data, Inc. Confidential 38 Thank you

Slide 39

Slide 39 text

The power to use every bit of privacy-compliant data to serve with relevance. © 2024 Treasure Data, Inc. Confidential 39