Hive Optimisation - Tips & Tricks

HIVE OPTIMISATION – TIPS & TRICKS Philip Tromans Scientific Engineer
10th July 2012

Agenda • Infrastructure & data pipeline. • Tips: • Join
optimization • Replicating RANK() • Optimizing GROUP BY • More efficient file storage with RCFile and CLUSTER BY

Our Infrastructure • SoftLayer Bare Metal Servers • Intel Xeon-SandyBridge
E3-1270-Quadcore [3.4GHz] • 16GB RAM • 4x2TB Hard Disks (1 local, 3 JBOD for HDFS) • Hive 0.9.0 (approx)

Data Pipeline Ad Servers RTB Servers Recommendation Servers Log Broker
SQL Hive DR

Getting Data Into Hive • Log files arrive, divided into
(server, 5 minute time window, entity type) chunks. • HoneyMaker aggregates these together (atomically), and ensures events are in the correct time bucket. Pushes to HDFS. • Apiary polls HDFS, and loads data into Hive.

Apiary (Workflow Management) • Allows the creation of tasks. •
Tasks can be chained together with events propagated through an Observer model. • Events are fired when data arrives – eliminates the need for cron jobs etc. • Manages pulling data from, and pushing computations back into SQL server.

MapReduce Joins SELECT * FROM impressions I JOIN creative C
ON (I.creativeid = C.id); (7, { datetime: ‘20120708154633’, site: ‘www.sample.com’ }) (7, { width: 300, height: 250 }) Map 20120708154633 www.sample.com 7 300 250 Reduce

Tip 1: Avoid NULLs in joins SELECT COUNT(1) FROM bidrequest
BR JOIN creative C ON (BR.creativeid = C.id); In our schema, a bid request gets assigned a creative id if we want to bid on the user. We decide not to bid on most requests, so most creative ids are NULL.

BR JOIN creative C ON (BR.creativeid = C.id); Total MapReduce CPU Time Spent: 0 days 2 hours 10 minutes 42 seconds 20 msec Task Time task_201207020930_3180_r_000000 22mins, 44sec task_201207020930_3180_r_000001 5mins, 21sec task_201207020930_3180_r_000002 5mins, 21sec task_201207020930_3180_r_000003 5mins, 18sec task_201207020930_3180_r_000004 5mins, 15sec … …

BR JOIN creative C ON (BR.creativeid = C.id); Total MapReduce CPU Time Spent: 0 days 2 hours 10 minutes 42 seconds 20 msec Task Time task_201207020930_3180_r_000000 22mins, 44sec task_201207020930_3180_r_000001 5mins, 21sec task_201207020930_3180_r_000002 5mins, 21sec task_201207020930_3180_r_000003 5mins, 18sec task_201207020930_3180_r_000004 5mins, 15sec … … • Records with the NULL key all end up at the same reducer (hash(NULL) mod p = 0) • This skews the load, breaking parallelism. • Null keys will never be INNER JOINed, so it’s pointless for them to get to the reduce phase.

BR JOIN creative C ON (BR.creativeid = C.id); Total MapReduce CPU Time Spent: 0 days 2 hours 10 minutes 42 seconds 20 msec SELECT COUNT(1) FROM bidrequest BR JOIN creative C ON (BR.creativeid = C.id AND BR.creativeid IS NOT NULL); Total MapReduce CPU Time Spent: 0 days 1 hours 9 minutes 44 seconds 660 msec Moral: Reducing is slow, avoid it if possible.

Tip 2: Use map-side joins • If reducing is slow,
can we avoid it altogether? Datetime Site Creative Id 20120708154633 www.sample.com 7 Id Width Height 7 300 250 Map Joined tuples Small table stored in memory as a hash-map, present on each mapper.

Tip 2: Use map-side joins SET hive.auto.convert.join = true; SELECT
COUNT(1) FROM bidrequest BR JOIN creative C ON (BR.creativeid = C.id AND BR.creativeid IS NOT NULL); Total MapReduce CPU Time Spent: 55 minutes 38 seconds 570 msec

Tip 2: Use map-side joins • Map-side joins can also
be marked explicitly. SELECT /*+ MAPJOIN(C) */ COUNT(1) FROM bidrequest BR JOIN creative C ON (BR.creativeid = C.id AND BR.creativeid IS NOT NULL);

Tip 3: Group & Order Joins • Joins are carried
out in order, one at a time, left-to-right. • Joining to a big table is expensive. SELECT COUNT(1) FROM bidrequest BR JOIN creative C ON (BR.creativeid = C.id AND BR.creativeid IS NOT NULL) JOIN adgroup A ON (C.adgroupid = A.id) Total MapReduce CPU Time Spent: 0 days 1 hours 19 minutes 10 seconds 80 msec

Tip 3: Group & Order Joins SELECT COUNT(1) FROM bidrequest
BR JOIN creative C ON (…) JOIN adgroup A ON (C.adgroupid = A.id) Total MapReduce CPU Time Spent: 0 days 1 hours 19 minutes 10 seconds 80 msec SELECT COUNT(1) FROM adgroup A JOIN creative C ON (C.adgroupid = A.id) JOIN bidrequest BR ON (…) Total MapReduce CPU Time Spent: 0 days 1 hours 2 minutes 57 seconds 980 msec

Tip 3: Group & Order Joins You can’t always reorder
your joins, but you can always group them: SELECT /*+ MAPJOIN(AC) */ COUNT(1) FROM bidrequest BR JOIN ( SELECT C.id AS creativeid, A.id AS adgroupid FROM creative C JOIN adgroup A ON (C.adgroupid = A.id) ) AC ON (BR.creativeid = AC.creativeid AND BR.creativeid IS NOT NULL); Total MapReduce CPU Time Spent: 59 minutes 56 seconds 490 msec

Tip 4: Replicating OVER … PARTITION BY • MS SQL
Server allows you to write: SELECT x, y, ROW_NUMBER() OVER (PARTITION BY a, b ORDER BY c) AS rownum FROM srctable; • Often useful for picking out the first row from an arbitrary partition.

Tip 4: Replicating OVER … PARTITION BY • Hive doesn’t
support this, but you can replicate the behaviour with a UDF and a little syntactic sugar. SELECT x, y, ROW_NUMBER() OVER (PARTITION BY a, b ORDER BY c) AS rownum FROM srctable; becomes: SELECT x, y, partitionedRowNumber(a, b) AS rownum FROM ( SELECT x, y, a, b FROM srctable DISTRIBUTE BY a, b SORT BY a, b, c ) iq;

Tip 4: Replicating OVER … PARTITION BY • UDF code
available at: https://github.com/philiptromans/hive-udf

Tip 5: Optimise GROUP BYs • GROUP BY is implemented
in Hive using map-side aggregations (unless you turn it off via hive.map.aggr). • It works a little like a very aggressive combiner. • It uses a HashMap to store GROUP BY keys against values in memory. • There are obviously a few ways in which this can run out of memory…

Tip 5: Optimise GROUP BYs • There are a few
things which you can do to stop this: • Decrease mapred.max.split.size • Decrease hive.groupby.mapaggr.checkinterval • Decrease hive.map.aggr.hash.percentmemory • Decrease hive.map.aggr.hash.force.flush.memory.threshold • There are no golden rules – it all depends on your data and queries. • Keep experimenting – the more you decrease, the slower the query.

Tip 6: Cluster within RCFiles • If you’re more worried
about disk space than CPU, then this is a neat trick. INSERT OVERWRITE TABLE bigtable PARTITION(part_key=‘clustered’) SELECT … FROM bigtable WHERE part_key=‘not_clustered’ CLUSTER BY …;

Tip 6: Cluster within RCFiles • If you’re more worried
about disk space than CPU, then this is a neat trick. INSERT OVERWRITE TABLE bigtable PARTITION(part_key=‘clustered’) SELECT … FROM bigtable WHERE part_key=‘not_clustered’ CLUSTER BY …; Generally 10-30% more efficiently packed. Cluster by columns which take up the most disk space, but have little entropy. URLs are a good example.

Final Tips • Look through conf/hive-default.xml.template in the Hive checkout.
All sorts of useful options, e.g.: • hive.exec.parallel • Hive has an EXPLAIN keyword. Use it.

Summary 1. Avoid NULLs in joins (push as much as
you can into the join predicate). 2. Use map-side joins. 3. Group & order joins. 4. Replicating RANK() / OVER (PARTITION BY…). 5. Optimise GROUP BYs. 6. Cluster within RCFiles.

Any Questions? • [email protected] • philiptromans.me Struq are hiring! [email protected]
http://blog.struq.com/category/careers/

Hive Optimisation - Tips & Tricks

Hive Optimisation - Tips & Tricks

Philip Tromans

Other Decks in Programming

Featured

Transcript

HIVE OPTIMISATION – TIPS & TRICKS Philip Tromans Scientific Engineer

Agenda • Infrastructure & data pipeline. • Tips: • Join

Our Infrastructure • SoftLayer Bare Metal Servers • Intel Xeon-SandyBridge

Data Pipeline Ad Servers RTB Servers Recommendation Servers Log Broker

Getting Data Into Hive • Log files arrive, divided into

Apiary (Workflow Management) • Allows the creation of tasks. •

MapReduce Joins SELECT * FROM impressions I JOIN creative C

Tip 1: Avoid NULLs in joins SELECT COUNT(1) FROM bidrequest

Tip 1: Avoid NULLs in joins SELECT COUNT(1) FROM bidrequest

Tip 1: Avoid NULLs in joins SELECT COUNT(1) FROM bidrequest

Tip 1: Avoid NULLs in joins SELECT COUNT(1) FROM bidrequest

Tip 2: Use map-side joins • If reducing is slow,

Tip 2: Use map-side joins SET hive.auto.convert.join = true; SELECT

Tip 2: Use map-side joins • Map-side joins can also

Tip 3: Group & Order Joins • Joins are carried

Tip 3: Group & Order Joins SELECT COUNT(1) FROM bidrequest

Tip 3: Group & Order Joins You can’t always reorder

Tip 4: Replicating OVER … PARTITION BY • MS SQL

Tip 4: Replicating OVER … PARTITION BY • Hive doesn’t

Tip 4: Replicating OVER … PARTITION BY • UDF code

Tip 5: Optimise GROUP BYs • GROUP BY is implemented

Tip 5: Optimise GROUP BYs • There are a few

Tip 6: Cluster within RCFiles • If you’re more worried

Tip 6: Cluster within RCFiles • If you’re more worried

Final Tips • Look through conf/hive-default.xml.template in the Hive checkout.

Summary 1. Avoid NULLs in joins (push as much as

Any Questions? • [email protected] • philiptromans.me Struq are hiring! [email protected]