Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Hive Optimisation - Tips & Tricks

Hive Optimisation - Tips & Tricks

Slides from the talk given at the inaugural Hive London meetup (http://www.meetup.com/HiveLondon/events/69366482/). Covers a few odds and ends around making Hive queries faster through join optimisation etc.

Philip Tromans

July 10, 2012
Tweet

Other Decks in Programming

Transcript

  1. Agenda • Infrastructure & data pipeline. • Tips: • Join

    optimization • Replicating RANK() • Optimizing GROUP BY • More efficient file storage with RCFile and CLUSTER BY
  2. Our Infrastructure • SoftLayer Bare Metal Servers • Intel Xeon-SandyBridge

    E3-1270-Quadcore [3.4GHz] • 16GB RAM • 4x2TB Hard Disks (1 local, 3 JBOD for HDFS) • Hive 0.9.0 (approx)
  3. Getting Data Into Hive • Log files arrive, divided into

    (server, 5 minute time window, entity type) chunks. • HoneyMaker aggregates these together (atomically), and ensures events are in the correct time bucket. Pushes to HDFS. • Apiary polls HDFS, and loads data into Hive.
  4. Apiary (Workflow Management) • Allows the creation of tasks. •

    Tasks can be chained together with events propagated through an Observer model. • Events are fired when data arrives – eliminates the need for cron jobs etc. • Manages pulling data from, and pushing computations back into SQL server.
  5. MapReduce Joins SELECT * FROM impressions I JOIN creative C

    ON (I.creativeid = C.id); (7, { datetime: ‘20120708154633’, site: ‘www.sample.com’ }) (7, { width: 300, height: 250 }) Map 20120708154633 www.sample.com 7 300 250 Reduce
  6. Tip 1: Avoid NULLs in joins SELECT COUNT(1) FROM bidrequest

    BR JOIN creative C ON (BR.creativeid = C.id); In our schema, a bid request gets assigned a creative id if we want to bid on the user. We decide not to bid on most requests, so most creative ids are NULL.
  7. Tip 1: Avoid NULLs in joins SELECT COUNT(1) FROM bidrequest

    BR JOIN creative C ON (BR.creativeid = C.id); Total MapReduce CPU Time Spent: 0 days 2 hours 10 minutes 42 seconds 20 msec Task Time task_201207020930_3180_r_000000 22mins, 44sec task_201207020930_3180_r_000001 5mins, 21sec task_201207020930_3180_r_000002 5mins, 21sec task_201207020930_3180_r_000003 5mins, 18sec task_201207020930_3180_r_000004 5mins, 15sec … …
  8. Tip 1: Avoid NULLs in joins SELECT COUNT(1) FROM bidrequest

    BR JOIN creative C ON (BR.creativeid = C.id); Total MapReduce CPU Time Spent: 0 days 2 hours 10 minutes 42 seconds 20 msec Task Time task_201207020930_3180_r_000000 22mins, 44sec task_201207020930_3180_r_000001 5mins, 21sec task_201207020930_3180_r_000002 5mins, 21sec task_201207020930_3180_r_000003 5mins, 18sec task_201207020930_3180_r_000004 5mins, 15sec … … • Records with the NULL key all end up at the same reducer (hash(NULL) mod p = 0) • This skews the load, breaking parallelism. • Null keys will never be INNER JOINed, so it’s pointless for them to get to the reduce phase.
  9. Tip 1: Avoid NULLs in joins SELECT COUNT(1) FROM bidrequest

    BR JOIN creative C ON (BR.creativeid = C.id); Total MapReduce CPU Time Spent: 0 days 2 hours 10 minutes 42 seconds 20 msec SELECT COUNT(1) FROM bidrequest BR JOIN creative C ON (BR.creativeid = C.id AND BR.creativeid IS NOT NULL); Total MapReduce CPU Time Spent: 0 days 1 hours 9 minutes 44 seconds 660 msec Moral: Reducing is slow, avoid it if possible.
  10. Tip 2: Use map-side joins • If reducing is slow,

    can we avoid it altogether? Datetime Site Creative Id 20120708154633 www.sample.com 7 Id Width Height 7 300 250 Map Joined tuples Small table stored in memory as a hash-map, present on each mapper.
  11. Tip 2: Use map-side joins SET hive.auto.convert.join = true; SELECT

    COUNT(1) FROM bidrequest BR JOIN creative C ON (BR.creativeid = C.id AND BR.creativeid IS NOT NULL); Total MapReduce CPU Time Spent: 55 minutes 38 seconds 570 msec
  12. Tip 2: Use map-side joins • Map-side joins can also

    be marked explicitly. SELECT /*+ MAPJOIN(C) */ COUNT(1) FROM bidrequest BR JOIN creative C ON (BR.creativeid = C.id AND BR.creativeid IS NOT NULL);
  13. Tip 3: Group & Order Joins • Joins are carried

    out in order, one at a time, left-to-right. • Joining to a big table is expensive. SELECT COUNT(1) FROM bidrequest BR JOIN creative C ON (BR.creativeid = C.id AND BR.creativeid IS NOT NULL) JOIN adgroup A ON (C.adgroupid = A.id) Total MapReduce CPU Time Spent: 0 days 1 hours 19 minutes 10 seconds 80 msec
  14. Tip 3: Group & Order Joins SELECT COUNT(1) FROM bidrequest

    BR JOIN creative C ON (…) JOIN adgroup A ON (C.adgroupid = A.id) Total MapReduce CPU Time Spent: 0 days 1 hours 19 minutes 10 seconds 80 msec SELECT COUNT(1) FROM adgroup A JOIN creative C ON (C.adgroupid = A.id) JOIN bidrequest BR ON (…) Total MapReduce CPU Time Spent: 0 days 1 hours 2 minutes 57 seconds 980 msec
  15. Tip 3: Group & Order Joins You can’t always reorder

    your joins, but you can always group them: SELECT /*+ MAPJOIN(AC) */ COUNT(1) FROM bidrequest BR JOIN ( SELECT C.id AS creativeid, A.id AS adgroupid FROM creative C JOIN adgroup A ON (C.adgroupid = A.id) ) AC ON (BR.creativeid = AC.creativeid AND BR.creativeid IS NOT NULL); Total MapReduce CPU Time Spent: 59 minutes 56 seconds 490 msec
  16. Tip 4: Replicating OVER … PARTITION BY • MS SQL

    Server allows you to write: SELECT x, y, ROW_NUMBER() OVER (PARTITION BY a, b ORDER BY c) AS rownum FROM srctable; • Often useful for picking out the first row from an arbitrary partition.
  17. Tip 4: Replicating OVER … PARTITION BY • Hive doesn’t

    support this, but you can replicate the behaviour with a UDF and a little syntactic sugar. SELECT x, y, ROW_NUMBER() OVER (PARTITION BY a, b ORDER BY c) AS rownum FROM srctable; becomes: SELECT x, y, partitionedRowNumber(a, b) AS rownum FROM ( SELECT x, y, a, b FROM srctable DISTRIBUTE BY a, b SORT BY a, b, c ) iq;
  18. Tip 4: Replicating OVER … PARTITION BY • UDF code

    available at: https://github.com/philiptromans/hive-udf
  19. Tip 5: Optimise GROUP BYs • GROUP BY is implemented

    in Hive using map-side aggregations (unless you turn it off via hive.map.aggr). • It works a little like a very aggressive combiner. • It uses a HashMap to store GROUP BY keys against values in memory. • There are obviously a few ways in which this can run out of memory…
  20. Tip 5: Optimise GROUP BYs • There are a few

    things which you can do to stop this: • Decrease mapred.max.split.size • Decrease hive.groupby.mapaggr.checkinterval • Decrease hive.map.aggr.hash.percentmemory • Decrease hive.map.aggr.hash.force.flush.memory.threshold • There are no golden rules – it all depends on your data and queries. • Keep experimenting – the more you decrease, the slower the query.
  21. Tip 6: Cluster within RCFiles • If you’re more worried

    about disk space than CPU, then this is a neat trick. INSERT OVERWRITE TABLE bigtable PARTITION(part_key=‘clustered’) SELECT … FROM bigtable WHERE part_key=‘not_clustered’ CLUSTER BY …;
  22. Tip 6: Cluster within RCFiles • If you’re more worried

    about disk space than CPU, then this is a neat trick. INSERT OVERWRITE TABLE bigtable PARTITION(part_key=‘clustered’) SELECT … FROM bigtable WHERE part_key=‘not_clustered’ CLUSTER BY …; Generally 10-30% more efficiently packed. Cluster by columns which take up the most disk space, but have little entropy. URLs are a good example.
  23. Final Tips • Look through conf/hive-default.xml.template in the Hive checkout.

    All sorts of useful options, e.g.: • hive.exec.parallel • Hive has an EXPLAIN keyword. Use it.
  24. Summary 1. Avoid NULLs in joins (push as much as

    you can into the join predicate). 2. Use map-side joins. 3. Group & order joins. 4. Replicating RANK() / OVER (PARTITION BY…). 5. Optimise GROUP BYs. 6. Cluster within RCFiles.