Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Hive Optimisation - Tips & Tricks

Hive Optimisation - Tips & Tricks

Slides from the talk given at the inaugural Hive London meetup (http://www.meetup.com/HiveLondon/events/69366482/). Covers a few odds and ends around making Hive queries faster through join optimisation etc.

Philip Tromans

July 10, 2012
Tweet

Other Decks in Programming

Transcript

  1. HIVE OPTIMISATION –
    TIPS & TRICKS
    Philip Tromans
    Scientific Engineer
    10th July 2012

    View Slide

  2. Agenda
    • Infrastructure & data pipeline.
    • Tips:
    • Join optimization
    • Replicating RANK()
    • Optimizing GROUP BY
    • More efficient file storage with RCFile and CLUSTER BY

    View Slide

  3. Our Infrastructure
    • SoftLayer Bare Metal Servers
    • Intel Xeon-SandyBridge E3-1270-Quadcore [3.4GHz]
    • 16GB RAM
    • 4x2TB Hard Disks (1 local, 3 JBOD for HDFS)
    • Hive 0.9.0 (approx)

    View Slide

  4. Data Pipeline
    Ad
    Servers
    RTB
    Servers
    Recommendation
    Servers
    Log
    Broker
    SQL
    Hive
    DR

    View Slide

  5. Getting Data Into Hive
    • Log files arrive, divided into (server, 5 minute time
    window, entity type) chunks.
    • HoneyMaker aggregates these together (atomically), and
    ensures events are in the correct time bucket. Pushes to
    HDFS.
    • Apiary polls HDFS, and loads data into Hive.

    View Slide

  6. Apiary (Workflow Management)
    • Allows the creation of tasks.
    • Tasks can be chained together with events propagated
    through an Observer model.
    • Events are fired when data arrives – eliminates the need
    for cron jobs etc.
    • Manages pulling data from, and pushing computations
    back into SQL server.

    View Slide

  7. MapReduce Joins
    SELECT * FROM impressions I JOIN creative C ON (I.creativeid = C.id);
    (7, { datetime: ‘20120708154633’, site:
    ‘www.sample.com’ })
    (7, { width: 300, height: 250 })
    Map
    20120708154633 www.sample.com 7 300 250
    Reduce

    View Slide

  8. Tip 1: Avoid NULLs in joins
    SELECT COUNT(1) FROM bidrequest BR
    JOIN creative C ON (BR.creativeid = C.id);
    In our schema, a bid request gets assigned a creative id if
    we want to bid on the user. We decide not to bid on most
    requests, so most creative ids are NULL.

    View Slide

  9. Tip 1: Avoid NULLs in joins
    SELECT COUNT(1) FROM bidrequest BR
    JOIN creative C ON (BR.creativeid = C.id);
    Total MapReduce CPU Time Spent: 0 days 2 hours
    10 minutes 42 seconds 20 msec
    Task Time
    task_201207020930_3180_r_000000 22mins, 44sec
    task_201207020930_3180_r_000001 5mins, 21sec
    task_201207020930_3180_r_000002 5mins, 21sec
    task_201207020930_3180_r_000003 5mins, 18sec
    task_201207020930_3180_r_000004 5mins, 15sec
    … …

    View Slide

  10. Tip 1: Avoid NULLs in joins
    SELECT COUNT(1) FROM bidrequest BR
    JOIN creative C ON (BR.creativeid = C.id);
    Total MapReduce CPU Time Spent: 0 days 2 hours
    10 minutes 42 seconds 20 msec
    Task Time
    task_201207020930_3180_r_000000 22mins, 44sec
    task_201207020930_3180_r_000001 5mins, 21sec
    task_201207020930_3180_r_000002 5mins, 21sec
    task_201207020930_3180_r_000003 5mins, 18sec
    task_201207020930_3180_r_000004 5mins, 15sec
    … …
    • Records with the NULL key all end up at the same
    reducer (hash(NULL) mod p = 0)
    • This skews the load, breaking parallelism.
    • Null keys will never be INNER JOINed, so it’s pointless for
    them to get to the reduce phase.

    View Slide

  11. Tip 1: Avoid NULLs in joins
    SELECT COUNT(1) FROM bidrequest BR
    JOIN creative C ON (BR.creativeid = C.id);
    Total MapReduce CPU Time Spent:
    0 days 2 hours 10 minutes 42 seconds 20 msec
    SELECT COUNT(1) FROM bidrequest BR
    JOIN creative C ON (BR.creativeid = C.id
    AND BR.creativeid IS NOT NULL);
    Total MapReduce CPU Time Spent:
    0 days 1 hours 9 minutes 44 seconds 660 msec
    Moral: Reducing is slow, avoid it if possible.

    View Slide

  12. Tip 2: Use map-side joins
    • If reducing is slow, can we avoid it altogether?
    Datetime Site Creative Id
    20120708154633 www.sample.com 7
    Id Width Height
    7 300 250
    Map
    Joined tuples
    Small table stored in memory as a hash-map, present on
    each mapper.

    View Slide

  13. Tip 2: Use map-side joins
    SET hive.auto.convert.join = true;
    SELECT COUNT(1) FROM bidrequest BR
    JOIN creative C ON (BR.creativeid = C.id
    AND BR.creativeid IS NOT NULL);
    Total MapReduce CPU Time Spent: 55 minutes 38
    seconds 570 msec

    View Slide

  14. Tip 2: Use map-side joins
    • Map-side joins can also be marked explicitly.
    SELECT /*+ MAPJOIN(C) */ COUNT(1) FROM
    bidrequest BR
    JOIN creative C ON (BR.creativeid = C.id
    AND BR.creativeid IS NOT NULL);

    View Slide

  15. Tip 3: Group & Order Joins
    • Joins are carried out in order, one at a time, left-to-right.
    • Joining to a big table is expensive.
    SELECT COUNT(1) FROM bidrequest BR
    JOIN creative C ON (BR.creativeid = C.id
    AND BR.creativeid IS NOT NULL)
    JOIN adgroup A ON (C.adgroupid = A.id)
    Total MapReduce CPU Time Spent:
    0 days 1 hours 19 minutes 10 seconds 80 msec

    View Slide

  16. Tip 3: Group & Order Joins
    SELECT COUNT(1) FROM bidrequest BR
    JOIN creative C ON (…)
    JOIN adgroup A ON (C.adgroupid = A.id)
    Total MapReduce CPU Time Spent:
    0 days 1 hours 19 minutes 10 seconds 80 msec
    SELECT COUNT(1) FROM adgroup A
    JOIN creative C ON (C.adgroupid = A.id)
    JOIN bidrequest BR ON (…)
    Total MapReduce CPU Time Spent:
    0 days 1 hours 2 minutes 57 seconds 980 msec

    View Slide

  17. Tip 3: Group & Order Joins
    You can’t always reorder your joins, but you can always group
    them:
    SELECT /*+ MAPJOIN(AC) */ COUNT(1) FROM bidrequest
    BR
    JOIN (
    SELECT C.id AS creativeid, A.id AS
    adgroupid FROM creative C
    JOIN adgroup A ON (C.adgroupid = A.id)
    ) AC
    ON (BR.creativeid = AC.creativeid AND
    BR.creativeid IS NOT NULL);
    Total MapReduce CPU Time Spent:
    59 minutes 56 seconds 490 msec

    View Slide

  18. Tip 4: Replicating OVER … PARTITION BY
    • MS SQL Server allows you to write:
    SELECT x, y, ROW_NUMBER() OVER (PARTITION BY a,
    b ORDER BY c) AS rownum FROM srctable;
    • Often useful for picking out the first row from an arbitrary
    partition.

    View Slide

  19. Tip 4: Replicating OVER … PARTITION BY
    • Hive doesn’t support this, but you can replicate the
    behaviour with a UDF and a little syntactic sugar.
    SELECT x, y, ROW_NUMBER() OVER (PARTITION BY a,
    b ORDER BY c) AS rownum FROM srctable;
    becomes:
    SELECT x, y, partitionedRowNumber(a, b) AS
    rownum FROM (
    SELECT x, y, a, b FROM srctable
    DISTRIBUTE BY a, b
    SORT BY a, b, c
    ) iq;

    View Slide

  20. Tip 4: Replicating OVER … PARTITION BY
    • UDF code available at:
    https://github.com/philiptromans/hive-udf

    View Slide

  21. Tip 5: Optimise GROUP BYs
    • GROUP BY is implemented in Hive using map-side
    aggregations (unless you turn it off via hive.map.aggr).
    • It works a little like a very aggressive combiner.
    • It uses a HashMap to store GROUP BY keys against
    values in memory.
    • There are obviously a few ways in which this can run out
    of memory…

    View Slide

  22. Tip 5: Optimise GROUP BYs
    • There are a few things which you can do to stop this:
    • Decrease mapred.max.split.size
    • Decrease hive.groupby.mapaggr.checkinterval
    • Decrease hive.map.aggr.hash.percentmemory
    • Decrease hive.map.aggr.hash.force.flush.memory.threshold
    • There are no golden rules – it all depends on your data and queries.
    • Keep experimenting – the more you decrease, the slower the query.

    View Slide

  23. Tip 6: Cluster within RCFiles
    • If you’re more worried about disk space than CPU, then
    this is a neat trick.
    INSERT OVERWRITE TABLE bigtable
    PARTITION(part_key=‘clustered’)
    SELECT … FROM bigtable WHERE
    part_key=‘not_clustered’
    CLUSTER BY …;

    View Slide

  24. Tip 6: Cluster within RCFiles
    • If you’re more worried about disk space than CPU, then
    this is a neat trick.
    INSERT OVERWRITE TABLE bigtable
    PARTITION(part_key=‘clustered’)
    SELECT … FROM bigtable WHERE
    part_key=‘not_clustered’
    CLUSTER BY …;
    Generally 10-30% more efficiently packed.
    Cluster by columns which take up
    the most disk space, but have little
    entropy. URLs are a good example.

    View Slide

  25. Final Tips
    • Look through conf/hive-default.xml.template in the Hive
    checkout. All sorts of useful options, e.g.:
    • hive.exec.parallel
    • Hive has an EXPLAIN keyword. Use it.

    View Slide

  26. Summary
    1. Avoid NULLs in joins (push as much as you can into the
    join predicate).
    2. Use map-side joins.
    3. Group & order joins.
    4. Replicating RANK() / OVER (PARTITION BY…).
    5. Optimise GROUP BYs.
    6. Cluster within RCFiles.

    View Slide

  27. Any Questions?
    [email protected]
    • philiptromans.me
    Struq are hiring!
    [email protected]
    http://blog.struq.com/category/careers/

    View Slide