Yokozuna: Scaling Solr With Riak

Yokozuna: Scaling Solr With Riak

59cff71bb485029664d8cf53898cea21?s=128

Ryan Zezeski

June 04, 2013
Tweet

Transcript

  1. 2.

    WHO AM I? • DEVELOPER @ BASHO TECHNOLOGIES • PREVIOUS

    @ AOL FOR ADVERTISING.COM • MOST EXPERIENCE IN JAVA & ERLANG • 2+ YEARS WORKING ON SEARCH • @RZEZESKI ON TWITTER 2
  2. 4.

    AGENDA • OVERVIEW OF RIAK & YOKOZUNA • DATA PARTITIONING

    & OWNERSHIP • HIGH AVAILABILITY & CONSISTENCY • SELF HEALING (ANTI ENTROPY) • DEMOS 4
  3. 5.

    WHAT IS RIAK? • KEY-VALUE STORE (+ SOME EXTRAS) •

    DISTRIBUTED • HIGHLY AVAILABLE • MASTERLESS • EVENTUALLY CONSISTENT • SCALE UP/DOWN 5
  4. 6.

    DATABASE • KEY/VALUE MODEL • BASIC SECONDARY INDEX SUPPORT •

    MAP/REDUCE (NOT LIKE HADOOP) • SEARCH (YOKOZUNA/SOLR) 6
  5. 7.

    DISTRIBUTED • MANY NODES IN LAN • RECOMMEND STARTING WITH

    5 • ENTERPRISE REPLICATION CAN SPAN WAN 7
  6. 8.

    HIGH AVAILABILITY • ALWAYS TAKE WRITES • ALWAYS SERVICE READS

    • FAVORS YIELD OVER HARVEST • IMPLIES EVENTUAL CONSISTENCY 8
  7. 9.

    MASTERLESS • NO NOTION OF MASTER OR SLAVE • ANY

    NODE MAY SERVICE READ/WRITE/ QUERY 9
  8. 10.

    EVENTUALLY CONSISTENT • READS CAN BE STALE • CONCURRENT WRITES

    CAN CAUSE SIBLINGS • EVENTUALLY VALUES CONVERGES 10
  9. 11.

    YOKOZUNA • INTEGRATION OF RIAK AND SOLR • INDEX RIAK

    DATA WITH SOLR • DISTRIBUTE SOLR WITH RIAK • TOGETHER DO WHAT EACH ALONE CANNOT 11
  10. 12.

    YOKOZUNA • EACH NODE RUN A LOCAL SOLR INSTANCE •

    CREATE AN INDEX SAME AS BUCKET NAME • DOCUMENT IS “EXTRACTED” FROM VALUE • SUPPORTS PLAIN TEXT, XML, AND JSON • SOLR CELL SUPPORT COMING SOON 12
  11. 13.

    YOKOZUNA • SUPPORTS “TAGGING” • USE SOLR QUERY SYNTAX •

    PARAMETERS PASSED VERBATIM • IF DISTRIBUTED SEARCH SUPPORTS IT - YOKOZUNA SUPPORTS IT • NO SOLR CLOUD INVOLVED 13
  12. 15.

    NAIVE HASHING NODE # = HASH(KEY) % NUM_NODES NH(Ka) =

    0 NH(Kb) = 1 NH(Kc) = 2 NH(Kd) = 0 ... 15
  13. 16.

    NAIVE HASHING NODE 0 NODE 1 NODE 2 Ka Kb

    Kc Kd Ke Kf Kg Kh Ki Kj Kk Km Kl Kp Kn Ko Kq Kr 16
  14. 17.

    NAIVE HASHING NODE 0 NODE 1 NODE 2 Ka Kb

    Kc Kd Kg Ki NODE 3 Ke Kf Kh Kj Kk Kl Km Kn Ko Kp Kq Kr 17
  15. 18.

    NAIVE HASHING K * (NN - 1) / NN =>

    K • K = # OF KEYS • NN = # OF NODES • AS NN GROWS FACTOR ESSENTIALLY BECOMES 1, THUS ALL KEYS MOVE 18
  16. 19.

    CONSISTENT HASHING PARTITION # = HASH(KEY) % PARTITIONS • #

    PARTITIONS REMAINS CONSTANT • KEY ALWAYS MAPS TO SAME PARTITION • NODES OWN PARTITIONS • PARTITIONS CONTAIN KEYS • EXTRA LEVEL OF INDIRECTION 19
  17. 20.

    P9 P6 P3 P8 P5 P2 P7 P4 P1 CONSISTENT

    HASHING NODE 0 NODE 1 NODE 2 Ka Kb Kc Kd Ke Kf Kg Kh Ki Kj Kk Km Kl Kp Kn Ko Kq Kr 20
  18. 21.

    P9 P6 P3 P8 P5 P2 P7 P4 P1 CONSISTENT

    HASHING NODE 0 NODE 1 NODE 2 Ka Kb Kc Kd Ke Kf Kg Kh Ki Kj Kk Km Kl Kp Kn Ko Kq Kr NODE 3 21
  19. 22.

    CONSISTENT HASHING NN * K/Q => K/Q • K =

    # OF KEYS • NN = # OF NODES • Q = # OF PARTITIONS • AS K GROWS NN BECOMES CONSTANT, THUS K/Q KEYS MOVE 22
  20. 23.

    CONSISTENT HASHING • EVENLY DIVIDES KEYSPACE • LOGICAL PARTITIONING SEPARATED

    FROM PHYSICAL PARTITIONING • UNIFORM HASH GIVES UNIFORM DISTRIBUTION 23
  21. 25.

    THE RING P1 P2 P3 P4 P5 P6 P7 P8

    ND0 ND1 ND2 ND0 ND1 ND2 ND0 ND1 25
  22. 26.

    THE RING P1 P2 P3 P4 P5 P6 P7 P8

    ND3 ND1 ND2 ND0 ND3 ND2 ND0 ND1 26
  23. 27.

    THE RING • GOSSIPED BETWEEN NODES • EPOCH CONSENSUS BASED

    • MASTERLESS - ANY NODE CAN SERVICE ANY REQUEST 27
  24. 28.

    WRITES (INDEX) NODE 0 NODE 1 NODE 2 Ia Id

    Ig Ij Im Ip Ib Ie Ih Ik In Iq Ic If Ii Il Io Ir P7 P4 P1 Ka Kd Kg Kj Km Kp P8 P5 P2 Kb Ke Kh Kk Kn Kq P9 P6 P3 Kc Kf Ki Kl Ko Kr 28
  25. 29.

    READS (QUERY) NODE 0 NODE 1 NODE 2 Ia Id

    Ig Ij Im Ip Ib Ie Ih Ik In Iq Ic If Ii Il Io Ir Q Q + SHARDS 29
  26. 32.

    “IF THE SYSTEM IS ‘DOWN’ AND NO ONE MAKES A

    REQUEST, IS IT REALLY DOWN?” ~ ME 32
  27. 36.

    DURING FAILURE OR OVERLOAD - FOR A GIVEN QUERY -

    YOU MUST DECIDE BETWEEN HARVEST OR YIELD 36
  28. 38.

    REPLICATION • N VALUE - # OF REPLICAS TO STORE

    • DEFAULT OF 3 • MORE REPLICAS TRADES IOPS + SPACE FOR MORE HARVEST 38
  29. 39.

    P9 P8 P7 P6 P5 P4 P3 P2 P1 WRITES

    NODE 0 NODE 1 NODE 2 K I K 39
  30. 40.

    P9 P8 P7 P6 P5 P4 P3 P2 P1 REPLICATED

    WRITES NODE 0 NODE 1 NODE 2 K1 K2 K3 I1 I2 I3 K 40
  31. 41.

    QUERY + REPLICATION • NOT ALL NODES NEED TO BE

    QUERIED • FIND COVERING SUBSET OF PARTITIONS/NODES • YOKOZUNA BUILDS THE COVERAGE PLAN - SOLR EXECUTES THE DISTRIBUTED QUERY • NO USE OF SOLR CLOUD 41
  32. 42.

    SLOPPY QUORUM • N REPLICAS IMPLIES IDEA OF “PREFERENCE LIST”

    • SOME PARTITIONS ARE THE “PRIMARIES” - OTHERS ARE “SECONDARY” • SLOPPY = ALLOW NON-PRIMARY TO STORE REPLICAS • 100% YIELD - BUT POTENTIALLY DEGRADED HARVEST 42
  33. 43.

    TUNABLE QUORUM • R - # OF PARTITIONS TO VERIFYREAD

    • W - # OF PARTITIONS TO VERIFY WRITE • PR/PW - # OF PARTITIONS WHICH MUST BE PRIMARY • ALLOWS YOU TO TRADE YIELD FOR HARVEST - PER REQUEST 43
  34. 44.

    SIBLINGS • NO MASTER TO SERIALIZE OPS • CONCURRENT ACTORS

    ON SAME KEY • OPERATIONS CAN INTERLEAVE • USE VCLOCKS TO DETECT CONFLICT • CREATE SIBLINGS - LET CLIENT FIX • INDEX ALL SIBLINGS 44
  35. 46.

    HINTED HANDOFF • WHEN NODES GO DOWN DATA WRITTEN TO

    SECONDARY PARTITIONS • WHEN NODES COMES BACK NEED TO GIVE THE DATA TO PRIMARY OWNER • AS DATA IS HANDED OFF INDEX IT ON DESTINATION NODE 46
  36. 47.

    P1 P2 P3 P4 P5 P6 P7 P8 ND3 ND1

    ND2 ND0 ND3 ND2 ND0 ND1 K K1 K2 K3 WRITE HINTED HANDOFF 47
  37. 48.

    P1 P2 P3 P4 P5 P6 P7 P8 ND3 ND1

    ND2 ND0 ND3 ND2 ND0 ND1 K K1 K2 K3 WRITE HINTED HANDOFF 48
  38. 49.

    P1 P2 P3 P4 P5 P6 P7 P8 ND3 ND1

    ND2 ND0 ND3 ND2 ND0 ND1 K K1 K2 K3 WRITE HINTED HANDOFF 49
  39. 50.

    P1 P2 P3 P4 P5 P6 P7 P8 ND3 ND1

    ND2 ND0 ND3 ND2 ND0 ND1 K K1 K2 K3 WRITE K2 HINTED HANDOFF 50
  40. 51.

    READ REPAIR • REPLICAS MAY NOT AGREE • REPLICAS MAY

    BE LOST • CHECK REPLICA VALUES DURING READ • FIX IF THEY DISAGREE • SEND NEW VALUE TO EACH REPLICA 51
  41. 52.

    ACTIVE ANTI- ENTROPY • 2 SYSTEMS (RIAK & SOLR) -

    GREATER CHANCE FOR INCONSISTENCY • FILES CAN BECOME TRUNCATED/ CORRUPTED • ACCIDENTAL RM -RF • SEGFAULT AT THE RIGHT TIME • ETC 52
  42. 58.

    AAE - MERKLE TREES • IT’S A HASH TREE •

    IT’S ABOUT EFFICIENCY • BILLIONS OF OBJECTS CAN BE COMPARED AT COST OF COMPARING 2 HASHES (WIN!) 58
  43. 65.

    AAE • DURABLE TREES • UPDATED IN REAL TIME •

    NON-BLOCKING • PERIODICALLY EXCHANGED • INVOKE READ-REPAIR AND RE-INDEX ON DIVERGENCE • PERIODICALLY REBUILT 65
  44. 76.

    INDEX SOME DATA • COMMIT LOG HISTORY OF VARIOUS BASHO

    REPOS • INDEX REPO NAME AND COMMIT AUTHOR, DATE, SUBJECT, BODY • USED BASHO BENCH TO LOAD DATA 76
  45. 77.
  46. 78.
  47. 79.

    QUERY • QUERY FROM ANY NODE • USE SOLR SYNTAX

    • RETURN SOLR RESULT VERBATIM • CAN USE EXISTING SOLR CLIENTS (FOR QUERY, NOT WRITE) 79
  48. 84.

    QUERY (DOWN) • INDEX REPLICATION ALLOWS FOR QUERY AVAILABILITY •

    JUST NEED 1 REPLICA OF INDEX • IF TOO MANY NODES GO DOWN YOKOZUNA WILL REFUSE QUERY • PREFERS 100% HARVEST 84