Pro Yearly is on sale from $80 to $50! »

Yokozuna: Scaling Solr With Riak

Yokozuna: Scaling Solr With Riak

59cff71bb485029664d8cf53898cea21?s=128

Ryan Zezeski

June 04, 2013
Tweet

Transcript

  1. Yokozuna, Scaling Solr With Riak Ryan Zezeski Berlin Buzzwords -

    June 4th 2013 1
  2. WHO AM I? • DEVELOPER @ BASHO TECHNOLOGIES • PREVIOUS

    @ AOL FOR ADVERTISING.COM • MOST EXPERIENCE IN JAVA & ERLANG • 2+ YEARS WORKING ON SEARCH • @RZEZESKI ON TWITTER 2
  3. NOT TALKING ABOUT SEARCH 3

  4. AGENDA • OVERVIEW OF RIAK & YOKOZUNA • DATA PARTITIONING

    & OWNERSHIP • HIGH AVAILABILITY & CONSISTENCY • SELF HEALING (ANTI ENTROPY) • DEMOS 4
  5. WHAT IS RIAK? • KEY-VALUE STORE (+ SOME EXTRAS) •

    DISTRIBUTED • HIGHLY AVAILABLE • MASTERLESS • EVENTUALLY CONSISTENT • SCALE UP/DOWN 5
  6. DATABASE • KEY/VALUE MODEL • BASIC SECONDARY INDEX SUPPORT •

    MAP/REDUCE (NOT LIKE HADOOP) • SEARCH (YOKOZUNA/SOLR) 6
  7. DISTRIBUTED • MANY NODES IN LAN • RECOMMEND STARTING WITH

    5 • ENTERPRISE REPLICATION CAN SPAN WAN 7
  8. HIGH AVAILABILITY • ALWAYS TAKE WRITES • ALWAYS SERVICE READS

    • FAVORS YIELD OVER HARVEST • IMPLIES EVENTUAL CONSISTENCY 8
  9. MASTERLESS • NO NOTION OF MASTER OR SLAVE • ANY

    NODE MAY SERVICE READ/WRITE/ QUERY 9
  10. EVENTUALLY CONSISTENT • READS CAN BE STALE • CONCURRENT WRITES

    CAN CAUSE SIBLINGS • EVENTUALLY VALUES CONVERGES 10
  11. YOKOZUNA • INTEGRATION OF RIAK AND SOLR • INDEX RIAK

    DATA WITH SOLR • DISTRIBUTE SOLR WITH RIAK • TOGETHER DO WHAT EACH ALONE CANNOT 11
  12. YOKOZUNA • EACH NODE RUN A LOCAL SOLR INSTANCE •

    CREATE AN INDEX SAME AS BUCKET NAME • DOCUMENT IS “EXTRACTED” FROM VALUE • SUPPORTS PLAIN TEXT, XML, AND JSON • SOLR CELL SUPPORT COMING SOON 12
  13. YOKOZUNA • SUPPORTS “TAGGING” • USE SOLR QUERY SYNTAX •

    PARAMETERS PASSED VERBATIM • IF DISTRIBUTED SEARCH SUPPORTS IT - YOKOZUNA SUPPORTS IT • NO SOLR CLOUD INVOLVED 13
  14. PARTITIONING & OWNERSHIP Aufteilen & Eigentum 14

  15. NAIVE HASHING NODE # = HASH(KEY) % NUM_NODES NH(Ka) =

    0 NH(Kb) = 1 NH(Kc) = 2 NH(Kd) = 0 ... 15
  16. NAIVE HASHING NODE 0 NODE 1 NODE 2 Ka Kb

    Kc Kd Ke Kf Kg Kh Ki Kj Kk Km Kl Kp Kn Ko Kq Kr 16
  17. NAIVE HASHING NODE 0 NODE 1 NODE 2 Ka Kb

    Kc Kd Kg Ki NODE 3 Ke Kf Kh Kj Kk Kl Km Kn Ko Kp Kq Kr 17
  18. NAIVE HASHING K * (NN - 1) / NN =>

    K • K = # OF KEYS • NN = # OF NODES • AS NN GROWS FACTOR ESSENTIALLY BECOMES 1, THUS ALL KEYS MOVE 18
  19. CONSISTENT HASHING PARTITION # = HASH(KEY) % PARTITIONS • #

    PARTITIONS REMAINS CONSTANT • KEY ALWAYS MAPS TO SAME PARTITION • NODES OWN PARTITIONS • PARTITIONS CONTAIN KEYS • EXTRA LEVEL OF INDIRECTION 19
  20. P9 P6 P3 P8 P5 P2 P7 P4 P1 CONSISTENT

    HASHING NODE 0 NODE 1 NODE 2 Ka Kb Kc Kd Ke Kf Kg Kh Ki Kj Kk Km Kl Kp Kn Ko Kq Kr 20
  21. P9 P6 P3 P8 P5 P2 P7 P4 P1 CONSISTENT

    HASHING NODE 0 NODE 1 NODE 2 Ka Kb Kc Kd Ke Kf Kg Kh Ki Kj Kk Km Kl Kp Kn Ko Kq Kr NODE 3 21
  22. CONSISTENT HASHING NN * K/Q => K/Q • K =

    # OF KEYS • NN = # OF NODES • Q = # OF PARTITIONS • AS K GROWS NN BECOMES CONSTANT, THUS K/Q KEYS MOVE 22
  23. CONSISTENT HASHING • EVENLY DIVIDES KEYSPACE • LOGICAL PARTITIONING SEPARATED

    FROM PHYSICAL PARTITIONING • UNIFORM HASH GIVES UNIFORM DISTRIBUTION 23
  24. THE RING P1 P2 P3 P4 P5 P6 P7 P8

    24
  25. THE RING P1 P2 P3 P4 P5 P6 P7 P8

    ND0 ND1 ND2 ND0 ND1 ND2 ND0 ND1 25
  26. THE RING P1 P2 P3 P4 P5 P6 P7 P8

    ND3 ND1 ND2 ND0 ND3 ND2 ND0 ND1 26
  27. THE RING • GOSSIPED BETWEEN NODES • EPOCH CONSENSUS BASED

    • MASTERLESS - ANY NODE CAN SERVICE ANY REQUEST 27
  28. WRITES (INDEX) NODE 0 NODE 1 NODE 2 Ia Id

    Ig Ij Im Ip Ib Ie Ih Ik In Iq Ic If Ii Il Io Ir P7 P4 P1 Ka Kd Kg Kj Km Kp P8 P5 P2 Kb Ke Kh Kk Kn Kq P9 P6 P3 Kc Kf Ki Kl Ko Kr 28
  29. READS (QUERY) NODE 0 NODE 1 NODE 2 Ia Id

    Ig Ij Im Ip Ib Ie Ih Ik In Iq Ic If Ii Il Io Ir Q Q + SHARDS 29
  30. HIGH AVAILABILITY Hochverfügbarkeit 30

  31. UPTIME IS A POOR METRIC 31

  32. “IF THE SYSTEM IS ‘DOWN’ AND NO ONE MAKES A

    REQUEST, IS IT REALLY DOWN?” ~ ME 32
  33. HARVEST VS YIELD 33

  34. YIELD QUERIES COMPLETED QUERIES OFFERED 34

  35. HARVEST DATA AVAILABLE COMPLETE DATA 35

  36. DURING FAILURE OR OVERLOAD - FOR A GIVEN QUERY -

    YOU MUST DECIDE BETWEEN HARVEST OR YIELD 36
  37. MAINTAIN HARVEST VIA REPLICATION 37

  38. REPLICATION • N VALUE - # OF REPLICAS TO STORE

    • DEFAULT OF 3 • MORE REPLICAS TRADES IOPS + SPACE FOR MORE HARVEST 38
  39. P9 P8 P7 P6 P5 P4 P3 P2 P1 WRITES

    NODE 0 NODE 1 NODE 2 K I K 39
  40. P9 P8 P7 P6 P5 P4 P3 P2 P1 REPLICATED

    WRITES NODE 0 NODE 1 NODE 2 K1 K2 K3 I1 I2 I3 K 40
  41. QUERY + REPLICATION • NOT ALL NODES NEED TO BE

    QUERIED • FIND COVERING SUBSET OF PARTITIONS/NODES • YOKOZUNA BUILDS THE COVERAGE PLAN - SOLR EXECUTES THE DISTRIBUTED QUERY • NO USE OF SOLR CLOUD 41
  42. SLOPPY QUORUM • N REPLICAS IMPLIES IDEA OF “PREFERENCE LIST”

    • SOME PARTITIONS ARE THE “PRIMARIES” - OTHERS ARE “SECONDARY” • SLOPPY = ALLOW NON-PRIMARY TO STORE REPLICAS • 100% YIELD - BUT POTENTIALLY DEGRADED HARVEST 42
  43. TUNABLE QUORUM • R - # OF PARTITIONS TO VERIFYREAD

    • W - # OF PARTITIONS TO VERIFY WRITE • PR/PW - # OF PARTITIONS WHICH MUST BE PRIMARY • ALLOWS YOU TO TRADE YIELD FOR HARVEST - PER REQUEST 43
  44. SIBLINGS • NO MASTER TO SERIALIZE OPS • CONCURRENT ACTORS

    ON SAME KEY • OPERATIONS CAN INTERLEAVE • USE VCLOCKS TO DETECT CONFLICT • CREATE SIBLINGS - LET CLIENT FIX • INDEX ALL SIBLINGS 44
  45. SELF HEALING Selbstheilung 45

  46. HINTED HANDOFF • WHEN NODES GO DOWN DATA WRITTEN TO

    SECONDARY PARTITIONS • WHEN NODES COMES BACK NEED TO GIVE THE DATA TO PRIMARY OWNER • AS DATA IS HANDED OFF INDEX IT ON DESTINATION NODE 46
  47. P1 P2 P3 P4 P5 P6 P7 P8 ND3 ND1

    ND2 ND0 ND3 ND2 ND0 ND1 K K1 K2 K3 WRITE HINTED HANDOFF 47
  48. P1 P2 P3 P4 P5 P6 P7 P8 ND3 ND1

    ND2 ND0 ND3 ND2 ND0 ND1 K K1 K2 K3 WRITE HINTED HANDOFF 48
  49. P1 P2 P3 P4 P5 P6 P7 P8 ND3 ND1

    ND2 ND0 ND3 ND2 ND0 ND1 K K1 K2 K3 WRITE HINTED HANDOFF 49
  50. P1 P2 P3 P4 P5 P6 P7 P8 ND3 ND1

    ND2 ND0 ND3 ND2 ND0 ND1 K K1 K2 K3 WRITE K2 HINTED HANDOFF 50
  51. READ REPAIR • REPLICAS MAY NOT AGREE • REPLICAS MAY

    BE LOST • CHECK REPLICA VALUES DURING READ • FIX IF THEY DISAGREE • SEND NEW VALUE TO EACH REPLICA 51
  52. ACTIVE ANTI- ENTROPY • 2 SYSTEMS (RIAK & SOLR) -

    GREATER CHANCE FOR INCONSISTENCY • FILES CAN BECOME TRUNCATED/ CORRUPTED • ACCIDENTAL RM -RF • SEGFAULT AT THE RIGHT TIME • ETC 52
  53. MYRAID OF FAILURE SCENARIOS - FROM OBVIOUS TO NEARLY INVISIBLE

    53
  54. AAE - MERKLE TREES 54

  55. AAE - MERKLE TREES EACH SEGMENT IS LIST OF KEY-HASH

    PAIRS 55
  56. AAE - MERKLE TREES HASH OF HASHES IN SEGMENT 56

  57. AAE - MERKLE TREES HASH OF HASHES OF HASHES OF

    HASHES :) 57
  58. AAE - MERKLE TREES • IT’S A HASH TREE •

    IT’S ABOUT EFFICIENCY • BILLIONS OF OBJECTS CAN BE COMPARED AT COST OF COMPARING 2 HASHES (WIN!) 58
  59. AAE - EXCHANGE 59

  60. AAE - EXCHANGE TOP HASHES DON’T MATCH - SOMETHING IS

    DIFFERENT 60
  61. AAE - EXCHANGE NARROW DOWN THE DIVERGENT SEGMENT 61

  62. AAE - EXCHANGE NARROW DOWN THE DIVERGENT SEGMENT CONT... 62

  63. AAE - EXCHANGE ITER FINAL LIST OF HASHES TO FIND

    DIVERGENT KEYS 63
  64. AAE - EXCHANGE REPAIR (RE-INDEX) KEYS THAT ARE DIVERGENT (RED)

    64
  65. AAE • DURABLE TREES • UPDATED IN REAL TIME •

    NON-BLOCKING • PERIODICALLY EXCHANGED • INVOKE READ-REPAIR AND RE-INDEX ON DIVERGENCE • PERIODICALLY REBUILT 65
  66. CODE FOR DETECTION AND REPAIR - NOT PREVENTION 66

  67. DEMONSTRATION Vorführung 67

  68. CREATE CLUSTER 68

  69. START 5 NODES 69

  70. JOIN NODES 70

  71. CREATE PLAN 71

  72. COMMIT PLAN 72

  73. CHECK MEMBERSHIP 73

  74. STORE SCHEMA 74

  75. CREATE INDEX 75

  76. INDEX SOME DATA • COMMIT LOG HISTORY OF VARIOUS BASHO

    REPOS • INDEX REPO NAME AND COMMIT AUTHOR, DATE, SUBJECT, BODY • USED BASHO BENCH TO LOAD DATA 76
  77. QUERY 77

  78. QUERY 78

  79. QUERY • QUERY FROM ANY NODE • USE SOLR SYNTAX

    • RETURN SOLR RESULT VERBATIM • CAN USE EXISTING SOLR CLIENTS (FOR QUERY, NOT WRITE) 79
  80. WHAT HAPPENS IF YOU TAKE 2 NODES DOWN? 80

  81. DOWN 2 NODES 81

  82. VERIFY DOWN 82

  83. QUERY (DOWN) 83

  84. QUERY (DOWN) • INDEX REPLICATION ALLOWS FOR QUERY AVAILABILITY •

    JUST NEED 1 REPLICA OF INDEX • IF TOO MANY NODES GO DOWN YOKOZUNA WILL REFUSE QUERY • PREFERS 100% HARVEST 84
  85. WHAT HAPPENS IF YOU WRITE DATA WHILE NODES ARE DOWN?

    85
  86. VERIFY 0 RESULTS 86

  87. ADD NEW DATA 87

  88. QUERY NODE 1 88

  89. DISABLE HANDOFF 89

  90. START NODE 4 & 5 90

  91. QUERY SOLR NODE 4 91

  92. ENABLE HANDOFF 92

  93. TAIL LOGS 93

  94. QUERY SOLR NODE 4 94

  95. QUERY NODE 4 95

  96. WHAT HAPPENS IF YOU LOSE YOUR INDEX DATA? 96

  97. QUERY SOLR NODE 4 NOTICE NUM FOUND IS 6747 97

  98. RM -RF THE INDEX 98

  99. KILL -9 JVM 99

  100. YOKO RESTART JVM 100

  101. QUERY SOLR NODE 4 NUM FOUND 0 BECAUSE INDEX WAS

    DELETED 101
  102. AAE DETECT/REPAIR 102

  103. QUERY SOLR NODE 4 NUM FOUND IS 6747 AGAIN THANKS

    TO AAE 103
  104. Danke sehr! HTTP://GITHUB.COM/BASHO/YOKOZUNA 104