Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Yokozuna: Scaling Solr With Riak

Yokozuna: Scaling Solr With Riak

Ryan Zezeski

June 04, 2013
Tweet

More Decks by Ryan Zezeski

Other Decks in Programming

Transcript

  1. Yokozuna,
    Scaling Solr With Riak
    Ryan Zezeski
    Berlin Buzzwords - June 4th 2013
    1

    View Slide

  2. WHO AM I?
    • DEVELOPER @ BASHO TECHNOLOGIES
    • PREVIOUS @ AOL FOR
    ADVERTISING.COM
    • MOST EXPERIENCE IN JAVA & ERLANG
    • 2+ YEARS WORKING ON SEARCH
    • @RZEZESKI ON TWITTER
    2

    View Slide

  3. NOT TALKING
    ABOUT SEARCH
    3

    View Slide

  4. AGENDA
    • OVERVIEW OF RIAK & YOKOZUNA
    • DATA PARTITIONING & OWNERSHIP
    • HIGH AVAILABILITY & CONSISTENCY
    • SELF HEALING (ANTI ENTROPY)
    • DEMOS
    4

    View Slide

  5. WHAT IS RIAK?
    • KEY-VALUE STORE (+ SOME EXTRAS)
    • DISTRIBUTED
    • HIGHLY AVAILABLE
    • MASTERLESS
    • EVENTUALLY CONSISTENT
    • SCALE UP/DOWN
    5

    View Slide

  6. DATABASE
    • KEY/VALUE MODEL
    • BASIC SECONDARY INDEX SUPPORT
    • MAP/REDUCE (NOT LIKE HADOOP)
    • SEARCH (YOKOZUNA/SOLR)
    6

    View Slide

  7. DISTRIBUTED
    • MANY NODES IN LAN
    • RECOMMEND STARTING WITH 5
    • ENTERPRISE REPLICATION CAN SPAN
    WAN
    7

    View Slide

  8. HIGH AVAILABILITY
    • ALWAYS TAKE WRITES
    • ALWAYS SERVICE READS
    • FAVORS YIELD OVER HARVEST
    • IMPLIES EVENTUAL CONSISTENCY
    8

    View Slide

  9. MASTERLESS
    • NO NOTION OF MASTER OR SLAVE
    • ANY NODE MAY SERVICE READ/WRITE/
    QUERY
    9

    View Slide

  10. EVENTUALLY
    CONSISTENT
    • READS CAN BE STALE
    • CONCURRENT WRITES CAN CAUSE
    SIBLINGS
    • EVENTUALLY VALUES CONVERGES
    10

    View Slide

  11. YOKOZUNA
    • INTEGRATION OF RIAK AND SOLR
    • INDEX RIAK DATA WITH SOLR
    • DISTRIBUTE SOLR WITH RIAK
    • TOGETHER DO WHAT EACH ALONE
    CANNOT
    11

    View Slide

  12. YOKOZUNA
    • EACH NODE RUN A LOCAL SOLR
    INSTANCE
    • CREATE AN INDEX SAME AS BUCKET
    NAME
    • DOCUMENT IS “EXTRACTED” FROM
    VALUE
    • SUPPORTS PLAIN TEXT, XML, AND JSON
    • SOLR CELL SUPPORT COMING SOON
    12

    View Slide

  13. YOKOZUNA
    • SUPPORTS “TAGGING”
    • USE SOLR QUERY SYNTAX
    • PARAMETERS PASSED VERBATIM
    • IF DISTRIBUTED SEARCH SUPPORTS IT -
    YOKOZUNA SUPPORTS IT
    • NO SOLR CLOUD INVOLVED
    13

    View Slide

  14. PARTITIONING &
    OWNERSHIP
    Aufteilen & Eigentum
    14

    View Slide

  15. NAIVE HASHING
    NODE # = HASH(KEY) % NUM_NODES
    NH(Ka) = 0
    NH(Kb) = 1
    NH(Kc) = 2
    NH(Kd) = 0
    ...
    15

    View Slide

  16. NAIVE HASHING
    NODE 0 NODE 1 NODE 2
    Ka Kb Kc
    Kd Ke Kf
    Kg Kh Ki
    Kj Kk
    Km Kl
    Kp Kn Ko
    Kq Kr
    16

    View Slide

  17. NAIVE HASHING
    NODE 0 NODE 1 NODE 2
    Ka Kb Kc Kd
    Kg
    Ki
    NODE 3
    Ke Kf Kh
    Kj Kk Kl
    Km Kn Ko Kp
    Kq Kr
    17

    View Slide

  18. NAIVE HASHING
    K * (NN - 1) / NN => K
    • K = # OF KEYS
    • NN = # OF NODES
    • AS NN GROWS FACTOR ESSENTIALLY
    BECOMES 1, THUS ALL KEYS MOVE
    18

    View Slide

  19. CONSISTENT
    HASHING
    PARTITION # = HASH(KEY) % PARTITIONS
    • # PARTITIONS REMAINS CONSTANT
    • KEY ALWAYS MAPS TO SAME PARTITION
    • NODES OWN PARTITIONS
    • PARTITIONS CONTAIN KEYS
    • EXTRA LEVEL OF INDIRECTION
    19

    View Slide

  20. P9
    P6
    P3
    P8
    P5
    P2
    P7
    P4
    P1
    CONSISTENT
    HASHING
    NODE 0 NODE 1 NODE 2
    Ka Kb Kc
    Kd Ke Kf
    Kg Kh Ki
    Kj Kk
    Km Kl
    Kp Kn Ko
    Kq Kr
    20

    View Slide

  21. P9
    P6
    P3
    P8 P5
    P2
    P7
    P4 P1
    CONSISTENT
    HASHING
    NODE 0 NODE 1 NODE 2
    Ka
    Kb Kc
    Kd Ke
    Kf
    Kg Kh Ki
    Kj
    Kk
    Km Kl
    Kp Kn
    Ko
    Kq Kr
    NODE 3
    21

    View Slide

  22. CONSISTENT
    HASHING
    NN * K/Q => K/Q
    • K = # OF KEYS
    • NN = # OF NODES
    • Q = # OF PARTITIONS
    • AS K GROWS NN BECOMES
    CONSTANT, THUS K/Q KEYS MOVE
    22

    View Slide

  23. CONSISTENT
    HASHING
    • EVENLY DIVIDES KEYSPACE
    • LOGICAL PARTITIONING SEPARATED
    FROM PHYSICAL PARTITIONING
    • UNIFORM HASH GIVES UNIFORM
    DISTRIBUTION
    23

    View Slide

  24. THE RING
    P1
    P2
    P3
    P4
    P5
    P6
    P7
    P8
    24

    View Slide

  25. THE RING
    P1
    P2
    P3
    P4
    P5
    P6
    P7
    P8 ND0
    ND1
    ND2
    ND0
    ND1
    ND2
    ND0
    ND1
    25

    View Slide

  26. THE RING
    P1
    P2
    P3
    P4
    P5
    P6
    P7
    P8 ND3
    ND1
    ND2
    ND0
    ND3
    ND2
    ND0
    ND1
    26

    View Slide

  27. THE RING
    • GOSSIPED BETWEEN NODES
    • EPOCH CONSENSUS BASED
    • MASTERLESS - ANY NODE CAN
    SERVICE ANY REQUEST
    27

    View Slide

  28. WRITES (INDEX)
    NODE 0 NODE 1 NODE 2
    Ia Id Ig
    Ij Im Ip
    Ib Ie Ih
    Ik In Iq
    Ic If Ii
    Il Io Ir
    P7
    P4
    P1
    Ka Kd Kg
    Kj Km Kp
    P8
    P5
    P2
    Kb Ke Kh
    Kk Kn Kq
    P9
    P6
    P3
    Kc Kf Ki
    Kl Ko Kr
    28

    View Slide

  29. READS (QUERY)
    NODE 0 NODE 1 NODE 2
    Ia Id Ig
    Ij Im Ip
    Ib Ie Ih
    Ik In Iq
    Ic If Ii
    Il Io Ir
    Q
    Q + SHARDS
    29

    View Slide

  30. HIGH AVAILABILITY
    Hochverfügbarkeit
    30

    View Slide

  31. UPTIME IS A POOR
    METRIC
    31

    View Slide

  32. “IF THE SYSTEM IS
    ‘DOWN’ AND NO
    ONE MAKES A
    REQUEST, IS IT
    REALLY DOWN?” ~
    ME
    32

    View Slide

  33. HARVEST VS YIELD
    33

    View Slide

  34. YIELD
    QUERIES COMPLETED
    QUERIES OFFERED
    34

    View Slide

  35. HARVEST
    DATA AVAILABLE
    COMPLETE DATA
    35

    View Slide

  36. DURING FAILURE OR
    OVERLOAD - FOR A
    GIVEN QUERY - YOU
    MUST DECIDE
    BETWEEN HARVEST
    OR YIELD
    36

    View Slide

  37. MAINTAIN HARVEST
    VIA REPLICATION
    37

    View Slide

  38. REPLICATION
    • N VALUE - # OF REPLICAS TO STORE
    • DEFAULT OF 3
    • MORE REPLICAS TRADES IOPS + SPACE
    FOR MORE HARVEST
    38

    View Slide

  39. P9
    P8
    P7 P6
    P5
    P4 P3
    P2
    P1
    WRITES
    NODE 0 NODE 1 NODE 2
    K
    I
    K
    39

    View Slide

  40. P9
    P8
    P7 P6
    P5
    P4 P3
    P2
    P1
    REPLICATED WRITES
    NODE 0 NODE 1 NODE 2
    K1 K2 K3
    I1 I2 I3
    K
    40

    View Slide

  41. QUERY +
    REPLICATION
    • NOT ALL NODES NEED TO BE QUERIED
    • FIND COVERING SUBSET OF
    PARTITIONS/NODES
    • YOKOZUNA BUILDS THE COVERAGE
    PLAN - SOLR EXECUTES THE
    DISTRIBUTED QUERY
    • NO USE OF SOLR CLOUD
    41

    View Slide

  42. SLOPPY QUORUM
    • N REPLICAS IMPLIES IDEA OF
    “PREFERENCE LIST”
    • SOME PARTITIONS ARE THE
    “PRIMARIES” - OTHERS ARE
    “SECONDARY”
    • SLOPPY = ALLOW NON-PRIMARY TO
    STORE REPLICAS
    • 100% YIELD - BUT POTENTIALLY
    DEGRADED HARVEST
    42

    View Slide

  43. TUNABLE QUORUM
    • R - # OF PARTITIONS TO VERIFYREAD
    • W - # OF PARTITIONS TO VERIFY WRITE
    • PR/PW - # OF PARTITIONS WHICH
    MUST BE PRIMARY
    • ALLOWS YOU TO TRADE YIELD FOR
    HARVEST - PER REQUEST
    43

    View Slide

  44. SIBLINGS
    • NO MASTER TO SERIALIZE OPS
    • CONCURRENT ACTORS ON SAME KEY
    • OPERATIONS CAN INTERLEAVE
    • USE VCLOCKS TO DETECT CONFLICT
    • CREATE SIBLINGS - LET CLIENT FIX
    • INDEX ALL SIBLINGS
    44

    View Slide

  45. SELF HEALING
    Selbstheilung
    45

    View Slide

  46. HINTED HANDOFF
    • WHEN NODES GO DOWN DATA
    WRITTEN TO SECONDARY PARTITIONS
    • WHEN NODES COMES BACK NEED TO
    GIVE THE DATA TO PRIMARY OWNER
    • AS DATA IS HANDED OFF INDEX IT ON
    DESTINATION NODE
    46

    View Slide

  47. P1
    P2
    P3
    P4
    P5
    P6
    P7
    P8 ND3
    ND1
    ND2
    ND0
    ND3
    ND2
    ND0
    ND1
    K
    K1
    K2
    K3
    WRITE
    HINTED HANDOFF
    47

    View Slide

  48. P1
    P2
    P3
    P4
    P5
    P6
    P7
    P8 ND3
    ND1
    ND2
    ND0
    ND3
    ND2
    ND0
    ND1
    K
    K1
    K2
    K3
    WRITE
    HINTED HANDOFF
    48

    View Slide

  49. P1
    P2
    P3
    P4
    P5
    P6
    P7
    P8 ND3
    ND1
    ND2
    ND0
    ND3
    ND2
    ND0
    ND1
    K
    K1
    K2
    K3
    WRITE
    HINTED HANDOFF
    49

    View Slide

  50. P1
    P2
    P3
    P4
    P5
    P6
    P7
    P8 ND3
    ND1
    ND2
    ND0
    ND3
    ND2
    ND0
    ND1
    K
    K1
    K2
    K3
    WRITE
    K2
    HINTED HANDOFF
    50

    View Slide

  51. READ REPAIR
    • REPLICAS MAY NOT AGREE
    • REPLICAS MAY BE LOST
    • CHECK REPLICA VALUES DURING READ
    • FIX IF THEY DISAGREE
    • SEND NEW VALUE TO EACH REPLICA
    51

    View Slide

  52. ACTIVE ANTI-
    ENTROPY
    • 2 SYSTEMS (RIAK & SOLR) - GREATER
    CHANCE FOR INCONSISTENCY
    • FILES CAN BECOME TRUNCATED/
    CORRUPTED
    • ACCIDENTAL RM -RF
    • SEGFAULT AT THE RIGHT TIME
    • ETC
    52

    View Slide

  53. MYRAID OF FAILURE
    SCENARIOS - FROM
    OBVIOUS TO NEARLY
    INVISIBLE
    53

    View Slide

  54. AAE - MERKLE TREES
    54

    View Slide

  55. AAE - MERKLE TREES
    EACH SEGMENT IS LIST OF
    KEY-HASH PAIRS
    55

    View Slide

  56. AAE - MERKLE TREES
    HASH OF HASHES IN
    SEGMENT
    56

    View Slide

  57. AAE - MERKLE TREES
    HASH OF HASHES OF
    HASHES OF HASHES :)
    57

    View Slide

  58. AAE - MERKLE TREES
    • IT’S A HASH TREE
    • IT’S ABOUT EFFICIENCY
    • BILLIONS OF OBJECTS CAN BE
    COMPARED AT COST OF COMPARING
    2 HASHES (WIN!)
    58

    View Slide

  59. AAE - EXCHANGE
    59

    View Slide

  60. AAE - EXCHANGE
    TOP HASHES DON’T
    MATCH -
    SOMETHING IS
    DIFFERENT
    60

    View Slide

  61. AAE - EXCHANGE
    NARROW DOWN
    THE DIVERGENT
    SEGMENT
    61

    View Slide

  62. AAE - EXCHANGE
    NARROW DOWN
    THE DIVERGENT
    SEGMENT CONT...
    62

    View Slide

  63. AAE - EXCHANGE
    ITER FINAL LIST OF
    HASHES TO FIND
    DIVERGENT KEYS
    63

    View Slide

  64. AAE - EXCHANGE
    REPAIR (RE-INDEX)
    KEYS THAT ARE
    DIVERGENT (RED)
    64

    View Slide

  65. AAE
    • DURABLE TREES
    • UPDATED IN REAL TIME
    • NON-BLOCKING
    • PERIODICALLY EXCHANGED
    • INVOKE READ-REPAIR AND RE-INDEX
    ON DIVERGENCE
    • PERIODICALLY REBUILT
    65

    View Slide

  66. CODE FOR
    DETECTION AND
    REPAIR - NOT
    PREVENTION
    66

    View Slide

  67. DEMONSTRATION
    Vorführung
    67

    View Slide

  68. CREATE CLUSTER
    68

    View Slide

  69. START 5 NODES
    69

    View Slide

  70. JOIN NODES
    70

    View Slide

  71. CREATE PLAN
    71

    View Slide

  72. COMMIT PLAN
    72

    View Slide

  73. CHECK MEMBERSHIP
    73

    View Slide

  74. STORE SCHEMA
    74

    View Slide

  75. CREATE INDEX
    75

    View Slide

  76. INDEX SOME DATA
    • COMMIT LOG HISTORY OF VARIOUS
    BASHO REPOS
    • INDEX REPO NAME AND COMMIT
    AUTHOR, DATE, SUBJECT, BODY
    • USED BASHO BENCH TO LOAD DATA
    76

    View Slide

  77. QUERY
    77

    View Slide

  78. QUERY
    78

    View Slide

  79. QUERY
    • QUERY FROM ANY NODE
    • USE SOLR SYNTAX
    • RETURN SOLR RESULT VERBATIM
    • CAN USE EXISTING SOLR CLIENTS
    (FOR QUERY, NOT WRITE)
    79

    View Slide

  80. WHAT HAPPENS IF
    YOU TAKE 2 NODES
    DOWN?
    80

    View Slide

  81. DOWN 2 NODES
    81

    View Slide

  82. VERIFY DOWN
    82

    View Slide

  83. QUERY (DOWN)
    83

    View Slide

  84. QUERY (DOWN)
    • INDEX REPLICATION ALLOWS FOR
    QUERY AVAILABILITY
    • JUST NEED 1 REPLICA OF INDEX
    • IF TOO MANY NODES GO DOWN
    YOKOZUNA WILL REFUSE QUERY
    • PREFERS 100% HARVEST
    84

    View Slide

  85. WHAT HAPPENS IF
    YOU WRITE DATA
    WHILE NODES ARE
    DOWN?
    85

    View Slide

  86. VERIFY 0 RESULTS
    86

    View Slide

  87. ADD NEW DATA
    87

    View Slide

  88. QUERY NODE 1
    88

    View Slide

  89. DISABLE HANDOFF
    89

    View Slide

  90. START NODE 4 & 5
    90

    View Slide

  91. QUERY SOLR NODE 4
    91

    View Slide

  92. ENABLE HANDOFF
    92

    View Slide

  93. TAIL LOGS
    93

    View Slide

  94. QUERY SOLR NODE 4
    94

    View Slide

  95. QUERY NODE 4
    95

    View Slide

  96. WHAT HAPPENS IF
    YOU LOSE YOUR
    INDEX DATA?
    96

    View Slide

  97. QUERY SOLR NODE 4
    NOTICE NUM
    FOUND IS 6747
    97

    View Slide

  98. RM -RF THE INDEX
    98

    View Slide

  99. KILL -9 JVM
    99

    View Slide

  100. YOKO RESTART JVM
    100

    View Slide

  101. QUERY SOLR NODE 4
    NUM FOUND 0
    BECAUSE INDEX
    WAS DELETED
    101

    View Slide

  102. AAE DETECT/REPAIR
    102

    View Slide

  103. QUERY SOLR NODE 4
    NUM FOUND IS
    6747 AGAIN THANKS
    TO AAE
    103

    View Slide

  104. Danke sehr!
    HTTP://GITHUB.COM/BASHO/YOKOZUNA
    104

    View Slide