Slide 1

Slide 1 text

Yokozuna, Scaling Solr With Riak Ryan Zezeski Berlin Buzzwords - June 4th 2013 1

Slide 2

Slide 2 text

WHO AM I? • DEVELOPER @ BASHO TECHNOLOGIES • PREVIOUS @ AOL FOR ADVERTISING.COM • MOST EXPERIENCE IN JAVA & ERLANG • 2+ YEARS WORKING ON SEARCH • @RZEZESKI ON TWITTER 2

Slide 3

Slide 3 text

NOT TALKING ABOUT SEARCH 3

Slide 4

Slide 4 text

AGENDA • OVERVIEW OF RIAK & YOKOZUNA • DATA PARTITIONING & OWNERSHIP • HIGH AVAILABILITY & CONSISTENCY • SELF HEALING (ANTI ENTROPY) • DEMOS 4

Slide 5

Slide 5 text

WHAT IS RIAK? • KEY-VALUE STORE (+ SOME EXTRAS) • DISTRIBUTED • HIGHLY AVAILABLE • MASTERLESS • EVENTUALLY CONSISTENT • SCALE UP/DOWN 5

Slide 6

Slide 6 text

DATABASE • KEY/VALUE MODEL • BASIC SECONDARY INDEX SUPPORT • MAP/REDUCE (NOT LIKE HADOOP) • SEARCH (YOKOZUNA/SOLR) 6

Slide 7

Slide 7 text

DISTRIBUTED • MANY NODES IN LAN • RECOMMEND STARTING WITH 5 • ENTERPRISE REPLICATION CAN SPAN WAN 7

Slide 8

Slide 8 text

HIGH AVAILABILITY • ALWAYS TAKE WRITES • ALWAYS SERVICE READS • FAVORS YIELD OVER HARVEST • IMPLIES EVENTUAL CONSISTENCY 8

Slide 9

Slide 9 text

MASTERLESS • NO NOTION OF MASTER OR SLAVE • ANY NODE MAY SERVICE READ/WRITE/ QUERY 9

Slide 10

Slide 10 text

EVENTUALLY CONSISTENT • READS CAN BE STALE • CONCURRENT WRITES CAN CAUSE SIBLINGS • EVENTUALLY VALUES CONVERGES 10

Slide 11

Slide 11 text

YOKOZUNA • INTEGRATION OF RIAK AND SOLR • INDEX RIAK DATA WITH SOLR • DISTRIBUTE SOLR WITH RIAK • TOGETHER DO WHAT EACH ALONE CANNOT 11

Slide 12

Slide 12 text

YOKOZUNA • EACH NODE RUN A LOCAL SOLR INSTANCE • CREATE AN INDEX SAME AS BUCKET NAME • DOCUMENT IS “EXTRACTED” FROM VALUE • SUPPORTS PLAIN TEXT, XML, AND JSON • SOLR CELL SUPPORT COMING SOON 12

Slide 13

Slide 13 text

YOKOZUNA • SUPPORTS “TAGGING” • USE SOLR QUERY SYNTAX • PARAMETERS PASSED VERBATIM • IF DISTRIBUTED SEARCH SUPPORTS IT - YOKOZUNA SUPPORTS IT • NO SOLR CLOUD INVOLVED 13

Slide 14

Slide 14 text

PARTITIONING & OWNERSHIP Aufteilen & Eigentum 14

Slide 15

Slide 15 text

NAIVE HASHING NODE # = HASH(KEY) % NUM_NODES NH(Ka) = 0 NH(Kb) = 1 NH(Kc) = 2 NH(Kd) = 0 ... 15

Slide 16

Slide 16 text

NAIVE HASHING NODE 0 NODE 1 NODE 2 Ka Kb Kc Kd Ke Kf Kg Kh Ki Kj Kk Km Kl Kp Kn Ko Kq Kr 16

Slide 17

Slide 17 text

NAIVE HASHING NODE 0 NODE 1 NODE 2 Ka Kb Kc Kd Kg Ki NODE 3 Ke Kf Kh Kj Kk Kl Km Kn Ko Kp Kq Kr 17

Slide 18

Slide 18 text

NAIVE HASHING K * (NN - 1) / NN => K • K = # OF KEYS • NN = # OF NODES • AS NN GROWS FACTOR ESSENTIALLY BECOMES 1, THUS ALL KEYS MOVE 18

Slide 19

Slide 19 text

CONSISTENT HASHING PARTITION # = HASH(KEY) % PARTITIONS • # PARTITIONS REMAINS CONSTANT • KEY ALWAYS MAPS TO SAME PARTITION • NODES OWN PARTITIONS • PARTITIONS CONTAIN KEYS • EXTRA LEVEL OF INDIRECTION 19

Slide 20

Slide 20 text

P9 P6 P3 P8 P5 P2 P7 P4 P1 CONSISTENT HASHING NODE 0 NODE 1 NODE 2 Ka Kb Kc Kd Ke Kf Kg Kh Ki Kj Kk Km Kl Kp Kn Ko Kq Kr 20

Slide 21

Slide 21 text

P9 P6 P3 P8 P5 P2 P7 P4 P1 CONSISTENT HASHING NODE 0 NODE 1 NODE 2 Ka Kb Kc Kd Ke Kf Kg Kh Ki Kj Kk Km Kl Kp Kn Ko Kq Kr NODE 3 21

Slide 22

Slide 22 text

CONSISTENT HASHING NN * K/Q => K/Q • K = # OF KEYS • NN = # OF NODES • Q = # OF PARTITIONS • AS K GROWS NN BECOMES CONSTANT, THUS K/Q KEYS MOVE 22

Slide 23

Slide 23 text

CONSISTENT HASHING • EVENLY DIVIDES KEYSPACE • LOGICAL PARTITIONING SEPARATED FROM PHYSICAL PARTITIONING • UNIFORM HASH GIVES UNIFORM DISTRIBUTION 23

Slide 24

Slide 24 text

THE RING P1 P2 P3 P4 P5 P6 P7 P8 24

Slide 25

Slide 25 text

THE RING P1 P2 P3 P4 P5 P6 P7 P8 ND0 ND1 ND2 ND0 ND1 ND2 ND0 ND1 25

Slide 26

Slide 26 text

THE RING P1 P2 P3 P4 P5 P6 P7 P8 ND3 ND1 ND2 ND0 ND3 ND2 ND0 ND1 26

Slide 27

Slide 27 text

THE RING • GOSSIPED BETWEEN NODES • EPOCH CONSENSUS BASED • MASTERLESS - ANY NODE CAN SERVICE ANY REQUEST 27

Slide 28

Slide 28 text

WRITES (INDEX) NODE 0 NODE 1 NODE 2 Ia Id Ig Ij Im Ip Ib Ie Ih Ik In Iq Ic If Ii Il Io Ir P7 P4 P1 Ka Kd Kg Kj Km Kp P8 P5 P2 Kb Ke Kh Kk Kn Kq P9 P6 P3 Kc Kf Ki Kl Ko Kr 28

Slide 29

Slide 29 text

READS (QUERY) NODE 0 NODE 1 NODE 2 Ia Id Ig Ij Im Ip Ib Ie Ih Ik In Iq Ic If Ii Il Io Ir Q Q + SHARDS 29

Slide 30

Slide 30 text

HIGH AVAILABILITY Hochverfügbarkeit 30

Slide 31

Slide 31 text

UPTIME IS A POOR METRIC 31

Slide 32

Slide 32 text

“IF THE SYSTEM IS ‘DOWN’ AND NO ONE MAKES A REQUEST, IS IT REALLY DOWN?” ~ ME 32

Slide 33

Slide 33 text

HARVEST VS YIELD 33

Slide 34

Slide 34 text

YIELD QUERIES COMPLETED QUERIES OFFERED 34

Slide 35

Slide 35 text

HARVEST DATA AVAILABLE COMPLETE DATA 35

Slide 36

Slide 36 text

DURING FAILURE OR OVERLOAD - FOR A GIVEN QUERY - YOU MUST DECIDE BETWEEN HARVEST OR YIELD 36

Slide 37

Slide 37 text

MAINTAIN HARVEST VIA REPLICATION 37

Slide 38

Slide 38 text

REPLICATION • N VALUE - # OF REPLICAS TO STORE • DEFAULT OF 3 • MORE REPLICAS TRADES IOPS + SPACE FOR MORE HARVEST 38

Slide 39

Slide 39 text

P9 P8 P7 P6 P5 P4 P3 P2 P1 WRITES NODE 0 NODE 1 NODE 2 K I K 39

Slide 40

Slide 40 text

P9 P8 P7 P6 P5 P4 P3 P2 P1 REPLICATED WRITES NODE 0 NODE 1 NODE 2 K1 K2 K3 I1 I2 I3 K 40

Slide 41

Slide 41 text

QUERY + REPLICATION • NOT ALL NODES NEED TO BE QUERIED • FIND COVERING SUBSET OF PARTITIONS/NODES • YOKOZUNA BUILDS THE COVERAGE PLAN - SOLR EXECUTES THE DISTRIBUTED QUERY • NO USE OF SOLR CLOUD 41

Slide 42

Slide 42 text

SLOPPY QUORUM • N REPLICAS IMPLIES IDEA OF “PREFERENCE LIST” • SOME PARTITIONS ARE THE “PRIMARIES” - OTHERS ARE “SECONDARY” • SLOPPY = ALLOW NON-PRIMARY TO STORE REPLICAS • 100% YIELD - BUT POTENTIALLY DEGRADED HARVEST 42

Slide 43

Slide 43 text

TUNABLE QUORUM • R - # OF PARTITIONS TO VERIFYREAD • W - # OF PARTITIONS TO VERIFY WRITE • PR/PW - # OF PARTITIONS WHICH MUST BE PRIMARY • ALLOWS YOU TO TRADE YIELD FOR HARVEST - PER REQUEST 43

Slide 44

Slide 44 text

SIBLINGS • NO MASTER TO SERIALIZE OPS • CONCURRENT ACTORS ON SAME KEY • OPERATIONS CAN INTERLEAVE • USE VCLOCKS TO DETECT CONFLICT • CREATE SIBLINGS - LET CLIENT FIX • INDEX ALL SIBLINGS 44

Slide 45

Slide 45 text

SELF HEALING Selbstheilung 45

Slide 46

Slide 46 text

HINTED HANDOFF • WHEN NODES GO DOWN DATA WRITTEN TO SECONDARY PARTITIONS • WHEN NODES COMES BACK NEED TO GIVE THE DATA TO PRIMARY OWNER • AS DATA IS HANDED OFF INDEX IT ON DESTINATION NODE 46

Slide 47

Slide 47 text

P1 P2 P3 P4 P5 P6 P7 P8 ND3 ND1 ND2 ND0 ND3 ND2 ND0 ND1 K K1 K2 K3 WRITE HINTED HANDOFF 47

Slide 48

Slide 48 text

P1 P2 P3 P4 P5 P6 P7 P8 ND3 ND1 ND2 ND0 ND3 ND2 ND0 ND1 K K1 K2 K3 WRITE HINTED HANDOFF 48

Slide 49

Slide 49 text

P1 P2 P3 P4 P5 P6 P7 P8 ND3 ND1 ND2 ND0 ND3 ND2 ND0 ND1 K K1 K2 K3 WRITE HINTED HANDOFF 49

Slide 50

Slide 50 text

P1 P2 P3 P4 P5 P6 P7 P8 ND3 ND1 ND2 ND0 ND3 ND2 ND0 ND1 K K1 K2 K3 WRITE K2 HINTED HANDOFF 50

Slide 51

Slide 51 text

READ REPAIR • REPLICAS MAY NOT AGREE • REPLICAS MAY BE LOST • CHECK REPLICA VALUES DURING READ • FIX IF THEY DISAGREE • SEND NEW VALUE TO EACH REPLICA 51

Slide 52

Slide 52 text

ACTIVE ANTI- ENTROPY • 2 SYSTEMS (RIAK & SOLR) - GREATER CHANCE FOR INCONSISTENCY • FILES CAN BECOME TRUNCATED/ CORRUPTED • ACCIDENTAL RM -RF • SEGFAULT AT THE RIGHT TIME • ETC 52

Slide 53

Slide 53 text

MYRAID OF FAILURE SCENARIOS - FROM OBVIOUS TO NEARLY INVISIBLE 53

Slide 54

Slide 54 text

AAE - MERKLE TREES 54

Slide 55

Slide 55 text

AAE - MERKLE TREES EACH SEGMENT IS LIST OF KEY-HASH PAIRS 55

Slide 56

Slide 56 text

AAE - MERKLE TREES HASH OF HASHES IN SEGMENT 56

Slide 57

Slide 57 text

AAE - MERKLE TREES HASH OF HASHES OF HASHES OF HASHES :) 57

Slide 58

Slide 58 text

AAE - MERKLE TREES • IT’S A HASH TREE • IT’S ABOUT EFFICIENCY • BILLIONS OF OBJECTS CAN BE COMPARED AT COST OF COMPARING 2 HASHES (WIN!) 58

Slide 59

Slide 59 text

AAE - EXCHANGE 59

Slide 60

Slide 60 text

AAE - EXCHANGE TOP HASHES DON’T MATCH - SOMETHING IS DIFFERENT 60

Slide 61

Slide 61 text

AAE - EXCHANGE NARROW DOWN THE DIVERGENT SEGMENT 61

Slide 62

Slide 62 text

AAE - EXCHANGE NARROW DOWN THE DIVERGENT SEGMENT CONT... 62

Slide 63

Slide 63 text

AAE - EXCHANGE ITER FINAL LIST OF HASHES TO FIND DIVERGENT KEYS 63

Slide 64

Slide 64 text

AAE - EXCHANGE REPAIR (RE-INDEX) KEYS THAT ARE DIVERGENT (RED) 64

Slide 65

Slide 65 text

AAE • DURABLE TREES • UPDATED IN REAL TIME • NON-BLOCKING • PERIODICALLY EXCHANGED • INVOKE READ-REPAIR AND RE-INDEX ON DIVERGENCE • PERIODICALLY REBUILT 65

Slide 66

Slide 66 text

CODE FOR DETECTION AND REPAIR - NOT PREVENTION 66

Slide 67

Slide 67 text

DEMONSTRATION Vorführung 67

Slide 68

Slide 68 text

CREATE CLUSTER 68

Slide 69

Slide 69 text

START 5 NODES 69

Slide 70

Slide 70 text

JOIN NODES 70

Slide 71

Slide 71 text

CREATE PLAN 71

Slide 72

Slide 72 text

COMMIT PLAN 72

Slide 73

Slide 73 text

CHECK MEMBERSHIP 73

Slide 74

Slide 74 text

STORE SCHEMA 74

Slide 75

Slide 75 text

CREATE INDEX 75

Slide 76

Slide 76 text

INDEX SOME DATA • COMMIT LOG HISTORY OF VARIOUS BASHO REPOS • INDEX REPO NAME AND COMMIT AUTHOR, DATE, SUBJECT, BODY • USED BASHO BENCH TO LOAD DATA 76

Slide 77

Slide 77 text

QUERY 77

Slide 78

Slide 78 text

QUERY 78

Slide 79

Slide 79 text

QUERY • QUERY FROM ANY NODE • USE SOLR SYNTAX • RETURN SOLR RESULT VERBATIM • CAN USE EXISTING SOLR CLIENTS (FOR QUERY, NOT WRITE) 79

Slide 80

Slide 80 text

WHAT HAPPENS IF YOU TAKE 2 NODES DOWN? 80

Slide 81

Slide 81 text

DOWN 2 NODES 81

Slide 82

Slide 82 text

VERIFY DOWN 82

Slide 83

Slide 83 text

QUERY (DOWN) 83

Slide 84

Slide 84 text

QUERY (DOWN) • INDEX REPLICATION ALLOWS FOR QUERY AVAILABILITY • JUST NEED 1 REPLICA OF INDEX • IF TOO MANY NODES GO DOWN YOKOZUNA WILL REFUSE QUERY • PREFERS 100% HARVEST 84

Slide 85

Slide 85 text

WHAT HAPPENS IF YOU WRITE DATA WHILE NODES ARE DOWN? 85

Slide 86

Slide 86 text

VERIFY 0 RESULTS 86

Slide 87

Slide 87 text

ADD NEW DATA 87

Slide 88

Slide 88 text

QUERY NODE 1 88

Slide 89

Slide 89 text

DISABLE HANDOFF 89

Slide 90

Slide 90 text

START NODE 4 & 5 90

Slide 91

Slide 91 text

QUERY SOLR NODE 4 91

Slide 92

Slide 92 text

ENABLE HANDOFF 92

Slide 93

Slide 93 text

TAIL LOGS 93

Slide 94

Slide 94 text

QUERY SOLR NODE 4 94

Slide 95

Slide 95 text

QUERY NODE 4 95

Slide 96

Slide 96 text

WHAT HAPPENS IF YOU LOSE YOUR INDEX DATA? 96

Slide 97

Slide 97 text

QUERY SOLR NODE 4 NOTICE NUM FOUND IS 6747 97

Slide 98

Slide 98 text

RM -RF THE INDEX 98

Slide 99

Slide 99 text

KILL -9 JVM 99

Slide 100

Slide 100 text

YOKO RESTART JVM 100

Slide 101

Slide 101 text

QUERY SOLR NODE 4 NUM FOUND 0 BECAUSE INDEX WAS DELETED 101

Slide 102

Slide 102 text

AAE DETECT/REPAIR 102

Slide 103

Slide 103 text

QUERY SOLR NODE 4 NUM FOUND IS 6747 AGAIN THANKS TO AAE 103

Slide 104

Slide 104 text

Danke sehr! HTTP://GITHUB.COM/BASHO/YOKOZUNA 104