Yokozuna,
Scaling Solr With Riak
Ryan Zezeski
Berlin Buzzwords - June 4th 2013
1
Slide 2
Slide 2 text
WHO AM I?
• DEVELOPER @ BASHO TECHNOLOGIES
• PREVIOUS @ AOL FOR
ADVERTISING.COM
• MOST EXPERIENCE IN JAVA & ERLANG
• 2+ YEARS WORKING ON SEARCH
• @RZEZESKI ON TWITTER
2
Slide 3
Slide 3 text
NOT TALKING
ABOUT SEARCH
3
Slide 4
Slide 4 text
AGENDA
• OVERVIEW OF RIAK & YOKOZUNA
• DATA PARTITIONING & OWNERSHIP
• HIGH AVAILABILITY & CONSISTENCY
• SELF HEALING (ANTI ENTROPY)
• DEMOS
4
Slide 5
Slide 5 text
WHAT IS RIAK?
• KEY-VALUE STORE (+ SOME EXTRAS)
• DISTRIBUTED
• HIGHLY AVAILABLE
• MASTERLESS
• EVENTUALLY CONSISTENT
• SCALE UP/DOWN
5
Slide 6
Slide 6 text
DATABASE
• KEY/VALUE MODEL
• BASIC SECONDARY INDEX SUPPORT
• MAP/REDUCE (NOT LIKE HADOOP)
• SEARCH (YOKOZUNA/SOLR)
6
Slide 7
Slide 7 text
DISTRIBUTED
• MANY NODES IN LAN
• RECOMMEND STARTING WITH 5
• ENTERPRISE REPLICATION CAN SPAN
WAN
7
Slide 8
Slide 8 text
HIGH AVAILABILITY
• ALWAYS TAKE WRITES
• ALWAYS SERVICE READS
• FAVORS YIELD OVER HARVEST
• IMPLIES EVENTUAL CONSISTENCY
8
Slide 9
Slide 9 text
MASTERLESS
• NO NOTION OF MASTER OR SLAVE
• ANY NODE MAY SERVICE READ/WRITE/
QUERY
9
Slide 10
Slide 10 text
EVENTUALLY
CONSISTENT
• READS CAN BE STALE
• CONCURRENT WRITES CAN CAUSE
SIBLINGS
• EVENTUALLY VALUES CONVERGES
10
Slide 11
Slide 11 text
YOKOZUNA
• INTEGRATION OF RIAK AND SOLR
• INDEX RIAK DATA WITH SOLR
• DISTRIBUTE SOLR WITH RIAK
• TOGETHER DO WHAT EACH ALONE
CANNOT
11
Slide 12
Slide 12 text
YOKOZUNA
• EACH NODE RUN A LOCAL SOLR
INSTANCE
• CREATE AN INDEX SAME AS BUCKET
NAME
• DOCUMENT IS “EXTRACTED” FROM
VALUE
• SUPPORTS PLAIN TEXT, XML, AND JSON
• SOLR CELL SUPPORT COMING SOON
12
Slide 13
Slide 13 text
YOKOZUNA
• SUPPORTS “TAGGING”
• USE SOLR QUERY SYNTAX
• PARAMETERS PASSED VERBATIM
• IF DISTRIBUTED SEARCH SUPPORTS IT -
YOKOZUNA SUPPORTS IT
• NO SOLR CLOUD INVOLVED
13
NAIVE HASHING
NODE 0 NODE 1 NODE 2
Ka Kb Kc
Kd Ke Kf
Kg Kh Ki
Kj Kk
Km Kl
Kp Kn Ko
Kq Kr
16
Slide 17
Slide 17 text
NAIVE HASHING
NODE 0 NODE 1 NODE 2
Ka Kb Kc Kd
Kg
Ki
NODE 3
Ke Kf Kh
Kj Kk Kl
Km Kn Ko Kp
Kq Kr
17
Slide 18
Slide 18 text
NAIVE HASHING
K * (NN - 1) / NN => K
• K = # OF KEYS
• NN = # OF NODES
• AS NN GROWS FACTOR ESSENTIALLY
BECOMES 1, THUS ALL KEYS MOVE
18
Slide 19
Slide 19 text
CONSISTENT
HASHING
PARTITION # = HASH(KEY) % PARTITIONS
• # PARTITIONS REMAINS CONSTANT
• KEY ALWAYS MAPS TO SAME PARTITION
• NODES OWN PARTITIONS
• PARTITIONS CONTAIN KEYS
• EXTRA LEVEL OF INDIRECTION
19
Slide 20
Slide 20 text
P9
P6
P3
P8
P5
P2
P7
P4
P1
CONSISTENT
HASHING
NODE 0 NODE 1 NODE 2
Ka Kb Kc
Kd Ke Kf
Kg Kh Ki
Kj Kk
Km Kl
Kp Kn Ko
Kq Kr
20
Slide 21
Slide 21 text
P9
P6
P3
P8 P5
P2
P7
P4 P1
CONSISTENT
HASHING
NODE 0 NODE 1 NODE 2
Ka
Kb Kc
Kd Ke
Kf
Kg Kh Ki
Kj
Kk
Km Kl
Kp Kn
Ko
Kq Kr
NODE 3
21
Slide 22
Slide 22 text
CONSISTENT
HASHING
NN * K/Q => K/Q
• K = # OF KEYS
• NN = # OF NODES
• Q = # OF PARTITIONS
• AS K GROWS NN BECOMES
CONSTANT, THUS K/Q KEYS MOVE
22
Slide 23
Slide 23 text
CONSISTENT
HASHING
• EVENLY DIVIDES KEYSPACE
• LOGICAL PARTITIONING SEPARATED
FROM PHYSICAL PARTITIONING
• UNIFORM HASH GIVES UNIFORM
DISTRIBUTION
23
Slide 24
Slide 24 text
THE RING
P1
P2
P3
P4
P5
P6
P7
P8
24
Slide 25
Slide 25 text
THE RING
P1
P2
P3
P4
P5
P6
P7
P8 ND0
ND1
ND2
ND0
ND1
ND2
ND0
ND1
25
Slide 26
Slide 26 text
THE RING
P1
P2
P3
P4
P5
P6
P7
P8 ND3
ND1
ND2
ND0
ND3
ND2
ND0
ND1
26
Slide 27
Slide 27 text
THE RING
• GOSSIPED BETWEEN NODES
• EPOCH CONSENSUS BASED
• MASTERLESS - ANY NODE CAN
SERVICE ANY REQUEST
27
Slide 28
Slide 28 text
WRITES (INDEX)
NODE 0 NODE 1 NODE 2
Ia Id Ig
Ij Im Ip
Ib Ie Ih
Ik In Iq
Ic If Ii
Il Io Ir
P7
P4
P1
Ka Kd Kg
Kj Km Kp
P8
P5
P2
Kb Ke Kh
Kk Kn Kq
P9
P6
P3
Kc Kf Ki
Kl Ko Kr
28
Slide 29
Slide 29 text
READS (QUERY)
NODE 0 NODE 1 NODE 2
Ia Id Ig
Ij Im Ip
Ib Ie Ih
Ik In Iq
Ic If Ii
Il Io Ir
Q
Q + SHARDS
29
Slide 30
Slide 30 text
HIGH AVAILABILITY
Hochverfügbarkeit
30
Slide 31
Slide 31 text
UPTIME IS A POOR
METRIC
31
Slide 32
Slide 32 text
“IF THE SYSTEM IS
‘DOWN’ AND NO
ONE MAKES A
REQUEST, IS IT
REALLY DOWN?” ~
ME
32
Slide 33
Slide 33 text
HARVEST VS YIELD
33
Slide 34
Slide 34 text
YIELD
QUERIES COMPLETED
QUERIES OFFERED
34
Slide 35
Slide 35 text
HARVEST
DATA AVAILABLE
COMPLETE DATA
35
Slide 36
Slide 36 text
DURING FAILURE OR
OVERLOAD - FOR A
GIVEN QUERY - YOU
MUST DECIDE
BETWEEN HARVEST
OR YIELD
36
Slide 37
Slide 37 text
MAINTAIN HARVEST
VIA REPLICATION
37
Slide 38
Slide 38 text
REPLICATION
• N VALUE - # OF REPLICAS TO STORE
• DEFAULT OF 3
• MORE REPLICAS TRADES IOPS + SPACE
FOR MORE HARVEST
38
Slide 39
Slide 39 text
P9
P8
P7 P6
P5
P4 P3
P2
P1
WRITES
NODE 0 NODE 1 NODE 2
K
I
K
39
QUERY +
REPLICATION
• NOT ALL NODES NEED TO BE QUERIED
• FIND COVERING SUBSET OF
PARTITIONS/NODES
• YOKOZUNA BUILDS THE COVERAGE
PLAN - SOLR EXECUTES THE
DISTRIBUTED QUERY
• NO USE OF SOLR CLOUD
41
Slide 42
Slide 42 text
SLOPPY QUORUM
• N REPLICAS IMPLIES IDEA OF
“PREFERENCE LIST”
• SOME PARTITIONS ARE THE
“PRIMARIES” - OTHERS ARE
“SECONDARY”
• SLOPPY = ALLOW NON-PRIMARY TO
STORE REPLICAS
• 100% YIELD - BUT POTENTIALLY
DEGRADED HARVEST
42
Slide 43
Slide 43 text
TUNABLE QUORUM
• R - # OF PARTITIONS TO VERIFYREAD
• W - # OF PARTITIONS TO VERIFY WRITE
• PR/PW - # OF PARTITIONS WHICH
MUST BE PRIMARY
• ALLOWS YOU TO TRADE YIELD FOR
HARVEST - PER REQUEST
43
Slide 44
Slide 44 text
SIBLINGS
• NO MASTER TO SERIALIZE OPS
• CONCURRENT ACTORS ON SAME KEY
• OPERATIONS CAN INTERLEAVE
• USE VCLOCKS TO DETECT CONFLICT
• CREATE SIBLINGS - LET CLIENT FIX
• INDEX ALL SIBLINGS
44
Slide 45
Slide 45 text
SELF HEALING
Selbstheilung
45
Slide 46
Slide 46 text
HINTED HANDOFF
• WHEN NODES GO DOWN DATA
WRITTEN TO SECONDARY PARTITIONS
• WHEN NODES COMES BACK NEED TO
GIVE THE DATA TO PRIMARY OWNER
• AS DATA IS HANDED OFF INDEX IT ON
DESTINATION NODE
46
READ REPAIR
• REPLICAS MAY NOT AGREE
• REPLICAS MAY BE LOST
• CHECK REPLICA VALUES DURING READ
• FIX IF THEY DISAGREE
• SEND NEW VALUE TO EACH REPLICA
51
Slide 52
Slide 52 text
ACTIVE ANTI-
ENTROPY
• 2 SYSTEMS (RIAK & SOLR) - GREATER
CHANCE FOR INCONSISTENCY
• FILES CAN BECOME TRUNCATED/
CORRUPTED
• ACCIDENTAL RM -RF
• SEGFAULT AT THE RIGHT TIME
• ETC
52
Slide 53
Slide 53 text
MYRAID OF FAILURE
SCENARIOS - FROM
OBVIOUS TO NEARLY
INVISIBLE
53
Slide 54
Slide 54 text
AAE - MERKLE TREES
54
Slide 55
Slide 55 text
AAE - MERKLE TREES
EACH SEGMENT IS LIST OF
KEY-HASH PAIRS
55
Slide 56
Slide 56 text
AAE - MERKLE TREES
HASH OF HASHES IN
SEGMENT
56
Slide 57
Slide 57 text
AAE - MERKLE TREES
HASH OF HASHES OF
HASHES OF HASHES :)
57
Slide 58
Slide 58 text
AAE - MERKLE TREES
• IT’S A HASH TREE
• IT’S ABOUT EFFICIENCY
• BILLIONS OF OBJECTS CAN BE
COMPARED AT COST OF COMPARING
2 HASHES (WIN!)
58
Slide 59
Slide 59 text
AAE - EXCHANGE
59
Slide 60
Slide 60 text
AAE - EXCHANGE
TOP HASHES DON’T
MATCH -
SOMETHING IS
DIFFERENT
60
Slide 61
Slide 61 text
AAE - EXCHANGE
NARROW DOWN
THE DIVERGENT
SEGMENT
61
Slide 62
Slide 62 text
AAE - EXCHANGE
NARROW DOWN
THE DIVERGENT
SEGMENT CONT...
62
Slide 63
Slide 63 text
AAE - EXCHANGE
ITER FINAL LIST OF
HASHES TO FIND
DIVERGENT KEYS
63
Slide 64
Slide 64 text
AAE - EXCHANGE
REPAIR (RE-INDEX)
KEYS THAT ARE
DIVERGENT (RED)
64
Slide 65
Slide 65 text
AAE
• DURABLE TREES
• UPDATED IN REAL TIME
• NON-BLOCKING
• PERIODICALLY EXCHANGED
• INVOKE READ-REPAIR AND RE-INDEX
ON DIVERGENCE
• PERIODICALLY REBUILT
65
Slide 66
Slide 66 text
CODE FOR
DETECTION AND
REPAIR - NOT
PREVENTION
66
Slide 67
Slide 67 text
DEMONSTRATION
Vorführung
67
Slide 68
Slide 68 text
CREATE CLUSTER
68
Slide 69
Slide 69 text
START 5 NODES
69
Slide 70
Slide 70 text
JOIN NODES
70
Slide 71
Slide 71 text
CREATE PLAN
71
Slide 72
Slide 72 text
COMMIT PLAN
72
Slide 73
Slide 73 text
CHECK MEMBERSHIP
73
Slide 74
Slide 74 text
STORE SCHEMA
74
Slide 75
Slide 75 text
CREATE INDEX
75
Slide 76
Slide 76 text
INDEX SOME DATA
• COMMIT LOG HISTORY OF VARIOUS
BASHO REPOS
• INDEX REPO NAME AND COMMIT
AUTHOR, DATE, SUBJECT, BODY
• USED BASHO BENCH TO LOAD DATA
76
Slide 77
Slide 77 text
QUERY
77
Slide 78
Slide 78 text
QUERY
78
Slide 79
Slide 79 text
QUERY
• QUERY FROM ANY NODE
• USE SOLR SYNTAX
• RETURN SOLR RESULT VERBATIM
• CAN USE EXISTING SOLR CLIENTS
(FOR QUERY, NOT WRITE)
79
Slide 80
Slide 80 text
WHAT HAPPENS IF
YOU TAKE 2 NODES
DOWN?
80
Slide 81
Slide 81 text
DOWN 2 NODES
81
Slide 82
Slide 82 text
VERIFY DOWN
82
Slide 83
Slide 83 text
QUERY (DOWN)
83
Slide 84
Slide 84 text
QUERY (DOWN)
• INDEX REPLICATION ALLOWS FOR
QUERY AVAILABILITY
• JUST NEED 1 REPLICA OF INDEX
• IF TOO MANY NODES GO DOWN
YOKOZUNA WILL REFUSE QUERY
• PREFERS 100% HARVEST
84
Slide 85
Slide 85 text
WHAT HAPPENS IF
YOU WRITE DATA
WHILE NODES ARE
DOWN?
85
Slide 86
Slide 86 text
VERIFY 0 RESULTS
86
Slide 87
Slide 87 text
ADD NEW DATA
87
Slide 88
Slide 88 text
QUERY NODE 1
88
Slide 89
Slide 89 text
DISABLE HANDOFF
89
Slide 90
Slide 90 text
START NODE 4 & 5
90
Slide 91
Slide 91 text
QUERY SOLR NODE 4
91
Slide 92
Slide 92 text
ENABLE HANDOFF
92
Slide 93
Slide 93 text
TAIL LOGS
93
Slide 94
Slide 94 text
QUERY SOLR NODE 4
94
Slide 95
Slide 95 text
QUERY NODE 4
95
Slide 96
Slide 96 text
WHAT HAPPENS IF
YOU LOSE YOUR
INDEX DATA?
96
Slide 97
Slide 97 text
QUERY SOLR NODE 4
NOTICE NUM
FOUND IS 6747
97
Slide 98
Slide 98 text
RM -RF THE INDEX
98
Slide 99
Slide 99 text
KILL -9 JVM
99
Slide 100
Slide 100 text
YOKO RESTART JVM
100
Slide 101
Slide 101 text
QUERY SOLR NODE 4
NUM FOUND 0
BECAUSE INDEX
WAS DELETED
101
Slide 102
Slide 102 text
AAE DETECT/REPAIR
102
Slide 103
Slide 103 text
QUERY SOLR NODE 4
NUM FOUND IS
6747 AGAIN THANKS
TO AAE
103