Slide 1

Slide 1 text

⁄ Leaving the Ivory Tower: Research in the Real World

Slide 2

Slide 2 text

Armon Dadgar Co-Founder and CTO at HashiCorp

Slide 3

Slide 3 text

Copyright © 2018 HashiCorp ⁄ !3 HashiCorp Suite C++ Provision
 Operations Secure
 Security Deploy
 Development Connect
 Networking Private Cloud AWS Azure GCP Common Cloud Operating Model

Slide 4

Slide 4 text

Research Origins Mitchell Hashimoto Armon Dadgar

Slide 5

Slide 5 text

Contributing Back

Slide 6

Slide 6 text

Standing on the Shoulder of Giants Or The Value of Research ▪ Discover the “State of the Art” ▪ Relevant works to challenge thinking ▪ Understand fundamental tradeoffs (e.g. FLP Theorem) ▪ Metrics for evaluation

Slide 7

Slide 7 text

⁄ Building Consul: A Story of (Service) Discovery

Slide 8

Slide 8 text

Immutable + Micro-services Front End API Layer Data Layer Immutable Artifact

Slide 9

Slide 9 text

Common Solutions Circa 2012 ▪ Hard Coded IP of Host / Virtual IP / Load Balancer ▪ Config Management “Convergence Runs” ▪ Custom Zookeeper based systems

Slide 10

Slide 10 text

Imagining Solutions API Layer Data Layer Database:3306 10.0.1.25:3306 API Layer Data Layer 10.0.1.25:3306

Slide 11

Slide 11 text

Entirely Peer to Peer B C A D

Slide 12

Slide 12 text

Exploring the Literature Centralized Decentralized Central Servers “Super Peers” Peer To Peer

Slide 13

Slide 13 text

Exploring the Literature Structured Unstructured Rings Spanning Trees Binary Trees Adaptive Structure Hybrid Structures Epidemic Broadcast Mesh Network Randomized

Slide 14

Slide 14 text

Exploring the Literature Limited Visibility Full Visibility Few Members Known “Neighbors” Known All Members Known

Slide 15

Slide 15 text

Imposing Constraints Cloud Datacenter Environment Low Latency and High Bandwidth We are operating within a cloud datacenter, where we expect low latencies and high bandwidth, relative to IoT or Internet-wide applications. Few Nodes (< 5K) The operating environment was not large scale peer-to- peer public networks for file sharing, but private infrastructure. The scale is much smaller than some other target environments. Simple To Implement Keep It Simple Stupid (KISS) was a goal. We wanted the simplest possible implementation, and no simpler. Complex protocols are more difficult to implement correctly.

Slide 16

Slide 16 text

The SWIM Approach

Slide 17

Slide 17 text

SWIM Properties ▪ Completely Decentralized ▪ Unstructured, with Epidemic Dissemination ▪ Full Visibility, All Members Known ▪ Trades more bandwidth use for simplicity and fault tolerance

Slide 18

Slide 18 text

Closely Considered ▪ Plumtree. Hybrid tree and epidemic style. ▪ T-Man. Adaptive, can change internal style. ▪ HyParView. Limited view of membership. ▪ Complexity of implementation deemed not worthy ▪ Size of clusters not a concern for full view ▪ Expected traffic minimal

Slide 19

Slide 19 text

Adaptations Used ▪ Bi-Modal Multicast. Active Push/Pull Synchronization. ▪ Steady State vs Recovery Messages. Optimize for efficient distribution in steady state. ▪ Lamport Clocks. Provide a causal relationship between messages. ▪ Vivaldi. Network Coordinates to determine “distance” of peers.

Slide 20

Slide 20 text

Serf Product (serf.io)

Slide 21

Slide 21 text

Gossip For Service Discovery B C A D “Web” at IP1 “DB” at IP2 “Cache” at IP3 “LB” at IP4

Slide 22

Slide 22 text

Serf in Practice ▪ (+) Immutable Simplified ▪ (+) Fault Tolerant, Easy to Operate ▪ (-) Eventual Consistency ▪ (-) No Key/Value Configuration ▪ (-) No “Central” API or UI

Slide 23

Slide 23 text

Rethinking Architecture B C A D “Web” at IP1 “DB” at IP2 “Cache” at IP3 “LB” at IP4 Server

Slide 24

Slide 24 text

Central Servers Challenges ▪ High Availability ▪ Durability of State ▪ Strong Consistency

Slide 25

Slide 25 text

Paxos or How Hard is it to Agree?

Slide 26

Slide 26 text

Paxos Made Simple (?)

Slide 27

Slide 27 text

Exploring The Literature ▪ Multi Paxos ▪ Egalitarian Paxos ▪ Fast Paxos ▪ Cheap Paxos ▪ Generalized Paxos

Slide 28

Slide 28 text

No content

Slide 29

Slide 29 text

Raft or Paxos Made Simple

Slide 30

Slide 30 text

Consul Product (consul.io) Hybrid CP / AP Design - Strongly consistent servers (Raft) - Weekly consistent membership (SWIM) - Centralized API and State - Decentralized Operation

Slide 31

Slide 31 text

Work Embedded in Consul (and Serf) ▪ Consensus ▪ Gossip Protocols ▪ Network Tomography ▪ Capabilities Based Security ▪ Concurrency Control (MVCC) ▪ Lamport / Vector Clocks

Slide 32

Slide 32 text

Research across Products - Security Systems (Kerberos) - Security Protocols - Access Control Systems - Cryptography - Graph Theory - Type Theory - Automata Theory - Scheduler Design (Mesos, Borg, Omega) - Bin Packing - Pre-emption - Consensus - Gossip

Slide 33

Slide 33 text

⁄ Forming HashiCorp Research

Slide 34

Slide 34 text

Industrial Research Group Jon Currey joins as Director of Research

Slide 35

Slide 35 text

Focus on industrial research, working 18 to 24 months ahead of engineering, on novel work. HashiCorp Research Charter

Slide 36

Slide 36 text

Research Goals Problem Novel Solution Existing Solution Publish Integrate Product

Slide 37

Slide 37 text

Customer Problem Frontend Backend Internet

Slide 38

Slide 38 text

Customer Problem Frontend Backend Internet

Slide 39

Slide 39 text

Research Process Collect Data Make Hypothesis Design Solution Design Experiment Validate Hypothesis Validate Solution

Slide 40

Slide 40 text

Gossip FSM Suspect Healthy Dead Ping Timeout Suspect Timeout Refute Dead Refute Suspect

Slide 41

Slide 41 text

Untimely Processing Suspect Healthy Dead Ping Timeout Suspect Timeout Refute Dead Refute Suspect

Slide 42

Slide 42 text

Reducing Sensitivity Exponential Convergence - Replace Fixed Timers - Use Redundant Confirmations - Insight from Bloom Filters, K independent hashes Local Health Awareness - Measure Local Health - Tune sensitivity as health changes Early Notification - Send Suspicion Early - Send Suspicion Redundant - Enable faster refute

Slide 43

Slide 43 text

Evaluation of Solution

Slide 44

Slide 44 text

Publishing Lifeguard

Slide 45

Slide 45 text

Integration with Product

Slide 46

Slide 46 text

⁄ Picking the Problem

Slide 47

Slide 47 text

Vault Audit Logs User Action Audit Log

Slide 48

Slide 48 text

Vault Anomaly Detector Anomaly Detection Audit Log User Action

Slide 49

Slide 49 text

Anomaly Detector Unexpected Expected Event Detector Model

Slide 50

Slide 50 text

Exploring the Literature Few False Negatives Few False Positives Lots of false positives Lots of false negatives

Slide 51

Slide 51 text

Applications to Vault Screen Millions of Events Security Issues Missed

Slide 52

Slide 52 text

Defining a Model Unexpected Expected Event Detector Model

Slide 53

Slide 53 text

Refining Configuration Vault Advisor Audit Log User Action Configuration

Slide 54

Slide 54 text

Vault Advisor in Depth

Slide 55

Slide 55 text

Research Status Problem Novel Solution Existing Solution Publish Integrate Product

Slide 56

Slide 56 text

Lifeguard Integration Pull Request Upstream Research Team Project Fork Eng Team

Slide 57

Slide 57 text

Product-ization Research Team | Advisor Prototype Eng Team Train Develop Publish Research Embedded

Slide 58

Slide 58 text

What’s Coming Problem Novel Solution Existing Solution Publish Integrate Product

Slide 59

Slide 59 text

⁄ Research Culture

Slide 60

Slide 60 text

Fostering Research Culture ▪ Product / Engineering is 100x bigger than Research ▪ Cultural approach needed ▪ Consuming research

Slide 61

Slide 61 text

Publishing PRD / RFCs

Slide 62

Slide 62 text

Slack #talk-research

Slide 63

Slide 63 text

Brown bags and Conferences

Slide 64

Slide 64 text

Sponsorships & Memberships

Slide 65

Slide 65 text

Cultural Goals ▪ Build awareness of research ▪ Give access to published academic work ▪ Create channels to engage internally ▪ Promote involvement in external community ▪ Involve Research in Engineering, and visa versa

Slide 66

Slide 66 text

⁄ Conclusion

Slide 67

Slide 67 text

Real world value ▪ Leverage the “State of the Art”, instead of naive design ▪ Apply domain constraints against fundamental tradeoffs ▪ Improve product performance, security, and usability

Slide 68

Slide 68 text

Research used from Day 1 ▪ Academic research fundamental to HashiCorp Products ▪ Day 1 core designs based on the literature ▪ Day 2+ improvements from literature

Slide 69

Slide 69 text

HashiCorp Research ▪ Focused on Industrial Research ▪ Publishing work, not just consuming ▪ Advocate for research culture internally ▪ Features like Lifeguard ▪ New products like Vault Advisor

Slide 70

Slide 70 text

Promoting Research ▪ Build a culture around research ▪ Enable access, encourage consumption ▪ Create bridges between Research and Engineering ▪ Vocalize the benefits

Slide 71

Slide 71 text

Thank You www.hashicorp.com