Migratory Patterns - KubeCon Salt Lake City, 2024

Slide 1

Slide 1 text

Migratory Patterns KubeCon + CloudNativeCon North America 2024 Salt Lake City @thepete.net

Slide 2

Slide 2 text

✨ AI

Slide 3

Slide 3 text

✨AI-POWERED ✨GenAI✨ Service Our AI service uses RAG, and needed a vector store (just a specialized database) @thepete.net

Slide 4

Slide 4 text

Vector Store ✨AI-POWERED ✨GenAI✨ Service Our AI service uses RAG, and needed a vector store (just a specialized database) @thepete.net

Slide 5

Slide 5 text

Vector Store ✨AI-POWERED ✨GenAI✨ Service Our AI service uses RAG, and needed a vector store (just a specialized database) We were loading data into that vector store via a data ingestion pipeline @thepete.net

Slide 6

Slide 6 text

@thepete.net ✨AI-POWERED ✨GenAI✨ Service Data Ingestion Vector Store Our AI service uses RAG, and needed a vector store (just a specialized database) We were loading data into that vector store via a data ingestion pipeline

Slide 7

Slide 7 text

@thepete.net ✨AI-POWERED ✨GenAI✨ Service Data Ingestion Vector Store Pinecone Vector Store Our AI service uses RAG, and needed a vector store (just a specialized database) We were loading data into that vector store via a data ingestion pipeline For our initial proof-of-concept, we’d used Pinecone for our vector store technology

Slide 8

Slide 8 text

Slide 9

Slide 9 text

@thepete.net Pinecone Postgres Writer Reader Pinecone Postgres Writer Reader BEFORE AFTER

Slide 10

Slide 10 text

" BIG BANG MIGRATION Pinecone Postgres Writer Reader @thepete.net

Slide 11

Slide 11 text

Writer Reader Pinecone 1. STOP THE WORLD Postgres " BIG BANG MIGRATION @thepete.net

Slide 12

Slide 12 text

Writer Reader Pinecone Postgres 1. STOP THE WORLD 2. BACKFILL DATA " BIG BANG MIGRATION @thepete.net

Slide 13

Slide 13 text

Writer Reader Pinecone Postgres 1. STOP THE WORLD 2. BACKFILL DATA 3. CUT-OVER " BIG BANG MIGRATION @thepete.net

Slide 14

Slide 14 text

Writer Reader Pinecone Postgres 1. STOP THE WORLD 2. BACKFILL DATA 3. CUT-OVER 4. START THE WORLD " BIG BANG MIGRATION @thepete.net

Slide 15

Slide 15 text

Disadvantages of Big Bang Migrations • We have to stop the world • downtime for our users # • very stressful for us! $ • No safe way to test production changes • No plan B if things go wrong @thepete.net

Slide 16

Slide 16 text

Writer Reader Changes have only been written to Postgres. Not feasible to fall back To Pinecone. NO PLAN B Pinecone Postgres @thepete.net

Slide 17

Slide 17 text

EXPAND/CONTRACT MIGRATION BIG BANG MIGRATION @thepete.net

Slide 18

Slide 18 text

EXPAND/CONTRACT MIGRATION Writer Reader Pinecone Postgres @thepete.net

Slide 19

Slide 19 text

EXPAND/CONTRACT MIGRATION 1. DUAL WRITE Writer Reader Pinecone Postgres @thepete.net

Slide 20

Slide 20 text

EXPAND/CONTRACT MIGRATION 1. DUAL WRITE 2. BACKFILL Writer Reader Pinecone Postgres @thepete.net

Slide 21

Slide 21 text

EXPAND/CONTRACT MIGRATION 1. DUAL WRITE 2. BACKFILL 3. CUT-OVER Writer Reader Pinecone Postgres @thepete.net

Slide 22

Slide 22 text

EXPAND/CONTRACT MIGRATION 1. DUAL WRITE 2. BACKFILL 3. CUT-OVER 4. WRAP UP Pinecone Writer Reader Pinecone Postgres @thepete.net

Slide 23

Slide 23 text

Expand/Contract enables conﬁdent migrations A sequence of steps, where every step has an option to fall back Safety -> Courage -> Speed

Slide 24

Slide 24 text

Show me the code

Slide 25

Slide 25 text

def run_pipeline(self): print("loading projects...") projects = self._hippocampus_db.read_projects() print("building project descriptions...") build_docs(projects) print("prepping upload...") upload_df = prep_upload(projects) # at this stage of our migration from pinecone to postgres we're # dual-writing to both pinecone and local PG print("uploading to pinecone vectorstore...") self.upload_to_pinecone_vectorstore(upload_df) print("uploading to pg vectorstore...") self.upload_to_postgres_vectorstore(upload_df) print("PIPELINE COMPLETE!") DUAL WRITE, IRL Writer Pinecone Postgres prepare data upload data

Slide 26

Slide 26 text

def lookup_candidates_for(self, description: str): if feature_flags.use_postgres_for_vector_store(): vector_store = self._postgres_vector_store else: vector_store = self._pinecone_vector_store results = vector_store.similarity_search( description, self._number_of_results ) return self._candidates_from(results) CUT-OVER, IRL Reader FF Pinecone Postgres

Slide 27

Slide 27 text

Pinecone Postgres Writer Reader feature ﬂag 3. CUT-OVER @thepete.net

Slide 28

Slide 28 text

Pinecone Postgres Writer Reader feature ﬂag 3. CUT-OVER @thepete.net

Slide 29

Slide 29 text

Pinecone Postgres Reader feature ﬂag @thepete.net

Slide 30

Slide 30 text

Pinecone Postgres Reader feature ﬂag 0% @thepete.net

Slide 31

Slide 31 text

Pinecone Postgres Reader feature ﬂag 5% @thepete.net

Slide 32

Slide 32 text

Pinecone Postgres Reader feature ﬂag 50% @thepete.net

Slide 33

Slide 33 text

Feature Flags & Observability @thepete.net

Slide 34

Slide 34 text

spot the canary @thepete.net

Slide 35

Slide 35 text

start of 5% canary spot the canary @thepete.net

Slide 36

Slide 36 text

spot the canary start of 5% canary read_from_new_system: true read_from_new_system: false @thepete.net

Slide 37

Slide 37 text

spot the canary start of 5% canary read_from_new_system: true read_from_new_system: false ~ twice as slow! @thepete.net

Slide 38

Slide 38 text

& @thepete.net

Slide 39

Slide 39 text

One Standard Many Vendors AppDynamics (Cisco) Aria by VMware (Wavefront) Arize Phoenix Aspecto Axiom Better Stack BugSnag Causely Centreon Chronosphere Control Plane Coralogix Cribl Dash0 DaoCloud Datadog Dynatrace Elastic F5 observIQ OneUptime OpenObserve OpenText Oracle qryn Red Hat Sentry Software ServicePilot SigNoz SolarWinds Splunk Sumo Logic TelemetryHub Traceloop Uptrace Google Cloud Platform Grafana Labs Helios Highlight Honeycomb HyperDX Immersive Fusion Instana ITRS KloudFuse KloudMate ServiceNow Cloud Observability (Lightstep) Last9 Levitate LogicMonitor LogScale by Crowdstrike (Humio) Lumigo MetricsHub Middleware New Relic Observe, Inc. Apache SkyWalking Fluent Bit Jaeger ObserveAny GreptimeDB TingYun VictoriaMetrics Tracetest Alibaba Cloud Seq VuNet Systems Bonree Embrace groundcover @thepete.net

Slide 40

Slide 40 text

Old Thing New Thing Consumer PARALLEL EXECUTION 1. Stand up the new thing 2. Keep state synchronized between old and new things 3. Choose when to send traﬃc to the new thing @thepete.net

Slide 41

Slide 41 text

@thepete.net DB schema changes API schema changes Re-platforming to diﬀerent infrastructure Extracting a microservice Switching service providers {v1} {v2}

Slide 42

Slide 42 text

Old Thing New Thing Consumer PARALLEL EXECUTION 1. Stand up the new thing 2. Keep state synchronized between old and new things 3. Choose when to send traﬃc to the new thing @thepete.net

Slide 43

Slide 43 text

The Monolith Accounting Module Extracting a Microservice Accounting Service @thepete.net

Slide 44

Slide 44 text

The Monolith Accounting Module Extracting a Microservice Accounting Service - Put a shim in front of the module @thepete.net

Slide 45

Slide 45 text

The Monolith Accounting Module Extracting a Microservice Accounting Service - Put a shim in front of the module - Have all internal calls route through the shim @thepete.net

Slide 46

Slide 46 text

The Monolith Accounting Module Extracting a Microservice Accounting Service - Put a shim in front of the module - Have all internal calls route through the shim - Shim routes calls to internal module or external service (based on feature ﬂag) @thepete.net

Slide 47

Slide 47 text

Accounting Service Accounting Module Shim feature ﬂag Consumer @thepete.net

Slide 48

Slide 48 text

Accounting Service Accounting Module Shim feature ﬂag Consumer @thepete.net

Slide 49

Slide 49 text

Accounting Service Accounting Module Consumer RESULT FROM OLD MODULE RESULT FROM NEW SERVICE CHECK RESULTS AGREE RESULT FROM OLD MODULE PARALLEL RUN pattern RESULT FROM NEW SERVICE discard parallel result CALL BOTH IMPLEMENTATIONS @thepete.net

Slide 50

Slide 50 text

def similarity_search(self,description:str) -> List[Chunk]: # dark launch w. parallel run: we will use both pinecone- and # postgres-backed vector stores to find candidate comans, # and compare the results to ensure that the postgres-backed # vector store is working as expected. pinecone_results = self._pinecone_store.similarity_search(description) postgres_results = self._postgres_store.similarity_search(description) self._check_for_parallel_run_descrepency(pinecone_results,postgres_results) # *FOR NOW*, # WE THROW AWAY THE POSTGRES-BACKED RESULTS AND ONLY RETURN THE PINECONE-BACKED RESULTS return pinecone_results PARALLEL RUN, IRL

Slide 51

Slide 51 text

def _check_for_parallel_run_descrepancy( self, pinecone_results: list[PineconeEntry], postgres_results: list[PostgresEntry] ): if len(pinecone_results) != len(postgres_results): self._record_discrepancy( "different number of results", { "pinecone_count": len(pinecone_results), "postgres_count": len(postgres_results), }, ) # no point in continuing comparison if we're not comparing the same entries! return for pinecone_entry, postgres_entry in zip(pinecone_results, postgres_results): # pinecone ids have a "pmatch:" prefix which we need to take into account when comparing if pinecone_entry.coman_id != "pmatch:" + postgres_entry.coman_id: self._record_discrepancy( "different coman ids", { "pinecone_coman_id": pinecone_entry.coman_id, "postgres_coman_id": postgres_entry.coman_id, }, ) # no point in continuing comparison if we're not comparing the same entries! continue # a very small variance is possible due to the way that the vector stores calculate vector distance if abs(pinecone_entry.similarity - postgres_entry.similarity) > 1e-5: self._record_discrepancy( "different similarity scores", { PARALLEL RUN, IRL

Slide 52

Slide 52 text

Dark Launch

Slide 53

Slide 53 text

Dark Launch The secret for going from zero to seventy million users overnight is to avoid doing it all in one fell swoop. We chose to simulate the impact of many real users hitting many machines by means of a ‘dark launch’ period in which Facebook pages would make connections to the chat servers, query for presence information and simulate message sends without a single UI element drawn on the page. “ ” https://engineering.fb.com/2008/05/13/web/facebook-chat/

Slide 54

Slide 54 text

in conclusion • Big-bang migrations are risky • Expand/contract reduce the stress • The pattern is applicable for a surprising variety of migrations • There are fancy variants, but they all build on the same core ideas • feature ﬂags and observability make things even better @thepete.net

Slide 55

Slide 55 text

& @thepete.net

Slide 56

Slide 56 text

Thanks! Questions? @thepete.net

Slide 57

Slide 57 text

Thanks! Questions? Slides will be in my socials. I love talking about this stuﬀ! Come chat. beingagile Pete Hodgson https://thepete.net @ph1 @thepete.net