Managing Data Chaos in The World of Microservices

Managing Data Chaos in The World of Microservices

Microservices is one of the hottest topics in recent years, and the industry is shifting toward splitting applications into smaller and smaller independent units. This is all happening for very good reasons; you can gain a lot both in terms of technologies and organizational scalability. Many infrastructure tools to support the movement have been developed, from schedulers, deploy automation, and services discovery systems to development tools, like distribute tracers, log aggregators, and analyzers, and we’ve invented and reinvented protocols to make microservices communication even more efficient. However, one problem is often overlooked: the data layer is being diluted due to active encapsulation, which is essential for microservices to grow and evolve.

As we move toward more independently encapsulated services, we’re experiencing dramatically increased challenges managing data, including:

* Observability, knowledge sharing, and data discovery (Who owns that piece of the data? Where can I find that thing?)
* Querying the data (What API should I expose for others? How can I get this info from that dataset? Should I cache this or re-query when necessary?)
* Structural and semantic changes in the datasets (Can I add a new field here? Who’s using this record, and how should I update one not breaking any other services?)

These problems are common, but most of our effort and attention is directed at infrastructure, which is easier to find generic solutions for. On the other hand, making sense of the data is hardly a generalizable problem. There have been many attempts to tame the chaos associated with independent dataset management. Oleksii Kachaiev discusses high-level approaches to build a sharable abstraction layer separating “physical” details from logical concerns as well as specific technologies you can leverage.

The growing complexity of your data layer may overshadow the benefits of microservices architecture you deployed, so the sooner you start working on the solution, the easier it will be to manage the chaos.

B9b7a5ffa24e2af6f877a7950461ba0f?s=128

Oleksii Kachaiev

September 13, 2018
Tweet

Transcript

  1. 2.

    @me • CTO at Attendify • 6+ years with Clojure

    in production • Creator of Muse (Clojure) & Fn.py (Python) • Aleph & Netty contributor • More: protocols, algebras, Haskell, Idris • @kachayev on Twitter & Github
  2. 3.

    The Landscape • microservices are common nowadays • mostly we

    talk about deployment, discovery, tracing • rarely we talk about protocols and errors handling • we almost never talk about data access • we almost never think about data access in advance
  3. 4.

    The Landscape • infrastructure questions are "generalizable" • data is

    a pretty peculiar phenomenon • number of use cases is way larger • but we still can summarize something
  4. 5.

    The Landscape • service SHOULD encapsulate data access • meaning,

    no direct access to DB, caches etc • otherwise you have a distributed monolith • ... and even more problems
  5. 6.

    The Landscape • data access/manipulation: • reads • writes •

    mixed transactions • each one is a separate topic
  6. 7.

    The Landscape • reads • transactions (a.k.a "real-time", mostly API

    responses) • analysis (a.k.a "offline", mostly preprocessing) • will talk mostly about transaction reads • it's a complex topic with microservices
  7. 8.

    The Landscape • early days: monolith with a single storage

    • (mostly) relational, (mostly) with SQL interface • now: a LOT of services • backed by different storages • with different access protocols • with different transactional semantic
  8. 9.

    Across Services... • no "JOINS" • no transactions • no

    foreign keys • no migrations • no standard access protocol
  9. 10.

    Across Services... • no manual "JOINS" • no manual transactions

    • no manual foreign keys • no manual migrations • no standard manually crafted access protocol
  10. 11.

    Across Services... • "JOINS" turned to be a "glue code"

    • transaction integrity is a problem, fighting with • dirty & non-repeatable reads • phantom reads • no ideal solution for references integrity
  11. 12.

    Use Case • typical messanger application • users (microservice "Users")

    • chat threads & messages (service "Messages") • now you need a list of unread messages with senders • hmmm...
  12. 13.

    JOINs: Monolith & "SQL" Storage SELECT ( m.id, m.text, m.created_at,

    u.email, u.first_name, u.last_name, u.photo->>'thumb_url' as photo_url ) FROM messages AS m JOIN users AS u ON m.sender_id == u.id WHERE m.status = UNREAD AND m.sent_by = :user_id LIMIT 20 !
  13. 15.

    JOINs: How? • on the client side • Falcor by

    Netflix • not very popular apporach • due to "almost" obvious problems • impl. complexity • "too much" of information on client
  14. 16.

    JOINs: How? • on the server side • either put

    this as a new RPC to existing service • or add new "proxy"-level functionality • you still need to implement this...
  15. 18.

    Glue Code: Manual JOIN (defn inject-sender [{:keys [sender-id] :as message}]

    (d/chain' (fetch-user sender-id) (fn [user] (assoc message :sender user)))) (defn fetch-thread [thread-id] (d/chain' (fetch-last-messages thread-id 20) (fn [messages] (->> messages (map inject-sender) (apply d/zip'))))) !
  16. 19.

    Glue Code: Manual JOIN • it's kinda simple from the

    first observation • we're all engineers, we know how to write code! • it's super boring doing this each time • your CI server is happy, but there're a lot of problems • the key problem: it's messy • we're mixing nodes, relations, fetching etc
  17. 20.

    Glue Code: Keep In Mind • concurrency, scheduling • requests

    deduplication • how many times will you fetch each user in the example? • batches • errors handling • tracebility, debugability !
  18. 21.

    Glue Code: Libraries • Stitch (Scala, Twitter), 2014 (?) •

    Haxl (Haskell, Facebook), 2014 • Clump (Scala, SoundCloud), 2014 • Muse (Clojure, Attendify), 2015 • Fetch (Scala, 47 Degrees), 2016 • ... a lot more
  19. 22.

    Glue Code: How? • declare data sources • declare relations

    • let the library & compiler do the rest of the job • data nodes traversal & dependencies walking • caching • parallelization
  20. 23.

    Glue Code: Muse ;; declare data nodes (defrecord User [id]

    muse/DataSource (fetch [_] ...)) (defrecord ChatThread [id] muse/DataSource (fetch [_] (fetch-last-messages id 20))) ;; implement relations (defn inject-sender [{:keys [sender-id] :as m}] (muse/fmap (partial assoc m :sender) (User. sender-id))) (defn fetch-thread [thread-id] (muse/traverse inject-sender (ChatThread. thread-id)))
  21. 24.

    Glue Code: How's Going? • pros: less code & more

    predictability • separate nodes & relations • executor might be optimized as a library • cons: requires a library to be adopted • can we do more? • ... pair your glue code with access protocol!
  22. 25.

    Glue Code: Being Smarter • take data nodes & relations

    declarations • declare what part of the data graph we want to fetch • make data nodes traversal smart enough to: • fetch only those relations we mentioned • include data fetch spec into subqueries
  23. 26.

    Glue Code: Being Smarter (defrecord ChatMessasge [id] DataSource (fetch [_]

    (d/chain' (fetch-message {:message-id id}) (fn [{:keys [sender-id] :as message}] (assoc message :status (MessageDelivery. id) :sender (User. sender-id) :attachments (MessageAttachments. id))))))
  24. 27.

    Glue Code: Being Smarter (muse/run!! (pull (ChatMessage. "9V5x8slpS"))) ;; ...

    everything! (muse/run!! (pull (ChatMessage. "9V5x8slpS") [:text])) ;; {:text "Hello there!"} (muse/run!! (pull (ChatMessage. "9V5x8slpS") [:text {:sender [:firstName]}])) ;; {:text "Hello there!" ;; :sender {:firstName "Shannon"}}
  25. 28.
  26. 29.

    Glue Code: Being Smarter • no requirements for the downstream

    • still pretty powerful • even though it doesn't cover 100% of use cases • now we have query analyzer, query planner and query executor • I think we saw this before...
  27. 30.

    Glue Code: A Few Notes • things we don't have

    a perfect solution (yet?)... • foreign keys are now managed manually • read-level transaction guarantees are not "given" • you have to expose them as a part of your API • at least through documentation
  28. 31.

    Glue Code: Are We Good? • messages.fetchMessages • messages.fetchMessagesWithSender •

    messages.fetchMessagesWithoutSender • messages.fetchWithSenderAndDeliveryStatus • ! " ☹ • did someone say "GraphQL"?
  29. 33.

    Protocol: GraphQL • typical response nowadays • the truth: it

    doesn't solve the problem • it just shapes it in another form • GraphQL vs REST is unfair comparison • GraphQL vs SQL is (no kidding!)
  30. 34.

    Protocol: GraphQL { messages(sentBy: $userId, status: "unread", lastest: 20) {

    id text createdAt sender { email firstName lastName photo { thumbUrl } } } }
  31. 35.

    Protocol: SQL SELECT ( m.id, m.text, m.created_at, u.email, u.first_name, u.last_name,

    u.photo->>'thumb_url' as photo_url ) FROM messages AS m JOIN users AS u ON m.sender_id == u.id WHERE m.status = UNREAD AND m.sent_by = :user_id LIMIT 20
  32. 36.

    Protocol: GraphQL, SQL • implicit (GraphQL) VS explicit (SQL) JOINs

    • hidden (GraphQL) VS opaque (SQL) underlying data structure • predefined filters (GraphQL) VS flexible select rules (SQL)
  33. 37.

    Protocol: GraphQL, SQL • no silver bullet! • GraphQL looks

    nicer for nested data • SQL works better for SELECT ... WHERE ... • and ORDER BY, and LIMIT etc • revealing how the data is structured is not all bad • ... gives you predictability on performance
  34. 38.

    Protocol: What About SQL? • you can use SQL as

    a client facing protocol • seriously • even if you're not a database • why? • widely known • a lot of tools to leverage
  35. 39.

    Protocol: How to SQL? • Apache Calcite: define SQL engine

    • Apache Avatica: run SQL server • documentation is not perfect, look into examples • impressive list of adopters • do not trust "no sql" movement • use whatever works for you
  36. 40.

    Protocol: How to SQL? • working on a library on

    top of Calcite • hope it will be released next month • to turn your service into a "table" • so you can easily run SQL proxy to fetch your data • hardest part: • how to convey what part of SQL is supported
  37. 41.

    Protocol: More Protocols! • a lot of interesting examples for

    inspiration • e.g. Datomic datalog queries • e.g. SPARQL (with data distribution in place ) • ... and more!
  38. 43.

    Versioning • can I change this field "slightly"? • this

    field is outdated, can I remove it? • someone broke our API calls, I can't figure out who!
  39. 44.

    Versioning • sounds familiar, ah? • API versioning * data

    versioning • ... * # of your teams • that's a lot!
  40. 45.

    Versioning • first step: describe everything • API calls •

    IO reads/writes... to files/cache/db • second step: collect all declarations to a single place • no need to reinvent, git repo is a good start
  41. 46.

    Versioning • kinda obvious, but hard to enforce organizationally •

    you don't need a "perfect solution ™" • just start from something & evolve as it goes
  42. 48.

    Versioning: Refine Your Types! • most of the time we

    primitives: String, Float etc • .. and collections: Maps, Arrays, (very rarely) Sets • that's not enough ! • came from memory management • doesn't work for bigger systems
  43. 49.

    Versioning: Refine Your Types! • you should be as precise

    as you can! • type theory for the resque • refined types in Haskell, Scala, Clojure • basic type + a predicate
  44. 50.

    Versioning: Refine Your Types! (def LatCoord (r/refined double (r/OpenClosedInterval -90.0

    90.0))) (def LngCoord (r/OpenClosedIntervalOf double -180.0 180.0)) (def GeoPoint {:lat LatCoord :lng LngCoord}) (def Route (r/BoundedListOf GeoPoint 2 50)) (def Route (r/refined [GeoPoint] (BoundedSize 2 50))) (def RouteFromZurich (r/refined Route (r/First InZurich)))
  45. 51.

    Versioning: Refine Your Types! • precise types for all IO

    operations • runtime check is a decent start • serialize type definitions to file • make sure that's possible when picking a library • you can also auto-convert storage metadata • char (30) → (r/BoundedSizeStr 0 30)
  46. 52.

    Versioning: Type Twice • never rely on a single point

    of view • each request/response should be declared twice • by the service and the caller • each data format (e.g. DB table) • by storage & by the reader • ... all readers
  47. 53.

    Versioning: Type Twice • data "owner": strongest guarantees possible •

    reader/user: relaxed to what's (trully) necessary
  48. 54.

    Versioning: Type Twice (def EmailFromStorage (refined NonEmptyStr (BoundedSize _ 64)

    valid-email-re)) ;; simply show on the screen? (def Reader1 (refined NonEmptyStr (BoundedSize _ 64))) ;; I will truncate anyways :) (def Reader2 NonEmptyStr) ;; I need to show "email me" button :( (def Reader3 (refined NonEmptyStr valid-email-re))
  49. 55.

    Versioning: Type Twice • playing with predicates you're changing the

    scope • scopes might intersect or be independent
  50. 56.
  51. 57.
  52. 58.

    Versioning: Type Twice • most protocols support back- and forward-

    compatibility • Protobuf, Thrift, FlatBuffers & others • rules are kinda implicit • defined by protocol & libraries • that's not enough !
  53. 59.

    Versioning: Type Twice • having all readers' & owners' type

    in a repo... • anytime you change your types you know who's affected • writer guarantees >= reader expects • that's why you need "double definitions" • make it part of you CI cycle!
  54. 60.

    Versioning: Refinements • no theoretical generic solution (yet?) • you

    can cover a lot of use cases "manually" • "if-else" driven type checker • provide "manual" proof in case of ambiguity • at least you have git blame now • advanced: run QuickCheck to double test that
  55. 62.

    Summary • JOINs: we did a lot, we still have

    a room for doing smarter • protocol: choose wisely, don't be shy • versioning: type your data (twice), keep types organized