Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Epistemology (and disks)

92091e0d874221db9e1c8cc90c47e43e?s=47 Alonisser
September 14, 2021

Epistemology (and disks)



September 14, 2021


  1. Castles of the mind The programmer, like the poet, works

    only slightly removed from pure thought-stuff. He builds his castles in the air, from air, creating by exertion of the imagination.” Frederick P. Brooks Jr., The Mythical Man-Month: Essays on Software Engineering
  2. Epistemology And Disks

  3. Hi, I’m Alon Nisser and I work in zencity.io I

    like open source and exploring new tech. Distrust software @alonisser on twitter, medium, Github, Gmail We’re helping Data-driven Decision Making for Local Governments of all sizes and shapes. Powered by AI, Zencity transforms data from all of the touchpoints residents have with their city into actionable insights Passionate about local government? Want to build new systems in a growing startup? Come work with us. We’re hiring across the board.
  4. Epistemology From greek: epistēmē 'knowledge' The branch of philosophy concerned

    with knowledge. Epistemologists study the nature, origin, and scope of knowledge, epistemic justification, the rationality of belief, and various related issues
  5. Epistemology From greek: epistēmē 'knowledge' Asks questions like: What is

    knowledge? How can a belief be justified? How do we know that something is true?
  6. So naturally..

  7. Let’s take a look at the data lake

  8. First came the costs

  9. Now that’s real big data 150TiB !! 6000$ a month

    just for storage But that’s doesn’t seem to be our data size
  10. Then I took a look in the databricks UI Can

    you spot a problem here?
  11. Should I believe the “table” UI showing my mere 100GB

    of data? Or the Azure portal UI showing a storage account with 150TiB? Epistemology: The branch of philosophy concerned with knowledge. Epistemologists study the nature, origin, and scope of knowledge, epistemic justification, the rationality of belief, and various related issues
  12. Let’s talk about search (Azure search)

  13. Tragedy unfolding in the Slack channel

  14. Tragedy was unfolding in the Slack channel Obviously we need

    to remove data. We try to remove quite a lot of data (in unused fields and inactive clients) but data usage is not going down. We’ve reached for Azure for help They are as clueless as we are Resizing the cluster has some effect, but far from what we need. What is going on?
  15. What is happening here? What is reality? How can I

    trust my senses?
  16. How can two very different observation of the “reality” can

    be truthful at the same time?
  17. Achilles and the tortoise Those aren’t new questions.

  18. Epistemology While there is a mathematical “solution” to this problem

    (at least for converging series given Achilles and the tortoise aren’t moving in a specific non converging series speed) The main point here is the duality of time/space (or of numbers) which can be seen as an infinite series divisible to measurable yet smaller units but is also a continuum. So reality is not only different from what we perceive but also, different representations of “things” can be “true” in the same time (This doesn’t mean there is no truth, or that all representations are truthful)
  19. Let’s talk about the data lake

  20. Both measurements are true

  21. We are actually measuring different things

  22. What is a delta table • A delta table is

    an abstraction allowing us to query with spark using SQL like syntax a bunch of parquet files while maintaing “ACID” like guarantees • But the delta_log files tracks changes, keep versions of possible rollback and references lot’s of files that aren’t in use anymore. Results of OPTIMIZE commands. Awaiting a VACUUM command to remove them. • Alas, no VACUUM command ever came
  23. What is a delta table • The table UI measured

    the bytes data in the abstraction of a data in a “table” that can be queried by spark sql • While the storage account measured actual bytes on disk Both are Bytes. But different
  24. Let’s talk about search

  25. The plot thickens (And an explanation)

  26. Let’s talk about Mongo

  27. Running Hosted Mongo in the clouds • It’s managed by

    someone else! It’s running in our datacenter so low ms of latency • It’s fast, and scalable • …. until the number of writes/reads jumps over a certain threshold and suddenly it’s SLOWWWWW.
  28. Running Hosted Mongo in the clouds • Underneath this great

    abstraction, data is still being read from disks. And while SSD is fast. Our cloud provider sets a limit on provisioned IOPS (Basically disk activities) . As our usage increased. We’ve hit the limit and then read/write IO becomes throttled by the cloud provider
  29. Running Hosted Mongo in the clouds • And just to

    make things a bit more confusing, The cloud provider decides the “allowed” IOPS level on SSD premium disks according to the disk Size...
  30. Running Hosted Mongo in the clouds Both of those represent

    the same thing. But in different layers.
  31. Castles of the mind and castles of dirt and flesh

    While our technology achievements often obscure the dirt underneath. The laws of physics still apply to our great castles of the mind . And underneath it’s the messy reality of silicon, Disks rolling and searching for the correct sector to read from and messy data transmitted over noisy networks Both representations are true
  32. Epistemology and Programming - Wrapping up • Different Realities might

    exists in different level of abstractions - accept that • Try to build a mental model of technology you use. (Mental model != know every knob and detail) • Don’t be afraid to peak underneath. • The truth is out there (at least multiple, eventually consistent versions of it ¯\_(ツ)_/¯) • Uncovering it might require us to venture out of our comfort zone
  33. Questions?

  34. Read/watch • https://www.slideshare.net/holograph/how-shit-works-storage (And the rest of how things works

    youtube videos by Tomer Gabel) • Explaining the Zeno paradox to a child (in Slate) • Numbers every programmer should know
  35. Thanks for listening!