$30 off During Our Annual Pro Sale. View Details »

Epistemology (and disks)

Alonisser
September 14, 2021

Epistemology (and disks)

Alonisser

September 14, 2021
Tweet

Other Decks in Programming

Transcript

  1. Castles of the mind
    The programmer, like the poet, works only slightly removed from pure
    thought-stuff. He builds his castles in the air, from air, creating by exertion
    of the imagination.”
    Frederick P. Brooks Jr., The Mythical Man-Month:
    Essays on Software Engineering

    View Slide

  2. Epistemology
    And Disks

    View Slide

  3. Hi, I’m Alon Nisser
    and I work in
    zencity.io
    I like open source and exploring new tech.
    Distrust software
    @alonisser on twitter, medium, Github, Gmail
    We’re helping Data-driven Decision Making
    for Local Governments of all sizes and
    shapes.
    Powered by AI, Zencity transforms data
    from all of the touchpoints residents have
    with their city into actionable insights
    Passionate about local government? Want
    to build new systems in a growing startup?
    Come work with us.
    We’re hiring across the board.

    View Slide

  4. Epistemology
    From greek: epistēmē 'knowledge'
    The branch of philosophy concerned with knowledge.
    Epistemologists study the nature, origin, and scope of knowledge,
    epistemic justification, the rationality of belief, and various related
    issues

    View Slide

  5. Epistemology
    From greek: epistēmē 'knowledge'
    Asks questions like:
    What is knowledge?
    How can a belief be justified?
    How do we know that something is true?

    View Slide

  6. So naturally..

    View Slide

  7. Let’s take a look at the
    data lake

    View Slide

  8. First came the costs

    View Slide

  9. Now that’s real big data
    150TiB !!
    6000$ a month just for storage
    But that’s doesn’t seem to be our data
    size

    View Slide

  10. Then I took a look in the databricks UI
    Can you spot a
    problem here?

    View Slide

  11. Should I believe the “table”
    UI showing my mere 100GB
    of data? Or the Azure portal
    UI showing a storage
    account with 150TiB?
    Epistemology: The branch of philosophy concerned with
    knowledge. Epistemologists study the nature, origin, and
    scope of knowledge, epistemic justification, the rationality of
    belief, and various related issues

    View Slide

  12. Let’s talk about search
    (Azure search)

    View Slide

  13. Tragedy unfolding in the Slack channel

    View Slide

  14. Tragedy was unfolding in the Slack channel
    Obviously we need to remove data. We try to remove quite a lot of data (in unused fields and inactive clients) but
    data usage is not going down.
    We’ve reached for Azure for help
    They are as clueless as we are
    Resizing the cluster has some effect, but far from what we need. What is going on?

    View Slide

  15. What is happening here?
    What is reality?
    How can I trust my senses?

    View Slide

  16. How can two very different
    observation of the “reality”
    can be truthful at the same
    time?

    View Slide

  17. Achilles and the tortoise
    Those aren’t new questions.

    View Slide

  18. Epistemology
    While there is a mathematical “solution” to this problem (at least for converging series given
    Achilles and the tortoise aren’t moving in a specific non converging series speed) The main
    point here is the duality of time/space (or of numbers) which can be seen as an infinite series
    divisible to measurable yet smaller units but is also a continuum.
    So reality is not only different from what we perceive but also, different representations of
    “things” can be “true” in the same time (This doesn’t mean there is no truth, or that all
    representations are truthful)

    View Slide

  19. Let’s talk about the data
    lake

    View Slide

  20. Both measurements are true

    View Slide

  21. We are actually measuring different things

    View Slide

  22. What is a delta
    table
    ● A delta table is an abstraction allowing
    us to query with spark using SQL like
    syntax a bunch of parquet files while
    maintaing “ACID” like guarantees
    ● But the delta_log files tracks changes,
    keep versions of possible rollback and
    references lot’s of files that aren’t in use
    anymore. Results of OPTIMIZE
    commands. Awaiting a VACUUM
    command to remove them.
    ● Alas, no VACUUM command ever came

    View Slide

  23. What is a delta
    table
    ● The table UI measured the bytes
    data in the abstraction of a data
    in a “table” that can be queried by
    spark sql
    ● While the storage account
    measured actual bytes on disk
    Both are Bytes. But different

    View Slide

  24. Let’s talk about search

    View Slide

  25. The plot thickens (And an explanation)

    View Slide

  26. Let’s talk about Mongo

    View Slide

  27. Running Hosted Mongo in the clouds
    ● It’s managed by someone else! It’s running in our datacenter so low ms of latency
    ● It’s fast, and scalable
    ● …. until the number of writes/reads jumps over a certain threshold and suddenly it’s
    SLOWWWWW.

    View Slide

  28. Running Hosted Mongo in the clouds
    ● Underneath this great abstraction, data is still being read from disks. And while SSD is
    fast. Our cloud provider sets a limit on provisioned IOPS (Basically disk activities) . As
    our usage increased. We’ve hit the limit and then read/write IO becomes throttled by
    the cloud provider

    View Slide

  29. Running Hosted Mongo in the clouds
    ● And just to make things a bit more confusing, The cloud provider decides the
    “allowed” IOPS level on SSD premium disks according to the disk Size...

    View Slide

  30. Running Hosted Mongo in the clouds
    Both of those represent the same thing. But in different layers.

    View Slide

  31. Castles of the mind and castles of dirt and
    flesh
    While our technology achievements often obscure the dirt underneath. The laws of physics
    still apply to our great castles of the mind . And underneath it’s the messy reality of silicon,
    Disks rolling and searching for the correct sector to read from and messy data transmitted
    over noisy networks
    Both
    representations
    are true

    View Slide

  32. Epistemology and Programming -
    Wrapping up
    ● Different Realities might exists in different level of abstractions - accept that
    ● Try to build a mental model of technology you use. (Mental model != know every knob
    and detail)
    ● Don’t be afraid to peak underneath.
    ● The truth is out there (at least multiple, eventually consistent versions of it ¯\_(ツ)_/¯)
    ● Uncovering it might require us to venture out of our comfort zone

    View Slide

  33. Questions?

    View Slide

  34. Read/watch
    ● https://www.slideshare.net/holograph/how-shit-works-storage (And the rest of how
    things works youtube videos by Tomer Gabel)
    ● Explaining the Zeno paradox to a child (in Slate)
    ● Numbers every programmer should know

    View Slide

  35. Thanks for listening!

    View Slide