Slide 1

Slide 1 text

Alice and travelling back in time

Slide 2

Slide 2 text

Roksolana Diachuk • Big Data Developer at Captify • Diversity & Inclusion ambassador at Captify • Women Who Code Kyiv Data Engineering Lead • Speaker

Slide 3

Slide 3 text

In previous episodes talks…

Slide 4

Slide 4 text

Functional forest

Slide 5

Slide 5 text

magic-db- cluster-0

Slide 6

Slide 6 text

2 years later

Slide 7

Slide 7 text

magic-db- cluster-0 Long time no see! !

Slide 8

Slide 8 text

NAME READY STATUS AGE launcher-crd 1/1 Running 33s magic-db- cluster-0

Slide 9

Slide 9 text

3 months passed

Slide 10

Slide 10 text

{ “predictions”: [ [ “B-natural phenomenon”, “O”, “B-geographical entity”, “B-time indicator”, ] ] }

Slide 11

Slide 11 text

{ “instances”: [Pods and higher- order functions are in danger]} Test

Slide 12

Slide 12 text

5 Days passed

Slide 13

Slide 13 text

“Oh, don’t be rude, Alice!
 I will only entertain you a bit with my riddle” “Like it explains anything”

Slide 14

Slide 14 text

What was lost can be found only in the anomalies of this world

Slide 15

Slide 15 text

ML model Source data Anomalies detected

Slide 16

Slide 16 text

01010100 01101000 01100101 00100000 01110000 01100001 01110011 …

Slide 17

Slide 17 text

No content

Slide 18

Slide 18 text

The past is the key to the future

Slide 19

Slide 19 text

1 Day LAter

Slide 20

Slide 20 text

Alice kept thinking of the note she got and the journey which led her to it. Was someone helping her? And what this note actually means?

Slide 21

Slide 21 text

“I need to check magic-db”

Slide 22

Slide 22 text

alice% _

Slide 23

Slide 23 text

alice% _ You’re now connected to the magic-db

Slide 24

Slide 24 text

magicdb event_date = 20140608 event_hour = 1402185600 event_hour = 1402196720 … event_date = 20140609 … event_date = 20211031

Slide 25

Slide 25 text

“Wait, there’re some tools that allow to look into data versions”

Slide 26

Slide 26 text

No content

Slide 27

Slide 27 text

Cloud storage Delta lake storage layer Apache Spark services

Slide 28

Slide 28 text

Data fi les Metadata Storage layer

Slide 29

Slide 29 text

Ingestion tables Re fi ned tables Agg data store

Slide 30

Slide 30 text

ACID Atomicity Consistency Isolation Durability

Slide 31

Slide 31 text

Transactions issue in Spark Failure during write —> data loss Data update —> data is not consistent

Slide 32

Slide 32 text

ACID transactions Schema enforcement Metadata handling Upserts and deletes Time travel

Slide 33

Slide 33 text

Time travel “That’s exactly what I need”

Slide 34

Slide 34 text

magicdb event_date = 20140608 event_hour = 1402185600 event_hour = 1402196720 … event_date = 20140609 … _metadata

Slide 35

Slide 35 text

“Hm, there are already metadata fi les available”

Slide 36

Slide 36 text

magicdb _last_checkpoint 00000000000000005536 … _metadata 00000000000000005537 00000000000000005538 00000000000000005539

Slide 37

Slide 37 text

Delta lake setup libraryDependencies += “io.delta” %% “delta-core” % “1.1.0” val spark = SparkSession .builder()
 .con fi g(“spark.sql.extensions”, “io.delta.sql.DeltaSparkSessionExtension”)
 .con fi g(“spark.sql.catalog.spark_catalog”, “spark.sql.delta.catalog.DeltaCatalog”) .getOrCreate()

Slide 38

Slide 38 text

Timestamp Entity_name Entity_type Belongs to 2014-06-07 factory-worker-0 Pod StatefulSet 2014-06-07 factory-worker-1 Pod StatefulSet 2014-06-07 factory-worker-2 Pod StatefulSet … spark.read.format(“delta”).load(“/magic-db”)

Slide 39

Slide 39 text

“It’s possible to look directly into the history of the data!”

Slide 40

Slide 40 text

Version Timestamp User Operation Comment 1 2014-06-07 Architect CREATE Default name space city 2 2014-06-08 Architect CREATE Factory … val deltaTable = DeltaTable.forPath(spark, “/magic-db”) deltaTable.history()

Slide 41

Slide 41 text

0. World creation

Slide 42

Slide 42 text

1. Default namespace city creation

Slide 43

Slide 43 text

2. Factory creation

Slide 44

Slide 44 text

3. Forest discovery

Slide 45

Slide 45 text

4. The fi rst human’s visit

Slide 46

Slide 46 text

5. Pod lost NAME READY STATUS AGE pod/magic-db-cluster-0 1/1 Error 15m pod/magic-db-cluster-1 1/1 Running 15m pod/magic-db-cluster-2 1/1 Running 15m

Slide 47

Slide 47 text

spark.read .format(“delta”) .option(“versionAsOf”, “4”) .load(“/magic-db”) . fi lter($“entity_type” == “Pod”) .count Result: 10400

Slide 48

Slide 48 text

spark.read .format(“delta”) .option(“versionAsOf”, “5”) .load(“/magic-db”) . fi lter($“entity_type” == “Pod”) .count Result: 10399

Slide 49

Slide 49 text

6. Anomalies PodLost PodLost PodLost PodLost PodLost PodLost

Slide 50

Slide 50 text

spark.read .format(“delta”) .option(“versionAsOf”, “6”) .load(“/magic-db”) . fi lter($“entity_type” == “Pod”) .count Result: 6130

Slide 51

Slide 51 text

“What happened to the pods?”

Slide 52

Slide 52 text

7. Second human’s visit

Slide 53

Slide 53 text

8. Factory failure

Slide 54

Slide 54 text

No content

Slide 55

Slide 55 text

“Although there was this mysterious visitor…”

Slide 56

Slide 56 text

Alice was sitting at the cafe. Even her favourite cheesecake could not console her. She knew s h e h a d t o d o something to save the pods and higher-order functions.

Slide 57

Slide 57 text

“I can just write them back into the database”

Slide 58

Slide 58 text

val newPods = Seq( (“2021-11-01”, “magic-db-0”, “Pod”, “CRD”), (“2021-11-01”, “magic-db-1”, “Pod”, “CRD”), (“2021-11-01”, “magic-db-2”, “Pod”, “CRD”), ).toDF("timestamp", “entity_name”, “entity_type”, “belongs_to”)

Slide 59

Slide 59 text

newPods.repartition(1)
 .write .format(“delta”) .mode(SaveMode.Append) .save(“/magic-db”) spark.read.format(“delta”) .load(“/magic-db”)
 . fi lter($“entity_type” == “Pod”) .orderBy($“timestamp”.desc)

Slide 60

Slide 60 text

Timestamp Entity_name Entity_type Belongs to 2021-11-01 magic-db-0 Pod CRD 2021-11-01 magic-db-1 Pod CRD 2021-11-01 magic-db-2 Pod CRD …

Slide 61

Slide 61 text

val deltaTable = DeltaTable.forPath(“/magic-db”)
 deltaTable.vacuum(3)

Slide 62

Slide 62 text

val deltaTable = DeltaTable.forPath(“/magic-db”)
 deltaTable.vacuum(3)

Slide 63

Slide 63 text

Version Timestamp User Operation Comment 9 2021-11-01 Alice CREATE New pods created 10 2021-11-01 System DELETE All pods lost val deltaTable = DeltaTable.forPath(spark, “/magic-db”) deltaTable.history()

Slide 64

Slide 64 text

10. All pods lost

Slide 65

Slide 65 text

spark.read .format(“delta”) .option(“versionAsOf”, “10”) .load(“/magic-db”)
 . fi lter($“entity_type” == “Pod”) .count Result: 0

Slide 66

Slide 66 text

“Hm, there was nothing about the functional forest”

Slide 67

Slide 67 text

spark.read .format(“delta”) .option(“versionAsOf”, “4”) .load(“/magic-db”)
 .select(“entity_type”).distinct Result: Pod, ReplicaSet, Deployment, CRD…

Slide 68

Slide 68 text

“What if the architect could help me?”

Slide 69

Slide 69 text

“I should fi nd a way to help them”

Slide 70

Slide 70 text

“I need to get back to the world of pods and higher-order functions”

Slide 71

Slide 71 text

To be continued…

Slide 72

Slide 72 text

Thank you for attention

Slide 73

Slide 73 text

My contact info dead_flowers22 roksolana-d roksolanadiachuk roksolanad