Slide 1

Slide 1 text

Fuel Prices Exposed Leveraging MongoDB’s Aggregation Pipeline

Slide 2

Slide 2 text

Fuel Prices Exposed Leveraging MongoDB’s Aggregation Pipeline

Slide 3

Slide 3 text

No content

Slide 4

Slide 4 text

No content

Slide 5

Slide 5 text

Markttransparenzstelle für Kraftstoffe

Slide 6

Slide 6 text

No content

Slide 7

Slide 7 text

A Look At Data

Slide 8

Slide 8 text

Federated Database Skip the import

Slide 9

Slide 9 text

Station Data What are we working with? { "_id": "0e18d0d3-ed38-4e7f-a18e-507a78ad901d", "brand": "OIL!", "city": "München", "first_active": "1970-01-01 01:00:00+01", "house_number": "", "latitude": "48.1807", "longitude": "11.4609", "name": "OIL! Tankstelle München", "openingtimes_json": "...", "post_code": "80999", "street": "Eversbuschstraße 33" }

Slide 10

Slide 10 text

Price Reports All prices { "_id": ..., "date": "2022-01-01 00:01:08+01", "station_uuid": "227a99f5-0be5-49f1-8f58-645ab5da32a8", "diesel": "1.609", "e5": "1.769", "e10": "0.000", "dieselchange": "1", "e5change": "1", "e10change": "0" }

Slide 11

Slide 11 text

Every engineer, ever “I guess I’ll write some code”

Slide 12

Slide 12 text

Aggregation Pipeline

Slide 13

Slide 13 text

MongoDB Queries db.stations_import.find({post_code: "80999"})

Slide 14

Slide 14 text

SQL Queries It can be so easy SELECT FROM JOIN WHERE GROUP BY HAVING ORDER BY OFFSET FETCH FIRST

Slide 15

Slide 15 text

SQL Queries Do we really SELECT FROM? JOIN WHERE GROUP BY HAVING ORDER BY OFFSET FETCH FIRST SELECT FROM

Slide 16

Slide 16 text

SQL Queries Who cares about order? JOIN WHERE GROUP BY HAVING ORDER BY OFFSET FETCH FIRST FROM SELECT

Slide 17

Slide 17 text

SQL Queries Tomayto, Tomahto FROM JOIN GROUP BY SELECT ORDER BY OFFSET FETCH FIRST WHERE HAVING

Slide 18

Slide 18 text

Aggregation Pipeline Write it like this db.collection.aggregate([ { $match: { ... }}, { $lookup: { ... }}, { $match: { ... }}, { $group: { ... }}, { $match: { ... }}, { $project: { ... }}, { $sort: { ... }}, { $skip: { ... }}, { $limit: { ... }}, ]);

Slide 19

Slide 19 text

db.collection.aggregate([ { $lookup: { ... }}, { $group: { ... }}, { $project: { ... }}, { $sort: { ... }}, { $skip: { ... }}, { $limit: { ... }}, ]); Aggregation Pipeline Duplicate stages { $match: { ... }}, { $match: { ... }}, { $match: { ... }},

Slide 20

Slide 20 text

Aggregation Pipeline Debug: run part of a pipeline db.collection.aggregate([ { $match: { ... }}, { $lookup: { ... }}, { $match: { ... }}, { $group: { ... }}, { $match: { ... }}, { $project: { ... }}, { $sort: { ... }}, { $skip: { ... }}, { $limit: { ... }}, ]);

Slide 21

Slide 21 text

Aggregation Pipeline Create views db.createView('viewName', 'collection', [ { $match: { ... }}, { $lookup: { ... }}, { $match: { ... }}, { $group: { ... }}, { $match: { ... }}, { $project: { ... }}, { $sort: { ... }}, { $skip: { ... }}, { $limit: { ... }}, ]);

Slide 22

Slide 22 text

Aggregation Pipeline On-demand materialised views { $out: { ... }}, db.collection.aggregate([ { $match: { ... }}, { $lookup: { ... }}, { $match: { ... }}, { $group: { ... }}, { $match: { ... }}, { $project: { ... }}, { $sort: { ... }}, { $skip: { ... }}, { $limit: { ... }}, ]);

Slide 23

Slide 23 text

Shaping Data

Slide 24

Slide 24 text

Aggregation Pipeline Station Data { $replaceWith: { _id: "$_id", name: "$name", brand: "$brand", location: { type: "Point", coordinates: [ { $toDouble: "$longitude" }, { $toDouble: "$latitude" }, ], }, address: { street: "$street", number: "$house_number", postCode: "$post_code", city: "$city", }, }}

Slide 25

Slide 25 text

What’s A Station? Cleaned up and organised { "_id": "0e18d0d3-ed38-4e7f-a18e-507a78ad901d", "name": "OIL! Tankstelle München", "brand": "OIL!", "location": { "type": "Point", "coordinates": [11.4609, 48.1807] }, "address": { "street": "Eversbuschstraße 33", "number": "", "postCode": "80999", "city": "München" } }

Slide 26

Slide 26 text

What’s A Station? Cleaned up and organised { "_id": "0e18d0d3-ed38-4e7f-a18e-507a78ad901d", "name": "OIL! Tankstelle München", "brand": "OIL!", "location": { "type": "Point", "coordinates": [11.4609, 48.1807] }, } "address": { "street": "Eversbuschstraße 33", "number": "", "postCode": "80999", "city": "München" }

Slide 27

Slide 27 text

{ "_id": "0e18d0d3-ed38-4e7f-a18e-507a78ad901d", "name": "OIL! Tankstelle München", "brand": "OIL!", "address": { "street": "Eversbuschstraße 33", "number": "", "postCode": "80999", "city": "München" } } What’s A Station? Cleaned up and organised "location": { "type": "Point", "coordinates": [11.4609, 48.1807] }, }

Slide 28

Slide 28 text

Aggregation Pipeline Price reports [ { $set: { date: { $dateFromString: { dateString: "$date" } }, station: { _id: "$station_uuid" }, changes: [ { fuel: "diesel", changed: { $toBool: { $toInt: "$dieselchange" } }, price: { $toDouble: "$diesel" }, }, { fuel: "e10", changed: { $toBool: { $toInt: "$e10change" } }, price: { $toDouble: "$e10" }, }, { fuel: "e5", changed: { $toBool: { $toInt: "$e5change" } }, price: { $toDouble: "$e5" }, }, ], }}, { $set: { day: { $dateTrunc: { date: "$date", unit: "day", timezone: "+01", }}, changes: { $filter: { input: "$changes", cond: { $eq: ["$$this.changed", true] }, }}, }}, { $unwind: "$changes" }, { $replaceWith: { _id: { $concat: ["$_id", "-", "$changes.fuel"] }, date: "$date", day: "$day", station: "$station", fuel: "$changes.fuel", price: "$changes.price", }}, ]

Slide 29

Slide 29 text

Price Change Make it easier to handle { "_id": "2022-01-01-prices-5-diesel", "date": { "$date": "2021-12-31T23:01:08.000Z" }, "day": { "$date": "2021-12-31T23:00:00.000Z" }, "station": { "_id": "227a99f5-0be5-49f1-8f58-645ab5da32a8" }, "fuel": "diesel", "price": 1.609 } { "_id": "2022-01-01-prices-5-e5", "date": { "$date": "2021-12-31T23:01:08.000Z" }, "day": { "$date": "2021-12-31T23:00:00.000Z" }, "station": { "_id": "227a99f5-0be5-49f1-8f58-645ab5da32a8" }, "fuel": "e5", "price": 1.769 }

Slide 30

Slide 30 text

Import ALL data

Slide 31

Slide 31 text

Import All Data Clone the repository git clone https://[email protected]/... Cloning into 'tankerkoenig-data'... remote: Azure Repos remote: Found 19574 objects to send. (376 ms) Receiving objects: 0% (64/19574), 83.04 MiB | 20.74 MiB/s

Slide 32

Slide 32 text

No content

Slide 33

Slide 33 text

Import All Data What have we got? $ du -hcs * 8.0K README.md 57G prices 7.1G stations 64G total $ wc -l prices/*/*/*.csv ... 731239757 total

Slide 34

Slide 34 text

Import All Data Reduce the dataset $ wc -l prices/2022/*/*.csv ... 123735340 total $ wc -l stations/2022/01/2022-01-01-stations.csv 16480 stations/2022/01/2022-01-01-stations.csv

Slide 35

Slide 35 text

Import All Data 2022 in numbers • ~124 million price records • 339000 per day • 16480 stations

Slide 36

Slide 36 text

~20 Daily Price Changes Per Station

Slide 37

Slide 37 text

No content

Slide 38

Slide 38 text

php import.php mongodb://localhost ../tankerkoenig-data/prices/2022/*/*.csv Processing file ../tankerkoenig-data/prices/2022/01/2022-01-01-prices.csv ... Done in 1173.451 s Inserted 123734975 documents in 349.873 s (353657.03 documents/s)

Slide 39

Slide 39 text

No content

Slide 40

Slide 40 text

Optimise Storage Bucket Pattern { "day": { "$date": "2021-12-31T23:00:00.000Z" }, "station": { "_id": "227a99f5-0be5-49f1-8f58-645ab5da32a8" }, "fuel": "diesel", "prices": [ { "_id": "2022-01-01-prices-5-diesel", "date": { "$date": "2021-12-31T23:01:08.000Z" }, "price": 1.609 }, { "_id": "2022-01-01-prices-224783-diesel", "date": { "$date": "2022-01-01T16:59:10.000Z" }, "price": 1.579 } ] }

Slide 41

Slide 41 text

weightedAverage: { $round: [ { $divide: [ { $reduce: { input: "$prices", initialValue: 0, in: { $add: [ "$$value", { $multiply: [ "$$this.seconds", "$$this.price" ] }, ]}, }}, { $dateDiff: { startDate: "$day", endDate: { $dateAdd: { startDate: "$day", unit: "day", amount: 1, }}, unit: "second", }}, ]}, 3, ]}, }}, ]

Slide 42

Slide 42 text

No content

Slide 43

Slide 43 text

No content

Slide 44

Slide 44 text

Daily Bucket Metadata { "_id": { "$oid": "650063b7630a232653ab7817" }, "day": { "$date": "2022-01-01T23:00:00.000Z" }, "fuel": "diesel", "station": { "_id": "278130b1-e062-4a0f-80cc-19e486b4c024", "name": "Aral Tankstelle", "brand": "ARAL", "location": { "type": "Point", "coordinates": [13.4214869, 52.5141525] }, "address": { "postCode": "10179" } }, ... }

Slide 45

Slide 45 text

Daily Bucket Full price list { ... "prices": [ { "_id": "2022-01-02-prices-16852-diesel", "date": { "$date": "2022-01-02T06:02:11.000Z" }, "price": 1.639, "previousPrice": 1.569 }, { "_id": "2022-01-02-prices-49287-diesel", "date": { "$date": "2022-01-02T08:01:08.000Z" }, "price": 1.689, "previousPrice": 1.639 } ], ... }

Slide 46

Slide 46 text

Daily Bucket Daily price data { ... "openingPrice": 1.569, "closingPrice": 1.629, "weightedAverage": 1.591, "lowestPrice": { "_id": "2022-01-02-prices-127016-diesel", "date": { "$date": "2022-01-02T12:05:08.000Z" }, "price": 1.549 }, "highestPrice": { "_id": "2022-01-02-prices-76907-diesel", "date": { "$date": "2022-01-02T09:31:09.000Z" }, "price": 1.709 }, ... }

Slide 47

Slide 47 text

Daily Bucket Daily averages { ... "dailyAverage": { "lowestPrice": 1.399, "highestPrice": 1.939, "averagePrice": 1.575, "percentiles": { "p50": 1.572, "p90": 1.62, "p95": 1.636, "p99": 1.725 } } }

Slide 48

Slide 48 text

Numbers, At Last Di ff erent Format? Rank Brand Changes Diesel e5 e10 1 LD Tankestelle Westheim 101.91 34.04 33.94 33.94 2 Rapp - Freie Tankstelle 85.71 27.87 29.30 29.29 3 Freie Tankstelle Endrass 85.71 26.35 29.68 29.68 4 Sb 79.48 25.89 26.75 26.84 5 DBV 78.93 22.50 22.86 33.56

Slide 49

Slide 49 text

Let’s build some charts

Slide 50

Slide 50 text

No content

Slide 51

Slide 51 text

No content

Slide 52

Slide 52 text

No content

Slide 53

Slide 53 text

No content

Slide 54

Slide 54 text

No content

Slide 55

Slide 55 text

No content

Slide 56

Slide 56 text

No content

Slide 57

Slide 57 text

No content

Slide 58

Slide 58 text

No content

Slide 59

Slide 59 text

No content

Slide 60

Slide 60 text

No content

Slide 61

Slide 61 text

No content

Slide 62

Slide 62 text

No content

Slide 63

Slide 63 text

Thanks! @alcaeus github.com/alcaeus symfony-devs: @alcaeus