Slide 1

Slide 1 text

Data Processing PHPers Summit 2024 https://flow-php.com

Slide 2

Slide 2 text

Problem https://flow-php.com

Slide 3

Slide 3 text

How to process a CSV Report? https://flow-php.com

Slide 4

Slide 4 text

Orders Report • order_id – uuid • created_at – datetime • updated_at – datetime • discount – float (nullable) • address – structure{street: string, city: string, zip: string, country: string} • notes – list • items – list https://flow-php.com

Slide 5

Slide 5 text

file_get_contents

Slide 6

Slide 6 text

fopen/fgets/feof

Slide 7

Slide 7 text

fopen/fgetcsv/feof

Slide 8

Slide 8 text

generators

Slide 9

Slide 9 text

generators

Slide 10

Slide 10 text

What about column types? https://flow-php.com

Slide 11

Slide 11 text

No content

Slide 12

Slide 12 text

Manual Type Casting https://flow-php.com

Slide 13

Slide 13 text

Issues & limitations? • Flexibility • Maintainability • Extendability • Scalability* https://flow-php.com

Slide 14

Slide 14 text

Are there any other solutions? https://flow-php.com

Slide 15

Slide 15 text

ETL https://flow-php.com

Slide 16

Slide 16 text

ETL • Extract https://flow-php.com

Slide 17

Slide 17 text

ETL • Extract • Transform https://flow-php.com

Slide 18

Slide 18 text

ETL • Extract • Transform • Load https://flow-php.com

Slide 19

Slide 19 text

Extraction • Database • File • API • Streams • Queues / Topics https://flow-php.com

Slide 20

Slide 20 text

• Filtering • Merging • Cleaning • Grouping • Aggregating • Deduplicating • Sorting • Partitioning Transformation https://flow-php.com

Slide 21

Slide 21 text

• Database • File • Stream • Projection • API Loading https://flow-php.com

Slide 22

Slide 22 text

Examples php/scala/python https://flow-php.com

Slide 23

Slide 23 text

How to process a CSV Report? With Flow PHP https://flow-php.com

Slide 24

Slide 24 text

generators approach

Slide 25

Slide 25 text

https://flow-php.com ETL processing pipeline approach

Slide 26

Slide 26 text

+----------------------+----------------------+----------------------+----------+----------------------+----------------------+----------------------+ | order_id | created_at | updated_at | discount | address | notes | items | +----------------------+----------------------+----------------------+----------+----------------------+----------------------+----------------------+ | 48f7b4b3-48dc-3095-8 | 2024-04-23T01:35:12+ | 2024-04-23T01:35:12+ | | {"street":"56896 Pow | ["Sed cumque sit vol | [{"sku":"SKU_0005"," | | b8670686-1e52-36ee-9 | 2024-04-14T09:00:12+ | 2024-04-14T09:00:12+ | | {"street":"596 Derek | ["Fugiat saepe atque | [{"sku":"SKU_0004"," | | dc052d5e-2b2c-3b2a-9 | 2024-03-03T08:03:02+ | 2024-03-03T08:03:02+ | 40.08 | {"street":"51760 Koe | ["Aliquid voluptatem | [{"sku":"SKU_0004"," | | 6984b96b-6a27-367f-9 | 2024-04-03T16:18:07+ | 2024-04-03T16:18:07+ | | {"street":"9722 Doll | ["Est sit atque quos | [{"sku":"SKU_0004"," | | cb21141a-5494-33ea-9 | 2024-04-25T17:47:49+ | 2024-04-25T17:47:49+ | 2.38 | {"street":"11398 Abs | ["Est atque doloremq | [{"sku":"SKU_0005"," | | c9dc07fc-fa46-3f32-9 | 2024-03-27T12:44:03+ | 2024-03-27T12:44:03+ | | {"street":"78980 Bri | ["Sit aut laudantium | [{"sku":"SKU_0003"," | | 9b828e2d-b509-3485-b | 2024-04-12T06:33:52+ | 2024-04-12T06:33:52+ | | {"street":"6434 Chet | ["Ad consequuntur qu | [{"sku":"SKU_0005"," | | 6f619e18-05aa-306b-8 | 2024-06-10T21:17:45+ | 2024-06-10T21:17:45+ | | {"street":"8038 Crai | ["Dolorum recusandae | [{"sku":"SKU_0005"," | | 7814b135-500f-3137-9 | 2024-05-14T07:39:00+ | 2024-05-14T07:39:00+ | | {"street":"26190 Cor | ["Est quis necessita | [{"sku":"SKU_0005"," | +----------------------+----------------------+----------------------+----------+----------------------+----------------------+----------------------+ 10 rows Output https://flow-php.com

Slide 27

Slide 27 text

How to process a CSV Report? With Apache Spark (scala) https://flow-php.com

Slide 28

Slide 28 text

https://flow-php.com

Slide 29

Slide 29 text

+------------------------------------+-------------------------+-------------------------+--------+--------------------------------------------+--------------------------------+----------------------+ |order_id |created_at |updated_at |discount|address |notes |items | +------------------------------------+-------------------------+-------------------------+--------+--------------------------------------------+--------------------------------+----------------------+ |7833e6cb-b123-37f7-bee5-c0fea6dd6787|2024-04-26T06:01:52+00:00|2024-04-26T06:01:52+00:00|2.09 |"{""street"":""64428 Nitzsche Locks"" |""city"":""Lake Deontechester"" |""zip"":""15111"" | |5aa4fb2b-7bc5-3d7c-a9a9-04e88f831b00|2024-01-23T22:53:49+00:00|2024-01-23T22:53:49+00:00|46.87 |"{""street"":""5751 Jamal Drive"" |""city"":""Port Delmer"" |""zip"":""57385"" | |77a297db-b911-3800-a017-7f16502f324f|2024-02-20T00:44:03+00:00|2024-02-20T00:44:03+00:00|29.19 |"{""street"":""831 Murphy Haven"" |""city"":""West Alessandroport""|""zip"":""65846-1195""| |ed030a18-df55-38ce-90a3-c3461032c150|2024-02-15T22:03:47+00:00|2024-02-15T22:03:47+00:00|null |"{""street"":""8617 Lebsack Cape Suite 285""|""city"":""New Leonel"" |""zip"":""43725"" | |d4b7921e-7729-322f-89a1-51e1c5198678|2024-04-12T04:26:42+00:00|2024-04-12T04:26:42+00:00|10.44 |"{""street"":""523 Charlene Mount Apt. 694""|""city"":""Bruenstad"" |""zip"":""40291"" | |ff342e29-a6f8-3df3-b1d0-0adb376557fd|2024-03-24T08:49:27+00:00|2024-03-24T08:49:27+00:00|45.48 |"{""street"":""822 Carmel Common Apt. 560"" |""city"":""Abigailport"" |""zip"":""64470"" | |9771e63f-16e6-311a-a974-a0d60a06fea4|2024-02-14T00:23:18+00:00|2024-02-14T00:23:18+00:00|null |"{""street"":""87471 Jaylon Place"" |""city"":""Cummingsmouth"" |""zip"":""11956-0536""| |eaf137da-c206-3252-a5f5-428b1d4eb4f1|2024-05-03T18:01:13+00:00|2024-05-03T18:01:13+00:00|19.16 |"{""street"":""4494 Kunze Tunnel Apt. 465"" |""city"":""Lake Sabinaland"" |""zip"":""60381-1971""| |64a4ee3d-66e3-3b5e-9830-562c051e6576|2024-01-02T17:57:03+00:00|2024-01-02T17:57:03+00:00|27.79 |"{""street"":""425 Oren Manors"" |""city"":""Lake Vincent"" |""zip"":""69860"" | |24099ba6-9131-3714-84da-d3a59ede3cd1|2024-01-27T22:47:54+00:00|2024-01-27T22:47:54+00:00|45.2 |"{""street"":""328 Daniel Inlet Apt. 768"" |""city"":""Jedediahville"" |""zip"":""77120-2693""| +------------------------------------+-------------------------+-------------------------+--------+--------------------------------------------+--------------------------------+----------------------+ Output https://flow-php.com

Slide 30

Slide 30 text

How to process a CSV Report? With Pandas (python)

Slide 31

Slide 31 text

https://flow-php.com

Slide 32

Slide 32 text

order_id created_at updated_at discount address notes items 0 e13d7098-5a78-33... 2024-06-17T19:24... 2024-06-17T19:24... 12.45 {"street":"9742 ... ["Doloremque cum... [{"sku":"SKU_000... 1 947df050-3abb-3f... 2024-02-23T19:18... 2024-02-23T19:18... NaN {"street":"37051... ["Neque dolor et... [{"sku":"SKU_000... 2 6315f9e2-86bf-33... 2024-04-02T11:30... 2024-04-02T11:30... 47.10 {"street":"792 G... ["Et porro fugia... [{"sku":"SKU_000... 3 4cccb632-fade-34... 2024-05-06T00:17... 2024-05-06T00:17... 19.76 {"street":"30203... ["Aliquam saepe ... [{"sku":"SKU_000... 4 82384f8c-9adb-38... 2024-05-10T11:17... 2024-05-10T11:17... NaN {"street":"757 T... ["Beatae nesciun... [{"sku":"SKU_000... 5 e3fcf736-0f8c-3d... 2024-01-25T20:14... 2024-01-25T20:14... NaN {"street":"9088 ... ["Provident quam... [{"sku":"SKU_000... 6 b987a49a-b4c5-37... 2024-06-03T23:22... 2024-06-03T23:22... NaN {"street":"6867 ... ["Quibusdam maio... [{"sku":"SKU_000... 7 663523a9-713b-33... 2024-03-22T23:31... 2024-03-22T23:31... 25.88 {"street":"1577 ... ["In rem maxime ... [{"sku":"SKU_000... 8 6259fa2c-ec68-36... 2024-05-10T10:12... 2024-05-10T10:12... 21.67 {"street":"987 L... ["Voluptatem non... [{"sku":"SKU_000... 9 f7153c83-34b6-37... 2024-02-26T09:20... 2024-02-26T09:20... 18.93 {"street":"2039 ... ["Culpa error re... [{"sku":"SKU_000... Output https://flow-php.com

Slide 33

Slide 33 text

How ETL works? https://flow-php.com

Slide 34

Slide 34 text

Dataset Processing Visualization https://flow-php.com

Slide 35

Slide 35 text

Dataset Processing Visualization Size of data frame defines memory consuption Memory = Size of columns in rows * number of rows *simplified version https://flow-php.com

Slide 36

Slide 36 text

What is a transformation? https://flow-php.com

Slide 37

Slide 37 text

Don’t think in objects/functions https://flow-php.com

Slide 38

Slide 38 text

Think in tables, rows, columns and cells https://flow-php.com

Slide 39

Slide 39 text

Just like managing excel sheet https://flow-php.com

Slide 40

Slide 40 text

Dataset - Table https://flow-php.com

Slide 41

Slide 41 text

Data Frame – Rows https://flow-php.com

Slide 42

Slide 42 text

Row - Columns https://flow-php.com

Slide 43

Slide 43 text

Row & Column - Cell https://flow-php.com

Slide 44

Slide 44 text

Data Types https://flow-php.com

Slide 45

Slide 45 text

https://flow-php.com Data Types Usually there are two categories of data types

Slide 46

Slide 46 text

https://flow-php.com Native Types • Integer - scalar • String - scalar • Boolean - scalar • Float - scalar • Object • Resource • Null • Enum • Callable • Array

Slide 47

Slide 47 text

https://flow-php.com Logical Types • DateTime • Uuid • Json • List • Map • Structure • XML • XMLElement

Slide 48

Slide 48 text

https://flow-php.com Logical Types Logical types are more specific implementations of native types Different programming languages will provide different logical/native types

Slide 49

Slide 49 text

https://flow-php.com Logical Types • DateTime (object) • Uuid (object) • Json (array) • List (array) • Map (array) • Structure (array) • XML (object) • XMLElement (object)

Slide 50

Slide 50 text

https://flow-php.com Logical Types: List A list is a collection of elements where each element is indexed by its position in the list

Slide 51

Slide 51 text

https://flow-php.com Logical Types: Map A map (also known as a dictionary or associative array) stores key-value pairs, where each key is unique and associated with a single value. * In the PHP, main purpose of map is to guarantee a type of keys and values since regular array is not enforcing them.

Slide 52

Slide 52 text

https://flow-php.com Logical Types: Structure A complex data type grouping multiple fields (native and logical types)

Slide 53

Slide 53 text

https://flow-php.com Nullability All types can be nullable But not all programming languages handles nulls the same way

Slide 54

Slide 54 text

https://flow-php.com FlowPHP Types Logical Types Native Types

Slide 55

Slide 55 text

What is a transformation? https://flow-php.com

Slide 56

Slide 56 text

https://flow-php.com

Slide 57

Slide 57 text

Transformation is a process of converting, cleansing and structuring data into usable format. https://flow-php.com example of transforming string into Date Time object

Slide 58

Slide 58 text

Data transformations usually happens on a single cell level https://flow-php.com

Slide 59

Slide 59 text

Transform something! https://flow-php.com

Slide 60

Slide 60 text

Type Casting https://flow-php.com

Slide 61

Slide 61 text

https://flow-php.com Before

Slide 62

Slide 62 text

https://flow-php.com Cast Types

Slide 63

Slide 63 text

https://flow-php.com ref - reference

Slide 64

Slide 64 text

https://flow-php.com After

Slide 65

Slide 65 text

Filtering https://flow-php.com

Slide 66

Slide 66 text

https://flow-php.com Only discounted orders

Slide 67

Slide 67 text

Conditional Transformations https://flow-php.com

Slide 68

Slide 68 text

https://flow-php.com Replacing nulls with zeros

Slide 69

Slide 69 text

Sorting https://flow-php.com

Slide 70

Slide 70 text

https://flow-php.com Top 20 recently updated orders

Slide 71

Slide 71 text

How sorting can be memory efficient? https://flow-php.com

Slide 72

Slide 72 text

External sorting is a type of sorting algorithms that can handle large amounts of data https://flow-php.com

Slide 73

Slide 73 text

Grouping & Aggregation https://flow-php.com

Slide 74

Slide 74 text

https://flow-php.com Daily Orders Count

Slide 75

Slide 75 text

Joins https://flow-php.com

Slide 76

Slide 76 text

https://flow-php.com Joining orders with products Products Dataset

Slide 77

Slide 77 text

https://flow-php.com Unpack Order Line Items Order Items - List

Slide 78

Slide 78 text

First we need to turn our orders dataset into order line items dataset https://flow-php.com

Slide 79

Slide 79 text

https://flow-php.com Data Flattening Turning nested structures into flat rows

Slide 80

Slide 80 text

https://flow-php.com Unpack Order Line Items

Slide 81

Slide 81 text

https://flow-php.com Expand Order row with 2 items will be turned into two rows with the same order id

Slide 82

Slide 82 text

https://flow-php.com Unpack Unpack will turn each structure element into column

Slide 83

Slide 83 text

https://flow-php.com Full Pipeline

Slide 84

Slide 84 text

https://flow-php.com Output

Slide 85

Slide 85 text

What can we do next? https://flow-php.com

Slide 86

Slide 86 text

Calculate daily revenue Calculate daily profit Find top selling products ... https://flow-php.com

Slide 87

Slide 87 text

Data transformations are applied in steps by adding/replacing columns https://flow-php.com

Slide 88

Slide 88 text

Writing data to different sources https://flow-php.com

Slide 89

Slide 89 text

https://flow-php.com Require packages composer require flow-php/etl-adapter-parquet composer require flow-php/etl-adapter-json composer require flow-php/etl-adapter-doctrine

Slide 90

Slide 90 text

https://flow-php.com Write to

Slide 91

Slide 91 text

Dataset Schema https://flow-php.com

Slide 92

Slide 92 text

By defining column types we are defining dataset schema https://flow-php.com

Slide 93

Slide 93 text

Schema Definition https://flow-php.com

Slide 94

Slide 94 text

Schema can be used to either validate dataset or to improve extraction performance https://flow-php.com

Slide 95

Slide 95 text

Providing schema to extractor https://flow-php.com

Slide 96

Slide 96 text

https://flow-php.com In this case type casting is not needed

Slide 97

Slide 97 text

Using schema to validate rows before loading them to destination https://flow-php.com

Slide 98

Slide 98 text

https://flow-php.com Lets try to validate schema of our joined dataset

Slide 99

Slide 99 text

Will it work? https://flow-php.com

Slide 100

Slide 100 text

https://flow-php.com

Slide 101

Slide 101 text

How it can be fixed? https://flow-php.com

Slide 102

Slide 102 text

https://flow-php.com We can make discount not nullable

Slide 103

Slide 103 text

While working with big datasets and complex transformations schema validation is necessary to guarantee data quality https://flow-php.com

Slide 104

Slide 104 text

Garbage In, Garbage Out https://flow-php.com Why should we care about dataset schema?

Slide 105

Slide 105 text

What are typical use cases of ETL’s? https://flow-php.com

Slide 106

Slide 106 text

https://flow-php.com Use cases Building data storages (lakehouses/warehouses/lakes) Generating reports Consuming API’s Systems synchronizations Building projections Converting datasets formats Initial data analysis

Slide 107

Slide 107 text

Data engineering makes data analysis and data science much easier (cheaper) https://flow-php.com

Slide 108

Slide 108 text

What about performance? https://flow-php.com

Slide 109

Slide 109 text

https://flow-php.com 5 mln rows

Slide 110

Slide 110 text

https://flow-php.com Results

Slide 111

Slide 111 text

I leave the decision up to you https://flow-php.com Is it fast?

Slide 112

Slide 112 text

Norbert Orzechowicz https://flow-php.com GitHub: https://github.com/norberttech LinkedIn: https://www.linkedin.com/in/norberttech X (Twitter): https://x.com/norbert_tech Email: [email protected] Discord: https://discord.gg/bUeTc8f9GD Flow PHP

Slide 113

Slide 113 text

That’s all for today! https://flow-php.com Questions?