Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Data Processing in PHP
Search
Norbert Orzechowicz
May 15, 2026
Programming
92
0
Share
Embed
Copy iframe code
Copy JS code
Copy link
Start on current slide
Data Processing in PHP
Norbert Orzechowicz
May 15, 2026
More Decks by Norbert Orzechowicz
See All by Norbert Orzechowicz
Parquet - from a file format to a system architecture
norbert_tech
0
83
Data Processing in PHP - PHPers 2024 Poznań
norbert_tech
0
240
Other Decks in Programming
See All in Programming
3Dシーンの圧縮
fadis
1
770
肥大化するレガシーコードに立ち向かうためのインターフェース分離と依存の逆転 / JJUG CCC 2026 Spring
hirokunimaeta
0
550
IBM Bobを活用したレガシーアプリの最新化
oniak3ibm
PRO
1
200
AIだと陥りがちなJakarta EE最新技術への移行時の落とし穴と解決策
tnagao7
0
110
TSKaigi Night Talks 2026_TypeScriptでサプライチェーンの整合性を型に閉じ込める
geekplus_tech
0
350
メソッドのジェネリクスでGoの夢は広がるか? / Kyoto.go #65
utgwkk
3
760
そのテスト、説明できますか?~LWテスト戦略FW~のご紹介
nakahara
0
110
Claspは野良GASの夢をみるか
takter00
0
190
「AIで開発し、AIを届ける」をEvalでつなぐ 〜AIネイティブに始めるプロダクト開発の実践〜 / Connecting "Develop with AI, deliver AI" with Eval
rkaga
4
5.1k
AIで効率化できた業務・日常
ochtum
0
130
Language Server 使ってる? 〜VSCode と Zed の場合〜 / Are you using a Language Server? ~For VS Code and Zed~
handlename
0
780
PHPで使える日時の表現と、その知り方 #frontend_phpcon_do
o0h
PRO
0
240
Featured
See All Featured
Art, The Web, and Tiny UX
lynnandtonic
304
22k
For a Future-Friendly Web
brad_frost
183
10k
Mozcon NYC 2025: Stop Losing SEO Traffic
samtorres
1
250
Designing for Performance
lara
611
70k
Done Done
chrislema
186
16k
Color Theory Basics | Prateek | Gurzu
gurzu
0
360
How to Build an AI Search Optimization Roadmap - Criteria and Steps to Take #SEOIRL
aleyda
1
2.1k
Facilitating Awesome Meetings
lara
57
7k
Music & Morning Musume
bryan
47
7.2k
New Earth Scene 8
popppiees
3
2.3k
The World Runs on Bad Software
bkeepers
PRO
72
12k
The innovator’s Mindset - Leading Through an Era of Exponential Change - McGill University 2025
jdejongh
PRO
1
200
Transcript
Data Processing phpday 2026 https://flow-php.com PHP
WHOAMI https://flow-php.com
Independent consultant helping companies organize and access their data. https://flow-php.com
Fan of Lego & BMD
Problem https://flow-php.com
How to process a CSV File? https://flow-php.com
Orders • order_id – uuid • created_at – datetime •
updated_at – datetime • discount – float (nullable) • address – structure{street: string, city: string, zip: string, country: string} • notes – list<string> • items – list<structure{sku: string, quantity: string, price: float}> https://flow-php.com
JSON enriched CSV
file_get_contents https://flow-php.com
fopen fgets str_getcsv fclose https://flow-php.com
fopen fgetcsv fclose https://flow-php.com
Generators https://flow-php.com
Generators (batching) https://flow-php.com
What about data types? https://flow-php.com
https://flow-php.com
Manual Type Casting https://flow-php.com
Issues & limitations? • Flexibility • Maintainability • Extendability •
Scalability* https://flow-php.com
Are there any other solutions? https://flow-php.com
Let’s talk about ETL https://flow-php.com
ETL • Extract https://flow-php.com
ETL • Extract • Transform https://flow-php.com
ETL • Extract • Transform • Load https://flow-php.com
Extraction • Database • File • API • Streams •
Queues / Topics https://flow-php.com
Transformation • Filtering • Merging • Cleaning • Grouping •
Aggregating • Deduplicating • Sorting • Partitioning https://flow-php.com
Loading • Database • File • Stream • Projection •
API https://flow-php.com
Examples php/scala/python https://flow-php.com
How to process a CSV File? https://flow-php.com With PHP
Raw PHP https://flow-php.com
https://flow-php.com Flow PHP
+----------------------+----------------------+----------------------+----------+----------------------+----------------------+----------------------+ | order_id | created_at | updated_at | discount |
address | notes | items | +----------------------+----------------------+----------------------+----------+----------------------+----------------------+----------------------+ | 48f7b4b3-48dc-3095-8 | 2024-04-23T01:35:12+ | 2024-04-23T01:35:12+ | | {"street":"56896 Pow | ["Sed cumque sit vol | [{"sku":"SKU_0005"," | | b8670686-1e52-36ee-9 | 2024-04-14T09:00:12+ | 2024-04-14T09:00:12+ | | {"street":"596 Derek | ["Fugiat saepe atque | [{"sku":"SKU_0004"," | | dc052d5e-2b2c-3b2a-9 | 2024-03-03T08:03:02+ | 2024-03-03T08:03:02+ | 40.08 | {"street":"51760 Koe | ["Aliquid voluptatem | [{"sku":"SKU_0004"," | | 6984b96b-6a27-367f-9 | 2024-04-03T16:18:07+ | 2024-04-03T16:18:07+ | | {"street":"9722 Doll | ["Est sit atque quos | [{"sku":"SKU_0004"," | | cb21141a-5494-33ea-9 | 2024-04-25T17:47:49+ | 2024-04-25T17:47:49+ | 2.38 | {"street":"11398 Abs | ["Est atque doloremq | [{"sku":"SKU_0005"," | | c9dc07fc-fa46-3f32-9 | 2024-03-27T12:44:03+ | 2024-03-27T12:44:03+ | | {"street":"78980 Bri | ["Sit aut laudantium | [{"sku":"SKU_0003"," | | 9b828e2d-b509-3485-b | 2024-04-12T06:33:52+ | 2024-04-12T06:33:52+ | | {"street":"6434 Chet | ["Ad consequuntur qu | [{"sku":"SKU_0005"," | | 6f619e18-05aa-306b-8 | 2024-06-10T21:17:45+ | 2024-06-10T21:17:45+ | | {"street":"8038 Crai | ["Dolorum recusandae | [{"sku":"SKU_0005"," | | 7814b135-500f-3137-9 | 2024-05-14T07:39:00+ | 2024-05-14T07:39:00+ | | {"street":"26190 Cor | ["Est quis necessita | [{"sku":"SKU_0005"," | +----------------------+----------------------+----------------------+----------+----------------------+----------------------+----------------------+ 10 rows Output https://flow-php.com
How to process a CSV File? With Apache Spark (scala)
https://flow-php.com
https://flow-php.com
+------------------------------------+-------------------------+-------------------------+--------+--------------------------------------------+--------------------------------+----------------------+ |order_id |created_at |updated_at |discount|address |notes |items | +------------------------------------+-------------------------+-------------------------+--------+--------------------------------------------+--------------------------------+----------------------+ |7833e6cb-b123-37f7-bee5-c0fea6dd6787|2024-04-26T06:01:52+00:00|2024-04-26T06:01:52+00:00|2.09
|"{""street"":""64428 Nitzsche Locks"" |""city"":""Lake Deontechester"" |""zip"":""15111"" | |5aa4fb2b-7bc5-3d7c-a9a9-04e88f831b00|2024-01-23T22:53:49+00:00|2024-01-23T22:53:49+00:00|46.87 |"{""street"":""5751 Jamal Drive"" |""city"":""Port Delmer"" |""zip"":""57385"" | |77a297db-b911-3800-a017-7f16502f324f|2024-02-20T00:44:03+00:00|2024-02-20T00:44:03+00:00|29.19 |"{""street"":""831 Murphy Haven"" |""city"":""West Alessandroport""|""zip"":""65846-1195""| |ed030a18-df55-38ce-90a3-c3461032c150|2024-02-15T22:03:47+00:00|2024-02-15T22:03:47+00:00|null |"{""street"":""8617 Lebsack Cape Suite 285""|""city"":""New Leonel"" |""zip"":""43725"" | |d4b7921e-7729-322f-89a1-51e1c5198678|2024-04-12T04:26:42+00:00|2024-04-12T04:26:42+00:00|10.44 |"{""street"":""523 Charlene Mount Apt. 694""|""city"":""Bruenstad"" |""zip"":""40291"" | |ff342e29-a6f8-3df3-b1d0-0adb376557fd|2024-03-24T08:49:27+00:00|2024-03-24T08:49:27+00:00|45.48 |"{""street"":""822 Carmel Common Apt. 560"" |""city"":""Abigailport"" |""zip"":""64470"" | |9771e63f-16e6-311a-a974-a0d60a06fea4|2024-02-14T00:23:18+00:00|2024-02-14T00:23:18+00:00|null |"{""street"":""87471 Jaylon Place"" |""city"":""Cummingsmouth"" |""zip"":""11956-0536""| |eaf137da-c206-3252-a5f5-428b1d4eb4f1|2024-05-03T18:01:13+00:00|2024-05-03T18:01:13+00:00|19.16 |"{""street"":""4494 Kunze Tunnel Apt. 465"" |""city"":""Lake Sabinaland"" |""zip"":""60381-1971""| |64a4ee3d-66e3-3b5e-9830-562c051e6576|2024-01-02T17:57:03+00:00|2024-01-02T17:57:03+00:00|27.79 |"{""street"":""425 Oren Manors"" |""city"":""Lake Vincent"" |""zip"":""69860"" | |24099ba6-9131-3714-84da-d3a59ede3cd1|2024-01-27T22:47:54+00:00|2024-01-27T22:47:54+00:00|45.2 |"{""street"":""328 Daniel Inlet Apt. 768"" |""city"":""Jedediahville"" |""zip"":""77120-2693""| +------------------------------------+-------------------------+-------------------------+--------+--------------------------------------------+--------------------------------+----------------------+ Output https://flow-php.com
How to process a CSV File? With Pandas (python)
https://flow-php.com
order_id created_at updated_at discount address notes items 0 e13d7098-5a78-33... 2024-06-17T19:24...
2024-06-17T19:24... 12.45 {"street":"9742 ... ["Doloremque cum... [{"sku":"SKU_000... 1 947df050-3abb-3f... 2024-02-23T19:18... 2024-02-23T19:18... NaN {"street":"37051... ["Neque dolor et... [{"sku":"SKU_000... 2 6315f9e2-86bf-33... 2024-04-02T11:30... 2024-04-02T11:30... 47.10 {"street":"792 G... ["Et porro fugia... [{"sku":"SKU_000... 3 4cccb632-fade-34... 2024-05-06T00:17... 2024-05-06T00:17... 19.76 {"street":"30203... ["Aliquam saepe ... [{"sku":"SKU_000... 4 82384f8c-9adb-38... 2024-05-10T11:17... 2024-05-10T11:17... NaN {"street":"757 T... ["Beatae nesciun... [{"sku":"SKU_000... 5 e3fcf736-0f8c-3d... 2024-01-25T20:14... 2024-01-25T20:14... NaN {"street":"9088 ... ["Provident quam... [{"sku":"SKU_000... 6 b987a49a-b4c5-37... 2024-06-03T23:22... 2024-06-03T23:22... NaN {"street":"6867 ... ["Quibusdam maio... [{"sku":"SKU_000... 7 663523a9-713b-33... 2024-03-22T23:31... 2024-03-22T23:31... 25.88 {"street":"1577 ... ["In rem maxime ... [{"sku":"SKU_000... 8 6259fa2c-ec68-36... 2024-05-10T10:12... 2024-05-10T10:12... 21.67 {"street":"987 L... ["Voluptatem non... [{"sku":"SKU_000... 9 f7153c83-34b6-37... 2024-02-26T09:20... 2024-02-26T09:20... 18.93 {"street":"2039 ... ["Culpa error re... [{"sku":"SKU_000... Output https://flow-php.com
Data Frame https://flow-php.com
Data Frame https://flow-php.com
Data Frame Size of data frame defines memory consuption Memory
= Size of columns in rows * number of rows *simplified version https://flow-php.com
What is a transformation? https://flow-php.com
Don’t think in objects/functions https://flow-php.com
Think in tables, rows, columns and cells https://flow-php.com
Just like managing excel https://flow-php.com
Dataset - Table https://flow-php.com
Data Frame https://flow-php.com
Row / Columns https://flow-php.com
Row & Column - Cell https://flow-php.com
Data Types https://flow-php.com
https://flow-php.com Data Types Usually there are two categories of data
types
https://flow-php.com Physical (primitive) Types • Integer - scalar • String
- scalar • Boolean - scalar • Float - scalar • Object • Resource • Null • Enum • Callable • Array
https://flow-php.com Logical Types • DateTime • Uuid • Json •
List • Map • Structure • XML • XMLElement
https://flow-php.com Logical Types Logical types are more specific implementations of
native types Different programming languages will provide different logical/native types
https://flow-php.com Logical Types • DateTime (object) • Uuid (object) •
Json (array) • List (array) • Map (array) • Structure (array) • XML (object) • XMLElement (object)
https://flow-php.com Logical Types: List A list is a collection of
elements where each element is indexed by its position in the list
https://flow-php.com Logical Types: Map A map (also known as a
dictionary or associative array) stores key-value pairs, where each key is unique and associated with a single value. * In the PHP, main purpose of map is to guarantee a type of keys and values since regular array is not enforcing them.
https://flow-php.com Logical Types: Structure A complex data type grouping multiple
fields (physical and logical types)
https://flow-php.com Nullability All types can be nullable But not all
programming languages handles nulls the same way
https://flow-php.com Flow PHP Types Logical Types Native Types
What is a transformation? https://flow-php.com
https://flow-php.com
Transformation is a process of converting, cleansing and structuring data
into usable format. https://flow-php.com example of transforming string into Date Time object
Data transformations usually happens on a single cell level https://flow-php.com
Transform something! https://flow-php.com
Type Casting https://flow-php.com
https://flow-php.com Before
https://flow-php.com Cast Types
https://flow-php.com ref - reference
https://flow-php.com After
Filtering https://flow-php.com
https://flow-php.com Only discounted orders
Conditional Transformations https://flow-php.com
https://flow-php.com Replacing nulls with zeros
Sorting https://flow-php.com
https://flow-php.com Top 20 recently updated orders
How sorting can be memory efficient? https://flow-php.com
External sort is a type of sorting algorithms that can
handle large amounts of data https://flow-php.com
Grouping & Aggregation https://flow-php.com
https://flow-php.com Daily Orders Count
Joins https://flow-php.com
https://flow-php.com Joining orders with products Products Dataset
https://flow-php.com Unpack Order Line Items Order Items - List
First we need to turn our orders dataset into order
line items dataset https://flow-php.com
https://flow-php.com Data Flattening Turning nested structures into flat rows
https://flow-php.com Unpack Order Line Items
https://flow-php.com Expand Order row with 2 items will be turned
into two rows with the same order id
https://flow-php.com Unpack Unpack will turn each structure element into column
https://flow-php.com Full Pipeline
https://flow-php.com Output
What can we do next? https://flow-php.com
Calculate daily revenue Calculate daily profit Find top selling products
... https://flow-php.com
Data transformations are applied in steps by adding/replacing columns https://flow-php.com
Writing data to different sources https://flow-php.com
https://flow-php.com Granular by design composer require flow-php/etl-adapter-parquet composer require flow-php/etl-adapter-json
composer require flow-php/etl-adapter-doctrine
https://flow-php.com Write to
Dataset Schema https://flow-php.com
By defining column types we are defining dataset schema https://flow-php.com
Schema Definition https://flow-php.com
Schema can be used to either validate dataset or to
improve extraction performance https://flow-php.com
Providing schema to extractor https://flow-php.com
https://flow-php.com In this case type casting is unnecessary
Using schema to validate rows before loading them to destination
https://flow-php.com
https://flow-php.com Lets try to validate schema of our joined dataset
Will it work? https://flow-php.com
https://flow-php.com
How it can be fixed? https://flow-php.com
https://flow-php.com We can make discount not nullable
While working with big datasets and complex transformations schema validation
is necessary to guarantee data quality https://flow-php.com
Garbage In, Garbage Out https://flow-php.com Why should we care about
dataset schema?
What are typical use cases of ETL’s? https://flow-php.com
https://flow-php.com Use cases Building data storages (lakehouses/warehouses/lakes) Generating reports Consuming
API’s Systems synchronizations Building projections Converting datasets formats Initial data analysis
Data engineering makes data analysis and data science much easier
(cheaper) https://flow-php.com
What about performance? https://flow-php.com
https://flow-php.com 5 mln rows +/- 3Gb file
https://flow-php.com Results
https://flow-php.com Was it fast?
https://flow-php.com Flow PHP Reach out! https://flow-php.com https://norbert.tech https://www.linkedin.com/in/norberttech/ https://x.com/norbert_tech https://phpc.social/@norbert
That’s all for today! https://flow-php.com Questions?