Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Data Processing in PHP - PHPers 2024 Poznań
Search
Norbert Orzechowicz
June 24, 2024
Technology
240
0
Share
Embed
Copy iframe code
Copy JS code
Copy link
Start on current slide
Data Processing in PHP - PHPers 2024 Poznań
Norbert Orzechowicz
June 24, 2024
More Decks by Norbert Orzechowicz
See All by Norbert Orzechowicz
Data Processing in PHP
norbert_tech
0
92
Parquet - from a file format to a system architecture
norbert_tech
0
83
Other Decks in Technology
See All in Technology
AGENTS.mdとSkillsで始めるAIエージェント活用
sonoda_mj
3
210
2026TECHFRESH畢業分享會 - Lightning Talk - 打造精準高效的 MCP 設計模式與測試實務
line_developers_tw
PRO
0
1k
アンオフィシャルな、オフィシャルからのお願い
wyamazak_devrel
0
110
就職⽀援サービスにおけるキャリアアドバイザーのシフトスケジューリング
recruitengineers
PRO
1
140
中期計画、2回作ってみた ~業務委託と正社員、両方の視点から~
demaecan
1
750
失敗を資産に変えるClaude Code
shinyasaita
0
650
AIのReact習熟度を測る
uhyo
2
560
GitHub Copilot 最新アップデート – 「一歩先」の実践活用術
moulongzhang
2
340
AIソロプレナー時代に2ヶ月で20人増員した事業創造会社の開発組織の話
miyatakoji
0
660
FinOps × AIエージェントで実現する コストインシデントの自動調査
oasis1994liveforever
0
140
LayerX コーポレートエンジニアリング室におけるサプライチェーンセキュリティへの取り組み / Supply Chain Security at LayerX Corporate Engineering
yuyatakeyama
1
280
【Snowflake Summit 2026 Recap!!】Snowflake Summit Deep Dive: Security & Governance
civitaspo
1
170
Featured
See All Featured
Evolution of real-time – Irina Nazarova, EuRuKo, 2024
irinanazarova
9
1.4k
Measuring Dark Social's Impact On Conversion and Attribution
stephenakadiri
2
220
Introduction to Domain-Driven Design and Collaborative software design
baasie
1
840
Faster Mobile Websites
deanohume
310
31k
Marketing Yourself as an Engineer | Alaka | Gurzu
gurzu
0
230
Building Applications with DynamoDB
mza
96
7.1k
[RailsConf 2023] Rails as a piece of cake
palkan
59
6.7k
Efficient Content Optimization with Google Search Console & Apps Script
katarinadahlin
PRO
1
620
The Language of Interfaces
destraynor
162
27k
Docker and Python
trallard
47
3.9k
How To Speak Unicorn (iThemes Webinar)
marktimemedia
1
480
Visualization
eitanlees
152
17k
Transcript
Data Processing PHPers Summit 2024 https://flow-php.com
Problem https://flow-php.com
How to process a CSV Report? https://flow-php.com
Orders Report • order_id – uuid • created_at – datetime
• updated_at – datetime • discount – float (nullable) • address – structure{street: string, city: string, zip: string, country: string} • notes – list<string> • items – list<structure{sku: string, quantity: string, price: float}> https://flow-php.com
file_get_contents
fopen/fgets/feof
fopen/fgetcsv/feof
generators
generators
What about column types? https://flow-php.com
None
Manual Type Casting https://flow-php.com
Issues & limitations? • Flexibility • Maintainability • Extendability •
Scalability* https://flow-php.com
Are there any other solutions? https://flow-php.com
ETL https://flow-php.com
ETL • Extract https://flow-php.com
ETL • Extract • Transform https://flow-php.com
ETL • Extract • Transform • Load https://flow-php.com
Extraction • Database • File • API • Streams •
Queues / Topics https://flow-php.com
• Filtering • Merging • Cleaning • Grouping • Aggregating
• Deduplicating • Sorting • Partitioning Transformation https://flow-php.com
• Database • File • Stream • Projection • API
Loading https://flow-php.com
Examples php/scala/python https://flow-php.com
How to process a CSV Report? With Flow PHP https://flow-php.com
generators approach
https://flow-php.com ETL processing pipeline approach
+----------------------+----------------------+----------------------+----------+----------------------+----------------------+----------------------+ | order_id | created_at | updated_at | discount |
address | notes | items | +----------------------+----------------------+----------------------+----------+----------------------+----------------------+----------------------+ | 48f7b4b3-48dc-3095-8 | 2024-04-23T01:35:12+ | 2024-04-23T01:35:12+ | | {"street":"56896 Pow | ["Sed cumque sit vol | [{"sku":"SKU_0005"," | | b8670686-1e52-36ee-9 | 2024-04-14T09:00:12+ | 2024-04-14T09:00:12+ | | {"street":"596 Derek | ["Fugiat saepe atque | [{"sku":"SKU_0004"," | | dc052d5e-2b2c-3b2a-9 | 2024-03-03T08:03:02+ | 2024-03-03T08:03:02+ | 40.08 | {"street":"51760 Koe | ["Aliquid voluptatem | [{"sku":"SKU_0004"," | | 6984b96b-6a27-367f-9 | 2024-04-03T16:18:07+ | 2024-04-03T16:18:07+ | | {"street":"9722 Doll | ["Est sit atque quos | [{"sku":"SKU_0004"," | | cb21141a-5494-33ea-9 | 2024-04-25T17:47:49+ | 2024-04-25T17:47:49+ | 2.38 | {"street":"11398 Abs | ["Est atque doloremq | [{"sku":"SKU_0005"," | | c9dc07fc-fa46-3f32-9 | 2024-03-27T12:44:03+ | 2024-03-27T12:44:03+ | | {"street":"78980 Bri | ["Sit aut laudantium | [{"sku":"SKU_0003"," | | 9b828e2d-b509-3485-b | 2024-04-12T06:33:52+ | 2024-04-12T06:33:52+ | | {"street":"6434 Chet | ["Ad consequuntur qu | [{"sku":"SKU_0005"," | | 6f619e18-05aa-306b-8 | 2024-06-10T21:17:45+ | 2024-06-10T21:17:45+ | | {"street":"8038 Crai | ["Dolorum recusandae | [{"sku":"SKU_0005"," | | 7814b135-500f-3137-9 | 2024-05-14T07:39:00+ | 2024-05-14T07:39:00+ | | {"street":"26190 Cor | ["Est quis necessita | [{"sku":"SKU_0005"," | +----------------------+----------------------+----------------------+----------+----------------------+----------------------+----------------------+ 10 rows Output https://flow-php.com
How to process a CSV Report? With Apache Spark (scala)
https://flow-php.com
https://flow-php.com
+------------------------------------+-------------------------+-------------------------+--------+--------------------------------------------+--------------------------------+----------------------+ |order_id |created_at |updated_at |discount|address |notes |items | +------------------------------------+-------------------------+-------------------------+--------+--------------------------------------------+--------------------------------+----------------------+ |7833e6cb-b123-37f7-bee5-c0fea6dd6787|2024-04-26T06:01:52+00:00|2024-04-26T06:01:52+00:00|2.09
|"{""street"":""64428 Nitzsche Locks"" |""city"":""Lake Deontechester"" |""zip"":""15111"" | |5aa4fb2b-7bc5-3d7c-a9a9-04e88f831b00|2024-01-23T22:53:49+00:00|2024-01-23T22:53:49+00:00|46.87 |"{""street"":""5751 Jamal Drive"" |""city"":""Port Delmer"" |""zip"":""57385"" | |77a297db-b911-3800-a017-7f16502f324f|2024-02-20T00:44:03+00:00|2024-02-20T00:44:03+00:00|29.19 |"{""street"":""831 Murphy Haven"" |""city"":""West Alessandroport""|""zip"":""65846-1195""| |ed030a18-df55-38ce-90a3-c3461032c150|2024-02-15T22:03:47+00:00|2024-02-15T22:03:47+00:00|null |"{""street"":""8617 Lebsack Cape Suite 285""|""city"":""New Leonel"" |""zip"":""43725"" | |d4b7921e-7729-322f-89a1-51e1c5198678|2024-04-12T04:26:42+00:00|2024-04-12T04:26:42+00:00|10.44 |"{""street"":""523 Charlene Mount Apt. 694""|""city"":""Bruenstad"" |""zip"":""40291"" | |ff342e29-a6f8-3df3-b1d0-0adb376557fd|2024-03-24T08:49:27+00:00|2024-03-24T08:49:27+00:00|45.48 |"{""street"":""822 Carmel Common Apt. 560"" |""city"":""Abigailport"" |""zip"":""64470"" | |9771e63f-16e6-311a-a974-a0d60a06fea4|2024-02-14T00:23:18+00:00|2024-02-14T00:23:18+00:00|null |"{""street"":""87471 Jaylon Place"" |""city"":""Cummingsmouth"" |""zip"":""11956-0536""| |eaf137da-c206-3252-a5f5-428b1d4eb4f1|2024-05-03T18:01:13+00:00|2024-05-03T18:01:13+00:00|19.16 |"{""street"":""4494 Kunze Tunnel Apt. 465"" |""city"":""Lake Sabinaland"" |""zip"":""60381-1971""| |64a4ee3d-66e3-3b5e-9830-562c051e6576|2024-01-02T17:57:03+00:00|2024-01-02T17:57:03+00:00|27.79 |"{""street"":""425 Oren Manors"" |""city"":""Lake Vincent"" |""zip"":""69860"" | |24099ba6-9131-3714-84da-d3a59ede3cd1|2024-01-27T22:47:54+00:00|2024-01-27T22:47:54+00:00|45.2 |"{""street"":""328 Daniel Inlet Apt. 768"" |""city"":""Jedediahville"" |""zip"":""77120-2693""| +------------------------------------+-------------------------+-------------------------+--------+--------------------------------------------+--------------------------------+----------------------+ Output https://flow-php.com
How to process a CSV Report? With Pandas (python)
https://flow-php.com
order_id created_at updated_at discount address notes items 0 e13d7098-5a78-33... 2024-06-17T19:24...
2024-06-17T19:24... 12.45 {"street":"9742 ... ["Doloremque cum... [{"sku":"SKU_000... 1 947df050-3abb-3f... 2024-02-23T19:18... 2024-02-23T19:18... NaN {"street":"37051... ["Neque dolor et... [{"sku":"SKU_000... 2 6315f9e2-86bf-33... 2024-04-02T11:30... 2024-04-02T11:30... 47.10 {"street":"792 G... ["Et porro fugia... [{"sku":"SKU_000... 3 4cccb632-fade-34... 2024-05-06T00:17... 2024-05-06T00:17... 19.76 {"street":"30203... ["Aliquam saepe ... [{"sku":"SKU_000... 4 82384f8c-9adb-38... 2024-05-10T11:17... 2024-05-10T11:17... NaN {"street":"757 T... ["Beatae nesciun... [{"sku":"SKU_000... 5 e3fcf736-0f8c-3d... 2024-01-25T20:14... 2024-01-25T20:14... NaN {"street":"9088 ... ["Provident quam... [{"sku":"SKU_000... 6 b987a49a-b4c5-37... 2024-06-03T23:22... 2024-06-03T23:22... NaN {"street":"6867 ... ["Quibusdam maio... [{"sku":"SKU_000... 7 663523a9-713b-33... 2024-03-22T23:31... 2024-03-22T23:31... 25.88 {"street":"1577 ... ["In rem maxime ... [{"sku":"SKU_000... 8 6259fa2c-ec68-36... 2024-05-10T10:12... 2024-05-10T10:12... 21.67 {"street":"987 L... ["Voluptatem non... [{"sku":"SKU_000... 9 f7153c83-34b6-37... 2024-02-26T09:20... 2024-02-26T09:20... 18.93 {"street":"2039 ... ["Culpa error re... [{"sku":"SKU_000... Output https://flow-php.com
How ETL works? https://flow-php.com
Dataset Processing Visualization https://flow-php.com
Dataset Processing Visualization Size of data frame defines memory consuption
Memory = Size of columns in rows * number of rows *simplified version https://flow-php.com
What is a transformation? https://flow-php.com
Don’t think in objects/functions https://flow-php.com
Think in tables, rows, columns and cells https://flow-php.com
Just like managing excel sheet https://flow-php.com
Dataset - Table https://flow-php.com
Data Frame – Rows https://flow-php.com
Row - Columns https://flow-php.com
Row & Column - Cell https://flow-php.com
Data Types https://flow-php.com
https://flow-php.com Data Types Usually there are two categories of data
types
https://flow-php.com Native Types • Integer - scalar • String -
scalar • Boolean - scalar • Float - scalar • Object • Resource • Null • Enum • Callable • Array
https://flow-php.com Logical Types • DateTime • Uuid • Json •
List • Map • Structure • XML • XMLElement
https://flow-php.com Logical Types Logical types are more specific implementations of
native types Different programming languages will provide different logical/native types
https://flow-php.com Logical Types • DateTime (object) • Uuid (object) •
Json (array) • List (array) • Map (array) • Structure (array) • XML (object) • XMLElement (object)
https://flow-php.com Logical Types: List A list is a collection of
elements where each element is indexed by its position in the list
https://flow-php.com Logical Types: Map A map (also known as a
dictionary or associative array) stores key-value pairs, where each key is unique and associated with a single value. * In the PHP, main purpose of map is to guarantee a type of keys and values since regular array is not enforcing them.
https://flow-php.com Logical Types: Structure A complex data type grouping multiple
fields (native and logical types)
https://flow-php.com Nullability All types can be nullable But not all
programming languages handles nulls the same way
https://flow-php.com FlowPHP Types Logical Types Native Types
What is a transformation? https://flow-php.com
https://flow-php.com
Transformation is a process of converting, cleansing and structuring data
into usable format. https://flow-php.com example of transforming string into Date Time object
Data transformations usually happens on a single cell level https://flow-php.com
Transform something! https://flow-php.com
Type Casting https://flow-php.com
https://flow-php.com Before
https://flow-php.com Cast Types
https://flow-php.com ref - reference
https://flow-php.com After
Filtering https://flow-php.com
https://flow-php.com Only discounted orders
Conditional Transformations https://flow-php.com
https://flow-php.com Replacing nulls with zeros
Sorting https://flow-php.com
https://flow-php.com Top 20 recently updated orders
How sorting can be memory efficient? https://flow-php.com
External sorting is a type of sorting algorithms that can
handle large amounts of data https://flow-php.com
Grouping & Aggregation https://flow-php.com
https://flow-php.com Daily Orders Count
Joins https://flow-php.com
https://flow-php.com Joining orders with products Products Dataset
https://flow-php.com Unpack Order Line Items Order Items - List
First we need to turn our orders dataset into order
line items dataset https://flow-php.com
https://flow-php.com Data Flattening Turning nested structures into flat rows
https://flow-php.com Unpack Order Line Items
https://flow-php.com Expand Order row with 2 items will be turned
into two rows with the same order id
https://flow-php.com Unpack Unpack will turn each structure element into column
https://flow-php.com Full Pipeline
https://flow-php.com Output
What can we do next? https://flow-php.com
Calculate daily revenue Calculate daily profit Find top selling products
... https://flow-php.com
Data transformations are applied in steps by adding/replacing columns https://flow-php.com
Writing data to different sources https://flow-php.com
https://flow-php.com Require packages composer require flow-php/etl-adapter-parquet composer require flow-php/etl-adapter-json composer
require flow-php/etl-adapter-doctrine
https://flow-php.com Write to
Dataset Schema https://flow-php.com
By defining column types we are defining dataset schema https://flow-php.com
Schema Definition https://flow-php.com
Schema can be used to either validate dataset or to
improve extraction performance https://flow-php.com
Providing schema to extractor https://flow-php.com
https://flow-php.com In this case type casting is not needed
Using schema to validate rows before loading them to destination
https://flow-php.com
https://flow-php.com Lets try to validate schema of our joined dataset
Will it work? https://flow-php.com
https://flow-php.com
How it can be fixed? https://flow-php.com
https://flow-php.com We can make discount not nullable
While working with big datasets and complex transformations schema validation
is necessary to guarantee data quality https://flow-php.com
Garbage In, Garbage Out https://flow-php.com Why should we care about
dataset schema?
What are typical use cases of ETL’s? https://flow-php.com
https://flow-php.com Use cases Building data storages (lakehouses/warehouses/lakes) Generating reports Consuming
API’s Systems synchronizations Building projections Converting datasets formats Initial data analysis
Data engineering makes data analysis and data science much easier
(cheaper) https://flow-php.com
What about performance? https://flow-php.com
https://flow-php.com 5 mln rows
https://flow-php.com Results
I leave the decision up to you https://flow-php.com Is it
fast?
Norbert Orzechowicz https://flow-php.com GitHub: https://github.com/norberttech LinkedIn: https://www.linkedin.com/in/norberttech X (Twitter): https://x.com/norbert_tech
Email:
[email protected]
Discord: https://discord.gg/bUeTc8f9GD Flow PHP
That’s all for today! https://flow-php.com Questions?