A performant application, even with Doctrine!

Slide 1

Slide 1 text

Slide 2

Slide 2 text

Good evening! I am Thomas Calvet I am a Symfony enthusiast I work at ekino You can find me on GitHub and on the Symfony Devs Slack as fancyweb, on Twitter as @fancyweb_

Slide 3

Slide 3 text

ORM internals Simply and quickly

Slide 4

Slide 4 text

Object Relational Mapping (ORM) ▪ A technique to convert data between incompatible type systems using object-oriented programming languages ▪ Doctrine ORM is an Object Relational Mapper ▪ Its goal is to simplify the translation between database rows and the PHP object model

Slide 5

Slide 5 text

UnitOfWork (UoW) ▪ A single class of 3k+ lines of code ▪ Knows all the managed entities ▪ Responsible for tracking those entities changes ▪ Responsible for writing out those changes in the good order The private heart of the ORM

Slide 6

Slide 6 text

EntityManager (EM) ▪ A facade to the UoW ▪ And a facade to all the others ORM subsystems: ▫ Metadata ▫ Repositories ▫ Queries, etc. The central public access point of the ORM

Slide 7

Slide 7 text

The identity map ▪ A big associative array in the UoW ▪ Class property ($identityMap) ▪ Stores a reference to every managed entities ▪ Ensures that a managed entity is loaded only once and in the same in-memory object The unique source of truth

Slide 8

Slide 8 text

In this example, only one SQL query is executed since the entity with the id 1 is already in the identity map on the second “find” call.

Slide 9

Slide 9 text

In this example, two SQL queries are executed but thanks to the identity map the same entity instance is returned by the repository.

Slide 10

Slide 10 text

The states ▪ A big associative array in the UoW ▪ Cached in a class property ($entityStates) ▪ Every entity has a state ▪ There are four possible states Capital for the optimisation

Slide 11

Slide 11 text

An entity is “managed” when it it known by the identity map. It is added to this map in multiple cases : for example, when it is hydrated through the ORM or after being inserted in the database. 1 = UnitOfWork::STATE_MANAGED

Slide 12

Slide 12 text

An entity is “new” when it is not known by the identity map and when it has no identifier or an unknown database identifier. 2 = UnitOfWork::STATE_NEW

Slide 13

Slide 13 text

An entity is “detached” when it was removed from the identity map or when it is not known by the identity map and when it has a known database identifier. “detached” is also the default assumed state. 3 = UnitOfWork::STATE_DETACHED

Slide 14

Slide 14 text

A entity is “removed” when it has been scheduled for deletion from the database on the next UoW commit. 4 = UnitOfWork::STATE_REMOVED

Slide 15

Slide 15 text

The changesets ▪ A big associative array in the UoW ▪ Class property ($entityChangeSets) ▪ The differences between the last synchronized (from the database) states and the current states of the entities ▪ Computed at the beginning of the commit and cleared afterwhile

Slide 16

Slide 16 text

No content

Slide 17

Slide 17 text

The pending changes ▪ Big associatives arrays in the UoW ▪ Class properties ($entityInsertions, $collectionUpdates, etc.) ▪ All pending information about what to insert, update or delete (entities and collections) ▪ Computed at the beginning of the commit and cleared afterwhile

Slide 18

Slide 18 text

No content

Slide 19

Slide 19 text

To summarize With a schema

Slide 20

Slide 20 text

No content

Slide 21

Slide 21 text

No content

Slide 22

Slide 22 text

No content

Slide 23

Slide 23 text

16 big associative arrays in the UoW! The UoW contains highly performance sensitive code

Slide 24

Slide 24 text

“Always keep the identity map and internals in mind It will greatly improve the performance of your code!

Slide 25

Slide 25 text

Flush carefully Flush once

Slide 26

Slide 26 text

No content

Slide 27

Slide 27 text

Transactional write-behind ▪ A strategy used by the UoW ▪ Delays the execution of SQL queries ▪ The goal is to execute them in the most efficient way ▪ They are optimized in the shortest transaction possible so that all write locks are quickly released

Slide 28

Slide 28 text

Flushing is an heavy operation ▪ It computes all the changesets ▪ It dispatches all lifecycle events ▪ It generates all needed SQL queries ▪ It executes them in a transaction

Slide 29

Slide 29 text

Don’t

Slide 30

Slide 30 text

Slide 31

Slide 31 text

Batch your ﬂushes When it is necessary

Slide 32

Slide 32 text

One heavy ﬂush can be catastrophic ▪ The more changes there are on managed entities, the longer they take to be processed internally ▪ Sometimes, splitting an heavy task in x smaller ones is more efficient ▪ Too big database transactions are bad for concurrency because they lock the tables for too long

Slide 33

Slide 33 text

Don’t

Slide 34

Slide 34 text

Slide 35

Slide 35 text

Use SQL Get back to the basics

Slide 36

Slide 36 text

Slide 37

Slide 37 text

Clear your Entity Manager To start over

Slide 38

Slide 38 text

Reduce the memory usage ▪ The UoW stores a big amount of information in all its class properties ▪ Some are cleared after each commit ▪ Some are never cleared automatically ▪ Those leftover data can end up using a lot of memory

Slide 39

Slide 39 text

Slide 40

Slide 40 text

What happens when the UoW is cleared? ▪ It resets it to its initial state ▪ It sets all its “stack” class properties to an empty array, thus freeing a lot of memory ▪ Consequently, all managed entities become “detached”

Slide 41

Slide 41 text

⚠ Legacy clear, detach and merge Everything need to die

Slide 42

Slide 42 text

⚠ Clearing one entity class is not recommended because of many broken scenarios, it won’t be available anymore in Doctrine ORM 3. Don’t

Slide 43

Slide 43 text

⚠ Detaching and merging entities is not recommended, it won’t be available anymore in Doctrine ORM 3. Don’t

Slide 44

Slide 44 text

Know the tracking policies To choose the right one

Slide 45

Slide 45 text

Changes tracking ▪ At some point, the ORM determines what changed on the entities it manages thanks to a tracking policy ▪ Each tracking policy has advantages and disadvantages ▪ Each tracking policy has a different impact on the overall performance ▪ There are three different tracking policies

Slide 46

Slide 46 text

The Deferred Implicit tracking policy ▪ The ORM checks all managed entities for changes ▪ It checks all properties values one by one ▪ It checks for new entities that are referenced by other managed entities ▪ It obviously takes longer as the UoW grows Automatic but the worse performance

Slide 47

Slide 47 text

No content

Slide 48

Slide 48 text

The Deferred Explicit tracking policy ▪ Same behavior than the Deferred Implicit tracking policy ▪ Except that the ORM only check entities that have been explicitly marked through a “persist” call ▪ Better for large UoW Some manual work but a way better performance

Slide 49

Slide 49 text

No content

Slide 50

Slide 50 text

No content

Slide 51

Slide 51 text

The Notify tracking policy ▪ The entities notify interested listeners of every changes to their properties ▪ You have full control over when you consider a property changed or not ▪ The best for very large UoW A lot of manual work but the best performance

Slide 52

Slide 52 text

No content

Slide 53

Slide 53 text

Load the least entities possible The out of memory (OOM) problem

Slide 54

Slide 54 text

No content

Slide 55

Slide 55 text

Every loaded entity has an impact ▪ An impact on both time and memory because they increase the overall number of managed entities ▪ Hydrating an entity is heavy ▪ Think about the potential total numbers of entities your method call could load ▪ Don’t use the “findAll()” method ▪ Avoid filterless queries

Slide 56

Slide 56 text

Don’t There could be 50 000 comments one day.

Slide 57

Slide 57 text

Use the “iterate” feature ▪ Avoids to hydrate all the query resulting entities in the memory at once ▪ Hydrates the entities one by one instead ▪ Limited to queries that don’t need to fetch join a collection valued association ▪ ⚠ You still need to clear the EM regularly to free the memory

Slide 58

Slide 58 text

Slide 59

Slide 59 text

Use SQL Yes again

Slide 60

Slide 60 text

Slide 61

Slide 61 text

“Pre select” the associations The N+1 problem

Slide 62

Slide 62 text

No content

Slide 63

Slide 63 text

No content

Slide 64

Slide 64 text

Watch for the queries count ▪ Entity associations are lazy loaded ▪ An SQL query is executed on demand, when you access the uninitialized collection for the first time ▪ When you iterate on an array of entities, the ORM needs to execute a SQL query for every collection you access ▪ Therefore, nested iterations make the number of queries grow exponentially

Slide 65

Slide 65 text

No content

Slide 66

Slide 66 text

Slide 67

Slide 67 text

⚠ Big inverse side associations

Slide 68

Slide 68 text

Don’t

Slide 69

Slide 69 text

Slide 70

Slide 70 text

Don’t setup big inverse side associations ▪ In a general way, avoid non-essential associations ▪ Each association imply more work for the ORM, ie consumed time and memory ▪ Not having “big” inverse side associations prevents you from using their getter ▪ Consequently, it prevents you from loading large amount of entities in the memory at once and from running into the N+1 problem

Slide 71

Slide 71 text

Use caches Do not process the same things again and again

Slide 72

Slide 72 text

Metadata cache ▪ Removes the mapping files parsing overhead ▪ They don’t need to be parsed on each request because the same mapping files will always produce the same class metadata ▪ Your application should never be in production without a configured metadata cache

Slide 73

Slide 73 text

Query cache ▪ Removes the DQL to SQL conversion overhead ▪ The same DQL query will always be converted to the same SQL (for the same platform) ▪ Your application should never be in production without a configured query cache

Slide 74

Slide 74 text

Result cache ▪ Caches the results of a query ▪ Avoids to requery the database and to rehydrate the resulting entities ▪ You can specify the time to leave (TTL) by queries ▪ Use it particularly on slow queries, even with a short TTL

Slide 75

Slide 75 text

Slide 76

Slide 76 text

Second level cache ▪ Reduces the amount of necessary database access ▪ A cache between the identity map and the database ▪ Three caching modes : ▫ READ_ONLY, ▫ NONSTRICT_READ_WRITE ▫ READ_WRITE ▪ Experimental and very complex ▪ Limited to single application and to single primary key

Slide 77

Slide 77 text

Prefer entity listeners Over event listeners and event subscribers

Slide 78

Slide 78 text

When it concerns one kind of entity only. Don’t

Slide 79

Slide 79 text

When it concerns one kind of entity only. Don’t

Slide 80

Slide 80 text

Prefer entity listeners ▪ In a general way, prefer listeners ▪ Entity listeners are better for performance because they are called only for entities of the class they were configured for, they avoid to execute useless code paths ▪ Moreover, they can be lazy, ie being instantiated only when they are actually used

Slide 81

Slide 81 text

Slide 82

Slide 82 text

Slide 83

Slide 83 text

⚠ Composite primary keys

Slide 84

Slide 84 text

Don’t

Slide 85

Slide 85 text

Avoid composite primary keys ▪ They require additional internal work, ie they imply time and memory overheads ▪ They have a way higher probability of errors because almost nobody uses them ▪ Use single primary keys ▪ Use UUIDs

Slide 86

Slide 86 text

⚠ Table inheritance

Slide 87

Slide 87 text

Don’t

Slide 88

Slide 88 text

Don’t use table inheritance ▪ Single table inheritance (STI) implies having weak constraints on the table, for example many nullable columns ▪ Class Table Inheritance (CTI) implies multiples join operations for any query ▪ Representing OOP inheritance in a database is never a good idea

Slide 89

Slide 89 text

Use MappedSuperclass ▪ It provides reusable mapping information and reusable concrete code logic but is not an entity itself ▪ Think of it as a trait for your entities but with a bonus: the mapped super class is in the children class hierarchy ▪ Each “children” entity is independent, it has its own table and is easily refactorable

Slide 90

Slide 90 text

Slide 91

Slide 91 text

⚠ Cascade removing

Slide 92

Slide 92 text

Don’t

Slide 93

Slide 93 text

Don’t use cascade removing ▪ Cascade removing is done at runtime, is synchronous, and involves a full in-memory load of the entities to delete ▪ On big associations, it consumes a lot of time and a lot of memory, it often leads to OOM errors ▪ Use SQL “ON DELETE CASCADE” clauses

Slide 94

Slide 94 text

Slide 95

Slide 95 text

⚠ “Remove” lifecycle events

Slide 96

Slide 96 text

Don’t

Slide 97

Slide 97 text

Don’t use the “remove” lifecycle events ▪ In a general way, avoid to have to do things on removes ▪ If you don’t use cascade removing, those events won’t be dispatched for the associated entities to remove. And you don’t want to do everything manually yourselves ▪ Use domain events that are handled asynchronously ▪ It’s really easy now thanks to the Symfony Messenger component

Slide 98

Slide 98 text

⚠ Proﬁling and logs Very useful but...

Slide 99

Slide 99 text

They have a cost ▪ Collecting data for the profiler takes time and memory ▪ Writing logs involves IO operations and thus, takes time and memory ▪ Consider disabling the profiling and the logging, especially in your batch processing ▪ Do it case by case

Slide 100

Slide 100 text

Slide 101

Slide 101 text

No time to explain It’s already long enough

Slide 102

Slide 102 text

More topics worth checking out ▪ Extra lazy associations ▪ Partial objects and references ▪ Read only entities and properties ▪ Using criteria on non loaded associations ▪ Using the “filters” feature ▪ Aggregate data directly with SQL ▪ Multi step hydratation

Slide 103

Slide 103 text

“Always keep the identity map and internals in mind

Slide 104

Slide 104 text

Thanks! Any questions? You can find me at: fancyweb on GitHub and on the Symfony Devs Slack @fancyweb_ on Twitter [email protected] by mail