A performant application, even with Doctrine!

Good evening! I am Thomas Calvet I am a Symfony
enthusiast I work at ekino You can find me on GitHub and on the Symfony Devs Slack as fancyweb, on Twitter as @fancyweb_

ORM internals Simply and quickly

Object Relational Mapping (ORM) ▪ A technique to convert data
between incompatible type systems using object-oriented programming languages ▪ Doctrine ORM is an Object Relational Mapper ▪ Its goal is to simplify the translation between database rows and the PHP object model

UnitOfWork (UoW) ▪ A single class of 3k+ lines of
code ▪ Knows all the managed entities ▪ Responsible for tracking those entities changes ▪ Responsible for writing out those changes in the good order The private heart of the ORM

EntityManager (EM) ▪ A facade to the UoW ▪ And
a facade to all the others ORM subsystems: ▫ Metadata ▫ Repositories ▫ Queries, etc. The central public access point of the ORM

The identity map ▪ A big associative array in the
UoW ▪ Class property ($identityMap) ▪ Stores a reference to every managed entities ▪ Ensures that a managed entity is loaded only once and in the same in-memory object The unique source of truth

In this example, only one SQL query is executed since
the entity with the id 1 is already in the identity map on the second “find” call.

In this example, two SQL queries are executed but thanks
to the identity map the same entity instance is returned by the repository.

The states ▪ A big associative array in the UoW
▪ Cached in a class property ($entityStates) ▪ Every entity has a state ▪ There are four possible states Capital for the optimisation

An entity is “managed” when it it known by the
identity map. It is added to this map in multiple cases : for example, when it is hydrated through the ORM or after being inserted in the database. 1 = UnitOfWork::STATE_MANAGED

An entity is “new” when it is not known by
the identity map and when it has no identifier or an unknown database identifier. 2 = UnitOfWork::STATE_NEW

An entity is “detached” when it was removed from the
identity map or when it is not known by the identity map and when it has a known database identifier. “detached” is also the default assumed state. 3 = UnitOfWork::STATE_DETACHED

A entity is “removed” when it has been scheduled for
deletion from the database on the next UoW commit. 4 = UnitOfWork::STATE_REMOVED

The changesets ▪ A big associative array in the UoW
▪ Class property ($entityChangeSets) ▪ The differences between the last synchronized (from the database) states and the current states of the entities ▪ Computed at the beginning of the commit and cleared afterwhile

The pending changes ▪ Big associatives arrays in the UoW
▪ Class properties ($entityInsertions, $collectionUpdates, etc.) ▪ All pending information about what to insert, update or delete (entities and collections) ▪ Computed at the beginning of the commit and cleared afterwhile

To summarize With a schema

16 big associative arrays in the UoW! The UoW contains
highly performance sensitive code

“Always keep the identity map and internals in mind It
will greatly improve the performance of your code!

Flush carefully Flush once

Transactional write-behind ▪ A strategy used by the UoW ▪
Delays the execution of SQL queries ▪ The goal is to execute them in the most efficient way ▪ They are optimized in the shortest transaction possible so that all write locks are quickly released

Flushing is an heavy operation ▪ It computes all the
changesets ▪ It dispatches all lifecycle events ▪ It generates all needed SQL queries ▪ It executes them in a transaction

Don’t

Batch your ﬂushes When it is necessary

One heavy ﬂush can be catastrophic ▪ The more changes
there are on managed entities, the longer they take to be processed internally ▪ Sometimes, splitting an heavy task in x smaller ones is more efficient ▪ Too big database transactions are bad for concurrency because they lock the tables for too long

Don’t

Use SQL Get back to the basics

Clear your Entity Manager To start over

Reduce the memory usage ▪ The UoW stores a big
amount of information in all its class properties ▪ Some are cleared after each commit ▪ Some are never cleared automatically ▪ Those leftover data can end up using a lot of memory

What happens when the UoW is cleared? ▪ It resets
it to its initial state ▪ It sets all its “stack” class properties to an empty array, thus freeing a lot of memory ▪ Consequently, all managed entities become “detached”

⚠ Legacy clear, detach and merge Everything need to die

⚠ Clearing one entity class is not recommended because of
many broken scenarios, it won’t be available anymore in Doctrine ORM 3. Don’t

⚠ Detaching and merging entities is not recommended, it won’t
be available anymore in Doctrine ORM 3. Don’t

Know the tracking policies To choose the right one

Changes tracking ▪ At some point, the ORM determines what
changed on the entities it manages thanks to a tracking policy ▪ Each tracking policy has advantages and disadvantages ▪ Each tracking policy has a different impact on the overall performance ▪ There are three different tracking policies

The Deferred Implicit tracking policy ▪ The ORM checks all
managed entities for changes ▪ It checks all properties values one by one ▪ It checks for new entities that are referenced by other managed entities ▪ It obviously takes longer as the UoW grows Automatic but the worse performance

The Deferred Explicit tracking policy ▪ Same behavior than the
Deferred Implicit tracking policy ▪ Except that the ORM only check entities that have been explicitly marked through a “persist” call ▪ Better for large UoW Some manual work but a way better performance

The Notify tracking policy ▪ The entities notify interested listeners
of every changes to their properties ▪ You have full control over when you consider a property changed or not ▪ The best for very large UoW A lot of manual work but the best performance

Load the least entities possible The out of memory (OOM)
problem

Every loaded entity has an impact ▪ An impact on
both time and memory because they increase the overall number of managed entities ▪ Hydrating an entity is heavy ▪ Think about the potential total numbers of entities your method call could load ▪ Don’t use the “findAll()” method ▪ Avoid filterless queries

Don’t There could be 50 000 comments one day.

Use the “iterate” feature ▪ Avoids to hydrate all the
query resulting entities in the memory at once ▪ Hydrates the entities one by one instead ▪ Limited to queries that don’t need to fetch join a collection valued association ▪ ⚠ You still need to clear the EM regularly to free the memory

Use SQL Yes again

“Pre select” the associations The N+1 problem

Watch for the queries count ▪ Entity associations are lazy
loaded ▪ An SQL query is executed on demand, when you access the uninitialized collection for the first time ▪ When you iterate on an array of entities, the ORM needs to execute a SQL query for every collection you access ▪ Therefore, nested iterations make the number of queries grow exponentially

⚠ Big inverse side associations

Don’t

Don’t setup big inverse side associations ▪ In a general
way, avoid non-essential associations ▪ Each association imply more work for the ORM, ie consumed time and memory ▪ Not having “big” inverse side associations prevents you from using their getter ▪ Consequently, it prevents you from loading large amount of entities in the memory at once and from running into the N+1 problem

Use caches Do not process the same things again and
again

Metadata cache ▪ Removes the mapping files parsing overhead ▪
They don’t need to be parsed on each request because the same mapping files will always produce the same class metadata ▪ Your application should never be in production without a configured metadata cache

Query cache ▪ Removes the DQL to SQL conversion overhead
▪ The same DQL query will always be converted to the same SQL (for the same platform) ▪ Your application should never be in production without a configured query cache

Result cache ▪ Caches the results of a query ▪
Avoids to requery the database and to rehydrate the resulting entities ▪ You can specify the time to leave (TTL) by queries ▪ Use it particularly on slow queries, even with a short TTL

Second level cache ▪ Reduces the amount of necessary database
access ▪ A cache between the identity map and the database ▪ Three caching modes : ▫ READ_ONLY, ▫ NONSTRICT_READ_WRITE ▫ READ_WRITE ▪ Experimental and very complex ▪ Limited to single application and to single primary key

Prefer entity listeners Over event listeners and event subscribers

When it concerns one kind of entity only. Don’t

Prefer entity listeners ▪ In a general way, prefer listeners
▪ Entity listeners are better for performance because they are called only for entities of the class they were configured for, they avoid to execute useless code paths ▪ Moreover, they can be lazy, ie being instantiated only when they are actually used

⚠ Composite primary keys

Don’t

Avoid composite primary keys ▪ They require additional internal work,
ie they imply time and memory overheads ▪ They have a way higher probability of errors because almost nobody uses them ▪ Use single primary keys ▪ Use UUIDs

⚠ Table inheritance

Don’t

Don’t use table inheritance ▪ Single table inheritance (STI) implies
having weak constraints on the table, for example many nullable columns ▪ Class Table Inheritance (CTI) implies multiples join operations for any query ▪ Representing OOP inheritance in a database is never a good idea

Use MappedSuperclass ▪ It provides reusable mapping information and reusable
concrete code logic but is not an entity itself ▪ Think of it as a trait for your entities but with a bonus: the mapped super class is in the children class hierarchy ▪ Each “children” entity is independent, it has its own table and is easily refactorable

⚠ Cascade removing

Don’t

Don’t use cascade removing ▪ Cascade removing is done at
runtime, is synchronous, and involves a full in-memory load of the entities to delete ▪ On big associations, it consumes a lot of time and a lot of memory, it often leads to OOM errors ▪ Use SQL “ON DELETE CASCADE” clauses

⚠ “Remove” lifecycle events

Don’t

Don’t use the “remove” lifecycle events ▪ In a general
way, avoid to have to do things on removes ▪ If you don’t use cascade removing, those events won’t be dispatched for the associated entities to remove. And you don’t want to do everything manually yourselves ▪ Use domain events that are handled asynchronously ▪ It’s really easy now thanks to the Symfony Messenger component

⚠ Proﬁling and logs Very useful but...

They have a cost ▪ Collecting data for the profiler
takes time and memory ▪ Writing logs involves IO operations and thus, takes time and memory ▪ Consider disabling the profiling and the logging, especially in your batch processing ▪ Do it case by case

No time to explain It’s already long enough

More topics worth checking out ▪ Extra lazy associations ▪
Partial objects and references ▪ Read only entities and properties ▪ Using criteria on non loaded associations ▪ Using the “filters” feature ▪ Aggregate data directly with SQL ▪ Multi step hydratation

“Always keep the identity map and internals in mind

Thanks! Any questions? You can find me at: fancyweb on
GitHub and on the Symfony Devs Slack @fancyweb_ on Twitter [email protected] by mail

A performant application, even with Doctrine!

A performant application, even with Doctrine!

More Decks by Thomas Calvet

Other Decks in Programming

Featured

Transcript