Optimize your Python & Django apps with easy-to-learn Postgres superpowers

Transcript

Louise Grandjonc | @louisemeta @AzureDBPostgres @citusdata

A bit about me—and these <fish> Louise Grandjonc @louisemeta @AzureDBPostgres

@citusdata • Python developer • Postgres team at Microsoft (formerly of Citus Data) • Love both Python & Postgres • Really missing PyCon this year • Enjoy complex and modern SQL • Want applications to use database at full potential

AGENDA: Easy-to-learn Postgres superpowers #1: What can cause performance issues?

#2: What tools can you use to catch them? Loops Limiting query set Pagination Improving Slow Queries @louisemeta

What can cause performance issues? @louisemeta

ORMs are very handy tools, helping us build powerful applications

in short amount of time… @louisemeta

We have to be careful @louisemeta 1. ORM executes queries

you might not expect 2. Queries might not be optimised—& you won’t know about it 3. Fast query in your dev environment might behave very differently with your full dataset & under load

Postgres superpowers to help you catch performance issues @louisemeta

pg_stat_statements / for production & dev • Postgres extension •

Tracks statistics on queries executed by a server (once enabled) • Great to catch slow queries • Add to production environment • And to dev environment • Installed / enabled by default on Azure Database for PostgreSQL @louisemeta

pg_stat_statements / analyzing performance @louisemeta 1 2 3 4 SELECT

total_time, min_time, max_time, mean_time, calls, query FROM pg_stat_statements ORDER BY mean_time DESC LIMIT 100;

pg_stat_statements / analyzing performance -[ RECORD 1 ]------------------------------------------------------ total_time |

643.624 min_time | 256.424 max_time | 531.417 mean_time | 383.666666667 calls | 3 query | SELECT COUNT(*) FROM letters; @louisemeta 1 2 3 4 SELECT total_time, min_time, max_time, mean_time, calls, query FROM pg_stat_statements ORDER BY mean_time DESC LIMIT 100;

pg_stat_statements / analyzing performance -[ RECORD 1 ]------------------------------------------------------ total_time |

643.624 min_time | 256.424 max_time | 531.417 mean_time | 383.666666667 calls | 3 query | SELECT COUNT(*) FROM letters; @louisemeta 1 2 3 4 SELECT total_time, min_time, max_time, mean_time, calls, query FROM pg_stat_statements ORDER BY mean_time DESC LIMIT 100;

• Identify long-running queries • Uses pg_stat_statements • Top 5

queries by average duration per execution • Aggregated in 15-min intervals • Alter time window • View additional queries Query Performance Insights / on Azure Wait statistics in Query Performance Insight, on Microsoft Azure @louisemeta

• 2 types of custom suggestions • Create Index: suggests

new indexes for frequently-run or time- consuming queries • Drop Index: suggests drop rarely-used or redundant indexes Performance Recommendations / on Azure Azure portal performance recommendations @louisemeta

@louisemeta django-debug-toolbar

django-debug-toolbar / for dev environments • Lists queries in a

template view • Gives EXPLAIN plan for query • Trace back queries • Only when DEBUG=True in Django settings @louisemeta

Logs on Postgres / are they useful? @louisemeta 1 2

3 4 5 6 /* WARNING: Only do this in local environment */ /* or you might end up with very large log files */ log_statement = 'all' logging_collector = on log_min_duration_statement = 0 • Best if have different environments (Golang API, django apps, etc.) • If template is making Ajax calls to API, django-debug-toolbar won't display queries, which makes Postgres logs especially useful! • To have Postgres log all your queries, you have to change your config:

Logs on Postgres / are they useful? @louisemeta 1 2

Loops, once upon a thousand times @louisemeta

About today’s sample application: ”Ad & Campaign App” https://aka.ms/django-PG-ad-app

About the schema

On my website, have a page to list Campaigns

Loops, once upon a thousand times Here is the code

for this view @louisemeta 1 2 3 4 5 6 7 8 class CampaignListView(LoginRequiredMixin, ListView): model = Campaign template_name = 'ad_analytics/campaigns/list.html' context_object_name = 'campaigns' def get_queryset(self): queryset = super().get_queryset() return queryset.filter(company=self.request.user.employee.company)

Loops, once upon a thousand times Here is the code

In django-debug-toolbar, I have over 100 queries

Loops, once upon a thousand times Thanks to django-debug-toolbar, I

can trace back that it comes from my template. Indeed, I have the following code: @louisemeta 1 2 3 4 5 6 7 8 9 10 {% for campaign in campaigns %} <li class="treeview"> <ul class="treeview-menu"> <li><a href="{% url 'campaign_detail' campaign.id %}">Manage Campaign</a></li> {% for ad in campaign.ads.all %} <li><a href="{% url 'campaign_ad_detail' campaign.id ad.id %}">Ad #{{ ad.name|truncatechars:30 }}</a></li> {% endfor %} </ul> </li> {% endfor %}

Loops, once upon a thousand times Thanks to django-debug-toolbar, I

1 2 3 4 5 6 7 8 9 10

{% for campaign in campaigns %} <li class="treeview"> <ul class="treeview-menu"> <li><a href="{% url 'campaign_detail' campaign.id %}">Manage Campaign</a></li> {% for ad in campaign.ads.all %} <li><a href="{% url 'campaign_ad_detail' campaign.id ad.id %}">Ad #{{ ad.name|truncatechars:30 }}</a></li> {% endfor %} </ul> </li> {% endfor %} Loops, once upon a thousand times Thanks to django-debug-toolbar, I can trace back that it comes from my template. Indeed, I have the following code: @louisemeta

Loops, once upon a thousand times Loops can easily be

fixed, Django provides 2 functions: • select_related: usually for ForeignKey • prefetch_related: usually for ManyToMany or related objects @louisemeta 1 2 3 4 5 6 7 8 class CampaignListView(LoginRequiredMixin, ListView): model = Campaign template_name = 'ad_analytics/campaigns/list.html' context_object_name = 'campaigns' def get_queryset(self): queryset = super().get_queryset() return queryset.filter(company=self.request.user.employee.company) .prefetch_related('ads')

Loops, once upon a thousand times Loops can easily be

fixed, Django provides 2 functions: • select_related: usually for ForeignKey • prefetch_related: usually for ManyToMany or related objects @louisemeta 1 2 3 4 5 6 7 8 class CampaignListView(LoginRequiredMixin, ListView): model = Campaign template_name = 'ad_analytics/campaigns/list.html' context_object_name = 'campaigns' def get_queryset(self): queryset = super().get_queryset() return queryset.filter(company=self.request.user.employee.company) .prefetch_related('ads')

Going from 100 queries down to 6 queries

@louisemeta

If working with small sets, you might not notice This

`pg_stat_statements` query helps you ID most common queries: @louisemeta 1 2 select query,calls from pg_stat_statements ORDER BY calls DESC limit 50; To avoid catastrophes on production, look at queries during dev phase to avoid loops like this What I want you to remember

Limiting what you are selecting @louisemeta

Limiting what you are selecting SELECT "ad_analytics_ads"."id", "ad_analytics_ads"."name", "ad_analytics_ads"."campaign_id", "ad_analytics_ads"."image",

"ad_analytics_ads"."target_url", "ad_analytics_ads"."impressions_count", "ad_analytics_ads"."clicks_count", "ad_analytics_ads"."created_at", "ad_analytics_ads"."updated_at" FROM "ad_analytics_ads" WHERE "ad_analytics_ads"."campaign_id" IN (...) By default, Django ORM selects all columns. When we do prefetch_related, here is the query: 1 2

SELECT "ad_analytics_ads"."id", "ad_analytics_ads"."name", "ad_analytics_ads"."campaign_id", "ad_analytics_ads"."image", "ad_analytics_ads"."target_url", "ad_analytics_ads"."impressions_count", "ad_analytics_ads"."clicks_count", "ad_analytics_ads"."created_at", "ad_analytics_ads"."updated_at"

FROM "ad_analytics_ads" WHERE "ad_analytics_ads"."campaign_id" IN (...) Limiting what you are selecting @louisemeta 1 2 By default, Django ORM selects all columns. When we do prefetch_related, here is the query:

Limiting what you are selecting However, in menu, we only

display name; we only need id & campaign_id for the link. So, the query should be: @louisemeta 1 2 SELECT "ad_analytics_ads"."id", "ad_analytics_ads"."name", "ad_analytics_ads"."campaign_id", "ad_analytics_ads"."image", "ad_analytics_ads"."target_url", "ad_analytics_ads"."impressions_count", "ad_analytics_ads"."clicks_count", "ad_analytics_ads"."created_at", "ad_analytics_ads"."updated_at" FROM "ad_analytics_ads" WHERE "ad_analytics_ads"."campaign_id" IN (...) 1 2 SELECT "ad_analytics_ads"."id", "ad_analytics_ads"."name", "ad_analytics_ads"."campaign_id" FROM "ad_analytics_ads" WHERE "ad_analytics_ads"."campaign_id" IN (...) By default, Django ORM selects all columns. When we do prefetch_related, here is the query:

1 2 Limiting what you are selecting However, in menu,

we only display name; we only need id & campaign_id for the link. So, the query should be: @louisemeta 1 2 SELECT "ad_analytics_ads"."id", "ad_analytics_ads"."name", "ad_analytics_ads"."campaign_id", "ad_analytics_ads"."image", "ad_analytics_ads"."target_url", "ad_analytics_ads"."impressions_count", "ad_analytics_ads"."clicks_count", "ad_analytics_ads"."created_at", "ad_analytics_ads"."updated_at" FROM "ad_analytics_ads" WHERE "ad_analytics_ads"."campaign_id" IN (...) SELECT "ad_analytics_ads"."id", "ad_analytics_ads"."name", "ad_analytics_ads"."campaign_id" FROM "ad_analytics_ads" WHERE "ad_analytics_ads"."campaign_id" IN (...) By default, Django ORM selects all columns. When we do prefetch_related, here is the query:

Limiting what you are selecting You can limit columns retrieved

by Django ORM using function only() If table has tens of columns, limiting what you pull from the database will improve page performance @louisemeta Don't let your Django ORM dictate what you should SELECT :) What I want you to remember

Limiting what you are selecting You can limit columns retrieved

by Django ORM using function only() If table has tens of columns, limiting what you pull from the database will improve page performance @louisemeta 1 2 Example of how to use the function only() Campaign.objects.filter(company_id=1).only('name')

Limiting what you are selecting However—ads are in prefetch_related—so to

limit queryset to the right fields, we use Prefetch class @louisemeta 1 2 3 4 5 6 7 8 9 10 from django.db.models import Prefetch … def get_queryset(self): queryset = super().get_queryset() return (queryset .filter(company=self.request.user.employee.company) .prefetch_related(Prefetch('ads', queryset=Ads.objects.all().only('id', 'campaign_id', 'name'))) .order_by('id'))

Limiting what you are selecting However—ads are in prefetch_related—so to

Pagination @louisemeta

Suppose I generated 1M campaigns for my company. And now

the list page is very, very slow @louisemeta

Pagination / why is paginating so important? 1. Queries retrieving

data for 1M rows is slower than 25 rows 2. The more rows, the slower! 3. When you do prefetch_related, you are doing a query with campaign_id IN () , you can imagine what it would look like for 1M rows! 4. You never actually need to see all of your rows in one page @louisemeta

Pagination / Default paginator class @louisemeta 1 2 3 4

5 6 7 8 9 10 11 12 13 14 >>> from django.core.paginator import Paginator >>> queryset = Campaign.objects.filter(company_id=1).order_by('id') >>> p = Paginator(queryset, 25) >>> p.num_pages 40000 >>> page = p.page(1) >>> page.has_next() True >>> page.has_previous() False >>> page.next_page_number() 2 >>> page.object_list <QuerySet [<Campaign: Campaign object (1)>, <Campaign: Campaign object (2)>, The Paginator class in Django takes care of splitting queryset into pages

Pagination / Default paginator class @louisemeta 1 2 3 4

Pagination / Integration in ListView (python) The generic view ListView,

that we use for the campaigns page, provides a built-in way to paginate—by simply adding this: @louisemeta 1 2 3 4 5 6 7 8 9 class CampaignListView(LoginRequiredMixin, ListView): model = Campaign template_name = 'ad_analytics/campaigns/list.html' context_object_name = 'campaigns' paginate_by = 20 def get_queryset(self): ...

Pagination / Integration in ListView (python) The generic view ListView,

that we use for the campaigns page, provides a built-in way to paginate—by simply adding this: @louisemeta 1 2 3 4 5 6 7 8 9 class CampaignListView(LoginRequiredMixin, ListView): model = Campaign template_name = 'ad_analytics/campaigns/list.html' context_object_name = 'campaigns' paginate_by = 20 def get_queryset(self): ...

Pagination / Integration in ListView (html) 1 2 3 4

5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 <div class="pagination"> <span class="step-links"> <span class="page-links previous"> <a href="?page=1">« first</a> {% if page_obj.has_previous %} <a href="?page={{ page_obj.previous_page_number }}">previous</a> {% endif %} </span> <span class="page-links current"> Page {{ page_obj.number }} of {{ page_obj.paginator.num_pages }}. </span> <span class="page-links next"> {% if page_obj.has_next %} <a href="?page={{ page_obj.next_page_number }}">next</a> {% endif %} <a href="?page={{ page_obj.paginator.num_pages }}">last »</a> </span> </span> </div>

Pagination / Integration in ListView (html) 1 2 3 4

Pagination / Integration in ListView

Pagination / 2 problems with default pagination 1. Count used

by Paginator to determine # of pages is slow—the more rows, the slower it will get! 2. OFFSET … LIMIT 20 gets slower in high pages OFFSET is used to skip first N rows of a query—to do that, database must fetch rows & order, to return last ones @louisemeta

Keyset pagination / why you should care Example of what

queries are usually like with OFFSET .. LIMIT 20: 1 2 SELECT * FROM campaigns ORDER BY id LIMIT 20 OFFSET 20;

Keyset pagination / why you should care Instead of having

an OFFSET query like this: With keyset pagination, the query will have a filter on the ORDER BY fields: To learn about keyset pagination, I recommend this article by Markus Winand on his use-the-index-luke dot com website: https://aka.ms/luke-no-offset. Markus’s website is great if you want to learn more about SQL performance! 1 2 SELECT * FROM campaigns WHERE id > ? ORDER BY id LIMIT 20; 1 2 SELECT * FROM campaigns ORDER BY id LIMIT 20 OFFSET 20;

Keyset pagination / django-keyset-pagination-plus An application has been written to

simplify your use of keyset pagination: https://pypi.org/project/django-keyset-pagination-plus/ @louisemeta 1 2 3 4 5 6 7 8 9 10 11 from keyset_pagination.paginator import KeysetPaginator from keyset_pagination.mixin import PaginateMixin class CampaignListView(LoginRequiredMixin, PaginateMixin, ListView): model = Campaign template_name = 'ad_analytics/campaigns/list.html' context_object_name = 'campaigns' paginator_class = KeysetPaginator paginate_by = 20

Keyset pagination / django-keyset-pagination-plus An application has been written to

Keyset pagination

Fixing slow queries @louisemeta

Fixing a slow query / pg_stat_statements SQL code using pg_stat_statements

to get the 10 slowest queries @louisemeta 1 2 3 SELECT query, mean_time FROM pg_stat_statements ORDER BY mean_time DESC LIMIT 10;

Fixing a slow query / pg_stat_statements SQL code using pg_stat_statements

to get the 10 slowest queries @louisemeta 1 2 3 SELECT query, mean_time FROM pg_stat_statements ORDER BY mean_time DESC LIMIT 10; query | SELECT * FROM ad_analytics_campaign WHERE archived is not true average_time | 219.482032024528 One of the slow queries I discover is this one:

Fixing a slow query / Looking into EXPLAIN @louisemeta 1

2 3 4 5 6 7 8 9 10 11 12 postgres=> EXPLAIN SELECT * FROM ad_analytics_campaign WHERE archived is not true; QUERY PLAN ------------------------------------------------------------------------------------ ------ Gather (cost=1000.00..30858.95 rows=133 width=62) Workers Planned: 2 -> Parallel Seq Scan on ad_analytics_campaign (cost=0.00..29845.65 rows=55 width=62) Filter: (archived IS NOT TRUE) (4 rows) Time: 100.985 ms

Fixing a slow query / Looking into EXPLAIN @louisemeta 1

Fixing a slow query / adding a partial index Adding

a partial index on the column: @louisemeta 1 2 3 4 5 6 7 class Campaign(models.Model): ... class Meta: indexes = [models.Index(fields=['archived'], name='campaign_archived_idx', condition=Q(archived=False))]

Fixing a slow query / adding a partial index Adding

a partial index on the column: @louisemeta 1 2 3 4 5 6 7 class Campaign(models.Model): ... class Meta: indexes = [models.Index(fields=['archived'], name='campaign_archived_idx', condition=Q(archived=False))] New time for the query: 100.388 ms

Without partial index, this query took 219 ms SQL code

using pg_stat_statements to get the 10 slowest queries @louisemeta 1 2 3 SELECT query, mean_time FROM pg_stat_statements ORDER BY mean_time DESC LIMIT 10; query | SELECT * FROM ad_analytics_campaign WHERE archived is not true average_time | 219.482032024528 Remember we discovered this slow query using pg_stat_statements

Fixing a slow query / scaling UP your server What

if you already optimized your code, have perfect indexes—and things are still slow? Or getting slower and slower, even if you didn’t change anything? There can be many different reasons! More traffic, more usage of CPU, low memory… Your hardware resources might not be sufficient. @louisemeta This might be a good moment to scale UP—by adding disk, cpu, memory. Tip

Improving performance / scaling UP your server Recommend this talk

on how to debug, & what tools you can use. To understand if it’s time to scale up your server. https://aka.ms/all-goes-wrong-PG-talk @louisemeta

What if your app is a big success? What if

you need more scale than you can get from a single machine? @louisemeta

Fixing slow queries / scaling OUT with Citus • Open

source extension to PostgreSQL (not a fork) • Scales out Postgres horizontally • Distributes your data + your SQL queries • Multi-tenant SaaS, real-time analytics, & HTAP • Looks like single-node Postgres to your app • Now available as Hyperscale (Citus), part of Azure Database for PostgreSQL @louisemeta https://aka.ms/citus

APPLICATION SELECT FROM GROUP BY company_id, avg(spend) AS avg_campaign_spend compaigns

company_id; METADATA COORDINATOR NODE WORKER NODES W1 W2 W3 … Wn SELECT company_id sum(spend), count(spend) … FROM campaigns_2009 … SELECT company_id sum(spend), count(spend) … FROM campaigns_2001 … SELECT company_id sum(spend), count(spend) … FROM campaigns_2017 … How Citus Shards Postgres Across Nodes: Distributed Aggregate

https://aka.ms/hyperscale-citus-quickstart Fixing slow queries / Quickstart docs for Hyperscale (Citus)

on Azure

Your 8 takeaways Your Postgres superpowers @louisemeta

Your 8 takeaways, the Postgres superpowers 1. pg_stat_statements to find

slow queries 2. django-debug-toolbar is 3. Postgres logs for 100% visibility into queries 4. Loops—be careful 5. Don’t let your ORM dictate what you SELECT 6. Replace your old OFFSET pagination by keyset pagination 7. Indexes—and partial indexes—can help 8. Scale up—or scale out with Hyperscale (Citus) on Azure @louisemeta

grazie ευχαριστώ gracias ध"यवाद Louise Grandjonc • Hyperscale (Citus) Quickstart Docs: https://aka.ms/hyperscale-citus-quickstart • Talk on Debugging Postgres: https://aka.ms/all-goes-wrong-PG-talk • DjangoCon talk on Indexes: https://aka.ms/django-PG-index-talk • GitHub repo for Django Ad App: https://aka.ms/django-PG-ad-app • Git repo for Citus open source: https://aka.ms/citus • Join the Citus technical newsletter: https://aka.ms/citus-newsletter @louisemeta @AzureDBPostgres @citusdata

Optimize your Python & Django apps with easy-to...

Optimize your Python & Django apps with easy-to-learn Postgres superpowers | PyCon 2020 Remote | Louise Grandjonc

More Decks by Azure Database for PostgreSQL

Other Decks in Technology

Featured

Transcript