Making smarter queries with advanced ORM resources

Hi :) Mariana Bedran Lesche Full-stack developer at Labcodes Software
Studio @maribedran twitter.com gmail.com github.com https://github.com/labcodes/ dados_brasil_io

Motivation Exploring the advantages and disadvantages of the advanced ORM
resources comparing to the more usual ones in terms of performance and readability. After learning a bit of SQL I could write queries complex enough that I didn’t know how to implement using Django. Could I put the two of them together?

When to use this? Ran out of database optimization strategies?
✔ Index tables ✔ Paginate requests ✔ Use select_related and prefetch_related on querysets ✔ Use values, values_list, only and defer on querysets ✔ Setup database cache ✔ Use assertNumQueries method on tests

Methodology - Choose a big dataset - Come up with
some complex questions - Try to answer them with the things I had only seen on the Django documentation and never tried to apply

What went wrong - Procrastination - Had to validate the
data while importing it - My machine cannot handle loading too much data to memory at once - Importing the data took forever - Putting the technical knowledge together with the domain was much harder then I had thought

Open data of public interest from Brazil Brazil has a
law stating that all public data that’s not secret must be publicly available to anyone that has interest in it: the Information Access Bill (Lei de Acesso à Informação).

Real life is not that simple Though the data is
there, the access is not guaranteed for non technical people who would benefit from the information they provide. - Not every government agency complies with the law or take a long time to make the data available - The data is spread among multiple sources - Many datasets are in non open formats - Most of the people who have interest on the data don’t have the technical skills to process it

The Brasil IO project A project created by Álvaro Justen
(@turicas) to gather public data, clean it and make it available for those who want to research it. - Datasets: https://brasil.io/datasets - GitHub: https://github.com/turicas/brasil.io - Support the project: https://apoia.se/brasilio

Company partners from Brazil - Company CNPJ (unique identifier of
the National Legal Entities Registration) - Partner’s name - Partner’s category - Legal Entity - Natural Person - Foreign Partner - Partnership category - Partner’s CNPJ (if it’s a legal entity)

Company Empresa Natural Person Pessoa Física Foreigner Estrangeiro Partnership Sociedade
State Estado

Chamber of Deputies spents Deputies on National Congress can spend
money on products and services related to the parliamentary activity and receive a refund for that. - Deputy’s unique identifier - Deputy’s party - Deputy’s name - Date of the expense - Reference month - Reference year - Amount spent - Description - Company CNPJ (if it’s company) - Some other fields...

Expense Gasto Party Partido Deputy Deputado Company Empresa State Estado

Prefetch objects are great

# Deputy class DeputadoSerializer(serializers.ModelSerializer): class Meta: model = Deputado fields
= [ 'id', 'nome', # name 'partido', # party 'uf', # state 'gastos' # expenses ] depth = 2

class DeputadoListView(generics.ListAPIView): serializer_class = DeputadoSerializer def get_queryset(self): queryset = Deputado.objects.all().select_related(
'partido', 'uf' # party, state ).prefetch_related( 'gastos' # expenses ) # Will show all related expenses

class DeputadoListView(generics.ListAPIView): serializer_class = DeputadoSerializer def get_queryset(self): queryset = Deputado.objects.all().select_related(
'partido', 'uf' ) # Set default month/year filters today = date.today() filters = self.request.query_params.dict() filters.setdefault('gastos_mes', today.month) filters.setdefault('gastos_ano', today.year) return queryset.prefetch_gastos(**{ # prefetch expenses field.replace('gastos_', ''): value for field, value in filters.items() if field.startswith('gastos_') # expenses filters

from django.db.models import Prefetch class DeputadoQuerySet(models.QuerySet): # Deputy QuerySet def
prefetch_gastos(self, **kwargs): # prefetch expenses # Expenses QuerySet gastos_qs = GastoCotaParlamentar.objects.select_related( 'empresa' # company ).filter(**kwargs) # Never do that on a real project! prefetch = Prefetch('gastos', queryset=gastos_qs) return self.prefetch_related(prefetch)

Filters • month 3 gastos_mes=3 • year 2018 gastos_ano=2018 •
net value greather than 100 gastos_valor_liquido__gt=100

Aggregates can also be filtered

from django.db.models import Q, Sum # Avg, Max, Min, Variance
class DeputadoQuerySet(models.QuerySet): # annotate deputy’s expenses by month/year def annotate_gasto_no_mes_por_deputado(self, mes, ano): annotation = { f'gastos_{ano}_{mes:02}': Sum( # ‘gastos_2018_01’ 'gastos__valor_liquido', # net value filter=Q( gastos__mes=mes, # month gastos__ano=ano # year ) ) } return self.annotate(**annotation)

SELECT "politicos_deputado"."id", "politicos_deputado"."nome", "politicos_deputado"."partido_id", "politicos_deputado"."uf_id", "politicos_deputado"."id_legislatura", "politicos_deputado"."carteira_parlamentar", SUM("politicos_gastocotaparlamentar"."valor_liquido") FILTER (
WHERE ( "politicos_gastocotaparlamentar"."ano" = 2018 AND "politicos_gastocotaparlamentar"."mes" = 1 ) ) AS "gastos_2018_01" FROM "politicos_deputado" LEFT OUTER JOIN "politicos_gastocotaparlamentar" ON ( "politicos_deputado"."id" = "politicos_gastocotaparlamentar"."deputado_id" ) GROUP BY "politicos_deputado"."id" ORDER BY "politicos_deputado"."id" ASC LIMIT 1;

But there’s also FilteredRelations for that

from django.db.models import FilteredRelation, Q, Sum class DeputadoQuerySet(models.QuerySet): def annotate_gasto_no_mes_por_deputado2(self
, mes, ano): return self.annotate( gastos_filtrados=FilteredRelation( # filtered expenses 'gastos', condition=Q( gastos__mes=mes, # month gastos__ano=ano, # year ) ) ).annotate(**{ f'gastos_{ano}_{mes:02}': Sum( 'gastos_filtrados__valor_liquido' ) # filtered expense’s net value })

SELECT "politicos_deputado"."id", "politicos_deputado"."nome", "politicos_deputado"."partido_id", "politicos_deputado"."uf_id", "politicos_deputado"."id_legislatura", "politicos_deputado"."carteira_parlamentar", SUM(gastos_filtrados."valor_liquido") AS "gastos_2018_01"
FROM "politicos_deputado" LEFT OUTER JOIN "politicos_gastocotaparlamentar" gastos_filtrados ON ( "politicos_deputado"."id" = gastos_filtrados."deputado_id" AND ( ( gastos_filtrados."ano" = 2018 AND gastos_filtrados."mes" = 1 ) ) ) GROUP BY "politicos_deputado"."id" ORDER BY "politicos_deputado"."id" ASC LIMIT 1;

You can also aggregate on annotations

from django.db.models import Avg, Sum, Q class DeputadoQuerySet(models.QuerySet): ... #
annotate months expenses per deputy def annotate_gasto_mensal_por_deputado(self): meses = range(1, 13) # months anos = range(2009, 2019) # years annotations = { f'gastos_{ano}_{mes:02}': Sum( # ‘gastos_2018_01’ 'gastos__valor_liquido', # expenses net value filter=Q(gastos__mes=mes, gastos__ano=ano) ) for ano in anos for mes in meses } return self.annotate(**annotations)

from django.db.models import Avg class DeputadoQuerySet(models.QuerySet): ... def get_media_mensal(self): #
get months average meses = range(1, 13) # month anos = range(2009, 2019) # year aggregations = { f'media_{ano}_{mes:02}': Avg( # ‘media_2018_01’ f'gastos_{ano}_{mes:02}' # gastos_2018_01 ) for ano in anos for mes in meses } # annotate month expenses per deputy return self.annotate_gasto_mensal_por_deputado() \ .aggregate(**aggregations)

Subqueries

from django.db.models import Exists, OuterRef class DeputadoQuerySet(models.QuerySet): # annotate companys
def annotate_empresas(self): empresas_qs = Empresa.objects.filter( # Companys # partnerships - natural person - name sociedades__socio_pessoa_fisica__nome=OuterRef('nome'), uf=OuterRef('uf') ) return self.annotate( empresas=Exists(empresas_qs) )

SELECT "politicos_deputado"."id", "politicos_deputado"."nome", "politicos_deputado"."partido_id", "politicos_deputado"."uf_id", "politicos_deputado"."id_legislatura", "politicos_deputado"."carteira_parlamentar", EXISTS( SELECT U0."cnpj",
U0."nome", U0."uf_id" FROM "empresas_empresa" U0 INNER JOIN "empresas_sociedade" U1 ON (U0."cnpj" = U1."empresa_id") INNER JOIN "empresas_pessoafisica" U2 ON (U1."socio_pessoa_fisica_id" = U2."id") WHERE ( U2."nome" = ("politicos_deputado"."nome") AND U0."uf_id" = ("politicos_deputado"."uf_id") ) ) AS "empresas" FROM "politicos_deputado" ;

Making smarter queries with advanced ORM resources

Making smarter queries with advanced ORM resources

Labcodes Software Studio

More Decks by Labcodes Software Studio

Other Decks in Programming

Featured

Transcript

Hi :) Mariana Bedran Lesche Full-stack developer at Labcodes Software

Motivation Exploring the advantages and disadvantages of the advanced ORM

When to use this? Ran out of database optimization strategies?

Methodology - Choose a big dataset - Come up with

What went wrong - Procrastination - Had to validate the

Open data of public interest from Brazil Brazil has a

Real life is not that simple Though the data is

The Brasil IO project A project created by Álvaro Justen

Company partners from Brazil - Company CNPJ (unique identifier of

Company Empresa Natural Person Pessoa Física Foreigner Estrangeiro Partnership Sociedade

Chamber of Deputies spents Deputies on National Congress can spend

Expense Gasto Party Partido Deputy Deputado Company Empresa State Estado

Prefetch objects are great

# Deputy class DeputadoSerializer(serializers.ModelSerializer): class Meta: model = Deputado fields

class DeputadoListView(generics.ListAPIView): serializer_class = DeputadoSerializer def get_queryset(self): queryset = Deputado.objects.all().select_related(

class DeputadoListView(generics.ListAPIView): serializer_class = DeputadoSerializer def get_queryset(self): queryset = Deputado.objects.all().select_related(

from django.db.models import Prefetch class DeputadoQuerySet(models.QuerySet): # Deputy QuerySet def

Filters • month 3 gastos_mes=3 • year 2018 gastos_ano=2018 •

Aggregates can also be filtered

from django.db.models import Q, Sum # Avg, Max, Min, Variance

SELECT "politicos_deputado"."id", "politicos_deputado"."nome", "politicos_deputado"."partido_id", "politicos_deputado"."uf_id", "politicos_deputado"."id_legislatura", "politicos_deputado"."carteira_parlamentar", SUM("politicos_gastocotaparlamentar"."valor_liquido") FILTER (

But there’s also FilteredRelations for that

from django.db.models import FilteredRelation, Q, Sum class DeputadoQuerySet(models.QuerySet): def annotate_gasto_no_mes_por_deputado2(self

SELECT "politicos_deputado"."id", "politicos_deputado"."nome", "politicos_deputado"."partido_id", "politicos_deputado"."uf_id", "politicos_deputado"."id_legislatura", "politicos_deputado"."carteira_parlamentar", SUM(gastos_filtrados."valor_liquido") AS "gastos_2018_01"

You can also aggregate on annotations

from django.db.models import Avg, Sum, Q class DeputadoQuerySet(models.QuerySet): ... #

from django.db.models import Avg class DeputadoQuerySet(models.QuerySet): ... def get_media_mensal(self): #

Subqueries

from django.db.models import Exists, OuterRef class DeputadoQuerySet(models.QuerySet): # annotate companys

SELECT "politicos_deputado"."id", "politicos_deputado"."nome", "politicos_deputado"."partido_id", "politicos_deputado"."uf_id", "politicos_deputado"."id_legislatura", "politicos_deputado"."carteira_parlamentar", EXISTS( SELECT U0."cnpj",