Slide 1

Slide 1 text

No content

Slide 2

Slide 2 text

Hi :) Mariana Bedran Lesche Full-stack developer at Labcodes Software Studio @maribedran twitter.com gmail.com github.com https://github.com/labcodes/ dados_brasil_io

Slide 3

Slide 3 text

No content

Slide 4

Slide 4 text

Motivation Exploring the advantages and disadvantages of the advanced ORM resources comparing to the more usual ones in terms of performance and readability. After learning a bit of SQL I could write queries complex enough that I didn’t know how to implement using Django. Could I put the two of them together?

Slide 5

Slide 5 text

When to use this? Ran out of database optimization strategies? ✔ Index tables ✔ Paginate requests ✔ Use select_related and prefetch_related on querysets ✔ Use values, values_list, only and defer on querysets ✔ Setup database cache ✔ Use assertNumQueries method on tests

Slide 6

Slide 6 text

Methodology - Choose a big dataset - Come up with some complex questions - Try to answer them with the things I had only seen on the Django documentation and never tried to apply

Slide 7

Slide 7 text

What went wrong - Procrastination - Had to validate the data while importing it - My machine cannot handle loading too much data to memory at once - Importing the data took forever - Putting the technical knowledge together with the domain was much harder then I had thought

Slide 8

Slide 8 text

No content

Slide 9

Slide 9 text

Open data of public interest from Brazil Brazil has a law stating that all public data that’s not secret must be publicly available to anyone that has interest in it: the Information Access Bill (Lei de Acesso à Informação).

Slide 10

Slide 10 text

Real life is not that simple Though the data is there, the access is not guaranteed for non technical people who would benefit from the information they provide. - Not every government agency complies with the law or take a long time to make the data available - The data is spread among multiple sources - Many datasets are in non open formats - Most of the people who have interest on the data don’t have the technical skills to process it

Slide 11

Slide 11 text

The Brasil IO project A project created by Álvaro Justen (@turicas) to gather public data, clean it and make it available for those who want to research it. - Datasets: https://brasil.io/datasets - GitHub: https://github.com/turicas/brasil.io - Support the project: https://apoia.se/brasilio

Slide 12

Slide 12 text

No content

Slide 13

Slide 13 text

Company partners from Brazil - Company CNPJ (unique identifier of the National Legal Entities Registration) - Partner’s name - Partner’s category - Legal Entity - Natural Person - Foreign Partner - Partnership category - Partner’s CNPJ (if it’s a legal entity)

Slide 14

Slide 14 text

Company Empresa Natural Person Pessoa Física Foreigner Estrangeiro Partnership Sociedade State Estado

Slide 15

Slide 15 text

Chamber of Deputies spents Deputies on National Congress can spend money on products and services related to the parliamentary activity and receive a refund for that. - Deputy’s unique identifier - Deputy’s party - Deputy’s name - Date of the expense - Reference month - Reference year - Amount spent - Description - Company CNPJ (if it’s company) - Some other fields...

Slide 16

Slide 16 text

Expense Gasto Party Partido Deputy Deputado Company Empresa State Estado

Slide 17

Slide 17 text

No content

Slide 18

Slide 18 text

Prefetch objects are great

Slide 19

Slide 19 text

# Deputy class DeputadoSerializer(serializers.ModelSerializer): class Meta: model = Deputado fields = [ 'id', 'nome', # name 'partido', # party 'uf', # state 'gastos' # expenses ] depth = 2

Slide 20

Slide 20 text

class DeputadoListView(generics.ListAPIView): serializer_class = DeputadoSerializer def get_queryset(self): queryset = Deputado.objects.all().select_related( 'partido', 'uf' # party, state ).prefetch_related( 'gastos' # expenses ) # Will show all related expenses

Slide 21

Slide 21 text

class DeputadoListView(generics.ListAPIView): serializer_class = DeputadoSerializer def get_queryset(self): queryset = Deputado.objects.all().select_related( 'partido', 'uf' ) # Set default month/year filters today = date.today() filters = self.request.query_params.dict() filters.setdefault('gastos_mes', today.month) filters.setdefault('gastos_ano', today.year) return queryset.prefetch_gastos(**{ # prefetch expenses field.replace('gastos_', ''): value for field, value in filters.items() if field.startswith('gastos_') # expenses filters

Slide 22

Slide 22 text

from django.db.models import Prefetch class DeputadoQuerySet(models.QuerySet): # Deputy QuerySet def prefetch_gastos(self, **kwargs): # prefetch expenses # Expenses QuerySet gastos_qs = GastoCotaParlamentar.objects.select_related( 'empresa' # company ).filter(**kwargs) # Never do that on a real project! prefetch = Prefetch('gastos', queryset=gastos_qs) return self.prefetch_related(prefetch)

Slide 23

Slide 23 text

Filters ● month 3 gastos_mes=3 ● year 2018 gastos_ano=2018 ● net value greather than 100 gastos_valor_liquido__gt=100

Slide 24

Slide 24 text

Aggregates can also be filtered

Slide 25

Slide 25 text

from django.db.models import Q, Sum # Avg, Max, Min, Variance class DeputadoQuerySet(models.QuerySet): # annotate deputy’s expenses by month/year def annotate_gasto_no_mes_por_deputado(self, mes, ano): annotation = { f'gastos_{ano}_{mes:02}': Sum( # ‘gastos_2018_01’ 'gastos__valor_liquido', # net value filter=Q( gastos__mes=mes, # month gastos__ano=ano # year ) ) } return self.annotate(**annotation)

Slide 26

Slide 26 text

SELECT "politicos_deputado"."id", "politicos_deputado"."nome", "politicos_deputado"."partido_id", "politicos_deputado"."uf_id", "politicos_deputado"."id_legislatura", "politicos_deputado"."carteira_parlamentar", SUM("politicos_gastocotaparlamentar"."valor_liquido") FILTER ( WHERE ( "politicos_gastocotaparlamentar"."ano" = 2018 AND "politicos_gastocotaparlamentar"."mes" = 1 ) ) AS "gastos_2018_01" FROM "politicos_deputado" LEFT OUTER JOIN "politicos_gastocotaparlamentar" ON ( "politicos_deputado"."id" = "politicos_gastocotaparlamentar"."deputado_id" ) GROUP BY "politicos_deputado"."id" ORDER BY "politicos_deputado"."id" ASC LIMIT 1;

Slide 27

Slide 27 text

But there’s also FilteredRelations for that

Slide 28

Slide 28 text

from django.db.models import FilteredRelation, Q, Sum class DeputadoQuerySet(models.QuerySet): def annotate_gasto_no_mes_por_deputado2(self , mes, ano): return self.annotate( gastos_filtrados=FilteredRelation( # filtered expenses 'gastos', condition=Q( gastos__mes=mes, # month gastos__ano=ano, # year ) ) ).annotate(**{ f'gastos_{ano}_{mes:02}': Sum( 'gastos_filtrados__valor_liquido' ) # filtered expense’s net value })

Slide 29

Slide 29 text

SELECT "politicos_deputado"."id", "politicos_deputado"."nome", "politicos_deputado"."partido_id", "politicos_deputado"."uf_id", "politicos_deputado"."id_legislatura", "politicos_deputado"."carteira_parlamentar", SUM(gastos_filtrados."valor_liquido") AS "gastos_2018_01" FROM "politicos_deputado" LEFT OUTER JOIN "politicos_gastocotaparlamentar" gastos_filtrados ON ( "politicos_deputado"."id" = gastos_filtrados."deputado_id" AND ( ( gastos_filtrados."ano" = 2018 AND gastos_filtrados."mes" = 1 ) ) ) GROUP BY "politicos_deputado"."id" ORDER BY "politicos_deputado"."id" ASC LIMIT 1;

Slide 30

Slide 30 text

No content

Slide 31

Slide 31 text

No content

Slide 32

Slide 32 text

You can also aggregate on annotations

Slide 33

Slide 33 text

from django.db.models import Avg, Sum, Q class DeputadoQuerySet(models.QuerySet): ... # annotate months expenses per deputy def annotate_gasto_mensal_por_deputado(self): meses = range(1, 13) # months anos = range(2009, 2019) # years annotations = { f'gastos_{ano}_{mes:02}': Sum( # ‘gastos_2018_01’ 'gastos__valor_liquido', # expenses net value filter=Q(gastos__mes=mes, gastos__ano=ano) ) for ano in anos for mes in meses } return self.annotate(**annotations)

Slide 34

Slide 34 text

from django.db.models import Avg class DeputadoQuerySet(models.QuerySet): ... def get_media_mensal(self): # get months average meses = range(1, 13) # month anos = range(2009, 2019) # year aggregations = { f'media_{ano}_{mes:02}': Avg( # ‘media_2018_01’ f'gastos_{ano}_{mes:02}' # gastos_2018_01 ) for ano in anos for mes in meses } # annotate month expenses per deputy return self.annotate_gasto_mensal_por_deputado() \ .aggregate(**aggregations)

Slide 35

Slide 35 text

No content

Slide 36

Slide 36 text

Subqueries

Slide 37

Slide 37 text

from django.db.models import Exists, OuterRef class DeputadoQuerySet(models.QuerySet): # annotate companys def annotate_empresas(self): empresas_qs = Empresa.objects.filter( # Companys # partnerships - natural person - name sociedades__socio_pessoa_fisica__nome=OuterRef('nome'), uf=OuterRef('uf') ) return self.annotate( empresas=Exists(empresas_qs) )

Slide 38

Slide 38 text

SELECT "politicos_deputado"."id", "politicos_deputado"."nome", "politicos_deputado"."partido_id", "politicos_deputado"."uf_id", "politicos_deputado"."id_legislatura", "politicos_deputado"."carteira_parlamentar", EXISTS( SELECT U0."cnpj", U0."nome", U0."uf_id" FROM "empresas_empresa" U0 INNER JOIN "empresas_sociedade" U1 ON (U0."cnpj" = U1."empresa_id") INNER JOIN "empresas_pessoafisica" U2 ON (U1."socio_pessoa_fisica_id" = U2."id") WHERE ( U2."nome" = ("politicos_deputado"."nome") AND U0."uf_id" = ("politicos_deputado"."uf_id") ) ) AS "empresas" FROM "politicos_deputado" ;

Slide 39

Slide 39 text

No content