Upgrade to Pro — share decks privately, control downloads, hide ads and more …

A pythonic full-text search - PyCon US 2023

A pythonic full-text search - PyCon US 2023

-- A talk I gave at PyCon US 2023

Keeping in mind the Pythonic principle that “simple is better than complex” we’ll see how to implement full-text search in a web service using only the latest versions of Django and PostgreSQL and we’ll analyze the advantages compared to more complex solutions based on external services.

https://www.paulox.net/2023/04/23/pycon-us-2023/

Paolo Melchiorre

April 23, 2023
Tweet

More Decks by Paolo Melchiorre

Other Decks in Technology

Transcript

  1. Paolo Melchiorre paulox.net • CTO @ 20tab • Software engineer

    • Python developer • Django contributor © 2022 Bartek Pawlik (CC BY-NC-SA)
  2. Paolo Melchiorre >>> import this The Zen of Python, by

    Tim Peters Beautiful is better than ugly. Explicit is better than implicit. Simple is better than complex. Complex is better than complicated. ... 4
  3. “ Paolo Melchiorre 5 <… techniques for searching … computer-stored

    document … in a full-text database= — Wikipedia <Full-text search=
  4. Paolo Melchiorre 7 External search engines PROS Popular Full featured

    Resources CONS Driver Query language Synchronization
  5. Paolo Melchiorre 10 PostgreSQL Full text search (v8.3 ~2008) Data

    type (tsquery, tsvector) Special indexes (GIN, GiST) Phrase search (v9.6 ~2016) JSON support (v10 ~2017) Web search (v11 ~2018) New languages (v12-14 ~2019-2021)
  6. “ Paolo Melchiorre 11 <… the unit of searching in

    a full-text search system e.g. a magazine article …= — PostgreSQL <Document=
  7. Paolo Melchiorre 14 Django Full text search (v1.10 ~2016) django.contrib.postgres

    Fields, expressions, functions GIN index (v1.11 ~2017) GiST index (v2.0 ~2018) Phrase search (v2.2 ~2019) Web search (v3.1 ~2020)
  8. Paolo Melchiorre 17 """Blogs models.""" from django.contrib.postgres import search from

    django.db import models class Author(models.Model): name = models.CharField(max_length=200) class Entry(models.Model): headline = models.CharField(max_length=255) body_text = models.TextField() authors = models.ManyToManyField(Author)
  9. Paolo Melchiorre 18 """Field lookups.""" from blog.models import Author Author.objects.filter(name__contains="Jerry")

    [<Author: Jerry Lewis>] Author.objects.filter(name__icontains="TERRY") [<Author: Terry Gilliam>, <Author: Terry Jones>]
  10. Paolo Melchiorre 19 """Application definition.""" INSTALLED_APPS = [ "django.contrib.admin", "django.contrib.auth",

    "django.contrib.contenttypes", "django.contrib.sessions", "django.contrib.messages", "django.contrib.staticfiles", "django.contrib.postgres", "blog", ]
  11. Paolo Melchiorre 20 """Trigram extension.""" from django.contrib.postgres import operations from

    django.db import migrations class Migration(migrations.Migration): operations = [operations.TrigramExtension()]
  12. Paolo Melchiorre 21 """Trigram similar lookup.""" from blog.models import Author

    Author.objects.filter(name__trigram_similar="jerry ones") [<Author: Terry Jones>, <Author: Jerry Lewis>]
  13. Paolo Melchiorre 22 """Search lookup.""" from blog.models import Entry entries

    = Entry.objects.all() entries.filter(headline__search="any cheeses") [<Entry: Cheese on Toast recipes>]
  14. Paolo Melchiorre 23 """SearchVector function.""" from django.contrib.postgres import search from

    blog.models import Entry V = search.SearchVector("headline", "body_text") entries = Entry.objects.annotate(search=V) entries.filter(search="cheeses") [<Entry: Cheese on Toast recipes>, <Entry: Pizza Recipes>]
  15. Paolo Melchiorre 24 """SearchQuery expression.""" from django.contrib.postgres import search from

    blog.models import Entry V = search.SearchVector("headline", "body_text") Q = search.SearchQuery("cheese -top", search_type="websearch") entries = Entry.objects.annotate(search=V) entries.filter(search=Q) [<Entry: Cheese on Toast recipes>]
  16. Paolo Melchiorre 25 """Search configuration expression.""" from django.contrib.postgres import search

    from blog.models import Entry V = search.SearchVector("body_text", config="french") Q = search.SearchQuery("œuf", config="french") entries = Entry.objects.annotate(search=V) entries.filter(search=Q) [<Entry: Pain perdu>]
  17. Paolo Melchiorre """SearchRank function.""" from django.contrib.postgres import search from blog.models

    import Entry V = search.SearchVector("headline","body_text") Q = search.SearchQuery("cheese") R = search.SearchRank(V, Q) entries = Entry.objects.annotate(rank=R).order_by("-rank") entries.filter(rank__gt=0.05).values_list("headline", "rank") [('Cheese on Toast recipes', 0.09), ('Pizza Recipes', 0.06)] 26
  18. Paolo Melchiorre 27 """SearchVector weight attribute.""" from django.contrib.postgres import search

    from blog.models import Entry V = search.SearchVector("headline", weight="A") W = search.SearchVector("body_text", weight="B") Q = search.SearchQuery("cheese") R = search.SearchRank(V + W, Q) entries = Entry.objects.annotate(rank=R).order_by("-rank") entries.filter(rank__gt=0.05).values_list("headline", "rank") [('Cheese on Toast recipes', 0.72), ('Pizza Recipes', 0.24)]
  19. Paolo Melchiorre 28 """SearchHeadline function.""" from django.contrib.postgres import search from

    blog.models import Entry V = search.SearchVector("headline","body_text") Q = search.SearchQuery("cheeses") H = search.SearchHeadline("headline", Q) entries = Entry.objects.annotate(search=V, highlight=H) entries.filter(search=Q).values_list("highlight", flat=True) ["<b>Cheese</b> on Toast recipes", "Pizza Recipes"]
  20. Paolo Melchiorre 29 """SearchVector functional index.""" from django.contrib.postgres import indexes,

    search from django.db import models V = search.SearchVector("headline", config="english") class Entry(models.Model): headline = models.CharField(max_length=255) body_text = models.TextField() authors = models.ManyToManyField("Author") class Meta: indexes = [indexes.GinIndex(V, name="search_idx")]
  21. Paolo Melchiorre 30 """SearchVector functional index query.""" from django.contrib.postgres import

    search from blog.models import Entry V = search.SearchVector("headline", config="english") Q = search.SearchQuery("cheeses") entries = Entry.objects.annotate(search=V) entries.filter(search=Q) [<Entry: Cheese on Toast recipes>]
  22. Paolo Melchiorre 31 """SearchVector field.""" from django.contrib.postgres import search from

    django.db import models class Entry(models.Model): headline = models.CharField(max_length=255) body_text = models.TextField() authors = models.ManyToManyField("Author") search_vector = search.SearchVectorField() class Meta: indexes = [indexes.GinIndex(fields=["search_vector"])]
  23. Paolo Melchiorre 32 """SearchVector field update.""" from django.contrib.postgres import aggregates

    as agg, search from django.db import models from blog.models import Entry V = search.SearchVector("headline", "body_text") W = search.SearchVector(agg.StringAgg("authors__name", " ")) entries = Entry.objects.filter(id=models.OuterRef("id")) results = entries.annotate(search=V + W).values("search")[:1] Entry.objects.update(search_vector=models.Subquery(results)) 3
  24. Paolo Melchiorre 33 """SearchVector field query.""" from django.contrib.postgres import search

    from blog.models import Entry Q = search.SearchQuery("cheeses") entries = Entry.objects.all() entries.filter(search_vector=Q) [<Entry: Cheese on Toast recipes>, <Entry: Pizza Recipes>]
  25. Paolo Melchiorre 35 An old search • English-only search •

    HTML tag in results • Sphinx generation • PostgreSQL database • External search engine
  26. Paolo Melchiorre 36 Django developers feedback PROS Maintenance Light setup

    Dogfooding CONS Work to do Features DB workload
  27. Paolo Melchiorre 40 """Django documentation search document definition.""" from django.contrib.postgres.search

    import SearchVector as V from django.db.models import F from django.db.models.fields.json import KeyTextTransform as K DOCUMENT_SEARCH_VECTOR = ( V('title', weight='A', config=F('config')) + V(K('slug', 'metadata'), weight='A', config=F('config')) + V(K('toc', 'metadata'), weight='B', config=F('config')) + V(K('body', 'metadata'), weight='C', config=F('config')) + V(K('parents', 'metadata'), weight='D', config=F('config')) )
  28. Paolo Melchiorre 41 djangoproject.com full-text search features • Multilingual •

    PostgreSQL-only • Clean results • Low maintenance • Easier to setup • Web search syntax
  29. Paolo Melchiorre 43 DjangoProject.com search supported languages* Arabic Armenian Basque

    Catalan Danish Dutch English Finnish French German Greek Hindi Hungarian Indonesian Irish Italian Lithuanian Nepali Norwegian Portuguese Romanian Russian Serbian Spanish Swedish Tamil Turkish Yiddish
  30. Paolo Melchiorre 44 What’s next • Misspelling support • Search

    suggestions • Search statistics • Autocomplete • ...
  31. Paolo Melchiorre 45 Tips and Tricks • docs in djangoproject.com

    • details in postgresql.org • source code in github.com • questions in stackoverflow.com
  32. Paolo Melchiorre 48 Thanks • Participants • Speakers • Organizers

    • Volunteers from all conferences Grazie /ˈɡrat.t ͡ sje/ © 2017 Alessia Peviani (CC BY-NC)