Upgrade to Pro — share decks privately, control downloads, hide ads and more …

A pythonic full-text search - DjangoCon US 2022

A pythonic full-text search - DjangoCon US 2022

-- A talk I gave at DjangoCon US 2022

Keeping in mind the pythonic principle that “simple is better than complex” we’ll see how to implement full-text search in a web service using only latest versions of Django and PostgreSQL and we’ll analyze the advantages compared to more complex solutions based on external services.

https://www.paulox.net/2022/10/19/djangocon-us-2022/

Paolo Melchiorre

October 19, 2022
Tweet

More Decks by Paolo Melchiorre

Other Decks in Technology

Transcript

  1. Paolo Melchiorre ~ @pauloxnet @pauloxnet • CTO @ 20tab •

    Software engineer • Python developer • Django contributor Paolo Melchiorre 3 DjangoCon Europe 2019 - Bartek Pawlik (CC BY-NC-SA)
  2. Paolo Melchiorre ~ @pauloxnet 4 Pythonic >>> import this The

    Zen of Python, by Tim Peters Beautiful is better than ugly. Explicit is better than implicit. Simple is better than complex. Complex is better than complicated. ...
  3. Paolo Melchiorre ~ @pauloxnet 5 Full-text search “… techniques for

    searching … computer-stored document … in a full-text database.” — “Full-text search”, Wikipedia
  4. Paolo Melchiorre ~ @pauloxnet 7 External search engines PROS Popular

    Full featured Resources CONS Driver Query language Synchronization
  5. Paolo Melchiorre ~ @pauloxnet PostgreSQL 9 Photo by Nam Anh

    on Unsplash Elephant walking during daytime 2019
  6. Paolo Melchiorre ~ @pauloxnet 10 PostgreSQL Full text search (v8.3

    ~2008) Data type (tsquery, tsvector) Special indexes (GIN, GiST) Phrase search (v9.6 ~2016) JSON support (v10 ~2017) Web search (v11 ~2018) New languages (v12-14 ~2019-2021)
  7. Paolo Melchiorre ~ @pauloxnet 11 Document “… the unit of

    searching in a full-text search system; e.g., a magazine article …” — “Full Text Search”, PostgreSQL Documentation
  8. Paolo Melchiorre ~ @pauloxnet Django 13 William Gottlieb (Public domain)

    Django Reinhardt at the Aquarium jazz club in New York, NY 1946
  9. Paolo Melchiorre ~ @pauloxnet 14 Django Full text search (v1.10

    ~2016) django.contrib.postgres Fields, expressions, functions GIN index (v1.11 ~2017) GiST index (v2.0 ~2018) Phrase search (v2.2 ~2019) Web search (v3.1 ~2020)
  10. Paolo Melchiorre ~ @pauloxnet 15 Document-based search • Weighting •

    Categorization • Highlighting • Multiple languages
  11. Paolo Melchiorre ~ @pauloxnet """Blogs models.""" from django.contrib.postgres import search

    from django.db import models class Author(models.Model): name = models.CharField(max_length=200) class Entry(models.Model): headline = models.CharField(max_length=255) body_text = models.TextField() authors = models.ManyToManyField(Author) 17
  12. Paolo Melchiorre ~ @pauloxnet """Field lookups.""" from blog.models import Author

    Author.objects.filter(name__contains="Jerry") [<Author: Jerry Lewis>] Author.objects.filter(name__icontains="TERRY") [<Author: Terry Gilliam>, <Author: Terry Jones>] 18
  13. Paolo Melchiorre ~ @pauloxnet """Application definition.""" INSTALLED_APPS = [ "django.contrib.admin",

    "django.contrib.auth", "django.contrib.contenttypes", "django.contrib.sessions", "django.contrib.messages", "django.contrib.staticfiles", "django.contrib.postgres", "blog", ] 19
  14. Paolo Melchiorre ~ @pauloxnet """Trigram extension.""" from django.contrib.postgres import operations

    from django.db import migrations class Migration(migrations.Migration): operations = [operations.TrigramExtension()] 20
  15. Paolo Melchiorre ~ @pauloxnet """Trigram similar lookup.""" from blog.models import

    Author Author.objects.filter(name__trigram_similar="jerry ones") [<Author: Terry Jones>, <Author: Jerry Lewis>] 21
  16. Paolo Melchiorre ~ @pauloxnet """Search lookup.""" from blog.models import Entry

    Entry.objects.filter(headline__search="any cheeses") [<Entry: Cheese on Toast recipes>] 22
  17. Paolo Melchiorre ~ @pauloxnet """SearchVector function.""" from django.contrib.postgres import search

    from blog.models import Entry V = search.SearchVector("headline", "body_text") entries = Entry.objects.annotate(search=V) entries.filter(search="cheeses") [<Entry: Cheese on Toast recipes>, <Entry: Pizza Recipes>] 23
  18. Paolo Melchiorre ~ @pauloxnet """SearchQuery expression.""" from django.contrib.postgres import search

    from blog.models import Entry V = search.SearchVector("headline", "body_text") Q = search.SearchQuery("cheese -top", search_type="websearch") entries = Entry.objects.annotate(search=V) entries.filter(search=Q) [<Entry: Cheese on Toast recipes>] 24
  19. Paolo Melchiorre ~ @pauloxnet """SearchConfig expression.""" from django.contrib.postgres import search

    from blog.models import Entry V = search.SearchVector("body_text", config="french") Q = search.SearchQuery("œuf", config="french") entries = Entry.objects.annotate(search=V) entries.filter(search=Q) [<Entry: Pain perdu>] 25
  20. Paolo Melchiorre ~ @pauloxnet """SearchRank function.""" from django.contrib.postgres import search

    from blog.models import Entry V = search.SearchVector("headline","body_text") Q = search.SearchQuery("cheese") R = search.SearchRank(V, Q) entries = Entry.objects.annotate(rank=R).order_by("-rank") entries.filter(rank__gt=0.05).values_list("headline", "rank") [('Cheese on Toast recipes', 0.09), ('Pizza Recipes', 0.06)] 26
  21. Paolo Melchiorre ~ @pauloxnet """SearchVector weight attribute.""" from django.contrib.postgres import

    search from blog.models import Entry V = search.SearchVector("headline", weight="A") W = search.SearchVector("body_text", weight="B") Q = search.SearchQuery("cheese") R = search.SearchRank(V + W, Q) entries = Entry.objects.annotate(rank=R).order_by("-rank") entries.filter(rank__gt=0.05).values_list("headline", "rank") [('Cheese on Toast recipes', 0.72), ('Pizza Recipes', 0.24)] 27
  22. Paolo Melchiorre ~ @pauloxnet """SearchHeadline function.""" from django.contrib.postgres import search

    from blog.models import Entry V = search.SearchVector("headline","body_text") Q = search.SearchQuery("cheeses") H = search.SearchHeadline("headline", Q) entries = Entry.objects.annotate(search=V, highlight=H) entries.filter(search=Q).values_list("highlight", flat=True) ["<b>Cheese</b> on Toast recipes", "Pizza Recipes"] 28
  23. Paolo Melchiorre ~ @pauloxnet """SearchVector functional index.""" from django.contrib.postgres import

    indexes, search from django.db import models V = search.SearchVector("headline", config="english") class Entry(models.Model): headline = models.CharField(max_length=255) body_text = models.TextField() authors = models.ManyToManyField("Author") class Meta: indexes = [indexes.GinIndex(V, name="search_idx")] 29
  24. Paolo Melchiorre ~ @pauloxnet """SearchVector functional index query.""" from django.contrib.postgres

    import search from blog.models import Entry V = search.SearchVector("headline", config="english") entries = Entry.objects.annotate(search=V) entries.filter(search="cheeses") [<Entry: Cheese on Toast recipes>] 30
  25. Paolo Melchiorre ~ @pauloxnet """SearchVector field.""" from django.contrib.postgres import search

    from django.db import models class Entry(models.Model): headline = models.CharField(max_length=255) body_text = models.TextField() authors = models.ManyToManyField("Author") search_vector = search.SearchVectorField() class Meta: indexes = [indexes.GinIndex(fields=["search_vector"])] 31
  26. Paolo Melchiorre ~ @pauloxnet """SearchVector field update.""" from django.contrib.postgres import

    aggregates as agg, search from django.db import models from blog.models import Entry V = search.SearchVector("headline", "body_text") W = search.SearchVector(agg.StringAgg("authors__name", " ")) entries = Entry.objects.filter(id=models.OuterRef("id")) results = entries.annotate(search=V + W).values("search")[:1] Entry.objects.update(search_vector=models.Subquery(results)) 3 32
  27. Paolo Melchiorre ~ @pauloxnet """SearchVector field query.""" from blog.models import

    Entry Entry.objects.filter(search_vector="cheeses") [<Entry: Cheese on Toast recipes>, <Entry: Pizza Recipes>] 33
  28. Paolo Melchiorre ~ @pauloxnet 35 An old search • English-only

    search • HTML tag in results • Sphinx generation • PostgreSQL database • External search engine
  29. Paolo Melchiorre ~ @pauloxnet 36 Django developers feedback PROS Maintenance

    Light setup Dogfooding CONS Work to do Features DB workload
  30. Paolo Melchiorre ~ @pauloxnet """Django documentation search document definition.""" from

    django.contrib.postgres.search import SearchVector as V from django.db.models import F from django.db.models.fields.json import KeyTextTransform as K DOCUMENT_SEARCH_VECTOR = ( V('title', weight='A', config=F('config')) + V(K('slug', 'metadata'), weight='A', config=F('config')) + V(K('toc', 'metadata'), weight='B', config=F('config')) + V(K('body', 'metadata'), weight='C', config=F('config')) + V(K('parents', 'metadata'), weight='D', config=F('config')) ) 40
  31. Paolo Melchiorre ~ @pauloxnet 41 djangoproject.com Full-text search features •

    Multilingual • PostgreSQL-only • Clean results • Low maintenance • Easier to setup • Web search syntax
  32. Paolo Melchiorre ~ @pauloxnet • Irish • Italian • Lithuanian

    • Nepali • Norwegian • Portuguese • Romanian • Finnish • French • German • Greek • Hindi • Hungarian • Indonesian • Arabic • Armenian • Basque • Catalan • Danish • Dutch • English 43 DjangoProject.com supported languages* • Russian • Serbian • Spanish • Swedish • Tamil • Turkish • Yiddish
  33. Paolo Melchiorre ~ @pauloxnet 44 What’s next • Misspelling support

    • Search suggestions • Search statistics • Autocomplete • ...
  34. Paolo Melchiorre ~ @pauloxnet 45 Tips • docs in djangoproject.com

    • details in postgresql.org • source code in github.com • questions in stackoverflow.com
  35. Paolo Melchiorre ~ @pauloxnet 46 License CC BY-SA 4.0 This

    work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
  36. Paolo Melchiorre ~ @pauloxnet • Participants • Speakers • Organizers

    • Volunteers from all conferences 50 Thanks Grazie /ˈɡrat.t ͡ sje/ DjangoCon US 2022 - Paolo Melchiorre (CC BY-SA)