A pythonic full-text search

A pythonic full-text search

-- A talk I gave at Django London Meetup 2020

Keeping in mind the pythonic principle that "simple is better than complex" we will see how to implement full-text search in a web service using only Django and PostgreSQL and we will analyse the advantages of this solution compared to more complex solutions based on dedicated search engines.

More info on https://www.paulox.net/2020/10/13/django-london-meetup-2020/

6b8e2101579190ad96e747e01c279898?s=128

Paolo Melchiorre

October 13, 2020
Tweet

Transcript

  1. A PYTHONIC FULL-TEXT SEARCH PAOLO MELCHIORRE ~ @pauloxnet

  2. None
  3. CTO @ 20tab • Remote worker • Software engineer •

    Python developer • Django contributor Paolo Melchiorre
  4. Paolo Melchiorre ~ @pauloxnet 4 Pythonic >>> import this “Beautiful

    is better than ugly. Explicit is better than implicit. Simple is better than complex. Complex is better than complicated.” — “The Zen of Python”, Tim Peters
  5. Paolo Melchiorre ~ @pauloxnet 5 Full-text search “… techniques for

    searching … computer-stored document … in a full-text database.” — “Full-text search”, Wikipedia
  6. None
  7. Paolo Melchiorre ~ @pauloxnet 7 PostgreSQL Full text search (v8.3

    ~2008) Data type (tsquery, tsvector) Special indexes (GIN, GiST) Phrase search (v9.6 ~2016) JSON support (v10 ~2017) Web search (v11 ~2018) New languages (v12 ~2019)
  8. None
  9. Paolo Melchiorre ~ @pauloxnet 9 Django Full text search (v1.10

    ~2016) django.contrib.postgres Fields, expressions, functions GIN index (v1.11 ~2017) GiST index (v2.0 ~2018) Phrase search (v2.2 ~2019) Web search (v3.1 ~2020)
  10. None
  11. Paolo Melchiorre ~ @pauloxnet """Blogs models.""" from django.contrib.postgres import search

    from django.db import models class Blog(models.Model): name = models.CharField(max_length=100) tagline = models.TextField() class Author(models.Model): name = models.CharField(max_length=200) class Entry(models.Model): blog = models.ForeignKey(Blog, on_delete=models.CASCADE) headline = models.CharField(max_length=255) body_text = models.TextField() authors = models.ManyToManyField(Author) search_vector = search.SearchVectorField() 11
  12. Paolo Melchiorre ~ @pauloxnet """Field lookups.""" from blog.models import Author

    Author.objects.filter(name__contains="Terry") [<Author: Terry Gilliam>, <Author: Terry Jones>] Author.objects.filter(name__icontains="ERRY") [<Author: Terry Gilliam>, <Author: Terry Jones>, <Author: Jerry Lewis>] 12
  13. Paolo Melchiorre ~ @pauloxnet """Unaccent extension.""" from django.contrib.postgres import operations

    from django.db import migrations class Migration(migrations.Migration): operations = [operations.UnaccentExtension()] """Unaccent lookup.""" from blog.models import Author Author.objects.filter(name__unaccent="Helene Joy") [<Author: Hélène Joy>] 13
  14. Paolo Melchiorre ~ @pauloxnet """Trigram extension.""" from django.contrib.postgres import operations

    from django.db import migrations class Migration(migrations.Migration): operations = [operations.TrigramExtension()] """Trigram similar lookup.""" from blog.models import Author Author.objects.filter(name__trigram_similar="helena") [<Author: Helen Mirren>, <Author: Helena Bonham Carter>] 14
  15. Paolo Melchiorre ~ @pauloxnet """App installation.""" INSTALLED_APPS = [ #

    … "django.contrib.postgres", ] """Search lookup.""" from blog.models import Entry Entry.objects.filter(body_text__search="cheeses") [<Entry: Cheese on Toast recipes>, <Entry: Pizza Recipes>] 15
  16. Paolo Melchiorre ~ @pauloxnet """SearchVector function.""" from django.contrib.postgres import search

    from blog.models import Entry SEARCH_VECTOR = search.SearchVector("body_text", "blog__name") entries = Entry.objects.annotate(search=SEARCH_VECTOR) entries.filter(search="cheeses") [<Entry: Cheese on Toast recipes>, <Entry: Pizza Recipes>] 16
  17. Paolo Melchiorre ~ @pauloxnet """SearchQuery expression.""" from django.contrib.postgres import search

    from blog.models import Entry SEARCH_VECTOR = search.SearchVector("body_text") SEARCH_QUERY = search.SearchQuery("pizzas OR toasts", search_type="websearch") entries = Entry.objects.annotate(search=SEARCH_VECTOR) entries.filter(search=SEARCH_QUERY) [<Entry: Cheese on Toast recipes>, <Entry: Pizza Recipes>] 17
  18. Paolo Melchiorre ~ @pauloxnet """SearchConfig expression.""" from django.contrib.postgres import search

    from blog.models import Entry SEARCH_VECTOR = search.SearchVector("body_text", config="french") SEARCH_QUERY = search.SearchQuery("œuf", config="french") entries = Entry.objects.annotate(search=SEARCH_VECTOR) entries.filter(search=SEARCH_QUERY) [<Entry: Pain perdu>] 18
  19. Paolo Melchiorre ~ @pauloxnet """SearchRank function.""" from django.contrib.postgres import search

    from blog.models import Entry SEARCH_VECTOR = search.SearchVector("body_text") SEARCH_QUERY = search.SearchQuery("cheese OR meat", search_type="websearch") SEARCH_RANK = search.SearchRank(SEARCH_VECTOR, SEARCH_QUERY) entries = Entry.objects.annotate(rank=SEARCH_RANK) entries.order_by("-rank").filter(rank__gt=0.01).values_list("headline", "rank") [('Pizza Recipes', 0.06079271), ('Cheese on Toast recipes', 0.044488445)] 19
  20. Paolo Melchiorre ~ @pauloxnet """SearchVector weight attribute.""" from django.contrib.postgres import

    search from blog.models import Entry SEARCH_VECTOR = search.SearchVector("headline", weight="A") \ + search.SearchVector("body_text", weight="B") SEARCH_QUERY = search.SearchQuery("cheese OR meat", search_type="websearch") SEARCH_RANK = search.SearchRank(SEARCH_VECTOR, SEARCH_QUERY) entries = Entry.objects.annotate(rank=SEARCH_RANK).order_by("-rank") entries.values_list("headline", "rank") [('Cheese on Toast recipes', 0.36), ('Pizza Recipes', 0.24), ('Pain perdu', 0)] 20
  21. Paolo Melchiorre ~ @pauloxnet """SearchHeadline function.""" from django.contrib.postgres import search

    from blog.models import Entry SEARCH_QUERY = search.SearchQuery("pizzas OR toasts", search_type="websearch") SEARCH_HEADLINE = search.SearchHeadline("headline", SEARCH_QUERY) entries = Entry.objects.annotate(highlighted_headline=SEARCH_HEADLINE) entries.values_list("highlighted_headline", flat=True) ['Cheese on <b>Toast</b> recipes', '<b>Pizza</b> Recipes', 'Pain perdu'] 21
  22. Paolo Melchiorre ~ @pauloxnet """SearchVector field.""" from django.contrib.postgres import search

    from blog.models import Entry SEARCH_VECTOR = search.SearchVector("body_text") SEARCH_QUERY = search.SearchQuery("pizzas OR toasts", search_type="websearch") Entry.objects.update(search_vector=SEARCH_VECTOR) Entry.objects.filter(search_vector=SEARCH_QUERY) [<Entry: Cheese on Toast recipes>, <Entry: Pizza Recipes>] 22
  23. None
  24. Paolo Melchiorre ~ @pauloxnet 24 An old search • English-only

    search • HTML tag in results • Sphinx generation • PostgreSQL database • External search engine
  25. Paolo Melchiorre ~ @pauloxnet 25 Django developers feedback PROS Maintenance

    Light setup Dogfooding CONS Work to do Features Database workload
  26. None
  27. None
  28. Paolo Melchiorre ~ @pauloxnet 28 djangoproject.com Full-text search features •

    Multilingual • PostgreSQL based • Clean results • Low maintenance • Easier to setup
  29. Paolo Melchiorre ~ @pauloxnet """Documentation models.""" from django.contrib.postgres import search

    from django.db import models from docs.managers import DocumentManager class Document(models.Model): release = models.ForeignKey("DocumentRelease", on_delete=models.CASCADE) path = models.CharField(max_length=500) title = models.CharField(max_length=500) metadata = models.JSONField(default=dict) search = search.SearchVectorField(null=True, editable=False) config = models.SlugField(default="simple") objects = DocumentManager() 29
  30. Paolo Melchiorre ~ @pauloxnet """Documentation managers.""" from django.contrib.postgres import search

    from django.db import models class DocumentManager(models.Manager): def search(self, query_text, release_id): search_query = search.SearchQuery(query_text, config=models.F("config")) search_rank = search.SearchRank(models.F("search"), search_query) trigram_similarity = search.TrigramSimilarity("title", query_text) return ( self.get_queryset() .filter(release_id=release_id, search=search_query) .annotate(rank=search_rank + trigram_similarity) .order_by("-rank") ) 30
  31. Paolo Melchiorre ~ @pauloxnet """Documentation search.""" from django.db.models.fields.json import KeyTextTransform

    as Key from django.contrib.postgres.search import SearchVector from django.db.models import F DOCUMENT_SEARCH_VECTOR = ( SearchVector("title", weight="A", config=F("config")) + SearchVector(Key("slug", "metadata"), weight="A", config=F("config")) + SearchVector(Key("toc", "metadata"), weight="B", config=F("config")) + SearchVector(Key("body", "metadata"), weight="C", config=F("config")) + SearchVector(Key("parents", "metadata"), weight="D", config=F("config")) ) 31
  32. Paolo Melchiorre ~ @pauloxnet """Documentation command.""" from django.core.management.base import BaseCommand

    from django.db import transaction from docs.models import Document from docs.search import DOCUMENT_SEARCH_VECTOR class Command(BaseCommand): @transaction.atomic def handle(self, *args, **options): Document.objects.update(search=None) updated = Document.objects.update(search=DOCUMENT_SEARCH_VECTOR) self.stdout.write(f"Successfully indexed {updated} items.") 32
  33. Paolo Melchiorre ~ @pauloxnet """Documentation tests.""" from django.test import TestCase

    from docs.models import Document, DocumentRelease from docs.search import DOCUMENT_SEARCH_VECTOR from docs.tests import DOCUMENTS class DocumentManagerTest(TestCase): def setUp(self): self.release_en = DocumentRelease.objects.create(lang="en") slef.release_fr = DocumentRelease.objects.create(lang="fr") Document.objects.bulk_create(((Document(**d) for d in DOCUMENTS))) Document.objects.update(search=DOCUMENT_SEARCH_VECTOR) 33
  34. Paolo Melchiorre ~ @pauloxnet [ { "metadata": { "body": "<h1>Django

    1.2.1 release notes<a title='Permalink to …", "parents": "releases", "slug": "1.2.1", "toc": "<ul><li><a href='#'>Django 1.2.1 release notes</a></li> …", }, "title": "Django 1.2.1 release notes", }, { "metadata": { "body": "<h1>Notes de publication de Django 1.9.4<a title='Lien …", "parents": "releases", "slug": "1.9.4", "toc": "<ul><li><a href='#'>Notes de publication de Django 1.9.4 …", }, "title": "Notes de publication de Django 1.9.4", }, ] 34
  35. Paolo Melchiorre ~ @pauloxnet """Documentation tests.""" # … def test_search(self):

    queryset = Document.objects.search("django", self.release_en.id) documents = ( ("Django 1.2.1 release notes", 0.969828), ("Django 1.9.4 release notes", 0.949088), ) self.assertCountEqual(queryset.values_list("title", "rank"), documents) def test_multilingual_search(self): queryset = Document.objects.search("publication", self.release_fr.id) documents = ( ("Notes de publication de Django 1.2.1", 1.06933), ("Notes de publication de Django 1.9.4", 1.04587), ) self.assertCountEqual(queryset.values_list("title", "rank"), documents) 35
  36. Paolo Melchiorre ~ @pauloxnet 36 What’s next • Highlighted results

    • Web search syntax • … • Misspelling support • Search suggestions • Search statistics
  37. None
  38. Paolo Melchiorre ~ @pauloxnet 38 Tips • docs in djangoproject.com

    • details in postgresql.org • source code in github.com • questions in stackoverflow.com
  39. Paolo Melchiorre ~ @pauloxnet 39 License CC BY-SA 4.0 This

    work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
  40. None
  41. @20tab 20tab 20tab info@20tab.com 20tab.com

  42. @pauloxnet paolomelchiorre pauloxnet paolo@melchiorre.org paulox.net