Upgrade to Pro — share decks privately, control downloads, hide ads and more …

A pythonic full-text search

A pythonic full-text search

-- A talk I gave at Django London Meetup 2020

Keeping in mind the pythonic principle that "simple is better than complex" we will see how to implement full-text search in a web service using only Django and PostgreSQL and we will analyse the advantages of this solution compared to more complex solutions based on dedicated search engines.

More info on https://www.paulox.net/2020/10/13/django-london-meetup-2020/

Paolo Melchiorre

October 13, 2020
Tweet

More Decks by Paolo Melchiorre

Other Decks in Technology

Transcript

  1. A PYTHONIC FULL-TEXT SEARCH
    PAOLO MELCHIORRE ~ @pauloxnet

    View Slide

  2. View Slide

  3. CTO @ 20tab
    • Remote worker
    • Software engineer
    • Python developer
    • Django contributor
    Paolo Melchiorre

    View Slide

  4. Paolo Melchiorre ~ @pauloxnet
    4
    Pythonic
    >>> import this
    “Beautiful is better than ugly.
    Explicit is better than implicit.
    Simple is better than complex.
    Complex is better than complicated.”
    — “The Zen of Python”, Tim Peters

    View Slide

  5. Paolo Melchiorre ~ @pauloxnet
    5
    Full-text search
    “… techniques for searching
    … computer-stored document …
    in a full-text database.”
    — “Full-text search”, Wikipedia

    View Slide

  6. View Slide

  7. Paolo Melchiorre ~ @pauloxnet
    7
    PostgreSQL
    Full text search (v8.3 ~2008)
    Data type (tsquery, tsvector)
    Special indexes (GIN, GiST)
    Phrase search (v9.6 ~2016)
    JSON support (v10 ~2017)
    Web search (v11 ~2018)
    New languages (v12 ~2019)

    View Slide

  8. View Slide

  9. Paolo Melchiorre ~ @pauloxnet
    9
    Django
    Full text search (v1.10 ~2016)
    django.contrib.postgres
    Fields, expressions, functions
    GIN index (v1.11 ~2017)
    GiST index (v2.0 ~2018)
    Phrase search (v2.2 ~2019)
    Web search (v3.1 ~2020)

    View Slide

  10. View Slide

  11. Paolo Melchiorre ~ @pauloxnet
    """Blogs models."""
    from django.contrib.postgres import search
    from django.db import models
    class Blog(models.Model):
    name = models.CharField(max_length=100)
    tagline = models.TextField()
    class Author(models.Model):
    name = models.CharField(max_length=200)
    class Entry(models.Model):
    blog = models.ForeignKey(Blog, on_delete=models.CASCADE)
    headline = models.CharField(max_length=255)
    body_text = models.TextField()
    authors = models.ManyToManyField(Author)
    search_vector = search.SearchVectorField()
    11

    View Slide

  12. Paolo Melchiorre ~ @pauloxnet
    """Field lookups."""
    from blog.models import Author
    Author.objects.filter(name__contains="Terry")
    [, ]
    Author.objects.filter(name__icontains="ERRY")
    [, , ]
    12

    View Slide

  13. Paolo Melchiorre ~ @pauloxnet
    """Unaccent extension."""
    from django.contrib.postgres import operations
    from django.db import migrations
    class Migration(migrations.Migration):
    operations = [operations.UnaccentExtension()]
    """Unaccent lookup."""
    from blog.models import Author
    Author.objects.filter(name__unaccent="Helene Joy")
    []
    13

    View Slide

  14. Paolo Melchiorre ~ @pauloxnet
    """Trigram extension."""
    from django.contrib.postgres import operations
    from django.db import migrations
    class Migration(migrations.Migration):
    operations = [operations.TrigramExtension()]
    """Trigram similar lookup."""
    from blog.models import Author
    Author.objects.filter(name__trigram_similar="helena")
    [, ]
    14

    View Slide

  15. Paolo Melchiorre ~ @pauloxnet
    """App installation."""
    INSTALLED_APPS = [
    # …
    "django.contrib.postgres",
    ]
    """Search lookup."""
    from blog.models import Entry
    Entry.objects.filter(body_text__search="cheeses")
    [, ]
    15

    View Slide

  16. Paolo Melchiorre ~ @pauloxnet
    """SearchVector function."""
    from django.contrib.postgres import search
    from blog.models import Entry
    SEARCH_VECTOR = search.SearchVector("body_text", "blog__name")
    entries = Entry.objects.annotate(search=SEARCH_VECTOR)
    entries.filter(search="cheeses")
    [, ]
    16

    View Slide

  17. Paolo Melchiorre ~ @pauloxnet
    """SearchQuery expression."""
    from django.contrib.postgres import search
    from blog.models import Entry
    SEARCH_VECTOR = search.SearchVector("body_text")
    SEARCH_QUERY = search.SearchQuery("pizzas OR toasts", search_type="websearch")
    entries = Entry.objects.annotate(search=SEARCH_VECTOR)
    entries.filter(search=SEARCH_QUERY)
    [, ]
    17

    View Slide

  18. Paolo Melchiorre ~ @pauloxnet
    """SearchConfig expression."""
    from django.contrib.postgres import search
    from blog.models import Entry
    SEARCH_VECTOR = search.SearchVector("body_text", config="french")
    SEARCH_QUERY = search.SearchQuery("œuf", config="french")
    entries = Entry.objects.annotate(search=SEARCH_VECTOR)
    entries.filter(search=SEARCH_QUERY)
    []
    18

    View Slide

  19. Paolo Melchiorre ~ @pauloxnet
    """SearchRank function."""
    from django.contrib.postgres import search
    from blog.models import Entry
    SEARCH_VECTOR = search.SearchVector("body_text")
    SEARCH_QUERY = search.SearchQuery("cheese OR meat", search_type="websearch")
    SEARCH_RANK = search.SearchRank(SEARCH_VECTOR, SEARCH_QUERY)
    entries = Entry.objects.annotate(rank=SEARCH_RANK)
    entries.order_by("-rank").filter(rank__gt=0.01).values_list("headline", "rank")
    [('Pizza Recipes', 0.06079271), ('Cheese on Toast recipes', 0.044488445)]
    19

    View Slide

  20. Paolo Melchiorre ~ @pauloxnet
    """SearchVector weight attribute."""
    from django.contrib.postgres import search
    from blog.models import Entry
    SEARCH_VECTOR = search.SearchVector("headline", weight="A") \
    + search.SearchVector("body_text", weight="B")
    SEARCH_QUERY = search.SearchQuery("cheese OR meat", search_type="websearch")
    SEARCH_RANK = search.SearchRank(SEARCH_VECTOR, SEARCH_QUERY)
    entries = Entry.objects.annotate(rank=SEARCH_RANK).order_by("-rank")
    entries.values_list("headline", "rank")
    [('Cheese on Toast recipes', 0.36), ('Pizza Recipes', 0.24), ('Pain perdu', 0)]
    20

    View Slide

  21. Paolo Melchiorre ~ @pauloxnet
    """SearchHeadline function."""
    from django.contrib.postgres import search
    from blog.models import Entry
    SEARCH_QUERY = search.SearchQuery("pizzas OR toasts", search_type="websearch")
    SEARCH_HEADLINE = search.SearchHeadline("headline", SEARCH_QUERY)
    entries = Entry.objects.annotate(highlighted_headline=SEARCH_HEADLINE)
    entries.values_list("highlighted_headline", flat=True)
    ['Cheese on Toast recipes', 'Pizza Recipes', 'Pain perdu']
    21

    View Slide

  22. Paolo Melchiorre ~ @pauloxnet
    """SearchVector field."""
    from django.contrib.postgres import search
    from blog.models import Entry
    SEARCH_VECTOR = search.SearchVector("body_text")
    SEARCH_QUERY = search.SearchQuery("pizzas OR toasts", search_type="websearch")
    Entry.objects.update(search_vector=SEARCH_VECTOR)
    Entry.objects.filter(search_vector=SEARCH_QUERY)
    [, ]
    22

    View Slide

  23. View Slide

  24. Paolo Melchiorre ~ @pauloxnet
    24
    An old search
    • English-only search
    • HTML tag in results
    • Sphinx generation
    • PostgreSQL database
    • External search engine

    View Slide

  25. Paolo Melchiorre ~ @pauloxnet
    25
    Django developers feedback
    PROS
    Maintenance
    Light setup
    Dogfooding
    CONS
    Work to do
    Features
    Database workload

    View Slide

  26. View Slide

  27. View Slide

  28. Paolo Melchiorre ~ @pauloxnet
    28
    djangoproject.com
    Full-text search features
    • Multilingual
    • PostgreSQL based
    • Clean results
    • Low maintenance
    • Easier to setup

    View Slide

  29. Paolo Melchiorre ~ @pauloxnet
    """Documentation models."""
    from django.contrib.postgres import search
    from django.db import models
    from docs.managers import DocumentManager
    class Document(models.Model):
    release = models.ForeignKey("DocumentRelease", on_delete=models.CASCADE)
    path = models.CharField(max_length=500)
    title = models.CharField(max_length=500)
    metadata = models.JSONField(default=dict)
    search = search.SearchVectorField(null=True, editable=False)
    config = models.SlugField(default="simple")
    objects = DocumentManager()
    29

    View Slide

  30. Paolo Melchiorre ~ @pauloxnet
    """Documentation managers."""
    from django.contrib.postgres import search
    from django.db import models
    class DocumentManager(models.Manager):
    def search(self, query_text, release_id):
    search_query = search.SearchQuery(query_text, config=models.F("config"))
    search_rank = search.SearchRank(models.F("search"), search_query)
    trigram_similarity = search.TrigramSimilarity("title", query_text)
    return (
    self.get_queryset()
    .filter(release_id=release_id, search=search_query)
    .annotate(rank=search_rank + trigram_similarity)
    .order_by("-rank")
    )
    30

    View Slide

  31. Paolo Melchiorre ~ @pauloxnet
    """Documentation search."""
    from django.db.models.fields.json import KeyTextTransform as Key
    from django.contrib.postgres.search import SearchVector
    from django.db.models import F
    DOCUMENT_SEARCH_VECTOR = (
    SearchVector("title", weight="A", config=F("config"))
    + SearchVector(Key("slug", "metadata"), weight="A", config=F("config"))
    + SearchVector(Key("toc", "metadata"), weight="B", config=F("config"))
    + SearchVector(Key("body", "metadata"), weight="C", config=F("config"))
    + SearchVector(Key("parents", "metadata"), weight="D", config=F("config"))
    )
    31

    View Slide

  32. Paolo Melchiorre ~ @pauloxnet
    """Documentation command."""
    from django.core.management.base import BaseCommand
    from django.db import transaction
    from docs.models import Document
    from docs.search import DOCUMENT_SEARCH_VECTOR
    class Command(BaseCommand):
    @transaction.atomic
    def handle(self, *args, **options):
    Document.objects.update(search=None)
    updated = Document.objects.update(search=DOCUMENT_SEARCH_VECTOR)
    self.stdout.write(f"Successfully indexed {updated} items.")
    32

    View Slide

  33. Paolo Melchiorre ~ @pauloxnet
    """Documentation tests."""
    from django.test import TestCase
    from docs.models import Document, DocumentRelease
    from docs.search import DOCUMENT_SEARCH_VECTOR
    from docs.tests import DOCUMENTS
    class DocumentManagerTest(TestCase):
    def setUp(self):
    self.release_en = DocumentRelease.objects.create(lang="en")
    slef.release_fr = DocumentRelease.objects.create(lang="fr")
    Document.objects.bulk_create(((Document(**d) for d in DOCUMENTS)))
    Document.objects.update(search=DOCUMENT_SEARCH_VECTOR)
    33

    View Slide

  34. Paolo Melchiorre ~ @pauloxnet
    [
    {
    "metadata": {
    "body": "Django 1.2.1 release notesDjango 1.2.1 release notes …",
    },
    "title": "Django 1.2.1 release notes",
    },
    {
    "metadata": {
    "body": "Notes de publication de Django 1.9.4Notes de publication de Django 1.9.4 …",
    },
    "title": "Notes de publication de Django 1.9.4",
    },
    ]
    34

    View Slide

  35. Paolo Melchiorre ~ @pauloxnet
    """Documentation tests."""
    # …
    def test_search(self):
    queryset = Document.objects.search("django", self.release_en.id)
    documents = (
    ("Django 1.2.1 release notes", 0.969828),
    ("Django 1.9.4 release notes", 0.949088),
    )
    self.assertCountEqual(queryset.values_list("title", "rank"), documents)
    def test_multilingual_search(self):
    queryset = Document.objects.search("publication", self.release_fr.id)
    documents = (
    ("Notes de publication de Django 1.2.1", 1.06933),
    ("Notes de publication de Django 1.9.4", 1.04587),
    )
    self.assertCountEqual(queryset.values_list("title", "rank"), documents)
    35

    View Slide

  36. Paolo Melchiorre ~ @pauloxnet
    36
    What’s next
    • Highlighted results
    • Web search syntax
    • …
    • Misspelling support
    • Search suggestions
    • Search statistics

    View Slide

  37. View Slide

  38. Paolo Melchiorre ~ @pauloxnet
    38
    Tips
    • docs in djangoproject.com
    • details in postgresql.org
    • source code in github.com
    • questions in stackoverflow.com

    View Slide

  39. Paolo Melchiorre ~ @pauloxnet
    39
    License
    CC BY-SA 4.0
    This work is licensed under
    a Creative Commons
    Attribution-ShareAlike 4.0
    International License.

    View Slide

  40. View Slide

  41. @20tab
    20tab
    20tab
    [email protected]
    20tab.com

    View Slide

  42. @pauloxnet
    paolomelchiorre
    pauloxnet
    [email protected]
    paulox.net

    View Slide