$30 off During Our Annual Pro Sale. View Details »

A pythonic full-text search - DjangoCon US 2022

A pythonic full-text search - DjangoCon US 2022

-- A talk I gave at DjangoCon US 2022

Keeping in mind the pythonic principle that “simple is better than complex” we’ll see how to implement full-text search in a web service using only latest versions of Django and PostgreSQL and we’ll analyze the advantages compared to more complex solutions based on external services.

https://www.paulox.net/2022/10/19/djangocon-us-2022/

Paolo Melchiorre

October 19, 2022
Tweet

More Decks by Paolo Melchiorre

Other Decks in Technology

Transcript

  1. A PYTHONIC
    FULL-TEXT SEARCH
    PAOLO MELCHIORRE ~ @pauloxnet

    View Slide

  2. View Slide

  3. Paolo Melchiorre ~ @pauloxnet
    @pauloxnet
    • CTO @ 20tab
    • Software engineer
    • Python developer
    • Django contributor
    Paolo Melchiorre
    3
    DjangoCon Europe 2019 - Bartek Pawlik (CC BY-NC-SA)

    View Slide

  4. Paolo Melchiorre ~ @pauloxnet
    4
    Pythonic
    >>> import this
    The Zen of Python, by Tim Peters
    Beautiful is better than ugly.
    Explicit is better than implicit.
    Simple is better than complex.
    Complex is better than complicated.
    ...

    View Slide

  5. Paolo Melchiorre ~ @pauloxnet
    5
    Full-text search
    “… techniques for searching
    … computer-stored document …
    in a full-text database.”
    — “Full-text search”, Wikipedia

    View Slide

  6. Paolo Melchiorre ~ @pauloxnet
    6
    Popular search engines

    View Slide

  7. Paolo Melchiorre ~ @pauloxnet
    7
    External search engines
    PROS
    Popular
    Full featured
    Resources
    CONS
    Driver
    Query language
    Synchronization

    View Slide

  8. Paolo Melchiorre ~ @pauloxnet
    8
    External search engines synchronization

    View Slide

  9. Paolo Melchiorre ~ @pauloxnet
    PostgreSQL
    9
    Photo by Nam Anh on Unsplash
    Elephant walking during daytime 2019

    View Slide

  10. Paolo Melchiorre ~ @pauloxnet
    10
    PostgreSQL
    Full text search (v8.3 ~2008)
    Data type (tsquery, tsvector)
    Special indexes (GIN, GiST)
    Phrase search (v9.6 ~2016)
    JSON support (v10 ~2017)
    Web search (v11 ~2018)
    New languages (v12-14 ~2019-2021)

    View Slide

  11. Paolo Melchiorre ~ @pauloxnet
    11
    Document
    “… the unit of searching
    in a full-text search system;
    e.g., a magazine article …”
    — “Full Text Search”, PostgreSQL Documentation

    View Slide

  12. View Slide

  13. Paolo Melchiorre ~ @pauloxnet
    Django
    13
    William Gottlieb (Public domain)
    Django Reinhardt at the Aquarium jazz club in New York, NY 1946

    View Slide

  14. Paolo Melchiorre ~ @pauloxnet
    14
    Django
    Full text search (v1.10 ~2016)
    django.contrib.postgres
    Fields, expressions, functions
    GIN index (v1.11 ~2017)
    GiST index (v2.0 ~2018)
    Phrase search (v2.2 ~2019)
    Web search (v3.1 ~2020)

    View Slide

  15. Paolo Melchiorre ~ @pauloxnet
    15
    Document-based search
    • Weighting
    • Categorization
    • Highlighting
    • Multiple languages

    View Slide

  16. View Slide

  17. Paolo Melchiorre ~ @pauloxnet
    """Blogs models."""
    from django.contrib.postgres import search
    from django.db import models
    class Author(models.Model):
    name = models.CharField(max_length=200)
    class Entry(models.Model):
    headline = models.CharField(max_length=255)
    body_text = models.TextField()
    authors = models.ManyToManyField(Author)
    17

    View Slide

  18. Paolo Melchiorre ~ @pauloxnet
    """Field lookups."""
    from blog.models import Author
    Author.objects.filter(name__contains="Jerry")
    []
    Author.objects.filter(name__icontains="TERRY")
    [, ]
    18

    View Slide

  19. Paolo Melchiorre ~ @pauloxnet
    """Application definition."""
    INSTALLED_APPS = [
    "django.contrib.admin",
    "django.contrib.auth",
    "django.contrib.contenttypes",
    "django.contrib.sessions",
    "django.contrib.messages",
    "django.contrib.staticfiles",
    "django.contrib.postgres",
    "blog",
    ]
    19

    View Slide

  20. Paolo Melchiorre ~ @pauloxnet
    """Trigram extension."""
    from django.contrib.postgres import operations
    from django.db import migrations
    class Migration(migrations.Migration):
    operations = [operations.TrigramExtension()]
    20

    View Slide

  21. Paolo Melchiorre ~ @pauloxnet
    """Trigram similar lookup."""
    from blog.models import Author
    Author.objects.filter(name__trigram_similar="jerry ones")
    [, ]
    21

    View Slide

  22. Paolo Melchiorre ~ @pauloxnet
    """Search lookup."""
    from blog.models import Entry
    Entry.objects.filter(headline__search="any cheeses")
    []
    22

    View Slide

  23. Paolo Melchiorre ~ @pauloxnet
    """SearchVector function."""
    from django.contrib.postgres import search
    from blog.models import Entry
    V = search.SearchVector("headline", "body_text")
    entries = Entry.objects.annotate(search=V)
    entries.filter(search="cheeses")
    [, ]
    23

    View Slide

  24. Paolo Melchiorre ~ @pauloxnet
    """SearchQuery expression."""
    from django.contrib.postgres import search
    from blog.models import Entry
    V = search.SearchVector("headline", "body_text")
    Q = search.SearchQuery("cheese -top", search_type="websearch")
    entries = Entry.objects.annotate(search=V)
    entries.filter(search=Q)
    []
    24

    View Slide

  25. Paolo Melchiorre ~ @pauloxnet
    """SearchConfig expression."""
    from django.contrib.postgres import search
    from blog.models import Entry
    V = search.SearchVector("body_text", config="french")
    Q = search.SearchQuery("œuf", config="french")
    entries = Entry.objects.annotate(search=V)
    entries.filter(search=Q)
    []
    25

    View Slide

  26. Paolo Melchiorre ~ @pauloxnet
    """SearchRank function."""
    from django.contrib.postgres import search
    from blog.models import Entry
    V = search.SearchVector("headline","body_text")
    Q = search.SearchQuery("cheese")
    R = search.SearchRank(V, Q)
    entries = Entry.objects.annotate(rank=R).order_by("-rank")
    entries.filter(rank__gt=0.05).values_list("headline", "rank")
    [('Cheese on Toast recipes', 0.09), ('Pizza Recipes', 0.06)]
    26

    View Slide

  27. Paolo Melchiorre ~ @pauloxnet
    """SearchVector weight attribute."""
    from django.contrib.postgres import search
    from blog.models import Entry
    V = search.SearchVector("headline", weight="A")
    W = search.SearchVector("body_text", weight="B")
    Q = search.SearchQuery("cheese")
    R = search.SearchRank(V + W, Q)
    entries = Entry.objects.annotate(rank=R).order_by("-rank")
    entries.filter(rank__gt=0.05).values_list("headline", "rank")
    [('Cheese on Toast recipes', 0.72), ('Pizza Recipes', 0.24)]
    27

    View Slide

  28. Paolo Melchiorre ~ @pauloxnet
    """SearchHeadline function."""
    from django.contrib.postgres import search
    from blog.models import Entry
    V = search.SearchVector("headline","body_text")
    Q = search.SearchQuery("cheeses")
    H = search.SearchHeadline("headline", Q)
    entries = Entry.objects.annotate(search=V, highlight=H)
    entries.filter(search=Q).values_list("highlight", flat=True)
    ["Cheese on Toast recipes", "Pizza Recipes"]
    28

    View Slide

  29. Paolo Melchiorre ~ @pauloxnet
    """SearchVector functional index."""
    from django.contrib.postgres import indexes, search
    from django.db import models
    V = search.SearchVector("headline", config="english")
    class Entry(models.Model):
    headline = models.CharField(max_length=255)
    body_text = models.TextField()
    authors = models.ManyToManyField("Author")
    class Meta:
    indexes = [indexes.GinIndex(V, name="search_idx")]
    29

    View Slide

  30. Paolo Melchiorre ~ @pauloxnet
    """SearchVector functional index query."""
    from django.contrib.postgres import search
    from blog.models import Entry
    V = search.SearchVector("headline", config="english")
    entries = Entry.objects.annotate(search=V)
    entries.filter(search="cheeses")
    []
    30

    View Slide

  31. Paolo Melchiorre ~ @pauloxnet
    """SearchVector field."""
    from django.contrib.postgres import search
    from django.db import models
    class Entry(models.Model):
    headline = models.CharField(max_length=255)
    body_text = models.TextField()
    authors = models.ManyToManyField("Author")
    search_vector = search.SearchVectorField()
    class Meta:
    indexes = [indexes.GinIndex(fields=["search_vector"])]
    31

    View Slide

  32. Paolo Melchiorre ~ @pauloxnet
    """SearchVector field update."""
    from django.contrib.postgres import aggregates as agg, search
    from django.db import models
    from blog.models import Entry
    V = search.SearchVector("headline", "body_text")
    W = search.SearchVector(agg.StringAgg("authors__name", " "))
    entries = Entry.objects.filter(id=models.OuterRef("id"))
    results = entries.annotate(search=V + W).values("search")[:1]
    Entry.objects.update(search_vector=models.Subquery(results))
    3
    32

    View Slide

  33. Paolo Melchiorre ~ @pauloxnet
    """SearchVector field query."""
    from blog.models import Entry
    Entry.objects.filter(search_vector="cheeses")
    [, ]
    33

    View Slide

  34. View Slide

  35. Paolo Melchiorre ~ @pauloxnet
    35
    An old search
    • English-only search
    • HTML tag in results
    • Sphinx generation
    • PostgreSQL database
    • External search engine

    View Slide

  36. Paolo Melchiorre ~ @pauloxnet
    36
    Django developers feedback
    PROS
    Maintenance
    Light setup
    Dogfooding
    CONS
    Work to do
    Features
    DB workload

    View Slide

  37. Paolo Melchiorre ~ @pauloxnet
    EuroPython Sprints 2017
    37
    Paolo Melchiorre (CC BY-SA)

    View Slide

  38. View Slide

  39. C
    A
    B
    D
    A

    View Slide

  40. Paolo Melchiorre ~ @pauloxnet
    """Django documentation search document definition."""
    from django.contrib.postgres.search import SearchVector as V
    from django.db.models import F
    from django.db.models.fields.json import KeyTextTransform as K
    DOCUMENT_SEARCH_VECTOR = (
    V('title', weight='A', config=F('config')) +
    V(K('slug', 'metadata'), weight='A', config=F('config')) +
    V(K('toc', 'metadata'), weight='B', config=F('config')) +
    V(K('body', 'metadata'), weight='C', config=F('config')) +
    V(K('parents', 'metadata'), weight='D', config=F('config'))
    )
    40

    View Slide

  41. Paolo Melchiorre ~ @pauloxnet
    41
    djangoproject.com
    Full-text search features
    • Multilingual
    • PostgreSQL-only
    • Clean results
    • Low maintenance
    • Easier to setup
    • Web search syntax

    View Slide

  42. View Slide

  43. Paolo Melchiorre ~ @pauloxnet
    • Irish
    • Italian
    • Lithuanian
    • Nepali
    • Norwegian
    • Portuguese
    • Romanian
    • Finnish
    • French
    • German
    • Greek
    • Hindi
    • Hungarian
    • Indonesian
    • Arabic
    • Armenian
    • Basque
    • Catalan
    • Danish
    • Dutch
    • English
    43
    DjangoProject.com supported languages*
    • Russian
    • Serbian
    • Spanish
    • Swedish
    • Tamil
    • Turkish
    • Yiddish

    View Slide

  44. Paolo Melchiorre ~ @pauloxnet
    44
    What’s next
    • Misspelling support
    • Search suggestions
    • Search statistics
    • Autocomplete
    • ...

    View Slide

  45. Paolo Melchiorre ~ @pauloxnet
    45
    Tips
    • docs in djangoproject.com
    • details in postgresql.org
    • source code in github.com
    • questions in stackoverflow.com

    View Slide

  46. Paolo Melchiorre ~ @pauloxnet
    46
    License
    CC BY-SA 4.0
    This work is licensed under
    a Creative Commons
    Attribution-ShareAlike 4.0
    International License.

    View Slide

  47. View Slide

  48. @20tab
    20tab
    20tab
    [email protected]
    20tab.com

    View Slide

  49. @pauloxnet
    paolomelchiorre
    pauloxnet
    [email protected]
    paulox.net

    View Slide

  50. Paolo Melchiorre ~ @pauloxnet
    • Participants
    • Speakers
    • Organizers
    • Volunteers from all conferences
    50
    Thanks
    Grazie /ˈɡrat.t
    ͡ sje/
    DjangoCon US 2022 - Paolo Melchiorre (CC BY-SA)

    View Slide