Upgrade to Pro — share decks privately, control downloads, hide ads and more …

A pythonic full-text search - PyCon US 2023

A pythonic full-text search - PyCon US 2023

-- A talk I gave at PyCon US 2023

Keeping in mind the Pythonic principle that “simple is better than complex” we’ll see how to implement full-text search in a web service using only the latest versions of Django and PostgreSQL and we’ll analyze the advantages compared to more complex solutions based on external services.

https://www.paulox.net/2023/04/23/pycon-us-2023/

Paolo Melchiorre

April 23, 2023
Tweet

More Decks by Paolo Melchiorre

Other Decks in Technology

Transcript

  1. A pythonic
    full-text search
    PAOLO MELCHIORRE ~ paulox.net

    View Slide

  2. View Slide

  3. Paolo Melchiorre
    paulox.net
    ● CTO @ 20tab
    ● Software engineer
    ● Python developer
    ● Django contributor
    © 2022 Bartek Pawlik (CC BY-NC-SA)

    View Slide

  4. Paolo Melchiorre
    >>> import this
    The Zen of Python, by Tim Peters
    Beautiful is better than ugly.
    Explicit is better than implicit.
    Simple is better than complex.
    Complex is better than complicated.
    ...
    4

    View Slide


  5. Paolo Melchiorre
    5
    <… techniques for searching
    … computer-stored document …
    in a full-text database=
    — Wikipedia

    View Slide

  6. Paolo Melchiorre
    6
    Popular search engines

    View Slide

  7. Paolo Melchiorre
    7
    External search engines
    PROS
    Popular
    Full featured
    Resources
    CONS
    Driver
    Query language
    Synchronization

    View Slide

  8. Paolo Melchiorre
    8
    External search engines
    synchronization

    View Slide

  9. Paolo Melchiorre
    © 2019 Nam Anh (Unsplash Free Use)
    9

    View Slide

  10. Paolo Melchiorre
    10
    PostgreSQL
    Full text search (v8.3 ~2008)
    Data type (tsquery, tsvector)
    Special indexes (GIN, GiST)
    Phrase search (v9.6 ~2016)
    JSON support (v10 ~2017)
    Web search (v11 ~2018)
    New languages (v12-14 ~2019-2021)

    View Slide


  11. Paolo Melchiorre
    11
    <… the unit of searching
    in a full-text search system
    e.g. a magazine article …=
    — PostgreSQL

    View Slide

  12. View Slide

  13. Paolo Melchiorre
    13
    © 1946 William Gottlieb (Public Domain)

    View Slide

  14. Paolo Melchiorre
    14
    Django
    Full text search (v1.10 ~2016)
    django.contrib.postgres
    Fields, expressions, functions
    GIN index (v1.11 ~2017)
    GiST index (v2.0 ~2018)
    Phrase search (v2.2 ~2019)
    Web search (v3.1 ~2020)

    View Slide

  15. Paolo Melchiorre
    15
    Document-based
    search
    ● Weighting
    ● Categorization
    ● Highlighting
    ● Multiple languages

    View Slide

  16. View Slide

  17. Paolo Melchiorre
    17
    """Blogs models."""
    from django.contrib.postgres import search
    from django.db import models
    class Author(models.Model):
    name = models.CharField(max_length=200)
    class Entry(models.Model):
    headline = models.CharField(max_length=255)
    body_text = models.TextField()
    authors = models.ManyToManyField(Author)

    View Slide

  18. Paolo Melchiorre
    18
    """Field lookups."""
    from blog.models import Author
    Author.objects.filter(name__contains="Jerry")
    []
    Author.objects.filter(name__icontains="TERRY")
    [, ]

    View Slide

  19. Paolo Melchiorre
    19
    """Application definition."""
    INSTALLED_APPS = [
    "django.contrib.admin",
    "django.contrib.auth",
    "django.contrib.contenttypes",
    "django.contrib.sessions",
    "django.contrib.messages",
    "django.contrib.staticfiles",
    "django.contrib.postgres",
    "blog",
    ]

    View Slide

  20. Paolo Melchiorre
    20
    """Trigram extension."""
    from django.contrib.postgres import operations
    from django.db import migrations
    class Migration(migrations.Migration):
    operations = [operations.TrigramExtension()]

    View Slide

  21. Paolo Melchiorre
    21
    """Trigram similar lookup."""
    from blog.models import Author
    Author.objects.filter(name__trigram_similar="jerry ones")
    [, ]

    View Slide

  22. Paolo Melchiorre
    22
    """Search lookup."""
    from blog.models import Entry
    entries = Entry.objects.all()
    entries.filter(headline__search="any cheeses")
    []

    View Slide

  23. Paolo Melchiorre
    23
    """SearchVector function."""
    from django.contrib.postgres import search
    from blog.models import Entry
    V = search.SearchVector("headline", "body_text")
    entries = Entry.objects.annotate(search=V)
    entries.filter(search="cheeses")
    [, ]

    View Slide

  24. Paolo Melchiorre
    24
    """SearchQuery expression."""
    from django.contrib.postgres import search
    from blog.models import Entry
    V = search.SearchVector("headline", "body_text")
    Q = search.SearchQuery("cheese -top", search_type="websearch")
    entries = Entry.objects.annotate(search=V)
    entries.filter(search=Q)
    []

    View Slide

  25. Paolo Melchiorre
    25
    """Search configuration expression."""
    from django.contrib.postgres import search
    from blog.models import Entry
    V = search.SearchVector("body_text", config="french")
    Q = search.SearchQuery("œuf", config="french")
    entries = Entry.objects.annotate(search=V)
    entries.filter(search=Q)
    []

    View Slide

  26. Paolo Melchiorre
    """SearchRank function."""
    from django.contrib.postgres import search
    from blog.models import Entry
    V = search.SearchVector("headline","body_text")
    Q = search.SearchQuery("cheese")
    R = search.SearchRank(V, Q)
    entries = Entry.objects.annotate(rank=R).order_by("-rank")
    entries.filter(rank__gt=0.05).values_list("headline", "rank")
    [('Cheese on Toast recipes', 0.09), ('Pizza Recipes', 0.06)]
    26

    View Slide

  27. Paolo Melchiorre
    27
    """SearchVector weight attribute."""
    from django.contrib.postgres import search
    from blog.models import Entry
    V = search.SearchVector("headline", weight="A")
    W = search.SearchVector("body_text", weight="B")
    Q = search.SearchQuery("cheese")
    R = search.SearchRank(V + W, Q)
    entries = Entry.objects.annotate(rank=R).order_by("-rank")
    entries.filter(rank__gt=0.05).values_list("headline", "rank")
    [('Cheese on Toast recipes', 0.72), ('Pizza Recipes', 0.24)]

    View Slide

  28. Paolo Melchiorre
    28
    """SearchHeadline function."""
    from django.contrib.postgres import search
    from blog.models import Entry
    V = search.SearchVector("headline","body_text")
    Q = search.SearchQuery("cheeses")
    H = search.SearchHeadline("headline", Q)
    entries = Entry.objects.annotate(search=V, highlight=H)
    entries.filter(search=Q).values_list("highlight", flat=True)
    ["Cheese on Toast recipes", "Pizza Recipes"]

    View Slide

  29. Paolo Melchiorre
    29
    """SearchVector functional index."""
    from django.contrib.postgres import indexes, search
    from django.db import models
    V = search.SearchVector("headline", config="english")
    class Entry(models.Model):
    headline = models.CharField(max_length=255)
    body_text = models.TextField()
    authors = models.ManyToManyField("Author")
    class Meta:
    indexes = [indexes.GinIndex(V, name="search_idx")]

    View Slide

  30. Paolo Melchiorre
    30
    """SearchVector functional index query."""
    from django.contrib.postgres import search
    from blog.models import Entry
    V = search.SearchVector("headline", config="english")
    Q = search.SearchQuery("cheeses")
    entries = Entry.objects.annotate(search=V)
    entries.filter(search=Q)
    []

    View Slide

  31. Paolo Melchiorre
    31
    """SearchVector field."""
    from django.contrib.postgres import search
    from django.db import models
    class Entry(models.Model):
    headline = models.CharField(max_length=255)
    body_text = models.TextField()
    authors = models.ManyToManyField("Author")
    search_vector = search.SearchVectorField()
    class Meta:
    indexes = [indexes.GinIndex(fields=["search_vector"])]

    View Slide

  32. Paolo Melchiorre
    32
    """SearchVector field update."""
    from django.contrib.postgres import aggregates as agg, search
    from django.db import models
    from blog.models import Entry
    V = search.SearchVector("headline", "body_text")
    W = search.SearchVector(agg.StringAgg("authors__name", " "))
    entries = Entry.objects.filter(id=models.OuterRef("id"))
    results = entries.annotate(search=V + W).values("search")[:1]
    Entry.objects.update(search_vector=models.Subquery(results))
    3

    View Slide

  33. Paolo Melchiorre
    33
    """SearchVector field query."""
    from django.contrib.postgres import search
    from blog.models import Entry
    Q = search.SearchQuery("cheeses")
    entries = Entry.objects.all()
    entries.filter(search_vector=Q)
    [, ]

    View Slide

  34. View Slide

  35. Paolo Melchiorre
    35
    An old search
    ● English-only search
    ● HTML tag in results
    ● Sphinx generation
    ● PostgreSQL database
    ● External search engine

    View Slide

  36. Paolo Melchiorre
    36
    Django developers feedback
    PROS
    Maintenance
    Light setup
    Dogfooding
    CONS
    Work to do
    Features
    DB workload

    View Slide

  37. Paolo Melchiorre
    37
    © 2017 Paolo Melchiorre (CC BY-SA)

    View Slide

  38. View Slide

  39. A
    C B
    D
    A

    View Slide

  40. Paolo Melchiorre
    40
    """Django documentation search document definition."""
    from django.contrib.postgres.search import SearchVector as V
    from django.db.models import F
    from django.db.models.fields.json import KeyTextTransform as K
    DOCUMENT_SEARCH_VECTOR = (
    V('title', weight='A', config=F('config')) +
    V(K('slug', 'metadata'), weight='A', config=F('config')) +
    V(K('toc', 'metadata'), weight='B', config=F('config')) +
    V(K('body', 'metadata'), weight='C', config=F('config')) +
    V(K('parents', 'metadata'), weight='D', config=F('config'))
    )

    View Slide

  41. Paolo Melchiorre
    41
    djangoproject.com
    full-text search features
    ● Multilingual
    ● PostgreSQL-only
    ● Clean results
    ● Low maintenance
    ● Easier to setup
    ● Web search syntax

    View Slide

  42. View Slide

  43. Paolo Melchiorre
    43
    DjangoProject.com search
    supported languages*
    Arabic
    Armenian
    Basque
    Catalan
    Danish
    Dutch
    English
    Finnish
    French
    German
    Greek
    Hindi
    Hungarian
    Indonesian
    Irish
    Italian
    Lithuanian
    Nepali
    Norwegian
    Portuguese
    Romanian
    Russian
    Serbian
    Spanish
    Swedish
    Tamil
    Turkish
    Yiddish

    View Slide

  44. Paolo Melchiorre
    44
    What’s next
    ● Misspelling support
    ● Search suggestions
    ● Search statistics
    ● Autocomplete
    ● ...

    View Slide

  45. Paolo Melchiorre
    45
    Tips and
    Tricks
    ● docs in djangoproject.com
    ● details in postgresql.org
    ● source code in github.com
    ● questions in stackoverflow.com

    View Slide

  46. License
    This work is licensed under a
    Creative Commons
    Attribution-ShareAlike 4.0
    International License.
    CC BY-SA 4.0

    View Slide

  47. View Slide

  48. Paolo Melchiorre
    48
    Thanks
    ● Participants
    ● Speakers
    ● Organizers
    ● Volunteers
    from all conferences
    Grazie /ˈɡrat.t
    ͡ sje/
    © 2017 Alessia Peviani (CC BY-NC)

    View Slide

  49. @[email protected]
    @pauloxnet
    @paolomelchiorre
    @pauloxnet
    20tab.com
    Paolo
    Melchiorre
    paulox.net

    View Slide