Slide 1

Slide 1 text

A PYTHONIC FULL-TEXT SEARCH PAOLO MELCHIORRE ~ @pauloxnet

Slide 2

Slide 2 text

No content

Slide 3

Slide 3 text

CTO @ 20tab • Remote worker • Software engineer • Python developer • Django contributor Paolo Melchiorre

Slide 4

Slide 4 text

Paolo Melchiorre ~ @pauloxnet 4 Pythonic >>> import this “Beautiful is better than ugly. Explicit is better than implicit. Simple is better than complex. Complex is better than complicated.” — “The Zen of Python”, Tim Peters

Slide 5

Slide 5 text

Paolo Melchiorre ~ @pauloxnet 5 Full-text search “… techniques for searching … computer-stored document … in a full-text database.” — “Full-text search”, Wikipedia

Slide 6

Slide 6 text

No content

Slide 7

Slide 7 text

Paolo Melchiorre ~ @pauloxnet 7 PostgreSQL Full text search (v8.3 ~2008) Data type (tsquery, tsvector) Special indexes (GIN, GiST) Phrase search (v9.6 ~2016) JSON support (v10 ~2017) Web search (v11 ~2018) New languages (v12 ~2019)

Slide 8

Slide 8 text

No content

Slide 9

Slide 9 text

Paolo Melchiorre ~ @pauloxnet 9 Django Full text search (v1.10 ~2016) django.contrib.postgres Fields, expressions, functions GIN index (v1.11 ~2017) GiST index (v2.0 ~2018) Phrase search (v2.2 ~2019) Web search (v3.1 ~2020)

Slide 10

Slide 10 text

No content

Slide 11

Slide 11 text

Paolo Melchiorre ~ @pauloxnet """Blogs models.""" from django.contrib.postgres import search from django.db import models class Blog(models.Model): name = models.CharField(max_length=100) tagline = models.TextField() class Author(models.Model): name = models.CharField(max_length=200) class Entry(models.Model): blog = models.ForeignKey(Blog, on_delete=models.CASCADE) headline = models.CharField(max_length=255) body_text = models.TextField() authors = models.ManyToManyField(Author) search_vector = search.SearchVectorField() 11

Slide 12

Slide 12 text

Paolo Melchiorre ~ @pauloxnet """Field lookups.""" from blog.models import Author Author.objects.filter(name__contains="Terry") [, ] Author.objects.filter(name__icontains="ERRY") [, , ] 12

Slide 13

Slide 13 text

Paolo Melchiorre ~ @pauloxnet """Unaccent extension.""" from django.contrib.postgres import operations from django.db import migrations class Migration(migrations.Migration): operations = [operations.UnaccentExtension()] """Unaccent lookup.""" from blog.models import Author Author.objects.filter(name__unaccent="Helene Joy") [] 13

Slide 14

Slide 14 text

Paolo Melchiorre ~ @pauloxnet """Trigram extension.""" from django.contrib.postgres import operations from django.db import migrations class Migration(migrations.Migration): operations = [operations.TrigramExtension()] """Trigram similar lookup.""" from blog.models import Author Author.objects.filter(name__trigram_similar="helena") [, ] 14

Slide 15

Slide 15 text

Paolo Melchiorre ~ @pauloxnet """App installation.""" INSTALLED_APPS = [ # … "django.contrib.postgres", ] """Search lookup.""" from blog.models import Entry Entry.objects.filter(body_text__search="cheeses") [, ] 15

Slide 16

Slide 16 text

Paolo Melchiorre ~ @pauloxnet """SearchVector function.""" from django.contrib.postgres import search from blog.models import Entry SEARCH_VECTOR = search.SearchVector("body_text", "blog__name") entries = Entry.objects.annotate(search=SEARCH_VECTOR) entries.filter(search="cheeses") [, ] 16

Slide 17

Slide 17 text

Paolo Melchiorre ~ @pauloxnet """SearchQuery expression.""" from django.contrib.postgres import search from blog.models import Entry SEARCH_VECTOR = search.SearchVector("body_text") SEARCH_QUERY = search.SearchQuery("pizzas OR toasts", search_type="websearch") entries = Entry.objects.annotate(search=SEARCH_VECTOR) entries.filter(search=SEARCH_QUERY) [, ] 17

Slide 18

Slide 18 text

Paolo Melchiorre ~ @pauloxnet """SearchConfig expression.""" from django.contrib.postgres import search from blog.models import Entry SEARCH_VECTOR = search.SearchVector("body_text", config="french") SEARCH_QUERY = search.SearchQuery("œuf", config="french") entries = Entry.objects.annotate(search=SEARCH_VECTOR) entries.filter(search=SEARCH_QUERY) [] 18

Slide 19

Slide 19 text

Paolo Melchiorre ~ @pauloxnet """SearchRank function.""" from django.contrib.postgres import search from blog.models import Entry SEARCH_VECTOR = search.SearchVector("body_text") SEARCH_QUERY = search.SearchQuery("cheese OR meat", search_type="websearch") SEARCH_RANK = search.SearchRank(SEARCH_VECTOR, SEARCH_QUERY) entries = Entry.objects.annotate(rank=SEARCH_RANK) entries.order_by("-rank").filter(rank__gt=0.01).values_list("headline", "rank") [('Pizza Recipes', 0.06079271), ('Cheese on Toast recipes', 0.044488445)] 19

Slide 20

Slide 20 text

Paolo Melchiorre ~ @pauloxnet """SearchVector weight attribute.""" from django.contrib.postgres import search from blog.models import Entry SEARCH_VECTOR = search.SearchVector("headline", weight="A") \ + search.SearchVector("body_text", weight="B") SEARCH_QUERY = search.SearchQuery("cheese OR meat", search_type="websearch") SEARCH_RANK = search.SearchRank(SEARCH_VECTOR, SEARCH_QUERY) entries = Entry.objects.annotate(rank=SEARCH_RANK).order_by("-rank") entries.values_list("headline", "rank") [('Cheese on Toast recipes', 0.36), ('Pizza Recipes', 0.24), ('Pain perdu', 0)] 20

Slide 21

Slide 21 text

Paolo Melchiorre ~ @pauloxnet """SearchHeadline function.""" from django.contrib.postgres import search from blog.models import Entry SEARCH_QUERY = search.SearchQuery("pizzas OR toasts", search_type="websearch") SEARCH_HEADLINE = search.SearchHeadline("headline", SEARCH_QUERY) entries = Entry.objects.annotate(highlighted_headline=SEARCH_HEADLINE) entries.values_list("highlighted_headline", flat=True) ['Cheese on Toast recipes', 'Pizza Recipes', 'Pain perdu'] 21

Slide 22

Slide 22 text

Paolo Melchiorre ~ @pauloxnet """SearchVector field.""" from django.contrib.postgres import search from blog.models import Entry SEARCH_VECTOR = search.SearchVector("body_text") SEARCH_QUERY = search.SearchQuery("pizzas OR toasts", search_type="websearch") Entry.objects.update(search_vector=SEARCH_VECTOR) Entry.objects.filter(search_vector=SEARCH_QUERY) [, ] 22

Slide 23

Slide 23 text

No content

Slide 24

Slide 24 text

Paolo Melchiorre ~ @pauloxnet 24 An old search • English-only search • HTML tag in results • Sphinx generation • PostgreSQL database • External search engine

Slide 25

Slide 25 text

Paolo Melchiorre ~ @pauloxnet 25 Django developers feedback PROS Maintenance Light setup Dogfooding CONS Work to do Features Database workload

Slide 26

Slide 26 text

No content

Slide 27

Slide 27 text

No content

Slide 28

Slide 28 text

Paolo Melchiorre ~ @pauloxnet 28 djangoproject.com Full-text search features • Multilingual • PostgreSQL based • Clean results • Low maintenance • Easier to setup

Slide 29

Slide 29 text

Paolo Melchiorre ~ @pauloxnet """Documentation models.""" from django.contrib.postgres import search from django.db import models from docs.managers import DocumentManager class Document(models.Model): release = models.ForeignKey("DocumentRelease", on_delete=models.CASCADE) path = models.CharField(max_length=500) title = models.CharField(max_length=500) metadata = models.JSONField(default=dict) search = search.SearchVectorField(null=True, editable=False) config = models.SlugField(default="simple") objects = DocumentManager() 29

Slide 30

Slide 30 text

Paolo Melchiorre ~ @pauloxnet """Documentation managers.""" from django.contrib.postgres import search from django.db import models class DocumentManager(models.Manager): def search(self, query_text, release_id): search_query = search.SearchQuery(query_text, config=models.F("config")) search_rank = search.SearchRank(models.F("search"), search_query) trigram_similarity = search.TrigramSimilarity("title", query_text) return ( self.get_queryset() .filter(release_id=release_id, search=search_query) .annotate(rank=search_rank + trigram_similarity) .order_by("-rank") ) 30

Slide 31

Slide 31 text

Paolo Melchiorre ~ @pauloxnet """Documentation search.""" from django.db.models.fields.json import KeyTextTransform as Key from django.contrib.postgres.search import SearchVector from django.db.models import F DOCUMENT_SEARCH_VECTOR = ( SearchVector("title", weight="A", config=F("config")) + SearchVector(Key("slug", "metadata"), weight="A", config=F("config")) + SearchVector(Key("toc", "metadata"), weight="B", config=F("config")) + SearchVector(Key("body", "metadata"), weight="C", config=F("config")) + SearchVector(Key("parents", "metadata"), weight="D", config=F("config")) ) 31

Slide 32

Slide 32 text

Paolo Melchiorre ~ @pauloxnet """Documentation command.""" from django.core.management.base import BaseCommand from django.db import transaction from docs.models import Document from docs.search import DOCUMENT_SEARCH_VECTOR class Command(BaseCommand): @transaction.atomic def handle(self, *args, **options): Document.objects.update(search=None) updated = Document.objects.update(search=DOCUMENT_SEARCH_VECTOR) self.stdout.write(f"Successfully indexed {updated} items.") 32

Slide 33

Slide 33 text

Paolo Melchiorre ~ @pauloxnet """Documentation tests.""" from django.test import TestCase from docs.models import Document, DocumentRelease from docs.search import DOCUMENT_SEARCH_VECTOR from docs.tests import DOCUMENTS class DocumentManagerTest(TestCase): def setUp(self): self.release_en = DocumentRelease.objects.create(lang="en") slef.release_fr = DocumentRelease.objects.create(lang="fr") Document.objects.bulk_create(((Document(**d) for d in DOCUMENTS))) Document.objects.update(search=DOCUMENT_SEARCH_VECTOR) 33

Slide 34

Slide 34 text

Paolo Melchiorre ~ @pauloxnet [ { "metadata": { "body": "

Django 1.2.1 release notesDjango 1.2.1 release notes …", }, "title": "Django 1.2.1 release notes", }, { "metadata": { "body": "

Notes de publication de Django 1.9.4Notes de publication de Django 1.9.4 …", }, "title": "Notes de publication de Django 1.9.4", }, ] 34

Slide 35

Slide 35 text

Paolo Melchiorre ~ @pauloxnet """Documentation tests.""" # … def test_search(self): queryset = Document.objects.search("django", self.release_en.id) documents = ( ("Django 1.2.1 release notes", 0.969828), ("Django 1.9.4 release notes", 0.949088), ) self.assertCountEqual(queryset.values_list("title", "rank"), documents) def test_multilingual_search(self): queryset = Document.objects.search("publication", self.release_fr.id) documents = ( ("Notes de publication de Django 1.2.1", 1.06933), ("Notes de publication de Django 1.9.4", 1.04587), ) self.assertCountEqual(queryset.values_list("title", "rank"), documents) 35

Slide 36

Slide 36 text

Paolo Melchiorre ~ @pauloxnet 36 What’s next • Highlighted results • Web search syntax • … • Misspelling support • Search suggestions • Search statistics

Slide 37

Slide 37 text

No content

Slide 38

Slide 38 text

Paolo Melchiorre ~ @pauloxnet 38 Tips • docs in djangoproject.com • details in postgresql.org • source code in github.com • questions in stackoverflow.com

Slide 39

Slide 39 text

Paolo Melchiorre ~ @pauloxnet 39 License CC BY-SA 4.0 This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Slide 40

Slide 40 text

No content

Slide 41

Slide 41 text

@20tab 20tab 20tab [email protected] 20tab.com

Slide 42

Slide 42 text

@pauloxnet paolomelchiorre pauloxnet [email protected] paulox.net