Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Full-Text Search in Django with PostgreSQL - EP2017

Full-Text Search in Django with PostgreSQL - EP2017

Demonstrating how we used Django Full-Text Search and PostgreSQL in a Real Project.

Originally presented at #EuroPython 2017 in RImini, Italy

Link
====
https://ep2017.europython.eu/conference/talks/full-text-search-in-django-with-postgresql

Abstract
=======
From Django version 1.10, the potgres.search module is present, and it simplifies the use of PostgreSQL full-text search engine.The fact that PostgreSQL is used in a lot of Django based projects, using its full-text search engine, permits to use this feature in a very easy way, without having to configure external tools and not having to worry about synchronizing the data between different systems.We will view how we have used this full-text search feature in a real project.

Outline
======
- Full-Text Searchin Django with PostgreSQL
- Paolo Melchiorre
- Goal
- Motivation
- Agenda
- Full-Text Search
- Features of a FTS
- Tested Solutions
- Elasticsearch
- Apache Solr
- Existing Solutions
- FTS in PostgreSQL
- What are documents
- Language Support
- Django Support
- Making queries
- Standard queries
- Unaccented query
- Trigram similar
- The search lookup
- SearchVector
- SearchQuery
- SearchRank
- Search configuration
- Weighting queries
- SearchVectorField
- www.concertiaroma.com
- Version 2.0
- Version 3.0
- Band Manager
- Band Tests Setup
- Band Tests Method
- What’s next
- Conclusions
- Resources
- Acknowledgements
- Thank you
- Social Time
- Questions ?
- Contacts

Description
===========

After some experiences in the implementation of full-text search functionality with different system, we have decided to use PostgreSQL to implement full-text search functionality in our next project, a website to search for shows, venues, bands and festivals.

In the past, I have worked in two different projects, a mobile platform to sell and buy used items and a sport videos sharing platform, where I used two of the most currently famous full-text search software (Elasticsearch or Solr) but I had some synchronization and management problems.

After that, in my company, we searched for new Django support of full-text search PostgreSQL implementation and we decided to use it to avoid any problems that I had in the past.

I’m going to start speaking about the full-text search in a general context and I want to show the problems I encountered implementing it in the past.

Afterwards, I’m going to talk about the PostgreSQL functionality to implement the full-text search functionality and also present the django.contrib.potgres.search module, with step-by-step demonstrations of its functions with real world data.

Finally, I’m going to show the way we use and test this functionality in our project and which functionality lacks us to have a complete implementation of full-text search in our project.

At the end, I want to present my conclusions about our solution and I want to explore some new features that will be present in the next versions of Django and PostgreSQL.

Paolo Melchiorre

July 12, 2017
Tweet

More Decks by Paolo Melchiorre

Other Decks in Technology

Transcript

  1. Full-Text Search in Django with PostgreSQL EuroPython 2017 - Rimini,

    2017-07-12  Paolo Melchiorre - @pauloxnet  |
  2.  Paolo Melchiorre | ▪Computer Science Engineer ▪Backend Python Developer

    (>10yrs) ▪Django Developer (~5yrs) ▪Senior Software Engineer @ 20Tab ▪Happy Remote Worker ▪PostgreSQL user, not a DBA 2
  3.  Goal | “To show how we have used Django

    Full-Text Search and PostgreSQL in a Real Project” 3
  4.  Motivation | “To implement Full-Text Search using only Django

    and PostgreSQL functionalities, without resorting to external tools.” 4
  5.  Agenda | ▪Full-Text Search ▪Existing Solutions ▪PostgreSQL Full-Text Search

    ▪Django Full-Text Search Support ▪www.concertiaroma.com project ▪What’s next ▪Conclusions ▪Questions 5
  6.  Full-Text Search | “… Full-Text Search* refers to techniques

    for Searching a single computer-stored Document or a Collection in a Full-Text Database …” -- Wikipedia * FTS = Full-Text Search 6
  7.  Features of a FTS | ▪Stemming ▪Ranking ▪Stop-words ▪Multiple

    languages support ▪Accent support ▪Indexing ▪Phrase search 7
  8.  Elasticsearch | Project: Snap Market (~500k mobile users) Issues:

    ▪Management problems ▪Patching a Java plug-in @@ -52,7 +52,8 @@ public class DecompoundTokenFilter … { - posIncAtt.setPositionIncrement(0); + if (!subwordsonly) + posIncAtt.setPositionIncrement(0); return true; } 9
  9.  Apache Solr | Project: GoalScout (~25k videos) Issues: ▪Synchronization

    problems ▪All writes to PostgreSQL and reads from Solr 10
  10.  Existing Solutions | PROS  ▪Full featured solutions ▪Resources

    (documentations, articles, …) CONS  ▪Synchronization ▪Mandatory use of driver (haystack, bungiesearch…) ▪Ops Oriented: focus on system integrations 11
  11.  FTS in PostgreSQL | ▪FTS Support since version 8.3

    (~2008) ▪TSVECTOR to represent text data ▪TSQUERY to represent search predicates ▪Special Indexes (GIN, GIST) ▪Phrase Search since version 9.6 (~2016) 12
  12.  What are Documents | “… a Document is the

    Unit of searching in a Full-Text Search system; for example, a magazine Article or email Message …” -- PostgreSQL documentation 13
  13.  Django Support | ▪Module: django.contrib.postgres ▪FTS Support since version

    1.10 (2016) ▪BRIN and GIN indexes since version 1.11 (2017) ▪Dev Oriented: focus on programming 14
  14.  Making queries | class Blog(models.Model): name = models.CharField(max_length=100) tagline

    = models.TextField() class Author(models.Model): name = models.CharField(max_length=200) email = models.EmailField() class Entry(models.Model): blog = models.ForeignKey(Blog) headline = models.CharField(max_length=255) body_text = models.TextField() pub_date = models.DateField() authors = models.ManyToManyField(Author) 15
  15.  Standard queries | >>> Author.objects.filter(name__contains='Terry') [<Author: Terry Gilliam>, <Author:

    Terry Jones>] >>> Author.objects.filter(name__icontains='Erry') [<Author: Terry Gilliam>, <Author: Terry Jones>, <Author: Jerry Lewis>] 16
  16.  Unaccented query | >>> from django.contrib.postgres.operations import UnaccentExtension >>>

    UnaccentExtension() >>> Author.objects.filter(name__unaccent__icontains='Hélène') [<Author: Helen Mirren>, <Author: Helena Bonham Carter>, <Author: Hélène Joy>] 17
  17.  Trigram similar | >>> from django.contrib.postgres.operations import TrigramExtension >>>

    TrigramExtension() >>> Author.objects.filter(name__unaccent__trigram_similar='Hélèn') [<Author: Helen Mirren>, <Author: Helena Bonham Carter>, <Author: Hélène Joy>] 18
  18.  SearchVector | >>> from django.contrib.postgres.search import SearchVector >>> Entry.objects.annotate(

    ... search=SearchVector('body_text', 'blog__tagline'), ... ).filter(search='Cheese') [<Entry: Cheese on Toast recipes>, <Entry: Pizza Recipes>] 20
  19.  SearchQuery | >>> from django.contrib.postgres.search import SearchQuery >>> SearchQuery('potato')

    & SearchQuery('ireland') # potato AND ireland >>> SearchQuery('potato') | SearchQuery('penguin') # potato OR penguin >>> ~SearchQuery('sausage') # NOT sausage 21
  20.  SearchRank | >>> from django.contrib.postgres.search import ( ... SearchQuery,

    SearchRank, SearchVector ... ) >>> vector = SearchVector('body_text') >>> query = SearchQuery('cheese') >>> Entry.objects.annotate( ... rank=SearchRank(vector, query) ... ).order_by('-rank') [<Entry: Cheese on Toast recipes>, <Entry: Pizza recipes>] 22
  21.  Search confguration | >>> from django.contrib.postgres.search import ( ...

    SearchQuery, SearchVector ... ) >>> Entry.objects.annotate( ... search=SearchVector('body_text', config='french'), ... ).filter(search=SearchQuery('œuf', config='french')) [<Entry: Pain perdu>] >>> from django.db.models import F >>> Entry.objects.annotate( ... search=SearchVector('body_text', config=F('blog__lang')), ... ).filter(search=SearchQuery('œuf', config=F('blog__lang'))) [<Entry: Pain perdu>] 23
  22.  Weighting queries | >>> from django.contrib.postgres.search import ( ...

    SearchQuery, SearchRank, SearchVector ... ) >>> vector = SearchVector('body_text', weight='A') + ... SearchVector('blog__tagline', weight='B') >>> query = SearchQuery('cheese') >>> Entry.objects.annotate( ... rank=SearchRank(vector, query) ... ).filter(rank__gte=0.3).order_by('rank') 24
  23.  SearchVectorField | >>> Entry.objects.update( ... search_vector=SearchVector('body_text') ... ) >>>

    Entry.objects.filter(search_vector='cheese') [<Entry: Cheese on Toast recipes>, <Entry: Pizza recipes>] 25
  24.  www.concertiaroma.com| “… today's shows in the Capital” * The

    numbers of the project: ~ 1k venues > 12k bands > 15k shows ~ 200 festivals ~ 30k user/month * since ~2014 26
  25.  Version 2.0 | Python 2.7 - Django 1.7 -

    PostgreSQL 9.1 - SQL LIKE 27
  26.  Band Manager | LANG = 'english' class BandManager(models.Manager): def

    search(self, text): vector = ( SearchVector('nickname', weight='A', config=LANG) + SearchVector('genres__name', weight='B', config=LANG)+ SearchVector('description', weight='D', config=LANG) ) query = SearchQuery(text, config=LANG) rate = SearchRank(vector, query) return self.get_queryset().annotate(rate=rate).filter( search=query).annotate(search=vector).distinct( 'id', 'rate').order_by('-rate', 'id') 29
  27.  Band Test Setup | class BandTest(TestCase): def setUp(self): metal,

    _ = Genre.objects.get_or_create(name='Metal') doom, _ = Genre.objects.get_or_create(name='Doom') doomraiser, _ = Contact.objects.get_or_create( nickname='Doom raiser', description='Lorem…') doomraiser.genres.add(doom) forgotten_tomb, _ = Contact.objects.get_or_create( nickname='Forgotten Tomb', description='Lorem…') forgotten_tomb.genres.add(doom) .... 30
  28.  Band Test Method | class BandTest(TestCase): def setUp(self): ...

    def test_band_search(self): band_queryset = Band.objects.search( 'doom').values_list('nickname', 'rate') band_list = [ ('Doom raiser', 0.675475), ('The Foreshadowin', 0.258369), ('Forgotten Tomb', 0.243171)] self.assertSequenceEqual( list(OrderedDict(band_queryset).items()), band_list) 31
  29.  What’s next | ▪Misspelling support ▪Multiple language configuration ▪Search

    suggestions ▪SearchVectorField with triggers ▪JSON/JSONB Full-Text Search ▪RUM indexing 32
  30.  Conclusions | Conditions to implement this solution: ▪No extra

    dependencies ▪Not too complex searches ▪Easy management ▪No need to synchronize data ▪PostgreSQL already in your stack ▪Python-only environment 33
  31.  Thank you |   BY -  SA

    (Attribution-ShareAlike) creativecommons.org/licenses/by-sa  Slides speakerdeck.com/pauloxnet 36
  32.  Questions ? | After the talk, Please!  *

    * Speak Slowly I'm not a native English speaker 37