Slide 1

Slide 1 text

Full-Text Search in Django with PostgreSQL #pycon8 - PyDatabase - Firenze, 2017-04-08  Paolo Melchiorre @pauloxnet  |

Slide 2

Slide 2 text

 Paolo Melchiorre | ▪Computer Science Engineer ▪Backend Python Developer since 2006 ▪Django Developer since 2011 ▪Senior Software Engineer @ TwentyTab ▪Remote Worker ▪PostgreSQL user, not a DBA

Slide 3

Slide 3 text

 Goal | To show how we have used Django Full Text Search and PostgreSQL in a real project.

Slide 4

Slide 4 text

 Motivation | To implement Full-Text Search using only Django and PostgreSQL functionalities, without resorting to external tools.

Slide 5

Slide 5 text

 Agenda | ▪Full-Text Search ▪Existing Solutions ▪PostgreSQL FTS ▪Django Support ▪Concerti@Roma ▪What’s next ▪Conclusions ▪Questions

Slide 6

Slide 6 text

 Full-Text Search | “… Full-Text Search* refers to techniques for searching a single computer-stored document or a collection in a full text database …” -- Wikipedia * FTS = Full-Text Search

Slide 7

Slide 7 text

 Features of a FTS | ▪Stemming ▪Ranking ▪Stopwords ▪Multiple languages support ▪Accent support ▪Indexing ▪Phrase search

Slide 8

Slide 8 text

 Tested Solutions |

Slide 9

Slide 9 text

 Elasticsearch | Project: Snap Market (~500,000 mobile users) Issues: ▪Management problems ▪Patching a Java plugin @@ -52,7 +52,8 @@ public class DecompoundTokenFilter … { - posIncAtt.setPositionIncrement(0); + if (!subwordsonly) + posIncAtt.setPositionIncrement(0); return true; }

Slide 10

Slide 10 text

 Apache Solr | Project: GoalScout (~25,000 videos) Issues: ▪Synchronization problems ▪All writes to PostgreSQL and reads from Solr

Slide 11

Slide 11 text

 Existing Solutions | PROS ▪Full featured solutions ▪Resources (documentations, articles, …) CONS ▪Synchronization ▪Mandatory use of driver (haystack, bungiesearch…) ▪Ops oriented: focus on system integrations

Slide 12

Slide 12 text

 FTS in PostgreSQL | ▪FTS Support since version 8.3 ▪TSVECTOR to represent text data ▪TSQUERY to represent search predicates ▪Special Indexes (GIN, GIST) ▪Phrase Search since version 9.6

Slide 13

Slide 13 text

 What are documents | “… a document is the unit of searching in a full-text search system; for example, a magazine article or email message …” -- PostgreSQL documentation

Slide 14

Slide 14 text

 Language Support | Built-in text search for many languages $ cat /usr/share/postgresql/9.6/tsearch_data/italian.stop

Slide 15

Slide 15 text

 Django Support | ▪Module django.contrib.postgres ▪FTS Support since version 1.10 ▪BRIN and GIN indexes since version 1.11 ▪Dev oriented: focus on programming

Slide 16

Slide 16 text

 Making queries | class Blog(models.Model): name = models.CharField(max_length=100) tagline = models.TextField() class Author(models.Model): name = models.CharField(max_length=200) email = models.EmailField() class Entry(models.Model): blog = models.ForeignKey(Blog) headline = models.CharField(max_length=255) body_text = models.TextField() pub_date = models.DateField() authors = models.ManyToManyField(Author)

Slide 17

Slide 17 text

 Standard queries | >>> Author.objects.filter(name__contains='Terry') [, ] >>> Author.objects.filter(name__icontains='Erry') [, , ]

Slide 18

Slide 18 text

 Unaccented query | >>> Author.objects.filter(name__unaccent__icontains='Helen') [, , ]

Slide 19

Slide 19 text

 Trigram similar | >>> Author.objects.filter( name__unaccent__lower__trigram_similar='Hélène') [, , ]

Slide 20

Slide 20 text

 The search lookup | >>> Entry.objects.filter(body_text__search='Cheese') [, ]

Slide 21

Slide 21 text

 SearchVector | >>> from django.contrib.postgres.search import SearchVector >>> Entry.objects.annotate( ... search=SearchVector('body_text', 'blog__tagline'), ... ).filter(search='Cheese') [, ]

Slide 22

Slide 22 text

 SearchQuery | >>> from django.contrib.postgres.search import SearchQuery >>> SearchQuery('potato') & SearchQuery('ireland') # potato AND ireland >>> SearchQuery('potato') | SearchQuery('penguin') # potato OR penguin >>> ~SearchQuery('sausage') # NOT sausage

Slide 23

Slide 23 text

 SearchRank | >>> from django.contrib.postgres.search import ( SearchQuery, SearchRank, SearchVector) >>> vector = SearchVector('body_text') >>> query = SearchQuery('cheese') >>> Entry.objects.annotate( rank=SearchRank(vector, query)).order_by('-rank') [, ]

Slide 24

Slide 24 text

 Search confguration | >>> from django.contrib.postgres.search import ( SearchQuery, SearchVector) >>> Entry.objects.annotate( ... search=SearchVector('body_text', config='french'), ... ).filter(search=SearchQuery('œuf', config='french')) [] >>> from django.db.models import F >>> Entry.objects.annotate( ... search=SearchVector('body_text', config=F('blog__lang')), ... ).filter(search=SearchQuery('œuf', config=F('blog__lang'))) []

Slide 25

Slide 25 text

 Weighting queries | >>> from django.contrib.postgres.search import ( SearchQuery, SearchRank, SearchVector) >>> vector = SearchVector('body_text', weight='A') + SearchVector('blog__tagline', weight='B') >>> query = SearchQuery('cheese') >>> Entry.objects.annotate(rank=SearchRank( vector, query)).filter(rank__gte=0.3).order_by('rank')

Slide 26

Slide 26 text

 SearchVectorField | >>> Entry.objects.update(search_vector=SearchVector('body_text')) >>> Entry.objects.filter(search_vector='cheese') [, ]

Slide 27

Slide 27 text

 Concerti@Roma | The numbers of the project: ▪> 11,000 bands ▪~ 1,000 venues ▪~ 1,500 shows ▪> 100 festivals ▪~ 15,000 user/month

Slide 28

Slide 28 text

 Concerti@Roma 2 | ▪Python2.7 / Django 1.7 / PostgreSQL 9.1 / Like

Slide 29

Slide 29 text

 Concerti@Roma 3 | ▪Python3.6 / Django1.11 / PostgreSQL 9.6 / FTS

Slide 30

Slide 30 text

 C@R Manager | class BandManager(models.Manager): def search(self, text): vector = ( SearchVector('nickname', weight='A', config=LANG) + SearchVector('genres__name', weight='B', config=LANG)+ SearchVector('description', weight='D', config=LANG) ) query = SearchQuery(text, config=LANG) rate = SearchRank(vector, query) return self.get_queryset().annotate(rate=rate).filter( search=query).annotate(search=vector).distinct( 'id', 'rate').order_by('-rate', 'id')

Slide 31

Slide 31 text

 C@R Test Setup | class BandTest(TestCase): def setUp(self): metal, _ = Genre.objects.get_or_create(name='Metal') doom, _ = Genre.objects.get_or_create(name='Doom') doomraiser, _ = Contact.objects.get_or_create( nickname='Doom raiser', description='Lorem…') doomraiser.genres.add(doom) forgotten_tomb, _ = Contact.objects.get_or_create( nickname='Forgotten Tomb', description='Lorem…') forgotten_tomb.genres.add(doom) ....

Slide 32

Slide 32 text

 C@R Test Method | class BandTest(TestCase): def setUp(self): ... def test_band_search(self): band_queryset = Band.objects.search( 'doom').values_list('nickname', 'rate') band_list = [ ('Doom raiser', 0.675475), ('The Foreshadowin', 0.258369), ('Forgotten Tomb', 0.243171)] self.assertSequenceEqual(list( OrderedDict(band_queryset).items()), band_list)

Slide 33

Slide 33 text

 What’s next | ▪Multiple language ranking ▪Search suggestions ▪SearchVectorField with triggers ▪JSON/JSONB Full-Text Search ▪RUM indexing

Slide 34

Slide 34 text

 Conclusions | Conditions to implement this solution: ▪No extra dependencies ▪Not too complex searches ▪Easy management ▪No need to synchronize data ▪PostgreSQL already in your stack ▪Python-only enviroment

Slide 35

Slide 35 text

 Resources | ▪https://docs.djangoproject.com/en/1.11/ref/contri b/postgres/search/ ▪https://www.postgresql.org/docs/9.6/static/textse arch.html ▪https://github.com/damoti/django-tsvector-field ▪https://en.wikipedia.org/wiki/Full-text_search

Slide 36

Slide 36 text

 Acknowledgements | ▪20tab ▪Marc Tamlyn django.contrib.postgres

Slide 37

Slide 37 text

 Thank you |   BY -  SA (Attribution-ShareAlike) ▪https://creativecommons.org/licenses/by-sa/4.0/  Slides ▪https://speakerdeck.com/pauloxnet

Slide 38

Slide 38 text

 Social Time |

Slide 39

Slide 39 text

 Questions ? | 

Slide 40

Slide 40 text

 Contacts |  https://twitter.com/pauloxnet  https://linkedin.com/in/paolomelchiorre  https://github.com/pauloxnet