Postgres Full Text Search Is Good Enough

Slide 1

Slide 1 text

Postgres full-text search is Good Enough! Introduction to full-text search using Postgres

Slide 2

Slide 2 text

About me • I m not a DBA but an application developer • Co-founder of Lost Property (service company) • Python developer ﬁddling with Erlang and Haskell • Love using Postgres • Running the Pyramid London Meetup Twitter: @rachbelaid

Slide 3

Slide 3 text

Why full-text search? The magnifying glass is something that we now add to wireframes without even knowing yet what we are going to search.

Slide 4

Slide 4 text

Existing Solutions • Elasticsearch & SOLR : solution base on Lucene • Xapian: C++ with binding in most languages • Sphinx: Well integrated with RDMS

Slide 5

Slide 5 text

Maybe you don’t want an extra dependency

Slide 6

Slide 6 text

Maybe your search needs are simple

Slide 7

Slide 7 text

Maybe you want something simpler to manage

Slide 8

Slide 8 text

Maybe you don’t want to think about syncing data

Slide 9

Slide 9 text

Maybe you are already using Postgres

Slide 10

Slide 10 text

Search Core Concepts • Document based • Never just looking through a string • Stemming and lexemes • N-gram • Relevance

Slide 11

Slide 11 text

Terminology • Engine • Documents (what we are searching in) • Stopword (eg: ‘and’, ‘a’, ‘the’, ‘but’) • Stemming (Finding the root of a word) • Relevance (Algo used to rank the results of the query) • Boost (enhancing the relevance of certains document based on condition) • Position (Position of a token inside the document)

Slide 12

Slide 12 text

Example

Slide 13

Slide 13 text

Schema Author id name Post id content title author_id language Tag id name 1:N N:M

Slide 14

Slide 14 text

Steps • Building our search document • Tokenize and stemming the documents • Build query to ﬁnd our document

Slide 15

Slide 15 text

Document structure • Post title • Post content • Post author name • Post tags

Slide 16

Slide 16 text

Build our document SELECT post.title || ' ' || post.content || ' ' || author.name || ' ' || coalesce((string_agg(tag.name, ' ')), '') as document FROM post JOIN author ON author.id = post.author_id JOIN posts_tags ON posts_tags.post_id = posts_tags.tag_id JOIN tag ON tag.id = posts_tags.tag_id GROUP BY post.id;

Slide 17

Slide 17 text

Tokenization: tsvector Let's try to convert a simple string into a tsvector. ! SELECT to_tsvector('Try not to become a man of success, but rather try to become a man of value'); ! Result: ! to_tsvector ---------------------------------------------------------- 'becom':4,13 'man':6,15 'rather':10 'success':8 'tri':1,11 'valu':17 ! ! • less words than in the original sentence, • some of the words are different (try became tri) • they are all followed by numbers. (position)

Slide 18

Slide 18 text

Tokenization: tsvector A tsvector value is a sorted list of distinct lexemes which are words that have been normalized to make variants of the same word look alike. ! For example: ! - folding upper-case letters to lower-case - removal of sufﬁxes (such as 's', 'es' or 'ing' in English). ! This allows searches to ﬁnd variant forms of the same word without tediously entering all the possible variants. ! The numbers represent the position of the lexeme in the original string.

Slide 19

Slide 19 text

Build Tokenized Documents CREATE VIEW documents AS( SELECT post.id as post_id,  to_tsvector(post.title) || to_tsvector(post.content) || to_tsvector(author.name) || to_tsvector(coalesce((string_agg(tag.name, ' ‘)), ‘')) as content FROM post JOIN author ON author.id = post.author_id JOIN posts_tags ON posts_tags.post_id = posts_tags.tag_id JOIN tag ON tag.id = posts_tags.tag_id GROUP BY post.id );

Slide 20

Slide 20 text

Querying SELECT to_tsvector('If you can dream it, you can do it') @@ ‘dream' as result; ! result ---------- t ! SELECT to_tsvector('It''s kind of fun to do the impossible') @@ ‘impossible' as result; ! result ---------- f

Slide 21

Slide 21 text

Querying The second query returns false because we need to build a tsquery which creates the same lexemes and, using the operator @@, casts the string into a tsquery. SELECT ‘impossible’::tsquery,   to_tsquery('impossible'); tsquery | to_tsquery  -------------+------------  'impossible' | 'imposs'

Slide 22

Slide 22 text

Querying we will use to_tsquery for querying documents ! SELECT to_tsvector('It''s kind of fun to do the impossible') @@ to_tsquery(’impossible’) as result; ! result ---------- t

Slide 23

Slide 23 text

Querying A tsquery value stores lexemes that are to be searched for, and combines them honoring the Boolean operators: & (AND) | (OR) ! (NOT). ! Parentheses can be used to enforce grouping of the operators

Slide 24

Slide 24 text

Querying SELECT to_tsvector('If the facts don't fit the theory, change the facts') @@ to_tsquery('! fact’); -> FALSE ! SELECT to_tsvector('If the facts don''t fit the theory, change the facts') @@ to_tsquery('theory & !fact’); -> FALSE ! SELECT to_tsvector('If the facts don''t fit the theory, change the facts.') @@ to_tsquery('fiction | theory’); -> TRUE

Slide 25

Slide 25 text

Querying our document SELECT post.id, post.title FROM documents JOIN post ON post.id = document.post_id WHERE document.content @@ to_tsquery('Endangered & Species');

Slide 26

Slide 26 text

Language support Postgres provides built-in text search for many languages:     Danish, Dutch, English, Finnish, French, German, Hungarian, Italian, Norwegian, Portugese, Romanian, Russian, Spanish, Swedish, Turkish  ...and extensions for others.

Slide 27

Slide 27 text

Language support: Stemming SELECT to_tsvector('english', 'We are running'); ‘run':3 ! SELECT to_tsvector('french', 'We are running'); 'are':2 'running':3 'we':1

Slide 28

Slide 28 text

CREATE VIEW documents AS( SELECT post.id as post_id,  to_tsvector(post.language::regconfig, post.title) || to_tsvector(post.language::regconfig, post.content) || to_tsvector(‘simple’, author.name) || to_tsvector( ‘simple’,  coalesce((string_agg(tag.name, ‘’)), ‘')  ) as content FROM post JOIN author ON author.id = post.author_id JOIN posts_tags ON posts_tags.post_id = posts_tags.tag_id JOIN tag ON tag.id = posts_tags.tag_id GROUP BY post.id ); Language support:   Build new documents

Slide 29

Slide 29 text

Accented Character Supporting query of document in different languages require to normalise accents.  Eg: difﬁcult to search for french accent with an english keyboard. ! CREATE EXTENSION unaccent;  SELECT unaccent('èéêë');    unaccent  ----------  eeee  (1 row)

Slide 30

Slide 30 text

Custom dictionary We can unaccent our document or create a new dictionary: CREATE TEXT SEARCH CONFIGURATION fr ( COPY = french );  ALTER TEXT SEARCH CONFIGURATION fr ALTER MAPPING  FOR hword, hword_part, word WITH unaccent, french_stem;    When we are using this new text search conﬁg, we can see the lexemes: SELECT to_tsvector('french', 'il était une fois');  to_tsvector  -------------  'fois':4  SELECT to_tsvector('fr', 'il était une fois');  to_tsvector  ——————————  'etait':2 'fois':4

Slide 31

Slide 31 text

Custom Dictionnary • controle of synonyms • controle of thesaurus • controle of stopwords • ….

Slide 32

Slide 32 text

Ranking / Boost • When you build a search engine you want to be able to get search results ordered by relevance. • To order our results by revelance PostgreSQL provides a few functions but we will be using only 2 of them : ts_rank() and setweight(). • The function setweight() allows us to assign a weight value to a tsvector(); the value can be 'A', 'B', 'C' or 'D'

Slide 33

Slide 33 text

Ranking / Boost: Relevancy SELECT ts_rank(to_tsvector('This is an example of document'), to_tsquery('example | document')) as relevancy; -> 0.0607927  SELECT ts_rank(to_tsvector('This is an example of document'), to_tsquery('example | unkown')) as relevancy; -> 0.0303964  SELECT ts_rank(to_tsvector('This is an example of document'), to_tsquery('example & document')) as relevancy; -> 0.0985009   SELECT ts_rank(to_tsvector('This is an example of document'), to_tsquery('example & unknown')) as relevancy; -> 1e-20

Slide 34

Slide 34 text

Ranking / Boost: Documents CREATE VIEW documents AS( SELECT post.id as post_id, post.title as title, setweight(to_tsvector(post.language::regconfig,   post.title), 'A') || setweight(to_tsvector(post.language::regconfig, post.content), 'B') || setweight(to_tsvector('simple', author.name), 'C') || setweight(to_tsvector(‘simple', coalesce(string_agg(tag.name, ' '))), 'B') as document FROM post JOIN author ON author.id = post.author_id JOIN posts_tags ON posts_tags.post_id = posts_tags.tag_id JOIN tag ON tag.id = posts_tags.tag_id GROUP BY post.id );

Slide 35

Slide 35 text

Ranking / Boost: Querying SELECT post_id, title FROM documents WHERE   documents.content @@ to_tsquery(‘english', QUERY ) ORDER BY   ts_rank(document.content, to_tsquery(‘english', QUERY)) DESC; ! ! QUERY = - ‘Endangered & Species’ - ‘Endangered | Species’

Slide 36

Slide 36 text

Indexing: single table document PostgreSQL supports function based indexes so you can simply create a GIN index around the tsvector() function. (one table only) CREATE INDEX idx_fts_post   ON post   USING gin(   setweight(  to_tsvector(language, title),’A’)||  setweight(  to_tsvector(language, content),'B')  );

Slide 37

Slide 37 text

Indexing:   multiple tables document In our schema example; the document is spread accross multiple tables with different weights.   For a better performance it's necessary to denormalize the data : • MATERIALIZED VIEW (if can have indexing delay) • TRIGGERS • Application logic • ?? suggestions ??

Slide 38

Slide 38 text

Extra PG search features • Fuzzy search via trigrams extension • Highlighting

Slide 39

Slide 39 text

Compare to ES/SOLR Keep in mind that FTS is a feature of Postgres but ElasticSearch and SOLR are tool dedicated to do FTS: • PG required to have the tsquery build properly where ES make it easier to do query with custom analyser on part of document • PG make it easier to not think about data sync or canonical datastore problem • ES offers great languages (and better out of the box) than PG. Lucene • PG is rock solid and easy to run • ES better unicode case folding

Slide 40

Slide 40 text

For some cases:  Postgres full text search is good enough!!

Slide 41

Slide 41 text

References • https://speakerdeck.com/daniellindsley/building-a- python-based-search-engine • http://blog.lostpropertyhq.com/postgres-full-text- search-is-good-enough/

Slide 42

Slide 42 text

Questions?