Postgres Full Text Search Is Good Enough

Postgres full-text search is Good Enough! Introduction to full-text search
using Postgres

About me • I m not a DBA but an
application developer • Co-founder of Lost Property (service company) • Python developer ﬁddling with Erlang and Haskell • Love using Postgres • Running the Pyramid London Meetup Twitter: @rachbelaid

Why full-text search? The magnifying glass is something that we
now add to wireframes without even knowing yet what we are going to search.

Existing Solutions • Elasticsearch & SOLR : solution base on
Lucene • Xapian: C++ with binding in most languages • Sphinx: Well integrated with RDMS

Maybe you don’t want an extra dependency

Maybe your search needs are simple

Maybe you want something simpler to manage

Maybe you don’t want to think about syncing data

Maybe you are already using Postgres

Search Core Concepts • Document based • Never just looking
through a string • Stemming and lexemes • N-gram • Relevance

Terminology • Engine • Documents (what we are searching in)
• Stopword (eg: ‘and’, ‘a’, ‘the’, ‘but’) • Stemming (Finding the root of a word) • Relevance (Algo used to rank the results of the query) • Boost (enhancing the relevance of certains document based on condition) • Position (Position of a token inside the document)

Example

Schema Author id name Post id content title author_id language
Tag id name 1:N N:M

Steps • Building our search document • Tokenize and stemming
the documents • Build query to ﬁnd our document

Document structure • Post title • Post content • Post
author name • Post tags

Build our document SELECT post.title || ' ' || post.content
|| ' ' || author.name || ' ' || coalesce((string_agg(tag.name, ' ')), '') as document FROM post JOIN author ON author.id = post.author_id JOIN posts_tags ON posts_tags.post_id = posts_tags.tag_id JOIN tag ON tag.id = posts_tags.tag_id GROUP BY post.id;

Tokenization: tsvector Let's try to convert a simple string into
a tsvector. ! SELECT to_tsvector('Try not to become a man of success, but rather try to become a man of value'); ! Result: ! to_tsvector ---------------------------------------------------------- 'becom':4,13 'man':6,15 'rather':10 'success':8 'tri':1,11 'valu':17 ! ! • less words than in the original sentence, • some of the words are different (try became tri) • they are all followed by numbers. (position)

Tokenization: tsvector A tsvector value is a sorted list of
distinct lexemes which are words that have been normalized to make variants of the same word look alike. ! For example: ! - folding upper-case letters to lower-case - removal of sufﬁxes (such as 's', 'es' or 'ing' in English). ! This allows searches to ﬁnd variant forms of the same word without tediously entering all the possible variants. ! The numbers represent the position of the lexeme in the original string.

Build Tokenized Documents CREATE VIEW documents AS( SELECT post.id as
post_id,  to_tsvector(post.title) || to_tsvector(post.content) || to_tsvector(author.name) || to_tsvector(coalesce((string_agg(tag.name, ' ‘)), ‘')) as content FROM post JOIN author ON author.id = post.author_id JOIN posts_tags ON posts_tags.post_id = posts_tags.tag_id JOIN tag ON tag.id = posts_tags.tag_id GROUP BY post.id );

Querying SELECT to_tsvector('If you can dream it, you can do
it') @@ ‘dream' as result; ! result ---------- t ! SELECT to_tsvector('It''s kind of fun to do the impossible') @@ ‘impossible' as result; ! result ---------- f

Querying The second query returns false because we need to
build a tsquery which creates the same lexemes and, using the operator @@, casts the string into a tsquery. SELECT ‘impossible’::tsquery,   to_tsquery('impossible'); tsquery | to_tsquery  -------------+------------  'impossible' | 'imposs'

Querying we will use to_tsquery for querying documents ! SELECT
to_tsvector('It''s kind of fun to do the impossible') @@ to_tsquery(’impossible’) as result; ! result ---------- t

Querying A tsquery value stores lexemes that are to be
searched for, and combines them honoring the Boolean operators: & (AND) | (OR) ! (NOT). ! Parentheses can be used to enforce grouping of the operators

Querying SELECT to_tsvector('If the facts don't fit the theory, change
the facts') @@ to_tsquery('! fact’); -> FALSE ! SELECT to_tsvector('If the facts don''t fit the theory, change the facts') @@ to_tsquery('theory & !fact’); -> FALSE ! SELECT to_tsvector('If the facts don''t fit the theory, change the facts.') @@ to_tsquery('fiction | theory’); -> TRUE

Querying our document SELECT post.id, post.title FROM documents JOIN post
ON post.id = document.post_id WHERE document.content @@ to_tsquery('Endangered & Species');

Language support Postgres provides built-in text search for many languages:
    Danish, Dutch, English, Finnish, French, German, Hungarian, Italian, Norwegian, Portugese, Romanian, Russian, Spanish, Swedish, Turkish  ...and extensions for others.

Language support: Stemming SELECT to_tsvector('english', 'We are running'); ‘run':3 !
SELECT to_tsvector('french', 'We are running'); 'are':2 'running':3 'we':1

CREATE VIEW documents AS( SELECT post.id as post_id,  to_tsvector(post.language::regconfig, post.title)
|| to_tsvector(post.language::regconfig, post.content) || to_tsvector(‘simple’, author.name) || to_tsvector( ‘simple’,  coalesce((string_agg(tag.name, ‘’)), ‘')  ) as content FROM post JOIN author ON author.id = post.author_id JOIN posts_tags ON posts_tags.post_id = posts_tags.tag_id JOIN tag ON tag.id = posts_tags.tag_id GROUP BY post.id ); Language support:   Build new documents

Accented Character Supporting query of document in different languages require
to normalise accents.  Eg: difﬁcult to search for french accent with an english keyboard. ! CREATE EXTENSION unaccent;  SELECT unaccent('èéêë');    unaccent  ----------  eeee  (1 row)

Custom dictionary We can unaccent our document or create a
new dictionary: CREATE TEXT SEARCH CONFIGURATION fr ( COPY = french );  ALTER TEXT SEARCH CONFIGURATION fr ALTER MAPPING  FOR hword, hword_part, word WITH unaccent, french_stem;    When we are using this new text search conﬁg, we can see the lexemes: SELECT to_tsvector('french', 'il était une fois');  to_tsvector  -------------  'fois':4  SELECT to_tsvector('fr', 'il était une fois');  to_tsvector  ——————————  'etait':2 'fois':4

Custom Dictionnary • controle of synonyms • controle of thesaurus
• controle of stopwords • ….

Ranking / Boost • When you build a search engine
you want to be able to get search results ordered by relevance. • To order our results by revelance PostgreSQL provides a few functions but we will be using only 2 of them : ts_rank() and setweight(). • The function setweight() allows us to assign a weight value to a tsvector(); the value can be 'A', 'B', 'C' or 'D'

Ranking / Boost: Relevancy SELECT ts_rank(to_tsvector('This is an example of
document'), to_tsquery('example | document')) as relevancy; -> 0.0607927  SELECT ts_rank(to_tsvector('This is an example of document'), to_tsquery('example | unkown')) as relevancy; -> 0.0303964  SELECT ts_rank(to_tsvector('This is an example of document'), to_tsquery('example & document')) as relevancy; -> 0.0985009   SELECT ts_rank(to_tsvector('This is an example of document'), to_tsquery('example & unknown')) as relevancy; -> 1e-20

Ranking / Boost: Documents CREATE VIEW documents AS( SELECT post.id
as post_id, post.title as title, setweight(to_tsvector(post.language::regconfig,   post.title), 'A') || setweight(to_tsvector(post.language::regconfig, post.content), 'B') || setweight(to_tsvector('simple', author.name), 'C') || setweight(to_tsvector(‘simple', coalesce(string_agg(tag.name, ' '))), 'B') as document FROM post JOIN author ON author.id = post.author_id JOIN posts_tags ON posts_tags.post_id = posts_tags.tag_id JOIN tag ON tag.id = posts_tags.tag_id GROUP BY post.id );

Ranking / Boost: Querying SELECT post_id, title FROM documents WHERE
  documents.content @@ to_tsquery(‘english', QUERY ) ORDER BY   ts_rank(document.content, to_tsquery(‘english', QUERY)) DESC; ! ! QUERY = - ‘Endangered & Species’ - ‘Endangered | Species’

Indexing: single table document PostgreSQL supports function based indexes so
you can simply create a GIN index around the tsvector() function. (one table only) CREATE INDEX idx_fts_post   ON post   USING gin(   setweight(  to_tsvector(language, title),’A’)||  setweight(  to_tsvector(language, content),'B')  );

Indexing:   multiple tables document In our schema example; the
document is spread accross multiple tables with different weights.   For a better performance it's necessary to denormalize the data : • MATERIALIZED VIEW (if can have indexing delay) • TRIGGERS • Application logic • ?? suggestions ??

Extra PG search features • Fuzzy search via trigrams extension
• Highlighting

Compare to ES/SOLR Keep in mind that FTS is a
feature of Postgres but ElasticSearch and SOLR are tool dedicated to do FTS: • PG required to have the tsquery build properly where ES make it easier to do query with custom analyser on part of document • PG make it easier to not think about data sync or canonical datastore problem • ES offers great languages (and better out of the box) than PG. Lucene • PG is rock solid and easy to run • ES better unicode case folding

For some cases:  Postgres full text search is good enough!!

References • https://speakerdeck.com/daniellindsley/building-a- python-based-search-engine • http://blog.lostpropertyhq.com/postgres-full-text- search-is-good-enough/

Questions?

Postgres Full Text Search Is Good Enough

Postgres Full Text Search Is Good Enough

More Decks by Rach Belaid

Other Decks in Technology

Featured

Transcript