Addok presentation

Addok A geocoder: search engine for addresses 5 décembre 2016,
Paris, Redis Meetup at Soat office

Context • Etalab mission • Partnership with La Poste, IGN
and OpenStreetMap • Collaborative database, shared across all France services (IGN, La Poste, firemen, emergency services…) • Data published in open licence • V0 currently running (read only), v1 to be released first quarter of 2017 (with CRUD API) Part of “Base adresse nationale” (BAN) project

What’s a geocoder? Transform a full text string into a
structured address.

“1095 rue d’Ypres 59118 Wambrechies”

{ “number”: “1095”, “street”: “rue d’Ypres”, “city”: “Wambrechies”, “lat”: 50.683379,
“lon”: 3.041701, … }

Challenges • Speed • Autocompletion • Typo tolerance • Location
bias • Batch geocoding (eg. CSV) • Filters • Reverse geocoding • Easy to debug internals

Facts and numbers • Redis as database • Python code
base (and a bit of Lua) • 1324 LoC • Open source (WTFPL licenced) • Stable version: 0.5 • 1.0.0 release coming • ~ 30ms as average search time on our server • ~ 2000 searchs/s on our server (6 cores) • ~ 16Go of RAM for full France data • ~ 2 millions requests/day

ElasticSearch? We tried, but: – TF/IDF not really adapted to
address – Hard to debug, “black blox” – Very good for the 80%, very hard for the next 20% – A lot of features cannot be mixed (fuzzy + edge-ngrams for autocompletion + location bias…) – Quite slow with advanced configuration mixing features

Main principle rue boulevard des lilas rosiers 1357 2468 1357
1357 2468 3579 4680 2468 9687 8769 5791 6802 3820 9823 2397 … … … … … • Each word is a Redis set, where values are the documents ids containing this word (note: it’s actually a sorted set, but let’s make it simpler for now). • To issue a search, we do an intersect between each set of searched words

Search pattern overview • Clean search string (remove duplicate letters,
replace synonyms…) • Split search string in tokens (“ru”, “de”, “lila”) • First retain only “rare” tokens • Get all candidates (by intersecting matching sets: the one for “ru”, the one for “de”, the one for “lila”) • If needed, adapt number of candidates by adding/removing tokens, autocompleting, trying fuzzy words… • Sort all candidates, by comparing search string and candidates labels (“rue rose” with “rue des roses 59240 Dunkerque”, “rue des rosiers 59220 Henain”…), plus document importance (“boulevard de la République” in Paris VS “boulevard de la République” in a small village), plus optional location bias.

Filters • Allow to target a certain subset of documents:
one can filter by citycode, postcode, type, etc. • The available filters are configurable per Addok instance • they are just other sets, which will be added to the set intersection in order to find the candidates • for example if we have a filter by citycode, each citycode value is a set where values are documents ids with this citycode

Autocompletion • Only the last searched word will be tried
(and only if we don’t find any good candidates without) • Dedicated “edge-ngrams” like index: the set key is the first letters of a word, and values are words starting by those letters • For example, we may have a “fri” set that would have “frigo”, “friche”, “frioul”…; but also “frig”, that would only contain “frigo”, “frigau”…

Fuzzy matching • Allow to increase typo tolerance • Fuzzy
possibilities are computed in python (“frigo” would give “firgo”, “rfigo”, “friko”, “brigo”…), then only retain the ones existing in the index (in this case “friko” and “brigo”) and that have been seen with the other searched words

Location bias • Allow to give a geographical center (lat,
lon) to priorise results • Not yet using the new Redis GEO commands • Computing the geohash in python, which is then another set used to intersect documents ids • Always make sure results in the center geohash and neighbours are selected • Then apply a score boost for closer results in the sort step

Only common words case • Redis intersect time complexity is
(roughly) dependent of the smaller set we are trying to intersect • There are cases where we only have big sets, i.e. very common words. For example for the search “rue de la saint…” • First thing is to try to avoid such situation: if there is a filter, a center, those will usually allow to reduce a lot the intersect time • In the worst case, we do a manual scan by selecting the first “500” documents of the “less common” searched term, and we manually check if those documents are in the set of the others words

Reverse geocoding • Allow to gives a lat and lon
position, and to get the closer address • Geohash based (same that are used for the location bias) • Not yet using the Redis GEO API

Document<=>word weighting • The relation document<=>word is not uniform: some
documents are more linked to some words than others • For example, in France, there is a city called “Rue” • This is why we use a sorted set, where the score is the “weight” of the relation document<=>word • First, we add a “boost” to some fields: “name” is more boosted than “city” • Then we divide this boost by the number of words in the field. For example, “name” has a boost of 4, so “Rue” in the case of the city will have a final score of “4”, while in the “rue des Lilas” the final score of “rue” is 4/3=1.333 (field boost / number of words) • When we select documents candidates, the documents with the most weighted relations come first

Interactive shell: commands

Interactive shell: EXPLAIN example

Ongoing work • Hooks for plugins (custom indexes, custom endpoints,
post search…) • Trigram experiment (split in trigrams instead of words: “rue”, “ue “, “e d”, “ de”…) via plugin • Document storage in third party DB (Sqlite, Postgres…) via plugin • Use new Redis GEO pseudo-types • Various optimizations • Lua scripts • 1.0.0 freeze

Limits • Everything goes in RAM (but we are working
on storing the documents themselves elsewhere) • No OR in filters (postcode is 12345 OR 23456) • Only one configuration file (not possible to mix English and French in the same instance, for example)

Links • https://adresse.data.gouv.fr • https://github.com/addok/addok Contact • http://yohanboniface.me

Thanks!

Addok presentation

Addok presentation

Yohan Boniface

More Decks by Yohan Boniface

Other Decks in Programming

Featured

Transcript

Addok A geocoder: search engine for addresses 5 décembre 2016,

Context • Etalab mission • Partnership with La Poste, IGN

What’s a geocoder? Transform a full text string into a

“1095 rue d’Ypres 59118 Wambrechies”

{ “number”: “1095”, “street”: “rue d’Ypres”, “city”: “Wambrechies”, “lat”: 50.683379,

Challenges • Speed • Autocompletion • Typo tolerance • Location

Facts and numbers • Redis as database • Python code

ElasticSearch? We tried, but: – TF/IDF not really adapted to

Main principle rue boulevard des lilas rosiers 1357 2468 1357

Search pattern overview • Clean search string (remove duplicate letters,

Filters • Allow to target a certain subset of documents:

Autocompletion • Only the last searched word will be tried

Fuzzy matching • Allow to increase typo tolerance • Fuzzy

Location bias • Allow to give a geographical center (lat,

Only common words case • Redis intersect time complexity is

Reverse geocoding • Allow to gives a lat and lon

Document<=>word weighting • The relation document<=>word is not uniform: some

Interactive shell: commands

Interactive shell: EXPLAIN example

Ongoing work • Hooks for plugins (custom indexes, custom endpoints,

Limits • Everything goes in RAM (but we are working

Links • https://adresse.data.gouv.fr • https://github.com/addok/addok Contact • http://yohanboniface.me

Thanks!