Exploring complex data with Elasticsearch and Python

Exploring complex data with Elasticsearch and Python Simon Willison, PyBay
- 20th Aug 2016

Introducing the denormalized query engine design pattern

http://aaronland.info/talks/mw10_machinetags/#91

• Global data? Query the search index • Looking at
your own stuff? Query the DB directly - avoid any risk of “where are my changes?”

denormalized query engine design pattern • Point of truth is
your relational database, kept as normalized as possible • Denormalize all relevant data to a separate search index • Invest a lot of effort in synchronizing the two • Smartly route queries to database or search depending on tolerance for lag • Optional: query search engine for object IDs, then load directly from the database to display the content

Why do this? • “Search” engines aren’t just good at
text search • In exchange for a few seconds of indexing delay, you get: • horizontal scalability • powerful new types of query • aggregations

• Open source search engine built on Apache Lucene •
Interface is all JSON over HTTP - easy to use from any language • Claims to be “real-time” - it’s close enough • Insanely powerful query language (a JSON DSL) • Strong focus on analytics in addition to text search • Elastic means elastic: highly horizontally scalable

Let’s build a search engine for Django docs

def walk_documentation(path='.', base_url=''): path = os.path.realpath(path) for dirpath, dirnames, filenames
in os.walk(path): for filename in filenames: if filename.endswith('.txt'): filepath = os.path.join(dirpath, filename) content = open(filepath).read() title = find_title(content) # Figure out path relative to original relative_path = os.path.relpath(filepath, path) url = urlparse.urljoin(base_url, relative_path[:-4]) yield { 'title': title, 'path': relative_path, 'top_folder': relative_path.split('/')[0], 'url': url, 'content': content, 'id': hashlib.sha1(url).hexdigest(), } for document in walk_documentation(path, base_url, '/'): print json.dumps({'index': {'_id': document['id']}}) print json.dumps(document) https://gist.github.com/simonw/273caa2e47b1065af9b75087cb78fdd9

in os.walk(path): for filename in filenames: if filename.endswith('.txt'): filepath = os.path.join(dirpath, filename) content = open(filepath).read() title = find_title(content) # Figure out path relative to original relpath = os.path.relpath(filepath, path) url = urlparse.urljoin(base_url, relpath[:-4] + '/') yield { 'title': title, 'path': relpath, 'top_folder': relpath('/')[0], 'url': url, 'content': content, 'id': hashlib.sha1(url).hexdigest(), } for document in walk_documentation(path, base_url, '/'): print json.dumps({'index': {'_id': document['id']}}) print json.dumps(document) https://gist.github.com/simonw/273caa2e47b1065af9b75087cb78fdd9

in os.walk(path): for filename in filenames: if filename.endswith('.txt'): filepath = os.path.join(dirpath, filename) content = open(filepath).read() title = find_title(content) # Figure out path relative to original relpath = os.path.relpath(filepath, path) url = urlparse.urljoin(base_url, relpath[:-4] + '/') yield { 'title': title, 'path': relpath, 'top_folder': relpath('/')[0], 'url': url, 'content': content, 'id': hashlib.sha1(url).hexdigest(), } for document in walk_documentation(path, base_url): print json.dumps({'index': {'_id': document['id']}}) print json.dumps(document) https://gist.github.com/simonw/273caa2e47b1065af9b75087cb78fdd9

{"index": {"_id": "de72ca631bca86f405aa301b9ee8590a4cf4e7c8"}} {"title": "Django documentation contents", "url": "https://docs.djangoproject.com/en/1.10/ contents/",
"content": "=============================\nDjango documentation contents \n=============================\n\n.. toctree::\n :hidden:\n\n index\n\n.. toctree:: \n :maxdepth: 3\n\n intro/index\n topics/index\n howto/index\n faq/index\n ref/ index\n misc/index\n glossary\n releases/index\n internals/index\n\nIndices, glossary and tables\n============================\n\n* :ref:`genindex`\n* :ref:`modindex`\n* :doc:`glossary` \n", "top_folder": "contents.txt", "path": "contents.txt", "id": "de72ca631bca86f405aa301b9ee8590a4cf4e7c8"} {"index": {"_id": "2633212db84c83b86479856e6f34494b3433a66a"}} {"title": "Glossary", "url": "https://docs.djangoproject.com/en/1.10/glossary/", "content": "========\nGlossary\n========\n\n.. glossary::\n\n concrete model\n A non-abstract (:attr:`abstract=False\n <django.db.models.Options.abstract>`) model.\n\n field\n An attribute on a :term:`model`; a given field usually maps directly to\n a single database column.\n\n See :doc:`/topics/db/models`.\n\n generic view\n A higher- order :term:`view` function that provides an abstract/generic\n implementation of a common idiom or pattern found in view development.\n\n See :doc:`/topics/class-based-views/index`.\n \n model\n Models store your application's data.\n\n See :doc:`/topics/db/models`. \n\n MTV\n \"Model-template-view\"; a software pattern, similar in style to MVC, but\n a better description of the way Django does things.\n\n See :ref:`the FAQ entry <faq-mtv>`.\n \n MVC\n `Model-view-controller`__; a software pattern. Django :ref:`follows MVC\n to some extent <faq-mtv>`.\n\n __ https://en.wikipedia.org/wiki/Model-view-controller\n\n project\n A Python package -- i.e. a directory of code -- that contains all the\n settings for an instance of Django. This would include database\n configuration, Django- specific options and application-specific\n settings.\n\n property\n Also known as \"managed attributes\", and a feature of Python since\n version 2.2. This is a neat way to implement attributes whose usage\n resembles attribute access, but whose implementation uses method calls.\n\n See :class:`property`.\n\n queryset\n An object representing some set of rows to be fetched from the database.\n\n See :doc:`/topics/db/queries`.\n\n slug\n A short label for something, containing only letters, numbers,\n underscores or hyphens. They're generally used in URLs. For\n example, in a typical blog entry URL:\n\n .. parsed-literal::\n\n https://www.djangoproject.com/weblog/2008/apr/12/**spring**/\n\n the last bit (``spring``) is the slug.\n\n template\n A chunk of text that acts as formatting for representing data. A\n template helps to abstract the presentation of data from the data\n itself.\n\n See :doc:`/topics/templates`.\n\n view\n A function responsible for rendering a page.\n", "top_folder": "glossary.txt", "path": "glossary.txt", "id": "2633212db84c83b86479856e6f34494b3433a66a"}

python index_docs.py django/docs/ \ https://docs.djangoproject.com/en/1.10/ | \ curl -s XPOST
localhost:9200/docsearch/doc/_bulk \ --data-binary @-

http://localhost:9200/docsearch/doc/_search? q=prefetch_related http://localhost:9200/docsearch/doc/_search? q=prefetch_related+-top_folder:releases

Indexing PyPI

{ "info": { "maintainer": "", "docs_url": null, "requires_python": "", "maintainer_email":
"", "cheesecake_code_kwalitee_id": null, "keywords": "", "package_url": "http://pypi.python.org/pypi/Django", "author": "Django Software Foundation", "author_email": "[email protected]", "download_url": "", "platform": "", "version": "1.10", "cheesecake_documentation_id": null, "_pypi_hidden": false, "description": "UNKNOWN\n\n\n", "release_url": "http://pypi.python.org/pypi/Django/1.10", "downloads": { "last_month": 1473, "last_week": 0, "last_day": 0 }, "_pypi_ordering": 121, "requires_dist": [ "bcrypt; extra == 'bcrypt'", "argon2-cffi (>=16.1.0); extra == 'argon2'" ], "classifiers": [ "Development Status :: 5 - Production/Stable", "Environment :: Web Environment", "Framework :: Django", "Intended Audience :: Developers", • https://pypi.python.org/pypi/Django/json

Mapping PUT /pypi/package/_mapping { "package": { "properties": { "keywords": {"type":
"string", "analyzer": "snowball"}, "summary": {"type": "string", "analyzer": "snowball"}, "name": {"index": "not_analyzed", "type": "string"}, "classifiers": {"index": "not_analyzed", "type": "string"}, "description": {"type": "string", "analyzer": "snowball"} } } }

elasticsearch_dsl from elasticsearch_dsl import DocType, String, Date, Integer, Boolean class
Package(DocType): name = String(index='not_analyzed') summary = String(analyzer='snowball') description = String(analyzer='snowball') keywords = String(analyzer='snowball') classifiers = String(index='not_analyzed', multi=True) class Meta: index = 'package'

elasticsearch_dsl # Create the mapping in Elasticsearch (do this only
once) Package.init() # Save a package to the index Package( meta={ 'id': data['info']['name'] }, name=data['info']['name'], summary=data['info']['summary'], description=data['info']['description'], keywords=data['info']['description'], classifiers=data['info']['classifiers'], ).save()

Kibana

http://0.0.0.0:5601/app/kibana

https://www.elastic.co/blog/kibana-4-literally

RDBMS sync strategies • needs_indexing boolean ﬂag • last_touched timestamp
• Redis/Kafka queue • Subscribe to database replication log

needs_indexing class Conference(models.Model): name = models.CharField(max_length=128) url = models.URLField(blank=True) #
… needs_indexing = models.BooleanField(default=True, db_index=True) # Reindex all conferences when associated guide is edited: guide.conferences.all().update(needs_indexing=True)

last_touched class Conference(models.Model): name = models.CharField(max_length=128) url = models.URLField(blank=True) #
… last_touched = models.DateTimeField( db_index=True, default=datetime.datetime.utcnow, ) # Reindex all conferences when associated guide is edited: guide.conferences.all().update(last_touched=datetime.datetime.utcnow()) Indexing code needs to track most recently seen last_touched date time

Redis/Kafka queue • Any time an object needs reindexing, add
the type/ID to a queue • Every few seconds, clear the queue, dedupe the item IDs, fetch from database and reindex them

Replication log • We built a system at Eventbrite called
Dilithium, which subscribes to the MySQL replication log and writes interesting moments (e.g. order.updated) to Kafka • Re-indexing code subscribes to Kafka • github.com/noplay/python-mysql-replication

More use-cases

Faceted search

Recommendations • Search for events where saved-by matches my- friend-1
or my-friend-2 or my-friend-3 or … • Find events similar to my-last-10-saved-events • Search engines are great at scoring! Boost by in-same-city-as-me, boost more by saved-by- my-friends

Analyzing user activity • Elasticsearch + Kibana are popular tools
for log analysis - can easily handle enormous amounts of trafﬁc • Feed user actions into a custom index - search.executed, user.followed etc • Can then write application logic that varies depending on recent user activity

Analyzing patterns • Faceted search makes it easy to analyze
large datasets • Create “characteristics” for your users - e.g. uses_linkedin, signed_up_in_2015, referred_by_a_friend • Use Kibana to explorer interesting relationships

Why does this even work? • Search is all about
set intersections: the set of documents containing “dogs” with the set of documents containing “skateboarding” • This is very distributable: query a dozen shards, then merge and return the results • Relevance is a ﬁrst-class concept • map/reduce in real-time (unlike Hadoop)

denormalize to a query engine! in summary… Elasticsearch is pretty
good

Exploring complex data with Elasticsearch and P...

Exploring complex data with Elasticsearch and Python

More Decks by Simon Willison

Other Decks in Technology

Featured

Transcript