2016 - Simon Willison - Exploring Complex Data with Elasticsearch and Python

Exploring complex data with Elasticsearch and Python Simon Willison, PyBay
- 20th Aug 2016

Introducing the denormalized query engine design pattern

http://aaronland.info/talks/mw10_machinetags/#91

• Global data? Query the search index • Looking at
your own stuff? Query the DB directly - avoid any risk of “where are my changes?”

denormalized query engine design pattern • Point of truth is
your relational database, kept as normalized as possible • Denormalize all relevant data to a separate search index • Invest a lot of effort in synchronizing the two • Smartly route queries to database or search depending on tolerance for lag • Optional: query search engine for object IDs, then load directly from the database to display the content

Why do this? • “Search” engines aren’t just good at
text search • In exchange for a few seconds of indexing delay, you get: • horizontal scalability • powerful new types of query • aggregations

• Open source search engine built on Apache Lucene •
Interface is all JSON over HTTP - easy to use from any language • Claims to be “real-time” - it’s close enough • Insanely powerful query language (a JSON DSL) • Strong focus on analytics in addition to text search • Elastic means elastic: highly horizontally scalable

Let’s build a search engine for Django docs

def walk_documentation(path='.', base_url=''): path = os.path.realpath(path) for dirpath, dirnames, filenames
in os.walk(path): for filename in filenames: if filename.endswith('.txt'): filepath = os.path.join(dirpath, filename) content = open(filepath).read() title = find_title(content) # Figure out path relative to original relative_path = os.path.relpath(filepath, path) url = urlparse.urljoin(base_url, relative_path[:-4]) yield { 'title': title, 'path': relative_path, 'top_folder': relative_path.split('/')[0], 'url': url, 'content': content, 'id': hashlib.sha1(url).hexdigest(), } for document in walk_documentation(path, base_url, '/'): print json.dumps({'index': {'_id': document['id']}}) print json.dumps(document) https://gist.github.com/simonw/273caa2e47b1065af9b75087cb78fdd9

in os.walk(path): for filename in filenames: if filename.endswith('.txt'): filepath = os.path.join(dirpath, filename) content = open(filepath).read() title = find_title(content) # Figure out path relative to original relpath = os.path.relpath(filepath, path) url = urlparse.urljoin(base_url, relpath[:-4] + '/') yield { 'title': title, 'path': relpath, 'top_folder': relpath('/')[0], 'url': url, 'content': content, 'id': hashlib.sha1(url).hexdigest(), } for document in walk_documentation(path, base_url, '/'): print json.dumps({'index': {'_id': document['id']}}) print json.dumps(document) https://gist.github.com/simonw/273caa2e47b1065af9b75087cb78fdd9

in os.walk(path): for filename in filenames: if filename.endswith('.txt'): filepath = os.path.join(dirpath, filename) content = open(filepath).read() title = find_title(content) # Figure out path relative to original relpath = os.path.relpath(filepath, path) url = urlparse.urljoin(base_url, relpath[:-4] + '/') yield { 'title': title, 'path': relpath, 'top_folder': relpath('/')[0], 'url': url, 'content': content, 'id': hashlib.sha1(url).hexdigest(), } for document in walk_documentation(path, base_url): print json.dumps({'index': {'_id': document['id']}}) print json.dumps(document) https://gist.github.com/simonw/273caa2e47b1065af9b75087cb78fdd9

{"index": {"_id": "de72ca631bca86f405aa301b9ee8590a4cf4e7c8"}} {"title": "Django documentation contents", "url": "https://docs.djangoproject.com/en/1.10/ contents/",
"content": "=============================\nDjango documentation contents \n=============================\n\n.. toctree::\n :hidden:\n\n index\n\n.. toctree:: \n :maxdepth: 3\n\n intro/index\n topics/index\n howto/index\n faq/index\n ref/ index\n misc/index\n glossary\n releases/index\n internals/index\n\nIndices, glossary and tables\n============================\n\n* :ref:`genindex`\n* :ref:`modindex`\n* :doc:`glossary` \n", "top_folder": "contents.txt", "path": "contents.txt", "id": "de72ca631bca86f405aa301b9ee8590a4cf4e7c8"} {"index": {"_id": "2633212db84c83b86479856e6f34494b3433a66a"}} {"title": "Glossary", "url": "https://docs.djangoproject.com/en/1.10/glossary/", "content": "========\nGlossary\n========\n\n.. glossary::\n\n concrete model\n A non-abstract (:attr:`abstract=False\n <django.db.models.Options.abstract>`) model.\n\n field\n An attribute on a :term:`model`; a given field usually maps directly to\n a single database column.\n\n See :doc:`/topics/db/models`.\n\n generic view\n A higher- order :term:`view` function that provides an abstract/generic\n implementation of a common idiom or pattern found in view development.\n\n See :doc:`/topics/class-based-views/index`.\n \n model\n Models store your application's data.\n\n See :doc:`/topics/db/models`. \n\n MTV\n \"Model-template-view\"; a software pattern, similar in style to MVC, but\n a better description of the way Django does things.\n\n See :ref:`the FAQ entry <faq-mtv>`.\n \n MVC\n `Model-view-controller`__; a software pattern. Django :ref:`follows MVC\n to some extent <faq-mtv>`.\n\n __ https://en.wikipedia.org/wiki/Model-view-controller\n\n project\n A Python package -- i.e. a directory of code -- that contains all the\n settings for an instance of Django. This would include database\n configuration, Django- specific options and application-specific\n settings.\n\n property\n Also known as \"managed attributes\", and a feature of Python since\n version 2.2. This is a neat way to implement attributes whose usage\n resembles attribute access, but whose implementation uses method calls.\n\n See :class:`property`.\n\n queryset\n An object representing some set of rows to be fetched from the database.\n\n See :doc:`/topics/db/queries`.\n\n slug\n A short label for something, containing only letters, numbers,\n underscores or hyphens. They're generally used in URLs. For\n example, in a typical blog entry URL:\n\n .. parsed-literal::\n\n https://www.djangoproject.com/weblog/2008/apr/12/**spring**/\n\n the last bit (``spring``) is the slug.\n\n template\n A chunk of text that acts as formatting for representing data. A\n template helps to abstract the presentation of data from the data\n itself.\n\n See :doc:`/topics/templates`.\n\n view\n A function responsible for rendering a page.\n", "top_folder": "glossary.txt", "path": "glossary.txt", "id": "2633212db84c83b86479856e6f34494b3433a66a"}

python index_docs.py django/docs/ \ https://docs.djangoproject.com/en/1.10/ | \ curl -s XPOST
localhost:9200/docsearch/doc/_bulk \ --data-binary @-

http://localhost:9200/docsearch/doc/_search? q=prefetch_related http://localhost:9200/docsearch/doc/_search? q=prefetch_related+-top_folder:releases

Indexing PyPI

{ "info": { "maintainer": "", "docs_url": null, "requires_python": "", "maintainer_email":
"", "cheesecake_code_kwalitee_id": null, "keywords": "", "package_url": "http://pypi.python.org/pypi/Django", "author": "Django Software Foundation", "author_email": "[email protected]", "download_url": "", "platform": "", "version": "1.10", "cheesecake_documentation_id": null, "_pypi_hidden": false, "description": "UNKNOWN\n\n\n", "release_url": "http://pypi.python.org/pypi/Django/1.10", "downloads": { "last_month": 1473, "last_week": 0, "last_day": 0 }, "_pypi_ordering": 121, "requires_dist": [ "bcrypt; extra == 'bcrypt'", "argon2-cffi (>=16.1.0); extra == 'argon2'" ], "classifiers": [ "Development Status :: 5 - Production/Stable", "Environment :: Web Environment", "Framework :: Django", "Intended Audience :: Developers", • https://pypi.python.org/pypi/Django/json

Mapping PUT /pypi/package/_mapping { "package": { "properties": { "keywords": {"type":
"string", "analyzer": "snowball"}, "summary": {"type": "string", "analyzer": "snowball"}, "name": {"index": "not_analyzed", "type": "string"}, "classifiers": {"index": "not_analyzed", "type": "string"}, "description": {"type": "string", "analyzer": "snowball"} } } }

elasticsearch_dsl from elasticsearch_dsl import DocType, String, Date, Integer, Boolean class
Package(DocType): name = String(index='not_analyzed') summary = String(analyzer='snowball') description = String(analyzer='snowball') keywords = String(analyzer='snowball') classifiers = String(index='not_analyzed', multi=True) class Meta: index = 'package'

elasticsearch_dsl # Create the mapping in Elasticsearch (do this only
once) Package.init() # Save a package to the index Package( meta={ 'id': data['info']['name'] }, name=data['info']['name'], summary=data['info']['summary'], description=data['info']['description'], keywords=data['info']['description'], classifiers=data['info']['classifiers'], ).save()

Kibana

http://0.0.0.0:5601/app/kibana

https://www.elastic.co/blog/kibana-4-literally

RDBMS sync strategies • needs_indexing boolean ﬂag • last_touched timestamp
• Redis/Kafka queue • Subscribe to database replication log

needs_indexing class Conference(models.Model): name = models.CharField(max_length=128) url = models.URLField(blank=True) #
… needs_indexing = models.BooleanField(default=True, db_index=True) # Reindex all conferences when associated guide is edited: guide.conferences.all().update(needs_indexing=True)

last_touched class Conference(models.Model): name = models.CharField(max_length=128) url = models.URLField(blank=True) #
… last_touched = models.DateTimeField( db_index=True, default=datetime.datetime.utcnow, ) # Reindex all conferences when associated guide is edited: guide.conferences.all().update(last_touched=datetime.datetime.utcnow()) Indexing code needs to track most recently seen last_touched date time

Redis/Kafka queue • Any time an object needs reindexing, add
the type/ID to a queue • Every few seconds, clear the queue, dedupe the item IDs, fetch from database and reindex them

Replication log • We built a system at Eventbrite called
Dilithium, which subscribes to the MySQL replication log and writes interesting moments (e.g. order.updated) to Kafka • Re-indexing code subscribes to Kafka • github.com/noplay/python-mysql-replication

More use-cases

Faceted search

Recommendations • Search for events where saved-by matches my- friend-1
or my-friend-2 or my-friend-3 or … • Find events similar to my-last-10-saved-events • Search engines are great at scoring! Boost by in-same-city-as-me, boost more by saved-by- my-friends

Analyzing user activity • Elasticsearch + Kibana are popular tools
for log analysis - can easily handle enormous amounts of trafﬁc • Feed user actions into a custom index - search.executed, user.followed etc • Can then write application logic that varies depending on recent user activity

Analyzing patterns • Faceted search makes it easy to analyze
large datasets • Create “characteristics” for your users - e.g. uses_linkedin, signed_up_in_2015, referred_by_a_friend • Use Kibana to explorer interesting relationships

Why does this even work? • Search is all about
set intersections: the set of documents containing “dogs” with the set of documents containing “skateboarding” • This is very distributable: query a dozen shards, then merge and return the results • Relevance is a ﬁrst-class concept • map/reduce in real-time (unlike Hadoop)

denormalize to a query engine! in summary… Elasticsearch is pretty
good

2016 - Simon Willison - Exploring Complex Data ...

2016 - Simon Willison - Exploring Complex Data with Elasticsearch and Python

More Decks by PyBay

Other Decks in Programming

Featured

Transcript