Upgrade to Pro — share decks privately, control downloads, hide ads and more …

2016 - Simon Willison - Exploring Complex Data with Elasticsearch and Python

PyBay
August 20, 2016

2016 - Simon Willison - Exploring Complex Data with Elasticsearch and Python

Description
Elasticsearch is a powerful open-source search and analytics engine with applications that stretch far beyond adding text-based search to a website. Learn how Elasticsearch can be used with Python and Django to crunch through complex datasets and quickly build powerful interfaces for exploring information.

Bio
Simon Willison is an engineering director at Eventbrite, a Bay Area ticketing company working to bring the world together through live experiences. Simon works as part of a small product research and prototyping lab helping develop new concepts for Eventbrite products and features. Simon joined Eventbrite through their acquisition of Lanyrd, a Y Combinator funded company he co-founded in 2010. He is a co-creator of the Django Web Framework.

https://youtu.be/QMs-v-z0-as

PyBay

August 20, 2016
Tweet

More Decks by PyBay

Other Decks in Programming

Transcript

  1. • Global data? Query the search index • Looking at

    your own stuff? Query the DB directly - avoid any risk of “where are my changes?”
  2. denormalized query engine design pattern • Point of truth is

    your relational database, kept as normalized as possible • Denormalize all relevant data to a separate search index • Invest a lot of effort in synchronizing the two • Smartly route queries to database or search depending on tolerance for lag • Optional: query search engine for object IDs, then load directly from the database to display the content
  3. Why do this? • “Search” engines aren’t just good at

    text search • In exchange for a few seconds of indexing delay, you get: • horizontal scalability • powerful new types of query • aggregations
  4. • Open source search engine built on Apache Lucene •

    Interface is all JSON over HTTP - easy to use from any language • Claims to be “real-time” - it’s close enough • Insanely powerful query language (a JSON DSL) • Strong focus on analytics in addition to text search • Elastic means elastic: highly horizontally scalable
  5. def walk_documentation(path='.', base_url=''): path = os.path.realpath(path) for dirpath, dirnames, filenames

    in os.walk(path): for filename in filenames: if filename.endswith('.txt'): filepath = os.path.join(dirpath, filename) content = open(filepath).read() title = find_title(content) # Figure out path relative to original relative_path = os.path.relpath(filepath, path) url = urlparse.urljoin(base_url, relative_path[:-4]) yield { 'title': title, 'path': relative_path, 'top_folder': relative_path.split('/')[0], 'url': url, 'content': content, 'id': hashlib.sha1(url).hexdigest(), } for document in walk_documentation(path, base_url, '/'): print json.dumps({'index': {'_id': document['id']}}) print json.dumps(document) https://gist.github.com/simonw/273caa2e47b1065af9b75087cb78fdd9
  6. def walk_documentation(path='.', base_url=''): path = os.path.realpath(path) for dirpath, dirnames, filenames

    in os.walk(path): for filename in filenames: if filename.endswith('.txt'): filepath = os.path.join(dirpath, filename) content = open(filepath).read() title = find_title(content) # Figure out path relative to original relpath = os.path.relpath(filepath, path) url = urlparse.urljoin(base_url, relpath[:-4] + '/') yield { 'title': title, 'path': relpath, 'top_folder': relpath('/')[0], 'url': url, 'content': content, 'id': hashlib.sha1(url).hexdigest(), } for document in walk_documentation(path, base_url, '/'): print json.dumps({'index': {'_id': document['id']}}) print json.dumps(document) https://gist.github.com/simonw/273caa2e47b1065af9b75087cb78fdd9
  7. def walk_documentation(path='.', base_url=''): path = os.path.realpath(path) for dirpath, dirnames, filenames

    in os.walk(path): for filename in filenames: if filename.endswith('.txt'): filepath = os.path.join(dirpath, filename) content = open(filepath).read() title = find_title(content) # Figure out path relative to original relpath = os.path.relpath(filepath, path) url = urlparse.urljoin(base_url, relpath[:-4] + '/') yield { 'title': title, 'path': relpath, 'top_folder': relpath('/')[0], 'url': url, 'content': content, 'id': hashlib.sha1(url).hexdigest(), } for document in walk_documentation(path, base_url): print json.dumps({'index': {'_id': document['id']}}) print json.dumps(document) https://gist.github.com/simonw/273caa2e47b1065af9b75087cb78fdd9
  8. {"index": {"_id": "de72ca631bca86f405aa301b9ee8590a4cf4e7c8"}} {"title": "Django documentation contents", "url": "https://docs.djangoproject.com/en/1.10/ contents/",

    "content": "=============================\nDjango documentation contents \n=============================\n\n.. toctree::\n :hidden:\n\n index\n\n.. toctree:: \n :maxdepth: 3\n\n intro/index\n topics/index\n howto/index\n faq/index\n ref/ index\n misc/index\n glossary\n releases/index\n internals/index\n\nIndices, glossary and tables\n============================\n\n* :ref:`genindex`\n* :ref:`modindex`\n* :doc:`glossary` \n", "top_folder": "contents.txt", "path": "contents.txt", "id": "de72ca631bca86f405aa301b9ee8590a4cf4e7c8"} {"index": {"_id": "2633212db84c83b86479856e6f34494b3433a66a"}} {"title": "Glossary", "url": "https://docs.djangoproject.com/en/1.10/glossary/", "content": "========\nGlossary\n========\n\n.. glossary::\n\n concrete model\n A non-abstract (:attr:`abstract=False\n <django.db.models.Options.abstract>`) model.\n\n field\n An attribute on a :term:`model`; a given field usually maps directly to\n a single database column.\n\n See :doc:`/topics/db/models`.\n\n generic view\n A higher- order :term:`view` function that provides an abstract/generic\n implementation of a common idiom or pattern found in view development.\n\n See :doc:`/topics/class-based-views/index`.\n \n model\n Models store your application's data.\n\n See :doc:`/topics/db/models`. \n\n MTV\n \"Model-template-view\"; a software pattern, similar in style to MVC, but\n a better description of the way Django does things.\n\n See :ref:`the FAQ entry <faq-mtv>`.\n \n MVC\n `Model-view-controller`__; a software pattern. Django :ref:`follows MVC\n to some extent <faq-mtv>`.\n\n __ https://en.wikipedia.org/wiki/Model-view-controller\n\n project\n A Python package -- i.e. a directory of code -- that contains all the\n settings for an instance of Django. This would include database\n configuration, Django- specific options and application-specific\n settings.\n\n property\n Also known as \"managed attributes\", and a feature of Python since\n version 2.2. This is a neat way to implement attributes whose usage\n resembles attribute access, but whose implementation uses method calls.\n\n See :class:`property`.\n\n queryset\n An object representing some set of rows to be fetched from the database.\n\n See :doc:`/topics/db/queries`.\n\n slug\n A short label for something, containing only letters, numbers,\n underscores or hyphens. They're generally used in URLs. For\n example, in a typical blog entry URL:\n\n .. parsed-literal::\n\n https://www.djangoproject.com/weblog/2008/apr/12/**spring**/\n\n the last bit (``spring``) is the slug.\n\n template\n A chunk of text that acts as formatting for representing data. A\n template helps to abstract the presentation of data from the data\n itself.\n\n See :doc:`/topics/templates`.\n\n view\n A function responsible for rendering a page.\n", "top_folder": "glossary.txt", "path": "glossary.txt", "id": "2633212db84c83b86479856e6f34494b3433a66a"}
  9. { "info": { "maintainer": "", "docs_url": null, "requires_python": "", "maintainer_email":

    "", "cheesecake_code_kwalitee_id": null, "keywords": "", "package_url": "http://pypi.python.org/pypi/Django", "author": "Django Software Foundation", "author_email": "[email protected]", "download_url": "", "platform": "", "version": "1.10", "cheesecake_documentation_id": null, "_pypi_hidden": false, "description": "UNKNOWN\n\n\n", "release_url": "http://pypi.python.org/pypi/Django/1.10", "downloads": { "last_month": 1473, "last_week": 0, "last_day": 0 }, "_pypi_ordering": 121, "requires_dist": [ "bcrypt; extra == 'bcrypt'", "argon2-cffi (>=16.1.0); extra == 'argon2'" ], "classifiers": [ "Development Status :: 5 - Production/Stable", "Environment :: Web Environment", "Framework :: Django", "Intended Audience :: Developers", • https://pypi.python.org/pypi/Django/json
  10. Mapping PUT /pypi/package/_mapping { "package": { "properties": { "keywords": {"type":

    "string", "analyzer": "snowball"}, "summary": {"type": "string", "analyzer": "snowball"}, "name": {"index": "not_analyzed", "type": "string"}, "classifiers": {"index": "not_analyzed", "type": "string"}, "description": {"type": "string", "analyzer": "snowball"} } } }
  11. elasticsearch_dsl from elasticsearch_dsl import DocType, String, Date, Integer, Boolean class

    Package(DocType): name = String(index='not_analyzed') summary = String(analyzer='snowball') description = String(analyzer='snowball') keywords = String(analyzer='snowball') classifiers = String(index='not_analyzed', multi=True) class Meta: index = 'package'
  12. elasticsearch_dsl # Create the mapping in Elasticsearch (do this only

    once) Package.init() # Save a package to the index Package( meta={ 'id': data['info']['name'] }, name=data['info']['name'], summary=data['info']['summary'], description=data['info']['description'], keywords=data['info']['description'], classifiers=data['info']['classifiers'], ).save()
  13. RDBMS sync strategies • needs_indexing boolean flag • last_touched timestamp

    • Redis/Kafka queue • Subscribe to database replication log
  14. needs_indexing class Conference(models.Model): name = models.CharField(max_length=128) url = models.URLField(blank=True) #

    … needs_indexing = models.BooleanField(default=True, db_index=True) # Reindex all conferences when associated guide is edited: guide.conferences.all().update(needs_indexing=True)
  15. last_touched class Conference(models.Model): name = models.CharField(max_length=128) url = models.URLField(blank=True) #

    … last_touched = models.DateTimeField( db_index=True, default=datetime.datetime.utcnow, ) # Reindex all conferences when associated guide is edited: guide.conferences.all().update(last_touched=datetime.datetime.utcnow()) Indexing code needs to track most recently seen last_touched date time
  16. Redis/Kafka queue • Any time an object needs reindexing, add

    the type/ID to a queue • Every few seconds, clear the queue, dedupe the item IDs, fetch from database and reindex them
  17. Replication log • We built a system at Eventbrite called

    Dilithium, which subscribes to the MySQL replication log and writes interesting moments (e.g. order.updated) to Kafka • Re-indexing code subscribes to Kafka • github.com/noplay/python-mysql-replication
  18. Recommendations • Search for events where saved-by matches my- friend-1

    or my-friend-2 or my-friend-3 or … • Find events similar to my-last-10-saved-events • Search engines are great at scoring! Boost by in-same-city-as-me, boost more by saved-by- my-friends
  19. Analyzing user activity • Elasticsearch + Kibana are popular tools

    for log analysis - can easily handle enormous amounts of traffic • Feed user actions into a custom index - search.executed, user.followed etc • Can then write application logic that varies depending on recent user activity
  20. Analyzing patterns • Faceted search makes it easy to analyze

    large datasets • Create “characteristics” for your users - e.g. uses_linkedin, signed_up_in_2015, referred_by_a_friend • Use Kibana to explorer interesting relationships
  21. Why does this even work? • Search is all about

    set intersections: the set of documents containing “dogs” with the set of documents containing “skateboarding” • This is very distributable: query a dozen shards, then merge and return the results • Relevance is a first-class concept • map/reduce in real-time (unlike Hadoop)