Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Exploring complex data with Elasticsearch and Python

Exploring complex data with Elasticsearch and Python

Talk I gave at PyBay 2016. Code accompanying the talk can be found here: https://github.com/simonw/pybay-2016-elasticsearch-talk

E172168287724cd3051588354ded616b?s=128

Simon Willison

August 20, 2016
Tweet

Transcript

  1. Exploring complex data with Elasticsearch and Python Simon Willison, PyBay

    - 20th Aug 2016
  2. Introducing the denormalized query engine design pattern

  3. 2016

  4. 2005

  5. None
  6. None
  7. None
  8. http://aaronland.info/talks/mw10_machinetags/#91

  9. • Global data? Query the search index • Looking at

    your own stuff? Query the DB directly - avoid any risk of “where are my changes?”
  10. denormalized query engine design pattern • Point of truth is

    your relational database, kept as normalized as possible • Denormalize all relevant data to a separate search index • Invest a lot of effort in synchronizing the two • Smartly route queries to database or search depending on tolerance for lag • Optional: query search engine for object IDs, then load directly from the database to display the content
  11. Why do this? • “Search” engines aren’t just good at

    text search • In exchange for a few seconds of indexing delay, you get: • horizontal scalability • powerful new types of query • aggregations
  12. None
  13. • Open source search engine built on Apache Lucene •

    Interface is all JSON over HTTP - easy to use from any language • Claims to be “real-time” - it’s close enough • Insanely powerful query language (a JSON DSL) • Strong focus on analytics in addition to text search • Elastic means elastic: highly horizontally scalable
  14. Let’s build a search engine for Django docs

  15. def walk_documentation(path='.', base_url=''): path = os.path.realpath(path) for dirpath, dirnames, filenames

    in os.walk(path): for filename in filenames: if filename.endswith('.txt'): filepath = os.path.join(dirpath, filename) content = open(filepath).read() title = find_title(content) # Figure out path relative to original relative_path = os.path.relpath(filepath, path) url = urlparse.urljoin(base_url, relative_path[:-4]) yield { 'title': title, 'path': relative_path, 'top_folder': relative_path.split('/')[0], 'url': url, 'content': content, 'id': hashlib.sha1(url).hexdigest(), } for document in walk_documentation(path, base_url, '/'): print json.dumps({'index': {'_id': document['id']}}) print json.dumps(document) https://gist.github.com/simonw/273caa2e47b1065af9b75087cb78fdd9
  16. def walk_documentation(path='.', base_url=''): path = os.path.realpath(path) for dirpath, dirnames, filenames

    in os.walk(path): for filename in filenames: if filename.endswith('.txt'): filepath = os.path.join(dirpath, filename) content = open(filepath).read() title = find_title(content) # Figure out path relative to original relpath = os.path.relpath(filepath, path) url = urlparse.urljoin(base_url, relpath[:-4] + '/') yield { 'title': title, 'path': relpath, 'top_folder': relpath('/')[0], 'url': url, 'content': content, 'id': hashlib.sha1(url).hexdigest(), } for document in walk_documentation(path, base_url, '/'): print json.dumps({'index': {'_id': document['id']}}) print json.dumps(document) https://gist.github.com/simonw/273caa2e47b1065af9b75087cb78fdd9
  17. def walk_documentation(path='.', base_url=''): path = os.path.realpath(path) for dirpath, dirnames, filenames

    in os.walk(path): for filename in filenames: if filename.endswith('.txt'): filepath = os.path.join(dirpath, filename) content = open(filepath).read() title = find_title(content) # Figure out path relative to original relpath = os.path.relpath(filepath, path) url = urlparse.urljoin(base_url, relpath[:-4] + '/') yield { 'title': title, 'path': relpath, 'top_folder': relpath('/')[0], 'url': url, 'content': content, 'id': hashlib.sha1(url).hexdigest(), } for document in walk_documentation(path, base_url): print json.dumps({'index': {'_id': document['id']}}) print json.dumps(document) https://gist.github.com/simonw/273caa2e47b1065af9b75087cb78fdd9
  18. {"index": {"_id": "de72ca631bca86f405aa301b9ee8590a4cf4e7c8"}} {"title": "Django documentation contents", "url": "https://docs.djangoproject.com/en/1.10/ contents/",

    "content": "=============================\nDjango documentation contents \n=============================\n\n.. toctree::\n :hidden:\n\n index\n\n.. toctree:: \n :maxdepth: 3\n\n intro/index\n topics/index\n howto/index\n faq/index\n ref/ index\n misc/index\n glossary\n releases/index\n internals/index\n\nIndices, glossary and tables\n============================\n\n* :ref:`genindex`\n* :ref:`modindex`\n* :doc:`glossary` \n", "top_folder": "contents.txt", "path": "contents.txt", "id": "de72ca631bca86f405aa301b9ee8590a4cf4e7c8"} {"index": {"_id": "2633212db84c83b86479856e6f34494b3433a66a"}} {"title": "Glossary", "url": "https://docs.djangoproject.com/en/1.10/glossary/", "content": "========\nGlossary\n========\n\n.. glossary::\n\n concrete model\n A non-abstract (:attr:`abstract=False\n <django.db.models.Options.abstract>`) model.\n\n field\n An attribute on a :term:`model`; a given field usually maps directly to\n a single database column.\n\n See :doc:`/topics/db/models`.\n\n generic view\n A higher- order :term:`view` function that provides an abstract/generic\n implementation of a common idiom or pattern found in view development.\n\n See :doc:`/topics/class-based-views/index`.\n \n model\n Models store your application's data.\n\n See :doc:`/topics/db/models`. \n\n MTV\n \"Model-template-view\"; a software pattern, similar in style to MVC, but\n a better description of the way Django does things.\n\n See :ref:`the FAQ entry <faq-mtv>`.\n \n MVC\n `Model-view-controller`__; a software pattern. Django :ref:`follows MVC\n to some extent <faq-mtv>`.\n\n __ https://en.wikipedia.org/wiki/Model-view-controller\n\n project\n A Python package -- i.e. a directory of code -- that contains all the\n settings for an instance of Django. This would include database\n configuration, Django- specific options and application-specific\n settings.\n\n property\n Also known as \"managed attributes\", and a feature of Python since\n version 2.2. This is a neat way to implement attributes whose usage\n resembles attribute access, but whose implementation uses method calls.\n\n See :class:`property`.\n\n queryset\n An object representing some set of rows to be fetched from the database.\n\n See :doc:`/topics/db/queries`.\n\n slug\n A short label for something, containing only letters, numbers,\n underscores or hyphens. They're generally used in URLs. For\n example, in a typical blog entry URL:\n\n .. parsed-literal::\n\n https://www.djangoproject.com/weblog/2008/apr/12/**spring**/\n\n the last bit (``spring``) is the slug.\n\n template\n A chunk of text that acts as formatting for representing data. A\n template helps to abstract the presentation of data from the data\n itself.\n\n See :doc:`/topics/templates`.\n\n view\n A function responsible for rendering a page.\n", "top_folder": "glossary.txt", "path": "glossary.txt", "id": "2633212db84c83b86479856e6f34494b3433a66a"}
  19. python index_docs.py django/docs/ \ https://docs.djangoproject.com/en/1.10/ | \ curl -s XPOST

    localhost:9200/docsearch/doc/_bulk \ --data-binary @-
  20. http://localhost:9200/docsearch/doc/_search? q=prefetch_related http://localhost:9200/docsearch/doc/_search? q=prefetch_related+-top_folder:releases

  21. Indexing PyPI

  22. { "info": { "maintainer": "", "docs_url": null, "requires_python": "", "maintainer_email":

    "", "cheesecake_code_kwalitee_id": null, "keywords": "", "package_url": "http://pypi.python.org/pypi/Django", "author": "Django Software Foundation", "author_email": "foundation@djangoproject.com", "download_url": "", "platform": "", "version": "1.10", "cheesecake_documentation_id": null, "_pypi_hidden": false, "description": "UNKNOWN\n\n\n", "release_url": "http://pypi.python.org/pypi/Django/1.10", "downloads": { "last_month": 1473, "last_week": 0, "last_day": 0 }, "_pypi_ordering": 121, "requires_dist": [ "bcrypt; extra == 'bcrypt'", "argon2-cffi (>=16.1.0); extra == 'argon2'" ], "classifiers": [ "Development Status :: 5 - Production/Stable", "Environment :: Web Environment", "Framework :: Django", "Intended Audience :: Developers", • https://pypi.python.org/pypi/Django/json
  23. Mapping PUT /pypi/package/_mapping { "package": { "properties": { "keywords": {"type":

    "string", "analyzer": "snowball"}, "summary": {"type": "string", "analyzer": "snowball"}, "name": {"index": "not_analyzed", "type": "string"}, "classifiers": {"index": "not_analyzed", "type": "string"}, "description": {"type": "string", "analyzer": "snowball"} } } }
  24. elasticsearch_dsl from elasticsearch_dsl import DocType, String, Date, Integer, Boolean class

    Package(DocType): name = String(index='not_analyzed') summary = String(analyzer='snowball') description = String(analyzer='snowball') keywords = String(analyzer='snowball') classifiers = String(index='not_analyzed', multi=True) class Meta: index = 'package'
  25. elasticsearch_dsl # Create the mapping in Elasticsearch (do this only

    once) Package.init() # Save a package to the index Package( meta={ 'id': data['info']['name'] }, name=data['info']['name'], summary=data['info']['summary'], description=data['info']['description'], keywords=data['info']['description'], classifiers=data['info']['classifiers'], ).save()
  26. Sense

  27. None
  28. Kibana

  29. http://0.0.0.0:5601/app/kibana

  30. https://www.elastic.co/blog/kibana-4-literally

  31. RDBMS sync strategies • needs_indexing boolean flag • last_touched timestamp

    • Redis/Kafka queue • Subscribe to database replication log
  32. needs_indexing class Conference(models.Model): name = models.CharField(max_length=128) url = models.URLField(blank=True) #

    … needs_indexing = models.BooleanField(default=True, db_index=True) # Reindex all conferences when associated guide is edited: guide.conferences.all().update(needs_indexing=True)
  33. last_touched class Conference(models.Model): name = models.CharField(max_length=128) url = models.URLField(blank=True) #

    … last_touched = models.DateTimeField( db_index=True, default=datetime.datetime.utcnow, ) # Reindex all conferences when associated guide is edited: guide.conferences.all().update(last_touched=datetime.datetime.utcnow()) Indexing code needs to track most recently seen last_touched date time
  34. Redis/Kafka queue • Any time an object needs reindexing, add

    the type/ID to a queue • Every few seconds, clear the queue, dedupe the item IDs, fetch from database and reindex them
  35. Replication log • We built a system at Eventbrite called

    Dilithium, which subscribes to the MySQL replication log and writes interesting moments (e.g. order.updated) to Kafka • Re-indexing code subscribes to Kafka • github.com/noplay/python-mysql-replication
  36. More use-cases

  37. Faceted search

  38. Recommendations • Search for events where saved-by matches my- friend-1

    or my-friend-2 or my-friend-3 or … • Find events similar to my-last-10-saved-events • Search engines are great at scoring! Boost by in-same-city-as-me, boost more by saved-by- my-friends
  39. Analyzing user activity • Elasticsearch + Kibana are popular tools

    for log analysis - can easily handle enormous amounts of traffic • Feed user actions into a custom index - search.executed, user.followed etc • Can then write application logic that varies depending on recent user activity
  40. Analyzing patterns • Faceted search makes it easy to analyze

    large datasets • Create “characteristics” for your users - e.g. uses_linkedin, signed_up_in_2015, referred_by_a_friend • Use Kibana to explorer interesting relationships
  41. Why does this even work? • Search is all about

    set intersections: the set of documents containing “dogs” with the set of documents containing “skateboarding” • This is very distributable: query a dozen shards, then merge and return the results • Relevance is a first-class concept • map/reduce in real-time (unlike Hadoop)
  42. denormalize to a query engine! in summary… Elasticsearch is pretty

    good