Slide 1

Slide 1 text

Exploring complex data with Elasticsearch and Python Simon Willison, PyBay - 20th Aug 2016

Slide 2

Slide 2 text

Introducing the denormalized query engine design pattern

Slide 3

Slide 3 text

2016

Slide 4

Slide 4 text

2005

Slide 5

Slide 5 text

No content

Slide 6

Slide 6 text

No content

Slide 7

Slide 7 text

No content

Slide 8

Slide 8 text

http://aaronland.info/talks/mw10_machinetags/#91

Slide 9

Slide 9 text

• Global data? Query the search index • Looking at your own stuff? Query the DB directly - avoid any risk of “where are my changes?”

Slide 10

Slide 10 text

denormalized query engine design pattern • Point of truth is your relational database, kept as normalized as possible • Denormalize all relevant data to a separate search index • Invest a lot of effort in synchronizing the two • Smartly route queries to database or search depending on tolerance for lag • Optional: query search engine for object IDs, then load directly from the database to display the content

Slide 11

Slide 11 text

Why do this? • “Search” engines aren’t just good at text search • In exchange for a few seconds of indexing delay, you get: • horizontal scalability • powerful new types of query • aggregations

Slide 12

Slide 12 text

No content

Slide 13

Slide 13 text

• Open source search engine built on Apache Lucene • Interface is all JSON over HTTP - easy to use from any language • Claims to be “real-time” - it’s close enough • Insanely powerful query language (a JSON DSL) • Strong focus on analytics in addition to text search • Elastic means elastic: highly horizontally scalable

Slide 14

Slide 14 text

Let’s build a search engine for Django docs

Slide 15

Slide 15 text

def walk_documentation(path='.', base_url=''): path = os.path.realpath(path) for dirpath, dirnames, filenames in os.walk(path): for filename in filenames: if filename.endswith('.txt'): filepath = os.path.join(dirpath, filename) content = open(filepath).read() title = find_title(content) # Figure out path relative to original relative_path = os.path.relpath(filepath, path) url = urlparse.urljoin(base_url, relative_path[:-4]) yield { 'title': title, 'path': relative_path, 'top_folder': relative_path.split('/')[0], 'url': url, 'content': content, 'id': hashlib.sha1(url).hexdigest(), } for document in walk_documentation(path, base_url, '/'): print json.dumps({'index': {'_id': document['id']}}) print json.dumps(document) https://gist.github.com/simonw/273caa2e47b1065af9b75087cb78fdd9

Slide 16

Slide 16 text

def walk_documentation(path='.', base_url=''): path = os.path.realpath(path) for dirpath, dirnames, filenames in os.walk(path): for filename in filenames: if filename.endswith('.txt'): filepath = os.path.join(dirpath, filename) content = open(filepath).read() title = find_title(content) # Figure out path relative to original relpath = os.path.relpath(filepath, path) url = urlparse.urljoin(base_url, relpath[:-4] + '/') yield { 'title': title, 'path': relpath, 'top_folder': relpath('/')[0], 'url': url, 'content': content, 'id': hashlib.sha1(url).hexdigest(), } for document in walk_documentation(path, base_url, '/'): print json.dumps({'index': {'_id': document['id']}}) print json.dumps(document) https://gist.github.com/simonw/273caa2e47b1065af9b75087cb78fdd9

Slide 17

Slide 17 text

def walk_documentation(path='.', base_url=''): path = os.path.realpath(path) for dirpath, dirnames, filenames in os.walk(path): for filename in filenames: if filename.endswith('.txt'): filepath = os.path.join(dirpath, filename) content = open(filepath).read() title = find_title(content) # Figure out path relative to original relpath = os.path.relpath(filepath, path) url = urlparse.urljoin(base_url, relpath[:-4] + '/') yield { 'title': title, 'path': relpath, 'top_folder': relpath('/')[0], 'url': url, 'content': content, 'id': hashlib.sha1(url).hexdigest(), } for document in walk_documentation(path, base_url): print json.dumps({'index': {'_id': document['id']}}) print json.dumps(document) https://gist.github.com/simonw/273caa2e47b1065af9b75087cb78fdd9

Slide 18

Slide 18 text

{"index": {"_id": "de72ca631bca86f405aa301b9ee8590a4cf4e7c8"}} {"title": "Django documentation contents", "url": "https://docs.djangoproject.com/en/1.10/ contents/", "content": "=============================\nDjango documentation contents \n=============================\n\n.. toctree::\n :hidden:\n\n index\n\n.. toctree:: \n :maxdepth: 3\n\n intro/index\n topics/index\n howto/index\n faq/index\n ref/ index\n misc/index\n glossary\n releases/index\n internals/index\n\nIndices, glossary and tables\n============================\n\n* :ref:`genindex`\n* :ref:`modindex`\n* :doc:`glossary` \n", "top_folder": "contents.txt", "path": "contents.txt", "id": "de72ca631bca86f405aa301b9ee8590a4cf4e7c8"} {"index": {"_id": "2633212db84c83b86479856e6f34494b3433a66a"}} {"title": "Glossary", "url": "https://docs.djangoproject.com/en/1.10/glossary/", "content": "========\nGlossary\n========\n\n.. glossary::\n\n concrete model\n A non-abstract (:attr:`abstract=False\n `) model.\n\n field\n An attribute on a :term:`model`; a given field usually maps directly to\n a single database column.\n\n See :doc:`/topics/db/models`.\n\n generic view\n A higher- order :term:`view` function that provides an abstract/generic\n implementation of a common idiom or pattern found in view development.\n\n See :doc:`/topics/class-based-views/index`.\n \n model\n Models store your application's data.\n\n See :doc:`/topics/db/models`. \n\n MTV\n \"Model-template-view\"; a software pattern, similar in style to MVC, but\n a better description of the way Django does things.\n\n See :ref:`the FAQ entry `.\n \n MVC\n `Model-view-controller`__; a software pattern. Django :ref:`follows MVC\n to some extent `.\n\n __ https://en.wikipedia.org/wiki/Model-view-controller\n\n project\n A Python package -- i.e. a directory of code -- that contains all the\n settings for an instance of Django. This would include database\n configuration, Django- specific options and application-specific\n settings.\n\n property\n Also known as \"managed attributes\", and a feature of Python since\n version 2.2. This is a neat way to implement attributes whose usage\n resembles attribute access, but whose implementation uses method calls.\n\n See :class:`property`.\n\n queryset\n An object representing some set of rows to be fetched from the database.\n\n See :doc:`/topics/db/queries`.\n\n slug\n A short label for something, containing only letters, numbers,\n underscores or hyphens. They're generally used in URLs. For\n example, in a typical blog entry URL:\n\n .. parsed-literal::\n\n https://www.djangoproject.com/weblog/2008/apr/12/**spring**/\n\n the last bit (``spring``) is the slug.\n\n template\n A chunk of text that acts as formatting for representing data. A\n template helps to abstract the presentation of data from the data\n itself.\n\n See :doc:`/topics/templates`.\n\n view\n A function responsible for rendering a page.\n", "top_folder": "glossary.txt", "path": "glossary.txt", "id": "2633212db84c83b86479856e6f34494b3433a66a"}

Slide 19

Slide 19 text

python index_docs.py django/docs/ \ https://docs.djangoproject.com/en/1.10/ | \ curl -s XPOST localhost:9200/docsearch/doc/_bulk \ --data-binary @-

Slide 20

Slide 20 text

http://localhost:9200/docsearch/doc/_search? q=prefetch_related http://localhost:9200/docsearch/doc/_search? q=prefetch_related+-top_folder:releases

Slide 21

Slide 21 text

Indexing PyPI

Slide 22

Slide 22 text

{ "info": { "maintainer": "", "docs_url": null, "requires_python": "", "maintainer_email": "", "cheesecake_code_kwalitee_id": null, "keywords": "", "package_url": "http://pypi.python.org/pypi/Django", "author": "Django Software Foundation", "author_email": "[email protected]", "download_url": "", "platform": "", "version": "1.10", "cheesecake_documentation_id": null, "_pypi_hidden": false, "description": "UNKNOWN\n\n\n", "release_url": "http://pypi.python.org/pypi/Django/1.10", "downloads": { "last_month": 1473, "last_week": 0, "last_day": 0 }, "_pypi_ordering": 121, "requires_dist": [ "bcrypt; extra == 'bcrypt'", "argon2-cffi (>=16.1.0); extra == 'argon2'" ], "classifiers": [ "Development Status :: 5 - Production/Stable", "Environment :: Web Environment", "Framework :: Django", "Intended Audience :: Developers", • https://pypi.python.org/pypi/Django/json

Slide 23

Slide 23 text

Mapping PUT /pypi/package/_mapping { "package": { "properties": { "keywords": {"type": "string", "analyzer": "snowball"}, "summary": {"type": "string", "analyzer": "snowball"}, "name": {"index": "not_analyzed", "type": "string"}, "classifiers": {"index": "not_analyzed", "type": "string"}, "description": {"type": "string", "analyzer": "snowball"} } } }

Slide 24

Slide 24 text

elasticsearch_dsl from elasticsearch_dsl import DocType, String, Date, Integer, Boolean class Package(DocType): name = String(index='not_analyzed') summary = String(analyzer='snowball') description = String(analyzer='snowball') keywords = String(analyzer='snowball') classifiers = String(index='not_analyzed', multi=True) class Meta: index = 'package'

Slide 25

Slide 25 text

elasticsearch_dsl # Create the mapping in Elasticsearch (do this only once) Package.init() # Save a package to the index Package( meta={ 'id': data['info']['name'] }, name=data['info']['name'], summary=data['info']['summary'], description=data['info']['description'], keywords=data['info']['description'], classifiers=data['info']['classifiers'], ).save()

Slide 26

Slide 26 text

Sense

Slide 27

Slide 27 text

No content

Slide 28

Slide 28 text

Kibana

Slide 29

Slide 29 text

http://0.0.0.0:5601/app/kibana

Slide 30

Slide 30 text

https://www.elastic.co/blog/kibana-4-literally

Slide 31

Slide 31 text

RDBMS sync strategies • needs_indexing boolean flag • last_touched timestamp • Redis/Kafka queue • Subscribe to database replication log

Slide 32

Slide 32 text

needs_indexing class Conference(models.Model): name = models.CharField(max_length=128) url = models.URLField(blank=True) # … needs_indexing = models.BooleanField(default=True, db_index=True) # Reindex all conferences when associated guide is edited: guide.conferences.all().update(needs_indexing=True)

Slide 33

Slide 33 text

last_touched class Conference(models.Model): name = models.CharField(max_length=128) url = models.URLField(blank=True) # … last_touched = models.DateTimeField( db_index=True, default=datetime.datetime.utcnow, ) # Reindex all conferences when associated guide is edited: guide.conferences.all().update(last_touched=datetime.datetime.utcnow()) Indexing code needs to track most recently seen last_touched date time

Slide 34

Slide 34 text

Redis/Kafka queue • Any time an object needs reindexing, add the type/ID to a queue • Every few seconds, clear the queue, dedupe the item IDs, fetch from database and reindex them

Slide 35

Slide 35 text

Replication log • We built a system at Eventbrite called Dilithium, which subscribes to the MySQL replication log and writes interesting moments (e.g. order.updated) to Kafka • Re-indexing code subscribes to Kafka • github.com/noplay/python-mysql-replication

Slide 36

Slide 36 text

More use-cases

Slide 37

Slide 37 text

Faceted search

Slide 38

Slide 38 text

Recommendations • Search for events where saved-by matches my- friend-1 or my-friend-2 or my-friend-3 or … • Find events similar to my-last-10-saved-events • Search engines are great at scoring! Boost by in-same-city-as-me, boost more by saved-by- my-friends

Slide 39

Slide 39 text

Analyzing user activity • Elasticsearch + Kibana are popular tools for log analysis - can easily handle enormous amounts of traffic • Feed user actions into a custom index - search.executed, user.followed etc • Can then write application logic that varies depending on recent user activity

Slide 40

Slide 40 text

Analyzing patterns • Faceted search makes it easy to analyze large datasets • Create “characteristics” for your users - e.g. uses_linkedin, signed_up_in_2015, referred_by_a_friend • Use Kibana to explorer interesting relationships

Slide 41

Slide 41 text

Why does this even work? • Search is all about set intersections: the set of documents containing “dogs” with the set of documents containing “skateboarding” • This is very distributable: query a dozen shards, then merge and return the results • Relevance is a first-class concept • map/reduce in real-time (unlike Hadoop)

Slide 42

Slide 42 text

denormalize to a query engine! in summary… Elasticsearch is pretty good