Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Data Science Master. ElasticSearch and Kibana. Session 2: ElasticSearch

Data Science Master. ElasticSearch and Kibana. Session 2: ElasticSearch

Introduction to the concept of indexes, how to deal with those, mappings and other concepts.

Daniel Izquierdo Cortazar

March 30, 2017
Tweet

More Decks by Daniel Izquierdo Cortazar

Other Decks in Technology

Transcript

  1. Outline Introduction ElasticSearch Terminology Mappings Uploading and Accessing Data REST

    API Daniel Izquierdo Cortázar Máster en Data Science. ETSII. 2
  2. Introduction Search engine (based on Lucene) REST API interface Schema-free

    JSON documents Developed in Java Apache License Part of the ‘ELK stack’ Daniel Izquierdo Cortázar Máster en Data Science. ETSII. 3
  3. ElasticSearch Terminology Daniel Izquierdo Cortázar Máster en Data Science. ETSII.

    4 Relational Database ElasticSearch Database Index Table Type Row Document Field Field Warning: Joins are extremely expensive in ES
  4. Deal with Indexes Create Index $ curl -XPUT -k localhost:9200/test

    >{"acknowledged":true} >{"error":{"root_cause":[{"type":"index_already_exists_exception","reason ":"already exists","index":"test"}],"type":"index_already_exists_exception","reason" :"already exists","index":"test"},"status":400} Daniel Izquierdo Cortázar Máster en Data Science. ETSII. 5
  5. Deal with Indexes Delete Index $ curl -XPUT -k localhost:9200/test

    >{"acknowledged":true} >{"error":{"root_cause":[{"type":"index_not_found_exception","reason":"no such index","resource.type":"index_or_alias","resource.id":"test","index":"tes t"}],"type":"index_not_found_exception","reason":"no such index","resource.type":"index_or_alias","resource.id":"test","index":"tes t"},"status":404} Daniel Izquierdo Cortázar Máster en Data Science. ETSII. 6
  6. Deal with Indexes List Indexes $ curl -k localhost:9200/_cat/indices yellow

    open test_dump 5 1 4040 0 2.8mb 2.8mb yellow open .kibana 1 1 181 0 186.2kb 186.2kb yellow open .kibana 1 1 181 0 186.2kb 186.2kb ... Daniel Izquierdo Cortázar Máster en Data Science. ETSII. 7
  7. Mappings Fields and mapping types not needed to define There’s

    a dynamic mapping $ curl -k localhost:9200/test/_mapping/ >{"test":{"mappings":{}}} Daniel Izquierdo Cortázar Máster en Data Science. ETSII. 8
  8. Mappings Upload a mapping $ curl -XPUT -k localhost:9200/test/_mapping/item -d

    ' { "properties":{ "name":{ "type":"string" } } }' >{"acknowledged":true} Daniel Izquierdo Cortázar Máster en Data Science. ETSII. 9
  9. Mappings Delete a mapping It’s not possible to delete a

    mapping associated to a type We need to delete the index and create the types we need Daniel Izquierdo Cortázar Máster en Data Science. ETSII. 10
  10. Mappings It’s not possible to update a mapping associated to

    a type (delete + create) $ curl -XPUT -k localhost:9200/test/_mapping/item -d ' {"properties":{ "name":{ "type":"integer"}}}’ >{"error":{"root_cause":[{"type":"illegal_argument_exception","reason":"m apper [name] cannot be changed from type [string] to [int]"}],"type":"illegal_argument_exception","reason":"mapper [name] cannot be changed from type [string] to [int]"},"status":400} Daniel Izquierdo Cortázar Máster en Data Science. ETSII. 11
  11. Mappings List the available mappings in an index $ curl

    -XGET localhost:9200/test/_mapping >{"test":{"mappings":{"item":{"proper… Hint: json_reformat prettify the JSON Daniel Izquierdo Cortázar Máster en Data Science. ETSII. 12
  12. Mappings Available types: Basic Datatypes string long, integer, short, byte,

    double, float date boolean binary Ranges: [integer|float…]_range Daniel Izquierdo Cortázar Máster en Data Science. ETSII. 13 Specialized Datatypes Geo-points datatype IP datatype Nested datatype others...
  13. Mappings Fields are unique across the mappings Fields in each

    mapping type are not independent of each other Same name in the same index in different mapping types map to the same field internally Daniel Izquierdo Cortázar Máster en Data Science. ETSII. 14
  14. Mappings Testing automated mapping $ curl -XPUT localhost:9200/test/item/1 -d '{"name":"Daniel"}'

    >{"_index":"test","_type":"item","_id":"1","_version":1," _shards":{"total":2,"successful":1,"failed":0},"created": true} Daniel Izquierdo Cortázar Máster en Data Science. ETSII. 15
  15. Mappings Testing automated mapping $ curl -XPUT localhost:9200/test/item/2 -d '{"name":"Daniel",

    "age":"34"}' $ curl -XGET localhost:9200/test/_mapping/ >{"test":{"mappings":{"item":{"properties":{"age":{"type" :"string"},"name":{"type":"string"}}} Daniel Izquierdo Cortázar Máster en Data Science. ETSII. 16
  16. Mappings Remember the cross-shared fields! $ curl -XPUT -k localhost:9200/_mapping/item2

    -d '{"properties":{"age":{"type":"integer"}}}' >{"error":{"root_cause":[{"type":"illegal_argument_except ion","reason":"mapper [age] cannot be changed from type [string] to [int]"}],"type":"illegal_argument_exception","reason":"ma pper [age] cannot be changed from type [string] to [int]"},"status":400} Daniel Izquierdo Cortázar Máster en Data Science. ETSII. 17
  17. Mappings Let’s use a new field curl -XPUT -k localhost:9200/_mapping/item2

    -d '{"properties":{"height":{"type":"integer"}}}' > {"acknowledged":true} $ curl -XPUT localhost:9200/test/item/1 -d '{"height":"185"}' >{"_index":"test","_type":"item","_id":"1","_version":2," _shards":{"total":2,"successful":1,"failed":0},"created": false} Daniel Izquierdo Cortázar Máster en Data Science. ETSII. 18
  18. Mappings Let’s use a new field $ curl -XGET localhost:9200/test/_mapping/item

    >{"test":{"mappings":{"item":{"properties":{"age":{"type" :"string"},"height":{"type":"integer"},"name":{"type":"st ring"}}}}}} Daniel Izquierdo Cortázar Máster en Data Science. ETSII. 19
  19. Mappings Analyze and not analyzed fields By default => “analyzed”

    And “analyzed” => segmentation algorithm Hint: Standard Tokenizer: https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-standard-tokenizer.html Daniel Izquierdo Cortázar Máster en Data Science. ETSII. 20
  20. Mappings Analyzed example (DIY:) “The truth is out there” is

    segmented into [“The”, “truth”, “is”, “out”, “there”] The key word “truth” matches that field Daniel Izquierdo Cortázar Máster en Data Science. ETSII. 21
  21. Mappings Not Analyze example: “The truth is out there” is

    not segmented The keyword “truth” does not match that field Daniel Izquierdo Cortázar Máster en Data Science. ETSII. 22
  22. Mappings Not analyzed mapping "name": { "type": "string", "index": "not_analyzed"

    } Daniel Izquierdo Cortázar Máster en Data Science. ETSII. 23
  23. Soft. Dev. Analytics Dataset Software development analytics. This aims at

    providing useful insights about the software development process. Pretty focused on activity, community and process Daniel Izquierdo Cortázar Máster en Data Science. ETSII. 24
  24. Soft. Dev. Analytics Dataset Some examples of data sources: Bugzilla,

    Jira, Redmine, etc Git, Mercurial, SVN mailing lists, Askbot, IRC, Slack Twitter, Stackoverflow, Telegram Jenkins, Travis Gerrit, Jira, Bugzilla Daniel Izquierdo Cortázar Máster en Data Science. ETSII. 25
  25. Soft. Dev. Analytics Dataset This needs of a data science

    approach: Mining + storing + cleaning + massaging + visualizing + storyteller Daniel Izquierdo Cortázar Máster en Data Science. ETSII. 26
  26. Soft. Dev. Analytics Dataset Example with Git: commit e065f08d71609de44151ecdd9d9cb152dbf8713b 3feac22f25fc2228851547c53018742b10927f8a

    Author: Daniel Izquierdo <[email protected]> AuthorDate: Thu May 26 19:34:07 2016 +0200 Commit: Daniel Izquierdo <[email protected]> CommitDate: Thu May 26 19:34:07 2016 +0200 Add module to deal with format and unification of data Methods receive as entry a pandas.DataFrame. Those return a new dataframe with some basic formating activity such as adding, removing, filling and dates formating. :000000 100644 0000000... 18e533f... A format.py :000000 100644 0000000... e69de29... A tests/__init__.py :000000 100644 0000000... d8685c1... A tests/run_tests.py :000000 100644 0000000... ab74263... A tests/test_format.py 113 0 format.py 0 0 tests/__init__.py 30 0 tests/run_tests.py 55 0 tests/test_format.py Daniel Izquierdo Cortázar Máster en Data Science. ETSII. 27
  27. Soft. Dev. Analytics Dataset For a commit we have: Date

    of author and commit Author and committer Files touched Added and removed lines Hash and message Daniel Izquierdo Cortázar Máster en Data Science. ETSII. 28
  28. Soft. Dev. Analytics Dataset Let’s parse that info Store it

    in a database And start querying Daniel Izquierdo Cortázar Máster en Data Science. ETSII. 29
  29. Soft. Dev. Analytics Dataset Some examples: Eclipse dashboard: eclipse.biterg.io OPNFV

    dashboard: opnfv.biterg.io Diversity: https://speakerdeck.com/bitergia/diversity-in-open-source-panel Inner Source: as in Inner Source Commons Daniel Izquierdo Cortázar Máster en Data Science. ETSII. 30
  30. ElasticDump Great for dumps (eg: mysqldump) Use for export and

    import data Mappings and datasets Daniel Izquierdo Cortázar Máster en Data Science. ETSII. 31
  31. ElasticDump Export mapping: $ elasticdump --input=http://localhost:9200 --input-index=test --output=test_mapping.json --type=mapping Mon,

    20 Mar 2017 19:40:05 GMT | starting dump Mon, 20 Mar 2017 19:40:05 GMT | got 1 objects from source elasticsearch (offset: 0) Mon, 20 Mar 2017 19:40:05 GMT | sent 1 objects to destination file, wrote 1 Mon, 20 Mar 2017 19:40:05 GMT | got 0 objects from source elasticsearch (offset: 1) Mon, 20 Mar 2017 19:40:05 GMT | Total Writes: 1 Mon, 20 Mar 2017 19:40:05 GMT | dump complete Daniel Izquierdo Cortázar Máster en Data Science. ETSII. 32
  32. ElasticDump Export data: $ elasticdump --input=http://localhost:9200 --input-index=test --type=data --output=test.json Mon,

    20 Mar 2017 19:39:02 GMT | starting dump Mon, 20 Mar 2017 19:39:02 GMT | got 2 objects from source elasticsearch (offset: 0) Mon, 20 Mar 2017 19:39:02 GMT | sent 2 objects to destination file, wrote 2 Mon, 20 Mar 2017 19:39:02 GMT | got 0 objects from source elasticsearch (offset: 2) Mon, 20 Mar 2017 19:39:02 GMT | Total Writes: 2 Mon, 20 Mar 2017 19:39:02 GMT | dump complete Daniel Izquierdo Cortázar Máster en Data Science. ETSII. 33
  33. ElasticDump Import mapping: $ elasticdump --input=test_mapping.json --type=mapping --output=http://localhost:9200 --output-index=testing Import

    data: $ elasticdump --input=test.json --output=http://localhost:9200 --type=data --output-index=testing Daniel Izquierdo Cortázar Máster en Data Science. ETSII. 34
  34. Importing Large Datasets Let’s import the example mapping and dataset

    Daniel Izquierdo Cortázar Máster en Data Science. ETSII. 35
  35. REST API In brief: You can Create, Read, Update and

    Delete data (CRUD) And the REST API offers an unique URL for each resource Daniel Izquierdo Cortázar Máster en Data Science. ETSII. 36