Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Toulouse JUG and Data Science

Toulouse JUG and Data Science

Talk given for Toulouse JUG and TDS user group
https://speakerdeck.com/elastic/toulouse-jug-and-data-science

Elastic Co

May 21, 2015
Tweet

More Decks by Elastic Co

Other Decks in Programming

Transcript

  1. #elasticsearch

    View Slide

  2. www.elastic.co Copyright Elastic 2015 Copying, publishing and/or
    distributing without written permission is strictly prohibited
    2
    Who ?
    $ curl http://localhost:9200/talk/speaker/dpilato
    {
    "nom" : "David Pilato",
    "jobs" : [
    { "boite" : "SRA Europe (SSII)", "mission" : "bon à tout faire", "date" : "1995" },
    { "boite" : "SFR", "mission" : "touche à tout", "date" : "1997" },
    { "boite" : "e-Brands / Vivendi", "mission" : "chef de projets", "date": "2000" },
    { "boite" : "DGDDI (douane)", "mission" : "mouton à 5 pattes", "date" : "2005" },
    { "boite" : "IDEO Technologies", "mission" : "directeur technique", "date" : "2012" },
    { "boite" : "elastic", "mission" : "développeur", "date" : "2013" } ],
    "passions" : [ "famille", "job", "deejay" ],
    "blog" : "http://dev.david.pilato.fr/",
    "twitter" : [ "@dadoonet", "@elasticfr", "@scrutmydocs" ],
    "email" : "[email protected]"
    }

    View Slide

  3. View Slide

  4. www.elastic.co Copyright Elastic 2015 Copying, publishing and/or
    distributing without written permission is strictly prohibited
    4
    elastic.co
    • Créée en 2012 par les auteurs
    • Formation
    • Support de développement
    • Support de production
    • Marvel
    • Shield
    • Watcher
    • Found by elastic

    View Slide

  5. www.elastic.co Copyright Elastic 2015 Copying, publishing and/or
    distributing without written permission is strictly prohibited
    5
    Old school search
    Cherche moi un document 

    de décembre 2011 portant sur
    la france et contenant produit
    et david
    En SQL :
    SELECT
    doc.*, pays.*
    FROM
    doc, pays
    WHERE
    doc.pays_code = pays.code AND
    doc.date_doc > to_date('2011-12', 'yyyy-mm') AND
    doc.date_doc < to_date('2012-01', 'yyyy-mm') AND
    lower(pays.libelle) = 'france' AND
    lower(doc.commentaire) LIKE ‘%produit%' AND
    lower(doc.commentaire) LIKE ‘%david%';

    View Slide

  6. www.elastic.co Copyright Elastic 2015 Copying, publishing and/or
    distributing without written permission is strictly prohibited
    6
    Graphical User Interface

    View Slide

  7. www.elastic.co Copyright Elastic 2015 Copying, publishing and/or
    distributing without written permission is strictly prohibited
    7
    Search engine?
    • Moteur d'indexation de documents
    • Moteur de recherche dans les index

    View Slide

  8. www.elastic.co Copyright Elastic 2015 Copying, publishing and/or
    distributing without written permission is strictly prohibited
    8
    elasticsearch
    • NoSQL orienté document
    • Apache Lucene
    • HTTP / REST / JSON
    • Distribué, Scalable, Cloud ready
    • Apache2 License
    • Simple: start in 5 minutes 30 seconds
    • Efficace: just start new nodes!
    • Puissant: some ms!
    • Complet: built-in + plugins

    View Slide

  9. www.elastic.co Copyright Elastic 2015 Copying, publishing and/or
    distributing without written permission is strictly prohibited
    9
    Think document!
    • Document : Un objet représentant les données (au sens NoSQL).

    Penser "recherche", c'est oublier le SGBDR et penser "Documents"
    {
    "text": "Bienvenue à la conférence #elasticsearch pour #JUG",
    "created_at": "2012-04-06T20:45:36.000Z",
    "source": "Twitter for iPad",
    "truncated": false,
    "retweet_count": 0,
    "hashtag": [ { "text": "elasticsearch", "start": 27, "end": 40 },
    { "text": "JUG", "start": 47, "end": 55 } ],
    "user": { "id": 51172224, "name": "David Pilato",
    "screen_name": "dadoonet", "location": "France",
    "description": "Soft Architect, Project Manager, Senior Developper.\r\nAt
    this time, enjoying NoSQL world : CouchDB, ElasticSearch.\r\nDeeJay 4 times a year,
    just for fun !" }
    }
    • Type : Regroupe des documents de même type
    • Index : Espace logique de stockage des documents dont les types
    sont fonctionnellement communs

    View Slide

  10. www.elastic.co Copyright Elastic 2015 Copying, publishing and/or
    distributing without written permission is strictly prohibited
    10
    Index
    {
    "_index":"twitter",
    "_type":"tweet",
    "_id":"1"
    }
    $ curl -XPUT localhost:9200/twitter/tweet/1 -d '
    {
    "text": "Bienvenue à la conférence #elasticsearch pour #JUG",
    "created_at": "2012-04-06T20:45:36.000Z",
    "source": "Twitter for iPad",
    "truncated": false,
    "retweet_count": 0,
    "hashtag": [ { "text": "elasticsearch", "start": 27, "end": 40 },
    { "text": "JUG", "start": 47, "end": 55 } ],
    "user": { "id": 51172224, "name": "David Pilato",
    "screen_name": "dadoonet", "location": "France",
    "description": "Soft Architect, Project Manager, Senior Developper.\r\nAt this
    time, enjoying NoSQL world : CouchDB, ElasticSearch.\r\nDeeJay 4 times a year, just for
    fun !" }
    }'

    View Slide

  11. www.elastic.co Copyright Elastic 2015 Copying, publishing and/or
    distributing without written permission is strictly prohibited
    11
    search
    $ curl localhost:9200/twitter/tweet/_search?q=elasticsearch
    {
    "took" : 24,
    "timed_out" : false,
    "_shards" : { "total" : 5, "successful" : 5, "failed" : 0 },
    "hits" : {
    "total" : 1,
    "max_score" : 0.227,
    "hits" : [ {
    "_index" : "twitter",
    "_type" : "tweet",
    "_id" : "1",
    "_score" : 0.227, "_source" : {
    "text": "Bienvenue à la conférence #elasticsearch pour #JUG",
    "created_at": "2012-04-06T20:45:36.000Z",
    "source": "Twitter for iPad",
    […]
    }
    } ]
    }
    }
    Nb de
    documents
    Coordonnées
    Pertinence
    Document
    source

    View Slide

  12. www.elastic.co Copyright Elastic 2015 Copying, publishing and/or
    distributing without written permission is strictly prohibited
    12
    advanced search (Query DSL)
    $ curl localhost:9200/twitter/tweet/_search -d ’{

    "query" : {
    "bool" : {
    "must" : {
    "term" : { "user" : "kimchy" }
    },
    "must_not" : {
    "range" : {
    "age" : { "from" : 10, "to" : 20 }
    }
    },
    "should" : [
    {
    "term" : { "tag" : "wow" }
    },{
    "match" : { "tag" : "elasticsearch is cool" }
    }
    ]
    }
    }
    }’

    View Slide

  13. View Slide

  14. View Slide

  15. www.elastic.co Copyright Elastic 2015 Copying, publishing and/or
    distributing without written permission is strictly prohibited
    15
    La puissance des agrégations (aka facettes)
    Make sense of your (BIG) data!
    (Et en temps quasi réel, s’il vous plait !)
    Compute

    View Slide

  16. Tweets
    ID Username Date Hashtag
    1 dadoonet 2012-04-18 1
    2 talk 2012-04-18 5
    3 elasticsearch 2012-04-18 2
    4 dadoonet 2012-04-18 2
    5 talk 2012-04-18 6
    6 elasticsearch 2012-04-19 3
    7 dadoonet 2012-04-19 3
    8 talk 2012-04-19 7
    9 elasticsearch 2012-04-20 4

    View Slide

  17. Terms
    D Username Date Hashtag
    1 dadoonet 2012-04-18 1
    2 talk 2012-04-18 5
    3 elasticsearch 2012-04-18 2
    4 dadoonet 2012-04-18 2
    5 talk 2012-04-18 6
    6 elasticsearch 2012-04-19 3
    7 dadoonet 2012-04-19 3
    8 talk 2012-04-19 7
    9 elasticsearch 2012-04-20 4
    Username Count
    dadoonet 3
    talk 3
    elasticsearch 3

    View Slide

  18. Terms
    D Username Date Hashtag
    1 dadoonet 2012-04-18 1
    2 talk 2012-04-18 5
    3 elasticsearch 2012-04-18 2
    4 dadoonet 2012-04-18 2
    5 talk 2012-04-18 6
    6 elasticsearch 2012-04-19 3
    7 dadoonet 2012-04-19 3
    8 talk 2012-04-19 7
    9 elasticsearch 2012-04-20 4
    "aggregations" : {
    "users" : { "terms" : {"field" : "username"} }
    }
    "aggregations" : {
    "users" : {
    "buckets" : [
    { "key" : "dadoonet", "doc_count" : 3 },
    { "key" : "talk", "doc_count" : 3 },
    { "key" : "elasticsearch", "doc_count" : 3 }
    ]
    }
    }

    View Slide

  19. Date Histogram
    e Date Hashtag
    2012-04-18 1
    2012-04-18 5
    ch 2012-04-18 2
    2012-04-18 2
    2012-04-18 6
    ch 2012-04-19 3
    2012-04-19 3
    2012-04-19 7
    ch 2012-04-20 4
    Per month
    Date Count
    2012-04 9
    Per day
    Date Count
    2012-04-18 5
    2012-04-19 3
    2012-04-20 1

    View Slide

  20. Date Histogram
    e Date Hashtag
    2012-04-18 1
    2012-04-18 5
    ch 2012-04-18 2
    2012-04-18 2
    2012-04-18 6
    ch 2012-04-19 3
    2012-04-19 3
    2012-04-19 7
    ch 2012-04-20 4
    "aggregations" : {
    "perday" : {
    "date_histogram" : {
    "field" : "date",
    "interval" : "day",

    "format" : "yyyy-MM-dd"
    }
    }
    }
    "aggregations" : {
    "perday" : [
    {
    "key_as_string": "2012-04-18",
    "key": 1334700000000,
    "doc_count": 5
    }, {
    "key_as_string": "2012-04-19",
    "key": 1334786400000, "doc_count": 3
    }, {
    "key_as_string": "2012-04-20",
    "key": 1334872800000,
    "doc_count": 1
    } ] }

    View Slide

  21. Range + Stats
    Hashtag
    18 1
    18 5
    18 2
    18 2
    18 6
    19 3
    19 3
    19 7
    20 4
    Hashtag Count
    x < 3 3
    3 <= x < 5 3
    x >= 5 3
    Min Max Moy Total
    1 2 1.67 5
    3 4 3.33 10
    5 7 6 18

    View Slide

  22. Range + Stats
    Hashtag
    18 1
    18 5
    18 2
    18 2
    18 6
    19 3
    19 3
    19 7
    20 4
    "aggregations" : { "hashtags" : {
    "range" : { "field" : "hashtag",
    "ranges" : [
    { "to" : 3 },
    { "from" : 3, "to" : 5 },
    { "from" : 5 }
    ] },
    "aggregations" : { "hashtag_stats" : {
    "stats" : { "field" : "hashtag" } } } } }
    "aggregations" : {
    "hashtags" :[ {
    "to": 3, "doc_count": 3,
    "hashtag_stats" : {
    "min": 1, "max": 2,"sum": 5, "mean": 1.667 } }, {
    "from":3, "to" : 5, "doc_count": 3,
    "hashtag_stats" : {
    "min": 3, "max": 4, "sum": 10, "mean": 3.333 } },{
    "from":5, "doc_count": 3,
    "hashtag_stats" : {
    "min": 5, "max": 7, "sum": 18, "mean": 6 }
    } ] }

    View Slide

  23. Site marchand
    Range
    Terms
    Terms
    Range

    View Slide

  24. Analyse temps-réel
    Terms
    Date histogram

    View Slide

  25. Facettes Cartographiques

    View Slide

  26. Reprenons notre formulaire
    Recherche Full Text

    View Slide

  27. Reprenons notre formulaire

    View Slide

  28. www.elastic.co Copyright Elastic 2015 Copying, publishing and/or
    distributing without written permission is strictly prohibited
    28
    http://onemilliontweetmap.com/
    Make sense of your (BIG) data
    Demo time!

    View Slide

  29. www.elastic.co Copyright Elastic 2015 Copying, publishing and/or
    distributing without written permission is strictly prohibited
    29
    aka Inverted Search
    Percolation

    View Slide

  30. www.elastic.co Copyright Elastic 2015 Copying, publishing and/or
    distributing without written permission is strictly prohibited
    30
    Index a search request
    $ curl -XPOST localhost:9200/twitter/.percolator/dadoonet -d ’{
    "query" : { "term" : { "user.screen_name" : "dadoonet" } }
    }’
    $ curl -XPOST localhost:9200/twitter/.percolator/elasticsearch -d ’{
    "query" : { "match" : { "hashtag.text" : "elasticsearch" } }
    }’
    $ curl -XPOST localhost:9200/twitter/.percolator/mycomplexquery -d ’{
    "query" : {
    "bool" : {
    "must" : {
    "term" : { "user" : "kimchy" }
    },
    "must_not" : {
    "range" : {
    "age" : { "from" : 10, "to" : 20 }
    }
    }
    }
    }
    }’

    View Slide

  31. www.elastic.co Copyright Elastic 2015 Copying, publishing and/or
    distributing without written permission is strictly prohibited
    31
    Execute a document
    $ curl localhost:9200/twitter/tweet/_percolate -d ‘{
    "doc": {
    "text": "Bienvenue à la conférence #elasticsearch pour #JUG",
    "created_at": "2012-04-06T20:45:36.000Z",
    "source": "Twitter for iPad",
    "truncated": false,
    "retweet_count": 0,
    "hashtag": [ { "text": "elasticsearch", "start": 27, "end": 40 },
    { "text": "JUG", "start": 47, "end": 55 } ],
    "user": { "id": 51172224, "name": "David Pilato",
    "screen_name": "dadoonet", "location": "France",
    "description": "Soft Architect, Project Manager, Senior Developper.\r\nAt this
    time, enjoying NoSQL world : CouchDB, ElasticSearch.\r\nDeeJay 4 times a year, just for
    fun !" }
    }
    }'
    {
    "took" : 19,
    "_shards" : { "total" : 5, "successful" : 5, "failed" : 0 },
    "total" : 2,
    "matches" : [ {
    "_index" : "twitter",
    "_id" : "dadoonet"
    }, {
    "_index" : "twitter",
    "_id" : "elasticsearch"
    } ] }

    View Slide

  32. www.elastic.co Copyright Elastic 2015 Copying, publishing and/or
    distributing without written permission is strictly prohibited
    32
    Analysis and Mapping
    Should we index everything?

    View Slide

  33. www.elastic.co Copyright Elastic 2015 Copying, publishing and/or
    distributing without written permission is strictly prohibited
    33
    Standard analyzer
    $ curl -XPOST 'localhost:9200/test/_analyze?analyzer=standard&pretty=1' -d 'The
    quick brown fox jumps over the lazy Dog'
    {
    "tokens" : [ {
    "token" : "quick",
    "start_offset": 4, "end_offset": 9, "type": "", "position": 2
    }, {
    "token" : "brown",
    "start_offset": 10, "end_offset": 15, "type": "", "position": 3
    }, {
    "token" : "fox",
    "start_offset": 16, "end_offset": 19, "type": "", "position": 4
    }, {
    "token": "jumps",
    "start_offset": 20, "end_offset": 26, "type": "", "position": 5
    }, {
    "token": "over",
    "start_offset": 27, "end_offset": 31, "type": "", "position": 6
    }, {
    "token" : "lazy",
    "start_offset": 36, "end_offset": 40, "type": "", "position": 8
    }, {
    "token" : "dog",
    "start_offset": 41, "end_offset": 44, "type": "", "position": 9
    } ] }

    View Slide

  34. www.elastic.co Copyright Elastic 2015 Copying, publishing and/or
    distributing without written permission is strictly prohibited
    34
    Whitespace analyzer
    $ curl -XPOST 'localhost:9200/test/_analyze?analyzer=whitespace&pretty=1' -d 'The
    quick brown fox jumps over the lazy Dog'
    {
    "tokens" : [ {
    "token" : "The", ...
    }, {
    "token" : "quick", ...
    }, {
    "token" : "brown", ...
    }, {
    "token" : "fox", ...
    }, {
    "token" : "jumps", ...
    }, {
    "token" : "over", ...
    }, {
    "token" : "the", ...
    }, {
    "token" : "lazy", ...
    }, {
    "token" : "Dog", ...
    } ] }

    View Slide

  35. Analyzer ?

    View Slide

  36. www.elastic.co Copyright Elastic 2015 Copying, publishing and/or
    distributing without written permission is strictly prohibited
    36
    Tokenizer / Token filters
    • whitespace
    "the dog!" -> "the", "dog!"
    • standard
    "the dog!" -> "the", "dog"
    • asciifolding
    éléphant -> elephant
    • stemmer french
    elephants -> "eleph"
    prenez -> "prendre"
    • stopword french (le, la, un, une, être, avoir, …)
    • ngram ou edge ngram
    eleph -> ["el","ele","elep","eleph"]

    View Slide

  37. www.elastic.co Copyright Elastic 2015 Copying, publishing and/or
    distributing without written permission is strictly prohibited
    37
    Custom analyzer
    "analysis":{
    "analyzer":{
    "francais":{
    "type":"custom",
    "tokenizer":"standard",
    "filter":["lowercase", "stop_francais", "fr_stemmer", "asciifolding", "elision"]
    }
    },
    "filter":{
    "stop_francais":{
    "type":"stop",
    "stopwords":["_french_", "twitter"]
    },
    "fr_stemmer" : {
    "type" : "stemmer",
    "name" : "french"
    },
    "elision" : {
    "type" : "elision",
    "articles" : ["l", "m", "t", "qu", "n", "s", "j", "d"]
    }
    }
    }

    View Slide

  38. www.elastic.co Copyright Elastic 2015 Copying, publishing and/or
    distributing without written permission is strictly prohibited
    38
    Define your mapping!
    "type1" : {
    "properties" : {
    "text1" : { "type" : "string", "analyzer" : "francais" },
    "text2" : { "type" : "string", "index_analyzer" : "ngram",
    "search_analyzer" : "simple"
    },
    "text3" : {
    "type" : "string",
    "analyzer" : "francais",
    "fields" : {
    "ngram" : {
    "type" : "string",
    "analyzer" : "ngram"
    },
    "facet" : {
    "type" : "string",
    "index" : "not_analyzed"
    }
    }
    }
    }
    }

    View Slide

  39. Users and community

    View Slide

  40. Users

    View Slide

  41. FR users

    View Slide

  42. elasticfr
    @elasticfr
    discuss.elastic.co

    View Slide