Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Software stack for the visualization and analys...

Software stack for the visualization and analysis around Apache Solr with Parallel SQL

Software stack for the visualization and analysis around Apache Solr with Parallel SQL

Minoru Osuka

June 11, 2016
Tweet

More Decks by Minoru Osuka

Other Decks in Technology

Transcript

  1. Software stack for the visualization and analysis around Apache Solr

    with Parallel SQL [email protected] Some rights reserved by Sebastian Sikora
  2. Self-introduction େਢլ ູ Minoru OSUKA
 [email protected] Committer and PMC Member


    on ManifoldCF Project
 at Apache Software Foundation
 http://manifoldcf.apache.org Contributor on Solr Project
 at Apache Software Foundation
 http://lucene.apache.org/solr/ Author ofʦվగ৽൛ʧApache Solr ೖ໳
 http://gihyo.jp/book/2014/978-4-7741-6163-1 Some rights reserved by QuinnDombrowski
  3. Agenda Parallel SQL SQL Request Handler JDBC SQL Syntax ETL

    and Search The Elastic Stack Apache Family Visualizing Data Demonstration Some rights reserved by scui3asteveo
  4. Parallel SQL MapReduce-like shufflingɺ·ͨ͸ JSON Facet API Λ࢖ͬͯूܭɻ SQL ͸

    Prest SQL Parser Ͱ Solr Query ΁ม׵ ͞ΕɺStreaming API Λར༻ͯ͠σʔλͷݕࡧ Λߦ͏ɻ SQL Request Handler ͱ JDBC Driver Ͱఏڙɻ Some rights reserved by cogdogblog
  5. Parallel SQL SQL จ͸ SolrCloud ͷෳ਺ͷϫʔΧʔϊʔυͰฒྻ࣮ߦ͢Δ Streaming Expression ʹίϯ ύΠϧ͞ΕΔɻ

    SolrCloud ͷ Collection ͸ϦϨʔγϣφϧͳςʔϒϧͱͯ͠ந৅Խ͞ΕΔɻ WHERE ۟͸ Lucene / Solr ͷΫΤϦߏจΛαϙʔτɻ άϧʔϐϯά΍ूܭͷΑ͏ͳଟ͘ͷΦϖϨʔγϣϯ͸ɺStreaming expressions ͷϦΞϧλ ΠϜ MapReduce ͷػೳΛར༻ͯ͠ɺࣗಈతʹฒྻԽ͞ΕΔɻ άϧʔϐϯά / ूܭ͸ύϑΥʔϚϯε͕௿Լ͢Δࣄ͕͋Δ͕ɺJSON Facet API ͰύϑΥʔ ϚϯεΛվળ͢Δࣄ͕Ͱ͖Δɻ ݱࡏ͸ SolrCloud ϞʔυͷΈαϙʔτɻελϯυΞϩʔϯϞʔυͰ͸ະରԠɻ SQL ػೳ͸ݱࡏɺ࣮ݧతͰશͯͷ SQL ߏจ͸࣮૷͸͞Ε͍ͯͳ͍ɻ
  6. SQL Request Handler σϑΥϧτͰ SQL Λड͚෇͚Δɺ/sql ͱ͍͏໊ લͷϦΫΤετϋϯυϥʔΛఏڙɻ ϦΫΤετϋϯυϥʔʹରͯ͠ɺSQL จΛൃߦ

    ͢Δ͜ͱͰɺSolr ͷΠϯσοΫε͔Β৚݅ʹ Ϛον͢ΔυΩϡϝϯτ (σʔλ) Λநग़ɻ Some rights reserved by christiaan_008
  7. SQL Request Handler • ϦΫΤεταϯϓϧ • Ϩεϙϯεαϯϓϧ $ curl --data-urlencode

    \
 'stmt=SELECT to, count(*) FROM collection4 GROUP BY to ORDER BY count(*) desc LIMIT 10' \
 http://localhost:8983/solr/collection4/sql?aggregationMode=facet {"result-set":{"docs":[
 {"count(*)":9158,"to":"[email protected]"},
 {"count(*)":6244,"to":"[email protected]"},
 {"count(*)":5874,"to":"[email protected]"},
 {"count(*)":5867,"to":"[email protected]"},
 {"count(*)":5595,"to":"[email protected]"},
 {"count(*)":4904,"to":"[email protected]"},
 {"count(*)":4622,"to":"[email protected]"},
 {"count(*)":3819,"to":"[email protected]"},
 {"count(*)":3678,"to":"[email protected]"},
 {"count(*)":3653,"to":"[email protected]"},
 {"EOF":"true","RESPONSE_TIME":10}]}
 }
  8. JDBC Driver SQL αϙʔτʹ൐͍ɺैདྷ͔Β͋ΔΫϥΠ ΞϯτϥΠϒϥϦͷ SolrJ ( Java ΫϥΠΞϯ τϥΠϒϥϦ

    ) ʹɺJDBC υϥΠόʔ͕૊Έ ࠐ·ΕΔɻ JDBC υϥΠόʔܦ༝ͰɺSQL Request Handler Λར༻ͯ͠ɺSolr ͷΠϯσοΫε΁ ΞΫηεՄೳɻ Some rights reserved by WashuOtaku
  9. JDBC Driver JDBC υϥΠόʔར༻࣌ʹ͸ҎԼͷϑΝΠϧΛ Java ΫϥεύεͷσΟϨΫτϦ΁഑ஔɻ JDBC υϥΠόʔΫϥε໊ org.apache.solr.client.solrj.io.sql.DriverImpl ઀ଓจࣈྻ

    jdbc:solr://<zk_connection_string>/?collection=<collection> $ find solr/solr-6.0.1/dist -name *.jar | grep -E "(/solr-solrj-.*\.jar)|(/solrj-lib/.*\.jar)"
 solr/solr-6.0.1/dist/solr-solrj-6.0.1.jar
 solr/solr-6.0.1/dist/solrj-lib/commons-io-2.4.jar
 solr/solr-6.0.1/dist/solrj-lib/httpclient-4.4.1.jar
 solr/solr-6.0.1/dist/solrj-lib/httpcore-4.4.1.jar
 solr/solr-6.0.1/dist/solrj-lib/httpmime-4.4.1.jar
 solr/solr-6.0.1/dist/solrj-lib/jcl-over-slf4j-1.7.7.jar
 solr/solr-6.0.1/dist/solrj-lib/noggit-0.6.jar
 solr/solr-6.0.1/dist/solrj-lib/slf4j-api-1.7.7.jar
 solr/solr-6.0.1/dist/solrj-lib/stax2-api-3.1.4.jar
 solr/solr-6.0.1/dist/solrj-lib/woodstox-core-asl-4.4.1.jar
 solr/solr-6.0.1/dist/solrj-lib/zookeeper-3.4.6.jar
  10. I Love Syntax Tee Shirt SQL Syntax SELECT จ •

    WHERE ۟ • LIMIT ۟ • ORDER BY ۟ • DISTINCT ۟ • GROUP BY ۟ ౷ܭؔ਺ countɺminɺmaxɺsumɺavg
  11. SQL Syntax fieldC ΛϑϨʔζ 'term1 term2' Ͱݕࡧ fieldC Λ 'term1'

    ͱ 'term2' ͷ OR ৚݅Ͱݕࡧ fieldC ͕ 0 Ҏ্ɺ100 ҎԼͷ৚݅Ͱݕࡧ ( Solr ͷൣғݕࡧ ) ෳ਺৚݅ͷ૊Έ߹Θͤ NOT ݕࡧ WHERE fieldC = 'term1 term2' WHERE fieldC = '(term1 term2)' WHERE fieldC = '[0 TO 100]' WHERE ((fieldC = 'term1' AND fieldA = 'term2') OR (fieldB = 'term3')) WHERE (fieldA = 'term1') AND NOT (fieldB = 'term2')
  12. Visualization and Analysis • Data Source
 ֎෦ͷσʔλͷೖखݩͱͳΔ΋ͷɻ
 σʔλϕεͳͲɻ • Data

    Lake
 େྔͷσʔλΛѻ͏ͨΊͷετϨʔδɺ
 ϑΝΠϧஔ͖৔ͳͲɻ • Data Mart
 ༻్ɺ໨తͳͲʹԠͯ͡ඞཁͳ΋ͷ͚ͩΛநग़ɺ
 ूܭ͠ɺར༻͠΍͍͢ܗʹ֨ೲͨ͠΋ͷɻ • User Interface
 σʔλʹରͯ͠ΞυϗοΫͳΫΤϦΛ༻͍ͯ
 ݕࡧɺநग़Λߦ͍ɺද΍άϥϑͳͲΛඳըɻ Some rights reserved by MMU Engage
  13. Data Lake Data Source Visualization and Analysis User Interface Data

    Mart Data Service Log Storage Storage Search Engine KVS RDBMS Search Graph Chart Storage Data Stream
  14. The Elastic Stack Logstash
 https://www.elastic.co/products/logstash Πϕϯτ΍ϩάΛ؅ཧ͢ΔͨΊͷπʔϧɻ
 ೚ҙͷγεςϜ͔Βϩά΍࣌ؒϕʔεͷ
 ΠϕϯτσʔλΛऔಘͰ͖Δɻ Elasticsearch
 https://www.elastic.co/products/elasticsearch

    Lucene Λϕʔεʹͨ͠ɺॊೈͰڧྗͳ
 Φʔϓϯιʔεͷ෼ࢄՄೳͳϦΞϧλΠϜ
 ݕࡧΤϯδϯɻ Kibana
 https://www.elastic.co/products/kibana Elasticsearch ΛόοΫΤϯυʹͨ͠ɺ
 μογϡϘʔυػೳΛ΋ͭσʔλ
 ͷՄࢹԽιϑτ΢ΣΞɻ Some rights reserved by paisleyorguk
  15. Apache Family Apache Flume
 http://flume.apache.org େྔͷσʔλΛޮ཰తʹऩूɺू໿͓Αͼ
 Ҡಈͤ͞ΔͨΊͷ෼ࢄαʔϏεɻ Apache Solr
 http://lucene.apache.org/solr/

    Apache Lucene ϓϩδΣΫτ͔Β೿ੜͨ͠ɺ
 ߴ଎ͳΦʔϓϯιʔεͷݕࡧϓϥοτϑΥʔϜɻ Apache Zeppelin
 https://zeppelin.incubator.apache.org Web ϕʔεͷΠϯλϥΫςΟϒ UIɻ
 SQL ΍ Streaming ͷίϚϯυͷ݁ՌΛදʹ੔ܗ
 ͨ͠ΓɺάϥϑΛϓϩοτ͢Δ͜ͱ͕Ͱ͖Δɻ Some rights reserved by QuinnDombrowski
  16. Data Lake / Data Mart Data Source In The Case

    of Solr User Interface Data Stream Banana / Silk SQuirreL SQL github.com
 mosuka/logstash-output-solr + github.com
 mosuka/fluent-plugin-output-solr +
  17. Apache Solr Some rights reserved by NASA Goddard Photo and

    Video Data Lake Data Source User Interface Data Mart Data Service Log Storage Storage Search Engine KVS RDBMS Search Graph Chart Storage Data Stream
  18. Apache Solr Solr Λ σʔλͷӬଓԽͷͨΊͷ Data Lakeɺ·ͨɺඞཁͳσʔλͷநग़ͷͨΊͷ Data Mart ͱͯ͠ར༻Λ͢Δɻ

    Data-driven schemaless mode Λద༻ɺautoCommitɺsolftAutoCommit Λ༗ޮʹ͓ͯ͘͠ɻ add-unknown-fields-to-the-schema Update Request Processor Chain Ͱޡղऍ͞Εͯࠔ ΔϑΟʔϧυ͕͋Δ৔߹͸͋Β͔͡ΊɺSchema API Λ࢖ͬͯϑΟʔϧυఆٛΛߦ͏ɻ Solr 6.0.1 ʹ͸ɺγϯάϧγϟʔυͷ SolrCloud ؀ڥͰɺSQL Request Handler ͷ aggregation ϞʔυΛ facet ͱͨ͠৔߹ (aggregationMode=facet)ɺྫ֎͕ൃੜ͢Δόά͕͋ Δɻ ClassCastException occurs in /sql handler with GROUP BY aggregationMode=facet and single shard https://github.com/apache/lucene-solr ࠷৽ͷιʔεΛར༻͢Δɻ
  19. Apache Flume Some rights reserved by post406 Data Lake Data

    Source User Interface Data Mart Data Service Log Storage Storage Search Engine KVS RDBMS Search Graph Chart Storage Data Stream
  20. Apache Flume Flume Λ Data Source ͔Βσʔλநग़͠ɺData Lake ΁ͷసૹΛ͢ΔͨΊʹར༻Λ ͢ΔɻετϦʔϛϯάॲཧʹΑΓɺ΄΅ϦΞϧλΠϜͰͷॲཧ͕Մೳɻ

    ݱࡏϦϦʔε͞Ε͍ͯΔɺFlume 1.6.0 Ͱ͸ɺαϙʔτ͞Ε͍ͯΔ Solr ͷόʔδϣ ϯ͸ 4.3 ͱݹ͍ɻ ελϯυΞϩʔϯϞʔυͷ Solr ΁͸σʔλΛૹ৴Ͱ͖Δ͕ɺSolr 5.x ͔Βͷ APIɺ࢓༷ͷมߋʹ൐͍ɺSolrCloud Ϟʔυͷ Solr ʹ͸σʔλΛૹ৴Ͱ͖ͳ͍ɻ https://issues.apache.org/jira/browse/FLUME-2919 Solr 6.x ରԠͨ͠ https://github.com/mosuka/flume Λར༻͢Δ
  21. Apache Zeppelin Some rights reserved by probabilistic Data Lake Data

    Source User Interface Data Mart Data Service Log Storage Storage Search Engine KVS RDBMS Search Graph Chart Storage Data Stream
  22. Apache Zeppelin Zeppelin Λ User Interface ͱͯ͠ར༻͠ɺData Mart ͷσʔλΛඞཁʹԠͯ͡ɺΞυ ϗοΫͳΫΤϦʹΑΔɺΠϯλϥΫςΟϒͳσʔλ໰͍߹Θͤɾ෼ੳ͕ՄೳͱͳΔɻ

    ݱࡏϦϦʔε͞Ε͍ͯΔɺZeppelin 0.5.6-incubating Ͱ͸ɺJDBC Driver αϙʔτ͞ Ε͍ͯͳ͍ɻ JDBC ରԠ͸ 0.6.0 Ͱ༧ఆɻ JDBC Driver ରԠͨ͠ https://github.com/apache/incubator-zeppelin Λར༻͢ Δɻ Solr ΁ͷ઀ଓઃఆํ๏͸Լه URL Ͱ΋঺հ͞Ε͍ͯΔɻ Solr JDBC - Apache Zeppelin (incubating)
 https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=63406991
  23. Lucidworks Banana Some rights reserved by danbri Data Lake Data

    Source User Interface Data Mart Data Service Log Storage Storage Search Engine KVS RDBMS Search Graph Chart Storage Data Stream
  24. Lucidworks Banana https://github.com/lucidworks/banana Banana Λ User Interface ͱͯ͠ར༻͠ɺData Mart ͷσʔλΛϦΞϧλΠϜʹՄ

    ࢹԽΛ͢Δ͜ͱ͕͕ՄೳͱͳΔɻ Banana ͸ Kibana 3.x Λ Solr ޲͚ʹҠ২ͨ͠΋ͷɻ ݱࡏϦϦʔε͞Ε͍ͯΔɺBanana 1.6.0 Ͱ͸ɺϚϧνϊʔυߏ੒ͨ͠৔߹ɺ Dashboard ઃఆΛ Solr ʹอଘ͢ΔࡍɺϦϞʔτͷ Solr Λࢀর͢Δ͜ͱ͕Ͱ͖ͳ ͍όά͕͋Δɻ·ͨɺSolr ίϨΫγϣϯ໊ͷ࢓༷มߋʹ൐͍ɺDashboard ઃఆอ ଘ༻ͷίϨΫγϣϯ໊Λมߋ͢Δඞཁ͕͋Δɻ Pull Request : https://github.com/lucidworks/banana/pull/270 ্هόάରԠͨ͠ https://github.com/mosuka/banana Λར༻͢Δɻ
  25. Lucidworks Silk Some rights reserved by karen2754 Data Lake Data

    Source User Interface Data Mart Data Service Log Storage Storage Search Engine KVS RDBMS Search Graph Chart Storage Data Stream
  26. Lucidworks Silk https://github.com/lucidworks/silk Silk Λ User Interface ͱͯ͠ར༻͠ɺData Mart ͷσʔλΛϦΞϧλΠϜʹՄ

    ࢹԽΛ͢Δ͜ͱ͕͕ՄೳͱͳΔɻ Silk ͸ Kibana 4.x Λ Solr ޲͚ʹҠ২ͨ͠΋ͷɻ ݱࡏͷ dev ϒϥϯν͸ Solr 6.x ͔Βͷ schema.xml ʹະରԠͷField type ͷه ड़෦෼͕͋Δɻ Pull Request : https://github.com/lucidworks/silk/pull/11 ରԠͨ͠ https://github.com/mosuka/silk Λར༻͢Δɻ
  27. SQuirreL SQL Some rights reserved by likeaduck Data Lake Data

    Source User Interface Data Mart Data Service Log Storage Storage Search Engine KVS RDBMS Search Graph Chart Storage Data Stream
  28. SQuirreL SQL http://squirrel-sql.sourceforge.net SQuirrel SQL ͸ άϥϑΟΧϧͳσεΫτοϓΞϓϦέʔγϣϯɻ Java Ͱ։ൃ͞Ε͓ͯΓɺJVM ͕Πϯετʔϧ͞Ε͍ͯΔϚγϯͰ͋Ε͹ɺ

    OS Λ໰Θ࣮ͣߦՄೳɻ JDBC Driver Λఏڙ͢ΔɺϦϨʔγϣφϧσʔλϕʔεͰ͋Ε͹ɺ઀ଓՄೳɻ Solr ΁ͷ઀ଓઃఆํ๏͸Լه URL Ͱ΋঺հ͞Ε͍ͯΔɻ Solr JDBC - SQuirreL SQL
 https://cwiki.apache.org/confluence/display/solr/Solr+JDBC+-+SQuirreL +SQL
  29. Demonstration Some rights reserved by Sebastian Sikora User Interface Data

    Lake /Data Mart Data Source Data Lake Data Source User Interface Data Mart Data Service Log Storage Storage Search Engine KVS RDBMS Search Graph Chart Storage Data Stream Banana / Silk SQuirreL SQL
  30. Some rights reserved by Miss Dilettante Documents and Software Apache

    Solr Reference Guide
 https://cwiki.apache.org/confluence/display/solr/Apache+Solr+Reference+Guide Flume 1.6.0 User Guide
 https://flume.apache.org/FlumeUserGuide.html Zeppelin 0.6.0-SNAPSHOT document
 https://zeppelin.incubator.apache.org/docs/0.6.0-SNAPSHOT/ lucidworks/banana: Banana for Solr - A Port of Kibana
 https://github.com/lucidworks/banana lucidworks/silk: Silk is a port of Kibana 4 project.
 https://github.com/lucidworks/silk Cloudera Morphlines Reference Guide
 http://www.slideshare.net/cloudera/using-morphlines-for-onthefly-etl OSSͷπʔϧʮSolrʯʮFlumeʯʮBananaʯͷ૊Έ߹ΘͤʹΑΔ σʔλՄࢹԽϓϥοτϑΥʔϜߏங
 https://codezine.jp/article/detail/8707 ʦվగ৽൛ʧApache Solrೖ໳――ΦʔϓϯιʔεશจݕࡧΤϯδϯ
 http://gihyo.jp/book/2014/978-4-7741-6163-1 αʔόʗΠϯϑϥΤϯδχΞཆ੒ಡຊ ϩάऩूʙՄࢹԽฤ
 http://gihyo.jp/book/2014/978-4-7741-6983-5
  31. Documents and Software Apache Solr
 https://cwiki.apache.org/confluence/display/solr/ Apache+Solr+Reference+Guide Apache Flume
 https://flume.apache.org/FlumeUserGuide.html

    Logstash
 https://codezine.jp/article/detail/8707 logstash-output-solr
 http://gihyo.jp/book/2014/978-4-7741-6163-1 Fluentd
 http://gihyo.jp/book/2014/978-4-7741-6163-1 fluent-plugin-output-solr
 http://gihyo.jp/book/2014/978-4-7741-6983-5 Apache Zeppelin
 https://zeppelin.incubator.apache.org/docs/0.6.0- SNAPSHOT/ Lucidworks Banana
 https://github.com/lucidworks/banana Lucidworks Silk
 https://github.com/lucidworks/silk SQuirrel SQL
 http://www.slideshare.net/cloudera/using-morphlines- for-onthefly-etl