Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Search-based business intelligence and reverse data engineering with Apache Solr

Search-based business intelligence and reverse data engineering with Apache Solr

We are searching the unknown. How can you find hidden and unknown relationships in unrelated data silos? How can you find relevant information in a 10^56 dimensional space? Sounds impossible? This talk will present a case study and success story about how Apache Solr has been used to build a search based business intelligence and information research application to answer these questions. The talk was delivered at the Apache: Big Data 2015 Conference in Budapast. #apachebigdataeu2015

M.-Leander Reimer

September 29, 2015
Tweet

More Decks by M.-Leander Reimer

Other Decks in Programming

Transcript

  1. Apache: Big Data Europe 2015 Search-based business intelligence and reverse

    data engineering with Apache Solr M a r i o-L ea nder Rei mer C h i ef T echnol og i st
  2. Apache: Big Data Europe 2015 This talk will … Mario-Leander

    Reimer 2 28. September 2015 o Give a brief overview of the AIR system’s architecture o Show reverse data engineering using Solr and MIR o Talk about the fight for our right to Solr o Describe solutions for the problem of combinatorial explosion o Outline a flexible and lightweight ETL approach for Solr
  3. Apache: Big Data Europe 2015 Apache: Big Data Europe 2015

    Mario-Leander Reimer 3 28. September 2015 A <<Anwendungscluster>> AIR Repository A <<Application Cluster>> AIR Loader Mechanic A <<System>> AIR Central A <<Subsystem>> Maintenance I <<Subsystem>> Apache Solr A <<Client>> AIR Client I <<Subsystem>> .NET WPF A <<Subsystem>> Solr Extensions A <<Subsystem>> Defects A <<Subsystem>> Flat Rates A <<Subsystem>> Service Bulletins Service Technician A <<Ext. System>> 3rd Party Application A <<Subsystem>> AIR Fork DLL A <<Subsystem>> AIR Call DLL Launch I <<Subsystem>> Spring Framework I <<Subsystem>> JEE 5 A <<System>> AIR Control I <<Subsystem>> Jenkins A <<Subsystem>> Documents A <<Subsystem>> Vehicles A <<Subsystem>> Measures Backend Databases and Systems A <<Subsystem>> Repair Overview A <<Subsystem>> ... A <<Subsystem>> JSF Web UI A <<Subsystem>> REST API Independent Workshop A <<Client>> Browser Search and Display A <<Ext. System>> 3rd Party iOS App A <<Subsystem>> AIR iOS Lib A <<Subsystem>> Defects A <<Subsystem>> Flat Rates A <<Subsystem>> Service Bulletins I <<Subsystem>> Spring Framework A <<Subsystem>> Documents A <<Subsystem>> Parts A <<Subsystem>> WS Clients A <<Subsystem>> File Storage A <<Subsystem>> Solr Access A <<Subsystem>> Protocoll A <<Subsystem>> Watchlist A <<Subsystem>> Masterdata A <<Subsystem>> Retrofits AIR DB Document Storage A <<Ext. System>> AIR Bus I <<Ext. System>> Backend Systems Query A <<Subsystem>> Vehicles Execute Load 20 Languages 800 GB Solr Index A <<Subsystem>> Maintenance
  4. Apache: Big Data Europe 2015 Apache: Big Data Europe 2015

    Mario-Leander Reimer 4 28. September 2015 A <<Anwendungscluster>> AIR Repository A <<Application Cluster>> AIR Loader A <<Subsystem>> Maintenance I <<Subsystem>> Apache Solr Master A <<Subsystem>> Solr Extensions A <<Subsystem>> Defects A <<Subsystem>> Flat Rates A <<Subsystem>> Service Bulletins A <<System>> AIR Control I <<Subsystem>> Jenkins A <<Subsystem>> Documents Backend Databases and Systems A <<Subsystem>> Repair Overview A <<Subsystem>> ... I <<Subsystem>> Spring Framework A <<Subsystem>> Vehicles Execute Load 20 Languages 800 GB Solr Index I <<Subsystem>> Apache Solr Slave A <<Subsystem>> Solr Extensions Replicate A <<System>> MIR 20 Languages 800 GB Solr Index Search
  5. Apache: Big Data Europe 2015 Let‘s go back to when

    it all began … Source: http://www.october212015.com/images/timecircuits.jpg 5
  6. Apache: Big Data Europe 2015 Apache: Big Data Europe 2015

    The project vision: find the right information in less than 3 clicks. 6 The situation: o Users had to use up to 7 different applications for their daily work. o Systems were not really integrated nicely. o Finding the correct information was laborious and error prone. The idea: o Combine the data into a consistent information network. o Make the information network and its data searchable and navigable. o Replace existing application with one easy to use application. Mario-Leander Reimer 28. September 2015
  7. Apache: Big Data Europe 2015 Apache: Big Data Europe 2015

    But how do we find the originating system for the desired data? 7 Mario-Leander Reimer 28. September 2015 Where to find the vehicle data? 60 potential systems and 5000 entities. Other data Vehicle data System A System B System C System D
  8. Apache: Big Data Europe 2015 Apache: Big Data Europe 2015

    And how do we find the hidden relations between the systems and their data? 8 Mario-Leander Reimer 28. September 2015 How is the data linked to each other? 400.000 potential relations. Other data Vehicle System A System B System C System D Customer Documents
  9. Apache: Big Data Europe 2015 Meta Information Research (MIR) 9

    Source: http://www.thewallpapers.org/photo/31865/Mir_space_station_12_June_1998.jpg
  10. Apache: Big Data Europe 2015 Apache: Big Data Europe 2015

    MIR is a simple and lightweight data reverse engineering and analysis tool based on Solr. 10 o MIR manages meta information about the source systems (the data models and record descriptions) o MIR allows to navigate and search in the metadata, you can drill into the metadata using facets o MIR also manages the target data model and Solr schema description Mario-Leander Reimer 28. September 2015 Metadata Index A <<System>> Meta Information Research I <<Subsystem>> Apache Solr A <<Subsystem>> MIR User Interface Backend Databases and Systems A <<Subsystem>> MIR Loader A <<Subsystem>> MIR Generators Read Sources (Java, XML) Magic Draw 25MB
  11. Apache: Big Data Europe 2015 11 Wildcard queries Facetted drill

    down Tree view of systems, tables and attributes Search results Found potential synonyms for the chassis number
  12. Apache: Big Data Europe 2015 12 EAT YOUR OWN DOG

    FOOD. The AIR Solr schema definition is modelled and defined within MIR. Solr schema attributes Solr entities for each release
  13. Apache: Big Data Europe 2015 Apache: Big Data Europe 2015

    def sourceGenerator = MIR + Solr + Maven; 13 Mario-Leander Reimer 28. September 2015
  14. 14 But Solr is a full text search engine. You

    have to use an Oracle DB for your application data! NO!
  15. Apache: Big Data Europe 2015 Apache: Big Data Europe 2015

    Some of the AIR requirements were ... 15 o Focus is on search. Transactions are not required. o High demands on request volume and performance. o Free navigation on data model and content. o Support for full text search and facetted search. o Offline capabilities. o Scalability from low-end device to server to cloud. Mario-Leander Reimer 28. September 2015
  16. Apache: Big Data Europe 2015 Apache: Big Data Europe 2015

    Apache Solr outperformed Oracle significantly in query time as well as index size. 16 Mario-Leander Reimer 28. September 2015 SELECT * FROM VEHICLE WHERE VIN='V%' INFO_TYPE:VEHICLE AND VIN:V* SELECT * FROM MEASURE WHERE TEXT='engine' INFO_TYPE:MEASURE AND TEXT:engine SELECT * FROM VEHICLE WHERE VIN='%X%' INFO_TYPE:VEHICLE AND VIN:*X* | 038 ms | 000 ms | 000 ms | 383 ms | 384 ms | 383 ms | 092 ms | 000 ms | 000 ms | 389 ms | 387 ms | 386 ms | 039 ms | 000 ms | 000 ms | 859 ms | 379 ms | 383 ms Test data set: 150.000 records Disk space: 132 MB Solr vs. 385 MB Oracle
  17. Apache: Big Data Europe 2015 17 28. September 2015 Source:

    http://www.dirtbikerider.com/news/images/anotherimpressivegpweekendforhusqvarna_553db21addaaa.jpg Dirt Race Use Case: o Low-end devices o No Internet
  18. Apache: Big Data Europe 2015 Apache: Big Data Europe 2015

    Running Solr and AIR-2-Go on Raspberry Pi Model B worked like a charm. 18 Running Debian Linux + JDK8 Jetty Container with the Solr and AIR WARs deployed Reduced Solr data set with approx ~1.5 Mio documents Mario-Leander Reimer 28. September 2015 Model B Hardware Specs: o ARMv6 CPU at 700Mhz o 512MB RAM o 32GB SD-Card
  19. Apache: Big Data Europe 2015 Apache: Big Data Europe 2015

    No silver bullet. A careful schema design is crucial for your Solr performance. 20 28. September 2015 Mario-Leander Reimer
  20. Apache: Big Data Europe 2015 Apache: Big Data Europe 2015

    33.071.137 Vehicles 648.129 Technical Documents 14.830.197 Flat Rate Units 5.078.411 FRU Groups 55.000 Parts 648.129 Measures 18.573 Repair Instructions 6.180 Fault Indications m n m n m n 1.678.667 Packages n m n n m n n 41.385 Types Naive data denormalization can quickly lead to combinatorial explosion. 21 Mario-Leander Reimer 28. September 2015 Num Docs: 55.777.706 Relationship Navigation
  21. Apache: Big Data Europe 2015 Apache: Big Data Europe 2015

    Multi-valued fields can efficiently store 1..n relations but may result in false positives. 22 Mario-Leander Reimer 28. September 2015 { "INFO_TYPE":"AWPOS_GROUP", "NUMMER" :[ "1134190" , "1235590" ] "BAUSTAND" :["1969-12-31T23:00:00Z","1975-12-31T23:00:00Z"] "E_SERIES" :[ "F10" , "E30" ] } In case this doesn‘t matter, perform a post filtering in your application. Note: latest Solr versions support nested child documents. Use instead. Index 0 Index 1 q=INFO_TYPE:AWPOS_GROUP AND NUMMER:1134190 AND E_SERIES:F10 q=INFO_TYPE:AWPOS_GROUP AND NUMMER:1134190 AND E_SERIES:E30
  22. Apache: Big Data Europe 2015 Apache: Big Data Europe 2015

    Technical documents and their validity were expressed in a binary representation. 23 o Validity expressions may have up to 46 characteristics. o Validity expressions use 5 different boolean operators (AND, NOT, …) o Validity expessions can be nested and complex. o Some characteristics are dynamic and not even known at index time. Mario-Leander Reimer 28. September 2015 Solution: transform the validity expressions into the equivalent JavaScript terms and evaluate these terms at query time using a custom function query filter.
  23. Apache: Big Data Europe 2015 Binary validity expression example. 24

    28. September 2015 Type(53078923) = ‚Brand‘, Value(53086475) = ‚BMW PKW‘ Type(53088651) = ‚E-Series‘, Value(53161483) = ‚F10‘ Type(64555275) = ‚Transmission‘, Value(53161483) = ‚MECH‘
  24. Apache: Big Data Europe 2015 Apache: Big Data Europe 2015

    Transformation of binary validity terms into their JavaScript equivalent at index time. 25 Mario-Leander Reimer 28. September 2015 ((BRAND=='BMW PKW')&&(E_SERIES=='F10')&&(TRANSMISSION=='MECH')) AND(Brand='BMW PKW', E-Series='F10'‚ Transmission='MECH') { "INFO_TYPE": "TECHNISCHES_DOKUMENT", "DOKUMENT_TITEL": "Getriebe aus- und einbauen", "DOKUMENT_ART": " reparaturanleitung", "VALIDITY": "((BRAND=='BMW PKW')&&((E_SERIES=='F10')&&(...))", „BRAND": [„BMW PKW"], ... }
  25. Apache: Big Data Europe 2015 Apache: Big Data Europe 2015

    The JavaScript validity term is evaluated at query time using a custom function query. 26 Mario-Leander Reimer 28. September 2015 &fq=INFO_TYPE:TECHNISCHES_DOKUMENT &fq=DOKUMENT_ART:reparaturanleitung &fq={!frange l=1 u=1 incl=true incu=true cache=false cost=500} jsTerm(VALIDITY,eyJNT1RPUl9LUkFGVFNUT0ZGQVJUX01PVE9SQVJCRUlUU 1ZFUkZBSFJFTiI6IkIiLCJFX01BU0NISU5FX0tSQUZUU1RPRkZBUlQiOm51bG wsIlNJQ0hFUkhFSVRTRkFIUlpFVUciOiIwIiwiQU5UUklFQiI6IkFXRCIsIkV kJBVVJFSUhFIjoiWCcifQ==) http://qaware.blogspot.de/2014/11/how-to-write-postfilter-for-solr-49.html Base64 decode { „BRAND":"BMW PKW", "E_SERIES":"F10", "TRANSMISSION":"MECH" }
  26. 27 How often do we load data? How do we

    ensure data consistency?
  27. Apache: Big Data Europe 2015 Apache: Big Data Europe 2015

    A traditional approach using a DWH and ETL: too inflexible, heavy weight and expensive. 28 Mario-Leander Reimer 28. September 2015 Data Warehouse System B System A System C File DB File DB AIR Solr ETL ETL ETL ETL ETL ETL ETL jobs would usually be implemented with Informatica Significant business logic required depending on the source database
  28. Apache: Big Data Europe 2015 Apache: Big Data Europe 2015

    Flexible and lightweight ETL combined with Continuous Delivery and DevOps. 29 Mario-Leander Reimer 28. September 2015 H <<System>> AIR Search H <<System>> AIR Loader Slave I <<System>> Jenkins Slave I <<System>> Apache Maven Developer Operations Solr Index A <<System>> AIR Loader I <<System>> Apache Solr Data Source A I <<System>> Jenkins Master Start I <<System>> Nexus Repository Build & Deploy Build Run Solr Index I <<System>> Apache Solr Replicate Data Source n Extract Load
  29. 30

  30. Apache: Big Data Europe 2015 Apache: Big Data Europe 2015

    Apache Solr has become a powerful tool for data analytics applications. Be creative. 31 Our next big project using Apache Solr is already on its way. High performance application to predict and calculate the bill of materials for all required parts and orders. Apache Solr as a compressed, scalable and high performance time series database. FOSDEM’15 – Florian Lautenschlager, QAware GmbH Leveraging the Power of SOLR and SPARK Apache: Big Data 2015 – Johannes Weigend, QAware GmbH Mario-Leander Reimer 28. September 2015
  31. 33 And with Apache Solr you can search and find

    the answers you are looking for.