Slide 1

Slide 1 text

Apache: Big Data Europe 2015 Search-based business intelligence and reverse data engineering with Apache Solr M a r i o-L ea nder Rei mer C h i ef T echnol og i st

Slide 2

Slide 2 text

Apache: Big Data Europe 2015 This talk will … Mario-Leander Reimer 2 28. September 2015 o Give a brief overview of the AIR system’s architecture o Show reverse data engineering using Solr and MIR o Talk about the fight for our right to Solr o Describe solutions for the problem of combinatorial explosion o Outline a flexible and lightweight ETL approach for Solr

Slide 3

Slide 3 text

Apache: Big Data Europe 2015 Apache: Big Data Europe 2015 Mario-Leander Reimer 3 28. September 2015 A <> AIR Repository A <> AIR Loader Mechanic A <> AIR Central A <> Maintenance I <> Apache Solr A <> AIR Client I <> .NET WPF A <> Solr Extensions A <> Defects A <> Flat Rates A <> Service Bulletins Service Technician A <> 3rd Party Application A <> AIR Fork DLL A <> AIR Call DLL Launch I <> Spring Framework I <> JEE 5 A <> AIR Control I <> Jenkins A <> Documents A <> Vehicles A <> Measures Backend Databases and Systems A <> Repair Overview A <> ... A <> JSF Web UI A <> REST API Independent Workshop A <> Browser Search and Display A <> 3rd Party iOS App A <> AIR iOS Lib A <> Defects A <> Flat Rates A <> Service Bulletins I <> Spring Framework A <> Documents A <> Parts A <> WS Clients A <> File Storage A <> Solr Access A <> Protocoll A <> Watchlist A <> Masterdata A <> Retrofits AIR DB Document Storage A <> AIR Bus I <> Backend Systems Query A <> Vehicles Execute Load 20 Languages 800 GB Solr Index A <> Maintenance

Slide 4

Slide 4 text

Apache: Big Data Europe 2015 Apache: Big Data Europe 2015 Mario-Leander Reimer 4 28. September 2015 A <> AIR Repository A <> AIR Loader A <> Maintenance I <> Apache Solr Master A <> Solr Extensions A <> Defects A <> Flat Rates A <> Service Bulletins A <> AIR Control I <> Jenkins A <> Documents Backend Databases and Systems A <> Repair Overview A <> ... I <> Spring Framework A <> Vehicles Execute Load 20 Languages 800 GB Solr Index I <> Apache Solr Slave A <> Solr Extensions Replicate A <> MIR 20 Languages 800 GB Solr Index Search

Slide 5

Slide 5 text

Apache: Big Data Europe 2015 Let‘s go back to when it all began … Source: http://www.october212015.com/images/timecircuits.jpg 5

Slide 6

Slide 6 text

Apache: Big Data Europe 2015 Apache: Big Data Europe 2015 The project vision: find the right information in less than 3 clicks. 6 The situation: o Users had to use up to 7 different applications for their daily work. o Systems were not really integrated nicely. o Finding the correct information was laborious and error prone. The idea: o Combine the data into a consistent information network. o Make the information network and its data searchable and navigable. o Replace existing application with one easy to use application. Mario-Leander Reimer 28. September 2015

Slide 7

Slide 7 text

Apache: Big Data Europe 2015 Apache: Big Data Europe 2015 But how do we find the originating system for the desired data? 7 Mario-Leander Reimer 28. September 2015 Where to find the vehicle data? 60 potential systems and 5000 entities. Other data Vehicle data System A System B System C System D

Slide 8

Slide 8 text

Apache: Big Data Europe 2015 Apache: Big Data Europe 2015 And how do we find the hidden relations between the systems and their data? 8 Mario-Leander Reimer 28. September 2015 How is the data linked to each other? 400.000 potential relations. Other data Vehicle System A System B System C System D Customer Documents

Slide 9

Slide 9 text

Apache: Big Data Europe 2015 Meta Information Research (MIR) 9 Source: http://www.thewallpapers.org/photo/31865/Mir_space_station_12_June_1998.jpg

Slide 10

Slide 10 text

Apache: Big Data Europe 2015 Apache: Big Data Europe 2015 MIR is a simple and lightweight data reverse engineering and analysis tool based on Solr. 10 o MIR manages meta information about the source systems (the data models and record descriptions) o MIR allows to navigate and search in the metadata, you can drill into the metadata using facets o MIR also manages the target data model and Solr schema description Mario-Leander Reimer 28. September 2015 Metadata Index A <> Meta Information Research I <> Apache Solr A <> MIR User Interface Backend Databases and Systems A <> MIR Loader A <> MIR Generators Read Sources (Java, XML) Magic Draw 25MB

Slide 11

Slide 11 text

Apache: Big Data Europe 2015 11 Wildcard queries Facetted drill down Tree view of systems, tables and attributes Search results Found potential synonyms for the chassis number

Slide 12

Slide 12 text

Apache: Big Data Europe 2015 12 EAT YOUR OWN DOG FOOD. The AIR Solr schema definition is modelled and defined within MIR. Solr schema attributes Solr entities for each release

Slide 13

Slide 13 text

Apache: Big Data Europe 2015 Apache: Big Data Europe 2015 def sourceGenerator = MIR + Solr + Maven; 13 Mario-Leander Reimer 28. September 2015

Slide 14

Slide 14 text

14 But Solr is a full text search engine. You have to use an Oracle DB for your application data! NO!

Slide 15

Slide 15 text

Apache: Big Data Europe 2015 Apache: Big Data Europe 2015 Some of the AIR requirements were ... 15 o Focus is on search. Transactions are not required. o High demands on request volume and performance. o Free navigation on data model and content. o Support for full text search and facetted search. o Offline capabilities. o Scalability from low-end device to server to cloud. Mario-Leander Reimer 28. September 2015

Slide 16

Slide 16 text

Apache: Big Data Europe 2015 Apache: Big Data Europe 2015 Apache Solr outperformed Oracle significantly in query time as well as index size. 16 Mario-Leander Reimer 28. September 2015 SELECT * FROM VEHICLE WHERE VIN='V%' INFO_TYPE:VEHICLE AND VIN:V* SELECT * FROM MEASURE WHERE TEXT='engine' INFO_TYPE:MEASURE AND TEXT:engine SELECT * FROM VEHICLE WHERE VIN='%X%' INFO_TYPE:VEHICLE AND VIN:*X* | 038 ms | 000 ms | 000 ms | 383 ms | 384 ms | 383 ms | 092 ms | 000 ms | 000 ms | 389 ms | 387 ms | 386 ms | 039 ms | 000 ms | 000 ms | 859 ms | 379 ms | 383 ms Test data set: 150.000 records Disk space: 132 MB Solr vs. 385 MB Oracle

Slide 17

Slide 17 text

Apache: Big Data Europe 2015 17 28. September 2015 Source: http://www.dirtbikerider.com/news/images/anotherimpressivegpweekendforhusqvarna_553db21addaaa.jpg Dirt Race Use Case: o Low-end devices o No Internet

Slide 18

Slide 18 text

Apache: Big Data Europe 2015 Apache: Big Data Europe 2015 Running Solr and AIR-2-Go on Raspberry Pi Model B worked like a charm. 18 Running Debian Linux + JDK8 Jetty Container with the Solr and AIR WARs deployed Reduced Solr data set with approx ~1.5 Mio documents Mario-Leander Reimer 28. September 2015 Model B Hardware Specs: o ARMv6 CPU at 700Mhz o 512MB RAM o 32GB SD-Card

Slide 19

Slide 19 text

19 YOU GOTTA FIGHT FOR YOUR RIGHT TO SOLR!

Slide 20

Slide 20 text

Apache: Big Data Europe 2015 Apache: Big Data Europe 2015 No silver bullet. A careful schema design is crucial for your Solr performance. 20 28. September 2015 Mario-Leander Reimer

Slide 21

Slide 21 text

Apache: Big Data Europe 2015 Apache: Big Data Europe 2015 33.071.137 Vehicles 648.129 Technical Documents 14.830.197 Flat Rate Units 5.078.411 FRU Groups 55.000 Parts 648.129 Measures 18.573 Repair Instructions 6.180 Fault Indications m n m n m n 1.678.667 Packages n m n n m n n 41.385 Types Naive data denormalization can quickly lead to combinatorial explosion. 21 Mario-Leander Reimer 28. September 2015 Num Docs: 55.777.706 Relationship Navigation

Slide 22

Slide 22 text

Apache: Big Data Europe 2015 Apache: Big Data Europe 2015 Multi-valued fields can efficiently store 1..n relations but may result in false positives. 22 Mario-Leander Reimer 28. September 2015 { "INFO_TYPE":"AWPOS_GROUP", "NUMMER" :[ "1134190" , "1235590" ] "BAUSTAND" :["1969-12-31T23:00:00Z","1975-12-31T23:00:00Z"] "E_SERIES" :[ "F10" , "E30" ] } In case this doesn‘t matter, perform a post filtering in your application. Note: latest Solr versions support nested child documents. Use instead. Index 0 Index 1 q=INFO_TYPE:AWPOS_GROUP AND NUMMER:1134190 AND E_SERIES:F10 q=INFO_TYPE:AWPOS_GROUP AND NUMMER:1134190 AND E_SERIES:E30

Slide 23

Slide 23 text

Apache: Big Data Europe 2015 Apache: Big Data Europe 2015 Technical documents and their validity were expressed in a binary representation. 23 o Validity expressions may have up to 46 characteristics. o Validity expressions use 5 different boolean operators (AND, NOT, …) o Validity expessions can be nested and complex. o Some characteristics are dynamic and not even known at index time. Mario-Leander Reimer 28. September 2015 Solution: transform the validity expressions into the equivalent JavaScript terms and evaluate these terms at query time using a custom function query filter.

Slide 24

Slide 24 text

Apache: Big Data Europe 2015 Binary validity expression example. 24 28. September 2015 Type(53078923) = ‚Brand‘, Value(53086475) = ‚BMW PKW‘ Type(53088651) = ‚E-Series‘, Value(53161483) = ‚F10‘ Type(64555275) = ‚Transmission‘, Value(53161483) = ‚MECH‘

Slide 25

Slide 25 text

Apache: Big Data Europe 2015 Apache: Big Data Europe 2015 Transformation of binary validity terms into their JavaScript equivalent at index time. 25 Mario-Leander Reimer 28. September 2015 ((BRAND=='BMW PKW')&&(E_SERIES=='F10')&&(TRANSMISSION=='MECH')) AND(Brand='BMW PKW', E-Series='F10'‚ Transmission='MECH') { "INFO_TYPE": "TECHNISCHES_DOKUMENT", "DOKUMENT_TITEL": "Getriebe aus- und einbauen", "DOKUMENT_ART": " reparaturanleitung", "VALIDITY": "((BRAND=='BMW PKW')&&((E_SERIES=='F10')&&(...))", „BRAND": [„BMW PKW"], ... }

Slide 26

Slide 26 text

Apache: Big Data Europe 2015 Apache: Big Data Europe 2015 The JavaScript validity term is evaluated at query time using a custom function query. 26 Mario-Leander Reimer 28. September 2015 &fq=INFO_TYPE:TECHNISCHES_DOKUMENT &fq=DOKUMENT_ART:reparaturanleitung &fq={!frange l=1 u=1 incl=true incu=true cache=false cost=500} jsTerm(VALIDITY,eyJNT1RPUl9LUkFGVFNUT0ZGQVJUX01PVE9SQVJCRUlUU 1ZFUkZBSFJFTiI6IkIiLCJFX01BU0NISU5FX0tSQUZUU1RPRkZBUlQiOm51bG wsIlNJQ0hFUkhFSVRTRkFIUlpFVUciOiIwIiwiQU5UUklFQiI6IkFXRCIsIkV kJBVVJFSUhFIjoiWCcifQ==) http://qaware.blogspot.de/2014/11/how-to-write-postfilter-for-solr-49.html Base64 decode { „BRAND":"BMW PKW", "E_SERIES":"F10", "TRANSMISSION":"MECH" }

Slide 27

Slide 27 text

27 How often do we load data? How do we ensure data consistency?

Slide 28

Slide 28 text

Apache: Big Data Europe 2015 Apache: Big Data Europe 2015 A traditional approach using a DWH and ETL: too inflexible, heavy weight and expensive. 28 Mario-Leander Reimer 28. September 2015 Data Warehouse System B System A System C File DB File DB AIR Solr ETL ETL ETL ETL ETL ETL ETL jobs would usually be implemented with Informatica Significant business logic required depending on the source database

Slide 29

Slide 29 text

Apache: Big Data Europe 2015 Apache: Big Data Europe 2015 Flexible and lightweight ETL combined with Continuous Delivery and DevOps. 29 Mario-Leander Reimer 28. September 2015 H <> AIR Search H <> AIR Loader Slave I <> Jenkins Slave I <> Apache Maven Developer Operations Solr Index A <> AIR Loader I <> Apache Solr Data Source A I <> Jenkins Master Start I <> Nexus Repository Build & Deploy Build Run Solr Index I <> Apache Solr Replicate Data Source n Extract Load

Slide 30

Slide 30 text

30

Slide 31

Slide 31 text

Apache: Big Data Europe 2015 Apache: Big Data Europe 2015 Apache Solr has become a powerful tool for data analytics applications. Be creative. 31 Our next big project using Apache Solr is already on its way. High performance application to predict and calculate the bill of materials for all required parts and orders. Apache Solr as a compressed, scalable and high performance time series database. FOSDEM’15 – Florian Lautenschlager, QAware GmbH Leveraging the Power of SOLR and SPARK Apache: Big Data 2015 – Johannes Weigend, QAware GmbH Mario-Leander Reimer 28. September 2015

Slide 32

Slide 32 text

32 Business intelligence is about asking the right questions about your data.

Slide 33

Slide 33 text

33 And with Apache Solr you can search and find the answers you are looking for.

Slide 34

Slide 34 text

https://twitter.com/leanderreimer/ https://slideshare.net/MarioLeanderReimer/ https://speakerdeck.com/lreimer/ & Mario-Leander Reimer Chief Technologist, QAware GmbH