Search-based business intelligence and reverse data engineering with Apache Solr

Apache: Big Data Europe 2015 Search-based business intelligence and reverse
data engineering with Apache Solr M a r i o-L ea nder Rei mer C h i ef T echnol og i st

Apache: Big Data Europe 2015 This talk will … Mario-Leander
Reimer 2 28. September 2015 o Give a brief overview of the AIR system’s architecture o Show reverse data engineering using Solr and MIR o Talk about the fight for our right to Solr o Describe solutions for the problem of combinatorial explosion o Outline a flexible and lightweight ETL approach for Solr

Apache: Big Data Europe 2015 Apache: Big Data Europe 2015
Mario-Leander Reimer 3 28. September 2015 A <<Anwendungscluster>> AIR Repository A <<Application Cluster>> AIR Loader Mechanic A <<System>> AIR Central A <<Subsystem>> Maintenance I <<Subsystem>> Apache Solr A <<Client>> AIR Client I <<Subsystem>> .NET WPF A <<Subsystem>> Solr Extensions A <<Subsystem>> Defects A <<Subsystem>> Flat Rates A <<Subsystem>> Service Bulletins Service Technician A <<Ext. System>> 3rd Party Application A <<Subsystem>> AIR Fork DLL A <<Subsystem>> AIR Call DLL Launch I <<Subsystem>> Spring Framework I <<Subsystem>> JEE 5 A <<System>> AIR Control I <<Subsystem>> Jenkins A <<Subsystem>> Documents A <<Subsystem>> Vehicles A <<Subsystem>> Measures Backend Databases and Systems A <<Subsystem>> Repair Overview A <<Subsystem>> ... A <<Subsystem>> JSF Web UI A <<Subsystem>> REST API Independent Workshop A <<Client>> Browser Search and Display A <<Ext. System>> 3rd Party iOS App A <<Subsystem>> AIR iOS Lib A <<Subsystem>> Defects A <<Subsystem>> Flat Rates A <<Subsystem>> Service Bulletins I <<Subsystem>> Spring Framework A <<Subsystem>> Documents A <<Subsystem>> Parts A <<Subsystem>> WS Clients A <<Subsystem>> File Storage A <<Subsystem>> Solr Access A <<Subsystem>> Protocoll A <<Subsystem>> Watchlist A <<Subsystem>> Masterdata A <<Subsystem>> Retrofits AIR DB Document Storage A <<Ext. System>> AIR Bus I <<Ext. System>> Backend Systems Query A <<Subsystem>> Vehicles Execute Load 20 Languages 800 GB Solr Index A <<Subsystem>> Maintenance

Mario-Leander Reimer 4 28. September 2015 A <<Anwendungscluster>> AIR Repository A <<Application Cluster>> AIR Loader A <<Subsystem>> Maintenance I <<Subsystem>> Apache Solr Master A <<Subsystem>> Solr Extensions A <<Subsystem>> Defects A <<Subsystem>> Flat Rates A <<Subsystem>> Service Bulletins A <<System>> AIR Control I <<Subsystem>> Jenkins A <<Subsystem>> Documents Backend Databases and Systems A <<Subsystem>> Repair Overview A <<Subsystem>> ... I <<Subsystem>> Spring Framework A <<Subsystem>> Vehicles Execute Load 20 Languages 800 GB Solr Index I <<Subsystem>> Apache Solr Slave A <<Subsystem>> Solr Extensions Replicate A <<System>> MIR 20 Languages 800 GB Solr Index Search

Apache: Big Data Europe 2015 Let‘s go back to when
it all began … Source: http://www.october212015.com/images/timecircuits.jpg 5

The project vision: find the right information in less than 3 clicks. 6 The situation: o Users had to use up to 7 different applications for their daily work. o Systems were not really integrated nicely. o Finding the correct information was laborious and error prone. The idea: o Combine the data into a consistent information network. o Make the information network and its data searchable and navigable. o Replace existing application with one easy to use application. Mario-Leander Reimer 28. September 2015

But how do we find the originating system for the desired data? 7 Mario-Leander Reimer 28. September 2015 Where to find the vehicle data? 60 potential systems and 5000 entities. Other data Vehicle data System A System B System C System D

And how do we find the hidden relations between the systems and their data? 8 Mario-Leander Reimer 28. September 2015 How is the data linked to each other? 400.000 potential relations. Other data Vehicle System A System B System C System D Customer Documents

Apache: Big Data Europe 2015 Meta Information Research (MIR) 9
Source: http://www.thewallpapers.org/photo/31865/Mir_space_station_12_June_1998.jpg

MIR is a simple and lightweight data reverse engineering and analysis tool based on Solr. 10 o MIR manages meta information about the source systems (the data models and record descriptions) o MIR allows to navigate and search in the metadata, you can drill into the metadata using facets o MIR also manages the target data model and Solr schema description Mario-Leander Reimer 28. September 2015 Metadata Index A <<System>> Meta Information Research I <<Subsystem>> Apache Solr A <<Subsystem>> MIR User Interface Backend Databases and Systems A <<Subsystem>> MIR Loader A <<Subsystem>> MIR Generators Read Sources (Java, XML) Magic Draw 25MB

Apache: Big Data Europe 2015 11 Wildcard queries Facetted drill
down Tree view of systems, tables and attributes Search results Found potential synonyms for the chassis number

Apache: Big Data Europe 2015 12 EAT YOUR OWN DOG
FOOD. The AIR Solr schema definition is modelled and defined within MIR. Solr schema attributes Solr entities for each release

def sourceGenerator = MIR + Solr + Maven; 13 Mario-Leander Reimer 28. September 2015

14 But Solr is a full text search engine. You
have to use an Oracle DB for your application data! NO!

Some of the AIR requirements were ... 15 o Focus is on search. Transactions are not required. o High demands on request volume and performance. o Free navigation on data model and content. o Support for full text search and facetted search. o Offline capabilities. o Scalability from low-end device to server to cloud. Mario-Leander Reimer 28. September 2015

Apache Solr outperformed Oracle significantly in query time as well as index size. 16 Mario-Leander Reimer 28. September 2015 SELECT * FROM VEHICLE WHERE VIN='V%' INFO_TYPE:VEHICLE AND VIN:V* SELECT * FROM MEASURE WHERE TEXT='engine' INFO_TYPE:MEASURE AND TEXT:engine SELECT * FROM VEHICLE WHERE VIN='%X%' INFO_TYPE:VEHICLE AND VIN:*X* | 038 ms | 000 ms | 000 ms | 383 ms | 384 ms | 383 ms | 092 ms | 000 ms | 000 ms | 389 ms | 387 ms | 386 ms | 039 ms | 000 ms | 000 ms | 859 ms | 379 ms | 383 ms Test data set: 150.000 records Disk space: 132 MB Solr vs. 385 MB Oracle

Apache: Big Data Europe 2015 17 28. September 2015 Source:
http://www.dirtbikerider.com/news/images/anotherimpressivegpweekendforhusqvarna_553db21addaaa.jpg Dirt Race Use Case: o Low-end devices o No Internet

Running Solr and AIR-2-Go on Raspberry Pi Model B worked like a charm. 18 Running Debian Linux + JDK8 Jetty Container with the Solr and AIR WARs deployed Reduced Solr data set with approx ~1.5 Mio documents Mario-Leander Reimer 28. September 2015 Model B Hardware Specs: o ARMv6 CPU at 700Mhz o 512MB RAM o 32GB SD-Card

19 YOU GOTTA FIGHT FOR YOUR RIGHT TO SOLR!

No silver bullet. A careful schema design is crucial for your Solr performance. 20 28. September 2015 Mario-Leander Reimer

33.071.137 Vehicles 648.129 Technical Documents 14.830.197 Flat Rate Units 5.078.411 FRU Groups 55.000 Parts 648.129 Measures 18.573 Repair Instructions 6.180 Fault Indications m n m n m n 1.678.667 Packages n m n n m n n 41.385 Types Naive data denormalization can quickly lead to combinatorial explosion. 21 Mario-Leander Reimer 28. September 2015 Num Docs: 55.777.706 Relationship Navigation

Multi-valued fields can efficiently store 1..n relations but may result in false positives. 22 Mario-Leander Reimer 28. September 2015 { "INFO_TYPE":"AWPOS_GROUP", "NUMMER" :[ "1134190" , "1235590" ] "BAUSTAND" :["1969-12-31T23:00:00Z","1975-12-31T23:00:00Z"] "E_SERIES" :[ "F10" , "E30" ] } In case this doesn‘t matter, perform a post filtering in your application. Note: latest Solr versions support nested child documents. Use instead. Index 0 Index 1 q=INFO_TYPE:AWPOS_GROUP AND NUMMER:1134190 AND E_SERIES:F10 q=INFO_TYPE:AWPOS_GROUP AND NUMMER:1134190 AND E_SERIES:E30

Technical documents and their validity were expressed in a binary representation. 23 o Validity expressions may have up to 46 characteristics. o Validity expressions use 5 different boolean operators (AND, NOT, …) o Validity expessions can be nested and complex. o Some characteristics are dynamic and not even known at index time. Mario-Leander Reimer 28. September 2015 Solution: transform the validity expressions into the equivalent JavaScript terms and evaluate these terms at query time using a custom function query filter.

Apache: Big Data Europe 2015 Binary validity expression example. 24
28. September 2015 Type(53078923) = ‚Brand‘, Value(53086475) = ‚BMW PKW‘ Type(53088651) = ‚E-Series‘, Value(53161483) = ‚F10‘ Type(64555275) = ‚Transmission‘, Value(53161483) = ‚MECH‘

Transformation of binary validity terms into their JavaScript equivalent at index time. 25 Mario-Leander Reimer 28. September 2015 ((BRAND=='BMW PKW')&&(E_SERIES=='F10')&&(TRANSMISSION=='MECH')) AND(Brand='BMW PKW', E-Series='F10'‚ Transmission='MECH') { "INFO_TYPE": "TECHNISCHES_DOKUMENT", "DOKUMENT_TITEL": "Getriebe aus- und einbauen", "DOKUMENT_ART": " reparaturanleitung", "VALIDITY": "((BRAND=='BMW PKW')&&((E_SERIES=='F10')&&(...))", „BRAND": [„BMW PKW"], ... }

The JavaScript validity term is evaluated at query time using a custom function query. 26 Mario-Leander Reimer 28. September 2015 &fq=INFO_TYPE:TECHNISCHES_DOKUMENT &fq=DOKUMENT_ART:reparaturanleitung &fq={!frange l=1 u=1 incl=true incu=true cache=false cost=500} jsTerm(VALIDITY,eyJNT1RPUl9LUkFGVFNUT0ZGQVJUX01PVE9SQVJCRUlUU 1ZFUkZBSFJFTiI6IkIiLCJFX01BU0NISU5FX0tSQUZUU1RPRkZBUlQiOm51bG wsIlNJQ0hFUkhFSVRTRkFIUlpFVUciOiIwIiwiQU5UUklFQiI6IkFXRCIsIkV kJBVVJFSUhFIjoiWCcifQ==) http://qaware.blogspot.de/2014/11/how-to-write-postfilter-for-solr-49.html Base64 decode { „BRAND":"BMW PKW", "E_SERIES":"F10", "TRANSMISSION":"MECH" }

27 How often do we load data? How do we
ensure data consistency?

A traditional approach using a DWH and ETL: too inflexible, heavy weight and expensive. 28 Mario-Leander Reimer 28. September 2015 Data Warehouse System B System A System C File DB File DB AIR Solr ETL ETL ETL ETL ETL ETL ETL jobs would usually be implemented with Informatica Significant business logic required depending on the source database

Flexible and lightweight ETL combined with Continuous Delivery and DevOps. 29 Mario-Leander Reimer 28. September 2015 H <<System>> AIR Search H <<System>> AIR Loader Slave I <<System>> Jenkins Slave I <<System>> Apache Maven Developer Operations Solr Index A <<System>> AIR Loader I <<System>> Apache Solr Data Source A I <<System>> Jenkins Master Start I <<System>> Nexus Repository Build & Deploy Build Run Solr Index I <<System>> Apache Solr Replicate Data Source n Extract Load

Apache Solr has become a powerful tool for data analytics applications. Be creative. 31 Our next big project using Apache Solr is already on its way. High performance application to predict and calculate the bill of materials for all required parts and orders. Apache Solr as a compressed, scalable and high performance time series database. FOSDEM’15 – Florian Lautenschlager, QAware GmbH Leveraging the Power of SOLR and SPARK Apache: Big Data 2015 – Johannes Weigend, QAware GmbH Mario-Leander Reimer 28. September 2015

32 Business intelligence is about asking the right questions about
your data.

33 And with Apache Solr you can search and find
the answers you are looking for.

https://twitter.com/leanderreimer/ https://slideshare.net/MarioLeanderReimer/ https://speakerdeck.com/lreimer/ & Mario-Leander Reimer Chief Technologist, QAware GmbH

Search-based business intelligence and reverse ...

Search-based business intelligence and reverse data engineering with Apache Solr

More Decks by M.-Leander Reimer

Other Decks in Programming

Featured

Transcript