From idea to production with NLP, Scala and Spark

Slide 1

Slide 1 text

From idea to production with NLP, Scala and Spark A story from the trenches

Slide 2

Slide 2 text

Me Niek Bartholomeus @niekbartho Background as a developer Spent 6 years evangelizing DevOps in big organizations Switched to data engineering and data science early 2016 http://niek.bartholomeus.be

Slide 3

Slide 3 text

Building a search engine for innovative partnerships between companies and individuals through complementarity (not similarity) MeetMatch YOU TOO CLOSE TOO FAR PERFECT FIT

Slide 4

Slide 4 text

MeetMatch - presenter notes MeetMatch is a small startup founded in 2015 that focuses on improving innovative cooperation between and within organisations. This is done by introducing a concept of complementarity: companies that work in areas that - when combined - have the potential of delivering great value. These areas should not be too far neither too close from each other. This approach is very diﬀerent from what Google and most others do: they always look for the closest, most similar results with the search term

Slide 5

Slide 5 text

Organization of business speed dating events around Europe where the participants are paired by an algorithm that uses this concept of complementarity Business networking events

Slide 6

Slide 6 text

Natural language processing We need to ﬁnd out what companies are doing … … which is easy to ﬁnd out from the company website, wikipedia, social media, … But we need a little help if we want to apply this on large scale …

Slide 7

Slide 7 text

We need to ﬁnd out what the companies do (and then be able to compare the results). How? Most of the time this is well described on their website, or even on wikipedia, social media, forums, etc Problem of scale: how can we efﬁciently read all of this information for lots of companies (tens or hundreds of thousands)? Solution: use natural language processing to let computers instead of humans do the work Our assumption is that for words that appear relatively frequently in the text of a company, the clusters of these words are a representation of the activities of that company Natural language processing - presenter notes

Slide 8

Slide 8 text

NLP pipeline Crawl webpages NLP Word clustering Our assumption: the clusters of words that are found in the text of a company are an indication of the activities of that company

Slide 9

Slide 9 text

Macro view of word embeddings using t-SNE https://lvdmaaten.github.io/tsne/

Slide 10

Slide 10 text

Macro view - zoom in

Slide 11

Slide 11 text

Word clusters indicating a company’s activities

Slide 12

Slide 12 text

Complementarity The word clusters are compared and receive a complementarity score

Slide 13

Slide 13 text

Small team Need to crunch big datasets Searching for vague patterns in a not so well understood domain requires a fast development & feedback cycle Last minute registrations on networking events require fast incremental processing Dilemma

Slide 14

Slide 14 text

Technical solution Parquet Web crawler MySql legacy Angular/d3 Event app attendee data scientist Cassandra Angular/d3 power user

Slide 15

Slide 15 text

The cornerstone of the target platform to cover for the challenging requirements mentioned before is … Spark! The ﬁrst version of the solution was a prototype built on php and mysql which was ported to Spark and scala. Main requirements: 1. scalability: need to process bug datasets and this will only grow when expanding the geographic scope of the companies to possibly worldwide (=> spark) 2. processing speed (=> scala: interpreted vs compiled language; => spark: engine optimizations + streaming supported) 3. maintainability of the code base (=> scala - functional programming) 4. fast feedback cycle (=> zeppelin notebook + parquet + angular) In order to quickly allow to visualize the intermediate results, the choice was made to use Zeppelin as the front end, spark Sql as the data access api and parquet as the storage format. This turned out to give great value for very low (upfront) effort. This turned out to be an excellent choice so far. Minimal effort was spent on the ceremony of setting up and running the platform. Great for ad hoc querying, especially OLAP style aggregations. However it became clear that this solution is not well suited for OLTP style queries (e.g. search for data by company), especially in a multi-user setting. The decision was made to add a web server to the stack over time, to cover these requirements. The hope was that the precious efforts we spent on building interactive notebooks with angular and D3 could be re-used on the web server stack. Once the UI and query requirements are stabilized in the Zeppelin/Parquet context they can be easily ported to an “online” portal based on Cassandra where they are accessible to by many users with minimal computing or memory needs on the infrastructure side Technical solution - presenter notes

Slide 16

Slide 16 text

Spark challenges 1. Lazy calculations 2. Application-focused logging

Slide 17

Slide 17 text

Spark challenges - presenter notes lazy calculations are not intuitive: more diﬃcult to add logging, to debug, to know what will happen when, especially when using previous lazy calculations more than once (persist, unpersist) logging: spark web ui is good to ﬁnd out what happens on a deep infrastructural level, but an application level layer is missing, without this it’s hard to align your own code with the web ui

Slide 18

Slide 18 text

Spark logging Jobs: Job details: Hard to align the jobs and job details to the structure of the scala code base and ﬁnd out what really happened We need a better layer of abstraction

Slide 19

Slide 19 text

Spark challenges 1. Lazy calculations 2. Application-focused logging 3. Catalyst Optimizer doesn’t work with typed Dataset API* 4. Lack of roll forward semantics * see task 4 of the workshop!

Slide 20

Slide 20 text

Framework Package and deployment toolkit Online platform Application-focused logging Data ﬂow https://github.com/tolomaus/languagedetector Reusable platform

Slide 21

Slide 21 text

https://github.com/tolomaus/languagedetector Reusable platform workﬂow: module: Framework

Slide 22

Slide 22 text

Reusable platform Package and deployment toolkit https://github.com/tolomaus/languagedetector Dev Repository Test Acceptance Production VCS package deploy commit

Slide 23

Slide 23 text

Reusable platform Online platform https://github.com/tolomaus/languagedetector

Slide 24

Slide 24 text

Reusable platform workﬂow module reads writes warnings spark jobs size and diff times and durations link to sparkUI Application-focused logging https://github.com/tolomaus/languagedetector

Slide 25

Slide 25 text

Monitoring, troubleshooting, tweaking a system while millions of lines of data are processed is a huge challenge. It is very important to have the tools available to not only be able to get a visibility of the system on the micro-scale (e.g see how one particular company is processed), but also on a macro scale/global scale. Simply adding logging to the code base to see what’s happening at execution time is not sufﬁcient because you will end up with millions of lines. As already mentioned the existing monitoring solution in Spark the, web UI, is not well ﬁt for monitoring on the applicative level, so therefore I built a custom solution on top of it so I can see a history of the calculations that have run, any warnings or errors that may have occurred, how long it took to process a particular company, etc. Each calculation consists of one or more spark jobs which link to their corresponding job page in the spark UI for further investigation. Application-focused logging - presenter notes

Slide 26

Slide 26 text

Reusable platform Data ﬂow https://github.com/tolomaus/languagedetector : censored Demo application:

Slide 27

Slide 27 text

Reusable platform Data ﬂow https://github.com/tolomaus/languagedetector : censored Real application:

Slide 28

Slide 28 text

1.Reduce accidental complexity 2. Zoom in/ zoom out Important guidelines

Slide 29

Slide 29 text

1. Reduce accidental complexity Most important rule when building a system. Complexity (i.e. managing all of the inter-dependencies) is probably the biggest reason why projects fail prematurely. Reducing accidental complexity (by avoiding dependencies that are not inherent to the problem) should therefore be taken very seriously. The good thing is that all the time spent here can already start to pay off very quickly. It allows to stay agile and reactive to new customer needs. Functional programming (as opposed to the more “straightforward” style) promotes to think deep about the underlying structure before building the solution. 2. Zoom in/ zoom out Another important guideline is to regularly change your viewpoint between the micro world and the macro world when designing a solution. Keep in mind the big picture while making decisions on the detailed level and vice versa. This applies to - the technical level (architecture, code base) - the analytical level (interpreting the results of the calculations - e.g. T-SNE macro view vs company micro view on previous slides) - the operational level (troubleshoot one company vs improving average computing time for all companies, see also monitoring next slide) - idea-to-production-and-back ﬂow (keep the operational concerns in mind from the early start) Important guidelines - presenter notes

Slide 30

Slide 30 text

TASK 1: DetectLanguage TASK 2: CountSentencesByLanguage TASK 3: CountWrongDetectionsByLanguage TASK 4: CountWrongDetectionsByLanguageAsDataFrame Workshop tasks https://github.com/tolomaus/languagedetector/blob/master/ src/main/scala/biz/meetmatch/modules