From idea to production with NLP, Scala and Spark

From idea to production with NLP, Scala and Spark A
story from the trenches

Me Niek Bartholomeus @niekbartho Background as a developer Spent 6
years evangelizing DevOps in big organizations Switched to data engineering and data science early 2016 http://niek.bartholomeus.be

Building a search engine for innovative partnerships between companies and
individuals through complementarity (not similarity) MeetMatch YOU TOO CLOSE TOO FAR PERFECT FIT

MeetMatch - presenter notes MeetMatch is a small startup founded
in 2015 that focuses on improving innovative cooperation between and within organisations. This is done by introducing a concept of complementarity: companies that work in areas that - when combined - have the potential of delivering great value. These areas should not be too far neither too close from each other. This approach is very diﬀerent from what Google and most others do: they always look for the closest, most similar results with the search term

Organization of business speed dating events around Europe where the
participants are paired by an algorithm that uses this concept of complementarity Business networking events

Natural language processing We need to ﬁnd out what companies
are doing … … which is easy to ﬁnd out from the company website, wikipedia, social media, … But we need a little help if we want to apply this on large scale …

We need to ﬁnd out what the companies do (and
then be able to compare the results). How? Most of the time this is well described on their website, or even on wikipedia, social media, forums, etc Problem of scale: how can we efﬁciently read all of this information for lots of companies (tens or hundreds of thousands)? Solution: use natural language processing to let computers instead of humans do the work Our assumption is that for words that appear relatively frequently in the text of a company, the clusters of these words are a representation of the activities of that company Natural language processing - presenter notes

NLP pipeline Crawl webpages NLP Word clustering Our assumption: the
clusters of words that are found in the text of a company are an indication of the activities of that company

Macro view of word embeddings using t-SNE https://lvdmaaten.github.io/tsne/

Macro view - zoom in

Word clusters indicating a company’s activities

Complementarity The word clusters are compared and receive a complementarity
score

Small team Need to crunch big datasets Searching for vague
patterns in a not so well understood domain requires a fast development & feedback cycle Last minute registrations on networking events require fast incremental processing Dilemma

Technical solution Parquet Web crawler MySql legacy Angular/d3 Event app
attendee data scientist Cassandra Angular/d3 power user

The cornerstone of the target platform to cover for the
challenging requirements mentioned before is … Spark! The ﬁrst version of the solution was a prototype built on php and mysql which was ported to Spark and scala. Main requirements: 1. scalability: need to process bug datasets and this will only grow when expanding the geographic scope of the companies to possibly worldwide (=> spark) 2. processing speed (=> scala: interpreted vs compiled language; => spark: engine optimizations + streaming supported) 3. maintainability of the code base (=> scala - functional programming) 4. fast feedback cycle (=> zeppelin notebook + parquet + angular) In order to quickly allow to visualize the intermediate results, the choice was made to use Zeppelin as the front end, spark Sql as the data access api and parquet as the storage format. This turned out to give great value for very low (upfront) effort. This turned out to be an excellent choice so far. Minimal effort was spent on the ceremony of setting up and running the platform. Great for ad hoc querying, especially OLAP style aggregations. However it became clear that this solution is not well suited for OLTP style queries (e.g. search for data by company), especially in a multi-user setting. The decision was made to add a web server to the stack over time, to cover these requirements. The hope was that the precious efforts we spent on building interactive notebooks with angular and D3 could be re-used on the web server stack. Once the UI and query requirements are stabilized in the Zeppelin/Parquet context they can be easily ported to an “online” portal based on Cassandra where they are accessible to by many users with minimal computing or memory needs on the infrastructure side Technical solution - presenter notes

Spark challenges 1. Lazy calculations 2. Application-focused logging

Spark challenges - presenter notes lazy calculations are not intuitive:
more diﬃcult to add logging, to debug, to know what will happen when, especially when using previous lazy calculations more than once (persist, unpersist) logging: spark web ui is good to ﬁnd out what happens on a deep infrastructural level, but an application level layer is missing, without this it’s hard to align your own code with the web ui

Spark logging Jobs: Job details: Hard to align the jobs
and job details to the structure of the scala code base and ﬁnd out what really happened We need a better layer of abstraction

Spark challenges 1. Lazy calculations 2. Application-focused logging 3. Catalyst
Optimizer doesn’t work with typed Dataset API* 4. Lack of roll forward semantics * see task 4 of the workshop!

Framework Package and deployment toolkit Online platform Application-focused logging Data
ﬂow https://github.com/tolomaus/languagedetector Reusable platform

https://github.com/tolomaus/languagedetector Reusable platform workﬂow: module: Framework

Reusable platform Package and deployment toolkit https://github.com/tolomaus/languagedetector Dev Repository Test
Acceptance Production VCS package deploy commit

Reusable platform Online platform https://github.com/tolomaus/languagedetector

Reusable platform workﬂow module reads writes warnings spark jobs size
and diff times and durations link to sparkUI Application-focused logging https://github.com/tolomaus/languagedetector

Monitoring, troubleshooting, tweaking a system while millions of lines of
data are processed is a huge challenge. It is very important to have the tools available to not only be able to get a visibility of the system on the micro-scale (e.g see how one particular company is processed), but also on a macro scale/global scale. Simply adding logging to the code base to see what’s happening at execution time is not sufﬁcient because you will end up with millions of lines. As already mentioned the existing monitoring solution in Spark the, web UI, is not well ﬁt for monitoring on the applicative level, so therefore I built a custom solution on top of it so I can see a history of the calculations that have run, any warnings or errors that may have occurred, how long it took to process a particular company, etc. Each calculation consists of one or more spark jobs which link to their corresponding job page in the spark UI for further investigation. Application-focused logging - presenter notes

Reusable platform Data ﬂow https://github.com/tolomaus/languagedetector : censored Demo application:

Reusable platform Data ﬂow https://github.com/tolomaus/languagedetector : censored Real application:

1.Reduce accidental complexity 2. Zoom in/ zoom out Important guidelines

1. Reduce accidental complexity Most important rule when building a
system. Complexity (i.e. managing all of the inter-dependencies) is probably the biggest reason why projects fail prematurely. Reducing accidental complexity (by avoiding dependencies that are not inherent to the problem) should therefore be taken very seriously. The good thing is that all the time spent here can already start to pay off very quickly. It allows to stay agile and reactive to new customer needs. Functional programming (as opposed to the more “straightforward” style) promotes to think deep about the underlying structure before building the solution. 2. Zoom in/ zoom out Another important guideline is to regularly change your viewpoint between the micro world and the macro world when designing a solution. Keep in mind the big picture while making decisions on the detailed level and vice versa. This applies to - the technical level (architecture, code base) - the analytical level (interpreting the results of the calculations - e.g. T-SNE macro view vs company micro view on previous slides) - the operational level (troubleshoot one company vs improving average computing time for all companies, see also monitoring next slide) - idea-to-production-and-back ﬂow (keep the operational concerns in mind from the early start) Important guidelines - presenter notes

TASK 1: DetectLanguage TASK 2: CountSentencesByLanguage TASK 3: CountWrongDetectionsByLanguage TASK
4: CountWrongDetectionsByLanguageAsDataFrame Workshop tasks https://github.com/tolomaus/languagedetector/blob/master/ src/main/scala/biz/meetmatch/modules

From idea to production with NLP, Scala and Spark

From idea to production with NLP, Scala and Spark

Niek Bartholomeus

More Decks by Niek Bartholomeus

Other Decks in Technology

Featured

Transcript