From idea to production with NLP, Scala and Spark

From idea to production with NLP, Scala and Spark

How I used NLP, Scala and Spark to build an intelligent business speed dating engine

See also


Niek Bartholomeus

March 08, 2017


  1. From idea to production with NLP, Scala and Spark A

    story from the trenches
  2. Me Niek Bartholomeus @niekbartho Background as a developer Spent 6

    years evangelizing DevOps in big organizations Switched to data engineering and data science early 2016
  3. Building a search engine for innovative partnerships between companies and

    individuals through complementarity (not similarity) MeetMatch YOU TOO CLOSE TOO FAR PERFECT FIT
  4. MeetMatch - presenter notes MeetMatch is a small startup founded

    in 2015 that focuses on improving innovative cooperation between and within organisations. This is done by introducing a concept of complementarity: companies that work in areas that - when combined - have the potential of delivering great value. These areas should not be too far neither too close from each other. This approach is very different from what Google and most others do: they always look for the closest, most similar results with the search term
  5. Organization of business speed dating events around Europe where the

    participants are paired by an algorithm that uses this concept of complementarity Business networking events
  6. Natural language processing We need to find out what companies

    are doing … … which is easy to find out from the company website, wikipedia, social media, … But we need a little help if we want to apply this on large scale …
  7. We need to find out what the companies do (and

    then be able to compare the results). How? Most of the time this is well described on their website, or even on wikipedia, social media, forums, etc Problem of scale: how can we efficiently read all of this information for lots of companies (tens or hundreds of thousands)? Solution: use natural language processing to let computers instead of humans do the work Our assumption is that for words that appear relatively frequently in the text of a company, the clusters of these words are a representation of the activities of that company Natural language processing - presenter notes
  8. NLP pipeline Crawl webpages NLP Word clustering Our assumption: the

    clusters of words that are found in the text of a company are an indication of the activities of that company
  9. Macro view of word embeddings using t-SNE

  10. Macro view - zoom in

  11. Word clusters indicating a company’s activities

  12. Complementarity The word clusters are compared and receive a complementarity

  13. Small team Need to crunch big datasets Searching for vague

    patterns in a not so well understood domain requires a fast development & feedback cycle Last minute registrations on networking events require fast incremental processing Dilemma
  14. Technical solution Parquet Web crawler MySql legacy Angular/d3 Event app

    attendee data scientist Cassandra Angular/d3 power user
  15. The cornerstone of the target platform to cover for the

    challenging requirements mentioned before is … Spark! The first version of the solution was a prototype built on php and mysql which was ported to Spark and scala. Main requirements: 1. scalability: need to process bug datasets and this will only grow when expanding the geographic scope of the companies to possibly worldwide (=> spark) 2. processing speed (=> scala: interpreted vs compiled language; => spark: engine optimizations + streaming supported) 3. maintainability of the code base (=> scala - functional programming) 4. fast feedback cycle (=> zeppelin notebook + parquet + angular) In order to quickly allow to visualize the intermediate results, the choice was made to use Zeppelin as the front end, spark Sql as the data access api and parquet as the storage format. This turned out to give great value for very low (upfront) effort. This turned out to be an excellent choice so far. Minimal effort was spent on the ceremony of setting up and running the platform. Great for ad hoc querying, especially OLAP style aggregations. However it became clear that this solution is not well suited for OLTP style queries (e.g. search for data by company), especially in a multi-user setting. The decision was made to add a web server to the stack over time, to cover these requirements. The hope was that the precious efforts we spent on building interactive notebooks with angular and D3 could be re-used on the web server stack. Once the UI and query requirements are stabilized in the Zeppelin/Parquet context they can be easily ported to an “online” portal based on Cassandra where they are accessible to by many users with minimal computing or memory needs on the infrastructure side Technical solution - presenter notes
  16. Spark challenges 1. Lazy calculations 2. Application-focused logging

  17. Spark challenges - presenter notes lazy calculations are not intuitive:

    more difficult to add logging, to debug, to know what will happen when, especially when using previous lazy calculations more than once (persist, unpersist) logging: spark web ui is good to find out what happens on a deep infrastructural level, but an application level layer is missing, without this it’s hard to align your own code with the web ui
  18. Spark logging Jobs: Job details: Hard to align the jobs

    and job details to the structure of the scala code base and find out what really happened We need a better layer of abstraction
  19. Spark challenges 1. Lazy calculations 2. Application-focused logging 3. Catalyst

    Optimizer doesn’t work with typed Dataset API* 4. Lack of roll forward semantics * see task 4 of the workshop!
  20. Framework Package and deployment toolkit Online platform Application-focused logging Data

    flow Reusable platform
  21. Reusable platform workflow: module: Framework

  22. Reusable platform Package and deployment toolkit Dev Repository Test

    Acceptance Production VCS package deploy commit
  23. Reusable platform Online platform

  24. Reusable platform workflow module reads writes warnings spark jobs size

    and diff times and durations link to sparkUI Application-focused logging
  25. Monitoring, troubleshooting, tweaking a system while millions of lines of

    data are processed is a huge challenge. It is very important to have the tools available to not only be able to get a visibility of the system on the micro-scale (e.g see how one particular company is processed), but also on a macro scale/global scale. Simply adding logging to the code base to see what’s happening at execution time is not sufficient because you will end up with millions of lines. As already mentioned the existing monitoring solution in Spark the, web UI, is not well fit for monitoring on the applicative level, so therefore I built a custom solution on top of it so I can see a history of the calculations that have run, any warnings or errors that may have occurred, how long it took to process a particular company, etc. Each calculation consists of one or more spark jobs which link to their corresponding job page in the spark UI for further investigation. Application-focused logging - presenter notes
  26. Reusable platform Data flow : censored Demo application:

  27. Reusable platform Data flow : censored Real application:

  28. 1.Reduce accidental complexity 2. Zoom in/ zoom out Important guidelines

  29. 1. Reduce accidental complexity Most important rule when building a

    system. Complexity (i.e. managing all of the inter-dependencies) is probably the biggest reason why projects fail prematurely. Reducing accidental complexity (by avoiding dependencies that are not inherent to the problem) should therefore be taken very seriously. The good thing is that all the time spent here can already start to pay off very quickly. It allows to stay agile and reactive to new customer needs. Functional programming (as opposed to the more “straightforward” style) promotes to think deep about the underlying structure before building the solution. 2. Zoom in/ zoom out Another important guideline is to regularly change your viewpoint between the micro world and the macro world when designing a solution. Keep in mind the big picture while making decisions on the detailed level and vice versa. This applies to - the technical level (architecture, code base) - the analytical level (interpreting the results of the calculations - e.g. T-SNE macro view vs company micro view on previous slides) - the operational level (troubleshoot one company vs improving average computing time for all companies, see also monitoring next slide) - idea-to-production-and-back flow (keep the operational concerns in mind from the early start) Important guidelines - presenter notes
  30. TASK 1: DetectLanguage TASK 2: CountSentencesByLanguage TASK 3: CountWrongDetectionsByLanguage TASK

    4: CountWrongDetectionsByLanguageAsDataFrame Workshop tasks src/main/scala/biz/meetmatch/modules