in 2015 that focuses on improving innovative cooperation between and within organisations. This is done by introducing a concept of complementarity: companies that work in areas that - when combined - have the potential of delivering great value. These areas should not be too far neither too close from each other. This approach is very different from what Google and most others do: they always look for the closest, most similar results with the search term
are doing … … which is easy to find out from the company website, wikipedia, social media, … But we need a little help if we want to apply this on large scale …
then be able to compare the results). How? Most of the time this is well described on their website, or even on wikipedia, social media, forums, etc Problem of scale: how can we efficiently read all of this information for lots of companies (tens or hundreds of thousands)? Solution: use natural language processing to let computers instead of humans do the work Our assumption is that for words that appear relatively frequently in the text of a company, the clusters of these words are a representation of the activities of that company Natural language processing - presenter notes
patterns in a not so well understood domain requires a fast development & feedback cycle Last minute registrations on networking events require fast incremental processing Dilemma
challenging requirements mentioned before is … Spark! The first version of the solution was a prototype built on php and mysql which was ported to Spark and scala. Main requirements: 1. scalability: need to process bug datasets and this will only grow when expanding the geographic scope of the companies to possibly worldwide (=> spark) 2. processing speed (=> scala: interpreted vs compiled language; => spark: engine optimizations + streaming supported) 3. maintainability of the code base (=> scala - functional programming) 4. fast feedback cycle (=> zeppelin notebook + parquet + angular) In order to quickly allow to visualize the intermediate results, the choice was made to use Zeppelin as the front end, spark Sql as the data access api and parquet as the storage format. This turned out to give great value for very low (upfront) effort. This turned out to be an excellent choice so far. Minimal effort was spent on the ceremony of setting up and running the platform. Great for ad hoc querying, especially OLAP style aggregations. However it became clear that this solution is not well suited for OLTP style queries (e.g. search for data by company), especially in a multi-user setting. The decision was made to add a web server to the stack over time, to cover these requirements. The hope was that the precious efforts we spent on building interactive notebooks with angular and D3 could be re-used on the web server stack. Once the UI and query requirements are stabilized in the Zeppelin/Parquet context they can be easily ported to an “online” portal based on Cassandra where they are accessible to by many users with minimal computing or memory needs on the infrastructure side Technical solution - presenter notes
more difficult to add logging, to debug, to know what will happen when, especially when using previous lazy calculations more than once (persist, unpersist) logging: spark web ui is good to find out what happens on a deep infrastructural level, but an application level layer is missing, without this it’s hard to align your own code with the web ui
data are processed is a huge challenge. It is very important to have the tools available to not only be able to get a visibility of the system on the micro-scale (e.g see how one particular company is processed), but also on a macro scale/global scale. Simply adding logging to the code base to see what’s happening at execution time is not sufficient because you will end up with millions of lines. As already mentioned the existing monitoring solution in Spark the, web UI, is not well fit for monitoring on the applicative level, so therefore I built a custom solution on top of it so I can see a history of the calculations that have run, any warnings or errors that may have occurred, how long it took to process a particular company, etc. Each calculation consists of one or more spark jobs which link to their corresponding job page in the spark UI for further investigation. Application-focused logging - presenter notes
system. Complexity (i.e. managing all of the inter-dependencies) is probably the biggest reason why projects fail prematurely. Reducing accidental complexity (by avoiding dependencies that are not inherent to the problem) should therefore be taken very seriously. The good thing is that all the time spent here can already start to pay off very quickly. It allows to stay agile and reactive to new customer needs. Functional programming (as opposed to the more “straightforward” style) promotes to think deep about the underlying structure before building the solution. 2. Zoom in/ zoom out Another important guideline is to regularly change your viewpoint between the micro world and the macro world when designing a solution. Keep in mind the big picture while making decisions on the detailed level and vice versa. This applies to - the technical level (architecture, code base) - the analytical level (interpreting the results of the calculations - e.g. T-SNE macro view vs company micro view on previous slides) - the operational level (troubleshoot one company vs improving average computing time for all companies, see also monitoring next slide) - idea-to-production-and-back flow (keep the operational concerns in mind from the early start) Important guidelines - presenter notes