Deep ML-inspired Architecture at Wildcard

Deep ML-Inspired Architecture at Wildcard Sven Kreiss, @svenkreiss

I am a Data Scientist at Wildcard. We launched last
month and were featured in the App Store as “Best New App”.  We are looking to grow our data team.

Wildcard 3 • founded in 2013 • develop technologies for
a future native mobile web experience through cards • Cards: new UI paradigm for content on mobile for which we schematize unstructured web content. Surfaced in the Wildcard iOS app and in other card ecosystems.

Wildcard: View as Card 4

ML Challenge 5 • Extract online content through ML.  Micro
service in this talk powers 54% of cards in Wildcard. url {“title”: “…

Dataset 6 • Scrape articles from a diverse set of
sources. • Custom labeling tools based on Databench:  databench.trivial.io.

Labeling Tools: Tree Based and Visual 7 • cross matched
labels between the tools • inhouse label sessions before handing to offshore (usability) • assign labels to page elements

Content Tree Labeling 8

Visual Labeling 9

Features 10 • Text properties: length, capitalization, special characters, numbers,
first 20 char identical to page’s meta title, … • BoW text: bag-of-words visible text • BoW meta: bag-of-words of CSS classes and other non-visible information inside HTML tags • html tag • Optional info from emulation: (x, y), (w, h), font-family, font-size, font-weight, …

Pipeline 11 • Parallelized document processing into features using Apache
Spark. Starts from a list of urls. • Scrapes web pages. • Constructs Content Tree. • Matches labels. • Filters for quality. • Need the same processing for a single webpage but with low latency and small resource requirements:  → pysparkling: pure Python implementation of   Spark’s RDD interface

pysparkling 12 • interface compatible with SparkContext and RDD but 
no dependence on the JVM • pysparkling.fileio can access local files, S3, HTTP, HDFS with a load-dump interface • used in Python micro-service endpoint applying   scikit-learn classifiers • used in labeling and evaluation tools and local development • used in dataset preparation tools  (train-test split, split urls by domain, …)

Pipeline II 13 • single machine Random Forest training •
“256Gb ought to be enough for anybody”  (for machine learning) - Andreas Mueller • multithread support, fast • use provided structured data (e.g. meta tags) as much as possible

Architecture 14 • Morbi in sem quis dui placerat ornare.
Pellentesque odio nisi, euismod in, pharetra a, ultricies in, diam. • Praesent dapibus, neque id cursus faucibus. • Phasellus ultrices nulla quis nibh. Quisque a lectus.

ML Algorithms Tough luck with Structured Learning http://scikit-learn.org/stable/tutorial/machine_learning_map/

Algorithm: zeroth order 16 page element page element page element
“title” “navigation” “author” scikit-learn   Random Forest /html/body/div[2]/div/div/div/ul/li[5] /html/body/div[3]/h1 /html/body/div[3]/span

Algorithm: first order 17 page element page element page element
“title” “navigation” “author”

Algorithm: second order 18 page element page element page element
“title” “navigation” “author”

Requirements 19 • text-density based labeling is too rigid: we
want to extend this to other types than news articles • clustering is too noisy: • ads in between paragraphs • cannot “cluster” authors after titles • CRF: complexity beyond linear-chain-CRF grows too quickly • want “single step” process: multi step algorithms erase information.   Example: if first step is to remove ads then second step cannot use information about ads to infer content.

First Attempt: Hypothesis Generation using Sampling 20 • start from
a guess (using zeroth order type classifier) • generate variations of that guess with a proposal function • evaluate an objective function based on a  document-wide likelihood function of classification probabilities

First Attempt: Hypothesis Generation using Sampling 21 page element page
element page element “title” “navigation” “author” Sampling

First Attempt: Hypothesis Generation using Sampling 22 • decent results
• training coverage questionable • slow inference

Second Attempt: “Deep Learning Inspired” 23 • Borrow ideas from
“scene description”. Traditionally done with scene graphs and CRFs. • With Deep Learning, can avoid building a graph and go straight to assigning a label to every pixel. Clément Farabet, 2011  http://www.clement.farabet.net/research.html#parsing

Second Attempt: “Deep Learning Inspired” 24 page element page element
page element “title” “navigation” “author” “title” “navigation” “author”

Second Attempt: “Deep Learning Inspired” 25 page element page element
page element “title” “navigation” “author” “title” “navigation” “author”

Feed forward process is much faster 26 Processing time dropped
by an order of magnitude. No significant degradation in quality. Training:  From urls: ~2 hours  With cached external calls: <1 hour Introduced Forward Model Bucket Model for Load

Business-visible Successes 27 • embedded media content:  Twitter cards, Instagram
posts,   Facebook posts, Facebook videos   and Youtube videos • On the right, New York Magazine article on the train crash in Philadelphia:   http://nymag.com/daily/intelligencer/2015/05/ amtrak-train-derails-philadelphia.html

preliminary Business-visible Successes 28 • enabling domains that require JavaScript
emulation  (e.g. websites with pure AngularJS) • fixed individual publishers with high visibility   in our app • comparison to competition:   third party 71-82%, inhouse 83% +/- 4%

Summary 29 • dataset creation, processing pipeline, content tree creation,
evaluation tools, labeling tools, training and inference strategies implemented over the past year • chose tools that allow quick iteration:  simple processing in parallel, ML on single node • two open source projects:  databench.trivial.io pip install databench  pysparkling.trivial.io pip install pysparkling • competitive performance,  54% of cards in Wildcard are powered by pure ML @svenkreiss

Deep ML-inspired Architecture at Wildcard

Deep ML-inspired Architecture at Wildcard

Sven Kreiss

More Decks by Sven Kreiss

Other Decks in Technology

Featured

Transcript