Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Deep ML-inspired Architecture at Wildcard

Sven Kreiss
September 18, 2015

Deep ML-inspired Architecture at Wildcard

Talk at MLconf Atlanta 2015 about the machine learning techniques used at Wildcard to do Structured Learning without Conditional Random Fields but with a trick from Deep Learning.

Sven Kreiss

September 18, 2015
Tweet

More Decks by Sven Kreiss

Other Decks in Technology

Transcript

  1. I am a Data Scientist at Wildcard. We launched last

    month and were featured in the App Store as “Best New App”.
 We are looking to grow our data team.
  2. Wildcard 3 • founded in 2013 • develop technologies for

    a future native mobile web experience through cards • Cards: new UI paradigm for content on mobile for which we schematize unstructured web content. Surfaced in the Wildcard iOS app and in other card ecosystems.
  3. ML Challenge 5 • Extract online content through ML.
 Micro

    service in this talk powers 54% of cards in Wildcard. url {“title”: “…
  4. Dataset 6 • Scrape articles from a diverse set of

    sources. • Custom labeling tools based on Databench:
 databench.trivial.io.
  5. Labeling Tools: Tree Based and Visual 7 • cross matched

    labels between the tools • inhouse label sessions before handing to offshore (usability) • assign labels to page elements
  6. Features 10 • Text properties: length, capitalization, special characters, numbers,

    first 20 char identical to page’s meta title, … • BoW text: bag-of-words visible text • BoW meta: bag-of-words of CSS classes and other non-visible information inside HTML tags • html tag • Optional info from emulation: (x, y), (w, h), font-family, font-size, font-weight, …
  7. Pipeline 11 • Parallelized document processing into features using Apache

    Spark. Starts from a list of urls. • Scrapes web pages. • Constructs Content Tree. • Matches labels. • Filters for quality. • Need the same processing for a single webpage but with low latency and small resource requirements:
 → pysparkling: pure Python implementation of 
 Spark’s RDD interface
  8. pysparkling 12 • interface compatible with SparkContext and RDD but


    no dependence on the JVM • pysparkling.fileio can access local files, S3, HTTP, HDFS with a load-dump interface • used in Python micro-service endpoint applying 
 scikit-learn classifiers • used in labeling and evaluation tools and local development • used in dataset preparation tools
 (train-test split, split urls by domain, …)
  9. Pipeline II 13 • single machine Random Forest training •

    “256Gb ought to be enough for anybody”
 (for machine learning) - Andreas Mueller • multithread support, fast • use provided structured data (e.g. meta tags) as much as possible
  10. Architecture 14 • Morbi in sem quis dui placerat ornare.

    Pellentesque odio nisi, euismod in, pharetra a, ultricies in, diam. • Praesent dapibus, neque id cursus faucibus. • Phasellus ultrices nulla quis nibh. Quisque a lectus.
  11. Algorithm: zeroth order 16 page element page element page element

    “title” “navigation” “author” scikit-learn 
 Random Forest /html/body/div[2]/div/div/div/ul/li[5] /html/body/div[3]/h1 /html/body/div[3]/span
  12. Algorithm: first order 17 page element page element page element

    “title” “navigation” “author”
  13. Algorithm: second order 18 page element page element page element

    “title” “navigation” “author”
  14. Requirements 19 • text-density based labeling is too rigid: we

    want to extend this to other types than news articles • clustering is too noisy: • ads in between paragraphs • cannot “cluster” authors after titles • CRF: complexity beyond linear-chain-CRF grows too quickly • want “single step” process: multi step algorithms erase information. 
 Example: if first step is to remove ads then second step cannot use information about ads to infer content.
  15. First Attempt: Hypothesis Generation using Sampling 20 • start from

    a guess (using zeroth order type classifier) • generate variations of that guess with a proposal function • evaluate an objective function based on a
 document-wide likelihood function of classification probabilities
  16. First Attempt: Hypothesis Generation using Sampling 21 page element page

    element page element “title” “navigation” “author” Sampling
  17. First Attempt: Hypothesis Generation using Sampling 22 • decent results

    • training coverage questionable • slow inference
  18. Second Attempt: “Deep Learning Inspired” 23 • Borrow ideas from

    “scene description”. Traditionally done with scene graphs and CRFs. • With Deep Learning, can avoid building a graph and go straight to assigning a label to every pixel. Clément Farabet, 2011
 http://www.clement.farabet.net/research.html#parsing
  19. Second Attempt: “Deep Learning Inspired” 24 page element page element

    page element “title” “navigation” “author” “title” “navigation” “author”
  20. Second Attempt: “Deep Learning Inspired” 25 page element page element

    page element “title” “navigation” “author” “title” “navigation” “author”
  21. Feed forward process is much faster 26 Processing time dropped

    by an order of magnitude. No significant degradation in quality. Training:
 From urls: ~2 hours
 With cached external calls: <1 hour Introduced Forward Model Bucket Model for Load
  22. Business-visible Successes 27 • embedded media content:
 Twitter cards, Instagram

    posts, 
 Facebook posts, Facebook videos 
 and Youtube videos • On the right, New York Magazine article on the train crash in Philadelphia: 
 http://nymag.com/daily/intelligencer/2015/05/ amtrak-train-derails-philadelphia.html
  23. preliminary Business-visible Successes 28 • enabling domains that require JavaScript

    emulation
 (e.g. websites with pure AngularJS) • fixed individual publishers with high visibility 
 in our app • comparison to competition: 
 third party 71-82%, inhouse 83% +/- 4%
  24. Summary 29 • dataset creation, processing pipeline, content tree creation,

    evaluation tools, labeling tools, training and inference strategies implemented over the past year • chose tools that allow quick iteration:
 simple processing in parallel, ML on single node • two open source projects:
 databench.trivial.io pip install databench
 pysparkling.trivial.io pip install pysparkling • competitive performance,
 54% of cards in Wildcard are powered by pure ML @svenkreiss