Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Live-Hack: Analyzing 7 years of Buzzwords (at Scale)

Live-Hack: Analyzing 7 years of Buzzwords (at Scale)

Talk given by Christoph Tavan at Berlin Buzzwords 2016, June 7.

https://www.berlinbuzzwords.de/session/live-hack-analyzing-7-years-buzzwords-scale

We're coming together for Berlin Buzzwords' 7th edition and over the course of the years a lot has changed in the Big Data Technology ecosystem. Once-hot buzzwords have vanished and new buzzwords arose.

While you would probably have written a MapReduce job in Java to crawl the web and analyze it on a massive scale this has now become much simpler with tools like Spark and Flink at hand.

I want to do a live coding session where I show that today it is possible to write a scalable web crawler and analytics tool which scrapes the past 6 years of Berlin Buzzwords (websites) and shows some interesting insights in the Big Data trends of the past 6 years. While I will run the tool on the very limited data set of the historical Berlin Buzzwords websites I want to highlight that it would in principle scale to crawl millions of websites and analyze petabytes of data.

Christoph Tavan

June 07, 2016
Tweet

More Decks by Christoph Tavan

Other Decks in Technology

Transcript

  1. Live-Hack: Analyzing 7 years of Buzzwords (at Scale) Berlin Buzzwords

    – June 7, 2016 Christoph Tavan – @ctavan
  2. • CTO @ mbr targeting (Real- Time Bidding) • Attending

    Buzzwords since 2012 • First Talk @ Buzzwords 2015 We’re Hiring!
  3. So What Were The Actual Buzzwords? Scrape [2010-2016].berlinbuzzwords.de 1 2

    Extract and Analyze Buzzwords … aaaaaand • Live! • Scalable! • In 30 Minutes...
  4. B) Scrape only relevant content (If you know what you’re

    looking for) Scraping – 2 Options A) Scrape all content, filter later (If you don’t know about the structure) +
  5. Scraping – 2 Options + B) Scrape only relevant content

    (If you know what you’re looking for) http://scrapy.org/ A) Scrape all content, filter later (If you don’t know about the structure) http://nutch.apache.org/ Comprehensive Overview: http://www.ijser.org/researchpaper%5CComparison-of-Open-Source-Crawlers--A-Review.pdf
  6. Scraping – 2 Options + B) Scrape only relevant content

    (If you know what you’re looking for) http://scrapy.org/ A) Scrape all content, filter later (If you don’t know about the structure) http://nutch.apache.org/ Comprehensive Overview: http://www.ijser.org/researchpaper%5CComparison-of-Open-Source-Crawlers--A-Review.pdf
  7. ✔ Is a Page a Session Page? .date-display-single ? ✔

    ✘ ✘ ✔ ✔ ✔ 2012 and 2013: .field-field-session-slot else: .date-display-single Is a Page a Session Page?
  8. Who’s the Speaker? .field-field-speaker ? ✔ ✔ ✘ ✘ ✘

    ✘ ✘ 2010 and 2011: .field-field-speaker a 2012: .field-field-indiv-speakers a 2013: .field-field-speakers a 2014-2016: .field-name-field-session-speaker a Who’s the Speaker?
  9. 2010 – 2013: #main p 2014 – 2016: article p

    And Where’s the Session Abstract?
  10. Interactive Notebooks • Commercial: ◦ https://databricks.com/ ◦ https://www.qubole.com/ ◦ http://www.cloudwick.com/

    ◦ Review: http://www.infoworld.com/article/3068519/artificial- intelligence/review-6-machine-learning-clouds.html • Open Source: ◦ http://jupyter.org/ ◦ https://zeppelin.incubator.apache.org/ ◦ http://spark-notebook.io/
  11. So What Were The Actual Buzzwords? Scrape [2010-2016].berlinbuzzwords.de 1 2

    Extract and Analyze Buzzwords … aaaaaand • Live! ✔ • Scalable! ✔ (at least analysis) • In 30 Minutes… ✔