Slide 1

Slide 1 text

TechTrends Because trends matter.

Slide 2

Slide 2 text

Idea ● recognizing trends gets more and more important, especially in IT ● find and visualize trends in the field of computer science ● a topic is more popular the more people talk about it ● use community generated content instead of big news sites ● finding the right trends is a key for good decisions in the future PIA - Tech Trends | Raphael Brand | Thomas Uhrig | Hannes Pernpeintner

Slide 3

Slide 3 text

Demo PIA - Tech Trends | Raphael Brand | Thomas Uhrig | Hannes Pernpeintner http://techtrends.mi.hdm-stuttgart.de

Slide 4

Slide 4 text

Agenda 1. Our Story 2. Architecture a. Crawling b. Content Extraction c. Preprocessing d. Training & Indexing e. Jobs & Scheduling f. Public API g. User Interface 3. Get ready for production 4. Problems & Learned Lessons 5. Outcome 6. What's next 7. Sources PIA - Tech Trends | Raphael Brand | Thomas Uhrig | Hannes Pernpeintner

Slide 5

Slide 5 text

Agenda 1. Our Story 2. Architecture Components, Parts a. Crawling b. Content Extraction c. Preprocessing d. Training & Indexing e. Jobs & Scheduling f. Public API g. User Interface 3. Get ready for production 4. Problems & Learned Lessons 5. Outcome 6. What's next 7. Sources PIA - Tech Trends | Raphael Brand | Thomas Uhrig | Hannes Pernpeintner

Slide 6

Slide 6 text

Agenda 1. Our Story 2. Architecture a. Crawling Link Aggregators, Blogs, News b. Content Extraction c. Preprocessing d. Training & Indexing e. Jobs & Scheduling f. Public API g. User Interface 3. Get ready for production 4. Problems & Learned Lessons 5. Outcome 6. What's next 7. Sources PIA - Tech Trends | Raphael Brand | Thomas Uhrig | Hannes Pernpeintner

Slide 7

Slide 7 text

Agenda 1. Our Story 2. Architecture a. Crawling b. Content Extraction Heterogeneous Sources, Arc90, Boilerpipe c. Preprocessing d. Training & Indexing e. Jobs & Scheduling f. Public API g. User Interface 3. Get ready for production 4. Problems & Learned Lessons 5. Outcome 6. What's next 7. Sources PIA - Tech Trends | Raphael Brand | Thomas Uhrig | Hannes Pernpeintner

Slide 8

Slide 8 text

Agenda 1. Our Story 2. Architecture a. Crawling b. Content Extraction c. Preprocessing NLTK, Blacklist d. Training & Indexing e. Jobs & Scheduling f. Public API g. User Interface 3. Get ready for production 4. Problems & Learned Lessons 5. Outcome 6. What's next 7. Sources PIA - Tech Trends | Raphael Brand | Thomas Uhrig | Hannes Pernpeintner

Slide 9

Slide 9 text

Agenda 1. Our Story 2. Architecture a. Crawling b. Content Extraction c. Preprocessing d. Training & Indexing Gensim e. Jobs & Scheduling f. Public API g. User Interface 3. Get ready for production 4. Problems & Learned Lessons 5. Outcome 6. What's next 7. Sources PIA - Tech Trends | Raphael Brand | Thomas Uhrig | Hannes Pernpeintner

Slide 10

Slide 10 text

Agenda 1. Our Story 2. Architecture a. Crawling b. Content Extraction c. Preprocessing d. Training & Indexing e. Jobs & Scheduling Crawler, Learner f. Public API g. User Interface 3. Get ready for production 4. Problems & Learned Lessons 5. Outcome 6. What's next 7. Sources PIA - Tech Trends | Raphael Brand | Thomas Uhrig | Hannes Pernpeintner

Slide 11

Slide 11 text

Agenda 1. Our Story 2. Architecture a. Crawling b. Content Extraction c. Preprocessing d. Training & Indexing e. Jobs & Scheduling f. Public API JSON g. User Interface 3. Get ready for production 4. Problems & Learned Lessons 5. Outcome 6. What's next 7. Sources PIA - Tech Trends | Raphael Brand | Thomas Uhrig | Hannes Pernpeintner

Slide 12

Slide 12 text

Agenda 1. Our Story 2. Architecture a. Crawling b. Content Extraction c. Preprocessing d. Training & Indexing e. Jobs & Scheduling f. Public API g. User Interface Visualizing, Bootstrap, D3.js 3. Get ready for production 4. Problems & Learned Lessons 5. Outcome 6. What's next 7. Sources PIA - Tech Trends | Raphael Brand | Thomas Uhrig | Hannes Pernpeintner

Slide 13

Slide 13 text

Agenda 1. Our Story 2. Architecture a. Crawling b. Content Extraction c. Preprocessing d. Training & Indexing e. Jobs & Scheduling f. Public API g. User Interface 3. Get ready for production Memcache, Preprocessing Cache, NGINX, Content Pipeline 4. Problems & Learned Lessons 5. Outcome 6. What's next 7. Sources PIA - Tech Trends | Raphael Brand | Thomas Uhrig | Hannes Pernpeintner

Slide 14

Slide 14 text

Agenda 1. Our Story 2. Architecture a. Crawling b. Content Extraction c. Preprocessing d. Training & Indexing e. Jobs & Scheduling f. Public API g. User Interface 3. Get ready for production 4. Problems & Learned Lessons Crawling, Preprocessing, VM, Learning, Garbage in Database 5. Outcome 6. What's next 7. Sources PIA - Tech Trends | Raphael Brand | Thomas Uhrig | Hannes Pernpeintner

Slide 15

Slide 15 text

Agenda 1. Our Story 2. Architecture a. Crawling b. Content Extraction c. Preprocessing d. Training & Indexing e. Jobs & Scheduling f. Public API g. User Interface 3. Get ready for production 4. Problems & Learned Lessons 5. Outcome It's running!, Mathias Haas 6. What's next 7. Sources PIA - Tech Trends | Raphael Brand | Thomas Uhrig | Hannes Pernpeintner

Slide 16

Slide 16 text

Agenda 1. Our Story 2. Architecture a. Crawling b. Content Extraction c. Preprocessing d. Training & Indexing e. Jobs & Scheduling f. Public API g. User Interface 3. Get ready for production 4. Problems & Learned Lessons 5. Outcome 6. What's next Ideas, Further plans 7. Sources PIA - Tech Trends | Raphael Brand | Thomas Uhrig | Hannes Pernpeintner

Slide 17

Slide 17 text

Agenda 1. Our Story 2. Architecture a. Crawling b. Content Extraction c. Preprocessing d. Training & Indexing e. Jobs & Scheduling f. Public API g. User Interface 3. Get ready for production 4. Problems & Learned Lessons 5. Outcome 6. What's next 7. Sources Libraries, Documents, Websites PIA - Tech Trends | Raphael Brand | Thomas Uhrig | Hannes Pernpeintner

Slide 18

Slide 18 text

http://www.unjournalism.com Architecture

Slide 19

Slide 19 text

UWSGI Components Database Crawler requests flask gensim nltk SQLite HTML D3.js JS BS4 jQuery HN, Reddit PIA - Tech Trends | Raphael Brand | Thomas Uhrig | Hannes Pernpeintner Data Tables Boot- strap CSS Blogs Boilerpipe Arc90 NGINX MEMCache Library Components Module Resource Language Java Python Python Python Python Python Preprocessing Content Extraction Web UI Similarity Server SQL Web API << Server - Client >>

Slide 20

Slide 20 text

https://farm8.staticflickr.com Crawling

Slide 21

Slide 21 text

Link aggregators PIA - Tech Trends | Raphael Brand | Thomas Uhrig | Hannes Pernpeintner

Slide 22

Slide 22 text

Crawling ● get the front page of Hackernews and Reddit ● parse html and create a list of links with votes, comments and current date ● for each new link fetch html, compress it and save it ● for each existing link update meta information PIA - Tech Trends | Raphael Brand | Thomas Uhrig | Hannes Pernpeintner Votes Comments Aenean vulputate eleifend tellus. Aenean leo ligula, porttitor eu, consequat vitae, eleifend ac, enim. Aliquam lorem ante, dapibus in, viverra quis, feugiat a, tellus. Phasellus viverra nulla ut metus varius laoreet. Quisque rutrum. Aenean imperdiet. Etiam

Slide 23

Slide 23 text

Data Storage ● SQLite (since we have just one user) ● Storing each HTML as a zipped BLOB (fallback) ● Gensim Similarity Server to store a pre-calculated index id : INTEGER link : STRING title : STRING id : INTEGER discussion: STRING votes : INTEGER comments : INTEGER first : TIMESTAMP last : TIMESTAMP id : INTEGER discussion: STRING votes : INTEGER comments : INTEGER first : TIMESTAMP last : TIMESTAMP Hackernews Reddit Links PIA - Tech Trends | Raphael Brand | Thomas Uhrig | Hannes Pernpeintner Gensim Similarity Server build a new index every night

Slide 24

Slide 24 text

http://static4.businessinsider.com Content Extraction

Slide 25

Slide 25 text

Problem Since we use Hackernews and Reddit to find links, we have a lot of different sources. Every page looks different and has a different HTML structure. Some are easy to parse, some are really hard - most are in between. PIA - Tech Trends | Raphael Brand | Thomas Uhrig | Hannes Pernpeintner

Slide 26

Slide 26 text

Two solutions Arc90's Readability ○ pure Python ○ easy and short algorithm ○ good results in most cases PIA - Tech Trends | Raphael Brand | Thomas Uhrig | Hannes Pernpeintner Boilerpipe ○ Java ○ complex library ○ nearly always good results

Slide 27

Slide 27 text

Our solution ● we use Arc90's Readability as the default solution shipped with TechTrends ● however, if Boilerpipe is installed, we use Boilerpipe (and Arc90's Readability only as a fallback) PIA - Tech Trends | Raphael Brand | Thomas Uhrig | Hannes Pernpeintner

Slide 28

Slide 28 text

Blacklisted and ignored content ○ Images ○ PDFs ○ Youtube ○ Animation-only sites ○ Flash ○ strong use of JavaScript async rendering (e.g. Twitter) PIA - Tech Trends | Raphael Brand | Thomas Uhrig | Hannes Pernpeintner http://csimg.shopwahl.de

Slide 29

Slide 29 text

http://www.recklinghaeuser-zeitung.de Preprocessing

Slide 30

Slide 30 text

Garbage in, garbage out! But: less is more! PIA - Tech Trends | Raphael Brand | Thomas Uhrig | Hannes Pernpeintner

Slide 31

Slide 31 text

1. extract content from raw HTML (see Arc90/Boilerpipe) 2. remove stopwords (English stopword list from NLTK) 3. remove words with length < 3 and digits 4. remove non-alphanumeric strings 5. Lemmatization (WordNetLemmatizer from NLTK) 6. a small hand-made black-list 7. generate a word list 8. cache word list What we did PIA - Tech Trends | Raphael Brand | Thomas Uhrig | Hannes Pernpeintner

Slide 32

Slide 32 text

Preprocessing was the most computing intensive process! PIA - Tech Trends | Raphael Brand | Thomas Uhrig | Hannes Pernpeintner 13500 links in 6 minutes and 45 seconds on i7 ultrabook only for preprocessing! 3 minutes and 40 seconds for training and learning a model.

Slide 33

Slide 33 text

http://cinemascopeloid.files.wordpress.com Training & Indexing

Slide 34

Slide 34 text

Training & Indexing implemented with Gensim and Latent Semantic Indexing (LSI) PIA - Tech Trends | Raphael Brand | Thomas Uhrig | Hannes Pernpeintner Our goal: Train and learn a model of our data which can be queried with a token list (e.g "java" or a document ID) and find similar documents.

Slide 35

Slide 35 text

Workflow Model Document Vectors Documents TF-IDF unsupervised learning Index Dictionary PIA - Tech Trends | Raphael Brand | Thomas Uhrig | Hannes Pernpeintner Document ID or Keyword(s)

Slide 36

Slide 36 text

LSI Transfer documents in a multidimensional space! Dimension a Dimension b Documents PIA - Tech Trends | Raphael Brand | Thomas Uhrig | Hannes Pernpeintner All documents are only in a small subspace. But this subspace is extremely high dimensional. LSI reduces this subspace to its main components.

Slide 37

Slide 37 text

LSI Find dimensions of biggest variance! Dimension a Dimension b Documents PIA - Tech Trends | Raphael Brand | Thomas Uhrig | Hannes Pernpeintner

Slide 38

Slide 38 text

LSI Reduce the space to the dimensions that varies significantly! Documents PIA - Tech Trends | Raphael Brand | Thomas Uhrig | Hannes Pernpeintner

Slide 39

Slide 39 text

http://cdn.imore.com Jobs & Scheduling

Slide 40

Slide 40 text

Crawler ● the crawler runs as an independent daemon: If the web crawler blows-up, it will not affect the indexing server! ● one crawler for each source (e.g. for Reddit) ● runs every 15 minutes since more than four months ● collects more than 500 articles per week ● 5000 articles are approximately 60 MB ● currently we have more than 13.300 articles! PIA - Tech Trends | Raphael Brand | Thomas Uhrig | Hannes Pernpeintner http://simple-article10.blogspot.de

Slide 41

Slide 41 text

Training ● Training and building of a new search index every night ● Restart the server every night ● Linux cronjobs PIA - Tech Trends | Raphael Brand | Thomas Uhrig | Hannes Pernpeintner learning every night crawl every 15 min

Slide 42

Slide 42 text

http://www.pnwphotos.com Public API

Slide 43

Slide 43 text

Public API ● a public JSON API to access all data (documents, queries and topics) ● different parameters defined as URL options ● our UI makes use of this API ● clean architecture and dependency between components (UI and server) ● a lot of work is delegated to the client to unburden the server ● documented on the website itself and easy to explore ● open for more clients (e.g. a mobile version or a reporting tool) PIA - Tech Trends | Raphael Brand | Thomas Uhrig | Hannes Pernpeintner

Slide 44

Slide 44 text

Public API PIA - Tech Trends | Raphael Brand | Thomas Uhrig | Hannes Pernpeintner Web UI Web API (2) URL call (3) JSON response (4) rendering (1) user input flask D3.js Python JS Library Module Language Client Server

Slide 45

Slide 45 text

Word Query ● search for one or more words ● most important for the user ● parameters to specify dates, minimal similarity and preprocessing of the input query PIA - Tech Trends | Raphael Brand | Thomas Uhrig | Hannes Pernpeintner /search?query=facebook /search?query=facebook+home /search?query=facebook+home&min=0.7&from=2013-03-01&to=2013-04-30

Slide 46

Slide 46 text

Similar Document Query ● search for a specific document in our database ● parameters to specify dates and minimal similarity PIA - Tech Trends | Raphael Brand | Thomas Uhrig | Hannes Pernpeintner /search?id=42 /search?id=42&min=0.7&from=2013-03-01&to=2013-04-30

Slide 47

Slide 47 text

Example Query ● search with a sample link ● interesting use-case for the user ● parameters to specify dates, minimal similarity and number of topics of the link ● dangerous since we depend on external content PIA - Tech Trends | Raphael Brand | Thomas Uhrig | Hannes Pernpeintner /example?link=http://www.lastwordonnothing.com/2013/04/22/dumped-by-google/ 1: send link Image 1 - http://icons.iconarchive. com Image 2 - https://si0.twimg.com 3: preprocessing Web 2: fetch Index 4: query

Slide 48

Slide 48 text

http://2.bp.blogspot.com User Interface

Slide 49

Slide 49 text

Landing Page ○ gain attention of the user ○ provide all basic information in a compact form ○ call to action PIA - Tech Trends | Raphael Brand | Thomas Uhrig | Hannes Pernpeintner

Slide 50

Slide 50 text

Search Page ○ present search results in different forms (charts, lists, popovers) ○ single page rich application with URL location rewriting ○ short load time since the actual page is loaded only once PIA - Tech Trends | Raphael Brand | Thomas Uhrig | Hannes Pernpeintner

Slide 51

Slide 51 text

Other Pages ○ additional information (e.g. what we do, who we are) ○ documentation (e.g. our public JSON API) PIA - Tech Trends | Raphael Brand | Thomas Uhrig | Hannes Pernpeintner

Slide 52

Slide 52 text

Behind the scenes ○ build with HTML, CSS and JavaScript ○ Bootstrap as CSS framework ○ D3.js to draw charts with JSON-data from our API PIA - Tech Trends | Raphael Brand | Thomas Uhrig | Hannes Pernpeintner JSON- API Search Page (1) JSON calls (2) JSON data (3) rendering Index << Client - Server >>

Slide 53

Slide 53 text

http://www.nasa.gov Get ready for production

Slide 54

Slide 54 text

UWSGI Environment Database Crawler requests flask gensim nltk SQLite HTML D3.js JS BS4 jQuery HN, Reddit PIA - Tech Trends | Raphael Brand | Thomas Uhrig | Hannes Pernpeintner Data Tables Boot- strap CSS Blogs Boilerpipe Arc90 NGINX MEMCache Library Components Module Resource Language Java Python Python Python Python Python Preprocessing Content Extraction Web UI Similarity Server SQL Web API << Server - Client >>

Slide 55

Slide 55 text

Nginx ○ builtin in support for the Web Server Gateway Interface (wsgi) ○ less complex and lower memory footprint than apache ○ handles gzip compression, serving static files, cache headers ... PIA - Tech Trends | Raphael Brand | Thomas Uhrig | Hannes Pernpeintner

Slide 56

Slide 56 text

uWSGI ○ application server ○ very easy to install and configure ○ good documentation PIA - Tech Trends | Raphael Brand | Thomas Uhrig | Hannes Pernpeintner

Slide 57

Slide 57 text

Memcached ○ the response of a unique url won't change for 24 hours ○ cache complete http response ○ very easy to install and use with your application, speeds up your application enormously if you data doesn't change ○ will be cleared once an new model was learned PIA - Tech Trends | Raphael Brand | Thomas Uhrig | Hannes Pernpeintner

Slide 58

Slide 58 text

Performance Improvements ○ pre calculate everything ○ document index, variance, ... ○ no cpu heavy routes ○ cache everything possible ○ try different configurations for your web server and application server ○ measure with e.g Siege! PIA - Tech Trends | Raphael Brand | Thomas Uhrig | Hannes Pernpeintner

Slide 59

Slide 59 text

Flask Assets ○ minification and combination of CSS and JavaScript ○ reduces file size and HTTP requests ○ done with Flask Assets (an extension for Flask) PIA - Tech Trends | Raphael Brand | Thomas Uhrig | Hannes Pernpeintner js = Bundle( 'js/jquery.min.js' , ... 'js/techtrends.js' , filters='jsmin', output='gen/packed.js' ) css = Bundle( 'css/jquery.dataTables.css' , ... 'css/bootstrap-responsive.min.css' , output='gen/packed.css' ) assets.register( 'js_all', js) assets.register( 'css_all', css)

Slide 60

Slide 60 text

PageSpeed Insights ○ no 100 % score because of third party widgets (social buttons) ○ minification ○ browser caching ○ gzip compression PIA - Tech Trends | Raphael Brand | Thomas Uhrig | Hannes Pernpeintner

Slide 61

Slide 61 text

http://barrygruff.files.wordpress.com Problems & Learned Lessons

Slide 62

Slide 62 text

What's popularity? An article... ...is present (on Hackernews, Reddit etc.) ...stays (on Hackernews, Reddit etc. for a long time) ...has votes ...has comments PIA - Tech Trends | Raphael Brand | Thomas Uhrig | Hannes Pernpeintner http://www.horusmedia.de

Slide 63

Slide 63 text

It's a mathematical indicator! popularity = x * duration + y * votes + z * comments Weights: x = y = z = 1 PIA - Tech Trends | Raphael Brand | Thomas Uhrig | Hannes Pernpeintner

Slide 64

Slide 64 text

But math is always a problem... But how to compare these values? ○ each value has its own average, variance and maximum ○ Are Reddit votes better than those from Hackernews? ○ Statistical methods were applied to make all values comparable ○ completely calculated in the database (SQL) PIA - Tech Trends | Raphael Brand | Thomas Uhrig | Hannes Pernpeintner

Slide 65

Slide 65 text

PIA - Tech Trends | Raphael Brand | Thomas Uhrig | Hannes Pernpeintner Visualization

Slide 66

Slide 66 text

Respect robots.txt, or you get banned. PIA - Tech Trends | Raphael Brand | Thomas Uhrig | Hannes Pernpeintner http://www.abakus-internet-marketing.de

Slide 67

Slide 67 text

PIA - Tech Trends | Raphael Brand | Thomas Uhrig | Hannes Pernpeintner 640 kB ought to be enough for anybody. - Bill Gates (maybe ;)) Well... http://www.fastcoexist.com

Slide 68

Slide 68 text

PIA - Tech Trends | Raphael Brand | Thomas Uhrig | Hannes Pernpeintner You're never done. http://www.monsterzeug.de

Slide 69

Slide 69 text

You never know... ○ Google App Engine has a lot of limitations and things that you can't decide or control on your own! ○ SKLearn is very low-level. ○ D3.js is very low-level and sometimes time-consuming to customize charts to look good. ○ Sometimes, you will come across strange problems in your used 3rd-party libs (explicit "show" on the bootstrap popover, which didn't work well) ○ Gensim claims to have online-learning, but, well, it doesn't... PIA - Tech Trends | Raphael Brand | Thomas Uhrig | Hannes Pernpeintner

Slide 70

Slide 70 text

User tests ● we asked friends to beta-test our website ● mostly very useful responses ● but it feels strange when your work gets judged PIA - Tech Trends | Raphael Brand | Thomas Uhrig | Hannes Pernpeintner Why is the page in English and why do I have to scroll on the front-page? Do you really need JavaScript? The design is a little bit dark and gray. What is the meaning of that little green arrow? Is the popular topics box related to my search? ... http://static.guim.co.uk

Slide 71

Slide 71 text

http://upload.wikimedia.org Outcome

Slide 72

Slide 72 text

Media Night A lot of interested people with a lot of interesting questions. We already had a meeting with somebody willing to use a customized version of TechTrends for his business. PIA - Tech Trends | Raphael Brand | Thomas Uhrig | Hannes Pernpeintner

Slide 73

Slide 73 text

A lot of work... The first crawler and preprocessing was done very fast - but we made a lot of mistakes. We maintained the crawling and preprocessing during the whole project for more than three months to get it running in the end. PIA - Tech Trends | Raphael Brand | Thomas Uhrig | Hannes Pernpeintner

Slide 74

Slide 74 text

...but: ● It's running! ● We learned a lot: ■ Data Mining (e.g. crawling) ■ Machine Learning Algorithms (e.g. LDA/LSI) ■ Web Architecture (e.g. Bootstrap) ■ API Design (e.g. JSON) ■ Performance (e.g. Caching) ■ UI Design (e.g. user experience) ■ Working as a team PIA - Tech Trends | Raphael Brand | Thomas Uhrig | Hannes Pernpeintner

Slide 75

Slide 75 text

http://www.beautys.de What's next?

Slide 76

Slide 76 text

Further ideas ● Modularisation with a plugin concept for crawlers ● Customization of TechTrends for different domains ● A mobile version?! Some reporting tool? ● Suggest similar search queries to the user ● Let it run for a longer time Our project is available on Bitbucket, for everyone! PIA - Tech Trends | Raphael Brand | Thomas Uhrig | Hannes Pernpeintner

Slide 77

Slide 77 text

http://blog.zdf.de Sources

Slide 78

Slide 78 text

Libraries ○ Gensim, http://radimrehurek.com/gensim/ ○ NLTK, http://nltk.org/ ○ Flask, http://flask.pocoo.org/ ○ Flask Assets, http://elsdoerfer.name/docs/flask-assets/ ○ D3.js, http://d3js.org/ ○ Bootstrap, http://twitter.github.io/bootstrap/ ○ DataTables, https://datatables.net/ ○ BeautifulSoup, http://www.crummy.com/software/BeautifulSoup/ ○ Requests, http://python-requests.org PIA - Tech Trends | Raphael Brand | Thomas Uhrig | Hannes Pernpeintner

Slide 79

Slide 79 text

Algorithms ○ Arc90, http://lab.arc90.com/2009/03/02/readability/ ○ LSI/LSA, http://de.wikipedia.org/wiki/Latente_Semantische_Analyse, http://radimrehurek. com/gensim/wiki.html PIA - Tech Trends | Raphael Brand | Thomas Uhrig | Hannes Pernpeintner

Slide 80

Slide 80 text

Tools ○ Boilerpipe, https://code.google.com/p/boilerpipe/ ○ Memcache, http://memcached.org/ ○ NGINX, http://nginx.com/ ○ SQLite, http://www.sqlite.org/ PIA - Tech Trends | Raphael Brand | Thomas Uhrig | Hannes Pernpeintner