Upgrade to Pro — share decks privately, control downloads, hide ads and more …

TechTrends - searching trends on HN and Reddit

TechTrends - searching trends on HN and Reddit

TechTrends is an intelligent search for Hackernews and Reddit. It lets you search for articles and provides a trend chart, similar to Google Trends. It's build in Python on top of Gensim.
This is the final presentation for our university in which we show how we build the project in detail.
The project is online on http://techtrends.mi.hdm-stuttgart.de/. More information under http://tuhrig.de/tag/techtrends/.

E7c3430a01629f9ca6d3bcb847e8e3b2?s=128

Thomas Uhrig

July 25, 2013
Tweet

Transcript

  1. TechTrends Because trends matter.

  2. Idea • recognizing trends gets more and more important, especially

    in IT • find and visualize trends in the field of computer science • a topic is more popular the more people talk about it • use community generated content instead of big news sites • finding the right trends is a key for good decisions in the future PIA - Tech Trends | Raphael Brand | Thomas Uhrig | Hannes Pernpeintner
  3. Demo PIA - Tech Trends | Raphael Brand | Thomas

    Uhrig | Hannes Pernpeintner http://techtrends.mi.hdm-stuttgart.de
  4. Agenda 1. Our Story 2. Architecture a. Crawling b. Content

    Extraction c. Preprocessing d. Training & Indexing e. Jobs & Scheduling f. Public API g. User Interface 3. Get ready for production 4. Problems & Learned Lessons 5. Outcome 6. What's next 7. Sources PIA - Tech Trends | Raphael Brand | Thomas Uhrig | Hannes Pernpeintner
  5. Agenda 1. Our Story 2. Architecture Components, Parts a. Crawling

    b. Content Extraction c. Preprocessing d. Training & Indexing e. Jobs & Scheduling f. Public API g. User Interface 3. Get ready for production 4. Problems & Learned Lessons 5. Outcome 6. What's next 7. Sources PIA - Tech Trends | Raphael Brand | Thomas Uhrig | Hannes Pernpeintner
  6. Agenda 1. Our Story 2. Architecture a. Crawling Link Aggregators,

    Blogs, News b. Content Extraction c. Preprocessing d. Training & Indexing e. Jobs & Scheduling f. Public API g. User Interface 3. Get ready for production 4. Problems & Learned Lessons 5. Outcome 6. What's next 7. Sources PIA - Tech Trends | Raphael Brand | Thomas Uhrig | Hannes Pernpeintner
  7. Agenda 1. Our Story 2. Architecture a. Crawling b. Content

    Extraction Heterogeneous Sources, Arc90, Boilerpipe c. Preprocessing d. Training & Indexing e. Jobs & Scheduling f. Public API g. User Interface 3. Get ready for production 4. Problems & Learned Lessons 5. Outcome 6. What's next 7. Sources PIA - Tech Trends | Raphael Brand | Thomas Uhrig | Hannes Pernpeintner
  8. Agenda 1. Our Story 2. Architecture a. Crawling b. Content

    Extraction c. Preprocessing NLTK, Blacklist d. Training & Indexing e. Jobs & Scheduling f. Public API g. User Interface 3. Get ready for production 4. Problems & Learned Lessons 5. Outcome 6. What's next 7. Sources PIA - Tech Trends | Raphael Brand | Thomas Uhrig | Hannes Pernpeintner
  9. Agenda 1. Our Story 2. Architecture a. Crawling b. Content

    Extraction c. Preprocessing d. Training & Indexing Gensim e. Jobs & Scheduling f. Public API g. User Interface 3. Get ready for production 4. Problems & Learned Lessons 5. Outcome 6. What's next 7. Sources PIA - Tech Trends | Raphael Brand | Thomas Uhrig | Hannes Pernpeintner
  10. Agenda 1. Our Story 2. Architecture a. Crawling b. Content

    Extraction c. Preprocessing d. Training & Indexing e. Jobs & Scheduling Crawler, Learner f. Public API g. User Interface 3. Get ready for production 4. Problems & Learned Lessons 5. Outcome 6. What's next 7. Sources PIA - Tech Trends | Raphael Brand | Thomas Uhrig | Hannes Pernpeintner
  11. Agenda 1. Our Story 2. Architecture a. Crawling b. Content

    Extraction c. Preprocessing d. Training & Indexing e. Jobs & Scheduling f. Public API JSON g. User Interface 3. Get ready for production 4. Problems & Learned Lessons 5. Outcome 6. What's next 7. Sources PIA - Tech Trends | Raphael Brand | Thomas Uhrig | Hannes Pernpeintner
  12. Agenda 1. Our Story 2. Architecture a. Crawling b. Content

    Extraction c. Preprocessing d. Training & Indexing e. Jobs & Scheduling f. Public API g. User Interface Visualizing, Bootstrap, D3.js 3. Get ready for production 4. Problems & Learned Lessons 5. Outcome 6. What's next 7. Sources PIA - Tech Trends | Raphael Brand | Thomas Uhrig | Hannes Pernpeintner
  13. Agenda 1. Our Story 2. Architecture a. Crawling b. Content

    Extraction c. Preprocessing d. Training & Indexing e. Jobs & Scheduling f. Public API g. User Interface 3. Get ready for production Memcache, Preprocessing Cache, NGINX, Content Pipeline 4. Problems & Learned Lessons 5. Outcome 6. What's next 7. Sources PIA - Tech Trends | Raphael Brand | Thomas Uhrig | Hannes Pernpeintner
  14. Agenda 1. Our Story 2. Architecture a. Crawling b. Content

    Extraction c. Preprocessing d. Training & Indexing e. Jobs & Scheduling f. Public API g. User Interface 3. Get ready for production 4. Problems & Learned Lessons Crawling, Preprocessing, VM, Learning, Garbage in Database 5. Outcome 6. What's next 7. Sources PIA - Tech Trends | Raphael Brand | Thomas Uhrig | Hannes Pernpeintner
  15. Agenda 1. Our Story 2. Architecture a. Crawling b. Content

    Extraction c. Preprocessing d. Training & Indexing e. Jobs & Scheduling f. Public API g. User Interface 3. Get ready for production 4. Problems & Learned Lessons 5. Outcome It's running!, Mathias Haas 6. What's next 7. Sources PIA - Tech Trends | Raphael Brand | Thomas Uhrig | Hannes Pernpeintner
  16. Agenda 1. Our Story 2. Architecture a. Crawling b. Content

    Extraction c. Preprocessing d. Training & Indexing e. Jobs & Scheduling f. Public API g. User Interface 3. Get ready for production 4. Problems & Learned Lessons 5. Outcome 6. What's next Ideas, Further plans 7. Sources PIA - Tech Trends | Raphael Brand | Thomas Uhrig | Hannes Pernpeintner
  17. Agenda 1. Our Story 2. Architecture a. Crawling b. Content

    Extraction c. Preprocessing d. Training & Indexing e. Jobs & Scheduling f. Public API g. User Interface 3. Get ready for production 4. Problems & Learned Lessons 5. Outcome 6. What's next 7. Sources Libraries, Documents, Websites PIA - Tech Trends | Raphael Brand | Thomas Uhrig | Hannes Pernpeintner
  18. http://www.unjournalism.com Architecture

  19. UWSGI Components Database Crawler requests flask gensim nltk SQLite HTML

    D3.js JS BS4 jQuery HN, Reddit PIA - Tech Trends | Raphael Brand | Thomas Uhrig | Hannes Pernpeintner Data Tables Boot- strap CSS Blogs Boilerpipe Arc90 NGINX MEMCache Library Components Module Resource Language Java Python Python Python Python Python Preprocessing Content Extraction Web UI Similarity Server SQL Web API << Server - Client >>
  20. https://farm8.staticflickr.com Crawling

  21. Link aggregators PIA - Tech Trends | Raphael Brand |

    Thomas Uhrig | Hannes Pernpeintner
  22. Crawling • get the front page of Hackernews and Reddit

    • parse html and create a list of links with votes, comments and current date • for each new link fetch html, compress it and save it • for each existing link update meta information PIA - Tech Trends | Raphael Brand | Thomas Uhrig | Hannes Pernpeintner Votes Comments Aenean vulputate eleifend tellus. Aenean leo ligula, porttitor eu, consequat vitae, eleifend ac, enim. Aliquam lorem ante, dapibus in, viverra quis, feugiat a, tellus. Phasellus viverra nulla ut metus varius laoreet. Quisque rutrum. Aenean imperdiet. Etiam
  23. Data Storage • SQLite (since we have just one user)

    • Storing each HTML as a zipped BLOB (fallback) • Gensim Similarity Server to store a pre-calculated index id : INTEGER link : STRING title : STRING id : INTEGER discussion: STRING votes : INTEGER comments : INTEGER first : TIMESTAMP last : TIMESTAMP id : INTEGER discussion: STRING votes : INTEGER comments : INTEGER first : TIMESTAMP last : TIMESTAMP Hackernews Reddit Links PIA - Tech Trends | Raphael Brand | Thomas Uhrig | Hannes Pernpeintner Gensim Similarity Server build a new index every night
  24. http://static4.businessinsider.com Content Extraction

  25. Problem Since we use Hackernews and Reddit to find links,

    we have a lot of different sources. Every page looks different and has a different HTML structure. Some are easy to parse, some are really hard - most are in between. PIA - Tech Trends | Raphael Brand | Thomas Uhrig | Hannes Pernpeintner
  26. Two solutions Arc90's Readability ◦ pure Python ◦ easy and

    short algorithm ◦ good results in most cases PIA - Tech Trends | Raphael Brand | Thomas Uhrig | Hannes Pernpeintner Boilerpipe ◦ Java ◦ complex library ◦ nearly always good results
  27. Our solution • we use Arc90's Readability as the default

    solution shipped with TechTrends • however, if Boilerpipe is installed, we use Boilerpipe (and Arc90's Readability only as a fallback) PIA - Tech Trends | Raphael Brand | Thomas Uhrig | Hannes Pernpeintner
  28. Blacklisted and ignored content ◦ Images ◦ PDFs ◦ Youtube

    ◦ Animation-only sites ◦ Flash ◦ strong use of JavaScript async rendering (e.g. Twitter) PIA - Tech Trends | Raphael Brand | Thomas Uhrig | Hannes Pernpeintner http://csimg.shopwahl.de
  29. http://www.recklinghaeuser-zeitung.de Preprocessing

  30. Garbage in, garbage out! But: less is more! PIA -

    Tech Trends | Raphael Brand | Thomas Uhrig | Hannes Pernpeintner
  31. 1. extract content from raw HTML (see Arc90/Boilerpipe) 2. remove

    stopwords (English stopword list from NLTK) 3. remove words with length < 3 and digits 4. remove non-alphanumeric strings 5. Lemmatization (WordNetLemmatizer from NLTK) 6. a small hand-made black-list 7. generate a word list 8. cache word list What we did PIA - Tech Trends | Raphael Brand | Thomas Uhrig | Hannes Pernpeintner
  32. Preprocessing was the most computing intensive process! PIA - Tech

    Trends | Raphael Brand | Thomas Uhrig | Hannes Pernpeintner 13500 links in 6 minutes and 45 seconds on i7 ultrabook only for preprocessing! 3 minutes and 40 seconds for training and learning a model.
  33. http://cinemascopeloid.files.wordpress.com Training & Indexing

  34. Training & Indexing implemented with Gensim and Latent Semantic Indexing

    (LSI) PIA - Tech Trends | Raphael Brand | Thomas Uhrig | Hannes Pernpeintner Our goal: Train and learn a model of our data which can be queried with a token list (e.g "java" or a document ID) and find similar documents.
  35. Workflow Model Document Vectors Documents TF-IDF unsupervised learning Index Dictionary

    PIA - Tech Trends | Raphael Brand | Thomas Uhrig | Hannes Pernpeintner Document ID or Keyword(s)
  36. LSI Transfer documents in a multidimensional space! Dimension a Dimension

    b Documents PIA - Tech Trends | Raphael Brand | Thomas Uhrig | Hannes Pernpeintner All documents are only in a small subspace. But this subspace is extremely high dimensional. LSI reduces this subspace to its main components.
  37. LSI Find dimensions of biggest variance! Dimension a Dimension b

    Documents PIA - Tech Trends | Raphael Brand | Thomas Uhrig | Hannes Pernpeintner
  38. LSI Reduce the space to the dimensions that varies significantly!

    Documents PIA - Tech Trends | Raphael Brand | Thomas Uhrig | Hannes Pernpeintner
  39. http://cdn.imore.com Jobs & Scheduling

  40. Crawler • the crawler runs as an independent daemon: If

    the web crawler blows-up, it will not affect the indexing server! • one crawler for each source (e.g. for Reddit) • runs every 15 minutes since more than four months • collects more than 500 articles per week • 5000 articles are approximately 60 MB • currently we have more than 13.300 articles! PIA - Tech Trends | Raphael Brand | Thomas Uhrig | Hannes Pernpeintner http://simple-article10.blogspot.de
  41. Training • Training and building of a new search index

    every night • Restart the server every night • Linux cronjobs PIA - Tech Trends | Raphael Brand | Thomas Uhrig | Hannes Pernpeintner learning every night crawl every 15 min
  42. http://www.pnwphotos.com Public API

  43. Public API • a public JSON API to access all

    data (documents, queries and topics) • different parameters defined as URL options • our UI makes use of this API • clean architecture and dependency between components (UI and server) • a lot of work is delegated to the client to unburden the server • documented on the website itself and easy to explore • open for more clients (e.g. a mobile version or a reporting tool) PIA - Tech Trends | Raphael Brand | Thomas Uhrig | Hannes Pernpeintner
  44. Public API PIA - Tech Trends | Raphael Brand |

    Thomas Uhrig | Hannes Pernpeintner Web UI Web API (2) URL call (3) JSON response (4) rendering (1) user input flask D3.js Python JS Library Module Language Client Server
  45. Word Query • search for one or more words •

    most important for the user • parameters to specify dates, minimal similarity and preprocessing of the input query PIA - Tech Trends | Raphael Brand | Thomas Uhrig | Hannes Pernpeintner /search?query=facebook /search?query=facebook+home /search?query=facebook+home&min=0.7&from=2013-03-01&to=2013-04-30
  46. Similar Document Query • search for a specific document in

    our database • parameters to specify dates and minimal similarity PIA - Tech Trends | Raphael Brand | Thomas Uhrig | Hannes Pernpeintner /search?id=42 /search?id=42&min=0.7&from=2013-03-01&to=2013-04-30
  47. Example Query • search with a sample link • interesting

    use-case for the user • parameters to specify dates, minimal similarity and number of topics of the link • dangerous since we depend on external content PIA - Tech Trends | Raphael Brand | Thomas Uhrig | Hannes Pernpeintner /example?link=http://www.lastwordonnothing.com/2013/04/22/dumped-by-google/ 1: send link Image 1 - http://icons.iconarchive. com Image 2 - https://si0.twimg.com 3: preprocessing Web 2: fetch Index 4: query
  48. http://2.bp.blogspot.com User Interface

  49. Landing Page ◦ gain attention of the user ◦ provide

    all basic information in a compact form ◦ call to action PIA - Tech Trends | Raphael Brand | Thomas Uhrig | Hannes Pernpeintner
  50. Search Page ◦ present search results in different forms (charts,

    lists, popovers) ◦ single page rich application with URL location rewriting ◦ short load time since the actual page is loaded only once PIA - Tech Trends | Raphael Brand | Thomas Uhrig | Hannes Pernpeintner
  51. Other Pages ◦ additional information (e.g. what we do, who

    we are) ◦ documentation (e.g. our public JSON API) PIA - Tech Trends | Raphael Brand | Thomas Uhrig | Hannes Pernpeintner
  52. Behind the scenes ◦ build with HTML, CSS and JavaScript

    ◦ Bootstrap as CSS framework ◦ D3.js to draw charts with JSON-data from our API PIA - Tech Trends | Raphael Brand | Thomas Uhrig | Hannes Pernpeintner JSON- API Search Page (1) JSON calls (2) JSON data (3) rendering Index << Client - Server >>
  53. http://www.nasa.gov Get ready for production

  54. UWSGI Environment Database Crawler requests flask gensim nltk SQLite HTML

    D3.js JS BS4 jQuery HN, Reddit PIA - Tech Trends | Raphael Brand | Thomas Uhrig | Hannes Pernpeintner Data Tables Boot- strap CSS Blogs Boilerpipe Arc90 NGINX MEMCache Library Components Module Resource Language Java Python Python Python Python Python Preprocessing Content Extraction Web UI Similarity Server SQL Web API << Server - Client >>
  55. Nginx ◦ builtin in support for the Web Server Gateway

    Interface (wsgi) ◦ less complex and lower memory footprint than apache ◦ handles gzip compression, serving static files, cache headers ... PIA - Tech Trends | Raphael Brand | Thomas Uhrig | Hannes Pernpeintner
  56. uWSGI ◦ application server ◦ very easy to install and

    configure ◦ good documentation PIA - Tech Trends | Raphael Brand | Thomas Uhrig | Hannes Pernpeintner
  57. Memcached ◦ the response of a unique url won't change

    for 24 hours ◦ cache complete http response ◦ very easy to install and use with your application, speeds up your application enormously if you data doesn't change ◦ will be cleared once an new model was learned PIA - Tech Trends | Raphael Brand | Thomas Uhrig | Hannes Pernpeintner
  58. Performance Improvements ◦ pre calculate everything ◦ document index, variance,

    ... ◦ no cpu heavy routes ◦ cache everything possible ◦ try different configurations for your web server and application server ◦ measure with e.g Siege! PIA - Tech Trends | Raphael Brand | Thomas Uhrig | Hannes Pernpeintner
  59. Flask Assets ◦ minification and combination of CSS and JavaScript

    ◦ reduces file size and HTTP requests ◦ done with Flask Assets (an extension for Flask) PIA - Tech Trends | Raphael Brand | Thomas Uhrig | Hannes Pernpeintner js = Bundle( 'js/jquery.min.js' , ... 'js/techtrends.js' , filters='jsmin', output='gen/packed.js' ) css = Bundle( 'css/jquery.dataTables.css' , ... 'css/bootstrap-responsive.min.css' , output='gen/packed.css' ) assets.register( 'js_all', js) assets.register( 'css_all', css)
  60. PageSpeed Insights ◦ no 100 % score because of third

    party widgets (social buttons) ◦ minification ◦ browser caching ◦ gzip compression PIA - Tech Trends | Raphael Brand | Thomas Uhrig | Hannes Pernpeintner
  61. http://barrygruff.files.wordpress.com Problems & Learned Lessons

  62. What's popularity? An article... ...is present (on Hackernews, Reddit etc.)

    ...stays (on Hackernews, Reddit etc. for a long time) ...has votes ...has comments PIA - Tech Trends | Raphael Brand | Thomas Uhrig | Hannes Pernpeintner http://www.horusmedia.de
  63. It's a mathematical indicator! popularity = x * duration +

    y * votes + z * comments Weights: x = y = z = 1 PIA - Tech Trends | Raphael Brand | Thomas Uhrig | Hannes Pernpeintner
  64. But math is always a problem... But how to compare

    these values? ◦ each value has its own average, variance and maximum ◦ Are Reddit votes better than those from Hackernews? ◦ Statistical methods were applied to make all values comparable ◦ completely calculated in the database (SQL) PIA - Tech Trends | Raphael Brand | Thomas Uhrig | Hannes Pernpeintner
  65. PIA - Tech Trends | Raphael Brand | Thomas Uhrig

    | Hannes Pernpeintner Visualization
  66. Respect robots.txt, or you get banned. PIA - Tech Trends

    | Raphael Brand | Thomas Uhrig | Hannes Pernpeintner http://www.abakus-internet-marketing.de
  67. PIA - Tech Trends | Raphael Brand | Thomas Uhrig

    | Hannes Pernpeintner 640 kB ought to be enough for anybody. - Bill Gates (maybe ;)) Well... http://www.fastcoexist.com
  68. PIA - Tech Trends | Raphael Brand | Thomas Uhrig

    | Hannes Pernpeintner You're never done. http://www.monsterzeug.de
  69. You never know... ◦ Google App Engine has a lot

    of limitations and things that you can't decide or control on your own! ◦ SKLearn is very low-level. ◦ D3.js is very low-level and sometimes time-consuming to customize charts to look good. ◦ Sometimes, you will come across strange problems in your used 3rd-party libs (explicit "show" on the bootstrap popover, which didn't work well) ◦ Gensim claims to have online-learning, but, well, it doesn't... PIA - Tech Trends | Raphael Brand | Thomas Uhrig | Hannes Pernpeintner
  70. User tests • we asked friends to beta-test our website

    • mostly very useful responses • but it feels strange when your work gets judged PIA - Tech Trends | Raphael Brand | Thomas Uhrig | Hannes Pernpeintner Why is the page in English and why do I have to scroll on the front-page? Do you really need JavaScript? The design is a little bit dark and gray. What is the meaning of that little green arrow? Is the popular topics box related to my search? ... http://static.guim.co.uk
  71. http://upload.wikimedia.org Outcome

  72. Media Night A lot of interested people with a lot

    of interesting questions. We already had a meeting with somebody willing to use a customized version of TechTrends for his business. PIA - Tech Trends | Raphael Brand | Thomas Uhrig | Hannes Pernpeintner
  73. A lot of work... The first crawler and preprocessing was

    done very fast - but we made a lot of mistakes. We maintained the crawling and preprocessing during the whole project for more than three months to get it running in the end. PIA - Tech Trends | Raphael Brand | Thomas Uhrig | Hannes Pernpeintner
  74. ...but: • It's running! • We learned a lot: ▪

    Data Mining (e.g. crawling) ▪ Machine Learning Algorithms (e.g. LDA/LSI) ▪ Web Architecture (e.g. Bootstrap) ▪ API Design (e.g. JSON) ▪ Performance (e.g. Caching) ▪ UI Design (e.g. user experience) ▪ Working as a team PIA - Tech Trends | Raphael Brand | Thomas Uhrig | Hannes Pernpeintner
  75. http://www.beautys.de What's next?

  76. Further ideas • Modularisation with a plugin concept for crawlers

    • Customization of TechTrends for different domains • A mobile version?! Some reporting tool? • Suggest similar search queries to the user • Let it run for a longer time Our project is available on Bitbucket, for everyone! PIA - Tech Trends | Raphael Brand | Thomas Uhrig | Hannes Pernpeintner
  77. http://blog.zdf.de Sources

  78. Libraries ◦ Gensim, http://radimrehurek.com/gensim/ ◦ NLTK, http://nltk.org/ ◦ Flask, http://flask.pocoo.org/

    ◦ Flask Assets, http://elsdoerfer.name/docs/flask-assets/ ◦ D3.js, http://d3js.org/ ◦ Bootstrap, http://twitter.github.io/bootstrap/ ◦ DataTables, https://datatables.net/ ◦ BeautifulSoup, http://www.crummy.com/software/BeautifulSoup/ ◦ Requests, http://python-requests.org PIA - Tech Trends | Raphael Brand | Thomas Uhrig | Hannes Pernpeintner
  79. Algorithms ◦ Arc90, http://lab.arc90.com/2009/03/02/readability/ ◦ LSI/LSA, http://de.wikipedia.org/wiki/Latente_Semantische_Analyse, http://radimrehurek. com/gensim/wiki.html PIA

    - Tech Trends | Raphael Brand | Thomas Uhrig | Hannes Pernpeintner
  80. Tools ◦ Boilerpipe, https://code.google.com/p/boilerpipe/ ◦ Memcache, http://memcached.org/ ◦ NGINX, http://nginx.com/

    ◦ SQLite, http://www.sqlite.org/ PIA - Tech Trends | Raphael Brand | Thomas Uhrig | Hannes Pernpeintner