Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Connecting the Dots Between News Articles

Philipe Fatio
October 23, 2014

Connecting the Dots Between News Articles

This is a presentation I did based on the paper "Connecting the Dots Between News Articles" for the seminar "Advanced Topics in Machine Learning" at ETH Zurich.

Philipe Fatio

October 23, 2014
Tweet

Other Decks in Science

Transcript

  1. Connecting the Dots Between News Articles a paper by Dafna

    Shahaf & Carlos Guestrin 
 Carnegie Mellon, 2010 presented by Philipe Fatio
  2. • information overload • easy to miss big picture •

    increasing need to present data in effective way
  3. • chronological and coherent chain of articles • discover hidden

    connections • better understanding of topic after reading the chain
  4. Algorithm s t d1 dn s t d1 dn s

    t d1 dn s t d1 dn d1 · · · dn
  5. Clinton Microsoft EU Markets Palestinians Vote Ex-Intern Testim ony on

    Clinton Judge Sides w ith Gov’t The N ext M icrosoft Palestinians O ffer Bonds Clinton W atches as Pales. Vote Contest the Vote
  6. Clinton Lewinsky Impeachment Gore Vote Ex-Intern Testim ony on Clinton

    Clinton A dm its Lew insky Predicts Im peachm ent of Clinton Clinton Im peached Clinton’s A cquittal Clinton A ngered A s Gore… Election D raw s N ear Contest the Vote
  7. chain only as strong as weakest link Coherence (d 1,

    . . . , dn) = n 1 Â i= 1 Â w 1(w 2 di \ di+ 1 )
  8. influential words potentially missing from articles some words more important

    than others Coherence (d 1, . . . , dn) = min i= 1... n 1 Â w 1(w 2 di \ di+ 1 )
  9. d0 d1 d2 : Judge Lance Ito lifted his ban

    on TV coverage of O.J. Simpson trial : O.J. Simpson’s defense lawyers do not object to DNA evidence : Winning three Super Bowls would be a historic accomplishment dna opening judge defense championship w Influence(d0, d1 | w) Influence(d0, d2 | w)
  10. does not prevent jittery word activation patterns of topics Coherence

    (d 1, . . . , dn) = min i= 1... n 1 Â w Influence (di , di+ 1 | w)
  11. activating all words everywhere leads to higher score Coherence (

    d1, . . . , dn ) = max activations min i = 1...n 1 Â w Influence ( di, di + 1 | w ) 1( w active in di, di + 1 )
  12. linear program • maximize weakest link, subject to: ‣ limit

    total number of active words ‣ limit number of active words per transition ‣ allow each word to be activated at most once
  13. Contest the Vote The Next Microsoft Judge Sides with Gov’t

    Clinton admits Liaison Clinton Judge Microsoft Gore
  14. Contest the Vote The Next Microsoft Judge Sides with Gov’t

    Clinton admits Liaison Clinton Judge Microsoft Gore 0.7 0.2 0.1 0.6 0.4
  15. Contest the Vote The Next Microsoft Judge Sides with Gov’t

    Clinton admits Liaison Clinton Judge Microsoft Gore 0.7 0.2 0.1 0.6 0.4 0.6/(0.6+0.1) = 0.86 0.7/0.7 = 1 0.1/(0.6+0.1) = 0.14
  16. short random walks • fraction of time we land on

    v starting from i • same but treat w as sink node Pi(v): Pw i (v):
  17. large still large small d1 w1 d2 w2 0.7 0.8

    0.2 0.1 P1(d2) Pw1 1 (d2) Pw2 1 (d2)
  18. finding a good chain • use local search combined with

    scoring function • tendency to get stuck in local optimum • jointly optimize over words and chains
  19. finding a good chain • add chain restrictions to linear

    program • objective: maximize strength of weakest link • yields best next node for each node
  20. scaling up • select subset of documents ‣ similar documents

    to chain ends ‣ documents reach in random walk on bipartite graph • speed up influence calculation
  21. user study • > 500’000 news articles • select initial

    subset of 500 - 1000 articles • 18 users evaluate resulting chains
  22. chain effectiveness improvement in familiarity 0 0.25 0.5 0.75 1

    Elections Afghanistan Lewinsky OJ Enron connecting the dots Google News Timeline shortest-path event threading
  23. chain properties fraction of times preferred 0 0.25 0.5 0.75

    1 Relevance Coherence Non-Redudunancy connecting the dots Google News Timeline shortest-path event threading Simple Relevance Coherence Non-Redudunancy Complex
  24. + novel approach + applicable to other domains + incorporate

    user interactions and preferences – time performance of algorithm not covered – fixed endpoints required
  25. • formalizing story coherence • formalizing influence without link structure

    • provide algorithm for connecting two fixed endpoints while maximizing chain coherence • allow user interaction