Save 37% off PRO during our Black Friday Sale! »

Estimating stock price correlations using Wikipedia

Estimating stock price correlations using Wikipedia

PyData Berlin 2016 talk

Description: Building an equities portfolio is a challenging task for a finance professional as it requires, among others, future correlations between stock prices. As this data is not always available, in this talk I look at an alternative to historical correlations as proxy for future correlations: using graph analysis techniques and text similarity measures based on Wikipedia data.


Delia Rusu

May 21, 2016


  1. Estimating stock price correlations using Wikipedia Delia Rusu

  2. About Me • Chief Data Scientist @Knowsis (London, UK) •

    PhD, Natural Language Processing and Machine Learning • Interests: unstructured data & finance •
  3. This Talk

  4. Financial Data

  5. Sources of Unstructured Data • Annual reports • Broker research

    • Conference call transcripts • Investor relations presentations • News and press releases • In-house content
  6. What about unconventional datasets?

  7. None
  8. Wikipedia • 10 edits per second • the English Wikipedia

    • currently has over 5M articles • average of 800 new articles per day • the German Wikipedia (4th largest) • currently has over 1.9M articles
  9. Structured Views of Wikipedia - Context Vectors … DAX German

    stock market index
  10. Word2Vec • Word2Vec model (Mikolov et al. 2013) • distributional

    hypothesis: words in similar contexts have similar meanings • shallow, 2-layer neural network • training objective - learn word vector representations which can predict nearby words (in context)
  11. Word2Vec Skip-gram model CBOW model w(t-2) w(t-1) w(t+1) w(t+2) w(t)

    w(t-2) w(t-1) w(t+1) w(t+2) w(t) input input output output
  12. Word and Article Vectors • generate context vectors for: •

    words which appear in Wikipedia articles • Wikipedia articles themselves (topics) • model: wiki2vec • The DAX is a blue chip stock market index The DAX_ID is a blue_chip_ID stock_market_index_ID
  13. Vector Similarity Deutsche Bank BMW ✓ cos ( ✓ )

    = BMW · DeutscheBank k BMW kk DeutscheBank k
  14. Structured Views of Wikipedia - Graph DAX blue chip stock

    market index XETRA … …
  15. Wikipedia Graph • Types of nodes: • articles, categories •

    Types of directed edges: • hyperlinks from article to article • infobox links from article to article • links from article to category • links from category to category
  16. None
  17. Graph-based Similarity • using the distance between nodes in the

    graph • using random walks on graphs • Personalized PageRank (Haveliwala, 2002), a variation of PageRank (Brin and Page, 1998) • the user query defines how important the node is, such that PageRank will prefer nodes in the vicinity of the query node. Similarity(v1, v2) = 1 Distance(v1, v2)
  18. DAX Analysis

  19. DAX Dataset • 30 companies part of the index …

  20. Wikipedia Similarity • using gensim's Word2Vec for model building and

    similarity computation • obtain pairwise similarity
  21. Pricing Data

  22. Returns • Obtain daily returns for each DAX company

  23. Correlating Daily Returns • Calculate the correlation for each possible

  24. Does the Wikipedia Similarity Explain Correlation of Returns?

  25. Not Really… R2 = 0.091

  26. What Happened? R2 = 0.091

  27. What Happened? R2 = 0.091

  28. What Happened? R2 = 0.091 Commerzbank!

  29. None
  30. Check R2 = 0.218

  31. Using Adjusted Close Returns R2 = 0.166

  32. Remarks • Wikipedia, an unconventional source of unstructured data, has

    explanatory power for financial variables • Graph-based similarity • does not capture similarity so well within a specific topic • categories are too broad - e.g. "Companies based in Frankfurt" • Context-based similarity (Word2Vec) • more powerful measure, captures similarity between words, topics • A lot more to be gained from industry-specific documents • e.g. annual reports, conference call transcripts
  33. Applications • predicting financial figures solely based on unstructured data

    or using hybrid models (unstructured + structured data) • estimating financial figures, ratios or statistical moments when relevant quantitative data is not available • e.g. the time preceding an IPO
  34. Resources • Notebook for this presentation • • Word2Vec

    and Wiki2Vec • Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. Distributed Representations of Words and Phrases and their Compositionality. In Proceedings of NIPS, 2013. • Word2Vec code repository: • Word2Vec in gensim: • Wiki2Vec: • Graph-Based Similarity • Agirre, Eneko, Ander Barrena, and Aitor Soroa. "Studying the Wikipedia Hyperlink Graph for Relatedness and Disambiguation." arXiv preprint arXiv:1503.01655, 2015. • Delia Rusu. Text Annotation using Background Knowledge. PhD Thesis. Ljubljana, 2014.
  35. Thank You • Questions?