Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Estimating stock price correlations using Wikipedia

Estimating stock price correlations using Wikipedia

PyData Berlin 2016 talk

Description: Building an equities portfolio is a challenging task for a finance professional as it requires, among others, future correlations between stock prices. As this data is not always available, in this talk I look at an alternative to historical correlations as proxy for future correlations: using graph analysis techniques and text similarity measures based on Wikipedia data.


Delia Rusu

May 21, 2016

Other Decks in Research


  1. Estimating stock price correlations using Wikipedia Delia Rusu

  2. About Me • Chief Data Scientist @Knowsis (London, UK) •

    PhD, Natural Language Processing and Machine Learning • Interests: unstructured data & finance • http://deliarusu.github.io/
  3. This Talk

  4. Financial Data

  5. Sources of Unstructured Data • Annual reports • Broker research

    • Conference call transcripts • Investor relations presentations • News and press releases • In-house content
  6. What about unconventional datasets?

  7. None
  8. Wikipedia • 10 edits per second • the English Wikipedia

    • currently has over 5M articles • average of 800 new articles per day • the German Wikipedia (4th largest) • currently has over 1.9M articles
  9. Structured Views of Wikipedia - Context Vectors … DAX German

    stock market index
  10. Word2Vec • Word2Vec model (Mikolov et al. 2013) • distributional

    hypothesis: words in similar contexts have similar meanings • shallow, 2-layer neural network • training objective - learn word vector representations which can predict nearby words (in context)
  11. Word2Vec Skip-gram model CBOW model w(t-2) w(t-1) w(t+1) w(t+2) w(t)

    w(t-2) w(t-1) w(t+1) w(t+2) w(t) input input output output
  12. Word and Article Vectors • generate context vectors for: •

    words which appear in Wikipedia articles • Wikipedia articles themselves (topics) • model: wiki2vec • https://github.com/idio/wiki2vec The DAX is a blue chip stock market index The DAX_ID is a blue_chip_ID stock_market_index_ID
  13. Vector Similarity Deutsche Bank BMW ✓ cos ( ✓ )

    = BMW · DeutscheBank k BMW kk DeutscheBank k
  14. Structured Views of Wikipedia - Graph DAX blue chip stock

    market index XETRA … …
  15. Wikipedia Graph • Types of nodes: • articles, categories •

    Types of directed edges: • hyperlinks from article to article • infobox links from article to article • links from article to category • links from category to category
  16. None
  17. Graph-based Similarity • using the distance between nodes in the

    graph • using random walks on graphs • Personalized PageRank (Haveliwala, 2002), a variation of PageRank (Brin and Page, 1998) • the user query defines how important the node is, such that PageRank will prefer nodes in the vicinity of the query node. Similarity(v1, v2) = 1 Distance(v1, v2)
  18. DAX Analysis

  19. DAX Dataset • 30 companies part of the index …

  20. Wikipedia Similarity • using gensim's Word2Vec for model building and

    similarity computation • obtain pairwise similarity
  21. Pricing Data

  22. Returns • Obtain daily returns for each DAX company

  23. Correlating Daily Returns • Calculate the correlation for each possible

  24. Does the Wikipedia Similarity Explain Correlation of Returns?

  25. Not Really… R2 = 0.091

  26. What Happened? R2 = 0.091

  27. What Happened? R2 = 0.091

  28. What Happened? R2 = 0.091 Commerzbank!

  29. None
  30. Check R2 = 0.218

  31. Using Adjusted Close Returns R2 = 0.166

  32. Remarks • Wikipedia, an unconventional source of unstructured data, has

    explanatory power for financial variables • Graph-based similarity • does not capture similarity so well within a specific topic • categories are too broad - e.g. "Companies based in Frankfurt" • Context-based similarity (Word2Vec) • more powerful measure, captures similarity between words, topics • A lot more to be gained from industry-specific documents • e.g. annual reports, conference call transcripts
  33. Applications • predicting financial figures solely based on unstructured data

    or using hybrid models (unstructured + structured data) • estimating financial figures, ratios or statistical moments when relevant quantitative data is not available • e.g. the time preceding an IPO
  34. Resources • Notebook for this presentation • https://github.com/deliarusu/wikipedia-correlation • Word2Vec

    and Wiki2Vec • Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. Distributed Representations of Words and Phrases and their Compositionality. In Proceedings of NIPS, 2013. • Word2Vec code repository: https://code.google.com/archive/p/word2vec/ • Word2Vec in gensim: https://radimrehurek.com/gensim/models/word2vec.html • Wiki2Vec: https://github.com/idio/wiki2vec • Graph-Based Similarity • Agirre, Eneko, Ander Barrena, and Aitor Soroa. "Studying the Wikipedia Hyperlink Graph for Relatedness and Disambiguation." arXiv preprint arXiv:1503.01655, 2015. • Delia Rusu. Text Annotation using Background Knowledge. PhD Thesis. Ljubljana, 2014.
  35. Thank You • Questions?