Estimating stock price correlations using Wikipedia

Estimating stock price correlations using Wikipedia

PyData Berlin 2016 talk

Description: Building an equities portfolio is a challenging task for a finance professional as it requires, among others, future correlations between stock prices. As this data is not always available, in this talk I look at an alternative to historical correlations as proxy for future correlations: using graph analysis techniques and text similarity measures based on Wikipedia data.


Delia Rusu

May 21, 2016


  1. 2.

    About Me • Chief Data Scientist @Knowsis (London, UK) •

    PhD, Natural Language Processing and Machine Learning • Interests: unstructured data & finance •
  2. 5.

    Sources of Unstructured Data • Annual reports • Broker research

    • Conference call transcripts • Investor relations presentations • News and press releases • In-house content
  3. 7.
  4. 8.

    Wikipedia • 10 edits per second • the English Wikipedia

    • currently has over 5M articles • average of 800 new articles per day • the German Wikipedia (4th largest) • currently has over 1.9M articles
  5. 10.

    Word2Vec • Word2Vec model (Mikolov et al. 2013) • distributional

    hypothesis: words in similar contexts have similar meanings • shallow, 2-layer neural network • training objective - learn word vector representations which can predict nearby words (in context)
  6. 11.

    Word2Vec Skip-gram model CBOW model w(t-2) w(t-1) w(t+1) w(t+2) w(t)

    w(t-2) w(t-1) w(t+1) w(t+2) w(t) input input output output
  7. 12.

    Word and Article Vectors • generate context vectors for: •

    words which appear in Wikipedia articles • Wikipedia articles themselves (topics) • model: wiki2vec • The DAX is a blue chip stock market index The DAX_ID is a blue_chip_ID stock_market_index_ID
  8. 13.

    Vector Similarity Deutsche Bank BMW ✓ cos ( ✓ )

    = BMW · DeutscheBank k BMW kk DeutscheBank k
  9. 15.

    Wikipedia Graph • Types of nodes: • articles, categories •

    Types of directed edges: • hyperlinks from article to article • infobox links from article to article • links from article to category • links from category to category
  10. 16.
  11. 17.

    Graph-based Similarity • using the distance between nodes in the

    graph • using random walks on graphs • Personalized PageRank (Haveliwala, 2002), a variation of PageRank (Brin and Page, 1998) • the user query defines how important the node is, such that PageRank will prefer nodes in the vicinity of the query node. Similarity(v1, v2) = 1 Distance(v1, v2)
  12. 20.

    Wikipedia Similarity • using gensim's Word2Vec for model building and

    similarity computation • obtain pairwise similarity
  13. 29.
  14. 32.

    Remarks • Wikipedia, an unconventional source of unstructured data, has

    explanatory power for financial variables • Graph-based similarity • does not capture similarity so well within a specific topic • categories are too broad - e.g. "Companies based in Frankfurt" • Context-based similarity (Word2Vec) • more powerful measure, captures similarity between words, topics • A lot more to be gained from industry-specific documents • e.g. annual reports, conference call transcripts
  15. 33.

    Applications • predicting financial figures solely based on unstructured data

    or using hybrid models (unstructured + structured data) • estimating financial figures, ratios or statistical moments when relevant quantitative data is not available • e.g. the time preceding an IPO
  16. 34.

    Resources • Notebook for this presentation • • Word2Vec

    and Wiki2Vec • Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. Distributed Representations of Words and Phrases and their Compositionality. In Proceedings of NIPS, 2013. • Word2Vec code repository: • Word2Vec in gensim: • Wiki2Vec: • Graph-Based Similarity • Agirre, Eneko, Ander Barrena, and Aitor Soroa. "Studying the Wikipedia Hyperlink Graph for Relatedness and Disambiguation." arXiv preprint arXiv:1503.01655, 2015. • Delia Rusu. Text Annotation using Background Knowledge. PhD Thesis. Ljubljana, 2014.