Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Estimating stock price correlations using Wikipedia

Estimating stock price correlations using Wikipedia

PyData Berlin 2016 talk

Description: Building an equities portfolio is a challenging task for a finance professional as it requires, among others, future correlations between stock prices. As this data is not always available, in this talk I look at an alternative to historical correlations as proxy for future correlations: using graph analysis techniques and text similarity measures based on Wikipedia data.

Delia Rusu

May 21, 2016
Tweet

Other Decks in Research

Transcript

  1. Estimating stock price
    correlations using
    Wikipedia
    Delia Rusu

    View full-size slide

  2. About Me
    • Chief Data Scientist @Knowsis (London, UK)
    • PhD, Natural Language Processing and Machine
    Learning
    • Interests: unstructured data & finance
    • http://deliarusu.github.io/

    View full-size slide

  3. Financial Data

    View full-size slide

  4. Sources of Unstructured
    Data
    • Annual reports
    • Broker research
    • Conference call transcripts
    • Investor relations presentations
    • News and press releases
    • In-house content

    View full-size slide

  5. What about
    unconventional datasets?

    View full-size slide

  6. Wikipedia
    • 10 edits per second
    • the English Wikipedia
    • currently has over 5M articles
    • average of 800 new articles per day
    • the German Wikipedia (4th largest)
    • currently has over 1.9M articles

    View full-size slide

  7. Structured Views of
    Wikipedia - Context Vectors

    DAX
    German
    stock market index

    View full-size slide

  8. Word2Vec
    • Word2Vec model (Mikolov et al. 2013)
    • distributional hypothesis: words in similar contexts
    have similar meanings
    • shallow, 2-layer neural network
    • training objective - learn word vector
    representations which can predict nearby words (in
    context)

    View full-size slide

  9. Word2Vec
    Skip-gram model CBOW model
    w(t-2)
    w(t-1)
    w(t+1)
    w(t+2)
    w(t)
    w(t-2)
    w(t-1)
    w(t+1)
    w(t+2)
    w(t)
    input input
    output output

    View full-size slide

  10. Word and Article Vectors
    • generate context vectors for:
    • words which appear in Wikipedia articles
    • Wikipedia articles themselves (topics)
    • model: wiki2vec
    • https://github.com/idio/wiki2vec
    The DAX is a blue chip stock market index
    The DAX_ID is a blue_chip_ID stock_market_index_ID

    View full-size slide

  11. Vector Similarity
    Deutsche Bank
    BMW

    cos
    (

    ) = BMW
    ·
    DeutscheBank
    k
    BMW
    kk
    DeutscheBank
    k

    View full-size slide

  12. Structured Views of
    Wikipedia - Graph
    DAX
    blue
    chip stock
    market
    index
    XETRA


    View full-size slide

  13. Wikipedia Graph
    • Types of nodes:
    • articles, categories
    • Types of directed edges:
    • hyperlinks from article to article
    • infobox links from article to article
    • links from article to category
    • links from category to category

    View full-size slide

  14. Graph-based Similarity
    • using the distance between nodes in the graph
    • using random walks on graphs
    • Personalized PageRank (Haveliwala, 2002), a
    variation of PageRank (Brin and Page, 1998)
    • the user query defines how important the node is,
    such that PageRank will prefer nodes in the vicinity
    of the query node.
    Similarity(v1, v2) = 1 Distance(v1, v2)

    View full-size slide

  15. DAX Analysis

    View full-size slide

  16. DAX Dataset
    • 30 companies part of the index

    View full-size slide

  17. Wikipedia Similarity
    • using gensim's Word2Vec for model building and similarity
    computation
    • obtain pairwise similarity

    View full-size slide

  18. Pricing Data

    View full-size slide

  19. Returns
    • Obtain daily returns for each DAX company

    View full-size slide

  20. Correlating Daily Returns
    • Calculate the correlation for each possible pair

    View full-size slide

  21. Does the Wikipedia
    Similarity Explain
    Correlation of Returns?

    View full-size slide

  22. Not Really…
    R2 = 0.091

    View full-size slide

  23. What Happened?
    R2 = 0.091

    View full-size slide

  24. What Happened?
    R2 = 0.091

    View full-size slide

  25. What Happened?
    R2 = 0.091
    Commerzbank!

    View full-size slide

  26. Check
    R2 = 0.218

    View full-size slide

  27. Using Adjusted Close
    Returns
    R2 = 0.166

    View full-size slide

  28. Remarks
    • Wikipedia, an unconventional source of unstructured data, has
    explanatory power for financial variables
    • Graph-based similarity
    • does not capture similarity so well within a specific topic
    • categories are too broad - e.g. "Companies based in Frankfurt"
    • Context-based similarity (Word2Vec)
    • more powerful measure, captures similarity between words, topics
    • A lot more to be gained from industry-specific documents
    • e.g. annual reports, conference call transcripts

    View full-size slide

  29. Applications
    • predicting financial figures solely based on
    unstructured data or using hybrid models
    (unstructured + structured data)
    • estimating financial figures, ratios or statistical
    moments when relevant quantitative data is not
    available
    • e.g. the time preceding an IPO

    View full-size slide

  30. Resources
    • Notebook for this presentation
    • https://github.com/deliarusu/wikipedia-correlation
    • Word2Vec and Wiki2Vec
    • Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. Distributed
    Representations of Words and Phrases and their Compositionality. In Proceedings of NIPS, 2013.
    • Word2Vec code repository: https://code.google.com/archive/p/word2vec/
    • Word2Vec in gensim: https://radimrehurek.com/gensim/models/word2vec.html
    • Wiki2Vec: https://github.com/idio/wiki2vec
    • Graph-Based Similarity
    • Agirre, Eneko, Ander Barrena, and Aitor Soroa. "Studying the Wikipedia Hyperlink Graph for
    Relatedness and Disambiguation." arXiv preprint arXiv:1503.01655, 2015.
    • Delia Rusu. Text Annotation using Background Knowledge. PhD Thesis. Ljubljana, 2014.

    View full-size slide

  31. Thank You
    • Questions?

    View full-size slide