Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Estimating stock price correlations using Wikipedia

Estimating stock price correlations using Wikipedia

PyData Berlin 2016 talk

Description: Building an equities portfolio is a challenging task for a finance professional as it requires, among others, future correlations between stock prices. As this data is not always available, in this talk I look at an alternative to historical correlations as proxy for future correlations: using graph analysis techniques and text similarity measures based on Wikipedia data.

Delia Rusu

May 21, 2016
Tweet

Other Decks in Research

Transcript

  1. Estimating stock price
    correlations using
    Wikipedia
    Delia Rusu

    View Slide

  2. About Me
    • Chief Data Scientist @Knowsis (London, UK)
    • PhD, Natural Language Processing and Machine
    Learning
    • Interests: unstructured data & finance
    • http://deliarusu.github.io/

    View Slide

  3. This Talk

    View Slide

  4. Financial Data

    View Slide

  5. Sources of Unstructured
    Data
    • Annual reports
    • Broker research
    • Conference call transcripts
    • Investor relations presentations
    • News and press releases
    • In-house content

    View Slide

  6. What about
    unconventional datasets?

    View Slide

  7. View Slide

  8. Wikipedia
    • 10 edits per second
    • the English Wikipedia
    • currently has over 5M articles
    • average of 800 new articles per day
    • the German Wikipedia (4th largest)
    • currently has over 1.9M articles

    View Slide

  9. Structured Views of
    Wikipedia - Context Vectors

    DAX
    German
    stock market index

    View Slide

  10. Word2Vec
    • Word2Vec model (Mikolov et al. 2013)
    • distributional hypothesis: words in similar contexts
    have similar meanings
    • shallow, 2-layer neural network
    • training objective - learn word vector
    representations which can predict nearby words (in
    context)

    View Slide

  11. Word2Vec
    Skip-gram model CBOW model
    w(t-2)
    w(t-1)
    w(t+1)
    w(t+2)
    w(t)
    w(t-2)
    w(t-1)
    w(t+1)
    w(t+2)
    w(t)
    input input
    output output

    View Slide

  12. Word and Article Vectors
    • generate context vectors for:
    • words which appear in Wikipedia articles
    • Wikipedia articles themselves (topics)
    • model: wiki2vec
    • https://github.com/idio/wiki2vec
    The DAX is a blue chip stock market index
    The DAX_ID is a blue_chip_ID stock_market_index_ID

    View Slide

  13. Vector Similarity
    Deutsche Bank
    BMW

    cos
    (

    ) = BMW
    ·
    DeutscheBank
    k
    BMW
    kk
    DeutscheBank
    k

    View Slide

  14. Structured Views of
    Wikipedia - Graph
    DAX
    blue
    chip stock
    market
    index
    XETRA


    View Slide

  15. Wikipedia Graph
    • Types of nodes:
    • articles, categories
    • Types of directed edges:
    • hyperlinks from article to article
    • infobox links from article to article
    • links from article to category
    • links from category to category

    View Slide

  16. View Slide

  17. Graph-based Similarity
    • using the distance between nodes in the graph
    • using random walks on graphs
    • Personalized PageRank (Haveliwala, 2002), a
    variation of PageRank (Brin and Page, 1998)
    • the user query defines how important the node is,
    such that PageRank will prefer nodes in the vicinity
    of the query node.
    Similarity(v1, v2) = 1 Distance(v1, v2)

    View Slide

  18. DAX Analysis

    View Slide

  19. DAX Dataset
    • 30 companies part of the index

    View Slide

  20. Wikipedia Similarity
    • using gensim's Word2Vec for model building and similarity
    computation
    • obtain pairwise similarity

    View Slide

  21. Pricing Data

    View Slide

  22. Returns
    • Obtain daily returns for each DAX company

    View Slide

  23. Correlating Daily Returns
    • Calculate the correlation for each possible pair

    View Slide

  24. Does the Wikipedia
    Similarity Explain
    Correlation of Returns?

    View Slide

  25. Not Really…
    R2 = 0.091

    View Slide

  26. What Happened?
    R2 = 0.091

    View Slide

  27. What Happened?
    R2 = 0.091

    View Slide

  28. What Happened?
    R2 = 0.091
    Commerzbank!

    View Slide

  29. View Slide

  30. Check
    R2 = 0.218

    View Slide

  31. Using Adjusted Close
    Returns
    R2 = 0.166

    View Slide

  32. Remarks
    • Wikipedia, an unconventional source of unstructured data, has
    explanatory power for financial variables
    • Graph-based similarity
    • does not capture similarity so well within a specific topic
    • categories are too broad - e.g. "Companies based in Frankfurt"
    • Context-based similarity (Word2Vec)
    • more powerful measure, captures similarity between words, topics
    • A lot more to be gained from industry-specific documents
    • e.g. annual reports, conference call transcripts

    View Slide

  33. Applications
    • predicting financial figures solely based on
    unstructured data or using hybrid models
    (unstructured + structured data)
    • estimating financial figures, ratios or statistical
    moments when relevant quantitative data is not
    available
    • e.g. the time preceding an IPO

    View Slide

  34. Resources
    • Notebook for this presentation
    • https://github.com/deliarusu/wikipedia-correlation
    • Word2Vec and Wiki2Vec
    • Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. Distributed
    Representations of Words and Phrases and their Compositionality. In Proceedings of NIPS, 2013.
    • Word2Vec code repository: https://code.google.com/archive/p/word2vec/
    • Word2Vec in gensim: https://radimrehurek.com/gensim/models/word2vec.html
    • Wiki2Vec: https://github.com/idio/wiki2vec
    • Graph-Based Similarity
    • Agirre, Eneko, Ander Barrena, and Aitor Soroa. "Studying the Wikipedia Hyperlink Graph for
    Relatedness and Disambiguation." arXiv preprint arXiv:1503.01655, 2015.
    • Delia Rusu. Text Annotation using Background Knowledge. PhD Thesis. Ljubljana, 2014.

    View Slide

  35. Thank You
    • Questions?

    View Slide