Estimating stock price correlations using Wikipedia

Estimating stock price correlations using Wikipedia Delia Rusu

About Me • Chief Data Scientist @Knowsis (London, UK) •
PhD, Natural Language Processing and Machine Learning • Interests: unstructured data & ﬁnance • http://deliarusu.github.io/

This Talk

Financial Data

Sources of Unstructured Data • Annual reports • Broker research
• Conference call transcripts • Investor relations presentations • News and press releases • In-house content

What about unconventional datasets?

Wikipedia • 10 edits per second • the English Wikipedia
• currently has over 5M articles • average of 800 new articles per day • the German Wikipedia (4th largest) • currently has over 1.9M articles

Structured Views of Wikipedia - Context Vectors … DAX German
stock market index

Word2Vec • Word2Vec model (Mikolov et al. 2013) • distributional
hypothesis: words in similar contexts have similar meanings • shallow, 2-layer neural network • training objective - learn word vector representations which can predict nearby words (in context)

Word2Vec Skip-gram model CBOW model w(t-2) w(t-1) w(t+1) w(t+2) w(t)
w(t-2) w(t-1) w(t+1) w(t+2) w(t) input input output output

Word and Article Vectors • generate context vectors for: •
words which appear in Wikipedia articles • Wikipedia articles themselves (topics) • model: wiki2vec • https://github.com/idio/wiki2vec The DAX is a blue chip stock market index The DAX_ID is a blue_chip_ID stock_market_index_ID

Vector Similarity Deutsche Bank BMW ✓ cos ( ✓ )
= BMW · DeutscheBank k BMW kk DeutscheBank k

Structured Views of Wikipedia - Graph DAX blue chip stock
market index XETRA … …

Wikipedia Graph • Types of nodes: • articles, categories •
Types of directed edges: • hyperlinks from article to article • infobox links from article to article • links from article to category • links from category to category

Graph-based Similarity • using the distance between nodes in the
graph • using random walks on graphs • Personalized PageRank (Haveliwala, 2002), a variation of PageRank (Brin and Page, 1998) • the user query deﬁnes how important the node is, such that PageRank will prefer nodes in the vicinity of the query node. Similarity(v1, v2) = 1 Distance(v1, v2)

DAX Analysis

DAX Dataset • 30 companies part of the index …

Wikipedia Similarity • using gensim's Word2Vec for model building and
similarity computation • obtain pairwise similarity

Pricing Data

Returns • Obtain daily returns for each DAX company

Correlating Daily Returns • Calculate the correlation for each possible
pair

Does the Wikipedia Similarity Explain Correlation of Returns?

Not Really… R2 = 0.091

What Happened? R2 = 0.091

What Happened? R2 = 0.091 Commerzbank!

Check R2 = 0.218

Using Adjusted Close Returns R2 = 0.166

Remarks • Wikipedia, an unconventional source of unstructured data, has
explanatory power for financial variables • Graph-based similarity • does not capture similarity so well within a specific topic • categories are too broad - e.g. "Companies based in Frankfurt" • Context-based similarity (Word2Vec) • more powerful measure, captures similarity between words, topics • A lot more to be gained from industry-specific documents • e.g. annual reports, conference call transcripts

Applications • predicting financial figures solely based on unstructured data
or using hybrid models (unstructured + structured data) • estimating financial figures, ratios or statistical moments when relevant quantitative data is not available • e.g. the time preceding an IPO

Resources • Notebook for this presentation • https://github.com/deliarusu/wikipedia-correlation • Word2Vec
and Wiki2Vec • Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. Distributed Representations of Words and Phrases and their Compositionality. In Proceedings of NIPS, 2013. • Word2Vec code repository: https://code.google.com/archive/p/word2vec/ • Word2Vec in gensim: https://radimrehurek.com/gensim/models/word2vec.html • Wiki2Vec: https://github.com/idio/wiki2vec • Graph-Based Similarity • Agirre, Eneko, Ander Barrena, and Aitor Soroa. "Studying the Wikipedia Hyperlink Graph for Relatedness and Disambiguation." arXiv preprint arXiv:1503.01655, 2015. • Delia Rusu. Text Annotation using Background Knowledge. PhD Thesis. Ljubljana, 2014.

Thank You • Questions?

Estimating stock price correlations using Wikip...

Estimating stock price correlations using Wikipedia

Delia Rusu

Other Decks in Research

Featured

Transcript