Creating a Next-Generation Financial Dataset from Scratch with NLP & Active Learning

Creating a next-generation ﬁnancial dataset from scratch with NLP &
active learning

whois Patrick Harrison Director of AI Engineering @ S&P Global
[email protected] We are a group of data scientists and machine learning engineers working to build production AI-powered applications at S&P Global.

In this talk… 1. A little bit about S&P Global
2. A major structural trend in ﬁnancial data — ESG 3. Creating an ESG dataset from scratch with spaCy, BERT, and active learning 4. Why S&P Global is a great place for NLP practitioners

S&P Global

S&P Global is… • A ﬁnancial data & technology company
• Several divisions: • Ratings — Credit ratings for governments, corporations, institutions • Indices — Dow Jones, S&P 500, S&P Europe 350, … • Market Intelligence — data, analytics, research, news • Platts — energy analytics • A member of the Fortune 500 ($50B+ market capitalization)

Customers • Companies • Banks & Investors • Governments &
Policy-makers • Professional Services • Academic Researchers “I need to make the best possible decisions for my organization.”

“I need data.” This is where we can help.

Relevant, accurate data makes makes better decisions possible. When our
customers make better decisions, it can lead to economic growth and better governance.

Some of our datasets… • Cross-industry: • Conventional ﬁnancial performance
metrics • News & events • Professionals • Transcripts of earnings conference calls • Many more… • Industry-speciﬁc: • Natural gas pipeline network & operations • Many more…

The accuracy guarantee • …or what you might call the
“100% precision, 100% recall rule” • If there is a fact in the public domain and it falls within the scope of S&P’s information coverage, typically we guarantee: • That fact will be in our datasets, and • the data will be correct • If you ﬁnd an example where data is missing or incorrect, we will send you $50

let’s talk about ESG data

In the past, when ﬁnancial analysts did research on a
company, conventional ﬁnancial performance metrics were paramount.

Today, customers are clamoring for new types of information about
companies they research.

Environmental | Social | Governance

Example ESG attributes • Has this company made a public
commitment to reduce or eliminate deforestation from its business operations? • Does the company disclose the investments it is making, if any, to promote sustainable water use in its business operations? • Does the company have standards in place to prohibit child labor practices in its business operations? In its supply chain? • Has this company made a public commitment regarding animal welfare practices? • Is the CEO’s compensation linked to company performance on sustainability metrics? • Does the company have targets in place for diversity and inclusion in its workforce? • … (hundreds more) • This data is really hard to get today!

We want to create the best ESG dataset in the
world Problem: collecting standardized ESG data for thousands of companies is hard

Why is collecting ESG data hard? Conventional Data ESG Data
Typically regulated Typically unregulated (for now) Disclosure is mandatory Disclosure is voluntary, non-disclosure is common Companies report similar metrics Companies report a variety of metrics, or no metrics Reported in standard formats Companies report data in various formats and channels Reported at predictable times Companies may report whenever they like

evidence for this ESG attribute Does the company assess risks
related to water issues at least once a year?

Summarizing the task • We need to identify spans of
text that contain relevant evidence for a company’s ESG attributes… • …which may or may not be disclosed anywhere • …for hundreds of ESG attributes • …from a variety of document or web sources • …across thousands of companies • …and system accuracy has to be 100%.

creating an ESG dataset from scratch with spaCy, BERT, and
active learning

nlp modeling pipeline

active learning lifecycle

Next steps • This modeling approach and workflow is currently
in “internal production” as S&P Global builds out its ESG dataset across thousands of companies • Members of our AI Engineering group build and maintain models, workflow tools, and infrastructure that make the active learning model development lifecycle and production workflow possible

S&P: a great place for nlp practitioners

The corpus • Documents are our bread and butter: we
work with hundreds of millions of professionally-produced documents • Enough text data to do some really interesting things, like creating customized word embedding and pre-trained language models for the ﬁnancial services domain

The people • We have large teams of analysts and
subject-matter experts on staff who can assist with annotating data — no crowd-sourcing required • The data-ﬁrst mindset — as a data company, we have a lot of people who have been thinking hard about the storage, management, and quality of data for a long time

The data “Where did you ﬁnd that?” “How many shares
of Apple, Inc. stock are outstanding?”

… … “Source Tagging”

The impact • Processing text is fundamental to the core
operations of our business • The business opportunity for NLP is large and direct • Lots of internal and external customers really care about the results of your work

Closing thoughts • It’s not always the best-performing model that
wins • It’s the end-to-end system that provides value in a speciﬁc business context, potentially including human-machine collaboration • We are hiring • A big thank you to the folks at Explosion and the rest of the Python data science ecosystem! [email protected]

Creating a Next-Generation Financial Dataset fr...

Creating a Next-Generation Financial Dataset from Scratch with NLP & Active Learning

Patrick Harrison

More Decks by Patrick Harrison

Other Decks in Technology

Featured

Transcript

Creating a next-generation ﬁnancial dataset from scratch with NLP &

whois Patrick Harrison Director of AI Engineering @ S&P Global

In this talk… 1. A little bit about S&P Global

S&P Global

S&P Global is… • A ﬁnancial data & technology company

Customers • Companies • Banks & Investors • Governments &

“I need data.” This is where we can help.

Relevant, accurate data makes makes better decisions possible. When our

Some of our datasets… • Cross-industry: • Conventional ﬁnancial performance

The accuracy guarantee • …or what you might call the

let’s talk about ESG data

In the past, when ﬁnancial analysts did research on a

Today, customers are clamoring for new types of information about

ESG

Environmental | Social | Governance

Example ESG attributes • Has this company made a public

We want to create the best ESG dataset in the

Why is collecting ESG data hard? Conventional Data ESG Data

evidence for this ESG attribute Does the company assess risks

Summarizing the task • We need to identify spans of

creating an ESG dataset from scratch with spaCy, BERT, and

nlp modeling pipeline

active learning lifecycle

Next steps • This modeling approach and workﬂow is currently

S&P: a great place for nlp practitioners

The corpus • Documents are our bread and butter: we

The people • We have large teams of analysts and

The data “Where did you ﬁnd that?” “How many shares

… … “Source Tagging”

The impact • Processing text is fundamental to the core

Closing thoughts • It’s not always the best-performing model that