Creating a Next-Generation Financial Dataset from Scratch with NLP & Active Learning

by Patrick Harrison

Slide 1

Slide 1 text

Creating a next-generation ﬁnancial dataset from scratch with NLP & active learning

Slide 2

Slide 2 text

whois Patrick Harrison Director of AI Engineering @ S&P Global [email protected] We are a group of data scientists and machine learning engineers working to build production AI-powered applications at S&P Global.

Slide 3

Slide 3 text

In this talk… 1. A little bit about S&P Global 2. A major structural trend in ﬁnancial data — ESG 3. Creating an ESG dataset from scratch with spaCy, BERT, and active learning 4. Why S&P Global is a great place for NLP practitioners

Slide 4

Slide 4 text

S&P Global

Slide 5

Slide 5 text

S&P Global is… • A ﬁnancial data & technology company • Several divisions: • Ratings — Credit ratings for governments, corporations, institutions • Indices — Dow Jones, S&P 500, S&P Europe 350, … • Market Intelligence — data, analytics, research, news • Platts — energy analytics • A member of the Fortune 500 ($50B+ market capitalization)

Slide 6

Slide 6 text

Customers • Companies • Banks & Investors • Governments & Policy-makers • Professional Services • Academic Researchers “I need to make the best possible decisions for my organization.”

Slide 7

Slide 7 text

“I need data.” This is where we can help.

Slide 8

Slide 8 text

Relevant, accurate data makes makes better decisions possible. When our customers make better decisions, it can lead to economic growth and better governance.

Slide 9

Slide 9 text

Some of our datasets… • Cross-industry: • Conventional ﬁnancial performance metrics • News & events • Professionals • Transcripts of earnings conference calls • Many more… • Industry-speciﬁc: • Natural gas pipeline network & operations • Many more…

Slide 10

Slide 10 text

No content

Slide 11

Slide 11 text

No content

Slide 12

Slide 12 text

No content

Slide 13

Slide 13 text

No content

Slide 14

Slide 14 text

No content

Slide 15

Slide 15 text

No content

Slide 16

Slide 16 text

No content

Slide 17

Slide 17 text

No content

Slide 18

Slide 18 text

The accuracy guarantee • …or what you might call the “100% precision, 100% recall rule” • If there is a fact in the public domain and it falls within the scope of S&P’s information coverage, typically we guarantee: • That fact will be in our datasets, and • the data will be correct • If you ﬁnd an example where data is missing or incorrect, we will send you $50

Slide 19

Slide 19 text

let’s talk about ESG data

Slide 20

Slide 20 text

In the past, when ﬁnancial analysts did research on a company, conventional ﬁnancial performance metrics were paramount.

Slide 21

Slide 21 text

Today, customers are clamoring for new types of information about companies they research.

Slide 22

Slide 22 text

ESG

Slide 23

Slide 23 text

Environmental | Social | Governance

Slide 24

Slide 24 text

Example ESG attributes • Has this company made a public commitment to reduce or eliminate deforestation from its business operations? • Does the company disclose the investments it is making, if any, to promote sustainable water use in its business operations? • Does the company have standards in place to prohibit child labor practices in its business operations? In its supply chain? • Has this company made a public commitment regarding animal welfare practices? • Is the CEO’s compensation linked to company performance on sustainability metrics? • Does the company have targets in place for diversity and inclusion in its workforce? • … (hundreds more) • This data is really hard to get today!

Slide 25

Slide 25 text

We want to create the best ESG dataset in the world Problem: collecting standardized ESG data for thousands of companies is hard

Slide 26

Slide 26 text

Why is collecting ESG data hard? Conventional Data ESG Data Typically regulated Typically unregulated (for now) Disclosure is mandatory Disclosure is voluntary, non-disclosure is common Companies report similar metrics Companies report a variety of metrics, or no metrics Reported in standard formats Companies report data in various formats and channels Reported at predictable times Companies may report whenever they like

Slide 27

Slide 27 text

No content

Slide 28

Slide 28 text

No content

Slide 29

Slide 29 text

No content

Slide 30

Slide 30 text

No content

Slide 31

Slide 31 text

evidence for this ESG attribute Does the company assess risks related to water issues at least once a year?

Slide 32

Slide 32 text

Summarizing the task • We need to identify spans of text that contain relevant evidence for a company’s ESG attributes… • …which may or may not be disclosed anywhere • …for hundreds of ESG attributes • …from a variety of document or web sources • …across thousands of companies • …and system accuracy has to be 100%.

Slide 33

Slide 33 text

creating an ESG dataset from scratch with spaCy, BERT, and active learning

Slide 34

Slide 34 text

nlp modeling pipeline

Slide 35

Slide 35 text

No content

Slide 36

Slide 36 text

No content

Slide 37

Slide 37 text

No content

Slide 38

Slide 38 text

No content

Slide 39

Slide 39 text

No content

Slide 40

Slide 40 text

active learning lifecycle

Slide 41

Slide 41 text

No content

Slide 42

Slide 42 text

No content

Slide 43

Slide 43 text

No content

Slide 44

Slide 44 text

No content

Slide 45

Slide 45 text

No content

Slide 46

Slide 46 text

No content

Slide 47

Slide 47 text

No content

Slide 48

Slide 48 text

No content

Slide 49

Slide 49 text

No content

Slide 50

Slide 50 text

No content

Slide 51

Slide 51 text

No content

Slide 52

Slide 52 text

No content

Slide 53

Slide 53 text

No content

Slide 54

Slide 54 text

No content

Slide 55

Slide 55 text

No content

Slide 56

Slide 56 text

No content

Slide 57

Slide 57 text

No content

Slide 58

Slide 58 text

Next steps • This modeling approach and workflow is currently in “internal production” as S&P Global builds out its ESG dataset across thousands of companies • Members of our AI Engineering group build and maintain models, workflow tools, and infrastructure that make the active learning model development lifecycle and production workflow possible

Slide 59

Slide 59 text

S&P: a great place for nlp practitioners

Slide 60

Slide 60 text

The corpus • Documents are our bread and butter: we work with hundreds of millions of professionally-produced documents • Enough text data to do some really interesting things, like creating customized word embedding and pre-trained language models for the ﬁnancial services domain

Slide 61

Slide 61 text

The people • We have large teams of analysts and subject-matter experts on staff who can assist with annotating data — no crowd-sourcing required • The data-ﬁrst mindset — as a data company, we have a lot of people who have been thinking hard about the storage, management, and quality of data for a long time

Slide 62

Slide 62 text

The data “Where did you ﬁnd that?” “How many shares of Apple, Inc. stock are outstanding?”

Slide 63

Slide 63 text

… … “Source Tagging”

Slide 64

Slide 64 text

The impact • Processing text is fundamental to the core operations of our business • The business opportunity for NLP is large and direct • Lots of internal and external customers really care about the results of your work

Slide 65

Slide 65 text

Closing thoughts • It’s not always the best-performing model that wins • It’s the end-to-end system that provides value in a speciﬁc business context, potentially including human-machine collaboration • We are hiring • A big thank you to the folks at Explosion and the rest of the Python data science ecosystem! [email protected]