Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Creating a Next-Generation Financial Dataset from Scratch with NLP & Active Learning

Creating a Next-Generation Financial Dataset from Scratch with NLP & Active Learning

Presented at the spaCy IRL 2019 conference in Berlin, Germany. Video available at https://www.youtube.com/watch?v=rdmaR4WRYEM.

Patrick Harrison

July 06, 2019
Tweet

More Decks by Patrick Harrison

Other Decks in Technology

Transcript

  1. whois Patrick Harrison Director of AI Engineering @ S&P Global

    [email protected] We are a group of data scientists and machine learning engineers working to build production AI-powered applications at S&P Global.
  2. In this talk… 1. A little bit about S&P Global

    2. A major structural trend in financial data — ESG 3. Creating an ESG dataset from scratch with spaCy, BERT, and active learning 4. Why S&P Global is a great place for NLP practitioners
  3. S&P Global is… • A financial data & technology company

    • Several divisions: • Ratings — Credit ratings for governments, corporations, institutions • Indices — Dow Jones, S&P 500, S&P Europe 350, … • Market Intelligence — data, analytics, research, news • Platts — energy analytics • A member of the Fortune 500 ($50B+ market capitalization)
  4. Customers • Companies • Banks & Investors • Governments &

    Policy-makers • Professional Services • Academic Researchers “I need to make the best possible decisions for my organization.”
  5. Relevant, accurate data makes makes better decisions possible. When our

    customers make better decisions, it can lead to economic growth and better governance.
  6. Some of our datasets… • Cross-industry: • Conventional financial performance

    metrics • News & events • Professionals • Transcripts of earnings conference calls • Many more… • Industry-specific: • Natural gas pipeline network & operations • Many more…
  7. The accuracy guarantee • …or what you might call the

    “100% precision, 100% recall rule” • If there is a fact in the public domain and it falls within the scope of S&P’s information coverage, typically we guarantee: • That fact will be in our datasets, and • the data will be correct • If you find an example where data is missing or incorrect, we will send you $50
  8. In the past, when financial analysts did research on a

    company, conventional financial performance metrics were paramount.
  9. ESG

  10. Example ESG attributes • Has this company made a public

    commitment to reduce or eliminate deforestation from its business operations? • Does the company disclose the investments it is making, if any, to promote sustainable water use in its business operations? • Does the company have standards in place to prohibit child labor practices in its business operations? In its supply chain? • Has this company made a public commitment regarding animal welfare practices? • Is the CEO’s compensation linked to company performance on sustainability metrics? • Does the company have targets in place for diversity and inclusion in its workforce? • … (hundreds more) • This data is really hard to get today!
  11. We want to create the best ESG dataset in the

    world Problem: collecting standardized ESG data for thousands of companies is hard
  12. Why is collecting ESG data hard? Conventional Data ESG Data

    Typically regulated Typically unregulated (for now) Disclosure is mandatory Disclosure is voluntary, non-disclosure is common Companies report similar metrics Companies report a variety of metrics, or no metrics Reported in standard formats Companies report data in various formats and channels Reported at predictable times Companies may report whenever they like
  13. evidence for this ESG attribute Does the company assess risks

    related to water issues at least once a year?
  14. Summarizing the task • We need to identify spans of

    text that contain relevant evidence for a company’s ESG attributes… • …which may or may not be disclosed anywhere • …for hundreds of ESG attributes • …from a variety of document or web sources • …across thousands of companies • …and system accuracy has to be 100%.
  15. Next steps • This modeling approach and workflow is currently

    in “internal production” as S&P Global builds out its ESG dataset across thousands of companies • Members of our AI Engineering group build and maintain models, workflow tools, and infrastructure that make the active learning model development lifecycle and production workflow possible
  16. The corpus • Documents are our bread and butter: we

    work with hundreds of millions of professionally-produced documents • Enough text data to do some really interesting things, like creating customized word embedding and pre-trained language models for the financial services domain
  17. The people • We have large teams of analysts and

    subject-matter experts on staff who can assist with annotating data — no crowd-sourcing required • The data-first mindset — as a data company, we have a lot of people who have been thinking hard about the storage, management, and quality of data for a long time
  18. The data “Where did you find that?” “How many shares

    of Apple, Inc. stock are outstanding?”
  19. The impact • Processing text is fundamental to the core

    operations of our business • The business opportunity for NLP is large and direct • Lots of internal and external customers really care about the results of your work
  20. Closing thoughts • It’s not always the best-performing model that

    wins • It’s the end-to-end system that provides value in a specific business context, potentially including human-machine collaboration • We are hiring • A big thank you to the folks at Explosion and the rest of the Python data science ecosystem! [email protected]