$30 off During Our Annual Pro Sale. View Details »

Creating a Next-Generation Financial Dataset from Scratch with NLP & Active Learning

Creating a Next-Generation Financial Dataset from Scratch with NLP & Active Learning

Presented at the spaCy IRL 2019 conference in Berlin, Germany. Video available at https://www.youtube.com/watch?v=rdmaR4WRYEM.

Patrick Harrison

July 06, 2019

More Decks by Patrick Harrison

Other Decks in Technology


  1. Creating a next-generation financial dataset from scratch with NLP &

    active learning
  2. whois Patrick Harrison Director of AI Engineering @ S&P Global

    patrick@hrsn.me We are a group of data scientists and machine learning engineers working to build production AI-powered applications at S&P Global.
  3. In this talk… 1. A little bit about S&P Global

    2. A major structural trend in financial data — ESG 3. Creating an ESG dataset from scratch with spaCy, BERT, and active learning 4. Why S&P Global is a great place for NLP practitioners
  4. S&P Global

  5. S&P Global is… • A financial data & technology company

    • Several divisions: • Ratings — Credit ratings for governments, corporations, institutions • Indices — Dow Jones, S&P 500, S&P Europe 350, … • Market Intelligence — data, analytics, research, news • Platts — energy analytics • A member of the Fortune 500 ($50B+ market capitalization)
  6. Customers • Companies • Banks & Investors • Governments &

    Policy-makers • Professional Services • Academic Researchers “I need to make the best possible decisions for my organization.”
  7. “I need data.” This is where we can help.

  8. Relevant, accurate data makes makes better decisions possible. When our

    customers make better decisions, it can lead to economic growth and better governance.
  9. Some of our datasets… • Cross-industry: • Conventional financial performance

    metrics • News & events • Professionals • Transcripts of earnings conference calls • Many more… • Industry-specific: • Natural gas pipeline network & operations • Many more…
  10. None
  11. None
  12. None
  13. None
  14. None
  15. None
  16. None
  17. None
  18. The accuracy guarantee • …or what you might call the

    “100% precision, 100% recall rule” • If there is a fact in the public domain and it falls within the scope of S&P’s information coverage, typically we guarantee: • That fact will be in our datasets, and • the data will be correct • If you find an example where data is missing or incorrect, we will send you $50
  19. let’s talk about ESG data

  20. In the past, when financial analysts did research on a

    company, conventional financial performance metrics were paramount.
  21. Today, customers are clamoring for new types of information about

    companies they research.
  22. ESG

  23. Environmental | Social | Governance

  24. Example ESG attributes • Has this company made a public

    commitment to reduce or eliminate deforestation from its business operations? • Does the company disclose the investments it is making, if any, to promote sustainable water use in its business operations? • Does the company have standards in place to prohibit child labor practices in its business operations? In its supply chain? • Has this company made a public commitment regarding animal welfare practices? • Is the CEO’s compensation linked to company performance on sustainability metrics? • Does the company have targets in place for diversity and inclusion in its workforce? • … (hundreds more) • This data is really hard to get today!
  25. We want to create the best ESG dataset in the

    world Problem: collecting standardized ESG data for thousands of companies is hard
  26. Why is collecting ESG data hard? Conventional Data ESG Data

    Typically regulated Typically unregulated (for now) Disclosure is mandatory Disclosure is voluntary, non-disclosure is common Companies report similar metrics Companies report a variety of metrics, or no metrics Reported in standard formats Companies report data in various formats and channels Reported at predictable times Companies may report whenever they like
  27. None
  28. None
  29. None
  30. None
  31. evidence for this ESG attribute Does the company assess risks

    related to water issues at least once a year?
  32. Summarizing the task • We need to identify spans of

    text that contain relevant evidence for a company’s ESG attributes… • …which may or may not be disclosed anywhere • …for hundreds of ESG attributes • …from a variety of document or web sources • …across thousands of companies • …and system accuracy has to be 100%.
  33. creating an ESG dataset from scratch with spaCy, BERT, and

    active learning
  34. nlp modeling pipeline

  35. None
  36. None
  37. None
  38. None
  39. None
  40. active learning lifecycle

  41. None
  42. None
  43. None
  44. None
  45. None
  46. None
  47. None
  48. None
  49. None
  50. None
  51. None
  52. None
  53. None
  54. None
  55. None
  56. None
  57. None
  58. Next steps • This modeling approach and workflow is currently

    in “internal production” as S&P Global builds out its ESG dataset across thousands of companies • Members of our AI Engineering group build and maintain models, workflow tools, and infrastructure that make the active learning model development lifecycle and production workflow possible
  59. S&P: a great place for nlp practitioners

  60. The corpus • Documents are our bread and butter: we

    work with hundreds of millions of professionally-produced documents • Enough text data to do some really interesting things, like creating customized word embedding and pre-trained language models for the financial services domain
  61. The people • We have large teams of analysts and

    subject-matter experts on staff who can assist with annotating data — no crowd-sourcing required • The data-first mindset — as a data company, we have a lot of people who have been thinking hard about the storage, management, and quality of data for a long time
  62. The data “Where did you find that?” “How many shares

    of Apple, Inc. stock are outstanding?”
  63. … … “Source Tagging”

  64. The impact • Processing text is fundamental to the core

    operations of our business • The business opportunity for NLP is large and direct • Lots of internal and external customers really care about the results of your work
  65. Closing thoughts • It’s not always the best-performing model that

    wins • It’s the end-to-end system that provides value in a specific business context, potentially including human-machine collaboration • We are hiring • A big thank you to the folks at Explosion and the rest of the Python data science ecosystem! patrick@hrsn.me