Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Search at Bloomberg: Challenges, Opportunities,...

Edgar Meij
July 13, 2022

Search at Bloomberg: Challenges, Opportunities, and Lessons Learned

SIGIR 2022 Industry Track - Keynote

Edgar Meij

July 13, 2022
Tweet

More Decks by Edgar Meij

Other Decks in Science

Transcript

  1. © 2018 Bloomberg Finance L.P. All rights reserved. © 2022

    Bloomberg Finance L.P. All rights reserved. Search at Bloomberg: Challenges, Opportunities, and Lessons Learned SIGIR 2022 – SIRIP 2022 Keynote July 12, 2022 Edgar Meij, Ph.D. Head of AI Search and Discovery @edgarmeij | [email protected]
  2. © 2018 Bloomberg Finance L.P. All rights reserved. Search @

    Bloomberg • Terminal / Enterprise search ◦ Data ◦ Information ◦ Analytics ◦ News ◦ Functionality ◦ … • Query understanding • Question answering • Autocomplete / Query suggestions • Intent detection • Related entity suggestions • Recommender systems • …
  3. © 2018 Bloomberg Finance L.P. All rights reserved. Today's topic

    Search and discovery, applied research, and corresponding challenges/opportunities – all in the context of the financial services domain
  4. © 2018 Bloomberg Finance L.P. All rights reserved. Bloomberg is

    just finance, right? • A technology company founded in New York City in 1981 • 325,000+ subscribers in 170 countries • Over 20,000 employees in 163 locations, including over 7,000 software engineers – with more than 200 engineers and data scientists working on AI and related problems • Increased use of and contributions to open source software • Increased presence in academic research
  5. © 2018 Bloomberg Finance L.P. All rights reserved. Bloomberg DATA

    ANALYTICS NEWS COMMUNITY …to facilitate financial decision-making. 11
  6. © 2022 Bloomberg Finance L.P. All rights reserved. The Bloomberg

    Terminal is software that delivers a diverse array of information, news and analytics to facilitate financial decision-making.
  7. © 2018 Bloomberg Finance L.P. All rights reserved. Data is

    the backbone of the financial markets • Historically mostly “structured” market data (ticks/quotes/trades) ◦ Well-understood ◦ Enabling advanced forms of automation • Other types of data/information ◦ Real-world events, natural disasters ◦ Sociocultural phenomena ◦ Economic indicators ◦ Sales, revenue forecasts, futures, etc. ◦ Government policies ◦ Legal proceedings and litigation ◦ The weather ◦ … Our Challenge Identify financially-relevant signal from noisy, complex tangentially-related datasets
  8. © 2018 Bloomberg Finance L.P. All rights reserved. Data is

    the backbone of the financial markets • Increasingly non-traditional factors, based on “alternative” data, such as: ◦ Satellite images / CO2 emissions over factories ◦ Sentiment analytics on news ◦ Shopping mall footfall traffic ◦ Number of people riding the subway ◦ “Pret index” ◦ Credit card transactions ◦ etc. • But also “unstructured” data… Our Challenge Identify financially-relevant signal from noisy, complex tangentially-related datasets
  9. © 2018 Bloomberg Finance L.P. All rights reserved. Data is

    the backbone of the financial markets • Increasingly non-traditional factors, based on “alternative” data, such as: ◦ Satellite images / CO2 emissions over factories ◦ Sentiment analytics on news ◦ Shopping mall footfall traffic ◦ Number of people riding the subway ◦ “Pret index” ◦ Credit card transactions ◦ etc. • But also “unstructured” data… Our Challenge Identify financially-relevant signal from noisy, complex tangentially-related datasets
  10. © 2018 Bloomberg Finance L.P. All rights reserved. Data is

    the backbone of the financial markets • Increasingly non-traditional factors, based on “alternative” data, such as: ◦ Satellite images / CO2 emissions over factories ◦ Sentiment analytics on news ◦ Shopping mall footfall traffic ◦ Number of people riding the subway ◦ “Pret index” ◦ Credit card transactions ◦ etc. • But also “unstructured” data… Challenge: identify financially-relevant signal from noisy, complex tangentially-related datasets.
  11. © 2018 Bloomberg Finance L.P. All rights reserved. “Unstructured” data

    • 80% of data exists in the form of “raw”, unstructured text, e.g., ◦ Company filings, earnings call transcripts ◦ Tweets, Reddit, Facebook posts, news stories ◦ Research analyst reports, CRMs ◦ Economic policy, govt communications ◦ Press releases ◦ Web pages ◦ Chats & e-mail, client feedback ◦ etc. ◦ Lots of jargon and custom terminology (sometimes even firm-specific!) Our Challenge Identify financially-relevant signal from noisy, complex tangentially-related datasets
  12. © 2022 Bloomberg Finance L.P. All rights reserved. Sanford Bernstein’s

    Toni Sacconaghi “And so, where specifically will you be in terms of capital requirements?” Real-time multi-modal data moves markets speech recognition entity recognition linking salience topic classification summarization Elon Musk “Excuse me. Next. Boring, bonehead questions are not cool. Next?”
  13. © 2022 Bloomberg Finance L.P. All rights reserved. Even within

    a single role, context matters financial research analyst REVIEW FORECASTS sell-side research press releases IDENTIFY RISK liquidity analysis new market conditions LISTEN TO CONFERENCE CALLS assess tone note new guidance / comments take notes for later reference FIND & READ RELEVANT DOCS financial reports presentations sell-side research ASSESS FUTURE PERFORMANCE competition ESG picture news / media sentiment tone in management calls ANNOTATE & STRUCTURE FINDINGS highlight research copy and share snippets set up alerts for new events EARNINGS SEASON NEW COMPANY
  14. © 2022 Bloomberg Finance L.P. All rights reserved. Even within

    a single role, context matters portfolio manager IDEATION sell-side research press releases searching/browsing inefficiencies in the market FIND & READ RELEVANT DOCS financial reports presentations sell-side research RISK ANALYSIS liquidity analysis valuation peer analysis sell-side analyst recommendations ASSESS FUTURE PERFORMANCE industry comparison ESG news / media sentiment backtest TRADE factor analysis pricing information forecast market conditions MONITOR identify anomalies news alerts sell-side analyst recommendations IDEA GENERATION TRADE
  15. © 2018 Bloomberg Finance L.P. All rights reserved. User models

    • Most of our clients use the Terminal in their day-to-day workflows to: ◦ Trade ◦ Spot inefficiencies/opportunities in the market ◦ Find signal ◦ Keep abreast of developments ◦ etc. • Deeply-engrained muscle memory for executing Bloomberg functions • Limited room for “discovery”
  16. © 2018 Bloomberg Finance L.P. All rights reserved. Today's topic

    Search and discovery, applied research, and corresponding challenges/opportunities – all in the context of the financial services domain
  17. 10K+ Functions (verticals) Structured Data* EQS Equity Screening PEOP People

    Search SRCH Bond Search … Unstructured Data BI BBG Intelligence NSE News HELP Help … Commands* GP Charting ALRT Alerts MAP Mapping …
  18. © 2018 Bloomberg Finance L.P. All rights reserved. Bloomberg Terminal

    commandline search • Autocomplete: vast majority is navigational (i.e., no search intent) • Returning results of vastly different types ◦ Securities ◦ Non-securities ▪ People ▪ Companies ▪ Wikipedia ▪ Help pages + FAQs ▪ Contributors ▪ Issuers ▪ Research categories/topics/analysts ▪ Definitions ▪ Fields ▪ Functions
  19. © 2018 Bloomberg Finance L.P. All rights reserved. Federating across

    many disparate units of retrieval • Many disparate sources ◦ Some indexed by us ◦ Some indexed by others/owners • How do you normalize scores for results from different back-ends, and then merge and present a single list? ◦ Account for different “document” lengths and collection statistics ◦ Account for different “document” fields ◦ Account for multiple languages ◦ Account for multiple ranking functions ◦ Account for different, perhaps non-comparable scores
  20. What are the market caps of German pharmaceuticals? Show me

    the EPS of Chinese pharmaceutical companies that had dividend yield over 5% last year EQS
  21. What are the market caps of German pharmaceuticals? Show me

    the EPS of Chinese pharmaceutical companies that had dividend yield over 5% last year EQS
  22. Show me the EPS of Chinese companies that had had

    dividend yield over 5% last year Show me the EPS of Chinese pharmaceutical companies that had dividend yield over 5% last year EQS
  23. What are the market caps of German pharmaceuticals? Show me

    the EPS of Chinese pharmaceutical companies that had dividend yield over 5% last year EQS
  24. What are the market caps of German pharmaceuticals? Show me

    the EPS of Chinese pharmaceutical companies that had dividend yield over 5% last year EQS
  25. What are the market caps of German pharmaceuticals? Show me

    the EPS of Chinese pharmaceutical companies that had dividend yield over 5% last year EQS
  26. What are the market caps of German pharmaceuticals? Show me

    the EPS of Chinese pharmaceutical companies that had dividend yield over 5% last year EQS
  27. What are the market caps of German pharmaceuticals? Show me

    the EPS of Chinese pharmaceutical companies that had dividend yield over 5% last year EQS
  28. What are the market caps of German pharmaceuticals? Show me

    the EPS of Chinese pharmaceutical companies that had dividend yield over 5% last year EQS
  29. What are the market caps of German pharmaceuticals? EQS Show

    me the EPS of Chinese pharmaceutical companies that had dividend yield over 5% last year
  30. © 2018 Bloomberg Finance L.P. All rights reserved. Semantic parsing

    framework • Reuse, reuse, reuse! Lots of domains, but they share a lot in common: language about time, currency, aggregation ops, … • Developer efficiency Build a semantic parser for a new domain fast, then iterate • Flexibility To support language for new semantic operations • Performance Interactive times, on par with the typical search engine • Interpretability Not just an answer, but how it was derived Developer User
  31. © 2022 Bloomberg Finance L.P. All rights reserved. DATE/TIME, TOPIC,

    ENTITY RECOGNITION oil last week (topic:OIL, time(-1,week,now)) COMPLEX LOGICAL STRUCTURES german or french parliament elections ((topic:GEPARM or topic:FRPARM) AND topic:ELECTIONS) OPEN VOCABULARY pandora papers ("pandora papers") vs. ("papers" AND company:PNDORA@DC) Query understanding for News sony or toyota last two weeks in japanese german or french parliament elections oil last week pandora papers
  32. © 2022 Bloomberg Finance L.P. All rights reserved. Query understanding

    is used across the Bloomberg Terminal Bonds Charts Economic Equities Holdings News People Show all floating Asian tech bonds maturing in the next 10 years Yearly net income of Google and IBM in last 20 years Show GDP of China and Germany in Q1 2016 What are the top 10 Asian tech companies with eps >= 4 German holders of French ETFs Show me news about oil from the FT from the last month Who are the UMich alumni that work for GS in NYC?
  33. © 2018 Bloomberg Finance L.P. All rights reserved. Aggregated autocomplete

    suggestions Prime Terminal real estate BondsAC NewsAC 5W1H AC ChartAC … Balance: Relevance (short term) Get things done • Clicks • Task completion Discovery (long term) Serendipity and utility • Adoption • Workflows Personalization Semantically Driven Auto-completion, CIKM 2019 Auto-completion for Question Answering Systems at Bloomberg, SIGIR 2018
  34. © 2018 Bloomberg Finance L.P. All rights reserved. TextQA Document

    Stores Retriever (tens of documents) Scorer & Answer Extractor Multiple domains Multiple backends Multiple extractors Confidence Modeling
  35. © 2018 Bloomberg Finance L.P. All rights reserved. Multiple domains

    Multiple backends Multiple extractors Confidence Model High confidence Moderate confidence Regular Result Best Answer Low confidence Φ No Show TextQA Document Stores Retriever (tens of documents) Scorer & Answer Extractor Show/No show
  36. © 2018 Bloomberg Finance L.P. All rights reserved. Some more

    exotic examples • How much new railway development is planned in Europe? Should projects work as planned and remain on schedule, Europe could have more than 17,200 km of new and upgraded railway by 2025, out of a total planned pipeline of 39,180 km. This equates to about $97 billion of investment in 2022 and $110 billion in 2023. High-speed rail could provide about 22,500 additional km. Industrial and materials producers are closely monitoring higher demand. The U.K., France, Russia, Italy, Germany and Sweden -- where some listed construction companies operate -- are showing up with leading projects, both in execution or in the planning phase.EU governments are making significant efforts to use the co-funding available -- as much as possible -- to boost their economies. From 2000-17, Spain spent 47.3% of EU funds on high-speed rail. • When will the cruise industry return to profit? RECENT EVENT REACTION: Carnival's deflated January bookings for 2H cruises elevates our concern that the industry's targeted 2H return to profit could falter if the omicron variant's impact on sales lingers. Though Carnival expects to operate over 96% of disclosed 1Q capacity days despite virus disruption, even that shortfall could clip 1Q sales by 4% vs. consensus, our analysis shows.
  37. © 2018 Bloomberg Finance L.P. All rights reserved. Challenges (and

    thus Opportunities!) • Federating across many disparate units of retrieval • Partial observability: incomplete/noisy interactions • Augmented intelligence • Staying performant
  38. © 2018 Bloomberg Finance L.P. All rights reserved. Federating across

    many disparate units of retrieval • Search-as-a-platform • In-domain vs. cross-domain search • Domain specialists vs. novice users ◦ Aliases in-context ◦ How to quality control? • Sample live queries as much as you can
  39. © 2018 Bloomberg Finance L.P. All rights reserved. Federating across

    many disparate units of retrieval • Entity / Intent / Domain detection, for partial as well as for full queries ◦ In some cases NER is sufficient (when the label is unique) ◦ Typeahead prediction first, and then NER/NED on the full query instead of on partial queries seems to work better than NER/NED on partial queries • Adjust downstream ranking and presentation accordingly Identifying Named Entities as they are Typed, EACL 2021
  40. © 2018 Bloomberg Finance L.P. All rights reserved. NLP in

    Practice • Generation 1: Write a bunch of rules (“templates”, “grammars”) ◦ High-precision ◦ Slow, manual, difficult to maintain or update • Generation 2: Train a statistical classifier ◦ For sequence tagging: conditional random fields ◦ For document classification: logistic regression, SVMs, decision trees/random forests ◦ Need labeled data • Generation 3: Deep learning and human in the loop ◦ Need a lot of labeled data, or distant supervision ◦ May be slower
  41. © 2018 Bloomberg Finance L.P. All rights reserved. Learning to

    Rank in practice • Generation 1: Parametrized BM25F / LM for IR • Generation 2: LambdaRank, RFs, GBRTs, (“deeper” learning) ◦ Address cold-start issues, add contextual info, enrich instances • Generation 3: Beyond supervised learning ◦ Neural IR / Dense retrievers / Vector-based similarity ◦ Reinforcement learning ◦ Want to optimize for long-term “utility” and stickiness
  42. © 2018 Bloomberg Finance L.P. All rights reserved. Partial observability

    • How to train LTR models with (i) limited amounts of (ii) weakly supervised data? ◦ System bias: only feedback on seen items ◦ What if you don’t have that many users? ◦ What if you have cohorts with outliers/different behaviours? • Cold-start issues around sampling questions from logs for training ◦ If users don’t know they can ask questions they probably won’t ◦ Generate + paraphrase questions instead Novelty Controlled Paraphrase Generation with Retrieval Augmented Conditional Prompt Tuning, AAAI 2022
  43. © 2018 Bloomberg Finance L.P. All rights reserved. Temporal relevance

    • How to deal with annotations/relevance assessments that change/decay over time? ◦ Developments in the real world ◦ Timeliness of answers, stale answers, expiring answers ◦ Temporally-anchored answers • Need online metrics, ability to blocklist, continuous (re)training
  44. © 2018 Bloomberg Finance L.P. All rights reserved. “Augmented” intelligence

    • In industry, no model is static ◦ New entities, new vocabulary, new contexts, new relationships, new data, etc. • Humans in the loop ◦ To generate questions, to annotate results, to judge relevance ◦ Help prevent model drift and ameliorate lack of recall, precision ◦ Provide important training data • “Automation is augmentation, not replacement” ◦ Need effective, easy to use tools for humans to work with algorithms
  45. © 2018 Bloomberg Finance L.P. All rights reserved. Practical considerations

    • Interpretability, explainability • Regulatory / compliance ◦ Data permissioning: per-role/per-person ◦ On-/Off-prem data storage, compute ◦ Encrypting data at-rest and in-transfer ◦ Private data flows, separated networks ◦ Right to be forgotten
  46. © 2018 Bloomberg Finance L.P. All rights reserved. Practical considerations

    • Buy vs. build: invest in resources and build from scratch, partner with vendor(s), or look at (and potentially improve) open source? ◦ Type of problem, type of data ◦ Where is the “alpha”? ◦ Accuracy ◦ Transparency ◦ Customization ◦ Maintenance, ease adding more/different data ◦ Time to market ◦ Privacy/Regulatory concerns ◦ Cost
  47. © 2018 Bloomberg Finance L.P. All rights reserved. Staying performant

    • Being dependent on a (search) stack ◦ Elastic? Solr? ◦ Migrations, removing tech debt ◦ Patching up old systems or redesigning? ◦ Second stage re-ranker in-/outside of Solr • Legacy systems and patchwork processes ◦ Allocate time to disentangle and move to modern platforms and architectures
  48. © 2018 Bloomberg Finance L.P. All rights reserved. Conclusion •

    Search at Bloomberg: making structured and unstructured data machine- readable, human-interpretable, discoverable, and findable ◦ At scale with high accuracy and low latency, to enable swift and effective financial decision-making • Deliver value by pushing the state-of-the-art through applied research ◦ Address challenges encountered in “production” scenarios (cold-start issues, confidence modeling, partially observed behavior, system-induced biases, and more) ◦ Validation through scientific peer review, open source contributions • Generate data and perform continuous annotations/training with a human-in- the-loop, to address (some of) these issues
  49. © 2018 Bloomberg Finance L.P. All rights reserved. © 2022

    Bloomberg Finance L.P. All rights reserved. https://TechAtBloomberg.com/AI https://TechAtBloomberg.com/data-science-research-grant-program/ https://www.bloomberg.com/careers @edgarmeij | [email protected] Thank you