Search at Bloomberg: Challenges, Opportunities, and Lessons Learned

© 2018 Bloomberg Finance L.P. All rights reserved. © 2022
Bloomberg Finance L.P. All rights reserved. Search at Bloomberg: Challenges, Opportunities, and Lessons Learned SIGIR 2022 – SIRIP 2022 Keynote July 12, 2022 Edgar Meij, Ph.D. Head of AI Search and Discovery @edgarmeij | [email protected]

© 2018 Bloomberg Finance L.P. All rights reserved. Search @
Bloomberg • Terminal / Enterprise search ◦ Data ◦ Information ◦ Analytics ◦ News ◦ Functionality ◦ … • Query understanding • Question answering • Autocomplete / Query suggestions • Intent detection • Related entity suggestions • Recommender systems • …

© 2018 Bloomberg Finance L.P. All rights reserved. Search @
Bloomberg

© 2018 Bloomberg Finance L.P. All rights reserved. Today's topic
Search and discovery, applied research, and corresponding challenges/opportunities – all in the context of the financial services domain

© 2018 Bloomberg Finance L.P. All rights reserved. Bloomberg is
just finance, right? • A technology company founded in New York City in 1981 • 325,000+ subscribers in 170 countries • Over 20,000 employees in 163 locations, including over 7,000 software engineers – with more than 200 engineers and data scientists working on AI and related problems • Increased use of and contributions to open source software • Increased presence in academic research

© 2018 Bloomberg Finance L.P. All rights reserved. Bloomberg DATA
ANALYTICS NEWS COMMUNITY …to facilitate financial decision-making. 11

© 2022 Bloomberg Finance L.P. All rights reserved. The Bloomberg
Terminal is software that delivers a diverse array of information, news and analytics to facilitate financial decision-making.

© 2018 Bloomberg Finance L.P. All rights reserved. Finding a
trading strategy is not unlike MLDC

© 2018 Bloomberg Finance L.P. All rights reserved. Data is
the backbone of the financial markets • Historically mostly “structured” market data (ticks/quotes/trades) ◦ Well-understood ◦ Enabling advanced forms of automation • Other types of data/information ◦ Real-world events, natural disasters ◦ Sociocultural phenomena ◦ Economic indicators ◦ Sales, revenue forecasts, futures, etc. ◦ Government policies ◦ Legal proceedings and litigation ◦ The weather ◦ … Our Challenge Identify financially-relevant signal from noisy, complex tangentially-related datasets

the backbone of the financial markets • Increasingly non-traditional factors, based on “alternative” data, such as: ◦ Satellite images / CO2 emissions over factories ◦ Sentiment analytics on news ◦ Shopping mall footfall traffic ◦ Number of people riding the subway ◦ “Pret index” ◦ Credit card transactions ◦ etc. • But also “unstructured” data… Our Challenge Identify financially-relevant signal from noisy, complex tangentially-related datasets

the backbone of the financial markets • Increasingly non-traditional factors, based on “alternative” data, such as: ◦ Satellite images / CO2 emissions over factories ◦ Sentiment analytics on news ◦ Shopping mall footfall traffic ◦ Number of people riding the subway ◦ “Pret index” ◦ Credit card transactions ◦ etc. • But also “unstructured” data… Challenge: identify financially-relevant signal from noisy, complex tangentially-related datasets.

© 2018 Bloomberg Finance L.P. All rights reserved. “Unstructured” data
• 80% of data exists in the form of “raw”, unstructured text, e.g., ◦ Company filings, earnings call transcripts ◦ Tweets, Reddit, Facebook posts, news stories ◦ Research analyst reports, CRMs ◦ Economic policy, govt communications ◦ Press releases ◦ Web pages ◦ Chats & e-mail, client feedback ◦ etc. ◦ Lots of jargon and custom terminology (sometimes even firm-specific!) Our Challenge Identify financially-relevant signal from noisy, complex tangentially-related datasets

© 2022 Bloomberg Finance L.P. All rights reserved. Sanford Bernstein’s
Toni Sacconaghi “And so, where specifically will you be in terms of capital requirements?” Real-time multi-modal data moves markets speech recognition entity recognition linking salience topic classification summarization Elon Musk “Excuse me. Next. Boring, bonehead questions are not cool. Next?”

Latency matters entity recognition linking salience sentiment

© 2022 Bloomberg Finance L.P. All rights reserved. Finance professionals
(i.e., our users)

Roles in finance different roles → different needs different times
→ different needs

© 2022 Bloomberg Finance L.P. All rights reserved. Even within
a single role, context matters financial research analyst REVIEW FORECASTS sell-side research press releases IDENTIFY RISK liquidity analysis new market conditions LISTEN TO CONFERENCE CALLS assess tone note new guidance / comments take notes for later reference FIND & READ RELEVANT DOCS financial reports presentations sell-side research ASSESS FUTURE PERFORMANCE competition ESG picture news / media sentiment tone in management calls ANNOTATE & STRUCTURE FINDINGS highlight research copy and share snippets set up alerts for new events EARNINGS SEASON NEW COMPANY

© 2022 Bloomberg Finance L.P. All rights reserved. Even within
a single role, context matters portfolio manager IDEATION sell-side research press releases searching/browsing inefficiencies in the market FIND & READ RELEVANT DOCS financial reports presentations sell-side research RISK ANALYSIS liquidity analysis valuation peer analysis sell-side analyst recommendations ASSESS FUTURE PERFORMANCE industry comparison ESG news / media sentiment backtest TRADE factor analysis pricing information forecast market conditions MONITOR identify anomalies news alerts sell-side analyst recommendations IDEA GENERATION TRADE

© 2018 Bloomberg Finance L.P. All rights reserved. User models
• Most of our clients use the Terminal in their day-to-day workflows to: ◦ Trade ◦ Spot inefficiencies/opportunities in the market ◦ Find signal ◦ Keep abreast of developments ◦ etc. • Deeply-engrained muscle memory for executing Bloomberg functions • Limited room for “discovery”

So where does one start?

© 2018 Bloomberg Finance L.P. All rights reserved. Today's topic
Search and discovery, applied research, and corresponding challenges/opportunities – all in the context of the financial services domain

10K+ Functions (verticals) Structured Data* EQS Equity Screening PEOP People
Search SRCH Bond Search … Unstructured Data BI BBG Intelligence NSE News HELP Help … Commands* GP Charting ALRT Alerts MAP Mapping …

© 2018 Bloomberg Finance L.P. All rights reserved. Bloomberg Terminal
commandline search • Autocomplete: vast majority is navigational (i.e., no search intent) • Returning results of vastly different types ◦ Securities ◦ Non-securities ▪ People ▪ Companies ▪ Wikipedia ▪ Help pages + FAQs ▪ Contributors ▪ Issuers ▪ Research categories/topics/analysts ▪ Definitions ▪ Fields ▪ Functions

© 2018 Bloomberg Finance L.P. All rights reserved. Federating across
many disparate units of retrieval • Many disparate sources ◦ Some indexed by us ◦ Some indexed by others/owners • How do you normalize scores for results from different back-ends, and then merge and present a single list? ◦ Account for different “document” lengths and collection statistics ◦ Account for different “document” fields ◦ Account for multiple languages ◦ Account for multiple ranking functions ◦ Account for different, perhaps non-comparable scores

EQS An example workflow in 30 seconds. k

Show me the EPS of Chinese pharmaceutical companies that had
dividend yield over 5% last year EQS

What are the market caps of German pharmaceuticals? Show me
the EPS of Chinese pharmaceutical companies that had dividend yield over 5% last year EQS

Show me the EPS of Chinese companies that had had
dividend yield over 5% last year Show me the EPS of Chinese pharmaceutical companies that had dividend yield over 5% last year EQS

What are the market caps of German pharmaceuticals? Show me
the EPS of Chinese pharmaceutical companies that had dividend yield over 5% last year EQS

What are the market caps of German pharmaceuticals? EQS Show
me the EPS of Chinese pharmaceutical companies that had dividend yield over 5% last year

© 2018 Bloomberg Finance L.P. All rights reserved. Semantic parsing
framework • Reuse, reuse, reuse! Lots of domains, but they share a lot in common: language about time, currency, aggregation ops, … • Developer efficiency Build a semantic parser for a new domain fast, then iterate • Flexibility To support language for new semantic operations • Performance Interactive times, on par with the typical search engine • Interpretability Not just an answer, but how it was derived Developer User

© 2022 Bloomberg Finance L.P. All rights reserved. DATE/TIME, TOPIC,
ENTITY RECOGNITION oil last week (topic:OIL, time(-1,week,now)) COMPLEX LOGICAL STRUCTURES german or french parliament elections ((topic:GEPARM or topic:FRPARM) AND topic:ELECTIONS) OPEN VOCABULARY pandora papers ("pandora papers") vs. ("papers" AND company:PNDORA@DC) Query understanding for News sony or toyota last two weeks in japanese german or french parliament elections oil last week pandora papers

© 2022 Bloomberg Finance L.P. All rights reserved. Query understanding
is used across the Bloomberg Terminal Bonds Charts Economic Equities Holdings News People Show all floating Asian tech bonds maturing in the next 10 years Yearly net income of Google and IBM in last 20 years Show GDP of China and Germany in Q1 2016 What are the top 10 Asian tech companies with eps >= 4 German holders of French ETFs Show me news about oil from the FT from the last month Who are the UMich alumni that work for GS in NYC?

© 2018 Bloomberg Finance L.P. All rights reserved. Aggregated autocomplete
suggestions Prime Terminal real estate BondsAC NewsAC 5W1H AC ChartAC … Balance: Relevance (short term) Get things done • Clicks • Task completion Discovery (long term) Serendipity and utility • Adoption • Workflows Personalization Semantically Driven Auto-completion, CIKM 2019 Auto-completion for Question Answering Systems at Bloomberg, SIGIR 2018

© 2022 Bloomberg Finance L.P. All rights reserved. (Text-based) Question
Answering

© 2018 Bloomberg Finance L.P. All rights reserved. (Text-based) Question
Answering

© 2018 Bloomberg Finance L.P. All rights reserved. TextQA Document
Stores Retriever (tens of documents) Scorer & Answer Extractor Multiple domains Multiple backends Multiple extractors Confidence Modeling

© 2018 Bloomberg Finance L.P. All rights reserved. Multiple domains
Multiple backends Multiple extractors Confidence Model High confidence Moderate confidence Regular Result Best Answer Low confidence Φ No Show TextQA Document Stores Retriever (tens of documents) Scorer & Answer Extractor Show/No show

© 2018 Bloomberg Finance L.P. All rights reserved. Some more
exotic examples • How much new railway development is planned in Europe? Should projects work as planned and remain on schedule, Europe could have more than 17,200 km of new and upgraded railway by 2025, out of a total planned pipeline of 39,180 km. This equates to about $97 billion of investment in 2022 and $110 billion in 2023. High-speed rail could provide about 22,500 additional km. Industrial and materials producers are closely monitoring higher demand. The U.K., France, Russia, Italy, Germany and Sweden -- where some listed construction companies operate -- are showing up with leading projects, both in execution or in the planning phase.EU governments are making significant efforts to use the co-funding available -- as much as possible -- to boost their economies. From 2000-17, Spain spent 47.3% of EU funds on high-speed rail. • When will the cruise industry return to profit? RECENT EVENT REACTION: Carnival's deflated January bookings for 2H cruises elevates our concern that the industry's targeted 2H return to profit could falter if the omicron variant's impact on sales lingers. Though Carnival expects to operate over 96% of disclosed 1Q capacity days despite virus disruption, even that shortfall could clip 1Q sales by 4% vs. consensus, our analysis shows.

© 2018 Bloomberg Finance L.P. All rights reserved. Challenges (and
thus Opportunities!) • Federating across many disparate units of retrieval • Partial observability: incomplete/noisy interactions • Augmented intelligence • Staying performant

many disparate units of retrieval

many disparate units of retrieval • Search-as-a-platform • In-domain vs. cross-domain search • Domain specialists vs. novice users ◦ Aliases in-context ◦ How to quality control? • Sample live queries as much as you can

many disparate units of retrieval • Entity / Intent / Domain detection, for partial as well as for full queries ◦ In some cases NER is sufficient (when the label is unique) ◦ Typeahead prediction first, and then NER/NED on the full query instead of on partial queries seems to work better than NER/NED on partial queries • Adjust downstream ranking and presentation accordingly Identifying Named Entities as they are Typed, EACL 2021

© 2018 Bloomberg Finance L.P. All rights reserved. NLP in
Practice • Generation 1: Write a bunch of rules (“templates”, “grammars”) ◦ High-precision ◦ Slow, manual, difficult to maintain or update • Generation 2: Train a statistical classifier ◦ For sequence tagging: conditional random fields ◦ For document classification: logistic regression, SVMs, decision trees/random forests ◦ Need labeled data • Generation 3: Deep learning and human in the loop ◦ Need a lot of labeled data, or distant supervision ◦ May be slower

© 2018 Bloomberg Finance L.P. All rights reserved. Learning to
Rank in practice • Generation 1: Parametrized BM25F / LM for IR • Generation 2: LambdaRank, RFs, GBRTs, (“deeper” learning) ◦ Address cold-start issues, add contextual info, enrich instances • Generation 3: Beyond supervised learning ◦ Neural IR / Dense retrievers / Vector-based similarity ◦ Reinforcement learning ◦ Want to optimize for long-term “utility” and stickiness

© 2018 Bloomberg Finance L.P. All rights reserved. Partial observability
• How to train LTR models with (i) limited amounts of (ii) weakly supervised data? ◦ System bias: only feedback on seen items ◦ What if you don’t have that many users? ◦ What if you have cohorts with outliers/different behaviours? • Cold-start issues around sampling questions from logs for training ◦ If users don’t know they can ask questions they probably won’t ◦ Generate + paraphrase questions instead Novelty Controlled Paraphrase Generation with Retrieval Augmented Conditional Prompt Tuning, AAAI 2022

© 2018 Bloomberg Finance L.P. All rights reserved. Temporal relevance
• How to deal with annotations/relevance assessments that change/decay over time? ◦ Developments in the real world ◦ Timeliness of answers, stale answers, expiring answers ◦ Temporally-anchored answers • Need online metrics, ability to blocklist, continuous (re)training

© 2018 Bloomberg Finance L.P. All rights reserved. “Augmented” intelligence
• In industry, no model is static ◦ New entities, new vocabulary, new contexts, new relationships, new data, etc. • Humans in the loop ◦ To generate questions, to annotate results, to judge relevance ◦ Help prevent model drift and ameliorate lack of recall, precision ◦ Provide important training data • “Automation is augmentation, not replacement” ◦ Need effective, easy to use tools for humans to work with algorithms

• Interpretability, explainability • Regulatory / compliance ◦ Data permissioning: per-role/per-person ◦ On-/Off-prem data storage, compute ◦ Encrypting data at-rest and in-transfer ◦ Private data flows, separated networks ◦ Right to be forgotten

• Buy vs. build: invest in resources and build from scratch, partner with vendor(s), or look at (and potentially improve) open source? ◦ Type of problem, type of data ◦ Where is the “alpha”? ◦ Accuracy ◦ Transparency ◦ Customization ◦ Maintenance, ease adding more/different data ◦ Time to market ◦ Privacy/Regulatory concerns ◦ Cost

© 2018 Bloomberg Finance L.P. All rights reserved. Staying performant
• Being dependent on a (search) stack ◦ Elastic? Solr? ◦ Migrations, removing tech debt ◦ Patching up old systems or redesigning? ◦ Second stage re-ranker in-/outside of Solr • Legacy systems and patchwork processes ◦ Allocate time to disentangle and move to modern platforms and architectures

© 2018 Bloomberg Finance L.P. All rights reserved. Conclusion •
Search at Bloomberg: making structured and unstructured data machine- readable, human-interpretable, discoverable, and findable ◦ At scale with high accuracy and low latency, to enable swift and effective financial decision-making • Deliver value by pushing the state-of-the-art through applied research ◦ Address challenges encountered in “production” scenarios (cold-start issues, confidence modeling, partially observed behavior, system-induced biases, and more) ◦ Validation through scientific peer review, open source contributions • Generate data and perform continuous annotations/training with a human-in- the-loop, to address (some of) these issues

© 2018 Bloomberg Finance L.P. All rights reserved. © 2022
Bloomberg Finance L.P. All rights reserved. https://TechAtBloomberg.com/AI https://TechAtBloomberg.com/data-science-research-grant-program/ https://www.bloomberg.com/careers @edgarmeij | [email protected] Thank you

Search at Bloomberg: Challenges, Opportunities,...

Search at Bloomberg: Challenges, Opportunities, and Lessons Learned

More Decks by Edgar Meij

Other Decks in Science

Featured

Transcript