Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Search at Bloomberg: Challenges, Opportunities, and Lessons Learned

Edgar Meij
July 13, 2022

Search at Bloomberg: Challenges, Opportunities, and Lessons Learned

SIGIR 2022 Industry Track - Keynote

Edgar Meij

July 13, 2022
Tweet

More Decks by Edgar Meij

Other Decks in Science

Transcript

  1. © 2018 Bloomberg Finance L.P. All rights reserved.
    © 2022 Bloomberg Finance L.P. All rights reserved.
    Search at Bloomberg:
    Challenges, Opportunities,
    and Lessons Learned
    SIGIR 2022 – SIRIP 2022 Keynote
    July 12, 2022
    Edgar Meij, Ph.D.
    Head of AI Search and Discovery
    @edgarmeij | [email protected]

    View Slide

  2. View Slide

  3. View Slide

  4. View Slide

  5. View Slide

  6. View Slide

  7. © 2018 Bloomberg Finance L.P. All rights reserved.
    Search @ Bloomberg
    ● Terminal / Enterprise search
    ○ Data
    ○ Information
    ○ Analytics
    ○ News
    ○ Functionality
    ○ …
    ● Query understanding
    ● Question answering
    ● Autocomplete / Query suggestions
    ● Intent detection
    ● Related entity suggestions
    ● Recommender systems
    ● …

    View Slide

  8. © 2018 Bloomberg Finance L.P. All rights reserved.
    Search @ Bloomberg

    View Slide

  9. © 2018 Bloomberg Finance L.P. All rights reserved.
    Today's topic
    Search and discovery, applied research, and corresponding
    challenges/opportunities – all in the context of the financial services
    domain

    View Slide

  10. © 2018 Bloomberg Finance L.P. All rights reserved.
    Bloomberg is just finance, right?
    ● A technology company founded in New York City in 1981
    ● 325,000+ subscribers in 170 countries
    ● Over 20,000 employees in 163 locations, including over 7,000
    software engineers – with more than 200 engineers and data
    scientists working on AI and related problems
    ● Increased use of and contributions to open source software
    ● Increased presence in academic research

    View Slide

  11. © 2018 Bloomberg Finance L.P. All rights reserved.
    Bloomberg
    DATA
    ANALYTICS
    NEWS
    COMMUNITY
    …to facilitate financial decision-making.
    11

    View Slide

  12. © 2022 Bloomberg Finance L.P. All rights reserved.
    The Bloomberg Terminal is software that
    delivers a diverse array of information, news and
    analytics to facilitate financial decision-making.

    View Slide

  13. © 2018 Bloomberg Finance L.P. All rights reserved.
    Finding a trading strategy is not unlike MLDC

    View Slide

  14. © 2018 Bloomberg Finance L.P. All rights reserved.
    Data is the backbone of the financial markets
    ● Historically mostly “structured” market data
    (ticks/quotes/trades)
    ○ Well-understood
    ○ Enabling advanced forms of automation
    ● Other types of data/information
    ○ Real-world events, natural disasters
    ○ Sociocultural phenomena
    ○ Economic indicators
    ○ Sales, revenue forecasts, futures, etc.
    ○ Government policies
    ○ Legal proceedings and litigation
    ○ The weather
    ○ …
    Our Challenge
    Identify financially-relevant
    signal from noisy, complex
    tangentially-related datasets

    View Slide

  15. © 2018 Bloomberg Finance L.P. All rights reserved.
    Data is the backbone of the financial markets
    ● Increasingly non-traditional factors, based on
    “alternative” data, such as:
    ○ Satellite images / CO2
    emissions over factories
    ○ Sentiment analytics on news
    ○ Shopping mall footfall traffic
    ○ Number of people riding the subway
    ○ “Pret index”
    ○ Credit card transactions
    ○ etc.
    ● But also “unstructured” data…
    Our Challenge
    Identify financially-relevant
    signal from noisy, complex
    tangentially-related datasets

    View Slide

  16. © 2018 Bloomberg Finance L.P. All rights reserved.
    Data is the backbone of the financial markets
    ● Increasingly non-traditional factors, based on
    “alternative” data, such as:
    ○ Satellite images / CO2
    emissions over factories
    ○ Sentiment analytics on news
    ○ Shopping mall footfall traffic
    ○ Number of people riding the subway
    ○ “Pret index”
    ○ Credit card transactions
    ○ etc.
    ● But also “unstructured” data…
    Our Challenge
    Identify financially-relevant
    signal from noisy, complex
    tangentially-related datasets

    View Slide

  17. © 2018 Bloomberg Finance L.P. All rights reserved.
    Data is the backbone of the financial markets
    ● Increasingly non-traditional factors, based on
    “alternative” data, such as:
    ○ Satellite images / CO2
    emissions over factories
    ○ Sentiment analytics on news
    ○ Shopping mall footfall traffic
    ○ Number of people riding the subway
    ○ “Pret index”
    ○ Credit card transactions
    ○ etc.
    ● But also “unstructured” data…
    Challenge: identify
    financially-relevant signal
    from noisy, complex
    tangentially-related datasets.

    View Slide

  18. © 2018 Bloomberg Finance L.P. All rights reserved.
    “Unstructured” data
    ● 80% of data exists in the form of “raw”,
    unstructured text, e.g.,
    ○ Company filings, earnings call transcripts
    ○ Tweets, Reddit, Facebook posts, news stories
    ○ Research analyst reports, CRMs
    ○ Economic policy, govt communications
    ○ Press releases
    ○ Web pages
    ○ Chats & e-mail, client feedback
    ○ etc.
    ○ Lots of jargon and custom terminology
    (sometimes even firm-specific!)
    Our Challenge
    Identify financially-relevant
    signal from noisy, complex
    tangentially-related datasets

    View Slide

  19. © 2018 Bloomberg Finance L.P. All rights reserved.
    Why?

    View Slide

  20. © 2022 Bloomberg Finance L.P. All rights reserved.
    Sanford Bernstein’s Toni Sacconaghi
    “And so, where specifically will you
    be in terms of capital requirements?”
    Real-time multi-modal data moves markets
    speech recognition
    entity recognition
    linking
    salience
    topic classification
    summarization
    Elon Musk
    “Excuse me. Next. Boring, bonehead
    questions are not cool. Next?”

    View Slide

  21. Latency matters
    entity recognition
    linking
    salience
    sentiment

    View Slide

  22. © 2022 Bloomberg Finance L.P. All rights reserved.
    Finance professionals
    (i.e., our users)

    View Slide

  23. Roles in finance
    different roles → different needs
    different times → different needs

    View Slide

  24. © 2022 Bloomberg Finance L.P. All rights reserved.
    Even within a single role, context matters
    financial
    research
    analyst
    REVIEW FORECASTS
    sell-side research
    press releases
    IDENTIFY RISK
    liquidity analysis
    new market conditions
    LISTEN TO CONFERENCE CALLS
    assess tone
    note new guidance / comments
    take notes for later reference
    FIND & READ RELEVANT DOCS
    financial reports
    presentations
    sell-side research
    ASSESS FUTURE PERFORMANCE
    competition
    ESG picture
    news / media sentiment
    tone in management calls
    ANNOTATE & STRUCTURE FINDINGS
    highlight research
    copy and share snippets
    set up alerts for new events
    EARNINGS
    SEASON
    NEW
    COMPANY

    View Slide

  25. © 2022 Bloomberg Finance L.P. All rights reserved.
    Even within a single role, context matters
    portfolio
    manager
    IDEATION
    sell-side research
    press releases
    searching/browsing
    inefficiencies in the market
    FIND & READ RELEVANT DOCS
    financial reports
    presentations
    sell-side research
    RISK ANALYSIS
    liquidity analysis
    valuation peer analysis
    sell-side analyst recommendations
    ASSESS FUTURE PERFORMANCE
    industry comparison
    ESG
    news / media sentiment
    backtest
    TRADE
    factor analysis
    pricing information
    forecast market conditions
    MONITOR
    identify anomalies
    news alerts
    sell-side analyst recommendations
    IDEA
    GENERATION
    TRADE

    View Slide

  26. © 2018 Bloomberg Finance L.P. All rights reserved.
    User models
    ● Most of our clients use the Terminal in their day-to-day workflows to:
    ○ Trade
    ○ Spot inefficiencies/opportunities in the market
    ○ Find signal
    ○ Keep abreast of developments
    ○ etc.
    ● Deeply-engrained muscle memory for executing Bloomberg functions
    ● Limited room for “discovery”

    View Slide

  27. © 2018 Bloomberg Finance L.P. All rights reserved.

    View Slide

  28. So where does one start?

    View Slide

  29. © 2018 Bloomberg Finance L.P. All rights reserved.
    Today's topic
    Search and discovery, applied research, and corresponding
    challenges/opportunities – all in the context of the financial services
    domain

    View Slide

  30. View Slide

  31. View Slide

  32. 10K+
    Functions
    (verticals)
    Structured Data*
    EQS Equity Screening
    PEOP People Search
    SRCH Bond Search

    Unstructured Data
    BI BBG Intelligence
    NSE News
    HELP Help

    Commands*
    GP Charting
    ALRT Alerts
    MAP Mapping

    View Slide

  33. View Slide

  34. © 2018 Bloomberg Finance L.P. All rights reserved.
    Bloomberg Terminal commandline search
    ● Autocomplete: vast majority is navigational (i.e., no search intent)
    ● Returning results of vastly different types
    ○ Securities
    ○ Non-securities
    ■ People
    ■ Companies
    ■ Wikipedia
    ■ Help pages + FAQs
    ■ Contributors
    ■ Issuers
    ■ Research categories/topics/analysts
    ■ Definitions
    ■ Fields
    ■ Functions

    View Slide

  35. © 2018 Bloomberg Finance L.P. All rights reserved.
    Federating across many disparate units of retrieval
    ● Many disparate sources
    ○ Some indexed by us
    ○ Some indexed by others/owners
    ● How do you normalize scores for results from different back-ends,
    and then merge and present a single list?
    ○ Account for different “document” lengths and collection statistics
    ○ Account for different “document” fields
    ○ Account for multiple languages
    ○ Account for multiple ranking functions
    ○ Account for different, perhaps non-comparable scores

    View Slide

  36. View Slide

  37. View Slide

  38. View Slide

  39. View Slide

  40. EQS
    An example workflow in 30 seconds. k

    View Slide

  41. Show me the EPS of Chinese pharmaceutical
    companies that had dividend yield over 5% last year
    EQS

    View Slide

  42. Show me the EPS of Chinese pharmaceutical
    companies that had dividend yield over 5% last year
    EQS

    View Slide

  43. What are the market
    caps of German
    pharmaceuticals?
    Show me the EPS of Chinese pharmaceutical
    companies that had dividend yield over 5% last year
    EQS

    View Slide

  44. What are the market
    caps of German
    pharmaceuticals?
    Show me the EPS of Chinese pharmaceutical
    companies that had dividend yield over 5% last year
    EQS

    View Slide

  45. Show me the EPS of
    Chinese companies
    that had had dividend
    yield over 5% last year
    Show me the EPS of Chinese pharmaceutical
    companies that had dividend yield over 5% last year
    EQS

    View Slide

  46. What are the market
    caps of German
    pharmaceuticals?
    Show me the EPS of Chinese pharmaceutical
    companies that had dividend yield over 5% last year
    EQS

    View Slide

  47. What are the market
    caps of German
    pharmaceuticals?
    Show me the EPS of Chinese pharmaceutical
    companies that had dividend yield over 5% last year
    EQS

    View Slide

  48. What are the market
    caps of German
    pharmaceuticals?
    Show me the EPS of Chinese pharmaceutical
    companies that had dividend yield over 5% last year
    EQS

    View Slide

  49. What are the market
    caps of German
    pharmaceuticals?
    Show me the EPS of Chinese pharmaceutical
    companies that had dividend yield over 5% last year
    EQS

    View Slide

  50. What are the market
    caps of German
    pharmaceuticals?
    Show me the EPS of Chinese pharmaceutical
    companies that had dividend yield over 5% last year
    EQS

    View Slide

  51. What are the market
    caps of German
    pharmaceuticals?
    Show me the EPS of Chinese pharmaceutical
    companies that had dividend yield over 5% last year
    EQS

    View Slide

  52. What are the market
    caps of German
    pharmaceuticals?
    EQS
    Show me the EPS of Chinese pharmaceutical
    companies that had dividend yield over 5% last year

    View Slide

  53. © 2018 Bloomberg Finance L.P. All rights reserved.
    Semantic parsing framework
    • Reuse, reuse, reuse!
    Lots of domains, but they share a lot in common:
    language about time, currency, aggregation ops, …
    • Developer efficiency
    Build a semantic parser for a new domain fast, then iterate
    • Flexibility
    To support language for new semantic operations
    • Performance
    Interactive times, on par with the typical search engine
    • Interpretability
    Not just an answer, but how it was derived
    Developer
    User

    View Slide

  54. © 2022 Bloomberg Finance L.P. All rights reserved.
    DATE/TIME, TOPIC, ENTITY RECOGNITION
    oil last week
    (topic:OIL, time(-1,week,now))
    COMPLEX LOGICAL STRUCTURES
    german or french parliament elections
    ((topic:GEPARM or topic:FRPARM)
    AND topic:ELECTIONS)
    OPEN VOCABULARY
    pandora papers
    ("pandora papers") vs.
    ("papers" AND company:[email protected])
    Query understanding for News
    sony or toyota last two weeks in japanese
    german or french parliament elections
    oil last week
    pandora papers

    View Slide

  55. © 2022 Bloomberg Finance L.P. All rights reserved.
    Query understanding is used across the Bloomberg Terminal
    Bonds
    Charts
    Economic
    Equities
    Holdings
    News
    People
    Show all floating Asian tech bonds maturing in the next 10 years
    Yearly net income of Google and IBM in last 20 years
    Show GDP of China and Germany in Q1 2016
    What are the top 10 Asian tech companies with eps >= 4
    German holders of French ETFs
    Show me news about oil from the FT from the last month
    Who are the UMich alumni that work for GS in NYC?

    View Slide

  56. © 2018 Bloomberg Finance L.P. All rights reserved.
    Aggregated autocomplete suggestions
    Prime Terminal
    real estate
    BondsAC NewsAC 5W1H AC ChartAC …
    Balance:
    Relevance (short term)
    Get things done
    • Clicks
    • Task completion
    Discovery (long term)
    Serendipity and utility
    • Adoption
    • Workflows
    Personalization
    Semantically Driven Auto-completion, CIKM 2019
    Auto-completion for Question Answering Systems at Bloomberg, SIGIR 2018

    View Slide

  57. © 2022 Bloomberg Finance L.P. All rights reserved.
    (Text-based) Question Answering

    View Slide

  58. View Slide

  59. © 2018 Bloomberg Finance L.P. All rights reserved.
    (Text-based) Question Answering

    View Slide

  60. © 2018 Bloomberg Finance L.P. All rights reserved.
    TextQA
    Document
    Stores
    Retriever
    (tens of documents)
    Scorer &
    Answer
    Extractor
    Multiple domains
    Multiple backends
    Multiple extractors
    Confidence Modeling

    View Slide

  61. © 2018 Bloomberg Finance L.P. All rights reserved.
    Multiple domains
    Multiple backends
    Multiple extractors
    Confidence
    Model
    High
    confidence
    Moderate
    confidence
    Regular
    Result
    Best Answer
    Low confidence
    Φ No Show
    TextQA
    Document
    Stores
    Retriever
    (tens of documents)
    Scorer &
    Answer
    Extractor
    Show/No show

    View Slide

  62. © 2018 Bloomberg Finance L.P. All rights reserved.
    Some more exotic examples
    ● How much new railway development is planned in Europe?
    Should projects work as planned and remain on schedule, Europe could have more than 17,200 km of new and
    upgraded railway by 2025, out of a total planned pipeline of 39,180 km. This equates to about $97 billion of investment
    in 2022 and $110 billion in 2023. High-speed rail could provide about 22,500 additional km. Industrial and materials
    producers are closely monitoring higher demand. The U.K., France, Russia, Italy, Germany and Sweden -- where
    some listed construction companies operate -- are showing up with leading projects, both in execution or in the
    planning phase.EU governments are making significant efforts to use the co-funding available -- as much as possible --
    to boost their economies. From 2000-17, Spain spent 47.3% of EU funds on high-speed rail.
    ● When will the cruise industry return to profit?
    RECENT EVENT REACTION: Carnival's deflated January bookings for 2H cruises elevates our concern that the
    industry's targeted 2H return to profit could falter if the omicron variant's impact on sales lingers. Though Carnival
    expects to operate over 96% of disclosed 1Q capacity days despite virus disruption, even that shortfall could clip 1Q
    sales by 4% vs. consensus, our analysis shows.

    View Slide

  63. © 2018 Bloomberg Finance L.P. All rights reserved.
    Challenges (and thus Opportunities!)

    View Slide

  64. © 2018 Bloomberg Finance L.P. All rights reserved.
    Challenges (and thus Opportunities!)
    ● Federating across many disparate units of retrieval
    ● Partial observability: incomplete/noisy interactions
    ● Augmented intelligence
    ● Staying performant

    View Slide

  65. © 2018 Bloomberg Finance L.P. All rights reserved.
    Federating across many disparate
    units of retrieval

    View Slide

  66. © 2018 Bloomberg Finance L.P. All rights reserved.
    Federating across many disparate units of retrieval
    ● Search-as-a-platform
    ● In-domain vs. cross-domain search
    ● Domain specialists vs. novice users
    ○ Aliases in-context
    ○ How to quality control?
    ● Sample live queries as much as you can

    View Slide

  67. View Slide

  68. View Slide

  69. View Slide

  70. View Slide

  71. © 2018 Bloomberg Finance L.P. All rights reserved.
    Federating across many disparate units of retrieval
    ● Entity / Intent / Domain detection, for partial as well as for full
    queries
    ○ In some cases NER is sufficient (when the label is unique)
    ○ Typeahead prediction first, and then NER/NED on the full query instead of
    on partial queries seems to work better than NER/NED on partial queries
    ● Adjust downstream ranking and presentation accordingly
    Identifying Named Entities as they are Typed, EACL 2021

    View Slide

  72. © 2018 Bloomberg Finance L.P. All rights reserved.
    Learning to Rank, in practice

    View Slide

  73. © 2018 Bloomberg Finance L.P. All rights reserved.
    NLP in Practice
    ● Generation 1: Write a bunch of rules (“templates”, “grammars”)
    ○ High-precision
    ○ Slow, manual, difficult to maintain or update
    ● Generation 2: Train a statistical classifier
    ○ For sequence tagging: conditional random fields
    ○ For document classification: logistic regression, SVMs, decision trees/random forests
    ○ Need labeled data
    ● Generation 3: Deep learning and human in the loop
    ○ Need a lot of labeled data, or distant supervision
    ○ May be slower

    View Slide

  74. © 2018 Bloomberg Finance L.P. All rights reserved.
    Learning to Rank in practice
    ● Generation 1: Parametrized BM25F / LM for IR
    ● Generation 2: LambdaRank, RFs, GBRTs, (“deeper” learning)
    ○ Address cold-start issues, add contextual info, enrich instances
    ● Generation 3: Beyond supervised learning
    ○ Neural IR / Dense retrievers / Vector-based similarity
    ○ Reinforcement learning
    ○ Want to optimize for long-term “utility” and stickiness

    View Slide

  75. © 2018 Bloomberg Finance L.P. All rights reserved.
    Partial observability
    ● How to train LTR models with (i) limited amounts of (ii) weakly
    supervised data?
    ○ System bias: only feedback on seen items
    ○ What if you don’t have that many users?
    ○ What if you have cohorts with outliers/different behaviours?
    ● Cold-start issues around sampling questions from logs for training
    ○ If users don’t know they can ask questions they probably won’t
    ○ Generate + paraphrase questions instead
    Novelty Controlled Paraphrase Generation with Retrieval Augmented
    Conditional Prompt Tuning, AAAI 2022

    View Slide

  76. © 2018 Bloomberg Finance L.P. All rights reserved.
    Temporal relevance
    ● How to deal with annotations/relevance assessments that
    change/decay over time?
    ○ Developments in the real world
    ○ Timeliness of answers, stale answers, expiring answers
    ○ Temporally-anchored answers
    ● Need online metrics, ability to blocklist, continuous (re)training

    View Slide

  77. © 2018 Bloomberg Finance L.P. All rights reserved.
    “Augmented” intelligence
    ● In industry, no model is static
    ○ New entities, new vocabulary, new contexts, new relationships, new data, etc.
    ● Humans in the loop
    ○ To generate questions, to annotate results, to judge relevance
    ○ Help prevent model drift and ameliorate lack of recall, precision
    ○ Provide important training data
    ● “Automation is augmentation, not replacement”
    ○ Need effective, easy to use tools for humans to work with algorithms

    View Slide

  78. © 2018 Bloomberg Finance L.P. All rights reserved.
    Practical considerations

    View Slide

  79. © 2018 Bloomberg Finance L.P. All rights reserved.
    Practical considerations
    ● Interpretability, explainability
    ● Regulatory / compliance
    ○ Data permissioning: per-role/per-person
    ○ On-/Off-prem data storage, compute
    ○ Encrypting data at-rest and in-transfer
    ○ Private data flows, separated networks
    ○ Right to be forgotten

    View Slide

  80. © 2018 Bloomberg Finance L.P. All rights reserved.
    Practical considerations
    ● Buy vs. build: invest in resources and build from scratch, partner with
    vendor(s), or look at (and potentially improve) open source?
    ○ Type of problem, type of data
    ○ Where is the “alpha”?
    ○ Accuracy
    ○ Transparency
    ○ Customization
    ○ Maintenance, ease adding more/different data
    ○ Time to market
    ○ Privacy/Regulatory concerns
    ○ Cost

    View Slide

  81. © 2018 Bloomberg Finance L.P. All rights reserved.
    Staying performant
    ● Being dependent on a (search) stack
    ○ Elastic? Solr?
    ○ Migrations, removing tech debt
    ○ Patching up old systems or redesigning?
    ○ Second stage re-ranker in-/outside of Solr
    ● Legacy systems and patchwork processes
    ○ Allocate time to disentangle and move to
    modern platforms and architectures

    View Slide

  82. © 2018 Bloomberg Finance L.P. All rights reserved.
    Conclusion
    ● Search at Bloomberg: making structured and unstructured data machine-
    readable, human-interpretable, discoverable, and findable
    ○ At scale with high accuracy and low latency, to enable swift and effective financial
    decision-making
    ● Deliver value by pushing the state-of-the-art through applied research
    ○ Address challenges encountered in “production” scenarios (cold-start issues,
    confidence modeling, partially observed behavior, system-induced biases, and more)
    ○ Validation through scientific peer review, open source contributions
    ● Generate data and perform continuous annotations/training with a human-in-
    the-loop, to address (some of) these issues

    View Slide

  83. © 2018 Bloomberg Finance L.P. All rights reserved.
    © 2022 Bloomberg Finance L.P. All rights reserved.
    https://TechAtBloomberg.com/AI
    https://TechAtBloomberg.com/data-science-research-grant-program/
    https://www.bloomberg.com/careers
    @edgarmeij | [email protected]
    Thank you

    View Slide