Upgrade to Pro — share decks privately, control downloads, hide ads and more …

All the Data That’s Fit to Find: Search at The New York Times

Elastic Co
February 19, 2016

All the Data That’s Fit to Find: Search at The New York Times

Hear how The New York Times put all 15 million of its articles published over the last 160 years into Elasticsearch. From article archives and the ingestion pipeline, to cluster setup and the APIs provided to the organization, Boerge will show what search really means at this world-renown news outlet.

Elastic Co

February 19, 2016
Tweet

More Decks by Elastic Co

Other Decks in Technology

Transcript

  1. ‹#› Boerge Svingen, Senior Manager 2016-02-19 @bsvingen All the Data

    That's Fit to Find: Search at The New York Times
  2. Agenda Some background Our current setup 3 Where we are

    going 4 Conclusions and comments 5 How we use search 2 1
  3. Agenda Some background 1 Our current setup 3 Where we

    are going 4 Conclusions and comments 5 How we use search 2
  4. Agenda Some background 2 Our current setup 3 Where we

    are going 4 Conclusions and comments 5 How we use search 1
  5. Explicit search • nytimes.com • Native apps • Public search

    API (http://developer.nytimes.com/docs/read/ article_search_api_v2) • TimesMachine (http://timesmachine.nytimes.com) • Internal tools • … Source: Gray Arial10pt
  6. Ingestion • A few thousand updates per day Low rate

    • CMS • Legacy systems • File based archives • OCR • … Sources • < 1 s • Low latency is important • (Especially for implicit search.) Latency
  7. Agenda Some background 3 Our current setup 2 Where we

    are going 4 Conclusions and comments 5 How we use search 1
  8. All AWS • The search stack runs entirely on AWS

    • Deployed using an in-house infrastructure tool • Nagios and New Relic for monitoring • Sumo Logic for log management and analytics
  9. System setup • Two full production clusters • DNS-based failover

    • Also used for maintenance/major changes/ reindexing • 16 node ES clusters in production (m1.xlarge) Prod A Prod B Staging Dev DNS
  10. SQS CMS API Handler Elasticsearch Search API Other data sources

    Handlers Handlers Handlers Normalization Merge S3 Indexer MongoDB Beanstalk Semantic platform
  11. Agenda Some background 4 Our current setup 2 Where we

    are going 3 Conclusions and comments 5 How we use search 1
  12. 1. Full documents in Elasticsearch • Store and index full

    documents in Elasticsearch. • No external lookups necessary in the API, no MongoDB. • Demand-driven API - clients decide what they want.
  13. 2. Replay • Use Kafka. • All content will go

    through Kafka. • Kafka will persist all content, source-of-truth for search. • Easy to create new search clusters through Kafka replay.
  14. 3. Keep all clusters busy • All production traffic will

    be replayed from active production cluster to standby production and to staging. • Gor • Makes sure that standby cache is always warm. • Changes to staging will be exposed to production traffic. • If standby cluster crashes we will know.
  15. 4. Virtualize • Vagrant boxes for everything. • Make it

    easy to run full pipeline locally. • Provision Vagrant boxes using same system as production servers.
  16. Agenda Some background 5 Our current setup 2 Where we

    are going 3 Conclusions and comments 4 How we use search 1
  17. • Know your users. • Search should be an integrated

    part of the product. • Be consistent across devices and platforms. • Focus where you can make a difference.
  18. Think about how to … • do deployments • do

    upgrades • reindex content • change the way you normalize data • create new search cluster … and make it easy.
  19. Things to do • Put your infrastructure in version control.

    • Vagrant everything. • Have immutable servers, replace instead of changing. • Make deploying a new stack easy and automatic.
  20. Table 1 Topic Issue Type Aggregations Add HDRHistogram as an

    option in percentiles and percentile_ranks aggregations #12362 (https://github.com/elastic/elasticsearch/pull/12362) (issue: #8324 (https://github.com/elastic/elasticsearch/issues/8324)) Analytics Aggregations: Adds other bucket to filters aggregation #11948 (https://github.com/elastic/elasticsearch/pull/11948) (issue: #11289 (https://github.com/elastic/elasticsearch/issues/11289)) Analytics Aggregations: Pipeline Aggregation to filter buckets based on a script #11941 (https://github.com/elastic/elasticsearch/pull/11941) Analytics Adds cumulative sum aggregation #11825 (https://github.com/elastic/elasticsearch/pull/11825) Analytics Allow users to perform simple arithmetic operations on histogram aggregations #11601 (https://github.com/elastic/elasticsearch/pull/11601) (issue: #11029 (https://github.com/elastic/elasticsearch/issues/11029)) Analytics Aggregations: add serial differencing pipeline aggregation #11196 (https://github.com/elastic/elasticsearch/pull/11196) (issue: #10190 (https://github.com/elastic/elasticsearch/issues/10190)) Analytics Add Holt-Winters to moving_avg aggregation #11043 (https://github.com/elastic/elasticsearch/pull/11043) Analytics Make it possible to configure missing values. #11042 (https://github.com/elastic/elasticsearch/pull/11042) (issue: #5324 (https://github.com/elastic/elasticsearch/issues/5324)) Analytics Adding Sum Bucket Aggregation #11013 (https://github.com/elastic/elasticsearch/pull/11013) (issue: #11007 (https://github.com/elastic/elasticsearch/issues/11007)) Analytics Adding Average Bucket Aggregation #11010 (https://github.com/elastic/elasticsearch/pull/11010) (issue: #11006 (https://github.com/elastic/elasticsearch/issues/11006)) Analytics min_bucket aggregation #10900 (https://github.com/elastic/elasticsearch/pull/10900) (issue: #9999 (https://github.com/elastic/elasticsearch/issues/9999)) Analytics Pipeline aggregations: Ability to perform computations on aggregations #10568 (https://github.com/elastic/elasticsearch/pull/10568) (issues: #10000 (https://github.com/elastic/elasticsearch/issues/10000), #10002 (https://github.com/elastic/elasticsearch/issues/10002), #9293 (https://github.com/elastic/elasticsearch/issues/9293), #9876 (https://github.com/elastic/elasticsearch/issues/9876)) Analytics Sampler aggregation #10221 (https://github.com/elastic/elasticsearch/pull/10221) (issue: #8108 (https://github.com/elastic/elasticsearch/issues/8108)) Analytics PercentageScore heuristic for significant_terms #9747 (https://github.com/elastic/elasticsearch/pull/9747) (issue: #9720 (https://github.com/elastic/elasticsearch/issues/9720)) Analytics Return the sum of the doc counts of other buckets in terms aggregations. #8213 (https://github.com/elastic/elasticsearch/pull/8213) Analytics Significant terms: add scriptable significance heuristic #7850 (https://github.com/elastic/elasticsearch/pull/7850) Analytics Scriptable Metrics Aggregation #7075 (https://github.com/elastic/elasticsearch/pull/7075) (issue: #5923 (https://github.com/elastic/elasticsearch/issues/5923)) Analytics Added pre and post offset to histogram aggregation #6980 (https://github.com/elastic/elasticsearch/pull/6980) (issue: #6605 (https://github.com/elastic/elasticsearch/issues/6605)) Analytics Added Filters aggregation #6974 (https://github.com/elastic/elasticsearch/pull/6974) (issues: #6118 (https://github.com/elastic/elasticsearch/issues/6118), #6119 (https://github.com/elastic/elasticsearch/issues/6119)) Analytics Add children aggregation #6936 (https://github.com/elastic/elasticsearch/pull/6936) Analytics Significant Terms: Add google normalized distance and chi square #6858 (https://github.com/elastic/elasticsearch/pull/6858) Analytics Infrastructure for changing easily the significance terms heuristic #6561 (https://github.com/elastic/elasticsearch/pull/6561) Analytics Added percentile rank aggregation #6432 (https://github.com/elastic/elasticsearch/pull/6432) (issue: #6386 (https://github.com/elastic/elasticsearch/issues/6386)) Analytics Deferred aggregations prevent combinatorial explosion #6128 (https://github.com/elastic/elasticsearch/pull/6128) Analytics Support bounding box aggregation on geo_shape/geo_point data types. [ISSUE] #5634 (https://github.com/elastic/elasticsearch/pull/5634) Analytics Add reverse nested aggregation [ISSUE] #5485 (https://github.com/elastic/elasticsearch/pull/5485) Analytics Cardinality aggregation [ISSUE] #5426 (https://github.com/elastic/elasticsearch/pull/5426) Analytics Percentiles aggregation [ISSUE] #5323 (https://github.com/elastic/elasticsearch/pull/5323) Analytics Significant_terms aggregation #5146 (https://github.com/elastic/elasticsearch/pull/5146) Analytics Add preserve original token option to ASCIIFolding #5115 (https://github.com/elastic/elasticsearch/pull/5115) (issue: #4931 (https://github.com/elastic/elasticsearch/issues/4931)) Analytics Add script support to value_count aggregations. #5007 (https://github.com/elastic/elasticsearch/pull/5007) (issue: #5001 (https://github.com/elastic/elasticsearch/issues/5001)) Other Allocation Cancel replica recovery on another sync option copy found #12421 (https://github.com/elastic/elasticsearch/pull/12421) Other Optional Delayed Allocation on Node leave #11712 (https://github.com/elastic/elasticsearch/pull/11712) Other Analysis Add keep_types for filtering by token type #7120 (https://github.com/elastic/elasticsearch/pull/7120) Search Add uppercase token filter #5539 (https://github.com/elastic/elasticsearch/pull/5539) Search CAT API Add _cat/nodeattrs API #12534 (https://github.com/elastic/elasticsearch/pull/12534) (issue: #8000 (https://github.com/elastic/elasticsearch/issues/8000)) Other Add wildcard support for header names #11367 (https://github.com/elastic/elasticsearch/pull/11367) (issue: #10811 (https://github.com/elastic/elasticsearch/issues/10811)) Other Show open and closed indices in _cat/indices #7936 (https://github.com/elastic/elasticsearch/pull/7936) (issue: #7907 (https://github.com/elastic/elasticsearch/issues/7907)) Other Add /_cat/fielddata to display fielddata usage #6086 (https://github.com/elastic/elasticsearch/pull/6086) (issue: #4593 (https://github.com/elastic/elasticsearch/issues/4593)) Other Add _cat/plugins endpoint [ISSUE] #4824 (https://github.com/elastic/elasticsearch/pull/4824) Other Add _cat/segments [ISSUE] #4711 (https://github.com/elastic/elasticsearch/pull/4711) Other CRUD Update API: Add support for scripted upserts. #7144 (https://github.com/elastic/elasticsearch/pull/7144) Other Update API - allow scripted upserts [ISSUE] #7143 (https://github.com/elastic/elasticsearch/pull/7143) Other Update API: Detect noop updates when using doc #6862 (https://github.com/elastic/elasticsearch/pull/6862) (issue: #6822 (https://github.com/elastic/elasticsearch/issues/6822)) Other Introducing VersionType.FORCE & VersionType.EXTERNAL_GTE #4993 (https://github.com/elastic/elasticsearch/pull/4993) (issues: #2946 (https://github.com/elastic/elasticsearch/issues/2946), #4213 (https://github.com/elastic/elasticsearch/issues/4213)) Other Cache Query Cache: Support shard level query response caching #7161 (https://github.com/elastic/elasticsearch/pull/7161) Other Circuit Breakers Allow setting individual breakers to "noop" breakers #8135 (https://github.com/elastic/elasticsearch/pull/8135) Other Add NoopCircuitBreaker used in NoneCircuitBreakerService #8063 (https://github.com/elastic/elasticsearch/pull/8063) Other Core Expose auto-IO-throttle from Lucene’s ConcurrentMergeScheduler #9243 (https://github.com/elastic/elasticsearch/pull/9243) (issue: #9133 (https://github.com/elastic/elasticsearch/issues/9133)) Other Switch auto-generated IDs to Flake IDs from random UUIDs #7531 (https://github.com/elastic/elasticsearch/pull/7531) (issues: #5941 (https://github.com/elastic/elasticsearch/issues/5941), #6004 (https://github.com/elastic/elasticsearch/issues/6004)) Other Dates Added epoch date formats to configure parsing of unix dates #11453 (https://github.com/elastic/elasticsearch/pull/11453) (issues: #10971 (https://github.com/elastic/elasticsearch/issues/10971), #5328 (https://github.com/elastic/elasticsearch/issues/5328)) Analytics Index APIs Add date math support in index names #12209 (https://github.com/elastic/elasticsearch/pull/12209) (issue: #12059 (https://github.com/elastic/elasticsearch/issues/12059)) Other Added GET Index API #7234 (https://github.com/elastic/elasticsearch/pull/7234) (issue: #4069 (https://github.com/elastic/elasticsearch/issues/4069)) Other Force single-segment merges [ISSUE] #5243 (https://github.com/elastic/elasticsearch/pull/5243) Other Create index to support aliases [ISSUE] #4920 (https://github.com/elastic/elasticsearch/pull/4920) Other Add Recovery API. #4802 (https://github.com/elastic/elasticsearch/pull/4802) (issue: #4637 (https://github.com/elastic/elasticsearch/issues/4637)) Other Index Templates Made template filtering generic and extensible #7454 (https://github.com/elastic/elasticsearch/pull/7454) (issue: #7459 (https://github.com/elastic/elasticsearch/issues/7459)) Other Added support for aliases to index templates #5180 (https://github.com/elastic/elasticsearch/pull/5180) (issues: #1825 (https://github.com/elastic/elasticsearch/issues/1825), #2739 (https://github.com/elastic/elasticsearch/issues/2739)) Other Indexed Scripts/Templates Allow search templates stored in an index to be retrieved and used at search time #5921 (https://github.com/elastic/elasticsearch/pull/5921) (issues: #5484 (https://github.com/elastic/elasticsearch/issues/5484), #5637 (https://github.com/elastic/elasticsearch/issues/5637)) Other Internal Added an option to add arbitrary headers to the client requests #7127 (https://github.com/elastic/elasticsearch/pull/7127) Other Java API Improved Suggest Client API #7507 (https://github.com/elastic/elasticsearch/pull/7507) (issue: #7435 (https://github.com/elastic/elasticsearch/issues/7435)) Other Logging Infra for deprecation logging #11285 (https://github.com/elastic/elasticsearch/pull/11285) (issue: #11033 (https://github.com/elastic/elasticsearch/issues/11033)) Other Infra for deprecation logging #11033 (https://github.com/elastic/elasticsearch/pull/11033) Other Add ability to specify a SizeBasedTriggeringPolicy for log configuration #10373 (https://github.com/elastic/elasticsearch/pull/10373) (issue: #10371 (https://github.com/elastic/elasticsearch/issues/10371)) Other Mapping Bring back numeric_resolution#10420 (https://github.com/elastic/elasticsearch/pull/10420) (issue: #10072 (https://github.com/elastic/elasticsearch/issues/10072)) Other Add new default option for timestamp field #7036 (https://github.com/elastic/elasticsearch/pull/7036) (issue: #4718 (https://github.com/elastic/elasticsearch/issues/4718)) Other Add transform to document before index. #6599 (https://github.com/elastic/elasticsearch/pull/6599) (issue: #6566 (https://github.com/elastic/elasticsearch/issues/6566)) Other Add doc values for binary field #5669 (https://github.com/elastic/elasticsearch/pull/5669) Other More Like This Add an unlike parameter #8674 (https://github.com/elastic/elasticsearch/pull/8674) Other Support for artificial documents in MLT query #7725 (https://github.com/elastic/elasticsearch/pull/7725) Other Percolator Enable percolation of nested documents #5082 (https://github.com/elastic/elasticsearch/pull/5082) Other Plugin Delete By Query Add delete-by-query plugin #11516 (https://github.com/elastic/elasticsearch/pull/11516) Other Plugins add list parse methods to XContentParser #10455 (https://github.com/elastic/elasticsearch/pull/10455) Other Migration advisory plugin [ISSUE] #10214 (https://github.com/elastic/elasticsearch/pull/10214) Other Query DSL Query DSL: Add filter clauses to bool queries. #11142 (https://github.com/elastic/elasticsearch/pull/11142) Other Add span within/containing queries. #10913 (https://github.com/elastic/elasticsearch/pull/10913) Analytics Add time_zone setting for query_string #8164 (https://github.com/elastic/elasticsearch/pull/8164) (issue: #7880 (https://github.com/elastic/elasticsearch/issues/7880)) Analytics Add format support for date range filter and queries #7821 (https://github.com/elastic/elasticsearch/pull/7821) (issue: #7189 (https://github.com/elastic/elasticsearch/issues/7189)) Analytics Add min_score parameter to function score query to only match docs above this threshold #7814 (https://github.com/elastic/elasticsearch/pull/7814) (issue: #6952 (https://github.com/elastic/elasticsearch/issues/6952)) Analytics Function score multi values #5940 (https://github.com/elastic/elasticsearch/pull/5940) (issue: #3960 (https://github.com/elastic/elasticsearch/issues/3960)) Other Add the field_value_factor function to the function_score query #5519 (https://github.com/elastic/elasticsearch/pull/5519) Other Added cross_fields type to multi_match query #5005 (https://github.com/elastic/elasticsearch/pull/5005) (issue: #2959 (https://github.com/elastic/elasticsearch/issues/2959)) Other Allow for executing queries based on pre-defined templates [ISSUE] #4879 (https://github.com/elastic/elasticsearch/pull/4879) Other REST API: Add response filtering with filter_path parameter #10980 (https://github.com/elastic/elasticsearch/pull/10980) (issue: #7401 (https://github.com/elastic/elasticsearch/issues/7401)) Other Render REST errors in a structural way #10643 (https://github.com/elastic/elasticsearch/pull/10643) (issue: #3303 (https://github.com/elastic/elasticsearch/issues/3303)) Other Add CBOR data format support #5509 (https://github.com/elastic/elasticsearch/pull/5509) (issue: #4860 (https://github.com/elastic/elasticsearch/issues/4860)) Other Recovery Add basic recovery prioritization to GatewayAllocator #11975 (https://github.com/elastic/elasticsearch/pull/11975) (issue: #11787 (https://github.com/elastic/elasticsearch/issues/11787)) Other Move index sealing terminology to synced flush #11336 (https://github.com/elastic/elasticsearch/pull/11336) (issues: #10032 (https://github.com/elastic/elasticsearch/issues/10032), #11179 (https://github.com/elastic/elasticsearch/issues/11179), #11251 (https://github.com/elastic/elasticsearch/issues/11251)) Other Seal indices for faster recovery #11179 (https://github.com/elastic/elasticsearch/pull/11179) (issue: #10032 (https://github.com/elastic/elasticsearch/issues/10032)) Other Scripting Add Multi-Valued Field Methods to Expressions #11105 (https://github.com/elastic/elasticsearch/pull/11105) Other Add support for fine-grained settings #10116 (https://github.com/elastic/elasticsearch/pull/10116) (issues: #10274 (https://github.com/elastic/elasticsearch/issues/10274), #6418 (https://github.com/elastic/elasticsearch/issues/6418)) Other Add script engine for Lucene expressions #6819 (https://github.com/elastic/elasticsearch/pull/6819) (issue: #6818 (https://github.com/elastic/elasticsearch/issues/6818)) Other Add Groovy as a scripting language, add groovy sandboxing #6233 (https://github.com/elastic/elasticsearch/pull/6233) Other Add Groovy as a scripting language, switching default from Mvel → Groovy #6106 (https://github.com/elastic/elasticsearch/pull/6106) Other Search Validate API: provide more verbose explanation #10147 (https://github.com/elastic/elasticsearch/pull/10147) (issues: #1412 (https://github.com/elastic/elasticsearch/issues/1412), #88247 (https://github.com/elastic/elasticsearch/issues/88247)) Other Add inner hits to nested and parent/child queries #8153 (https://github.com/elastic/elasticsearch/pull/8153) (issues: #3022 (https://github.com/elastic/elasticsearch/issues/3022), #3152 (https://github.com/elastic/elasticsearch/issues/3152)) Analytics Sorting: Allow _geo_distance to handle many to many geo point distance #7097 (https://github.com/elastic/elasticsearch/pull/7097) (issue: #3926 (https://github.com/elastic/elasticsearch/issues/3926)) Other Add search-exists API to check if any matching documents exist for a given query #7026 (https://github.com/elastic/elasticsearch/pull/7026) (issue: #6995 (https://github.com/elastic/elasticsearch/issues/6995)) Analytics Add an option to early terminate document collection when searching/counting #6885 (https://github.com/elastic/elasticsearch/pull/6885) (issue: #6876 (https://github.com/elastic/elasticsearch/issues/6876)) Analytics Sequential rescores [ISSUE] #4748 (https://github.com/elastic/elasticsearch/pull/4748) Other Search Templates Search Templates: Adds API endpoint to render search templates as a response #11570 (https://github.com/elastic/elasticsearch/pull/11570) (issue: #6821 (https://github.com/elastic/elasticsearch/issues/6821)) Other Settings Add ability to prompt for selected settings on startup #10918 (https://github.com/elastic/elasticsearch/pull/10918) (issue: #10838 (https://github.com/elastic/elasticsearch/issues/10838)) Other bootstrap.mlockall for Windows (VirtualLock) #10887 (https://github.com/elastic/elasticsearch/pull/10887) (issues: #8480 (https://github.com/elastic/elasticsearch/issues/8480), #9186 (https://github.com/elastic/elasticsearch/issues/9186), #9923 (https://github.com/elastic/elasticsearch/issues/9923)) Other Add checksum option for index.shard.check_on_startup#9183 (https://github.com/elastic/elasticsearch/pull/9183) Other Shadow Replicas Allow shards on shared filesystems to be recovered on any node #10960 (https://github.com/elastic/elasticsearch/pull/10960) (issue: #10932 (https://github.com/elastic/elasticsearch/issues/10932)) Other Shadow replicas on shared filesystems #9727 (https://github.com/elastic/elasticsearch/pull/9727) (issue: #8976 (https://github.com/elastic/elasticsearch/issues/8976)) Other Stats Add script compilation stats #12733 (https://github.com/elastic/elasticsearch/pull/12733) (issue: #12673 (https://github.com/elastic/elasticsearch/issues/12673)) Other Add OS name to _nodes and _cluster/nodes #11807 (https://github.com/elastic/elasticsearch/pull/11807) Other Add an API to locate unrecovered shards and their state #11545 (https://github.com/elastic/elasticsearch/pull/11545) (issue: #10952 (https://github.com/elastic/elasticsearch/issues/10952)) Other Cluster Health: Add wait time for pending task and recovery percentage #11393 (https://github.com/elastic/elasticsearch/pull/11393) (issue: #10805 (https://github.com/elastic/elasticsearch/issues/10805)) Other Add field stats api #10523 (https://github.com/elastic/elasticsearch/pull/10523) Other Store Add index.data_path setting #9033 (https://github.com/elastic/elasticsearch/pull/9033) (issues: #8819 (https://github.com/elastic/elasticsearch/issues/8819), #8976 (https://github.com/elastic/elasticsearch/issues/8976)) Other Add best_compression option for indices #8863 (https://github.com/elastic/elasticsearch/pull/8863) Other Suggesters Phrase Suggester: Add option to filter out phrase suggestions #6773 (https://github.com/elastic/elasticsearch/pull/6773) (issue: #3482 (https://github.com/elastic/elasticsearch/issues/3482)) Search ContextSuggester #4044 (https://github.com/elastic/elasticsearch/pull/4044) (issue: #3959 (https://github.com/elastic/elasticsearch/issues/3959)) Search Term Vectors Return term vectors as part of the search response #10729 (https://github.com/elastic/elasticsearch/pull/10729) (issue: #10823 (https://github.com/elastic/elasticsearch/issues/10823)) Search Support terms filtering #9561 (https://github.com/elastic/elasticsearch/pull/9561) Search Top Hits Add top_hits aggregation #6124 (https://github.com/elastic/elasticsearch/pull/6124) Analytics Upgrade API Add API to upgrade old Lucene indices to the latest version #7922 (https://github.com/elastic/elasticsearch/pull/7922) (issue: #7884 (https://github.com/elastic/elasticsearch/issues/7884)) Other Elasticsearch 2.0 new features
  21. Elasticsearch 2.0 new search features • Add keep_types for filtering

    by token type #7120 (https://github.com/elastic/ elasticsearch/pull/7120) • Add uppercase token filter #5539 (https://github.com/elastic/elasticsearch/ pull/5539) • Phrase Suggester: Add option to filter out phrase suggestions #6773 (https:// github.com/elastic/elasticsearch/pull/6773) • ContextSuggester #4044 (https://github.com/elastic/elasticsearch/pull/4044) • Return term vectors as part of the search response #10729 (https:// github.com/elastic/elasticsearch/pull/10729) • Support terms filtering #9561 (https://github.com/elastic/elasticsearch/pull/ 9561)