Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The ELK & The Eagle: Search & Analytics for the US Government

Elastic Co
March 11, 2015

The ELK & The Eagle: Search & Analytics for the US Government

This talk was presented at the inaugural Elastic{ON} conference, http://elasticon.com

Session Abstract:

An ELK Journey in Two Parts. In the first act, the story will highlight how they built a search engine to power 1,500 government websites. The second act is all about analytics or put another way where the search engine leaves off...at the logs. Loren will discuss how they migrated their entire Hadoop & MySQL-based data warehouse to ELK. Come hear the stories and find out how they were able to replace the complicated batch pipeline with a near-real-time system that is intuitive to use for exploration, discovery, reporting, and visualization.

Presented by Loren Siebert, DigitalGov Search

Elastic Co

March 11, 2015
Tweet

More Decks by Elastic Co

Other Decks in Technology

Transcript

  1. The ELK & The Eagle: Search & Analytics for the

    U.S. Government Loren Siebert DigitalGov Search U.S. General Services Administration
  2. { } CC-BY-ND 4.0 Two Stories; One Talk • Life

    with the DSL • Recall & Relevancy • Chains that Bind Us • Pitfalls & Epiphanies 2 • Logstash Lessons • Kibana Koolness • Capacity Capers • More Pitfalls & More Epiphanies You know, for search You know, for analytics
  3. { } CC-BY-ND 4.0 DigitalGov Search • Service provided by

    the U.S. General Services Administration • Powers the search box on government sites • Improves the public's search experience on .gov & .mil websites • Helps agencies better manage their site's search 3 Overview
  4. { } CC-BY-ND 4.0 DigitalGov Search 4 Brief History Year

    Sites 2000 1 2005 ~200 2010 307 2015 1505
  5. { } CC-BY-ND 4.0 Searchers' Experience 5 Branding Type-­‐ahead Spelling

    Promoted 3rd  Party Twitter YouTube Flickr/MRSS/   Instagram
  6. { } CC-BY-ND 4.0 First Story: Search • Customer document

    types vary from tweets to PDFs • Tiny => large (~50MB) • Very structured => less structured • Custom rules for retrieval • Experimented with MySQL and Solr 8
  7. { } CC-BY-ND 4.0 Search (Information Retrieval) • Easy to

    get started • Hard to get right • Effort went into analysis chain, not scaling/ops • No “one size fits all” analysis chain 9 Elasticsearch is a database with an opinion
  8. { } CC-BY-ND 4.0 When to Show What The ACL

    is a useful model here: 1. Accept All (with some exceptions) 2. Deny All (with some exceptions) 10 Don't Get Whipped by the Long Tail 1. Result   2. Result   3. Result   4. Result   5. And  so  on   6. Etc Promoted Tweets People
  9. { } CC-BY-ND 4.0 Analysis Chains • one of our

    popular analysis chains for English: asciifolding lowercase en_stop_filter en_protected_filter en_stem_filter en_synonym é=>e E=>e the/an irs jobs=>job isis=>isil • minimal_english, light_spanish • open-sourced our stopwords/synonyms/protwords • refinement is ongoing; new terms pop up daily • have a plan to reindex w/o downtime 11 Before choosing a stemmer, you must ask: "In which way do I prefer to be wrong?"
  10. { } CC-BY-ND 4.0 Methodology 1. Customer need! 2. Apply

    some DSL feature 3. Minimal effort; seems like magic; ship it! 4. Uh oh (performance, recall, relevancy, ...) 5. Go behind the curtain and troubleshoot the magic 12
  11. { } CC-BY-ND 4.0 Example: Fuzziness 13 Magic Uh oh

    Query  On Results  For Bohner Boehner measels measles Query  On Results  For contracts contacts parents patents
  12. { } CC-BY-ND 4.0 Search Summary • "advanced" queries •

    multiple datacenters • relevancy regressions 14 • zero downtime reindexing • internal search vs public search • drawing the line Pitfalls Epiphanies
  13. { } CC-BY-ND 4.0 Second Story: Analytics • Customers want

    to see and use their search analytics • Pageloads/Searches/Impressions/Clicks • Need to fine tune and promote their content • Nobody wants yesterday's insights tomorrow 15 It's a dessert topping and a floor wax!
  14. { } CC-BY-ND 4.0 Count All the Things... We got

    more searches on terms today that didn't get any searches yesterday than we got searches today on all the terms that did get searched on yesterday. 16 ...but beware long tail aggregations
  15. { } CC-BY-ND 4.0 Capacity Planning 17 Use Math! Our

    needs: • Anticipated indexing rate (and near real-time requirements) • Anticipated queries (and latency requirements) Some variables to consider: • Document count * size * replicas • Indexed Fields: types, analyzers, term vectors, position offsets • Various caches Hardware: Nodes, RAM, Disk, I/O
  16. { } CC-BY-ND 4.0 Capacity Planning You never know what

    is enough unless you know what is more than enough. - William Blake 18 My Way
  17. { } CC-BY-ND 4.0 How We Use Analytics Data 1.

    Identify a metric that we'd like to improve 2. Try to identify the root problem and fix it 3. Verify that the metric improved 19
  18. { } CC-BY-ND 4.0 Example: Click-Thru Ratios (CTRs) 20 Module

    CTR% Main 70% Promoted 35% Jobs 23% Tweets 9% Site CTR% abc.gov 26% def.gov 25% ghi.gov 21% jkl.gov 1% Query CTR% internal 1% employment 10% internship 11% job 10% Filter:  Jobs Filter:  jkl.gov  &  Jobs
  19. { } CC-BY-ND 4.0 Phases of Data Utility 1. Stored

    it, but couldn't really query it 2. Queryable, but no rollups/aggregations 3. Aggregations, but no visuals 4. Visuals, but canned 5. Composable visualizations, ad hoc exploration 21 When questions are cheap, people ask more of them.
  20. { } CC-BY-ND 4.0 First Scaling Hurdle • Analytics queries

    on clicks got slow • Failed Nagios check (but Marvel was green) • 1GB for Field Data... every day • { _type: "click", url: "http://url1" } • { _type: "pageload", url: "http://url2" } 22 Or, "When the wheels came off"
  21. { } CC-BY-ND 4.0 Analytics Summary • capacity planning can

    be hard • testing analytics can be harder • glimpsing the obvious 23 • data structures matter • give everyone flashlights • mouse not keyboard Pitfalls Epiphanies
  22. { } CC-BY-ND 4.0 Live Demo nih.gov whitehouse.gov army.mil sec.gov

    24 aggregation synonym fuzziness stemming Pick a site Pick a query term
  23. { } CC-BY-ND 4.0 Open Source Repositories https://github.com/GSA/asis (Image Search)

    https://github.com/GSA/jobs_api (Job Search) https://github.com/GSA/punchcard (Linguistics) 26
  24. { } CC-BY-ND 4.0 Example: trending terms • anomaly detection

    • oooh, significant_terms looks cool • what is the background set exactly? • relative changes vs absolute changes? • how do I tune this thing? 28