Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Elastic{ON} Tour D.C. - Elastic at NARA for Search

Elastic Co
October 27, 2017

Elastic{ON} Tour D.C. - Elastic at NARA for Search

Elastic{ON} Tour D.C. - October 26, 2017

The National Archives and Records Administration is building the next generation of electronic records archives. See how they're planning to use Elasticsearch to power discovery for millions of archived records from federal agencies, judicial proceedings, congressional work, and presidential administrations.

Kevin McCarthy | System Architect | National Archives & Records Administration

Elastic Co

October 27, 2017
Tweet

More Decks by Elastic Co

Other Decks in Technology

Transcript

  1. Elastic at NARA for Search: Electronic Records Archives (ERA 2.0)

    Kevin McCarthy System Architect National Archives & Records Administration October 26, 2017
  2. What is NARA? 2 What do we do? • Independent

    agency of the U.S. Government • Charged with preserving and describing Federal and historical records • Also with providing public access to those documents (when allowed) • Administers 14 Presidential Libraries (from Herbert Hoover on) • Includes the office of the Federal Register • Work with several digitizing partners to convert paper records to electronic
  3. What is NARA? (Cont.) 3 Issues that our agency faces

    from our mandate • Must preserve and provide access to Executive, Presidential, Legislative, Judicial, personnel (military and civil), and donated records • Must handle dozens of digital format types, some 30 - 40 years old • Volumes already in storage are in the PB range, soon XB • Must ensure integrity of everything, forever • Under certain, very prescribed situations, must be able to respond to special requests (FOIA, Special Access, litigation, subpoenas) • Records subject to an immense number of laws and regulations governing their management and access
  4. What is ERA? 4 A brief history of ERA ERA

    1.0 – a “System of Systems” • Planning started in 1997 • ERA Base Instance for Federal Records – 2008 • EOP Instance - 2008 • Congressional Records Instance (CRI) – 2009 • Census (2000 and 2010) – 2013 • More were planned, but never materialized (e.g. Classified) But there were challenges: • Scalability of Ingest • Verification, Processing & Preservation capabilities • Manage non-agency digital materials (Digital Surrogates, Legislative, Supreme Court, etc.) • Search & Discovery • System Maintainability & Extensibility
  5. What is ERA 2.0? 5 What’s gone into this version?

    • Evolution of original ERA concepts • Lesson learned -- simple, flexible, service oriented • Highly scalable • Cloud based • Data at rest -- bring the application to the data • Comprehensive and extensive search/discover capabilities (that’s where Elasticsearch comes in)
  6. 6 ERA – Migration from 1.0 to 2.0 ERA 1.0

    (Base) Processing & verification capabilities Safe storage & retrieval Metadata mgmt Business workflow & forms mgmt Digital Processing Environment (DPE) Digital Object Repository (DOR) Business Object Management (BOM) ERA 2.0
  7. 7 ERA 2.0 Architecture US East (N. Virginia) Region –

    VPC – Pilot/Production Public - security group Private- security group Common - security group Elastic IP Internet Gateway/NAT Router Pilot/Prod-to-DIT VPG Peer Connection Elastic Load Balancer Common: OpenVPN, FourtyCloud, ApacheDS, Admin tools Background Agent Processing Data Warehouse & Application Data DPE Elasticsearch DOR Elasticsearch Archivist Workbenches Web/Application Servers External Agency Upload S3 Bucket *** EBS Volumes: Databases & Application Data S3 Buckets: Content, Metadata, DPE Storage, Application Logs Route 53 DNS AWS Region – us-east-1 *** Elastic IP Private IP Elastic IP NARANet-to-Pilot FourtyCloud VPN Connection
  8. Where we use Elastic in ERA 2.0 8 Elasticsearch •

    Currently using AWS ES service as accelerator • However, cannot adequately secure that service to obtain our ATO • Discussed using Elastic Cloud service also • As a result, will stick build open source version of Elasticsearch in our VPC • Metadata discovery in DPE -- one large index to rule them all • Content and metadata discovery in DOR -- indexes segregated by Preservation Group • Makes searches in DOR easy; just assemble indexes to use based on user’s group membership
  9. 9 ERA 2.0 Preservation Groups Digital Objects are assigned to

    a Preservation Group at ingest; Access to each object is determined through the Group’s access profile. Users inherit access privileges from all groups to which they belong.
  10. Where we use Elastic in ERA 2.0 10 Logstash •

    Logging occurs in all components and at various levels (CloudTrail, application layer) • Each server runs its own Logstash, which collects these logs and dumps that data into an S3 bucket • Once in the S3 bucket, Lambda functions perform ETL on the data to populate our data warehouse • In addition, Logstash pushes the same log data to Elasticsearch for indexing • We use Kibana to visualize that day’s logs (but will use it for more later) • Elasticsearch keeps log indexes for 1 day, the creates snapshot for preservation
  11. Where we plan to use Elastic Stack 11 Logstash •

    Rebuilding our Data Warehouse into a Data Lake • Use Logstash (and perhaps Beats, still investigating) to populate it Kibana • Reports -- using Kibana against our Data Lake • Kibana will be used to produce ad-hoc reports based on end-users experience level
  12. Where we are and where we’re going 12 • ‘Broke

    ground’ Feb. 2015 • Initially developed a Pilot system, to garner user feedback • Pilot phase completed Sep. 2017 • Currently working on an ATO to put ERA 2.0 into Production • Production release currently scheduled for Aug. 2018 • After that, the ‘Great Migration’ • Our goal is to realize the original dream envisioned with ERA 1.0