Upgrade to Pro — share decks privately, control downloads, hide ads and more …

SEEKing Truth Among 10 Billion Logs

Elastic Co
December 10, 2015

SEEKing Truth Among 10 Billion Logs

SEEK Limited is a global leader in employment, education, and volunteer opportunities in 17 countries, hosting 100 million job seeker profiles and over 3 million available job opportunities at any given time. With websites attracting over 375 million monthly visits, SEEK needed to build a mechanism to centralize logs from multiple sources, simplify search across all of them, and generate timely results with visual presentation.

Christopher Phan | Elastic{ON} Tour Melbourne | December 10, 2015

Elastic Co

December 10, 2015
Tweet

More Decks by Elastic Co

Other Decks in Technology

Transcript

  1. Introduction Christopher Phan •  Recently graduated from Swinburne University • 

    Studied Bachelor of Information Technology •  Currently working at SEEK in the Security Operations Team SEEK Limited •  Global Leader in Employment, Education & Volunteer marketplaces which space across 17 Countries •  SEEK hosts 100 million job seeker profiles and has over 3 million jobs opportunities available at any given time globally •  SEEK websites attract over 375 million visits per month 2
  2. How has SEEK used Elasticsearch What did we need to

    achieve our goal? •  Provide stakeholders the ability to search and correlate results from multiple log sources in an effort to promote continuous delivery, proactive security, and improve end-user experience. What we needed to do? •  Build a mechanism to centralise logs from multiple sources and simplify search across all log sources. •  Generate timely results and be able to present them visually. 3
  3. Problems before Elasticsearch •  Log files stored in multiple locations

    •  Data logged in large flat files (>300mb per file) •  Lack of tools to effectively search and correlate data •  Unable to search all required log sources to get a complete picture •  Time intensive (Search performance – slow and manual) •  Limited retention periods (data often deleted before search could be performed) 4
  4. Conducted POC using A Physical Environment Discovered Key Learnings and

    Challenges Evolved Requirements for Production Build a Distributed Cluster Migrated Hosting to Cloud - scalability and availability Integrate data with other visualisation platforms Our Journey with Elasticsearch 5 Onwards Elastic POC Production Implementatio n     Evaluation Single  node   cluster   Windows  Server   2008   Elas8csearch  v.   0.90   10  day  reten8on   3  log  sources     Pla?orm   Reliability   Limited  Scalability   Performance   Issues   Security  Concerns       Horizontal   scalability   Self-­‐maintained   Solu8on   1  year  log   reten8on   Mul8ple  log   sources     Integrate  logging   and  searching   tools   Provide  a  visual   interface  to  end-­‐ users  in  real-­‐8me    
  5. Elasticsearch POC 6 We wanted to focus the POC to

    prove we could correlate search and results from three different teams. • Proactive Security • Track firewall violations to weblog patterns • Monitor DDoS & brute force attempts Security • Operational Monitoring leading to Continuous Deployment • Track deployment success • Webserver health checks DevOps • Providing meaningful information in real-time to end- users • Determine impact of fraudulent activity Fraud
  6. Key Learnings from our POC Platform Reliability •  Java memory

    leak – CircuitBreakerException •  Elasticsearch.conf not read •  System required manual recovery Limited scalability •  Hardware Constraints due to Single Physical Node Performance Issues •  Searching could take minutes to complete Security Concerns •  Authentication not granular enough 7
  7. Conducted POC using A Physical Environment Discovered  Key   Learnings

     and   Challenges Evolved  Requirements   for  Produc8on   Build a Distributed Cluster Migrated Hosting to Cloud - scalability and availability Integrate data with other visualisation platforms Our Journey with Elastic 8 Onwards Elastic POC Production Implementatio n     Evaluation Single  node   cluster   Windows  Server   2008   Elas8csearch  v.   0.90   10  day  reten8on   3  log  sources     Pla?orm   Reliability   Limited  Scalability   Performance   Issues   Security  Concerns       Horizontal   scalability   Self-­‐maintained   Solu8on   1  year  log   reten8on   Mul8ple  log   sources     Integrate  logging   and  searching   tools   Provide  a  visual   interface  to  end-­‐ users  in  real-­‐8me    
  8. Addressing Challenges identified from POC Data Management •  Growth of

    log repository to 30TB §  Using Curator plugin to manage data §  Hot – Warm – Cold model o Close > 15 days o Delete > 30 days o Daily S3 snapshots o S3 to Glacier > 60 days Security Concerns around user access •  Move from Apache to Shield §  Active Directory groups permissions §  Index + Alias = Permissions 9 Rules  for  reten8on  and  archiving  
  9. Addressing Challenges identified from POC – cont. Maintaining 99% uptime

    •  Separate Marvel cluster §  Isolate Marvel logs •  Automated Watcher Alerts §  0 logs returned §  Field data > 90% §  Email Alerts •  Cronjob curls §  Webhook notifications §  Alert emails 10
  10. Building to Scale 11 7  Data  Nodes  (Na,ve  TCP)  

    3  Master  Nodes  (TCP)   2  Client  Nodes   Elas,c  Load  Balancer   Security   DevOp s   Product   Fraud   Internal   Systems     Data   Analy8 cs   TCP,  HTTP   TCP,  UDP,  HTTP   Firewall  &  DC  Logs   Syslog   River  &  SeriLog   Database  Logs   nxLog   Webserver  &  Windows   event  logs   Users   Log Sources: Web  Applica8on   Firewall   Web  Server  Logs   DB  Errors   Applica,on  Logs   Internal  Firewall   Domain  Controller   Windows  Event  Logs   Cloudtrail  Logs   S3/Glacier  
  11. Use Case 1 - Investigating an Infected Device BEFORE • 

    Manual data gathering •  Anti-virus logs •  Firewall logs •  Forensics on the machine 13 AFTER •  Live internal network monitoring •  Chain firewall, windows event logs and domain controller •  Flag connections to blacklisted IPs and URLs •  Track events on a user/host level
  12. Use Case 2 – Measuring Customer Response 14 •  Monitor

    for decrease candidate applications •  Determine cause of registration drop-off •  Live tracking of Blue-Green testing
  13. Use Case 3 – Scraping before Elasticsearch BEFORE •  Time

    + Volume based rules •  Block cloud services •  High false positive count •  Difficult to track new behaviour 15 AFTER •  Monitor specific end-points •  Track sources without users •  Count unique URLs per source •  Analyze cookie + referrer + user-agent patterns
  14. What Elasticsearch means to SEEK 16 Cluster: •  5000 -

    docs indexed per second •  ~850 - shards •  >100 - ‘live’ indexes Storage •  3-5 billion docs searchable •  7-10 billion docs on disk •  ~100 billion retrievable documents •  1 year retention period
  15. Conclusion 17 Security Internal Systems Fraud DevOps Product Data Analytics

    •  Watch and identify scrapers •  Monitor for DDoS/Brute Force •  Malicious behavior analysis •  Incident Response timeline •  Web farm health monitoring •  Volume monitoring for error logs •  Track effects of phased deployments •  Monitor fraudulent user activity •  Find identifiers of fraudulent users •  Measure customer response times •  Track application status codes •  Identify cause user drop- off •  Link firewall logs to Windows event logs •  Monitor domain controller events •  Combine big data visualisation with drill- down capability •  Fluctuations in expected behavior
  16. 18 Thank  you  for  listening   Christopher  Phan    

    Email:  [email protected]     LinkedIn:   hXps://au.linkedin.com/in/christopher-­‐phan-­‐6a500051