Massively Scalable Services at AVAST: Case Study

Massively Scalable Services at AVAST: Case Study

Protection against zero-day attacks, polymorphic malware and computer security in general is moving more and more to "cloud". We build massively scalable low-latency backend systems with REST APIs that must respond to tens of thousands of requests every second. Such a demanding task requires application of the most modern technologies - distributed NoSQL data stores, asynchronous HTTP handling and the latest algorithms. We will show you how we went about creating one such system called FileRep (service that provides reputation of potentially harmful files) using Netty, Cassandra and Scala.
These slides were presented at WebExpo 2014 in Prague.

6048e3cd01eb6e6aaa0c469b6ed7ad08?s=128

Jakub Janeček

September 13, 2014
Tweet

Transcript

  1. Massively Scalable Services at AVAST
 Case Study! ! Jakub Janeček#

    janecek@avast.com!
  2. Motivation# How to detect malware before it is discovered?!

  3. Motivation# Word.exe# CLEAN#

  4. Motivation# Word.exe# CLEAN# CodeRed.exe# MALWARE#

  5. Motivation# Word.exe# CLEAN# CodeRed.exe# MALWARE# ?#

  6. Observations# ! File suspicious if new or unique in our user

    base.!
  7. Observations# ! File suspicious if new or unique in our user

    base.! ! “Cloud” can deliver detections almost instantaneously.!
  8. Observations# ! File suspicious if new or unique in our user

    base.! ! “Cloud” can deliver detections almost instantaneously.! File Reputation service!
  9. FileRep#

  10. FileRep# suspicious.exe#

  11. FileRep# suspicious.exe#

  12. FileRep# suspicious.exe# FileRep#

  13. FileRep# suspicious.exe# FileRep#

  14. FileRep# suspicious.exe# FileRep#

  15. Problem# What?! DB of all files in our user base?!

    Requests coming from millions of users at once?!
  16. Quiz# How many unique files ! does FileRep know?!

  17. Data"

  18. Data" Loners#

  19. Data" Loners# Topstars#

  20. Terminology# ! Loner – new or unique file, considered suspicious.!

  21. Terminology# ! Loner – new or unique file, considered suspicious.! ! Topstar

    – well-known file, usually safe.!
  22. Terminology# ! Loner – new or unique file, considered suspicious.! ! Topstar

    – well-known file, usually safe.! ! Prevalence – number of unique users having seen the file.!
  23. Terminology# ! Loner – new or unique file, considered suspicious.! ! Topstar

    – well-known file, usually safe.! ! Prevalence – number of unique users having seen the file.! ! Emergence – the first time the file was seen.!
  24. Node Architecture# FileRep#

  25. Node Architecture# FileRep#

  26. Node Architecture# FileRep# Cassandra!

  27. Node Architecture# FileRep# Cassandra! Mucker#

  28. Node Architecture# FileRep# Cassandra! Mucker# PostgreSQL!

  29. Cluster Architecture# FileRep# Cassandra! PostgreSQL! Mucker# FileRep# Cassandra! FileRep# Cassandra!

    FileRep# Cassandra! FileRep# Cassandra! FileRep# Cassandra! DC1# DC2# DC3#
  30. Mucker# FileRep#

  31. Mucker# FileRep# Mucker#

  32. Mucker# FileRep# Mucker#

  33. Mucker# FileRep# Mucker#

  34. Mucker# FileRep# Mucker#

  35. Merger" Mucker# FileRep# Mucker# Disk A! Disk B!

  36. Merger" Mucker# FileRep# Mucker# Disk A! Disk B!

  37. Merger" Mucker# FileRep# Mucker# Disk A! Disk B!

  38. Merger" Mucker# FileRep# Mucker# Disk A! Disk B! PostgreSQL!

  39. Merger" Mucker# FileRep# Mucker# Disk A! Disk B! PostgreSQL!

  40. Platform# Is there an existing platform for that?!

  41. Platform#

  42. Platform#

  43. Platform#

  44. Platform# class Handler extends RequestHandler[Buffer, Buffer] { def handle(c: Context,

    r: Buffer): Response }
  45. Platform# class Handler extends RequestHandler[Buffer, Buffer] { def handle(c: Context,

    r: Buffer): Response } boss thread!
  46. Platform# class Handler extends RequestHandler[Buffer, Buffer] { def handle(c: Context,

    r: Buffer): Response } boss thread! worker threads!
  47. Platform# class Handler extends RequestHandler[Buffer, Buffer] { def handle(c: Context,

    r: Buffer): Response } boss thread! worker threads! app threads!
  48. Platform# class Handler extends RequestHandler[Buffer, Buffer] { def handle(c: Context,

    r: Buffer): Response } boss thread! worker threads! app threads!
  49. Platform# class Handler extends RequestHandler[Buffer, Buffer] { def handle(c: Context,

    r: Buffer): Response } boss thread! worker threads! app threads!
  50. Problem# The amount of data grew and ! Mucker could

    not keep up.!
  51. FileRep v2# ! Evolution of FileRep v1.!

  52. FileRep v2# ! Evolution of FileRep v1.! ! The idea and functionality

    still the same.!
  53. FileRep v2# ! Evolution of FileRep v1.! ! The idea and functionality

    still the same.! ! Implementation:! ! simplification - Mucker replaced by HLL++,!
  54. FileRep v2# ! Evolution of FileRep v1.! ! The idea and functionality

    still the same.! ! Implementation:! ! simplification - Mucker replaced by HLL++,! ! cleanup - rewritten in Scala.!
  55. Simplified Node Architecture# FileRep# Cassandra! Mucker# PostgreSQL!

  56. Simplified Node Architecture# FileRep# Cassandra! PostgreSQL!

  57. Simplified Node Architecture# FileRep# Cassandra!

  58. Topstar Prevalence# How to store prevalence of topstars?!

  59. Topstar Prevalence# ! Prevalence ≅ 1 000 000!

  60. Topstar Prevalence# ! Prevalence ≅ 1 000 000! ! User ID =

    16 B!
  61. Topstar Prevalence# ! Prevalence ≅ 1 000 000! ! User ID =

    16 B! 1000000 * 16 = 16000000 B = 15 MB#
  62. Topstar Prevalence# ! Prevalence ≅ 1 000 000! ! User ID =

    16 B! 1000000 * 16 = 16000000 B = 15 MB# # 15 * 2000000 ≅ 30 GB#
  63. HyperLogLog++# user1! user2! user3! …! userX! ! Operations: add, cardinality, union,

    intersection!
  64. HyperLogLog++# user1! user2! user3! …! userX! Only around 16kB for

    99% accuracy.!
  65. HyperLogLog++# user1! user2! user3! user1! user4! user5!

  66. HyperLogLog++# user1! user2! user3! user1! user4! user5! union!

  67. HyperLogLog++# user1! user2! user3! user4! user5! Still 16kB with the

    same accuracy.!
  68. HyperLogLog++# How to synchronize them?!

  69. Synchronization of HLL++# Compare&Swap*! FileRep# FileRep#

  70. Synchronization of HLL++# Compare&Swap*! FileRep# FileRep# Cassandra! DATA!

  71. Synchronization of HLL++# Compare&Swap*! *Cassandra 2.0 only, CASSANDRA-6284 ! FileRep#

    FileRep# Cassandra! DATA!
  72. Synchronization of HLL++# Compare&Swap*! *Cassandra 2.0 only, CASSANDRA-6284 ! FileRep#

    FileRep# Cassandra! DATA!
  73. Synchronization of HLL++# Time-slotting! FileRep# FileRep#

  74. Synchronization of HLL++# Time-slotting! FileRep# FileRep# Cassandra!

  75. Synchronization of HLL++# Time-slotting! FileRep# FileRep# Cassandra!

  76. Generalization# ! The idea of reputation can also be applied to:!

  77. Generalization# ! The idea of reputation can also be applied to:!

    ! domains,!
  78. Generalization# ! The idea of reputation can also be applied to:!

    ! domains,! ! Android applications,!
  79. Generalization# ! The idea of reputation can also be applied to:!

    ! domains,! ! Android applications,! ! and others…!
  80. FileRep Statistics# 84 000 threats detected# on! 62 000 unique

    computers# every day!
  81. FileRep Statistics# 28 000 req/s in average, 50 000 req/s

    in peak  
  82. FileRep Statistics# Latency under! 500 milliseconds! (including round-trip and handshake).!

  83. Q&A# ?"

  84. AVAST# Join discussion with the AVAST developers.! ! Follow us

    at Twitter and G+! ! @avast_devs #AVASTdevs"