Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Osquery Performance @ Scale

Osquery Performance @ Scale

My talk from the osquery@scale conference.

33800dc7f40b28f182ad2b085de5daa9?s=128

Zach Wasserman

January 22, 2020
Tweet

Transcript

  1. Osquery Performance @ Scale Strategies for managing osquery performance in

    production environments.
  2. Zach has been involved with osquery since the inception of

    the project at Facebook in 2014. He serves as a member of the Osquery Technical Steering Committee, and works with a variety of clients to integrate osquery into their operations as the Principal Engineer at Dactiv LLC. Zach believes that we can make security accessible to everyone through open-source tools. Zach Wasserman Principal Engineer Dactiv LLC Page 2 @osqueryatscale @thezachw
  3. An Update From the Technical Steering Committee This content represents

    my opinion and is not an official position of the TSC or the Linux Foundation. Page 3 @osqueryatscale @thezachw
  4. Osquery in Transition • Early 2014 - Osquery project is

    created by Mike Arpaia at Facebook. • October 2014 - Facebook open-sources osquery. • 2014-2019 - Facebook maintains osquery as an open-source project. • June 2019 - Osquery project is handed to the Linux Foundation for a community support model. • October 2019 - Osquery 4.0.2 becomes the first stable release of osquery as a project of the Linux Foundation. • Current - Community is working to establish a regular release cycle, define a roadmap for the future, raise funds for the project, and help osquery grow. Page 4 @osqueryatscale @thezachw
  5. Get Involved • TSC members and the public meet biweekly

    to discuss status of the project and develop plans. ◦ Next meeting: February 4, 10AM PST ◦ Join #officehours in osquery Slack for announcements. • Donate to the osquery project. ◦ Funds are not yet earmarked, but raising funds for the project will allow us to improve testing infrastructure, and hire devs to work on osquery features and maintenance. • Volunteer to test releases. ◦ Organizations that can deploy beta versions of the agent can help ensure the quality of stable releases. Page 5 @osqueryatscale @thezachw
  6. Osquery Performance @ Scale The main event. Page 6 @osqueryatscale

    @thezachw
  7. Motivations • As security practitioners, we need visibility into the

    state of the systems we manage. • Resource utilization has real impact to the bottom line of our business. ◦ Production: Resource utilization is multiplied over each production server. More performance impact = higher cost. ◦ Workstations: We need to ensure that security workloads do not interfere with employees doing their jobs. • Osquery is built for performance, but it is easy to schedule queries that will have significant performance impacts on the system. Page 7 @osqueryatscale @thezachw
  8. Goals • Limit the performance impact of osquery. ◦ Osquery

    Watchdog • Develop monitoring for resource consumption of queries. ◦ The osquery_schedule table • Deploy new queries in a controlled manner. ◦ Host grouping and query sharding • Investigate performance. ◦ Profiling and SQLite explain query plan Page 8 @osqueryatscale @thezachw
  9. Osquery Watchdog Limit the performance impact of osquery. Page 9

    @osqueryatscale @thezachw
  10. Osquery Watchdog The view from osqueryi osquery> SELECT path, pid,

    pgroup, parent, cmdline ...> FROM processes WHERE name = 'osqueryd'; path = /usr/local/bin/osqueryd pid = 45472 pgroup = 45472 parent = 8938 cmdline = /usr/local/bin/osqueryd osqueryd --flagfile=/osquery.flags path = /usr/local/bin/osqueryd pid = 45473 pgroup = 45472 parent = 45472 cmdline = /usr/local/bin/osqueryd osqueryd Page 10 @osqueryatscale @thezachw
  11. Osquery Watchdog Background • When we start osqueryd, we get

    two processes: ◦ Parent process - The “watchdog” ◦ Child process - The “worker” • Potentially resource-intensive operations are performed in the worker process. ◦ Run queries, output logs, etc. • The watchdog process checks the utilization stats for the worker on an interval. ◦ Resource utilization limits exceeded -> Watchdog kills/respawns worker Page 11 @osqueryatscale @thezachw
  12. Osquery Watchdog Managing query execution What happens when a query

    runs on the worker? 1. Worker writes to RocksDB the name of the query being run. 2. Query executes. 3. Worker removes notation of running query. Page 12 @osqueryatscale @thezachw
  13. Osquery Watchdog Managing query execution Now suppose the watchdog kills

    the worker during query execution. 1. Worker writes to RocksDB the name of the query being run. 2. Query begins executing. 3. Watchdog kills worker. 4. Worker respawns, reads RocksDB, and sees that the previous worker was in the middle of execution. 5. Worker logs the failure during query execution and “blacklists” the query. Page 13 @osqueryatscale @thezachw
  14. Osquery Watchdog Managing query execution I0121 08:44:48.398947 270000128 scheduler.cpp:96] Executing

    scheduled query expensive_query: select 1 from users, users, users, users, users, users W0121 08:45:13.591068 127172608 watcher.cpp:331] osqueryd worker (71861) stopping: Maximum sustainable CPU utilization limit exceeded: 21 I0121 08:45:13.996376 127172608 watcher.cpp:583] osqueryd watcher (71860) executing worker (71928) I0121 08:45:14.841640 163079616 init.cpp:415] osquery worker initialized [watcher=71860] I0121 08:45:14.842711 163079616 rocksdb.cpp:131] Opening RocksDB handle: /tmp/osquery.db ... W0121 08:45:23.252063 163079616 config.cpp:317] Scheduled query may have failed: expensive_query Page 14 @osqueryatscale @thezachw
  15. Osquery Watchdog Query Blacklisting • Queries are “blacklisted” when execution

    fails. • Blacklisted queries are removed from the schedule for 24 hours. • This prevents crash-looping, and unnecessary use of resources when queries will be killed. • Observe query blacklist status using the blacklisted column of the osquery_schedule table. ◦ More on this later. Page 15 @osqueryatscale @thezachw
  16. Osquery Watchdog core/watcher.cpp:467 return SQL::selectFrom( {"parent", "user_time", "system_time", "resident_size"}, "processes",

    "pid", EQUALS, INTEGER(p)); Page 16 @osqueryatscale @thezachw
  17. Osquery Watchdog Configuring the Watchdog • The watchdog is enabled

    in osquery by default. ◦ Default settings: “normal” ▪ CPU - Over 10% for up to 12 seconds ▪ Memory - Up to 200MB ◦ --watchdog_level=1: “restrictive” ▪ CPU - Over 5% for up to 6 seconds ▪ Memory - Up to 100MB ◦ --watchdog_level=-1: “off” ▪ Performance limits are disabled Page 17 @osqueryatscale @thezachw
  18. Osquery Watchdog Configuring the Watchdog • Settings can be customized

    to specific needs. • --watchdog_utilization_limit ◦ Threshold percentage of CPU ◦ Time allowed over the threshold is defined by the intervals from --watchdog_level ▪ --watchdog_level=0 - 10 seconds above limit ▪ --watchdog_level=1 - 5 seconds above limit • --watchdog_memory_limit ◦ Maximum memory in MB • Tradeoff: Visibility <-> Performance safety Page 18 @osqueryatscale @thezachw
  19. Osquery Watchdog Extensions • It is also possible to use

    the watchdog to limit utilization by osquery extensions. ◦ Extensions typically run as child processes spawned by osqueryd (with the --extensions_autoload flag). ◦ Use --enable_extensions_watchdog to turn on this feature. Page 19 @osqueryatscale @thezachw
  20. Monitoring Use osquery to monitor osquery. Page 20 @osqueryatscale @thezachw

  21. Monitoring osquery_schedule • Osquery itself provides excellent facilities for monitoring

    osquery performance. • The osquery_schedule table exposes performance information for all scheduled queries. • Performance information is collected by looking at the difference in CPU time and memory of the worker process during execution. Page 21 @osqueryatscale @thezachw
  22. osquery_schedule Schema What information is available in the osquery_schedule table?

    Page 22 @osqueryatscale @thezachw
  23. Monitoring Blacklisted Queries • SELECT * FROM osquery_schedule WHERE blacklisted

    = 1; • Returns information about all of the queries that are currently blacklisted. • Depending on your requirements, this may be worth alerting on! Page 23 @osqueryatscale @thezachw
  24. Monitoring Dashboards • Consider using the data from osquery_schedule to

    create osquery performance dashboards. • Useful charts: ◦ Blacklisted queries ◦ Memory usage of queries ◦ System + user time usage of queries • Visualizing middle and top percentiles can help find outliers. Page 24 @osqueryatscale @thezachw
  25. Monitoring Page 25 @osqueryatscale @thezachw

  26. Monitoring Page 26 @osqueryatscale @thezachw

  27. Monitoring Page 27 @osqueryatscale @thezachw

  28. Query Deployment Deploy new queries in a controlled manner. Page

    28 @osqueryatscale @thezachw
  29. Query Deployment Deployment Strategies Two major strategies for controlling deployment

    of new queries: 1. Segment hosts and deploy queries to groups of hosts. 2. Use the shard option of scheduled queries to slow roll queries within a host group. Use these strategies together for the best control of rollouts. Page 29 @osqueryatscale @thezachw
  30. Query Deployment Group Hosts • Segment hosts by risk tolerance

    for performance issues. ◦ Start with lower risk hosts. • Different techniques can be used depending on the deployment/configuration strategy of osquery. • With tools like Chef/Puppet/Ansible: ◦ Use the tooling to deploy different pack files to each group of hosts. • With plain osquery: ◦ Use the discovery query feature of query packs to gate pack execution based on the results of dynamic queries. • With Fleet: ◦ Use labels to segment hosts and target packs to labels. Page 30 @osqueryatscale @thezachw
  31. Query Deployment Shard Queries • Set the shard option in

    a scheduled query to enable the query on a subset of hosts that receive the pack. ◦ Shard is a percentage of hosts on which to enable the query. ▪ 0 - No hosts ▪ 100 - All hosts • Increase the shard value as confidence in the query performance increases. Page 31 @osqueryatscale @thezachw
  32. Query Deployment Monitor Rollout • Use each of the rollout

    techniques to begin sending the new query to hosts. • Ensure that you have visibility (alerting, dashboards, etc.) into the performance of osquery on those systems. • Deploy to more hosts as confidence increases. • Good monitoring dashboards really pay off at this stage. Page 32 @osqueryatscale @thezachw
  33. Query Deployment Page 33 @osqueryatscale @thezachw

  34. Investigate Performance Use tooling to understand performance problems. Page 34

    @osqueryatscale @thezachw
  35. Investigate Performance Tools • Osquery provides tools that we can

    use to investigate the performance of queries. • Profiling ◦ Use the profile script to preview the performance of query packs on the local machine. • Query planning ◦ Use SQLite explain query plan to begin debugging performance problems. Page 35 @osqueryatscale @thezachw
  36. Investigate Performance Profiling $ ./tools/analysis/profile.py --config pack.json --shell /usr/bin/osqueryi Profiling

    query: select * from processes U:1 C:0 M:2 F:0 D:0 processes (1/1): utilization: 9.8 cpu_time: 0.099889228 memory: 18640896 fds: 4 duration: 0.5181262493133545 Profiling query: select * from users join user_groups using (uid) join groups using (gid) U:2 C:1 M:2 F:0 D:2 user_groups (1/1): utilization: 28.299999999999997 cpu_time: 0.570734208 memory: 19369984 fds: 4 duration: 1.5286788940429688 Profiling query: select * from time U:0 C:0 M:2 F:0 D:0 time (1/1): utilization: 5.35 cpu_time: 0.056201881999999995 memory: 16080896 fds: 4 duration: 0.5209510326385498 Page 36 @osqueryatscale @thezachw
  37. Investigate Performance Query Plan osquery> EXPLAIN QUERY PLAN ...> SELECT

    * FROM processes JOIN process_open_sockets USING (pid) JOIN hash USING (path) ...> WHERE pid IN (SELECT pid FROM processes WHERE uid = 0); +----+--------+---------+---------------------------------------------------------+ | id | parent | notused | detail | +----+--------+---------+---------------------------------------------------------+ | 4 | 0 | 0 | SCAN TABLE processes VIRTUAL TABLE INDEX 62: | | 8 | 0 | 0 | LIST SUBQUERY 1 | | 10 | 8 | 0 | SCAN TABLE processes VIRTUAL TABLE INDEX 68: | | 30 | 0 | 0 | SCAN TABLE process_open_sockets VIRTUAL TABLE INDEX 64: | | 38 | 0 | 0 | SCAN TABLE hash VIRTUAL TABLE INDEX 66: | +----+--------+---------+---------------------------------------------------------+ Page 37 @osqueryatscale @thezachw
  38. Investigate Performance Query Plan osquery> EXPLAIN QUERY PLAN ...> SELECT

    * FROM processes JOIN process_open_sockets USING (pid) JOIN hash USING (path) ...> WHERE uid = 0; +----+--------+---------+---------------------------------------------------------+ | id | parent | notused | detail | +----+--------+---------+---------------------------------------------------------+ | 4 | 0 | 0 | SCAN TABLE processes VIRTUAL TABLE INDEX 70: | | 10 | 0 | 0 | SCAN TABLE process_open_sockets VIRTUAL TABLE INDEX 71: | | 18 | 0 | 0 | SCAN TABLE hash VIRTUAL TABLE INDEX 73: | +----+--------+---------+---------------------------------------------------------+ Page 38 @osqueryatscale @thezachw
  39. Wrapping It Up • Use the watchdog to constrain performance.

    • Monitor performance and blacklists with the osquery_schedule table. • Be strategic in rollout of new queries. • Learn to use the tooling to evaluate performance. Page 39 @osqueryatscale @thezachw
  40. Thank You! Email - zach@dactiv.llc Twitter - @thezachw Osquery Slack

    - @zwass Page 40 @osqueryatscale @thezachw