Osquery
Performance
@ Scale
Strategies for managing osquery performance
in production environments.
Slide 2
Slide 2 text
Zach has been involved with osquery since the
inception of the project at Facebook in 2014.
He serves as a member of the Osquery Technical
Steering Committee, and works with a variety of clients
to integrate osquery into their operations as the Principal
Engineer at Dactiv LLC.
Zach believes that we can make security accessible to
everyone through open-source tools.
Zach Wasserman
Principal Engineer
Dactiv LLC
Page 2
@osqueryatscale
@thezachw
Slide 3
Slide 3 text
An Update From the Technical Steering Committee
This content represents my opinion and is not an official position of the TSC or the Linux
Foundation.
Page 3
@osqueryatscale
@thezachw
Slide 4
Slide 4 text
Osquery in Transition
● Early 2014 - Osquery project is created by Mike Arpaia at Facebook.
● October 2014 - Facebook open-sources osquery.
● 2014-2019 - Facebook maintains osquery as an open-source project.
● June 2019 - Osquery project is handed to the Linux Foundation for a
community support model.
● October 2019 - Osquery 4.0.2 becomes the first stable release of osquery
as a project of the Linux Foundation.
● Current - Community is working to establish a regular release cycle, define
a roadmap for the future, raise funds for the project, and help osquery
grow.
Page 4
@osqueryatscale
@thezachw
Slide 5
Slide 5 text
Get Involved
● TSC members and the public meet biweekly to discuss status of the project
and develop plans.
○ Next meeting: February 4, 10AM PST
○ Join #officehours in osquery Slack for announcements.
● Donate to the osquery project.
○ Funds are not yet earmarked, but raising funds for the project will allow
us to improve testing infrastructure, and hire devs to work on osquery
features and maintenance.
● Volunteer to test releases.
○ Organizations that can deploy beta versions of the agent can help
ensure the quality of stable releases.
Page 5
@osqueryatscale
@thezachw
Slide 6
Slide 6 text
Osquery Performance @ Scale
The main event.
Page 6
@osqueryatscale
@thezachw
Slide 7
Slide 7 text
Motivations
● As security practitioners, we need visibility into the state of the systems we
manage.
● Resource utilization has real impact to the bottom line of our business.
○ Production: Resource utilization is multiplied over each production
server. More performance impact = higher cost.
○ Workstations: We need to ensure that security workloads do not
interfere with employees doing their jobs.
● Osquery is built for performance, but it is easy to schedule queries that will
have significant performance impacts on the system.
Page 7
@osqueryatscale
@thezachw
Slide 8
Slide 8 text
Goals
● Limit the performance impact of osquery.
○ Osquery Watchdog
● Develop monitoring for resource consumption of queries.
○ The osquery_schedule table
● Deploy new queries in a controlled manner.
○ Host grouping and query sharding
● Investigate performance.
○ Profiling and SQLite explain query plan
Page 8
@osqueryatscale
@thezachw
Slide 9
Slide 9 text
Osquery Watchdog
Limit the performance impact of osquery.
Page 9
@osqueryatscale
@thezachw
Osquery Watchdog
Background
● When we start osqueryd, we get two processes:
○ Parent process - The “watchdog”
○ Child process - The “worker”
● Potentially resource-intensive operations are performed in the worker
process.
○ Run queries, output logs, etc.
● The watchdog process checks the utilization stats for the worker on an
interval.
○ Resource utilization limits exceeded -> Watchdog kills/respawns
worker
Page 11
@osqueryatscale
@thezachw
Slide 12
Slide 12 text
Osquery Watchdog
Managing query execution
What happens when a query runs on the worker?
1. Worker writes to RocksDB the name of the query being run.
2. Query executes.
3. Worker removes notation of running query.
Page 12
@osqueryatscale
@thezachw
Slide 13
Slide 13 text
Osquery Watchdog
Managing query execution
Now suppose the watchdog kills the worker during query execution.
1. Worker writes to RocksDB the name of the query being run.
2. Query begins executing.
3. Watchdog kills worker.
4. Worker respawns, reads RocksDB, and sees that the previous worker was
in the middle of execution.
5. Worker logs the failure during query execution and “blacklists” the query.
Page 13
@osqueryatscale
@thezachw
Osquery Watchdog
Query Blacklisting
● Queries are “blacklisted” when execution fails.
● Blacklisted queries are removed from the schedule for 24 hours.
● This prevents crash-looping, and unnecessary use of resources when
queries will be killed.
● Observe query blacklist status using the blacklisted column of the
osquery_schedule table.
○ More on this later.
Page 15
@osqueryatscale
@thezachw
Osquery Watchdog
Configuring the Watchdog
● The watchdog is enabled in osquery by default.
○ Default settings: “normal”
■ CPU - Over 10% for up to 12 seconds
■ Memory - Up to 200MB
○ --watchdog_level=1: “restrictive”
■ CPU - Over 5% for up to 6 seconds
■ Memory - Up to 100MB
○ --watchdog_level=-1: “off”
■ Performance limits are disabled
Page 17
@osqueryatscale
@thezachw
Slide 18
Slide 18 text
Osquery Watchdog
Configuring the Watchdog
● Settings can be customized to specific needs.
● --watchdog_utilization_limit
○ Threshold percentage of CPU
○ Time allowed over the threshold is defined by the intervals from
--watchdog_level
■ --watchdog_level=0 - 10 seconds above limit
■ --watchdog_level=1 - 5 seconds above limit
● --watchdog_memory_limit
○ Maximum memory in MB
● Tradeoff: Visibility <-> Performance safety
Page 18
@osqueryatscale
@thezachw
Slide 19
Slide 19 text
Osquery Watchdog
Extensions
● It is also possible to use the watchdog to limit utilization by osquery
extensions.
○ Extensions typically run as child processes spawned by osqueryd (with
the --extensions_autoload flag).
○ Use --enable_extensions_watchdog to turn on this feature.
Page 19
@osqueryatscale
@thezachw
Slide 20
Slide 20 text
Monitoring
Use osquery to monitor osquery.
Page 20
@osqueryatscale
@thezachw
Slide 21
Slide 21 text
Monitoring
osquery_schedule
● Osquery itself provides excellent facilities for monitoring osquery
performance.
● The osquery_schedule table exposes performance information for all
scheduled queries.
● Performance information is collected by looking at the difference in CPU
time and memory of the worker process during execution.
Page 21
@osqueryatscale
@thezachw
Slide 22
Slide 22 text
osquery_schedule
Schema
What information is available in the
osquery_schedule table?
Page 22
@osqueryatscale
@thezachw
Slide 23
Slide 23 text
Monitoring
Blacklisted Queries
● SELECT * FROM osquery_schedule WHERE blacklisted = 1;
● Returns information about all of the queries that are currently blacklisted.
● Depending on your requirements, this may be worth alerting on!
Page 23
@osqueryatscale
@thezachw
Slide 24
Slide 24 text
Monitoring
Dashboards
● Consider using the data from osquery_schedule to create osquery
performance dashboards.
● Useful charts:
○ Blacklisted queries
○ Memory usage of queries
○ System + user time usage of queries
● Visualizing middle and top percentiles can help find outliers.
Page 24
@osqueryatscale
@thezachw
Slide 25
Slide 25 text
Monitoring
Page 25
@osqueryatscale
@thezachw
Slide 26
Slide 26 text
Monitoring
Page 26
@osqueryatscale
@thezachw
Slide 27
Slide 27 text
Monitoring
Page 27
@osqueryatscale
@thezachw
Slide 28
Slide 28 text
Query Deployment
Deploy new queries in a controlled manner.
Page 28
@osqueryatscale
@thezachw
Slide 29
Slide 29 text
Query Deployment
Deployment Strategies
Two major strategies for controlling deployment of new queries:
1. Segment hosts and deploy queries to groups of hosts.
2. Use the shard option of scheduled queries to slow roll queries within a host
group.
Use these strategies together for the best control of rollouts.
Page 29
@osqueryatscale
@thezachw
Slide 30
Slide 30 text
Query Deployment
Group Hosts
● Segment hosts by risk tolerance for performance issues.
○ Start with lower risk hosts.
● Different techniques can be used depending on the
deployment/configuration strategy of osquery.
● With tools like Chef/Puppet/Ansible:
○ Use the tooling to deploy different pack files to each group of hosts.
● With plain osquery:
○ Use the discovery query feature of query packs to gate pack execution
based on the results of dynamic queries.
● With Fleet:
○ Use labels to segment hosts and target packs to labels.
Page 30
@osqueryatscale
@thezachw
Slide 31
Slide 31 text
Query Deployment
Shard Queries
● Set the shard option in a scheduled query to enable the query on a subset
of hosts that receive the pack.
○ Shard is a percentage of hosts on which to enable the query.
■ 0 - No hosts
■ 100 - All hosts
● Increase the shard value as confidence in the query performance increases.
Page 31
@osqueryatscale
@thezachw
Slide 32
Slide 32 text
Query Deployment
Monitor Rollout
● Use each of the rollout techniques to begin sending the new query to hosts.
● Ensure that you have visibility (alerting, dashboards, etc.) into the
performance of osquery on those systems.
● Deploy to more hosts as confidence increases.
● Good monitoring dashboards really pay off at this stage.
Page 32
@osqueryatscale
@thezachw
Investigate Performance
Use tooling to understand performance problems.
Page 34
@osqueryatscale
@thezachw
Slide 35
Slide 35 text
Investigate Performance
Tools
● Osquery provides tools that we can use to investigate the performance of
queries.
● Profiling
○ Use the profile script to preview the performance of query packs on the
local machine.
● Query planning
○ Use SQLite explain query plan to begin debugging performance
problems.
Page 35
@osqueryatscale
@thezachw
Investigate Performance
Query Plan
osquery> EXPLAIN QUERY PLAN
...> SELECT * FROM processes JOIN process_open_sockets USING (pid) JOIN hash USING (path)
...> WHERE pid IN (SELECT pid FROM processes WHERE uid = 0);
+----+--------+---------+---------------------------------------------------------+
| id | parent | notused | detail |
+----+--------+---------+---------------------------------------------------------+
| 4 | 0 | 0 | SCAN TABLE processes VIRTUAL TABLE INDEX 62: |
| 8 | 0 | 0 | LIST SUBQUERY 1 |
| 10 | 8 | 0 | SCAN TABLE processes VIRTUAL TABLE INDEX 68: |
| 30 | 0 | 0 | SCAN TABLE process_open_sockets VIRTUAL TABLE INDEX 64: |
| 38 | 0 | 0 | SCAN TABLE hash VIRTUAL TABLE INDEX 66: |
+----+--------+---------+---------------------------------------------------------+
Page 37
@osqueryatscale
@thezachw
Slide 38
Slide 38 text
Investigate Performance
Query Plan
osquery> EXPLAIN QUERY PLAN
...> SELECT * FROM processes JOIN process_open_sockets USING (pid) JOIN hash USING (path)
...> WHERE uid = 0;
+----+--------+---------+---------------------------------------------------------+
| id | parent | notused | detail |
+----+--------+---------+---------------------------------------------------------+
| 4 | 0 | 0 | SCAN TABLE processes VIRTUAL TABLE INDEX 70: |
| 10 | 0 | 0 | SCAN TABLE process_open_sockets VIRTUAL TABLE INDEX 71: |
| 18 | 0 | 0 | SCAN TABLE hash VIRTUAL TABLE INDEX 73: |
+----+--------+---------+---------------------------------------------------------+
Page 38
@osqueryatscale
@thezachw
Slide 39
Slide 39 text
Wrapping It Up
● Use the watchdog to constrain performance.
● Monitor performance and blacklists with the osquery_schedule table.
● Be strategic in rollout of new queries.
● Learn to use the tooling to evaluate performance.
Page 39
@osqueryatscale
@thezachw