Contents>

How to Profile Osquery Performance With Kolide

This article was originally published on June 2, 2018 and refers to Kolide Cloud which was formally retired in February of 2019. For posterity, this post is still available, but we encourage you to read our Device Trust announcement.

“Kyle’s computer is too hot to touch”

Recently, one of Kolide’s own employees reported sporadic degraded performance on his machine throughout the day. Kyle is known to complain (like if his water doesn’t have enough ice, or his bacon is slightly overcooked), but sometimes he complains for good reason, so we decided to investigate the problem.

Collections at Kolide

This spring, Kolide released a feature called “Collections” in our Kolide Cloud product. Collections present the user with an aggregate view of the state of their fleet across a given query, as well as the events that led to this state.

An image of Kolide's Collection Events

Osquery includes a table called osquery_schedule, which keeps a record of all queries run on a machine, along with the performance characteristics of these queries. Every instance of Kolide Cloud ships with an osquery_schedule collection enabled, which continuously aggregates data about query execution and performance. Typical results for this collection are below, with a number of columns removed for clarity.

An image of Kolide's collection results

Spotting the Culprit

Looking at the results for this collection, we can make a few assumptions to narrow down the result set. To begin with, we’ll open up the advanced search dropdown (the gear near the top of the previous screenshot), and add some constraints to our search. In our case, we are interested in:

queries that have run today
queries which have run for more than 5 seconds across all executions
queries have been executed more than once

note: wall_time is the recorded unix time difference between the start and end of a query. If a query is executed within one ‘tick’, this value will be recorded as zero, though system_time and user_time will still increase.

When we sort by wall time, a few repeat offenders become visible. Let’s take a look at the top one.

A table of collection results

Setting a Good Kext-ample

Kolide Cloud ships with a collection called “macOS Kernel Extensions”, which enumerates any kexts found on a user’s machine. The query looks like this:

SELECT k.path, name, version, authority, team_identifier, identifier
FROM kernel_extensions k JOIN signature s ON k.path=s.path
WHERE team_identifier != '';

Osquery’s signature table, joined by path in this query, is very slow (relatively speaking) because generating a signature is computationally expensive. In the worst cases, this query would cause fans to spin up within a second, and last for seconds at a time.

We looked at this query and realized that what we were really trying to find was non-Apple kernel extensions, since the Apple ones are obviously expected to be there. By adding another condition to the end of the WHERE clause, excluding Apple’s kernel extensions by name, we thought that we might increase performance dramatically, without affecting the intent of the query.

SELECT k.path, name, version, authority, team_identifier, identifier
FROM kernel_extensions k JOIN signature s ON k.path=s.path
WHERE team_identifier != '' AND name NOT LIKE 'com.apple_%';

Osquery has great docs on testing query performance, which let you articulate your queries as an osquery config, and benchmark the output.

Here’s our old query and the improved version in a config for testing:

{
  "schedule": {
    "macOS kexts v1": {
      "query": "SELECT k.path, name, version, authority, team_identifier, identifier FROM kernel_extensions k JOIN signature s ON k.path=s.path WHERE team_identifier != '';",
      "interval": 60
    },
    "macOS kexts v2": {
      "query": "SELECT k.path, name, version, authority, team_identifier, identifier FROM kernel_extensions k JOIN signature s ON k.path=s.path WHERE team_identifier != '' AND name NOT LIKE 'com.apple_%';",
      "interval": 60
    }
  }
}

We pipe this into the query profiler and examine the results (output trimmed for clarity).

./tools/analysis/profile.py --config ~/example.conf
 D:3 C:2 M:2 F:0 U:3 macOS kexts v1 (1/1): duration: 7.52999901772 cpu_time: 6.183422208 memory: 24412160 fds: 5 utilization: 77.06875
 D:0 C:0 M:2 F:0 U:0 macOS kexts v2 (1/1): duration: -0.489871025085 cpu_time: 0.024266762 memory: 13934592 fds: 4 utilization: 4.6

The Difference Is Clear

While there’s definitely something funky with the reported duration of our improved query (so fast we saved time running it!), the other fields make it clear that our updated query is WAY faster, while still returning results that match the intent of the original query.

We ran benchmarks multiple times on multiple machines, and found our new query consistently executing in about 1% of the time taken by the old one (or less!). Kolide is rolling out the improved version of this query to our customers this week.

A table of collection results for the updated query

A Final Note

One more interesting wrinkle with profiling performance is that the same query running on every machine in your fleet might perform in a dramatically different way. In the case of this kernel extension query, an older machine might not have the horsepower to crunch signatures as quickly as a newer one would. Additionally, the state of a particular user’s computer might make their machine much slower to complete a query that targets the thing that makes them “different”. It should be no surprise that this query took significantly longer to execute on machines with many kernel extensions.

If you’d like to read more osquery content like this, sign up for our biweekly newsletter.