Contents

Deep Dives

Why You Can't Trust Your NULLs in Osquery

The pain of empty strings in SQLite

Fritz Ifert-Miller

A basic assumption of nearly any database is that the absence of data in a table is recorded as NULL. This fundamental principle guides many of the approaches an individual would take to querying the data within. The behavior of common operators like COUNT, ISNULL, IFNULL, MAX, are predicated on the interpretation of NULL rows.

...but, if you are using the popular open-source endpoint visibility tool osquery, proceed with caution, I am here to show you the quicksand โ€”

A brief background on osquery

Osquery is an endpoint visibility tool that translates native API calls into SQLite virtual tables, allowing a Mac, Linux, or Windows computer to be queried in real-time as if it were a relational database:

SELECT name, pid, parent FROM processes LIMIT 5;
+----------------+-----+--------+
| name           | pid | parent |
+----------------+-----+--------+
| kernel_task    | 0   | 0      |
| launchd        | 1   | 0      |
| syslogd        | 67  | 1      |
| UserEventAgent | 68  | 1      |
| uninstalld     | 71  | 1      |
+----------------+-----+--------+

Created at and later open-sourced by Facebook, osquery is now under the ownership and governance of The Osquery Foundation, which is under The Linux Foundation. Osquery is the underlying agent for dozens of security solutions and runs on millions of devices across the globe.

If you aren't familiar with osquery, you can get started exploring it in less than 5 minutes by installing the agent on your device:

Osquery / Downloads / Official

For those of you who are already familiar with osquery, or using another osquery endpoint manager like Kolide, feel free to dive in and get started exploring why what you thought was NULL may not be what you expect.

Empty String != NULL

A painful lesson I learned early on in my days of writing SQL in osquery was realizing that its output is littered with empty strings.

What is an empty string, you might ask? Let's take a look:

WITH
demonstration_table (example_name,example_data,example_datatype) AS (
  VALUES
    ('first','foo','not empty string'),
    ('second','','empty string'),
    ('third',NULL,'NULL')
)
SELECT * FROM demonstration_table;
+--------------+--------------+------------------+
| example_name | example_data | example_datatype |
+--------------+--------------+------------------+
| first        | foo          | not empty string |
| second       |              | empty string     |
| third        |              | NULL             |
+--------------+--------------+------------------+

The second value '' is what is commonly referred to as an empty string. Despite containing no content, '' represents data.

The third value, on the other hand, is a NULL, which means the absence of data.

While these two values (the second and third rows in the column: example_data) look the same, they are very different, and the way the SQLite engine interprets them could not be more dissimilar.

There are many ways in which databases, including osquery, handle empty strings differently from NULLs.

WHERE NOT NULL & WHERE IS NULL

The first example of difficulties you may encounter is in situations in which you attempt to query for data where a specific column is or is not NULL. Let's take the block_devices table, for example. This table outputs a list of any connected block devices. In this case, my hard drive (with its various volumes) and some external media (thumb drives and external hard drives).

A screenshot of the macOS Disk Utility with an external media drive selected

It also returns the basic mapping of these volumes to their respective parent volumes (e.g. a hard drive may have many partitions).

SELECT name, parent, label FROM block_devices;
+----------------+------------+----------------------------+
| name           | parent     | label                      |
+----------------+------------+----------------------------+
| /dev/disk0     |            | APPLE SSD AP1024M Media    |
| /dev/disk0s1   | /dev/disk0 | EFI System Partition       |
| /dev/disk0s2   | /dev/disk0 | Untitled 2                 |
| /dev/disk1     |            | AppleAPFSMedia             |
| /dev/disk1s1   | /dev/disk1 | Macintosh HD - Data        |
| /dev/disk1s2   | /dev/disk1 | Preboot                    |
| /dev/disk1s3   | /dev/disk1 | Recovery                   |
| /dev/disk1s4   | /dev/disk1 | VM                         |
| /dev/disk1s5   | /dev/disk1 | Macintosh HD               |
| /dev/disk1s6   | /dev/disk1 | Update                     |
| /dev/disk1s5s1 | /dev/disk1 | com.apple.os.update-5DC2B8 |
| /dev/disk2     |            | WD My Passport 25E2 Media  |
| /dev/disk2s1   | /dev/disk2 | EFI System Partition       |
| /dev/disk2s2   | /dev/disk2 | Untitled 2                 |
| /dev/disk2s3   | /dev/disk2 | Booter                     |
| /dev/disk3     |            | Patriot Memory Media       |
+----------------+------------+----------------------------+

The parent volumes are those with no data in the parent column (because they have no parent of their own). Let's say we wanted to return just these parent volumes; we might attempt to do so by running the following:

osquery> SELECT name, parent, label FROM block_devices WHERE parent ISNULL;
osquery>

No results are returned! This is because, as we demonstrated in the previous section, these empty rows are not actually NULLs, they are empty strings. Let's amend our SQL to return the desired output:

SELECT name, parent, label FROM block_devices WHERE parent = '';
+------------+--------+---------------------------+
| name       | parent | label                     |
+------------+--------+---------------------------+
| /dev/disk0 |        | APPLE SSD AP1024M Media   |
| /dev/disk1 |        | AppleAPFSMedia            |
| /dev/disk2 |        | WD My Passport 25E2 Media |
| /dev/disk3 |        | Patriot Memory Media      |
+------------+--------+---------------------------+

Changing our WHERE clause to: WHERE parent = '' results in the output we expect. This basic example lays the foundation for all of the examples to follow.

NULL != NULL

Though we didn't return NULL values in our previous example it is important to point out the significance of the ISNULL operator.

A nuance of true NULL values is that NULL does not equal itself and NULL is not LIKE itself either. You may wonder what that means, let's see it in practice below:

SELECT
  1=1,
  NULL LIKE NULL,
  NULL LIKE '%',
  NULL = NULL,
  NULL ISNULL;
+-----+----------------+---------------+-------------+-------------+
| 1=1 | NULL LIKE NULL | NULL LIKE '%' | NULL = NULL | NULL ISNULL |
+-----+----------------+---------------+-------------+-------------+
| 1   |                |               |             | 1           |
+-----+----------------+---------------+-------------+-------------+

You must use the operator ISNULL to find real NULL values.

Inconsistencies of behavior when aggregating

Many of the issues with empty strings stem from the way they are treated when performing aggregation operations.

COUNT treats empty strings as NOT NULL

When the COUNT operator is used on a column, it records a count of any rows which are not NULL. We can see this behavior using the query below:

WITH
demonstration_table (example_data,example_datatype) AS (
  VALUES
    ('foo','not empty string'),
    ('','empty string'),
    (NULL,'NULL')
)
SELECT COUNT(example_data) FROM demonstration_table;
+---------------------+
| COUNT(example_data) |
+---------------------+
| 2                   |
+---------------------+

While we would expect to see a count of 1, counting only the row containing the example data: 'foo', our empty string is counted alongside 'foo' as the presence of row data.

We have to be mindful of that when looking for something based on the presence of data in a WHERE statement. For example, WHERE path NOT NULL may not return the output we expect.

AVG and other aggregation

The confusion of empty strings vs NULL extends into any SQLite arithmetic due to the different ways each is handled in aggregation functions.

Let's take a look at the AVG function which averages the contents of the data it operates on.

We will do a basic example first to verify the behavior:

WITH
example_data (data_type,sample_data) AS (
VALUES
  ('integer',50),
  ('integer',100))
SELECT AVG(sample_data) FROM example_data;
+------------------+
| AVG(sample_data) |
+------------------+
| 75.0             |
+------------------+

As we expect: (50+100)/2 = 75

But what happens when we throw a NULL row in the mix?

WITH
example_data (data_type,sample_data) AS (
VALUES
  ('true-null',NULL),
  ('integer',50),
  ('integer',100))
SELECT AVG(sample_data) FROM example_data;
+------------------+
| AVG(sample_data) |
+------------------+
| 75.0             |
+------------------+

The absence of data is not data, so the NULL row is ignored when calculating the average of the values. Let's see what happens when we substitute our empty string '' for NULL:

WITH
example_data (data_type,sample_data) AS (
VALUES
  ('empty string',''),
  ('integer',50),
  ('integer',100))
SELECT AVG(sample_data) FROM example_data;
+------------------+
| AVG(sample_data) |
+------------------+
| 50.0             |
+------------------+

The empty string is interpreted as 0 by SQLite leading to it being calculated as part of our average and subsequently producing a value, we may have otherwise not expected.

Empty strings being interpreted as 0 may produce other inconsistencies you do not expect:

  • value * NULL = NULL
  • value * empty string = 0

Let's see that in osqueryi:

WITH
example_data (value_1,value_2,data_type) AS (
VALUES
  (2,'','empty string'),
  (2,NULL,'NULL'),
  (2,50,'integer'),
  (2,100,'integer')
)
SELECT
  value_1,
  value_2,
  data_type,
  (value_1 * value_2) AS multiplied_output
FROM example_data;
+---------+---------+--------------+-------------------+
| value_1 | value_2 | data_type    | multiplied_output |
+---------+---------+--------------+-------------------+
| 2       |         | empty string | 0                 |
| 2       |         | NULL         |                   |
| 2       | 50      | integer      | 100               |
| 2       | 100     | integer      | 200               |
+---------+---------+--------------+-------------------+

Boolean interpretation

Boolean values of TRUE and FALSE are interpreted by SQLite as 1 and 0 respectively. This means that an average of boolean data is susceptible to the same problems mentioned above if empty strings are present:

WITH
example_data (data_type,sample_data) AS (
VALUES
  ('true-null',NULL),
  ('boolean-true',TRUE),
  ('boolean-false',FALSE),
  ('empty string','')
)
SELECT AVG(sample_data) FROM example_data;
+-------------------+
| AVG(sample_data)  |
+-------------------+
| 0.333333333333333 |
+-------------------+

Complicating things further: Boolean fields may actually be Tri-State

Many tables in osquery have columns that appear to encode boolean (TRUE,FALSE) data. However, the prevalence of empty strings complicates the utility of these columns.

Let's take a look at the chrome_extensions table to understand why this might cause us some confusion:

SELECT name, state
FROM users CROSS JOIN chrome_extensions USING(uid);
+----------------------------------------------+-------+
| name                                         | state |
+----------------------------------------------+-------+
| Gmail                                        | 1     |
| Application Launcher For Drive (by Google)   | 1     |
| Advanced REST client                         | 1     |
| The Camelizer                                | 0     |
| EditThisCookie                               | 1     |
| Privacy Badger                               | 1     |
| React Developer Tools                        | 1     |
| 1Password extension (desktop app required)   | 1     |
| Take Webpage Screenshots Entirely - FireShot | 1     |
| Endpoint Verification                        | 1     |
| Chrome Web Store Payments                    | 1     |
| Google Drive                                 | 1     |
| Chrome Media Router                          |       |
| Chrome Media Router                          | 1     |
| Application Launcher For Drive (by Google)   | 0     |
| Google Docs Offline                          | 1     |
| Chrome Web Store Payments                    | 1     |
| Sheets                                       | 1     |
| Gmail                                        | 1     |
| Adobe Acrobat                                | 0     |

If we had naively assumed the state column was a guaranteed boolean, we would have been mistaken and missed our Chrome Media Router row.

And again, that empty row? You guessed it, empty string:

SELECT name,state
FROM users CROSS JOIN chrome_extensions USING(uid)
WHERE state = '';
+---------------------------------+-------+
| name                            | state |
+---------------------------------+-------+
| Chrome Media Router             |       |
+---------------------------------+-------+

If we weren't paying attention, it would be easy to get caught by this inconsistent data.

WITH
chrome_extensions_sum AS (
SELECT
  COUNT(*) AS total,
  SUM(CASE WHEN state = TRUE THEN 1 END) AS enabled,
  SUM(CASE WHEN state = FALSE THEN 1 END) AS disabled
FROM users CROSS JOIN chrome_extensions USING(uid)
)
SELECT *,
enabled + disabled AS enabled_plus_disabled
FROM chrome_extensions_sum;
+-------+---------+----------+-----------------------+
| total | enabled | disabled | enabled_plus_disabled |
+-------+---------+----------+-----------------------+
| 53    | 44      | 8        | 52                    |
+-------+---------+----------+-----------------------+

Our reliance on the TRUE, FALSE operators to evaluate this column do not account for the possibility of an empty string.

Sometimes osquery uses NULL? ๐Ÿคท

Like so many things in life, there are few hard and fast rules. While there are a litany of examples of osquery tables with columns that encode empty strings instead of NULLs -- there are also outliers.

For example, the socket column in the process_open_sockets table correctly encodes NULL values:

SELECT
pid, socket, remote_address, local_port, path, state
FROM process_open_sockets
WHERE remote_address = '54.230.75.76';
+------+--------+----------------+------------+------+-------------+
| pid  | socket | remote_address | local_port | path | state       |
+------+--------+----------------+------------+------+-------------+
| 1464 |        | 54.230.75.76   | 49501      |      | ESTABLISHED |
+------+--------+----------------+------------+------+-------------+
SELECT
pid, socket, remote_address, local_port, path, state
FROM process_open_sockets
WHERE remote_address = '54.230.75.76'
AND socket ISNULL;
+------+--------+----------------+------------+------+-------------+
| pid  | socket | remote_address | local_port | path | state       |
+------+--------+----------------+------------+------+-------------+
| 1464 |        | 54.230.75.76   | 49501      |      | ESTABLISHED |
+------+--------+----------------+------------+------+-------------+

Finally! A bit of expected behavior!

๐Ÿคจ Wait...

The path column in that result above also has nothing.

๐Ÿค” ๐Ÿค” ๐Ÿค”

Certainly, it will also be a true NULL, let's verify...

SELECT
COUNT(*)
FROM process_open_sockets
WHERE remote_address = '54.230.75.76'
AND path ISNULL;
+----------+
| COUNT(*) |
+----------+
| 0        |
+----------+

๐Ÿ˜‘ Welp, it was a good shot. So much for that moment of hope.

But! It does reveal a characteristic that can help us avoid ambiguity when writing our queries.

Patterns which you can expect to observe:

  • Columns that are encoded as a TEXT field will typically use empty strings in place of NULL;

  • Columns that are numeric (eg. INT, BIGINT, DOUBLE) will typically encode the absence of data as NULL.

Be careful concatenating results with true NULLs

One peculiarity which you should be aware of when using osquery is that the common concatenate operator || will not produce output when you concatenate a NULL with any other string.

You can work around this behavior by instead relying on the CONCAT operator.

Let's see the difference in output in osqueryi below:

WITH
example_data (data_type,sample_data) AS (
VALUES
  ('true-null',NULL),
  ('boolean-true',TRUE),
  ('boolean-false',FALSE),
  ('empty string','')
)
SELECT
  CONCAT(data_type,',',sample_data) AS concat_example,
  (data_type || ',' || sample_data) AS double_pipe_example
FROM example_data;
+-----------------+---------------------+
| concat_example  | double_pipe_example |
+-----------------+---------------------+
| true-null,      |                     |
| boolean-true,1  | boolean-true,1      |
| boolean-false,0 | boolean-false,0     |
| empty string,   | empty string,       |
+-----------------+---------------------+

As we can see above, the true NULL row produces no concatenated output in the || generated column. If we weren't rigorously testing our output, we might have been unaware that we were dropping data on the floor!

When 'null' is not NULL

Sadly, even with the rules above, there are stranger examples of things happening in query outputs.

What about when 'null' is not actually NULL...

Let's take a look at the firefox_addons table.

SELECT
  name,
  creator
FROM users
CROSS JOIN firefox_addons USING(uid);
+---------------------------------+-------------------+
| name                            | creator           |
+---------------------------------+-------------------+
| Adobe Acrobat                   | Adobe Systems Inc.|
| Form Autofill                   | null              |
| Firefox Screenshots             | Mozilla           |
| Web Compatibility Interventions | null              |
| WebCompat Reporter              | Thomas Wisniewski |
| DoH Roll-Out                    | null              |
| Picture-In-Picture              | null              |
| Google                          | null              |
| Amazon.com                      | null              |
| Bing                            | null              |
| DuckDuckGo                      | null              |
| eBay                            | null              |
| Wikipedia (en)                  | null              |
| System theme                    | Mozilla           |
| Dark                            | Mozilla           |
| Light                           | Mozilla           |
| Firefox Alpenglow               | null              |
| Reset Search Defaults           | null              |
| Add-ons Search Detection        | null              |
| LastPass: Free Password Manager | LastPass          |
| 1Password โ€“ Password Manager    | AgileBits         |
+---------------------------------+-------------------+

Oh no! This is not at all what we expect to see. There are rows with the literal text string 'null'.

Let's take a look at the data source osquery uses to populate the firefox_addons table, a JSON file:

~/Library/Application Support/Firefox/Profiles/%.default/extensions.json

A cropped screenshot of a code editor of a JSON file with nulls highlighted

While the value is correctly encoded as NULL in the JSON file, somehow, this data is being parsed as a literal string 'null' instead.

Looping animation of Homer Simpson with a stoic expression slowly walking backwards into a shrub

This is a fantastic example of another general rule that you should keep in mind when working with osquery:

Data is populated from disparate sources and broadly speaking, osquery relays these sources with minimal consideration for whether they are misformatted, misencoded, or inconsistent.

Never forget:

osquery is merely a messenger; you must check your results!

How to avoid the pitfalls of empty strings

The following three rules will help ensure you do not fall victim to any pitfalls described above.

  1. Consider each table's schema (whether a column is NUMERIC vs. TEXT)

  2. Sanitize the output of your queries to ensure they cannot silently fail in ways you do not predict.

  3. Test your queries rigorously before blindly relying on their output.

1. Consider each table's schema

As we discussed above in the process_open_sockets example, there is a typical pattern in osquery that TEXT columns will typically encode empty-strings and NUMERIC columns will typically encode NULL. Knowing the type of columns you are working with will help you predict how they will behave.

When writing a new query, you can spot-check the schema by referencing the osquery Schema Docs or by typing the .schema table_name command in osqueryi:

A cropped screenshot of the osquery schema documentation page focused on the process_open_sockets table. The column "protocol" is highlighted by the cursor.

osquery Schema Documentation
osquery> .schema process_open_sockets
CREATE TABLE process_open_sockets(
`pid` INTEGER,
`fd` BIGINT,
`socket` BIGINT,
`family` INTEGER,
`protocol` INTEGER,
`local_address` TEXT,
`remote_address` TEXT,
`local_port` INTEGER,
`remote_port` INTEGER,
`path` TEXT,
`state` TEXT,
`net_namespace` TEXT HIDDEN,
PRIMARY KEY (`pid`)) WITHOUT ROWID;

Understandably, many queries you write will call columns from more than one table. You can use another unique trick to determine the data types you are dealing with for a given query.

The .types command will give you the datatypes from all the columns present in your query's output:

osquery> .types SELECT * FROM users, logged_in_users WHERE username = user;
+-------------+---------+
| name        | type    |
+-------------+---------+
| uid         | BIGINT  |
| gid         | BIGINT  |
| uid_signed  | BIGINT  |
| gid_signed  | BIGINT  |
| username    | TEXT    |
| description | TEXT    |
| directory   | TEXT    |
| shell       | TEXT    |
| uuid        | TEXT    |
| is_hidden   | INTEGER |
| type        | TEXT    |
| user        | TEXT    |
| tty         | TEXT    |
| host        | TEXT    |
| time        | BIGINT  |
| pid         | INTEGER |
+-------------+---------+

2. Sanitize the output of your queries

Due to the sometimes unpredictable nature of NULL vs empty string, it is often best to take a multi-pronged approach and try to sanitize any strings you wish to perform NULL dependent operations (COUNT etc.) on.

Thankfully, SQLite and, by extension, osquery support the NULLIF operator. This is more or less a shortcut to a CASE statement, but it is still handy to know. You supply the column name you wish to nullify empty strings for and then declare '' as the rule for nullification. We can take any of our earlier examples and nullify the problematic rows using this approach:

SELECT
  name,
  NULLIF(creator,'null') AS creator
FROM users
CROSS JOIN firefox_addons USING(uid);
+---------------------------------+-------------------+
| name                            | creator           |
+---------------------------------+-------------------+
| Adobe Acrobat                   | Adobe Systems Inc.|
| Form Autofill                   |                   |
| Firefox Screenshots             | Mozilla           |
| Web Compatibility Interventions |                   |
| WebCompat Reporter              | Thomas Wisniewski |
| DoH Roll-Out                    |                   |
| Picture-In-Picture              |                   |
| Google                          |                   |
| Amazon.com                      |                   |
| Bing                            |                   |
| DuckDuckGo                      |                   |
| eBay                            |                   |
| Wikipedia (en)                  |                   |
| System theme                    | Mozilla           |
| Dark                            | Mozilla           |
| Light                           | Mozilla           |
| Firefox Alpenglow               |                   |
| Reset Search Defaults           |                   |
| Add-ons Search Detection        |                   |
| LastPass: Free Password Manager | LastPass          |
| 1Password โ€“ Password Manager    | AgileBits         |
+---------------------------------+-------------------+

Let's verify that our COUNT function is now working the way we anticipate:

SELECT
  COUNT(NULLIF(creator,'null')) AS creator_present_count
FROM users
CROSS JOIN firefox_addons USING(uid);
+-----------------------+
| creator_present_count |
+-----------------------+
| 8                     |
+-----------------------+

So there we have it! If there is a table where you suspect empty strings are being recorded in place of NULL all you have to do is make sure to nullify that column using the NULLIF operator or to wrap your aggregated column inside a NULLIF.

3. Test your queries rigorously

Check twice, then check again.

The last point cannot be emphasized enough; you should never assume osquery is just going to 'do the right thing.'

Independently verify the output of your queries in osqueryi on representative devices and supply sample data (including empty strings) using CTEs if you want to play it extra safe.

Never assume that just because output looks consistent at a glance that it will always remain so across a diverse fleet of endpoints, the chrome_extensions example earlier demonstrated how even a single row might be missing a given column.


Further Reading

Interested in more how-to guides on osquery? I recommend reading some of my other posts:

Share this story:

More articles you
might enjoy:

Tutorials
How to Find a Mac's Manufacture Date Using Osquery
Fritz Ifert-Miller
Tutorials
How to Set up Windows File Integrity Monitoring Using Osquery and Kolide
Fritz Ifert-Miller
Deep Dives
Are Your Employees Slack Messages Leaking While Their Screen Is Locked?
Fritz Ifert-Miller
Try Kolide Free
Try Kolide Free