Contents

The File Table: Osquery's Secret Weapon

With Great Power Comes Great Responsibility

This article is part of an ongoing series in which we look at specific osquery tables and discuss what they can and cannot do. In another article: Using Spotlight across your fleet with osquery, we discussed a method for finding files on macOS computers that was incredibly performant and could find files based on the content and metadata inside.

The usage of Spotlight in that example was predicated on its ability to index text contents in files and its ability to avoid the performance hit of recursive filesystem searching. We briefly touched on the limitations of the file table in that article, and here we will take a slightly deeper dive.

This post will look at a cross-platform alternative to finding files across your infrastructure.

The File Table in Osquery

The file table in osquery has an incredible degree of utility across many queries and represents a fundamental cornerstone of osquery’s core capabilities.

Let’s start by taking a look at its basic schema:

+---------------+---------+---------------------------------------+
| COLUMN        | TYPE    | DESCRIPTION                           |
+---------------+---------+---------------------------------------+
| path          | TEXT    | Absolute file path                    |
| directory     | TEXT    | Directory of file(s)                  |
| filename      | TEXT    | Name portion of file path             |
| inode         | BIGINT  | Filesystem inode number               |
| uid           | BIGINT  | Owning user ID                        |
| gid           | BIGINT  | Owning group ID                       |
| mode          | TEXT    | Permission bits                       |
| device        | BIGINT  | Device ID (optional)                  |
| size          | BIGINT  | Size of file in bytes                 |
| block_size    | INTEGER | Block size of filesystem              |
| atime         | BIGINT  | Last access time                      |
| mtime         | BIGINT  | Last modification time                |
| ctime         | BIGINT  | Last status change time               |
| btime         | BIGINT  | (B)irth or (cr)eate time              |
| hard_links    | INTEGER | Number of hard links                  |
| symlink       | INTEGER | 1 if the path is a symlink, else 0    |
| type          | TEXT    | File status                           |
| attributes    | TEXT    | File attrib string.                   |
| volume_serial | TEXT    | Volume serial number                  |
| file_id       | TEXT    | file ID                               |
+---------------+---------+---------------------------------------+

As we can see, there are many metadata attributes that we can use to our advantage both when building queries and refining results. Let’s run a sample query against the file table to inspect a file on our local device:

SELECT *
FROM file
WHERE path = '/Users/fritz-imac/Downloads/github-recovery-codes.txt';

This query returns the following data:

path = /Users/fritz-imac/Downloads/github-recovery-codes.txt
directory = /Users/fritz-imac/Downloads
filename = github-recovery-codes.txt
inode = 20650405
uid = 502
gid = 20
mode = 0644
device = 0
size = 206
block_size = 4194304
atime = 1533646421
mtime = 1532976585
ctime = 1532976860
btime = 1532976585
hard_links = 1
symlink = 1
type = regular

Very cool! We can see that we have a fair bit of information about this github-recovery-codes.txt file. Now, let’s take a moment to clean it up and return the values in a format that we can quickly parse.

Osquery, by default, returns some data in a less than humanly digestible format.

Time is represented in UNIX epoch
Size is defined in bytes.¹

Unix epoch date-times are great because they are the easiest to transform. Using the datetime(value,'unixepoch') syntax, we can convert any date-time result in osquery to an easier read value. Then, we will round our size from bytes to megabytes using the ROUND function ROUND((f.size * 10e-7),4)

Finally, we will join our uid and gid on their respective tables (users and groups) and return their actual names. Let’s try it below:

SELECT
  f.path,
  u.username AS file_owner,
  g.groupname AS group_owner,
  datetime(f.atime, 'unixepoch') AS file_last_access_time,
  datetime(f.mtime, 'unixepoch') AS file_last_modified_time,
  datetime(f.ctime, 'unixepoch') AS file_last_status_change_time,
  datetime(f.btime, 'unixepoch') AS file_created_time,
  ROUND(
    (f.size * 10e - 7),
    4
  ) AS size_megabytes
FROM
  file f
  LEFT JOIN users u ON f.uid = u.uid
  LEFT JOIN groups g ON f.gid = g.gid
WHERE
  path = "/Users/fritz-imac/Downloads/github-recovery-codes.txt"

Running this query results in the following output:

path = /Users/fritz-imac/Downloads/github-recovery-codes.txt
file_owner = fritz-imac
group_owner = staff
file_last_access_time = 2018-08-29 18:43:40
file_last_modified_time = 2018-07-30 18:49:45
file_last_status_change_time = 2018-07-30 18:54:20
file_created_time = 2018-07-30 18:49:45
size_megabytes = 0.0002

Already that is much easier to read! Now we can go hunting for any file we want. But wait, slow down, tiger, there is a big caveat with the file table in osquery.

You Have to Know Where Your File Is First!

Unfortunately, the file table requires a WHERE clause, meaning that you need to know roughly where an item is before you can go querying for it. This protects against the massive recursion that would be necessitated by searching every single directory and their respective subdirectories on the file system.

The clause WHERE path = can thankfully be massaged through the use of wildcards and a LIKE argument, e.g., WHERE path LIKE "/foo/%" but there are some tricky things to look out for in terms of the way wildcards are handled!

Understanding How Single Wildcards ‘%’ Work Within the File Table

SELECT * FROM file WHERE path LIKE "/%";

Single wildcards treat the file table like an ogre in that it’s like an onion. They all have layers

A single % in the file table allows you to find items in a specified layer of the file system. For instance:

SELECT
  *
FROM
  file
WHERE
  path LIKE "/Users/%/Google Drive/%/%";

This query would find any files that were located within directories like:

/Users/username/Google Drive/foo/bar/
/Users/username2/Google Drive/bar/foo/

It will not, however, return any files that were located subsequent subdirectories deeper. For instance:

/Users/username/Google Drive/foo/bar/baz/filename.ext
/Users/username/Google Drive/foo/bar/baz/qux/quux/filename.ext

This means you can’t print out every file on your file system by running:

SELECT
  *
FROM
  file
WHERE
  path LIKE "/%"

Because as long as you only use single % wildcards, you will only ever be able to see all files at that layer of the subdirectory hierarchy. So in this instance, the first layer, items in the root directory:

osquery> SELECT path, btime, size, type FROM file WHERE path LIKE "/%";
+----------------------------+------------+------+-----------+
| path                       | btime      | size | type      |
+----------------------------+------------+------+-----------+
| /Applications/             | 1508949222 | 4114 | directory |
| /Library/                  | 1508949320 | 2312 | directory |
| /Network/                  | 1469907150 | 68   | directory |
| /System/                   | 1508949086 | 136  | directory |
| /Users/                    | 1469911752 | 170  | directory |
| /Volumes/                  | 1469907156 | 136  | directory |
| /bin/                      | 1508949475 | 1292 | directory |
| /cores/                    | 1469907149 | 68   | directory |
| /dev/                      | 0          | 4444 | directory |
| /etc/                      | 1508949384 | 4114 | directory |
| /home/                     | 0          | 1    | directory |
| /installer.failurerequests | 1504228163 | 313  | regular   |
| /net/                      | 0          | 1    | directory |
| /private/                  | 1519081368 | 204  | directory |
| /sbin/                     | 1508949475 | 2142 | directory |
| /tmp/                      | 1519081368 | 1326 | directory |
| /usr/                      | 1508948533 | 306  | directory |
| /var/                      | 1510954365 | 986  | directory |
+----------------------------+------------+------+-----------+

Understanding How Double Wildcards ‘%%’ Work Within the File Table

To recursively search your filesystem, you can use %% the double wildcard.

-- WARNING: Do NOT run the following query it will literally return thousands of
-- files on your computer!!!
SELECT
  *
FROM
  file
WHERE
  path LIKE "/%%"

Double wildcards can only ever be used at the end of a string eg. /foo/bar/%%

Double wildcards can NEVER be used mid-string (infix)

This means the following query will never return results:

SELECT
  path
from
  file
WHERE
  path LIKE "/Users/%%/UX/%%";

However, you can use the double wildcard to your advantage in limited situations where you want to search within a known parent directory and its respective sub-directories. The more possible sub-directories and recursion, the slower and less performant the query will be.

SELECT
  path
FROM
  file
WHERE
  path LIKE "/Users/fritz-imac/Google Drive/UX/%%";

Mixing single ‘%’ and double wildcards ‘%%’ within the file table

The last fun thing you can do is mix wildcard types if you want to restrict certain parts of your WHERE clause to a specific layer and permit recursion at another part of the path.

For instance, if I wanted to see how much percentage of disk space the files located in Google Drive are using per device across my entire fleet, I could run the following query:

SELECT SUM(file.size) AS gdrive_size,
  (SELECT (mounts.blocks * mounts.blocks_size)
   FROM mounts
   WHERE PATH = '/') AS total_disk_size,
       (100.0 * SUM(file.size) /
          (SELECT (mounts.blocks * mounts.blocks_size)
           FROM mounts
           WHERE PATH = '/')) AS gdrive_percentage_used
FROM FILE
WHERE file.path LIKE '/Users/%/Google Drive/%%';

Which produces the following output:

+-------------+-----------------+------------------------+
| gdrive_size | total_disk_size | gdrive_percentage_used |
+-------------+-----------------+------------------------+
| 7492526705  | 379000430592    | 1.97691772890512       |
+-------------+-----------------+------------------------+

These queries can be run, but you need to be mindful of how many files your system is looking through to produce the end result.

Limitations of Wildcard LIKE Searches in the File Table

While undeniably useful, the file table is not without limitations. Knowing what those limitations are and how to avoid them will help you make sure you are returning your expected data and that it is accurate.

It is worth noting that both of these limitations are outlined in a GitHub issue, in the osquery repository #7306, and may someday be addressed.🤞🏻

Symlink loops may prevent complete recursive crawling

If you were to run a query like:

SELECT * FROM file WHERE path LIKE '/%%'

You might expect that it would return every file on your file-system, but in practice it does not. This is because osquery will end the query once it encounters something called a symlink loop.

osquery> SELECT COUNT(*) FROM file WHERE path LIKE '/%%';
+----------+
| COUNT(*) |
+----------+
| 1172     |
+----------+

Symlinks (also known as symbolic links) are frequently referred to as shortcuts. They are files which point to another file or folder on the computer.

A symlink loop is when a symlink points back to a parent directory in its path which creates an infinite loop if traversed. We can observe this behavior in practice by querying a folder which contains one of these loops:

As we can see in the screenshot below, the directory /tmp/directory-level-0/ consists of 6 nested folders.

Nested directories with a symbolic link in directory-level-1

Screenshot of macOS Finder showing nested directories

The tree command shows the expected output of our nested directories:

➜  directory-level-0 tree
.
└── directory-level-1
    ├── directory-level-2
    │   └── directory-level-3
    │       └── directory-level-4
    │           └── directory-level-5
    │               └── directory-level-6
    └── symlink-back-to-level-0 -> /tmp/directory-level-0

However, when we recursively query the parent folder using osquery, we receive an incomplete result. This is because the symbolic link is causing a recursive loop of a child pointing back to a parent which osquery avoids by preemptively terminating the query:

osquery> SELECT path FROM file WHERE path LIKE '/tmp/directory-level-0/%%';
+-------------------------------------------------------------------------------------+
| path                                                                                |
+-------------------------------------------------------------------------------------+
| /tmp/directory-level-0/directory-level-1/                                           |
| /tmp/directory-level-0/directory-level-1/directory-level-2/                         |
| /tmp/directory-level-0/directory-level-1/directory-level-2/directory-level-3/       |
| /tmp/directory-level-0/directory-level-1/symlink-back-to-level-0/                   |
| /tmp/directory-level-0/directory-level-1/symlink-back-to-level-0/directory-level-1/ |
+-------------------------------------------------------------------------------------+

Unfortunately, the recursive searching in the osquery file table is accomplished via glob which has no method for skipping over symlinks.

As a result you have to take a targeted approach when recursively searching using double wildcards to avoid recursing through paths that may contain symlinks.

Hidden files are not returned by recursive searches

If a file is marked as hidden in the file-system (typically by prepending the filename with a .), it will not be returned in the results of any recursive file search. This can be observed in the following example query:

osquery> SELECT * FROM file WHERE path LIKE '/Users/test-macbook/git/kolide/test/%' AND filename = '.git';

osquery>

osquery> SELECT path FROM file WHERE path = '/Users/test-macbook/git/kolide/test/.git';
+------------------------------------------+
| path                                     |
+------------------------------------------+
| /Users/test-macbook/git/kolide/test/.git |
+------------------------------------------+

osquery> SELECT path FROM file WHERE directory = '/Users/test-macbook/git/kolide/test' AND filename = '.git';
+------------------------------------------+
| path                                     |
+------------------------------------------+
| /Users/test-macbook/git/kolide/test/.git |
+------------------------------------------+

Keep this in mind if you are searching for files that might be marked as hidden in the filesystem. Otherwise, you may be missing results you might have expected.

When Should I Use Recursive Queries in the File Table in Osquery?

When there is literally no other option for locating a file of import.
When you can limit the degree of recursion that your search is capable of by scoping it to a particular folder, e.g., ( '/Users/%/Downloads/%%')

If there is ever any doubt on how expensive a query is after you schedule it, osquery comes with great tools to do this analysis. For example, in a product like Kolide, you can run a live query against your devices that enumerates all of the queries in your schedule by simply running:

SELECT * FROM osquery_schedule

A screenshot of the Kolide Live Query feature showing

Live Query allows you to run osquery SQL across all enrolled devices in Kolide and immediately get the results in the browser or via API.

In this screenshot, we can short all the queries in our schedule by the wall_time to get a sense of how long these take to complete. We can even see how much memory they may be using during execution. These are all indicators that we may execute file queries with too much recursion potential.

If you’d like to read more osquery content like this, sign up for our biweekly newsletter.

Osquery sizes are returned in bytes but some comp-sci items (RAM vs Hard Drives) calculate their size in base-10 and some calculate in base-2 ↩