The File Table: Osquery's Secret Weapon
With Great Power Comes Great Responsibility
This article is part of an ongoing series in which we look at specific osquery tables and discuss what they can and cannot do. In another article: Using Spotlight across your fleet with osquery, we discussed a method for finding files on macOS computers that was incredibly performant and could find files based on the content and metadata inside.
The usage of Spotlight in that example was predicated on its ability to index
text contents in files and its ability to avoid the performance hit
of recursive filesystem searching. We briefly touched on the limitations of
the file
table in that article, and here we will take a slightly deeper dive.
This post will look at a cross-platform alternative to finding files across your infrastructure.
The File Table in Osquery
The file
table in osquery has an
incredible degree of utility across many queries and represents a
fundamental cornerstone of osquery's core capabilities.
Let's start by taking a look at its basic schema:
+---------------+---------+---------------------------------------+
| COLUMN | TYPE | DESCRIPTION |
+---------------+---------+---------------------------------------+
| path | TEXT | Absolute file path |
+---------------+---------+---------------------------------------+
| directory | TEXT | Directory of file(s) |
+---------------+---------+---------------------------------------+
| filename | TEXT | Name portion of file path |
+---------------+---------+---------------------------------------+
| inode | BIGINT | Filesystem inode number |
+---------------+---------+---------------------------------------+
| uid | BIGINT | Owning user ID |
+---------------+---------+---------------------------------------+
| gid | BIGINT | Owning group ID |
+---------------+---------+---------------------------------------+
| mode | TEXT | Permission bits |
+---------------+---------+---------------------------------------+
| device | BIGINT | Device ID (optional) |
+---------------+---------+---------------------------------------+
| size | BIGINT | Size of file in bytes |
+---------------+---------+---------------------------------------+
| block_size | INTEGER | Block size of filesystem |
+---------------+---------+---------------------------------------+
| atime | BIGINT | Last access time |
+---------------+---------+---------------------------------------+
| mtime | BIGINT | Last modification time |
+---------------+---------+---------------------------------------+
| ctime | BIGINT | Last status change time |
+---------------+---------+---------------------------------------+
| btime | BIGINT | (B)irth or (cr)eate time |
+---------------+---------+---------------------------------------+
| hard_links | INTEGER | Number of hard links |
+---------------+---------+---------------------------------------+
| symlink | INTEGER | 1 if the path is a symlink, else 0 |
+---------------+---------+---------------------------------------+
| type | TEXT | File status |
+---------------+---------+---------------------------------------+
| attributes | TEXT | File attrib string. |
+---------------+---------+---------------------------------------+
| volume_serial | TEXT | Volume serial number |
+---------------+---------+---------------------------------------+
| file_id | TEXT | file ID |
+---------------+---------+---------------------------------------+
As we can see, there are many metadata attributes that we can use to our advantage both when building queries and refining results. Let's run a sample query against the file table to inspect a file on our local device:
SELECT
*
FROM
file
WHERE
path = "/Users/fritz-imac/Downloads/github-recovery-codes.txt";
This query returns the following data:
path = /Users/fritz-imac/Downloads/github-recovery-codes.txt
directory = /Users/fritz-imac/Downloads
filename = github-recovery-codes.txt
inode = 20650405
uid = 502
gid = 20
mode = 0644
device = 0
size = 206
block_size = 4194304
atime = 1533646421
mtime = 1532976585
ctime = 1532976860
btime = 1532976585
hard_links = 1
symlink = 1
type = regular
Very cool! We can see that we have a fair bit of information about
this github-recovery-codes.txt file.
Now, let's take a moment to
clean it up and return the values in a format that we can quickly parse.
Osquery, by default, returns some data in a less than humanly digestible format.
- Time is represented in UNIX epoch
- Size is defined in bytes.1
Unix epoch date-times are great because they are the easiest to transform. Using the datetime(value,'unixepoch')
syntax, we can convert any date-time result in osquery to an easier read value.
Then, we will round our size from bytes to megabytes
using the ROUND function ROUND((f.size * 10e-7),4)
Finally, we will join our uid
and gid
on their respective tables (users
and groups
)
and return their actual names. Let's try it below:
SELECT
f.path,
u.username AS file_owner,
g.groupname AS group_owner,
datetime(f.atime, 'unixepoch') AS file_last_access_time,
datetime(f.mtime, 'unixepoch') AS file_last_modified_time,
datetime(f.ctime, 'unixepoch') AS file_last_status_change_time,
datetime(f.btime, 'unixepoch') AS file_created_time,
ROUND(
(f.size * 10e - 7),
4
) AS size_megabytes
FROM
file f
LEFT JOIN users u ON f.uid = u.uid
LEFT JOIN groups g ON f.gid = g.gid
WHERE
path = "/Users/fritz-imac/Downloads/github-recovery-codes.txt"
Running this query results in the following output:
path = /Users/fritz-imac/Downloads/github-recovery-codes.txt
file_owner = fritz-imac
group_owner = staff
file_last_access_time = 2018-08-29 18:43:40
file_last_modified_time = 2018-07-30 18:49:45
file_last_status_change_time = 2018-07-30 18:54:20
file_created_time = 2018-07-30 18:49:45
size_megabytes = 0.0002
Already that is much easier to read! Now we can go hunting for any file we want.
But wait, slow down, tiger, there is a big caveat with the file
table in osquery.
You Have To Know Where Your File Is First!
Unfortunately, the file
table requires a WHERE
clause, meaning that you
need to know roughly where an item is before you can go querying for it.
This protects against the massive recursion that would be necessitated by
searching every single directory and their respective subdirectories on the
file system.
The clause WHERE path =
can thankfully be massaged through the use of
wildcards and a LIKE
argument, e.g., WHERE path LIKE "/foo/%"
but there are
some tricky things to look out for in terms of the way wildcards are handled!
Understanding How Single Wildcards '%' Work Within the File Table
SELECT * FROM file WHERE path LIKE "/%";
Single wildcards treat the file table like an ogre in that it's like an onion. They all have layers.
A single %
in the file
table allows you to find items in a
specified layer of the file system. For instance:
SELECT
*
FROM
file
WHERE
path LIKE "/Users/%/Google Drive/%/%";
This query would find any files that were located within directories like:
/Users/username/Google Drive/foo/bar/
/Users/username2/Google Drive/bar/foo/
It will not, however, return any files that were located subsequent subdirectories deeper. For instance:
/Users/username/Google Drive/foo/bar/baz/filename.ext
/Users/username/Google Drive/foo/bar/baz/qux/quux/filename.ext
This means you can't print out every file on your file system by running:
SELECT
*
FROM
file
WHERE
path LIKE "/%"
Because as long as you only use single %
wildcards, you will only ever be
able to see all files at that layer of the subdirectory hierarchy.
So in this instance, the first layer, items in the root directory:
osquery> SELECT path, btime, size, type FROM file WHERE path LIKE "/%";
+----------------------------+------------+------+-----------+
| path | btime | size | type |
+----------------------------+------------+------+-----------+
| /Applications/ | 1508949222 | 4114 | directory |
| /Library/ | 1508949320 | 2312 | directory |
| /Network/ | 1469907150 | 68 | directory |
| /System/ | 1508949086 | 136 | directory |
| /Users/ | 1469911752 | 170 | directory |
| /Volumes/ | 1469907156 | 136 | directory |
| /bin/ | 1508949475 | 1292 | directory |
| /cores/ | 1469907149 | 68 | directory |
| /dev/ | 0 | 4444 | directory |
| /etc/ | 1508949384 | 4114 | directory |
| /home/ | 0 | 1 | directory |
| /installer.failurerequests | 1504228163 | 313 | regular |
| /net/ | 0 | 1 | directory |
| /private/ | 1519081368 | 204 | directory |
| /sbin/ | 1508949475 | 2142 | directory |
| /tmp/ | 1519081368 | 1326 | directory |
| /usr/ | 1508948533 | 306 | directory |
| /var/ | 1510954365 | 986 | directory |
+----------------------------+------------+------+-----------+
Understanding How Double Wildcards '%%' Work Within the File Table
To recursively search your filesystem, you can use %%
the double wildcard.
-- WARNING: Do NOT run the following query it will literally return every
-- file on your computer!!!
SELECT
*
FROM
file
WHERE
path LIKE "/%%"
Double wildcards can only ever be used at the end of a string eg. /foo/bar/%%
Double Wildcards Can NEVER Be Used Mid-String (infix)
This means the following query will never return results:
SELECT
path
from
file
WHERE
path LIKE "/Users/%%/UX/%%";
However, you can use the double wildcard to your advantage in limited situations where you want to search within a known parent directory and its respective sub-directories. The more possible sub-directories and recursion, the slower and less performant the query will be.
SELECT
path
FROM
file
WHERE
path LIKE "/Users/fritz-imac/Google Drive/UX/%%";
Mixing single '%' and double wildcards '%%' within the file table.
The last fun thing you can do is mix wildcard types if you want to restrict
certain parts of your WHERE
clause to a specific layer and permit recursion
at another part of the path.
For instance, if I wanted to see how much percentage of disk space the files located in Google Drive are using per device across my entire fleet, I could run the following query:
SELECT SUM(file.size) AS gdrive_size,
(SELECT (mounts.blocks * mounts.blocks_size)
FROM mounts
WHERE PATH = '/') AS total_disk_size,
(100.0 * SUM(file.size) /
(SELECT (mounts.blocks * mounts.blocks_size)
FROM mounts
WHERE PATH = '/')) AS gdrive_percentage_used
FROM FILE
WHERE file.path LIKE '/Users/%/Google Drive/%%';
Which produces the following output:
+-------------+-----------------+------------------------+
| gdrive_size | total_disk_size | gdrive_percentage_used |
+-------------+-----------------+------------------------+
| 7492526705 | 379000430592 | 1.97691772890512 |
+-------------+-----------------+------------------------+
These queries can be run, but you need to be mindful of how many files your system is looking through to produce the end result.
When should I use recursive queries in the file table in osquery?
When there is literally no other option for locating a file of import.
When you can limit the degree of recursion that your search is capable of by scoping it to a particular folder, e.g., (
'/Users/%/Downloads/%%'
)
If there is ever any doubt on how expensive a query is after you schedule it, osquery comes with great tools to do this analysis. For example, in a product like Kolide, you can run a live query against your devices that enumerates all of the queries in your schedule by simply running:
SELECT * FROM osquery_schedule
In this screenshot, we can short all the queries in our schedule by the wall_time
to get a sense of how long these take to complete. We can even see how much
memory they may be using during execution. These are all indicators that we may
execute file queries with too much recursion potential.
-
Osquery sizes are returned in bytes but some comp-sci items (RAM vs Hard Drives) calculate their size in base-10 and some calculate in base-2 ↩