tagfs

Last modified: Sunday, January 15, 2023

by Dominik Honnef

Note that there is considerable overlap between this and Everything, an RDF-based desktop search engine.

Hard requirements ¶

Must be browsable as a traditional file system, so that existing software can access the data. This does not just mean being able to access the files, but also to execute queries.
Must handle mutable data well. We want to use this as the primary file system, being able to record audio and video files, edit code, and so on. That is, we want to append to a file, not just replace it with a new one.
Must have a way of expressing traditional names and hierarchical structures. ‘go build’ won’t like a loose collection of blobs.
Must sit on top of a traditional file system. We don’t want to lock all data away in a database or custom format.

Considerations ¶

Should we support ontologies, or should they be encoded explicitly in tags? For example, should a file tagged with “berlin” match a query for “germany”?
How will we handle tags that describe distinct objects but share a name, such as Ontario in the USA and in Canada, or do people with the same name? Should tags be proper file system objects (i.e. files)?
Should we build something like Perkeep on top of this file system, to gain the advantages that just tagged hierarchical files can’t provide? Such as immutatbility and foregoing file names?

Design decisions ¶

We store metadata in extended attributes, not a custom database. A custom database will only act as an index. This way, if the custom database gets corrupted or falls out of sync with the underlying file system, we can easily reindex all files.

This also allows for ZFS snapshots, tools like cp etc to handle our metadata transparently.

Unsolved problems ¶

Because of our hard requirements, we can’t just name every file after a hash of its contents. That, however, means we run into the problem of conflicting file names. A user won’t be able to save two files named “dog.jpg” to ~/Pictures. And if two files with the same name exist in different directories, but a query returns both files, we need a way to display both files with unique names.
How do we efficiently update the index after rolling back to a ZFS snapshot?

Papers & Resources ¶

Metadata extraction ¶

we want to support metadata stored in containers, to simplify adoption. that is, we want to be able to search by ID3 tags, matroska tags, PDF titles and so on.
we do not want to fetch metadata from online databases. that is, we won’t take an IMDb ID and fetch a movie’s title. This should be implemented in applications that take the IDs, look up the data and store new xattrs. because we provide fast search on tags, these helpers can efficiently find “movies with imdb IDs that we haven’t crawled in 2 weeks” to keep metadata up to date. in an ideal world, something like Jellyfin would store its data in our xattrs.
a possible complication occurs when tags in containers have the same semantics as but different names from tags stored in xattrs. This causes duplicate tags, with potentially conflicting information. We may either have to rely on users and tools to standardize on the same tags, or provide a way of defining tag aliases. Aliases, however, may differ depending on the ontology in use.
should setting a tag modify metadata stored in containers, or always use xattrs?

Notes ¶

there is a difference between “find all tracks in all albums that contain tracks by artist X” and “find all tracks by artist X and group them by album”, as the former will include more tracks, namely all tracks in the albums.
tags can either be binary, or have associated values
- binary tags can be encoded as tags with values, ’tag=<tag name>’, if we allow multiple tags with the same name
tags in xattrs take precedence over tags extracted from files
do we want to provide access to parts of files? should a user be able to find and access the audio track of a movie directly? or the source code of a single function in a Go file?
- this can be split into two parts: being able to find files based on interior data, and being able to directly access interior data.
we want the hierarchical filesystem to coexist with our queries
provide a way to filter down a specific hierarchical directory
simple queries can be encoded in directory names, but for complex queries we may want the full power of SQL or SPARQL or something like that.
we want persistent queries (which could just be symlinks to the correct folder)
derived tags: a movie has an original language and the language of its actual audio track, which should use different tags (such as “original_language” and “audio_language”), but the user may want to easily search for either tag. one might expect a tag “language” to implicitly exist, with both the original and the audio language as values.
- this probably means that we want to support proper taxonomies, or maybe even ontologies
some tags can be inferred from the file extension/mime type
file system access is a nice starting point, but more advanced tools and UIs should be built on top. for example, say a movie has a sequel, we may want to find the sequel on disk. but we probably don’t want to encode this in the file system directly, nor even hard-code this functionality in the file system provider. consider the file system an API to build tools on top of. in this case, it could be as simple as reading the xattr, constructing a new path, and readdir’ing it.
we’ll support full text search somehow
- maybe also support latent semantic analysis?
- what about text inside containers, such as subtitles?
we can’t list all files by default. we have millions of files that match some queries. however, the user, in particular tools, may really want all files.
tag a media file with the recording date, then together with the length we know the recording end, and can do range queries
a read-only file can’t have its xattrs modified. this isn’t ideal for us, because we want to allow tagging of read-only data. this means we may have to implement our own notion of read-only that is separate from chmod.
we may want to support relational queries. e.g. “find all files where mtime > last_index_time”

Useful metadata to track ¶

indexed mtime

PDFs (and similar documents) ¶

type: book, paper, …
read: true/false
DOI
ISBN

Examples ¶

ln -s /tag:/movie /movies
cd /movies/year:/>2020/language:/english/
ls ./title:/ # list all movies after 2020, in english, by title.

References ¶

[1]

Nayuki, “Designing better file organization around tags, not hierarchies,” Mar. 08, 2017. https://www.nayuki.io/page/designing-better-file-organization-around-tags-not-hierarchies (accessed Dec. 31, 2022).

[2]

B. Schandl and N. Popitsch, “Lifting file systems into the linked data cloud with TripFS,” in Proceedings of the WWW2010 Workshop on Linked Data on the Web, LDOW 2010, Raleigh, USA, April 27, 2010, 2010, vol. 628. Accessed: Dec. 31, 2022. [Online]. Available: http://ceur-ws.org/Vol-628/ldow2010_paper02.pdf

[3]

L. Xu, H. Jiang, X. Liu, L. Tian, Y. Hua, and J. Hu, “Propeller: A scalable metadata organization for A versatile searchable file system.”

[4]

D. K. Gifford, P. Jouvelot, M. A. Sheldon, and J. W. O’Toole, “Semantic file systems,” in Proceedings of the thirteenth ACM symposium on Operating systems principles, Sep. 1991, pp. 16–25. doi: 10.1145/121132.121138.

[5]

S. Bloehdorn, O. Görlitz, S. Schenk, and M. Völkel, “TagFS - tag semantics for hierarchical file systems,” 2006. Accessed: Dec. 31, 2022. [Online]. Available: https://www.semanticscholar.org/paper/TagFS-Tag-Semantics-for-Hierarchical-File-Systems-Bloehdorn-G%C3%B6rlitz/974c473825c7186fddb2d14a63ea30f44a369bc8

[6]

Y. Padioleau and O. Ridoux, “A logic file system,” in USENIX 2003 Annual Technical Conference, General Track, Sep. 2003. Accessed: Dec. 31, 2022. [Online]. Available: https://hal.archives-ouvertes.fr/hal-03214497

[7]

K. Voit, “Don’t do complex folder hierarchies - they don’t work and this is why and what to do instead,” Jan. 25, 2020. https://karl-voit.at/2020/01/25/avoid-complex-folder-hierarchies/ (accessed Jan. 01, 2023).