Been having much fun with Tiger like some other Mac-heads I know. Lots of neat things about it, but I’ve been most fascinated by Spotlight. Pervasive metadata on your files sounds like a great idea, long since overdue.
The model appears to be: Volumes have a metadata store which contains information about most files on disk. The metadata for a given file is represented by a dictionary, with keys like kMDItemAudioSampleRate, and such. The metadata is generated by plugins (extension .mdimporter) which are invoked via just one method. The method is given the pathname of the file and returns a dictionary with the attributes of that file. Pretty simple. I want to try making one, but I need to think of a useful plugin idea first. My first thought was Ogg Vorbis files, but those seem to be handled by the Quicktime importer. (The only part of the Vorbis plugin that still works, but more on that later.)
Complex metadata is pretty time-consuming to calculate, so the UNIX locate technique of running every night with a cron job just wouldn’t work. (Even for locate, which only indexes filenames, reindexing from scratch every night is annoying.) Instead, the kernel informs the metadata system whenever a file on disk changes, and it invokes the appropriate metadata importer plugin. Now the indexing work is basically distributed over the idle time of the machine, and only repeated as needed.
Of course, Apple hides all this stuff behind a very simple user interface. You type in words, and it shows you files. However, if you want to search by a particular attribute, no GUI is provided. In the vast majority of cases, you don’t need it, so Apple opted not to distract you with the options. Fortunately, if you open up a Terminal window, there are several command line utilities to query the metadata and really see what’s going on:
- mdimport - Manually run importer plugins on files.
- mdls - Show the metadata attributes stored for a file.
- mdutil - general management functions, like erasing the metadata store, or disabling indexing on a volume
- mdfind - query the metadata
(Links to man pages provided because it looks like the man pages are not installed by default. Update: I just borked my MANPATH, the man pages are there.)
For example, here’s the info for a Quicktime movie I downloaded with Safari:
Rover:~/Desktop stan$ mdls RvB_Episode55_LoRes.mov
RvB_Episode55_LoRes.mov -------------
kMDItemAttributeChangeDate = 2005-05-07 11:13:36 -0500
kMDItemAudioBitRate = 127832
kMDItemAudioChannelCount = 2
kMDItemCodecs = (AAC, "Sorenson Video 3")
kMDItemContentCreationDate = 2005-05-03 11:50:43 -0500
kMDItemContentModificationDate = 2005-05-03 11:50:43 -0500
kMDItemContentType = "com.apple.quicktime-movie"
kMDItemContentTypeTree = (
"com.apple.quicktime-movie",
"public.movie",
"public.audiovisual-content",
"public.data",
"public.item",
"public.content"
)
kMDItemDisplayName = "RvB_Episode55_LoRes.mov"
kMDItemDurationSeconds = 302.735
kMDItemFSContentChangeDate = 2005-05-03 11:50:43 -0500
kMDItemFSCreationDate = 2005-05-03 11:50:43 -0500
kMDItemFSCreatorCode = 0
kMDItemFSFinderFlags = 0
kMDItemFSInvisible = 0
kMDItemFSLabel = 0
kMDItemFSName = "RvB_Episode55_LoRes.mov"
kMDItemFSNodeCount = 0
kMDItemFSOwnerGroupID = 501
kMDItemFSOwnerUserID = 501
kMDItemFSSize = 24302686
kMDItemFSTypeCode = 0
kMDItemID = 5274552
kMDItemKind = "QuickTime Movie"
kMDItemLastUsedDate = 2005-05-03 10:50:43 -0500
kMDItemMediaTypes = (Sound, Video)
kMDItemPixelHeight = 240
kMDItemPixelWidth = 360
kMDItemStreamable = 0
kMDItemTotalBitRate = 639232
kMDItemUsedDates = (2005-05-03 10:50:43 -0500)
kMDItemVideoBitRate = 511400
kMDItemWhereFroms = (
"http://files.redvsblue.com/3x55shisno/RvB_Episode55_LoRes.mov",
"http://www.redvsblue.com/archive/"
)
So you see the list includes all sorts of information about when the content in the file was created (this is separate from the creation date of the file), bitrates, dimensions, etc. Especially interesting is the last attribute, kMDItemWhereFroms. Both the originial URL of the file and the page which linked to it are included as part of the file metadata. I saw it reported on some blog (which I cannot now locate) that files downloaded with Safari have the URL information included with them, but I haven’t figure out how this is being achieved. Because Safari does this with any file you download, it must either manually inject this information into the metadata store somehow, OR it is writing some sort of extended attribute information to the filesystem directly, which is later picked up by the metadata importers. (Incidentally, this is how Beagle stores file metadata in general.)
Of course, you can turn it around and search for stuff like “What files have I downloaded from redvsblue.com?”:
Rover:~/Desktop stan$ mdfind "kMDItemWhereFroms == *redvsblue.com*"
/Users/stan/Desktop/RvB_Episode55_LoRes.mov
or “What rock songs do I have that are less than 3 minutes long?”:
Rover:~/Desktop stan$ mdfind "kMDItemMusicalGenre == Rock && kMDItemDurationSeconds < 180"
/Users/stan/Music/iTunes/iTunes Music/Chuck Berry/Blues/12 Route 66.m4p
Of course, you have to be a little careful with this, because it would require files to follow some sort of attribute standard. This is the nasty, endless argument that every metadata tagging standard has to deal with. That’s probably why Spotlight doesn’t let you select specific attributes to search on. Then it doesn’t matter if some kinds of files call the producer of creative work the “Artist,” and others the “Author,” or whatever. Hurling all the attribute values into a big bag and ignoring the keys is remarkably effective.
I’m still investigating other questions, like:
- Can multiple mdimporters be run on a single file? The attributes for a file would just be the union of the dictionaries produced by all the relevant plugins. This would allow existing mdimporters to be extended by just making a new importer that extracts some additional attributes. Not as efficient, but useful if you can’t modify the old importer.
- How is textual content indexed? Spotlight clearly returns results based on the contents of PDF/Text/RTF files, but
mdls doesn’t show any attributes corresponding to keywords used in the document. Is this information stored somewhere else?