Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extending Dirent: why some things aren't currently possible #75

Open
GwynethLlewelyn opened this issue Aug 31, 2023 · 0 comments
Open

Comments

@GwynethLlewelyn
Copy link

Hello again,

There are probably better ways to tackle this issue, but bear with me, I'll try to give an example of what I need...

I'm working on a reasonably simple web-based file manager for music files. Essentially, I walk through a hierarchy of media directories (depth is of no concern at this stage, but with the usual artist/album/song layout, in most scenarios, there will just be 3 or possibly 4 levels), grab the name of each entry (i.e. osPathname as defined by the godirwalk callback, using an alphabetical sorting), check if it has a valid audio extension (in the future I might do more sophisticated checks) and save it on a one-dimensional array (with the full pathname). Anything which is not an audio file (such as lyrics or .DS_Store — macOS 'garbage' left all over the place) gets skipped.

So far, so good. My original plan was simply to use an array of Dirents, selecting those that matched and skipping over those that don't match. Then everything gets pushed into a template, which reads the array of Dirents using a range, and just adds some HTML & CSS around it. Simple.

Currently, Dirent lacks some fields that would make this very simple (#74). You do get the filename, and the ability to check for the file's mode, but you do not get things like the file size or time of last modification.

Therefore, I've extended the Dirent with my own type, something like this:

type PlayListItem struct {
	de godirwalk.Dirent	// directory entry data retrieved from godirwalk.

	fullPath string		`validate:"filepath"`			// full path for the directory where this file is.
	cover string		`validate:"filepath,omitempty"`	// path to album cover image for this file.
	modTime time.Time	`validate:"datetime"`			// last modified date (at least on Unix-like systems).
	size int64			// filesize in bytes, as reported by the system.
	checked bool		// file checkbox enabled; eventually this will add the file to the playlist.
}

and added not only the methods that Dirent has (such as Name(), `IsDir()¨, etc.) — essentially by calling the 'parent' functions directly.

There are a few catches, though; for instance, Dirent has a reset() method to 'clean up' after itself; unfortunately, this method isn't exported, and, as a consequence, it means that it cannot be called from a type that extends Dirent. That, in turn, means more work for the garbage collector or memory leaks (I've not tested this thoroughly).

Now here is my issue: consider the artist/album/song directory layout mentioned before. Assume, for a moment, that there is an image inside the album directory which corresponds to the album's cover. On the template I wish to display the cover together with the songs for that album.

Now this could be trivially done by going through the directory twice; first to see if there is any image present and save it; and secondly to extract each filename and assign the cover's path to it in the associated PlayListItem struct for that file.

The problem, in this case, is that we don't have a callback for the start of a new directory transversal; we only have one for the end (that is, we have PostChildrenCallback but not the equivalent PreChildrenCallback). The best we can currently do is, inside the callback, check if the current file we're analysing is a directory, and, if it is, check first if there is an image file inside it, and then cache it.

This has two problems.

First, it seems to be a waste of resources/time, since you'll be transversing all files on the album directory anyway, looking for audio files only, but ultimately hit on the same image file as before — however, this time, discarding it.

Secondly, what if there is more than one image file inside it? You can only use the first image you find with whatever method you use; because once you find a second image during the directory transversal, you can only update the entries for the files from that point onwards — not to the ones that came before the second image was found.

One might argue that getting two, three, or even more image files in what is supposed to be an audio folder is unlikely, but that's not true. In my personal case, my audio library is shared among several different devices and applications — from iTunes to Plex and even Kodi. While all of those, generally speaking, will not touch the audio files themselves (I mean, unless that request is made explicitly), they are fond of adding lots of extra files inside, which get used on different scenarios. For instance, on a web page displaying the content of an album, there might be two images, one with a larger size than the other, used on different places; or there might be an image with the same content (and size) but stored under a completely different name, used in a different context. Or, even more likely, each application/device has its own way to retrieve metadata — including album covers — which are mutually incompatible and therefore get stored separately as different files.

During a directory transversal, therefore, once an image file is found, there might be some extra work to be done, such as checking for size, format, compression level, etc. and picking the 'best-so-far' alternative; if a new file is found deeper in the directory listing, then the same checks are performed and compared to the 'best-so-far' alternative; if the new file is 'better' (according to whatever criterium has been used), then it gets used instead.

The only solution I have for this scenario is to store all found files — audio and images — on a temporary stack, and read it back when PostChildrenCallback is called. After that, whatever heuristics get applied to the images, one is selected, and all the audio files on that stack get assigned the selected image. The stack is then emptied, PostChildrenCallback returns, and the stack starts to get filled again from the next directory to be transversed, according to godirwalk's choice.

(Note that for this solution, BFS or LFS are irrelevant, so long as PostChildrenCallback is called at the end of each directory).

This works, especially on my own use-case, since it's unlikely that an album has a million tracks. Most will have a dozen or so. A few — originally multiple-CD bundles ripped to hard disk — might have a few dozens. I have recently read a review of a "complete collection" of CDs for a 20th-century composer, which included 60 CDs total — still, that would be less than a thousand entries on a single directory. As such, the stack solution would work — there would be no reason to believe that memory might get exhausted in this case.

It's also ugly, but, alas, that's the current choice I've got.

But suppose you have a completely different scenario, say a tool that will go through all your files in your Developer folder (whatever it might be called in your system), and saves metadata based on what files it finds there. Imagine that you want to 'group' together header files in C with the corresponding code; these might even be found on separate directories; which means that you have no other choice but to walk through the entire tree at least twice, one for figuring out all those .c files, and the other to gather the corresponding .h files, in order to produce metadata that links one to the other. And some applications have very deep directories, crammed full of source code — especially if there is a developer using Emacs, meaning a plethora of files ending with ~, which have to be read (even if they get discarded afterwards); in such a scenario, the stack may grow beyond a reasonable amount of available system memory. Or the code to deal with this becomes huge, too complex, and nastily convoluted.

How could this tree transversal be done only once?

P. S. For the sake of completeness: I personally gave up on coming up with a clever data structure to deal with my own use-case; I simply noticed that, in general, almost all directories have at least one image file, named Folder.jpg, which usually is sized at 200x200 — enough for my own purposes! As such, when first entering a directory, I just check for the existence of Folder.jpg using os.Stat(), store its path (if present)... and that's pretty much it (I don't need anything else). Then, for all audio files found on that directory, I just assign the path to the current Folder.jpg to each file, skipping over the rest of the non-audio files. When PostChildrenCallback is called, I just clean the stored path — the next time godirwalk enters a directory, it will do the same check for os.Stat("Folder.jpg") again, and the path to that will be assigned to the files on this new album directory — and so forth. If no "Folder.jpg" is present, but rather a differently-named image file for the cover... well, tough luck, I simply ignore that.

I suppose that I could do a full search for all image files (as opposed to just checking for os.Stat("Folder.jpg")), pick one according to heuristics, store its path, and apply to all subsequent audio files on that speific album's directory. It's just that this requires a "search within a search", therefore making the code more complex (and more I/O-heavy). As such, I stick with just the existence or absence of a well-known filename. It's a very limited way of implementing the overall concept — with the trade-off that it's very simple to code (and test the code!).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant