Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strategizing about opendir/readdir #4440

Closed
dubiousjim opened this issue Mar 20, 2020 · 5 comments
Closed

Strategizing about opendir/readdir #4440

dubiousjim opened this issue Mar 20, 2020 · 5 comments
Labels

Comments

@dubiousjim
Copy link
Contributor

I'd like to implement a Dir class, and functions opendir/opendirSync that construct it. Most of this looks to be easy to implement, but there are some design choices it'd be helpful to get feedback on first. (Rather than implementing things one way, then needing to go back and redo it.)

As a reminder, here are some things currently in master:

/** A FileInfo describes a file and is returned by `stat`, `lstat`,
 * `statSync`, `lstatSync`. A list of FileInfo is returned by `readdir`,
 * `readdirSync`. */
export interface FileInfo {
  size: number;
  modified: number | null;
  accessed: number | null;
  created: number | null;
  name: string | null; // not populated by stat/lstat, only by readdir
  isFile(): boolean;
  isDirectory(): boolean;
  isSymlink(): boolean;
  // rest are Linux/Mac only
  dev: number | null;
  ino: number | null;
  mode: number | null;
  nlink: number | null;
  uid: number | null;
  gid: number | null;
  rdev: number | null;
  blksize: number | null;
  blocks: number | null;
}

export class File
  implements
    Reader,
    SyncReader,
    Writer,
    SyncWriter,
    Seeker,
    SyncSeeker,
    Closer {
  readonly rid: number;
  constructor(rid: number);
  write(p: Uint8Array): Promise<number>;
  writeSync(p: Uint8Array): number;
  read(p: Uint8Array): Promise<number | EOF>;
  readSync(p: Uint8Array): number | EOF;
  seek(offset: number, whence: SeekMode): Promise<number>;
  seekSync(offset: number, whence: SeekMode): number;
  close(): void;
}

PRs I have queued in the #4017 series will propose these changes:

export interface FileInfo {
  ...
  type: FileType | null;
  // rest are Linux/Mac only
  ...
  mode: number | null; // masked to 0o7777
}

export enum FileType {
  TYPE_UNKNOWN = 14 << 12,
  TYPE_REGULAR = 8 << 12,
  TYPE_DIRECTORY = 4 << 12,
  TYPE_SYMLINK = 10 << 12,
  TYPE_SOCKET = 12 << 12,
  TYPE_FIFO = 1 << 12,
  TYPE_CHARDEV = 2 << 12,
  TYPE_BLKDEV = 6 << 12
}

export class File {
  ...
  sync(): void;
  datasync(): void;
  truncate(len?: number): Promise<void>;
  truncateSync(len?: number): void;
  chmod(mode: number): Promise<void>;
  chmodSync(mode: number): void;
  utime(atime: number | Date, mtime: number | Date): Promise<void>;
  utimeSync(atime: number | Date, mtime: number | Date): void;
  // chown(uid: number, gid: number): Promise<void>;
  // chownSync(uid: number, gid: number): void;
  stat(): Promise<FileInfo>;
  statSync(): FileInfo;
}

The chown functions are commented because I haven't implemented them yet, and am not sure whether the machinery is there to let me do so.

For opendir and friends, I was thinking something like this:

export class Dir
  implements
    Closer {
  readonly rid: number;
  constructor(rid: number);
  close(): void;
  chmod(mode: number): Promise<void>;
  chmodSync(mode: number): void;
  utime(atime: number | Date, mtime: number | Date): Promise<void>;
  utimeSync(atime: number | Date, mtime: number | Date): void;
  // chown(uid: number, gid: number): Promise<void>;
  // chownSync(uid: number, gid: number): void;
  stat(): Promise<FileInfo>;
  statSync(): FileInfo;
  chdir(): void;
  readdir(): Promise<FileInfo[]>;
  readdirSync(): FileInfo[];
}

function opendir(path: string): Promise<Dir>;
function opendirSync(path: string): Dir;

The factory functions (opendir/opendirSync) would throw an error if path didn't refer to a directory. (I'd suggest at the same time open and openSync should be made to throw an error if path did refer to a directory. Though perhaps not if it refers to a other things that are neither directories nor files.) We could use the existing op_open to implement all of these.

If that all sounds reasonable, the main strategic questions remaining have to do with readdir and FileInfo.

  1. Ideally, there'd be a version of std::fs::read_dir in the Rust libs that took a RawFd as argument rather than a Path. And also a tokio-ized version of the same. But sadly there is neither. I tried poking around in libstd/sys/unix/fs.rs in the Rust sources to see how hard this would be to implement. But to get it to compile, I had to keep pulling in duplicates of more and more of the internal machinery of that file. In the end this didn't seem a promising way to go. Plus it would involve lots of calling into the libc crate with unsafe code blocks. (As the Rust std libs must.)

  2. We could work on getting Rust to include such functionality in their std libs, and even offer them an implementation, but I suspect even if they're open to merging that in the near future it would take a while before it's exposed in a version of Rust we're using.

  3. We're already using the nix package for some things on the Unix side (Linux/Mac). And they have an interface nix::dir::Dir, which can be constructed using either paths or fds. That looks good for our purposes. (Downsides: it's Unix-only, and also blocking, so we'd have to continue using blocking_json from cli/ops/dispatch_json.rs here, rather than moving to async functions from tokio::fs, as proposed in refactor: ops/fs.rs should use tokio::fs where applicable #4188.) I think this is the most promising way to go.

  4. Some issues about the nix::dir::Dir implementation though. The iterator over directory entries that it provides is sparser than the one provided by std::fs::read_dir (and its counterpart in tokio::fs). It only provides something like this for each entry:

    file_name(): string;
    ino(): number
    file_type(): FileTypeEnum | null;
    

    The file_type information is only populated by nix::dir::Dir if it's available without making an extra lstat system call. (Some Linux filesystems would require that extra call. I don't know which ones.) So what's naturally available to us if we go this way would only be a sparser version of our existing FileInfo. At the same time, it would be more efficient than the std::fs::read_dir version, which goes ahead and does the extra lstat call to fill in the additional fields, which won't always be needed. Of course, we could go ahead and do the same with the results of nix's directory iterator (using nix::dir::Dir::openat(dirfd, filename, ...)). But that seems wasteful. Why not instead just provide the minimal information, and give the caller the means to retrieve more info if they want it. (Using calls like fstatat, openat etc that are also present in nix. I'm assuming that from a Dir class, our implementation will only have a rid saved from which it can get an open dirfd. It won't have --- and there need not even anymore exist --- a valid path that could be used to reach that directory.)

  5. If the strategy just sketched seems attractive for opendir(path).readdir(), perhaps we should also consider it for readdir(path). That is, have this not provide a full FileInfo like one gets from stat/lstat, but a sparser thing like described above.

    That is in fact what other languages/hosts do in their std libs.

    • Node's fs.DirEnt has the fields: name: string|Buffer, isFile(): boolean, isDirectory(): boolean, and so on.
    • Python's os.DirEntry has the fields name: bytes|str, path: bytes|str, inode(), is_dir(follow_symlinks=False), is_file(follow_symlinks=False), is_symlink(), and stat(follow_symlinks=False)
    • Go's os.Readdir returns a list of FileInfo structures, same as is returned by its os.Lstat and os.Stat. But this is somewhat sparser than ours; and they also have the function os.Readdirnames that returns just a list of names.
  6. I'm ignoring in all of this the issues raised in [Discussion] Decide on reading directory API for 1.0 release #4277 and Make Deno.readDir actually streaming #4218, which have to do with whether readdir provides its results all at once or as a stream. The latter seems ultimately desirable, but I don't know how to make the Deno Rust guts save the state needed to implement an iterator. In any case, I think those issues are orthogonal to these.

@dubiousjim
Copy link
Contributor Author

cc @ry, @cknight

@dubiousjim
Copy link
Contributor Author

Another issue that I came across while looking into this was #1770. The existing readdir(path) implementation, on the Rust side, already would be able to provide basedir and/or fullpath and/or name without doing any extra system calls. The stat and lstat implementations would be able to provide fullpath without any extra calls at all, and basedir and/or name with only a single split of the Path (which I think is cheap, since the Rust std lib already parses the path, and anyway, it doesn't involve a system call). The opendir(path).readdir() envisaged above couldn't provide a basedir or fullpath, since it would only have an rid/open dirfd for the relevant directory. But it would provide each directory entry's name.

So an API that looks nice to me is to have stat/lstat(path) return a FileInfo that splits path into basedir and name strings. And have readdir(path) return a sparser structure that also contains basedir and name strings. But for opendir(path).readdir(), the basedir is null and only the name string is populated.

Are those issues open to revive discussion on?

@dubiousjim
Copy link
Contributor Author

I wrote above:

At the same time, it would be more efficient than the std::fs::read_dir version, which goes ahead and does the extra lstat call to fill in the additional fields, which won't always be needed. Of course, we could go ahead and do the same with the results of nix's directory iterator (using nix::dir::Dir::openat(dirfd, filename, ...)). But that seems wasteful.

I should have said fstatat there rather than openat, but more importantly, I was misremembering some things about the division of labor between Rust's stdlib and our cli/ops/fs.rs. Rust's std::fs::read_dir doesn't in fact automatically do the lstat call on its directory entries. Instead, that's something we ask it to do, by calling metadata() on the entry (also if we call file_type() on it).

In any case, even though I misdescribed things a bit, I'd still like to raise and discuss the question of whether we should be doing that extra syscall for every directory entry returned. (This is potentially even more wasteful since we're not yet returning a stream of directory entries.)

@cknight
Copy link
Contributor

cknight commented Mar 21, 2020

Thanks for the detailed and thorough write up @dubiousjim, this all seems reasonable to me. I'd prefer the sparse info with stat capabilities after if someone needs that info.

@stale
Copy link

stale bot commented Jan 6, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Jan 6, 2021
@stale stale bot closed this as completed Jan 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants