Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] GH-73991: Add os.copy() and friends #119079

Closed
wants to merge 4 commits into from

Conversation

barneygale
Copy link
Contributor

@barneygale barneygale commented May 15, 2024

Move these low-level file copy functions from shutil to os:

  • copyfileobj()
  • copyfile()
  • copymode()
  • copystat()
  • copy2() (as copy(), minus support for copying into a destination directory)

The names remain in shutil for backwards compatibility.


📚 Documentation preview 📚: https://cpython-previews--119079.org.readthedocs.build/

Move these low-level file copy functions from `shutil` to `os`:

- `copyfileobj()`
- `copyfile()`
- `copymode()`
- `copystat()`

The names remain in `shutil` for backwards compatibility. The higher level
functions (which resemble the `cp` shell utility) also remain in `shutil`.
@zooba
Copy link
Member

zooba commented May 15, 2024

I would love if we could figure out how to use the Windows copy2 implementation in one of the os.copy* functions. I know that will change the default alias from shutil, but it's a bit sad for "using the fastest way possible" to be false.

IIRC, the reason we didn't just change the normal shutil.copyfile to use the native API was because the native API would preserve metadata that the existing function loses (perhaps @eryksun remembers?). IMHO, preserving it is the better thing to do, and this is a good time to change the behaviour slightly while preserving the old one in shutil.

@barneygale
Copy link
Contributor Author

I'm fine with that. Suggest we move shutil.copy2() to os.copy()? Folks who want to copy only the data and mode can call os.copyfile() then os.copymode().

@barneygale
Copy link
Contributor Author

barneygale commented May 15, 2024

Having seen the (rough) patch, do you think it's worth it? To me it seems like an awful lot of churn just to kick some low-level functions out of shutil, and for reasons that are seem more about module purity than practicality.

@barneygale barneygale changed the title [WIP] GH-73991: Add os.copyfile() and friends [WIP] GH-73991: Add os.copy() and friends May 15, 2024
@eryksun
Copy link
Contributor

eryksun commented May 16, 2024

IIRC, the reason we didn't just change the normal shutil.copyfile to use the native API was because the native API would preserve metadata that the existing function loses (perhaps @eryksun remembers?). IMHO, preserving it is the better thing to do, and this is a good time to change the behaviour slightly while preserving the old one in shutil.

Yes, only shutil.copy2() is a good fit for using the native OS copy routine. It's always been documented that shutil.copyfile() doesn't copy metadata. shutil.copy() also copies the permission bits, which on Windows one could stretch to argue that it should copy the file's security descriptor -- including the owner/group (requires SeRestorePrivilege in general), SACL/DACL (including the owner/group and the SACL security attributes), but shutil has never implemented copying ACLs on any platform.

The new support for COPY_FILE_OPEN_AND_COPY_REPARSE_POINT in Windows 10+ would allow os.copy() to generalize follow_symlinks support for any name-surrogate reparse point (typically symlinks and junctions), which would bring it inline with the behavior of os.stat(). Of course, os.lstat(src) would have to be checked first, or a new nt._path_isnamesurrogate(src) check.

@thesamesam
Copy link
Contributor

With regard to fast copies, there's substantial discussion in #81338 and #93152 worthy of consideration.

@serhiy-storchaka
Copy link
Member

I am strictly against these changes. os for low-level functions, thin wrappers around OS calls, shutil if for high-level functions that correspond OS commands. There are few exceptions, like os.walk and os.makedirs, I believe they should belong to shutil, but it is late to change this.

@barneygale
Copy link
Contributor Author

os for low-level functions, thin wrappers around OS calls

os.copy() is a wrapper around OS calls: CopyFile2() on Windows, fcopyfile() on Mac, sendfile() on Linux. It's not very thin because it has to smooth out the differences between the OSs, but it doesn't seem very fat either - it's on par with most other functionality in os.

@barneygale
Copy link
Contributor Author

@eryksun
Copy link
Contributor

eryksun commented May 21, 2024

@zooba, I just noticed that the base system function BaseCopyStream() was updated to use the new system call NtCopyFileChunk(). This system call implements copying part of a file in the kernel, without needing separate read and write system calls. BaseCopyStream() is used by the CopyFile*() API functions.

This makes it more tempting to implement copyfile() with CopyFile2(). The problem is that copyfile() shouldn't copy timestamps, file attributes, extended attributes, or named data streams. The flag COPY_FILE_SKIP_ALTERNATE_STREAMS skips copying name data streams. Copying the last-write timestamp and file attributes is unavoidable, but it could be worked around as follows:

  • Save the original file attributes from lstat() or stat() if the destination file already exists, or else use FILE_ATTRIBUTE_NORMAL.
  • Restore or reset the file attributes via SetFileAttributesW().
  • Set the last-access and last-write timestamps to the current time via os.utime(). This also updates the file's change timestamp, if it has one (e.g. NTFS, but not FAT32).
  • Or combine the attribute and timestamp update in a single function that calls CreateFileW() and SetFileInformationByHandle(): FileBasicInfo.

The presence of extended attributes could be ignored, I suppose, since the Windows API provides no way to query or explicitly set them. In practice, probably the most common extended attributes are in the reserved "$KERNEL.*" namespace. Those can't be copied or set from user mode. Other extended attributes will be copied, such as the "$CI.CATALOGHINT" extended attribute that Microsoft sets on distributed files.


As discussed in past issues, copytree() and move() should support copying directory metadata and data streams on Windows. copytree() would call os.makedirs() to create the parent directories of dst. The destination directory would be created by copying from src using WinAPI CreateDirectoryExW() or, if that fails because dst exists, fall back on CopyFile2().

CreateDirectoryExW() supports copying the following from the source directory: the case-sensitive flag (crucial to reliably copying a case-sensitive directory), file attributes, extended attributes, named data streams, and a symlink or junction reparse point. It doesn't allow the target directory to already exist.

CopyFile2() supports directories if passed the flag COPY_FILE_DIRECTORY. It copies the last-write timestamp, file attributes, extended attributes, named data streams, and a junction reparse point. Unlike CreateDirectoryExW(), it doesn't copy the case-sensitive flag. Copying a symlink requires the flag COPY_FILE_COPY_SYMLINK. If the destination directory exists, it must be empty in order to copy a symlink or junction.

@zooba
Copy link
Member

zooba commented May 21, 2024

I just noticed that the base system function BaseCopyStream() was updated to use the new system call NtCopyFileChunk(). This system call implements copying part of a file in the kernel, without needing separate read and write system calls. BaseCopyStream() is used by the CopyFile*() API functions.

Copy is getting a decent amount of attention in Windows right now, as we identified it as a significant potential perf improvement when the kernel knows that a plain copy is happening (e.g. AV can ignore the writes if it's already scanned the source; ReFS can use copy-on-write by default, etc.). So it's worth using the native functions wherever we can.

But still, I really dislike the semantics of most of our copy functions. They seem to have been derived from the fact that our implementation used open/write rather than anything native, and I suspect if we'd gone native in the first place that we'd have defined it as "behaves like a native OS copy".

So IMHO, any new copy function should be "as close to standard OS copy semantics as possible when feasible" (whether that's cp ..., CopyFile(), etc.), and the old ones can remain slow for compatibility (and perhaps be marked as deprecated, even if we don't need to remove them).

Anyone can use Python to open/write copy a file, but only we can make it use the OS functions (ignoring ctypes). We should do the thing we can do, and let people who want generic behaviour write it themselves.

@barneygale
Copy link
Contributor Author

FWIW I agree completely! Picking another language/runtime at random, rust has std::fs::copy which seems to be a thin wrapper around platform-specific copy functions. It guarantees that permission bits are copied, but doesn't make any cross-platform guarantees about other metadata. That frees it from the constraints that plague the copy functions in shutil, i.e. that they're overspecified to the point that making obvious-seeming changes is hard.

What should we do to progress this @zooba? Do you think I should create a forum thread, or move straight to a issue titled "Add os.copy"?

@zooba
Copy link
Member

zooba commented May 28, 2024

What should we do to progress this @zooba? Do you think I should create a forum thread, or move straight to a issue titled "Add os.copy"?

Unfortunately, my first instinct is that a "do what the platform does" function belongs in... shutil!

But perhaps what we could do is make it private in os and expose it from pathlib? (shutil.copy2 could use os._copy as well) Proliferation of APIs doesn't help existing users, and barely helps new users if the guidance isn't clear. But the pathlib API is clearer ("if you have a pathlib object, use its copy()") and also solves an extensibility limitation. The concern is about using a high-level module to implement a low-level function.

@barneygale
Copy link
Contributor Author

Thanks very much Steve. I've played around with adding it privately to os and in pathlib. I'll ping you on another PR :)

@barneygale
Copy link
Contributor Author

Closing in favour of #119058.

@barneygale barneygale closed this Jun 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants