-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace routerify in file transfer server with axum #2461
Replace routerify in file transfer server with axum #2461
Conversation
Signed-off-by: James Rhodes <[email protected]>
Signed-off-by: James Rhodes <[email protected]>
Signed-off-by: James Rhodes <[email protected]>
Signed-off-by: James Rhodes <[email protected]>
Signed-off-by: James Rhodes <[email protected]>
- Add streaming support to download endpoint - Use FileTransferPaths to avoid repetition - Remove redundant code from upload handler - Improve error messages for upload failures - Modify remaining tests to use API instead of internal furnctions Signed-off-by: James Rhodes <[email protected]>
Signed-off-by: James Rhodes <[email protected]>
Signed-off-by: James Rhodes <[email protected]>
Signed-off-by: James Rhodes <[email protected]>
use super::error::FileTransferRequestError as Error; | ||
use super::request_files::FileTransferPaths; | ||
|
||
// TODO is this not just a repeat of code in tedge_config? | ||
const HTTP_FILE_TRANSFER_PORT: u16 = 8000; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably outside of the scope of these changes, but it feels like we shouldn't duplicate what tedge_config
knows here. If someone wants the default values, they could surely achieve that with a default config?
Related to this, it feels like the API for HttpConfig
is a bit flawed. We always construct default values and then override them with the builder API, which seems pointless (and leads to the multiple sources of truth issues).
|
||
let relative_path = Utf8PathBuf::from(relative_path); | ||
Some((relative_path, file_name.into())) | ||
fn separate_path_and_file_name(input: &Utf8Path) -> Option<(&Utf8Path, &str)> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The semantics of this function are completely different to what was originally written (and very directly tested):
thin-edge.io/crates/core/tedge_agent/src/file_transfer_server/http_rest.rs
Lines 250 to 261 in ccff7ee
#[test_case("/tedge/some/dir/file", "/tedge/some/dir", "file")] | |
#[test_case("/tedge/some/dir/", "/tedge/some/dir", "")] | |
fn test_separate_path_and_file_name( | |
input: &str, | |
expected_path: &str, | |
expected_file_name: &str, | |
) { | |
let (actual_path, actual_file_name) = | |
separate_path_and_file_name(Utf8Path::new(input)).unwrap(); | |
assert_eq!(actual_path, expected_path); | |
assert_eq!(actual_file_name, expected_file_name); | |
} |
However these tests were not testing anything the user could achieve, as the API did path cleanup (which removed trailing slashes) and (I think) the code we call afterwards would have returned an error if we had been able to call this with a trailing slash (as we would have created the entire path as directories and then attempted to create a file with an empty name inside that).
Codecov Report
Additional details and impacted files
|
|
||
if let Some((relative_path, file_name)) = separate_path_and_file_name(&full_path) { | ||
let root_path = file_transfer.data_dir.clone(); | ||
let directories_path = root_path.join(relative_path); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This relative_path
was very definitely an absolute path, so this code was entirely superfluous, as it just returned relative_path
in any case.
DeleteIoError { path, .. } => { | ||
tracing::error!("{self}"); | ||
( | ||
StatusCode::FORBIDDEN, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do these really want to be 403 Forbidden
(it's what the code did previously)? It feels like an encapsulation failure to recognise for certain paths that we can't delete/upload to them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO "could not delete path due to IO error" could happen due e.g. path not existing, in which case we need to return 404, or due to server not having permissions to do filesystem delete, in which case we should return 500, so if I'm not wrong, then even using a single return code for all IO errors is kinda bad. As for 403 Forbidden
, I also feel that it should be used when the client that made the request doesn't have permissions to do something, which doesn't seem to be the case here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that was roughly what I was thinking. We currently handle the not found case by returning 202 Accepted
, like the successful deletion case (which I assume is to make DELETE
requests idempotent). Returning 404 Not Found
may also make sense, or if the file previously existed, 410 Gone
, although a success response tells the user the file definitely no longer exists, and the request definitely succeeded.
I think if we've hit the case handled by this code, it's some error accessing the file system, which in my mind should definitely be a 500 Internal Server Error
. This code may also be caused by trying to delete a directory, which I talk about in a different comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
403 doesn't look bad, as the spec mentions the following: The access is tied to the application logic, such as insufficient rights to a resource.
which is the case here, right? I vaguely remember a similar discussion happening when the original author introduced this case in the first place and we settled with 403 for the same reason.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand what application logic is relevant here. With the current API at least, we have no control over permissions. If there's a permissions error on the file system, that implies some other process is interfering with /var/tedge/file-transfer, which is not an application-level error, it's an error from the environment. The question here mostly boils down to whether we want to assume it's some file that should have never been there (in which case encapsulating the error entirely and responding with 404 Not Found is appropriate, or whether we assume the file was uploaded correctly, but something as changed the internal permissions so we can't access it, in which case a 500 Internal Server Error is appropriate.
Thinking more broadly, a useful litmus test to apply here would be "if one user receives a 403 for a resource, there must exist some other user that is able to access the resource in question". If not, and the 403 response is global, then I don't believe this can be considered an authorization issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@didier-wenzek @reubenmiller do you have opinions on this point?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deleting a resource that doesn't exist is definitely not an error. Returning 202 in such a case makes the request idempotent.
Trying to delete a non-empty directory, should be considered as a user error as we don't want to support recursive deletes. So 400 makes sense. I would do the same for empty-directories.
If the error is due to missing access, I would say this is a 500 as the file-transfer service should own all the files under file transfer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
403 doesn't look bad, as the spec mentions the following:
The access is tied to the application logic, such as insufficient rights to a resource.
From RFC 9110:
If we get an IO error due to missing file permissions, I would say that the server tried to execute the request, but failed, due to a misconfiguration (e.g. not having write access to file-transfer
directory), in which case 500 would be fitting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trying to delete a non-empty directory, should be considered as a user error as we don't want to support recursive deletes. So 400 makes sense. I would do the same for empty-directories.
I think if it is a directory, 404 makes more sense as it the error is in the URL and the issue is that the resource in question cannot be interacted with (and essentially doesn't exist), not that the request is inherently invalid. I'll make sure the error message clarifies that the file we attempted to delete is a directory.
Other than that, it seems like there is now broad consensus that 500 is the appropriate response for an IO error, so I'll sort that out too.
// One must check that once normalized (i.e. any `..` removed) | ||
// the path is still under the file transfer dir | ||
let clean_path = clean_utf8_path(&full_path); | ||
if clean_path.starts_with(&self.data_dir) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was only checking that the file was in e.g. /var/tedge
since all the manipulation in this function was based around file-transfer
being part of the URL path (i.e. in the variable path
in this function), so it was possible for a user to read from/write to /var/tedge
, not just /var/tedge/file-transfer
. I'm assuming this wasn't intentional and I've fixed it by storing file_transfer_dir
inside HttpConfig
instead.
Signed-off-by: James Rhodes <[email protected]>
async fn check_server_does_not_panic_when_port_is_in_use() -> anyhow::Result<()> { | ||
let ttd = TempTedgeDir::new(); | ||
|
||
let http_config = HttpConfig::default() | ||
.with_data_dir(ttd.utf8_path_buf().into()) | ||
.with_port(3746); | ||
let config_clone = http_config.clone(); | ||
|
||
// Spawn HTTP file transfer server | ||
// handle_one uses port 3000. | ||
let builder_one = FileTransferServerBuilder::new(http_config); | ||
let handle_one = tokio::spawn(async move { builder_one.build().run().await }); | ||
|
||
// handle_two will not be able to bind to the same port. | ||
let builder_two = FileTransferServerBuilder::new(config_clone); | ||
let handle_two = tokio::spawn(async move { builder_two.build().run().await }); | ||
|
||
// although the code inside handle_two throws an error it does not panic. | ||
tokio::time::sleep(std::time::Duration::from_millis(100)).await; | ||
// Both servers will attempt to bind to port 3746. | ||
let server_one = FileTransferServerBuilder::new(http_config).build().run(); | ||
// This server will not be able to bind to the same port. | ||
let server_two = FileTransferServerBuilder::new(config_clone).build().run(); | ||
|
||
// to check for the error, we assert that handle_one is still running | ||
// while handle_two is finished. | ||
assert!(!handle_one.is_finished()); | ||
assert!(handle_two.is_finished()); | ||
tokio::select! { | ||
// Ensure we bind server_one first by polling that future first | ||
biased; | ||
res = server_one => bail!("expected second server to finish first, but first finished with: {res:?}"), | ||
res = server_two => ensure!(res.is_err(), "expected server two to fail with port binding error, but no error was found"), | ||
_ = tokio::time::sleep(Duration::from_secs(5)) => bail!("timed out waiting for actor to stop running"), | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This refactoring is incidental, I noticed the test had both a hard sleep (so it was comparatively slow) and not terribly deterministic (no guarantee server one would attempt to bind first, no guarantee the port binding would fail within 100ms). The timeout-based approach gives a lot more leniency to the servers to fail, and polling server_one
first ensures that it binds to a port first (assuming that is the first thing it attempts to do, which I think is a reasonable assumption).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In response to #2461 (comment), I've now changed this test again, and as a result the remaining complexity around which port binds first (i.e. what I'm relying on the biased
directive for) has disappeared entirely.
Robot Results
|
use FileTransferRequestError::*; | ||
match &self { | ||
FromIo(_) | PathRejection(_) => { | ||
tracing::error!("{self}"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: This will log an error not only when a request is handled, but every time .into_response()
is called.
Axum has a tracing
feature that makes it output traces when handling events. Could it be used for logging errors like we try to do here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure axum's tracing
is quite what we want here. Here, I want to determine whether I'm hiding information from the user (which is when we have an internal server error), and if so, log that information. This avoids polluting the server logs with errors arising from clients misusing the API, but means we don't expose implementation details to the user of the API. If we rely on some sort of middleware for logging, either it would take an input of FileTransferRequestError
, in which case we might as well do everything here as we'd still need to log selectively, or we'd take the converted response, by which point we know the status code, so it's easy to log selectively, but we've thrown away the pertinent information (e.g. an IO error).
DeleteIoError { path, .. } => { | ||
tracing::error!("{self}"); | ||
( | ||
StatusCode::FORBIDDEN, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
403 doesn't look bad, as the spec mentions the following: The access is tied to the application logic, such as insufficient rights to a resource.
which is the case here, right? I vaguely remember a similar discussion happening when the original author introduced this case in the first place and we settled with 403 for the same reason.
crates/core/tedge_agent/src/file_transfer_server/request_files.rs
Outdated
Show resolved
Hide resolved
Signed-off-by: James Rhodes <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy to approve once the DeleteIoError
response is changed from FORBIDDEN
to INTERNAL_SERVER_ERROR
as concluded.
a8f3179
to
0056f9c
Compare
Done in 1ca42b2 |
Signed-off-by: James Rhodes <[email protected]>
0056f9c
to
1ca42b2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved.
The next step will be to enable HTTPS support to this file transfer service, introducing http.cert_path
, http.key_path
and http.ca_path
and considering to reuse http.client.auth.cert_file
and http.client.auth.cert_file
to authenticate tedge-agent
when acting as a client (auth proxy) as well as a server (file transfer).
Co-authored-by: Albin Suresh <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got answers to my question and most pressing feedback was addressed, so approving.
#[error("Cannot upload: {path:?} is a directory, not a file")] | ||
CannotUploadDirectory { path: RequestPath }, | ||
|
||
#[error("Request to delete {path:?} failed: {source}")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note: We need to be careful with including source error message in parent error message.
I think the recommendation is to not include source error messages like this because proper error reporting/diagnostics will themselves call .source()
to print messages for the entire error chain. However, for code that is not aware of this, and doesn't depend on any error reporting crates, it will most likely just grab outer error message. It depends where the error is used. For the FTS, I think the errors appear only in HTTP responses and logs, but 1) I'm not sure how they are exactly reported by axum 2) it's all internal so it doesn't matter that much.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm aware, but knew exactly where these error messages were going (which is only when we call .to_string()
), and I didn't want to have to convert them to something like anyhow::Error
just to get a backtrace. You're right though, a lot of thin-edge errors are in library code where they're eventually going to become anyhow::Error
and in that case it's really annoying (reqwest
is the example of a public crate that does this that springs to mind).
Proposed changes
This is a rewrite of the file transfer server to use axum instead of routerify, which forms the groundwork for adding HTTPS and certificate authentication support once #2430 is merged.
The basic migration to axum was pretty trivial (see 4051213), but once this was complete, I refactored the code considerably to be more idiomatic axum code. The highlights of these changes are:
StatusCode
where all we do is set the HTTP status code, an error enum for handling any unsuccessful response)/tedge/file-transfer/some/file
, we sayFile not found: "some/file"
rather thanFile not found: "/var/tedge/file-transfer/some/file"
). It also fixes what appeared to be a (very limited in scope) path traversal attack, where a user could access the data directory directly and not just thefile-transfer
subdirectory of thisTypes of changes
Intended to be an improvement, but some minor bugs were fixed along the way, so checking both of those.
Paste Link to the issue
Checklist
cargo fmt
as mentioned in CODING_GUIDELINEScargo clippy
as mentioned in CODING_GUIDELINESFurther comments