Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace routerify in file transfer server with axum #2461

Merged
merged 15 commits into from
Nov 20, 2023

Conversation

jarhodes314
Copy link
Contributor

@jarhodes314 jarhodes314 commented Nov 16, 2023

Proposed changes

This is a rewrite of the file transfer server to use axum instead of routerify, which forms the groundwork for adding HTTPS and certificate authentication support once #2430 is merged.

The basic migration to axum was pretty trivial (see 4051213), but once this was complete, I refactored the code considerably to be more idiomatic axum code. The highlights of these changes are:

  • Meaningful handler return values: the http handlers no longer hand-build responses, they return normal Rust values (e.g. StatusCode where all we do is set the HTTP status code, an error enum for handling any unsuccessful response)
  • Local HTTP testing: the tests no longer listen on a port to test HTTP functionality, the services are now called directly
  • Extractors: boilerplate called by the handlers to extract path information from the URI and convert that into an absolute path to read from/write to the filesystem is now called before a handler runs and passed in as an argument to the handler
  • General security improvements: the API no longer exposes absolute paths to files to the user (e.g. when we respond with 404 not found for /tedge/file-transfer/some/file, we say File not found: "some/file" rather than File not found: "/var/tedge/file-transfer/some/file"). It also fixes what appeared to be a (very limited in scope) path traversal attack, where a user could access the data directory directly and not just the file-transfer subdirectory of this
  • Removing race conditions: the previous code checked that files existed before reading them, and handled errors based on that, but there's a file could be deleted between checking it exists and attempting to read it, which would result in a spurious HTTP 500 error which should be a 404.

Types of changes

Intended to be an improvement, but some minor bugs were fixed along the way, so checking both of those.

  • Bugfix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Improvement (general improvements like code refactoring that doesn't explicitly fix a bug or add any new functionality)
  • Documentation Update (if none of the other choices apply)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)

Paste Link to the issue


Checklist

  • I have read the CONTRIBUTING doc
  • I have signed the CLA (in all commits with git commit -s)
  • I ran cargo fmt as mentioned in CODING_GUIDELINES
  • I used cargo clippy as mentioned in CODING_GUIDELINES
  • I have added tests that prove my fix is effective or that my feature works
  • I have added necessary documentation (if appropriate)

Further comments

Signed-off-by: James Rhodes <[email protected]>
Signed-off-by: James Rhodes <[email protected]>
- Add streaming support to download endpoint
- Use FileTransferPaths to avoid repetition
- Remove redundant code from upload handler
- Improve error messages for upload failures
- Modify remaining tests to use API instead of internal furnctions

Signed-off-by: James Rhodes <[email protected]>
Signed-off-by: James Rhodes <[email protected]>
use super::error::FileTransferRequestError as Error;
use super::request_files::FileTransferPaths;

// TODO is this not just a repeat of code in tedge_config?
const HTTP_FILE_TRANSFER_PORT: u16 = 8000;
Copy link
Contributor Author

@jarhodes314 jarhodes314 Nov 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably outside of the scope of these changes, but it feels like we shouldn't duplicate what tedge_config knows here. If someone wants the default values, they could surely achieve that with a default config?

Related to this, it feels like the API for HttpConfig is a bit flawed. We always construct default values and then override them with the builder API, which seems pointless (and leads to the multiple sources of truth issues).


let relative_path = Utf8PathBuf::from(relative_path);
Some((relative_path, file_name.into()))
fn separate_path_and_file_name(input: &Utf8Path) -> Option<(&Utf8Path, &str)> {
Copy link
Contributor Author

@jarhodes314 jarhodes314 Nov 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The semantics of this function are completely different to what was originally written (and very directly tested):

#[test_case("/tedge/some/dir/file", "/tedge/some/dir", "file")]
#[test_case("/tedge/some/dir/", "/tedge/some/dir", "")]
fn test_separate_path_and_file_name(
input: &str,
expected_path: &str,
expected_file_name: &str,
) {
let (actual_path, actual_file_name) =
separate_path_and_file_name(Utf8Path::new(input)).unwrap();
assert_eq!(actual_path, expected_path);
assert_eq!(actual_file_name, expected_file_name);
}

However these tests were not testing anything the user could achieve, as the API did path cleanup (which removed trailing slashes) and (I think) the code we call afterwards would have returned an error if we had been able to call this with a trailing slash (as we would have created the entire path as directories and then attempted to create a file with an empty name inside that).

Copy link

codecov bot commented Nov 16, 2023

Codecov Report

Merging #2461 (fd5a907) into main (a030447) will increase coverage by 0.1%.
Report is 3 commits behind head on main.
The diff coverage is 87.7%.

Additional details and impacted files
Files Coverage Δ
crates/core/tedge_agent/src/agent.rs 0.0% <ø> (ø)
...core/tedge_agent/src/file_transfer_server/actor.rs 86.2% <88.8%> (+0.9%) ⬆️
...ge_agent/src/file_transfer_server/request_files.rs 88.2% <88.2%> (ø)
...core/tedge_agent/src/file_transfer_server/error.rs 47.6% <41.1%> (-27.4%) ⬇️
.../tedge_agent/src/file_transfer_server/http_rest.rs 89.6% <90.7%> (+24.2%) ⬆️

... and 8 files with indirect coverage changes


if let Some((relative_path, file_name)) = separate_path_and_file_name(&full_path) {
let root_path = file_transfer.data_dir.clone();
let directories_path = root_path.join(relative_path);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This relative_path was very definitely an absolute path, so this code was entirely superfluous, as it just returned relative_path in any case.

DeleteIoError { path, .. } => {
tracing::error!("{self}");
(
StatusCode::FORBIDDEN,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do these really want to be 403 Forbidden (it's what the code did previously)? It feels like an encapsulation failure to recognise for certain paths that we can't delete/upload to them.

Copy link
Contributor

@Bravo555 Bravo555 Nov 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO "could not delete path due to IO error" could happen due e.g. path not existing, in which case we need to return 404, or due to server not having permissions to do filesystem delete, in which case we should return 500, so if I'm not wrong, then even using a single return code for all IO errors is kinda bad. As for 403 Forbidden, I also feel that it should be used when the client that made the request doesn't have permissions to do something, which doesn't seem to be the case here.

Copy link
Contributor Author

@jarhodes314 jarhodes314 Nov 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that was roughly what I was thinking. We currently handle the not found case by returning 202 Accepted, like the successful deletion case (which I assume is to make DELETE requests idempotent). Returning 404 Not Found may also make sense, or if the file previously existed, 410 Gone, although a success response tells the user the file definitely no longer exists, and the request definitely succeeded.

I think if we've hit the case handled by this code, it's some error accessing the file system, which in my mind should definitely be a 500 Internal Server Error. This code may also be caused by trying to delete a directory, which I talk about in a different comment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

403 doesn't look bad, as the spec mentions the following: The access is tied to the application logic, such as insufficient rights to a resource. which is the case here, right? I vaguely remember a similar discussion happening when the original author introduced this case in the first place and we settled with 403 for the same reason.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand what application logic is relevant here. With the current API at least, we have no control over permissions. If there's a permissions error on the file system, that implies some other process is interfering with /var/tedge/file-transfer, which is not an application-level error, it's an error from the environment. The question here mostly boils down to whether we want to assume it's some file that should have never been there (in which case encapsulating the error entirely and responding with 404 Not Found is appropriate, or whether we assume the file was uploaded correctly, but something as changed the internal permissions so we can't access it, in which case a 500 Internal Server Error is appropriate.

Thinking more broadly, a useful litmus test to apply here would be "if one user receives a 403 for a resource, there must exist some other user that is able to access the resource in question". If not, and the 403 response is global, then I don't believe this can be considered an authorization issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@didier-wenzek @reubenmiller do you have opinions on this point?

Copy link
Contributor

@didier-wenzek didier-wenzek Nov 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deleting a resource that doesn't exist is definitely not an error. Returning 202 in such a case makes the request idempotent.

Trying to delete a non-empty directory, should be considered as a user error as we don't want to support recursive deletes. So 400 makes sense. I would do the same for empty-directories.

If the error is due to missing access, I would say this is a 500 as the file-transfer service should own all the files under file transfer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

403 doesn't look bad, as the spec mentions the following: The access is tied to the application logic, such as insufficient rights to a resource.

From RFC 9110:

The 403 (Forbidden) status code indicates that the server understood the request but refuses to fulfill it.

If we get an IO error due to missing file permissions, I would say that the server tried to execute the request, but failed, due to a misconfiguration (e.g. not having write access to file-transfer directory), in which case 500 would be fitting.

Copy link
Contributor Author

@jarhodes314 jarhodes314 Nov 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trying to delete a non-empty directory, should be considered as a user error as we don't want to support recursive deletes. So 400 makes sense. I would do the same for empty-directories.

I think if it is a directory, 404 makes more sense as it the error is in the URL and the issue is that the resource in question cannot be interacted with (and essentially doesn't exist), not that the request is inherently invalid. I'll make sure the error message clarifies that the file we attempted to delete is a directory.

Other than that, it seems like there is now broad consensus that 500 is the appropriate response for an IO error, so I'll sort that out too.

// One must check that once normalized (i.e. any `..` removed)
// the path is still under the file transfer dir
let clean_path = clean_utf8_path(&full_path);
if clean_path.starts_with(&self.data_dir) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was only checking that the file was in e.g. /var/tedge since all the manipulation in this function was based around file-transfer being part of the URL path (i.e. in the variable path in this function), so it was possible for a user to read from/write to /var/tedge, not just /var/tedge/file-transfer. I'm assuming this wasn't intentional and I've fixed it by storing file_transfer_dir inside HttpConfig instead.

Signed-off-by: James Rhodes <[email protected]>
Comment on lines 147 to 166
async fn check_server_does_not_panic_when_port_is_in_use() -> anyhow::Result<()> {
let ttd = TempTedgeDir::new();

let http_config = HttpConfig::default()
.with_data_dir(ttd.utf8_path_buf().into())
.with_port(3746);
let config_clone = http_config.clone();

// Spawn HTTP file transfer server
// handle_one uses port 3000.
let builder_one = FileTransferServerBuilder::new(http_config);
let handle_one = tokio::spawn(async move { builder_one.build().run().await });

// handle_two will not be able to bind to the same port.
let builder_two = FileTransferServerBuilder::new(config_clone);
let handle_two = tokio::spawn(async move { builder_two.build().run().await });

// although the code inside handle_two throws an error it does not panic.
tokio::time::sleep(std::time::Duration::from_millis(100)).await;
// Both servers will attempt to bind to port 3746.
let server_one = FileTransferServerBuilder::new(http_config).build().run();
// This server will not be able to bind to the same port.
let server_two = FileTransferServerBuilder::new(config_clone).build().run();

// to check for the error, we assert that handle_one is still running
// while handle_two is finished.
assert!(!handle_one.is_finished());
assert!(handle_two.is_finished());
tokio::select! {
// Ensure we bind server_one first by polling that future first
biased;
res = server_one => bail!("expected second server to finish first, but first finished with: {res:?}"),
res = server_two => ensure!(res.is_err(), "expected server two to fail with port binding error, but no error was found"),
_ = tokio::time::sleep(Duration::from_secs(5)) => bail!("timed out waiting for actor to stop running"),
}
Copy link
Contributor Author

@jarhodes314 jarhodes314 Nov 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This refactoring is incidental, I noticed the test had both a hard sleep (so it was comparatively slow) and not terribly deterministic (no guarantee server one would attempt to bind first, no guarantee the port binding would fail within 100ms). The timeout-based approach gives a lot more leniency to the servers to fail, and polling server_one first ensures that it binds to a port first (assuming that is the first thing it attempts to do, which I think is a reasonable assumption).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In response to #2461 (comment), I've now changed this test again, and as a result the remaining complexity around which port binds first (i.e. what I'm relying on the biased directive for) has disappeared entirely.

Copy link
Contributor

github-actions bot commented Nov 16, 2023

Robot Results

✅ Passed ❌ Failed ⏭️ Skipped Total Pass % ⏱️ Duration
376 0 3 376 100 55m40.082s

crates/core/tedge_agent/src/file_transfer_server/error.rs Outdated Show resolved Hide resolved
crates/core/tedge_agent/src/file_transfer_server/error.rs Outdated Show resolved Hide resolved
use FileTransferRequestError::*;
match &self {
FromIo(_) | PathRejection(_) => {
tracing::error!("{self}");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: This will log an error not only when a request is handled, but every time .into_response() is called.

Axum has a tracing feature that makes it output traces when handling events. Could it be used for logging errors like we try to do here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure axum's tracing is quite what we want here. Here, I want to determine whether I'm hiding information from the user (which is when we have an internal server error), and if so, log that information. This avoids polluting the server logs with errors arising from clients misusing the API, but means we don't expose implementation details to the user of the API. If we rely on some sort of middleware for logging, either it would take an input of FileTransferRequestError, in which case we might as well do everything here as we'd still need to log selectively, or we'd take the converted response, by which point we know the status code, so it's easy to log selectively, but we've thrown away the pertinent information (e.g. an IO error).

Cargo.toml Show resolved Hide resolved
crates/core/tedge_agent/src/file_transfer_server/actor.rs Outdated Show resolved Hide resolved
crates/core/tedge_agent/src/file_transfer_server/error.rs Outdated Show resolved Hide resolved
DeleteIoError { path, .. } => {
tracing::error!("{self}");
(
StatusCode::FORBIDDEN,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

403 doesn't look bad, as the spec mentions the following: The access is tied to the application logic, such as insufficient rights to a resource. which is the case here, right? I vaguely remember a similar discussion happening when the original author introduced this case in the first place and we settled with 403 for the same reason.

crates/core/tedge_agent/src/file_transfer_server/actor.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@albinsuresh albinsuresh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to approve once the DeleteIoError response is changed from FORBIDDEN to INTERNAL_SERVER_ERROR as concluded.

@jarhodes314
Copy link
Contributor Author

jarhodes314 commented Nov 17, 2023

Happy to approve once the DeleteIoError response is changed from FORBIDDEN to INTERNAL_SERVER_ERROR as concluded.

Done in 1ca42b2

Copy link
Contributor

@didier-wenzek didier-wenzek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved.

The next step will be to enable HTTPS support to this file transfer service, introducing http.cert_path, http.key_path and http.ca_path and considering to reuse http.client.auth.cert_file and http.client.auth.cert_file to authenticate tedge-agent when acting as a client (auth proxy) as well as a server (file transfer).

Co-authored-by: Albin Suresh <[email protected]>
Copy link
Contributor

@Bravo555 Bravo555 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got answers to my question and most pressing feedback was addressed, so approving.

#[error("Cannot upload: {path:?} is a directory, not a file")]
CannotUploadDirectory { path: RequestPath },

#[error("Request to delete {path:?} failed: {source}")]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: We need to be careful with including source error message in parent error message.

I think the recommendation is to not include source error messages like this because proper error reporting/diagnostics will themselves call .source() to print messages for the entire error chain. However, for code that is not aware of this, and doesn't depend on any error reporting crates, it will most likely just grab outer error message. It depends where the error is used. For the FTS, I think the errors appear only in HTTP responses and logs, but 1) I'm not sure how they are exactly reported by axum 2) it's all internal so it doesn't matter that much.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm aware, but knew exactly where these error messages were going (which is only when we call .to_string()), and I didn't want to have to convert them to something like anyhow::Error just to get a backtrace. You're right though, a lot of thin-edge errors are in library code where they're eventually going to become anyhow::Error and in that case it's really annoying (reqwest is the example of a public crate that does this that springs to mind).

@jarhodes314 jarhodes314 merged commit a67dba6 into thin-edge:main Nov 20, 2023
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants