Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix HTTP connection error for long running transfers #842

Merged
merged 209 commits into from
May 15, 2023

Conversation

sarahwooders
Copy link
Contributor

@sarahwooders sarahwooders commented May 11, 2023

Implements a few bug fixes causing errors for large transfers:

  • A basic backpressure mechanism, so that if the queues on a gateway are full, the chunk_requests POST request will return how many chunks were added and the current queue size, informing the HTTP client making the request to send the remaining chunks (those not added) to a different gateway or to wait and try again. With this change, I was able to transfer 1TB.
  • This also reduces the total number of HTTP connections per gateway to be 64, as opposed to 32 per destination, which seems to have been causing issues.
  • Empty chunks are allowed, since object stores can have empty folders which we still want transferred

There are still issues for SSH connections for long running transfers, and listing files can take an extremely long time on the client (#841), so these issues need to be fixed to for very large transfers.

lynnliu030 and others added 30 commits November 29, 2022 13:36
This change introduces an API for Skyplane Broadcast

Todos:
- [x] Fix provisioning in BroadcastDataplane
  - Reuse provision loop via inheritance
  - Move `_start_gateway` to a class method and override it
  - Adapt broadcast to use `bound_nodes`
- [x] Add BroadcastCopyJob (ideally extend CopyJob)
- [x] Update tracker to monitor broadcast jobs
- [x] Add multipart support 
- [x] Fix dependency issue via adding dockerfile and bc_requirements 
- [x] Integrate with gateway and test the monitoring side 

Co-authored-by: Paras Jain <[email protected]>
Co-authored-by: Sarah Wooders <[email protected]>
except Exception as e:
UsageClient.log_exception(
"dispatch job",
e,
args,
self.dataplane.topology.src_region_tag,
self.dataplane.topology.dest_region_tags,
self.dataplane.topology.dest_region_tags[0], # TODO: support multiple destinations
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self.dataplane.topology.dest_region_tags[0], # TODO: support multiple destinations
*self.dataplane.topology.dest_region_tags,

You can just "spread" this list as arguments

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does this mean?

headers={"Content-Type": "application/json"},
)
reply_json = json.loads(reply.data.decode("utf-8"))
print(server, min_idx, "added", n_added, len(chunk_batch), reply_json)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
print(server, min_idx, "added", n_added, len(chunk_batch), reply_json)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

log debug messages?

@@ -78,11 +78,14 @@ def worker_loop(self, worker_id: int, *args):
self.worker_id = worker_id
while not self.exit_flags[worker_id].is_set() and not self.error_event.is_set():
try:
# print(f"[{self.handle}:{self.worker_id}] Waiting for chunk, queue size {self.input_queue.size()}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clean up this file a bit?

if chunk_req.chunk.chunk_length_bytes == 0:
# nothing to do
# create empty file
open(fpath, "a").close()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Path('path/to/file.txt').touch()

@sarahwooders sarahwooders merged commit a21c50f into skyplane-project:main May 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants