Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement upload API #1085

Open
wants to merge 48 commits into
base: dev
Choose a base branch
from
Open

Conversation

psilabs-dev
Copy link
Contributor

@psilabs-dev psilabs-dev commented Oct 5, 2024

Implementation of an archive upload API, that uploads the content of an archive and all metadata, if any, to the LRR server. (#1005)

While it is possible to write a LRR plugin for a website W, it is often the case to use dedicated, feature-rich, and actively maintained downloaders (e.g. PixivUtil2 for Pixiv) that specifically pulls metadata/archives from W. However, moving downloaded content from these downloaders to LRR is still a human process. In the case of PixivUtil2, metadata is stored in a sqlite db, and the storage/language/architecture is different for each downloader.

This API implementation (using multipart/form-data encoding) allows users to perform language-agnostic archive migrations or mirrors from their existing archive servers and downloaders to the LRR server via the API, or write comparatively thin HTTP client wrappers around their downloaders that negotiate with and push to LRR during that application's runtime. The user may wish to attach the SHA1 hash of the file to check for file integrity and avoid corruptions during transit.

PUT /api/archives/upload

Return values (see documentation)

Include in body: file content (binary), filename, title (optional), summary (optional), tags (optional), category ID (optional), plugin calls (?)

Example of a PUT /api/archives/upload in Python with the "example.zip" archive:

import hashlib
import requests

def calculate_sha1(file_path):
    sha1 = hashlib.sha1()
    with open(file_path, 'rb') as fb:
        while chunk := fb.read(8192):
            sha1.update(chunk)
    return sha1.hexdigest()

file_path = "example.zip" # required
file_name = "example.zip" # required, this will be where the file is stored in the server
upload_mime = "application/zip"

# metadata
title = "example title"
tags = "example_date,key:value,artist:example artist"
summary = "example summary"
category_id = None

# calculate checksum (optional)
file_checksum = calculate_sha1(file_path)

data = {
    "title": title,
    "tags": tags,
    "summary": summary,
    "file_checksum": file_checksum
}

with open(file_path, 'rb') as fb:
    files = {'file': (file_name, fb, upload_mime)}
    response = requests.put(
        "http://localhost:3000/api/archives/upload",
        files=files,
        data=data
    )
    print(response.json())

This is a synchronous API implementation. Theoretically, we can also run this API as a Minion job but this is not a long-running task and doesn't justify creating a job, also error handling is more streamlined and the API should not take too many requests to begin with (TBD).

@psilabs-dev psilabs-dev marked this pull request as ready for review October 14, 2024 02:04
@psilabs-dev
Copy link
Contributor Author

psilabs-dev commented Oct 14, 2024

Open for review and testing. I've tested this on ~50k cbz archives with metadata (without obvious problems). I've also tested this in combination with/wo automatic metadata plugins, checksum validation, optional metadata (title, tags, summaries), mandatory fields (filename, file data), localhost/reverse proxy.

I'll add that the checksum validation is for when a file is corrupted during transit, not validating a file corrupted at the origin, if you upload a corrupted file it will still be corrupted and not checked.

Copy link
Owner

@Difegue Difegue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good at a glance to me - I'll have to play with it for a bit.

Would it be possible to get the documentation updated?
This would land in https://github.com/Difegue/LANraragi/blob/dev/tools/Documentation/api-documentation/archive-api.md - Please duplicate an existing {swagger} element and edit it to match. Even if it's not perfect I'll help with suggestions afterwards 👍

);

# utf downgrade (see LANraragi::Utils::Minion)
unless (utf8::downgrade($filename, 1)) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this API doesn't run through a minion job, you might not need the atrocious utf hack.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've re-added this, since redis (used for locking) seems to throw errors when certain files are not downgraded.

@Difegue Difegue linked an issue Oct 17, 2024 that may be closed by this pull request
@Difegue
Copy link
Owner

Difegue commented Oct 17, 2024

I guess one additional thought: You mention only one upload can be done at a time with this API, but I'm not seeing any obvious limitations that would prevent running it sequentially.
Is there something that breaks when you call it multiple times in parallel? If so, it might be good to add a lock of some sort to prevent further API calls when an upload is in progress.

I wouldn't put it behind client implementations to manhandle this endpoint to death, so we should take precautions if needed 🤔

@psilabs-dev
Copy link
Contributor Author

Ahh you're right, LRR does support concurrent API calls now that I tested it again. I don't know why last time for some reason my second worker would always hang and lose connections...

After testing concurrent uploads, I'm seeing two (probably fixable) issues with concurrency.

Type 1: when the same archive is uploaded concurrently by two clients, on rare occasions the server will throw a "The file couldn't be renamed in your content folder: No such file or directory" error to the later client. This can happen when the two uploads differ by as much as 30ms.

  1. this is the wierder bug characterized by this response:
<!DOCTYPE html>

<head>
	<title>LANraragi Server Error</title>

	<meta name="viewport" content="width=device-width" />
	<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<meta name="apple-mobile-web-status-bar-style" content="black" />
<meta name="mobile-web-app-capable" content="yes" />
<meta name="apple-mobile-web-app-capable" content="yes" />

	<link type="image/png" rel="icon" href="favicon.ico" />
	<link rel="manifest" href="app.webappmanifest" />
</head>

<body style='background: none repeat scroll 0% 0% brown; color: white; font-family: sans-serif; text-align: center'>

	<img src='/img/flubbed.gif' />
	<br />
	<h2>I flubbed it while rendering your page!</h2>
	It's likely there's a bug with the server at this point.<br />
	<p>
		 at /home/koyomi/lanraragi/script/../lib/LANraragi/Utils/Archive.pm line 336.

	</p>

	<h3>Some more info below if available:</h3>
	<pre>{
  &quot;action&quot; =&gt; &quot;create_archive&quot;,
  &quot;controller&quot; =&gt; &quot;api-archive&quot;
}
</pre>

</body>

</html>

I've noticed that when one worker encounters this error, the other worker will encounter a concurrency error of the first type. This is more rare than the first type of error

@psilabs-dev
Copy link
Contributor Author

This looks good at a glance to me - I'll have to play with it for a bit.

Would it be possible to get the documentation updated? This would land in https://github.com/Difegue/LANraragi/blob/dev/tools/Documentation/api-documentation/archive-api.md - Please duplicate an existing {swagger} element and edit it to match. Even if it's not perfect I'll help with suggestions afterwards 👍

Sure! I'll take a look at this

@psilabs-dev psilabs-dev marked this pull request as draft October 18, 2024 10:10
@psilabs-dev
Copy link
Contributor Author

psilabs-dev commented Oct 26, 2024

Ugh... there is a race condition that's happening when two identical uploads are being processed at almost the exact same time. I thought adding a redis lock would fix it but then implementing it causes more issues it seems, Utils::Archive will complain at 242 🧐

edit: maybe I forgot to close the connection.. the second implementation no longer causes issues

return $self->render(
json => {
operation => "upload",
success => 0,
error => "Couldn't move uploaded file to $tempfile."
error => "Server Error"
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not the biggest fan of generic "server error" messages which are always unhelpful - especially since we have the reason for the error here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can change this back. My opinion is that server-side errors should not be exposed to the client for security reasons/necessity, but this obviously is a tradeoff with debugging.

Copy link
Owner

@Difegue Difegue Nov 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I do agree it's better to avoid exposing information about the server file structure or similar.
Would Couldn't move uploaded file to temporary location work as a compromise? The detailed error with the filepath is already logged server-side so I'd be fine with that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No complaints here, should be reflected in the new commits.

lib/LANraragi/Controller/Api/Archive.pm Outdated Show resolved Hide resolved
lib/LANraragi/Controller/Api/Archive.pm Outdated Show resolved Hide resolved
@psilabs-dev
Copy link
Contributor Author

Ok, don't think I'll have anything else to add, I'll be doing final builds and stress tests and opening it up for review later.

@Difegue
Copy link
Owner

Difegue commented Nov 10, 2024

Sounds good, feel free to undraft once you think it's ready and I'll give it one last passover.

@psilabs-dev psilabs-dev marked this pull request as ready for review November 11, 2024 05:11
@psilabs-dev
Copy link
Contributor Author

One thing that came to mind before I forget, there are no file size upload limits. I tested whether this handles large files well and I was able to upload a 18gb archive without issue. Although I can't read it due to temp folder limitations.

That said, do we want a 500mb filesize limit?

@Difegue
Copy link
Owner

Difegue commented Nov 12, 2024

No, if there's no limitations on the server I wouldn't implement an artificial one in the API.
I imagine people running this behind nginx or similar will encounter the same session size limits they already had with the non-API version. (And they can fix it the same way)

Copy link
Owner

@Difegue Difegue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One last round of comments from me after reviewing the whole thing anew.

my $summary = $self->req->param('summary');

# return error if archive is not supported.
if ( !is_archive($filename) ) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't really need this bit if it's already done in handle_incoming_file, right?

(I'm aware Controller->Upload does the same, but I think I'll eventually deprecate that code path in favor of just using the API in the webclient)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is_archive ensures that the filename has a valid extension. I don't know how removing this will affect the tempdir logic, I'll need to test it out

my $tempdir = tempdir();

my ( $fn, $path, $ext ) = fileparse( $filename, qr/\.[^.]*/ );
my $byte_limit = LANraragi::Model::Config->enable_cryptofs ? 143 : 255;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this shouldn't be enforced at the handle_incoming_file level rather than both in here, in Model->Upload->Download_url, and the old Controller upload.

Might be one for later tho

Copy link
Contributor Author

@psilabs-dev psilabs-dev Nov 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What was the initial purpose of the byte limit? I only kept it there to mirror, I don't know how this will affect temporary file creation if a filename exceeds this limit.

my $uploadMime = $upload->headers->content_type;

# utf downgrade (see LANraragi::Utils::Minion)
$filename = encode('UTF-8', $filename);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this necessary? To me it looks like you're just passing the filename through utf-8 and back out for no real reason now.

Copy link
Contributor Author

@psilabs-dev psilabs-dev Nov 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tbh I don't think I can confidently explain this part of the code because I don't get perl encoding, it was a product of experimentation and was necessary because it prevented errors and ensured that API uploads behave the same as manual uploads (i.e. LANraragi::Controller::Upload::process_upload). I'll describe a timeline explaining the rationale.

For the minimal example consider an archive with a special character called "~.cbz". Note that this type of file can already be uploaded successfully through the manual upload portal. (It goes without saying that the behavior of the manual upload dictates the expected behavior of the API upload.)

The entire utf encode/downgrade block exists to guard against issues with redis when a file with a special character like this is uploaded. Redis locking is required before any writes to guard against race conditions. Without the utf block, an unsanitized filename with special characters will cause redis to throw an exception.

When utf-downgrade is (re)introduced, any upload that would've incurred a redis exception is now "handled" at the downgrade stage, which returns an HTTP error. All filenames that pass this downgrade test also do not trigger redis errors. However, this does not fix the issue that a file with special characters should not fail a downgrade in the API at all, when it never failed this stage during a manual upload, so why does this happen

During the manual process, the file is written to a temporary location first, and the filename is extracted using perl's built-in File::Find module. This means somewhere perl already performed encoding/normalization work on the file, which is not present in the API, hence it passing the downgrade stage. (Alternatively, this encoding may be happening during the javascript phase of the upload.)

to verify my assertion that File::Find performs encoding work (which I haven't done since I think the issue was resolved), you can do this: in the API stage, take the file bytes and filename and write them to a temporary directory and read it again with File::Find overwriting the filename. If I'm correct then this should pass utf downgrade.

The obvious solution would be to just copy the File::Find logic of the manual upload to the API, but the API method cannot afford to move the file to disk due to Redis, as it needs to happen before any writes. So a way to do this is to emulate the effect of File::Find and manually encode the file, hence the utf encode stage. Adding this stage makes the API have expected behavior.

hope this makes sense

tools/Documentation/api-documentation/archive-api.md Outdated Show resolved Hide resolved
tools/Documentation/api-documentation/archive-api.md Outdated Show resolved Hide resolved
tools/Documentation/api-documentation/archive-api.md Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Feature Request: Add api for upload archive
2 participants