Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Image analysis job #307

Merged
merged 21 commits into from
May 13, 2024
Merged

Conversation

JMicheli
Copy link
Collaborator

This pull request creates a first draft of an image analysis job (#181), which will need to be updated further before merging.

I chose a slightly broader approach to this job, making an AnalyzeMediaJob which includes an AnalyzeImage task. This is intended to be more extensible, as additional stages to media analysis will probably be wanted later on.

The job isn't fully integrated. Although it is written to have four variants (Individual, Library, Series, and MediaGroup) I've only implemented a client-side way to access the Individual analysis command. On that note, I did this by adding an "Analyze Media" button to the "Manage" page for individual media items, it seemed like the right place to put it. Implementing the job fully would involve adding the same button to other menus and linking them up on the axum side. The axum side also needs authentication boilerplate added.

Finally, the job currently updates the number of pages on a media item always as opposed to just validating the number of pages. It sounds, from the issue, like this isn't the goal, so I can change that alongside other changes.

In the meantime, I think this is a good place to discuss how this could be improved and made ready for integration to develop.

Copy link
Collaborator

@aaronleopold aaronleopold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You said this PR is a first draft towards #181, so I didn't comment on some things I assume you'll fix in followup commits, e.g. unwraps and TODOs. I had a few low priority comments and a couple of nit picks, but overall I think this is looking great and I appreciate the time you've put into this so far!

I'll mark this PR as a draft just so that when you're ready for me to take another look you can just change the status and I'll know to look.

I chose a slightly broader approach to this job, making an AnalyzeMediaJob which includes an AnalyzeImage task. This is intended to be more extensible, as additional stages to media analysis will probably be wanted later on.

I think this is definitely the right approach. I imagine we'll be adding various types of media analysis in the future, so not having to define and maintain separate jobs for each type of analysis is a good idea.

On that note, I did this by adding an "Analyze Media" button to the "Manage" page for individual media items, it seemed like the right place to put it.

I think so, it satisfies your Individual variant for the job. The other variants would need to triggers in different places, as you already mentioned, and I think this is a good starting point.

Finally, the job currently updates the number of pages on a media item always as opposed to just validating the number of pages. It sounds, from the issue, like this isn't the goal, so I can change that alongside other changes.

I think this is the eventual goal IMO. I think one thing to consider that might be good to gather input on is whether we want always want to auto-update metadata diffs when a media item is analyzed. An alternative that might be appealing is some sort of reconcilation process, which would require a user to manually review and accept the changes. What do you think?

As for the other functionalities outlined in that feature, things like image dimensions could just be an automatic update to whatever relation winds up holding that information on a media record.

apps/server/src/routers/api/v1/media.rs Outdated Show resolved Hide resolved
apps/server/src/routers/api/v1/media.rs Outdated Show resolved Hide resolved
core/src/filesystem/media/analyze_media_job.rs Outdated Show resolved Hide resolved
Comment on lines 29 to 32
pub struct AnalyzeMediaOutput {
/// The number of images analyzed
images_analyzed: u64,
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you could likely add more state to this, e.g. media_updated, but this is a good start and I'm fine leaving as-is if that is what you decide

core/src/filesystem/media/analyze_media_job.rs Outdated Show resolved Hide resolved
core/src/filesystem/media/analyze_media_job.rs Outdated Show resolved Hide resolved
core/src/filesystem/media/analyze_media_job.rs Outdated Show resolved Hide resolved
core/src/filesystem/media/epub.rs Outdated Show resolved Hide resolved
core/src/filesystem/media/rar.rs Outdated Show resolved Hide resolved
core/src/filesystem/media/zip.rs Outdated Show resolved Hide resolved
@aaronleopold aaronleopold marked this pull request as draft April 16, 2024 01:01
@JMicheli
Copy link
Collaborator Author

JMicheli commented Apr 22, 2024

Okay, these last several commits get things into a generally workable state. There are probably a few things to do still. A few that come to mind:

  • Add more doc comments and logging to job.
  • Any other media analysis tasks we need?
  • Should the media's page_count always be updated or do we want to only do that if it doesn't already have one?
  • Need to enforce permissions properly in server when the start analyze media task endpoint is hit.

@aaronleopold
Copy link
Collaborator

aaronleopold commented Apr 24, 2024

Any other media analysis tasks we need?

I think what you have is good for now wrt to what analysis tasks are performed, and you've built it already to be able to add more tasks as needed which is 🔥

Should the media's page_count always be updated or do we want to only do that if it doesn't already have one?

I'm not 100% yet. I think if it flat out doesn't exist, it's fine to set it to the value you read. The exception might be EPUB files.

Otherwise, I think some sort of reconciliation would be neat, e.g. the analysis generates a list of actions to be taken on the frontend. In the case of mismatch, it would tell you what the current is and what the analyzed is and you can decide the action.

However, that in itself is a huge feature. I'll err on the side of caution and say just persist JobExecuteLog with a WARN level for now. I think folks are generally anxious about automatic operations on their carefully curated metadata library (even if it doesn't change the underlying metadata files)

@JMicheli
Copy link
Collaborator Author

JMicheli commented Apr 27, 2024

So I was working on this pull request today and I realized that a media item's page is non-null.

Investigating further, to see if perhaps an invalid value might be assigned under some conditions that should be overwritten (versus properly assigned page counts where no update is necessary, per your prior comment), I found something I hadn't considered: the pages for a file are always counted during initial creation during a scan (see the process function for each FileProcessor implementation).

That initially made the job here seem a bit unnecessary - it would never have a new value. However, thinking further, it might actually makes the most sense to always update it. This way, should the page counting methodology be updated in the future to correct an error, people can easily correct their databases by running the analyze job. In interest of this, I am thinking of unifying the get_page_count and process logic so that both use the same code to count pages.

Let me know your thoughts here, if you agree, then I just need to get authentication working before this is finalized on the server side. If the random button I threw on each manage page is good enough for you, then it'll be ready to merge at that point as well.

@aaronleopold
Copy link
Collaborator

So I was working on this pull request today and I realized that a media item's id is non-null.

I'm not sure I understand, did you mean to refer to the page being non-nullable?

the pages for a file are always counted during initial creation during a scan (see the process function for each FileProcessor implementation).

Yeah good catch, there is a pages field directly embedded into a media file (IIRC it is -1 for EPUBs). This is mostly because there wasn't proper metadata support until a little later on, and that representation has just stuck around.

I think the big goal for the analysis is more for ensuring the metadata matches the actual count, not necessarily the pages field in the media directly (since like you said, that was generated by Stump already and should just be correct). The metadata which is read from misc files within e.g. an archive are more subject to error. However if someone were to manually edit the file, e.g. adding a page, obviously that would lead to a mismatch in both.

However, thinking further, it might actually makes the most sense to always update it. This way, should the page counting methodology be updated in the future to correct an error, people can easily correct their databases by running the analyze job.

I'm good with this 👍 I do like my idea for some sort of reconciliation flow (I'm starting to overuse this word now lol) but I don't think it is necessarily needed for this small of an operation.

In interest of this, I am thinking of unifying the get_page_count and process logic so that both use the same code to count pages.

Let me know your thoughts here, if you agree, then I just need to get authentication working before this is finalized on the server side. If the random button I threw on each manage page is good enough for you, then it'll be ready to merge at that point as well.

Yeah I think that makes sense, I agree 👍

I also think the random buttons are good for now, especially since this is based into experimental

@JMicheli

This comment has been minimized.

@JMicheli
Copy link
Collaborator Author

This last commit does two things. First, it fixes an error in the way I had the job for rars counting pages (needed to check if it was an image). Second, it addresses a todo in process that suggested using the same validation as in get_page for counting pages (basically checking if the entry is an image).

Once you've confirmed that this is what you want this should be good to merge.

@aaronleopold aaronleopold marked this pull request as ready for review May 1, 2024 23:49
Copy link
Collaborator

@aaronleopold aaronleopold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a few nit-picks, but the only one I think care about is the conditional update during analysis. If you have time to tackle any, great, if not I'm not worried. I think this looks great otherwise so I'm approving it. Thanks again!!

Let me know if you'd prefer me to handle the conflicts with the base branch 👍

apps/server/src/routers/api/v1/media.rs Outdated Show resolved Hide resolved
core/src/filesystem/media/analyze_media_job.rs Outdated Show resolved Hide resolved
core/src/filesystem/media/analyze_media_job.rs Outdated Show resolved Hide resolved
Comment on lines 176 to 183
// Update media item in database
let _ = ctx
.db
.media()
.update(media::id::equals(id), vec![media::pages::set(page_count)])
.exec()
.await?;
output.media_updated += 1;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This update should probably only hit if the page counts are actually different

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went ahead and followed this rule in my latest commit - I check if there's metadata, and if there is and page_count doesn't match the one we calculated, update it. If there isn't, create new metadata with the new page_count.

I'd appreciate if you could check the logic there, it may not have been the most idiomatic prisma usage.

@aaronleopold aaronleopold merged commit 8a8bd86 into stumpapp:experimental May 13, 2024
@aaronleopold aaronleopold mentioned this pull request Jul 8, 2024
@JMicheli JMicheli deleted the image_analysis branch October 9, 2024 04:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants