Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update bulk image conversion script to support multiple formats #226

Open
wants to merge 2 commits into
base: dev
Choose a base branch
from

Conversation

Anushlinux
Copy link

@Anushlinux Anushlinux commented Oct 3, 2024

I have updated the conversion code and implemented my function to the convert_image_hook as you required

Pls review the code
(ps: sorry for deleting the previous pr i did it bimistake)

scripts/hooks/convert_images_hook.py Outdated Show resolved Hide resolved
scripts/hooks/convert_images_hook.py Outdated Show resolved Hide resolved
Copy link
Owner

@Udayraj123 Udayraj123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Anushlinux you have understood and handled a lot of code flows(in local scripts) to OMRChecker. Great work on your first PR! 🎉
I've finished my review and I think we will need some discussion to make it even more extensible. Let's discuss async in the discord channel once you go through my comments.

@@ -3,43 +3,20 @@
import glob
import operator
import os

from PIL import Image
from pdf2image import convert_from_path
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new packages in local scripts need to be added to requirements.dev.txt


# Usual pre-processing commands for speedups (useful during re-runs)
# python3 scripts/local/convert_images.py -i inputs/ --replace [--filter-ext png,jpg] --output-ext jpg
# python3 scripts/local/resize_images.py -i inputs/ -o outputs --max-width=1500
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep all these comments as a guide for the contributors to figure out what needs to be done here

extracted_files = []
for _dir, _subdir, _files in os.walk(input_dir):
matching_globs = [
glob(os.path.join(_dir, f"*.{file_extension}"))
glob.glob(os.path.join(_dir, f"*.{file_extension}"))
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's change the import to from glob import glob instead

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, thanks for adding the code comment

@@ -72,29 +52,79 @@ def get_local_argparser():
local_argparser.add_argument(
"-r",
"--recursive",
required=True,
type=bool,
action='store_true',
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why should we have this default? It's better to take recursive updates as an explicit input

Comment on lines +85 to +88
pages = convert_from_path(input_path)
for i, page in enumerate(pages):
output_path = os.path.join(output_dir, f"page_{i + 1}.jpg")
page.save(output_path, 'JPEG')
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's also include file name here

Suggested change
pages = convert_from_path(input_path)
for i, page in enumerate(pages):
output_path = os.path.join(output_dir, f"page_{i + 1}.jpg")
page.save(output_path, 'JPEG')
pages = convert_from_path(input_path)
input_path = Path(PurePosixPath(input_path).as_posix())
file_name = input_path.name
for i, page in enumerate(pages):
output_path = os.path.join(output_dir, f"{file_name}_page_{i + 1}.jpg")
print(f"Saving page: {output_path}")
page.save(output_path, 'JPEG')

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also convert_image and convert_pdf_to_jpg should come from image_utils instead of bulk_ops_common.py.
I am thinking of keeping this file only for operations that are common across all sorts of bulk scripts (bulk resize, convert, watermark, etc)

"""
pages = convert_from_path(input_path)
for i, page in enumerate(pages):
output_path = os.path.join(output_dir, f"page_{i + 1}.jpg")
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a feature extension idea here, see if you can implement -
Let's add a support to extract only the first page based on a flag (in that case the image's output_dir shouldn't be created and the single image should be output directly with same filename without the page_ prefix.

os.makedirs(output_dir, exist_ok=True)
extensions = ['png', 'jpg', 'jpeg', 'tiff', 'tif', 'pdf']

filepaths = walk_and_extract_files(input_dir, extensions)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

snake case

Suggested change
filepaths = walk_and_extract_files(input_dir, extensions)
file_paths = walk_and_extract_files(input_dir, extensions)

Bulk converts images and PDFs to the specified format.
"""
os.makedirs(output_dir, exist_ok=True)
extensions = ['png', 'jpg', 'jpeg', 'tiff', 'tif', 'pdf']
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep pdf handling in a separate script (check other comments)

Suggested change
extensions = ['png', 'jpg', 'jpeg', 'tiff', 'tif', 'pdf']
extensions = ['png', 'jpg', 'jpeg', 'tiff', 'tif']

Comment on lines +100 to +104
for input_path in filepaths:
relative_path = os.path.relpath(os.path.dirname(input_path), input_dir)
output_subdir = os.path.join(output_dir, relative_path) if not in_place else os.path.dirname(input_path)
os.makedirs(output_subdir, exist_ok=True)

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The in_place flag handling can be a common bulk ops utility. I propose we define a function bulk_apply_on_files(func, input_dir, file_paths, in_place)

  • The first arg func can be convert_image or convert_pdf_to_image
  • The bulk_apply_on_files applies this passed function with the same arguments
    • if in_place is True and the conversion was successful, the input file gets removed (pdf or image) and the output files are added in the same directory
    • if in_place is False and the conversion was successful, the output files are added in a relative directory output_dir = f"{input_dir}/outputs/"
    • If the conversion was unsuccessful, no file changes happen

Let's discuss more about this in the discord group if needed.


converted_count = convert_images_in_tree(args)
trigger_size = args["trigger_size"]
converted_count = convert_images(args.filenames, args.format.upper())
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we can call the function bulk_apply_on_files(convert_image, input_dir, file_paths, in_place) or something similar based on our discussion (check other comments below)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants