Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[input-file] Allow comments after a URL #2808

Closed
AlttiRi opened this issue Aug 7, 2022 · 6 comments
Closed

[input-file] Allow comments after a URL #2808

AlttiRi opened this issue Aug 7, 2022 · 6 comments

Comments

@AlttiRi
Copy link

AlttiRi commented Aug 7, 2022

I would like to add comments in the input file (--input-file) after a URL. After " #" (" " a space character (one or more) and "#")

For example:

https://example.com/11 # 404
https://example.com/22 # important
https://example.com/3  # todo: download related media

For this gallery-dl currently prints the error:

[gallery-dl][error] No suitable extractor found for 'https://example.com/11 # 404'

The follow format is working, but it looks less readable:

# 404
https://example.com/11
# important
https://example.com/22
# todo: download related media 
https://example.com/33

More over it already expects only one URL per line, I can't to do:

https://example.com/11 https://example.com/22
https://example.com/33
[gallery-dl][error] No suitable extractor found for 'https://example.com/11 https://example.com/22'

There is also no valid URL with a "plain" space character. The space character is present either with %20 or with + (in the search params) in a URL.


Addition (why I noted these two things):

...so it will not be the breaking change.

@Hrxn
Copy link
Contributor

Hrxn commented Aug 9, 2022

The follow format is working, but it looks less readable:

# 404
https://example.com/11
# important
https://example.com/22
# todo: download related media 
https://example.com/33

Debatable...

I think it's pretty straightforward to understand.
But it's for mikf to decide if he wants to extend parsing support here for trailing comments in line...

More over it already expects only one URL per line, I can't to do:

https://example.com/11 https://example.com/22
https://example.com/33

Yes, this is by design? I mean, some kind of delimiter has to used..

There is also no valid URL with a "plain" space character. The space character is present either with %20 or with + in a URL.

Yes, this is actually how it's supposed to be, though..
https://url.spec.whatwg.org/#url-units
https://url.spec.whatwg.org/#url-code-points

mikf added a commit that referenced this issue Aug 10, 2022
everything after the first " #" (space + hash) gets ignored
@mikf
Copy link
Owner

mikf commented Aug 10, 2022

The follow format is working, but it looks less readable:

Personally I add blank lines between comment-URL-blocks, so that I can triple click URLs without having anything extra selected

# 404
https://example.com/11

# important
https://example.com/22

# todo: download related media 
https://example.com/33

There is also no valid URL with a "plain" space character. The space character is present either with %20 or with + (in the search params) in a URL.

True, but gallery-dl can usually handle both, url-escaped and -unescaped, versions of a URL.

@mikf mikf closed this as completed Aug 10, 2022
@AlttiRi
Copy link
Author

AlttiRi commented Aug 10, 2022

Thanks.

I think \t also should be counted as a space.
It's useful to use tabs to align text with auto-padding of it (works fine in Notepad, Notepad++).


When you have hundreds of URL (for regular downloading of recently added content with --download-archive and --abort 10) it would be easier just to mark a URL with a comment after it, instead of splitting the "URL's sequence" by a comment.

https://example.com/124

# other comment
# https://example.com/13

https://example.com/14345
https://example.com/15

# some comment
https://example.com/16

https://example.com/17

# 404
# https://example.com/18

https://example.com/19876

# 404
# https://example.com/404

https://example.com/4
https://example.com/124
# https://example.com/13   # other comment
https://example.com/14345
https://example.com/15
https://example.com/16     # some comment
https://example.com/17
# https://example.com/18   # 404
https://example.com/19876
# https://example.com/404  # 404
https://example.com/4

@AlttiRi
Copy link
Author

AlttiRi commented Sep 8, 2022

Thank you.

Works fine with some extractors.

However, it does not work with Pixiv if there are more than one space character before # sign:

gallery-dl: No suitable extractor found for 'https://www.pixiv.net/en/users/42083333    '

if " #" in line:
line = line.partition(" #")[0]
elif "\t#" in line:
line = line.partition("\t#")[0]

An example fix:

if " #" in line or "\t#" in line: 
    line = re.split("[ \t]+#", line, 1)[0]

@mikf
Copy link
Owner

mikf commented Sep 9, 2022

Sorry for the oversight. I must've only tested with URLs where extra whitespace/characters after a URL don't matter and just assumed it would work for all.

Fixed in bdad9c4

@AlttiRi
Copy link
Author

AlttiRi commented Oct 6, 2022

There is a minor bug (with bdad9c4) in case when you use "\t#" and " #" in one line:

https://example.com/123      #not-followed #3d
                       ^ \t is here
gallery-dl: No suitable extractor found for 'https://example.com/123      #not-followed'

mikf added a commit that referenced this issue Oct 8, 2022
and move 'parse_inputfile()' to util.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants