Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Kemono] Downloading revisions similar to the webpage #5013

Closed
1223334444abc opened this issue Jan 4, 2024 · 9 comments
Closed

[Kemono] Downloading revisions similar to the webpage #5013

1223334444abc opened this issue Jan 4, 2024 · 9 comments

Comments

@1223334444abc
Copy link

I have carefully examined the “revisions” provided by the API and found that the website actually merges “revisions” with the same “edited” time, treating them as the same and using the earliest “revision_id”. I don’t know how to use a similar strategy to merge historical versions in gallery-dl. Can you support automatic merging of historical records with the same “edited” time in the new version (theoretically, their content should be exactly the same)?

#4706
#4727

In addition, I hope that the current version folder and historical versions will have different names, for example:
For the current version: “directory”: [“[{service}]{username}”, “[{date:Olocal/%Y%m%d}][{id}]{title}”],
For historical versions: “directory”: [“[{service}]{username}”, “[{date:Olocal/%Y%m%d}][{id}][{revision_id}]{title}”]
How can I write the configuration file correctly to achieve this goal?

@Hrxn
Copy link
Contributor

Hrxn commented Jan 4, 2024

For the current version: “directory”: [“[{service}]{username}”, “[{date:Olocal/%Y%m%d}][{id}]{title}”],
For historical versions: “directory”: [“[{service}]{username}”, “[{date:Olocal/%Y%m%d}][{id}][{revision_id}]{title}”]
How can I write the configuration file correctly to achieve this goal?

What do you mean?
You've just gave an example for "directory", why don't you use that?

@1223334444abc
Copy link
Author

1223334444abc commented Jan 4, 2024

For the current version: “directory”: [“[{service}]{username}”, “[{date:Olocal/%Y%m%d}][{id}]{title}”],
For historical versions: “directory”: [“[{service}]{username}”, “[{date:Olocal/%Y%m%d}][{id}][{revision_id}]{title}”]
How can I write the configuration file correctly to achieve this goal?

What do you mean? You've just gave an example for "directory", why don't you use that?

I don’t know how to use different ‘directory’ settings for the current and historical versions of the Kemono page. Is there a way to set the ‘directory’ condition to ‘current version’? It seems like there is no such parameter.

        "kemonoparty": 
        {
            "username": "123456",
            "password": "123456",
            "metadata": true,
            "comments": true,
            "favorites": "artist",
            "revisions": true,
            "directory":
            {
                "revision_index == 0": ["[{service}]{username}", "[{date:Olocal/%Y%m%d}][{id}]{title}"],
                ""                   : ["[{service}]{username}", "[{date:Olocal/%Y%m%d}][{id}][{revision_id}]{title}"]
            },
            "filename": "{num}.{extension}",
            "postprocessors": [
                {
                    "name": "metadata",
                    "event": "post",
                    "filename": "content.txt",
                    "mode": "custom",
                    "format": "{content}\n{embed}\n"
                }]
        },

I tried to write the settings like this, but it didn’t work. The current version still outputs [{revision_id}]=[0].

The output is like this, including the incorrect current version file name and many unnecessary historical versions:

* .\gallery-dl\[fanbox]abcdefg\[20211115][1234567][0]ABCDEFGH\1.png
* .\gallery-dl\[fanbox]abcdefg\[20211115][1234567][0]ABCDEFGH\2.png
* .\gallery-dl\[fanbox]abcdefg\[20211115][1234567][0]ABCDEFGH\3.jpg
* .\gallery-dl\[fanbox]abcdefg\[20211115][1234567][8036383]ABCDEFGH\1.png
* .\gallery-dl\[fanbox]abcdefg\[20211115][1234567][8036383]ABCDEFGH\2.png
* .\gallery-dl\[fanbox]abcdefg\[20211115][1234567][8036383]ABCDEFGH\3.jpg
* .\gallery-dl\[fanbox]abcdefg\[20211115][1234567][7963350]ABCDEFGH\1.png
* .\gallery-dl\[fanbox]abcdefg\[20211115][1234567][7963350]ABCDEFGH\2.png
* .\gallery-dl\[fanbox]abcdefg\[20211115][1234567][7963350]ABCDEFGH\3.jpg
* .\gallery-dl\[fanbox]abcdefg\[20211115][1234567][7291886]ABCDEFGH\1.png
* .\gallery-dl\[fanbox]abcdefg\[20211115][1234567][7291886]ABCDEFGH\2.png
* .\gallery-dl\[fanbox]abcdefg\[20211115][1234567][7291886]ABCDEFGH\3.jpg
* .\gallery-dl\[fanbox]abcdefg\[20211115][1234567][7019935]ABCDEFGH\1.png
* .\gallery-dl\[fanbox]abcdefg\[20211115][1234567][7019935]ABCDEFGH\2.png
* .\gallery-dl\[fanbox]abcdefg\[20211115][1234567][7019935]ABCDEFGH\3.jpg
* .\gallery-dl\[fanbox]abcdefg\[20211115][1234567][6778173]ABCDEFGH\1.png
* .\gallery-dl\[fanbox]abcdefg\[20211115][1234567][6778173]ABCDEFGH\2.png
* .\gallery-dl\[fanbox]abcdefg\[20211115][1234567][6778173]ABCDEFGH\3.jpg
* .\gallery-dl\[fanbox]abcdefg\[20211115][1234567][6432407]ABCDEFGH\1.png
* .\gallery-dl\[fanbox]abcdefg\[20211115][1234567][6432407]ABCDEFGH\2.png
* .\gallery-dl\[fanbox]abcdefg\[20211115][1234567][6432407]ABCDEFGH\3.jpg
* .\gallery-dl\[fanbox]abcdefg\[20211115][1234567][6112308]ABCDEFGH\1.png
* .\gallery-dl\[fanbox]abcdefg\[20211115][1234567][6112308]ABCDEFGH\2.png
* .\gallery-dl\[fanbox]abcdefg\[20211115][1234567][6112308]ABCDEFGH\3.jpg
 ......

@a84r7a3rga76fg
Copy link

a84r7a3rga76fg commented Jan 4, 2024

This is Kemono's fault. They refuse to show the ID of the current version to discourage scraping because if they were to ban it outright they'd face more backlash than they can handle.

@1223334444abc
Copy link
Author

This is Kemono's fault. They refuse to show the ID of the current version to discourage scraping because if they were to ban it outright they'd face more backlash than they can handle.

I found that using the ‘edited’ time seems to achieve a similar effect as described above, but I don’t know how to obtain content in gallery-dl according to the ‘edited’ time. For example, the latest ‘edited’ time is considered the current version, and ‘edited’ time is used as the basis for obtaining different historical versions.

@1223334444abc
Copy link
Author

Oh no, it seems that some services do not have an ‘edited’ time.

https://kemono.su/patreon/user/3295915/post/88413981
https://kemono.su/api/v1/patreon/user/3295915/post/88413981/revisions

Now it seems that downloading can only be done by comparing the content.

mikf added a commit that referenced this issue Jan 15, 2024
A SHA1 hexdigest of other relevant metadata fields like
title, content, file and attachment URLs.

This value does NOT reflect which revisions are listed on the website.
Neither does 'edited' or any other metadata field (combinations).
@mikf
Copy link
Owner

mikf commented Jan 16, 2024

I looked a bit deeper into this whole revisions thing and found that

  1. There is no rhyme or reason for certain revisions being listed on the website

    For example, ALL 6 versions of your linked post as well as ALL 10 of this one are 100% identical.

  2. There are not all necessary revisions being listed to account for all metadata/file changes.

    This post should have 3 versions, but kemono lists only 2.


Commit 3d68eda adds a revision_hash metadata field when revisions are enabled, which can be used to better detect changed metadata/files between revisions.

@1223334444abc
Copy link
Author

1223334444abc commented Jan 16, 2024

I looked a bit deeper into this whole revisions thing and found that

1. There is no rhyme or reason for certain revisions being listed on the website
   For example, ALL 6 versions of your [linked post](https://kemono.su/patreon/user/3295915/post/88413981) as well as ALL 10 of [this one](https://kemono.su/patreon/user/3161935/post/68231671) are 100% identical.

2. There are not all necessary revisions being listed to account for all metadata/file changes.
   [This post](https://kemono.su/fanbox/user/853087/post/2366569) should have 3 versions, but kemono lists only 2.

Commit 3d68eda adds a revision_hash metadata field when revisions are enabled, which can be used to better detect changed metadata/files between revisions.

I think we don’t need to pay too much attention to the revisions being listed on the website, we just need to save all the different versions (including all the files, images and content changes). I hope there is a switch that automatically merges the revisions with the same content, saves the earliest one in each group of identical revisions, and makes the latest group the ‘current version’.

A post with different revisions
For example, this post that contains multiple different versions is downloaded as:

[fanbox]{username}/[20230707][6306504]{…title…} 
[fanbox]{username}/[20230707][6306504][6496356]{…title…} 
[fanbox]{username}/[20230707][6306504][6394907]{…title…} 
[fanbox]{username}/[20230707][6306504][6363466]{…title…}

(I didn’t carefully compare the differences between the versions above, but the download link provided by the author in the first version of this post did disappear in the latest version. Here I assume that different ‘edited’ are different versions.)

I am not sure how to use ‘revision_hash’ to achieve this goal (besides appending it to the file name). Could you please give me some hints? Do I need to write it to the database and set up some skip strategies?

In addition, some authors will delete published images after a period of time. If you need an example, I can find one for you.

@a84r7a3rga76fg
Copy link

I believe it's better to only download unique files from the post {id} or user {user} and periodically fetch the source page of the latest and past versions of the post and use a script on the files and metadata to sort the files as reflinks or hard links in a different location, e.g. you download them to C:\gallery-dl\kemono and create the reflinks or hard links at C:\gallery-dl\kemono sorted and then you can delete or do whatever with C:\gallery-dl\kemono sorted without losing any data.

This is what I'm using for only downloading unique files from the post to C:\gallery-dl\kemono.

"kemonoparty":
{
	"archive-format": "{subcategory}_{user}_{id}_{hash}",
	"archive": "~/gallery-dl/archives/kemono/{subcategory}_{user}.sqlite",
	"base-directory": "C:/gallery-dl/kemono/",
	"directory": ["{subcategory} {user}", "{id} {date!s:.10}"],
	"filename": "{hash}.{extension}",
	"revisions": true,
	"metadata": true,
	"discord":
	{
		"#": "discord-specific settings",
		"archive-format": "{subcategory}_{server}_{channel}_{id}_{hash}",
		"archive": "~/gallery-dl/archives/kemono/{subcategory}_{server}.sqlite",
		"directory": ["{subcategory} {server}", "{channel_name[:25]} {channel}", "{id} {date!s:.10}"],
		"filename": "{hash}.{extension}"
	}
}

mikf added a commit that referenced this issue Jan 26, 2024
set 'revisions' to '"unique"' to have it ignore duplicate revisions
@mikf
Copy link
Owner

mikf commented Jan 26, 2024

It is now possible to filter duplicate revisions by setting revisions to "unique" (afd20ef).

@mikf mikf closed this as completed Jan 26, 2024
bradenhilton pushed a commit to bradenhilton/gallery-dl that referenced this issue Feb 5, 2024
…f#5013)

A SHA1 hexdigest of other relevant metadata fields like
title, content, file and attachment URLs.

This value does NOT reflect which revisions are listed on the website.
Neither does 'edited' or any other metadata field (combinations).
bradenhilton pushed a commit to bradenhilton/gallery-dl that referenced this issue Feb 5, 2024
set 'revisions' to '"unique"' to have it ignore duplicate revisions
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants