Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add differential updates capabilities #192

Closed
ameshkov opened this issue Oct 18, 2023 · 47 comments
Closed

Add differential updates capabilities #192

ameshkov opened this issue Oct 18, 2023 · 47 comments

Comments

@ameshkov
Copy link
Member

ameshkov commented Oct 18, 2023

We would like to add support for differential updates to the filter lists, but for that the compiler should be extended pretty substantially.

Having differential updates will allow ad blockers download filter lists updates much more frequently.

Here's the idea:

  1. When it is building a filter list, the platform directory and its contents may already exist. Let's take a simple case, when we're running there's already this directory structure:

    platforms/android/filters/1.txt
    platforms/android/filters/1_optimized.txt
    
  2. If this is the case, here's what it does:

    1. Retrieves the current filter list version from the filter file.
    2. Calculates the diff between the old state of 1.txt and the new state of 1.txt. We should use standard diff format for that, not unified, no context: https://en.wikipedia.org/wiki/Diff
    3. The diff should be stored in separate files:
    platforms/android/filters/1.diff.json
    platforms/android/filters/1_optimized.diff.json
    

    The diff.json file format:

    {
        "diffs": [
            {
                "old": "1.0.0.0",
                "new": "1.0.0.1",
                "patch": "Here goes the diff"
            },
            {
                "old": "1.0.0.1",
                "new": "1.0.0.2",
                "patch": "Here goes the diff"
            }
        ]
    }

    Note, that the diffs array is sorted (older versions first).

  3. If the diff is larger than X bytes, we remove all diffs altogether. The diff.json will be empty: {} or { diffs: [] } or just an empty file.

  4. Ideally, I would like to make this mechanism available to all filter lists, not just the ones made by AdGuard. In order to do that we should allow indicating diff file URL in the filter list's metadata.

    We need two new metadata fields for that:

    1. Diff-URL: https://xxxxx.com/filter.diff.json -- the address of the file with filter list diffs.
    2. Diff-Expires: 1 hour -- expiration time of the filter list when differential updates are available. Note, that Expires continues to work as it was working before, i.e. once in a while AdGuard will do the so-called "full sync". However, we can and should increase it when differential updates are available.
  5. Also, I'd like us to provide a separate tool that builds diffs instead of having this all inside the filters compiler (which is very AdGuard-specific).

    Here's how it should work:

    ./filters-diff-builder
    	--new-list-path="./filter.txt"
    	--prev-list-path="./filter.old.txt"
    	--diff-expires="1 hour"
    	--diff-url="https://xxxxx.com/filter.diff.json"
    	--diff-path="./filter.diff.json"
    	--max-diff-size=200000
    	--max-diff-history=30
    
    • new-list-path - path to the new version of the filter list. The list will be modified in result, Diff-URL and Diff-Expires will be added (or replaced if they're already there).
    • prev-list-path - path to the previous version of the filter list.
    • diff-expires - value of Diff-Expires to be used.
    • diff-url - URL where the diff.json file will be published.
    • max-diff-size - maximum size of a diff file in bytes. If the resulting diff.json is larger than the specified value, it becomes empty effectively disabling differential updates.
    • max-diff-history - maximum number of diffs stored in the diff.json file.

Changes to how filter lists updates are checked

  1. First of all, differential updates are only used when all three Version, Diff-URL and Diff-Expires are specified in the filter list metadata.
  2. It requires the ad blocker to store the original filter list text.
  3. Every Diff-Expires period ad blocker should download the diff.json and then apply the diffs.

How the diffs are applied

First of all, any error while applying a diff should disable differential updates. In this case the ad blocker should back off and wait until it's time for the full sync.

The algorithm should be the following:

  1. Find the current version in the diffs file. If the version is missing, count it as an error and stop differential updates.

  2. Incrementally apply every patch in the diff.json file starting from that version. While applying the patches, check that the "old version -> new version" chain is continuous and have no gaps.

  3. Verify the resulting filter list:

    1. Check that Version: value is the same as the last one in the diff.json.
    2. If there's a Checksum: field in the filter list, check that it is a correct checksum for the list contents.

When to do full sync

  1. If the user manually triggers filters update check, do the full sync.
  2. Generally, follow the Expires: field.

Remarks and Q&A

  1. Why standard diff and not something more "compact"? This is actually a very good question. Diff is not ideal since it stores the "removed" strings as well. On the other hand, diff has great advantages: everyone knows what it is, it is human-readable and there're plenty of tools that understand it. I suggest implementing an MVP that uses diff and see how large the resulting diff files will be and then decide.
  2. What are the recommended values for max-diff-history? I think that it depends on how often the list is updated. For instance, AdGuard Base filter is updated every hour and keeping last 30 diffs should be good enough so that people that use their computer on a daily basis could continue to receive updates every hour.
  3. What kind of traffic economy we're talking about? Actually, not that much if any. I did some preliminary testing with FiltersRegistry and it seems that the size of a 2-day diff for Base filter is usually under 100kb. The whole list size is 5MB. Currently, we download it once in 4 days so this is about 40MB per month. If we switch to differential updates with Diff-Expires: 1 hour + increase the full sync period to 10 days then in the worst case scenario we'll be downloading about 90MB per month. However, this is very unlikely since computers don't usually work 24*7 so I'd assume we can safely divide it by 2. Anyways, I suggest first implementing the filters-diff-builder and see what real numbers we will be getting and then decide on how to proceed.
@ameshkov ameshkov added the enhancement Improvement of existent feature label Oct 18, 2023
@ameshkov
Copy link
Member Author

@gorhill are you on board with this? If you are, any remarks? Would you like to change/improve the spec?

@piquark6046
Copy link
Member

piquark6046 commented Oct 18, 2023

I think that it can distribute a filters list very quickly with low traffic if the size of complied filters list is huge.

@ameshkov
Copy link
Member Author

@piquark6046 yeah, that's exactly the motivation for this change.

@piquark6046
Copy link
Member

By the way, what is a cryptographic hash function Checksum used for the Checksum metadata?

@ameshkov
Copy link
Member Author

@piquark6046 it's something very old:

    /**
     * Calculates checksum
     * See:
     * https://adblockplus.org/en/filters#special-comments
     * https://hg.adblockplus.org/adblockplus/file/tip/addChecksum.py
     *
     * @param header Header lines
     * @param rules  Rules lines
     */
    const calculateChecksum = function (header, rules) {
        let content = header.concat(rules).join('\n');
        content = normalizeData(content);
        const checksum = crypto.createHash('md5').update(content).digest('base64');

        return `! Checksum: ${stripEnd(checksum.trim(), '=')}`;
    };

@gwarser
Copy link

gwarser commented Oct 18, 2023

There will be no issue with request count?

@piquark6046
Copy link
Member

I don't have expertise in information security enough to claim.
But, I suggest that the differential updates capabilities has to use SHA-256 or higher digests because MD5 hash function has its hash collision1.

Footnotes

  1. https://crypto.stackexchange.com/questions/1434/are-there-two-known-strings-which-have-the-same-md5-hash-value

@ameshkov
Copy link
Member Author

@gwarser good question, it depends on the CDN that's used. Generally, bandwidth is more important than the number of requests unless the CDN is too greedy. On the other hand, if we want to achieve faster filter updates, I don't see how we can avoid a huge number of requests.

@ameshkov
Copy link
Member Author

ameshkov commented Oct 18, 2023

@piquark6046 the current checksum is legacy from the old ABP times, any change to the algorithm now will break backwards compatibility so it'll require a new field. Also, tbh, MD5 is an good&old standard way of calculating checksums and it's good enough to prevent list corruption.

PS: it's not mandatory to the spec, just mentioned because some people may be using it.

@DandelionSprout
Copy link
Member

DandelionSprout commented Oct 18, 2023

The way I understand this, without a whole lot of experience with filters-diff-builder, it sounds like delta updates, which I do have experience with. I approve of this concept as it has been explained to me, but the OP makes it sound like the delta update process takes place within FiltersCompiler after it has already imported the lists from their sources.

If my quarter-way wild guess on that was to be correct, the process could be carried over to ABP if not for their quasi-'feature freeze', but adapting it in uBO would require some advanced work.

@gorhill
Copy link
Contributor

gorhill commented Oct 18, 2023

are you on board with this?

Something that requires thoughts, so I will have to think about it while I will keep following the discussion.

@ameshkov
Copy link
Member Author

but the OP makes it sound like the delta update process takes place within FiltersCompiler

It's actually unrelated, just a repo to open the issue, we'll create a new one if everything is okay with the spec.

filters-diff-builder is a new tool that will build diffs. The input for this tool is two filter lists versions: the previous and the new one.

@gwarser
Copy link

gwarser commented Oct 18, 2023

Did you know if ABP implemented this in some form? https://issues.adblockplus.org/ticket/6833/ ?

@ameshkov
Copy link
Member Author

I am not aware of ABP implementation, I wonder is there any person from eyeo we can tag on GH and ask?

@gorhill
Copy link
Contributor

gorhill commented Oct 18, 2023

@ameshkov
Copy link
Member Author

Seems to be implemented for MV3 list versions.

Here's a list with diff updates support: https://easylist-downloads.adblockplus.org/v3/full/easylist.txt
Diff: https://easylist-downloads.adblockplus.org/v3/diff/easylist/740f51eefafe121afbffdb2be4ef2f97b26cb6c3.json

I am not fully sure about that, but it seems to be a very MV3-specific implementation that does not handle our case. Static ruleset is the same until the extension is updated so the extension simply downloads this 740f51eefafe121afbffdb2be4ef2f97b26cb6c3.json every time and it returns a diff from the static ruleset to the latest state of the list.

@gorhill
Copy link
Contributor

gorhill commented Oct 18, 2023

Ok here is how I see your proposal from the point of view of what is currently there in uAssetsCDN:

Retrieves the current filter list version from the filter file.

This means having to add a ! Version: filed in each list. Not an issue, update commits are now tagged and this would be the version for all lists -- the new ! Version: field would be updated only for lists which content has changed.

Calculates the diff between the old state of 1.txt and the new state of 1.txt. We should use standard diff format for that, not unified, no context: https://en.wikipedia.org/wiki/Diff

Ok.

I tried git diff --unified=0 2023.10.18.97 2023.10.18.408 > /tmp/diff.patch and I got a patch of 49K in size.

For git diff --unified=0 2023.10.18.408 2023.10.18.777 > /tmp/diff.patch and I got a patch of 8.4K in size.

For git diff --unified=0 2023.10.18.408 2023.10.17.1128 > /tmp/diff.patch, I get 54K

The diff should be stored in separate files:

Since a client would not know which files changed, this means to try and fetch a diff file for every single asset.

From the point of view of uAssetsCDN, I am not sure about this given that all changes are committed in batch every few hours, currently every six hours, so it might be more beneficial to have (version => file => diff) in a single file to fetch. The size of the file would grow faster though.

An advantage of batch diffing is that it's beneficial in case of commits affecting many files, which is something occurring by design with uAssetsCDN.


The size of the file would grow faster though.

Probably best to have the "patch" be a file path to a patch rather than the patch itself with batch update -- so maybe it's best for uBO to have its own custom solution.

@ameshkov
Copy link
Member Author

ameshkov commented Oct 19, 2023

so it might be more beneficial to have (version => file => diff) in a single file to fetch

The spec can be extended to cover the possibility of a unified diff as well.

The unified diff.json could look like this:

{
    "easylist": {
        diffs: [....],
    },
    "ublock-unbreak": {
        diffs: [....],
    }
}

This would require adding a new meta field FilterId: to the list and allow multiple filter lists have the same Diff-URL.

An advantage of batch diffing is that it's beneficial in case of commits affecting many files, which is something occurring by design with uAssetsCDN.

With several files there's a little bit higher risk of a race condition, i.e. when the files on the CDN are changed when fetching diffs are in progress. On the other hand, this risk is still there when you're doing the full sync and fetch full lists one by one. The only chance to completely avoid it is to unite everything into a single file.

Probably best to have the "patch" be a file path to a patch rather than the patch itself with batch update -- so maybe it's best for uBO to have its own custom solution.

The point of the proposal is to allow all list maintainers to provide diff-updatable subscriptions. If we all implement custom solutions we only solve this problem for ourselves, but leave other lists without this capability. I am all for incorporating changes to the original proposal if this helps achieve a universal solution.

Regarding having paths instead of diffs, this indeed makes sense and allows to save lots of bandwidth. Initially, I have not added it to the proposal to keep it as simple as possible. On the other hand, taking another look at this all, it does not look too complex to me.

@gorhill
Copy link
Contributor

gorhill commented Oct 20, 2023

I've given more thoughts on this, again from how things are currently done in uBO.

First an observation which I guess might apply to AdGuard too: it's is not possible to apply diff patch to list content which is expanded client-side, i.e. the !#include directive. In uBO, there are expanded client-side, then the final result is cached locally. This can't be diff-patched with a diff computed on the server, the server just doesn't know what is the end result client-side.

So this means that diff-patching can only be done for lists which content is not further modified client-side. This is do-able in uBO for select key lists where diff-patching would be more important than avoiding to expand an !#include directive.

Now regarding batch-commit currently being used in uAssetsCDN, I don't see the need for some central file describing patches etc. The way I see it, a simple field in the header of the file like Commit version: would be enough for a diff-patcher to know what to do next: just get the diff-patch from the server according to the content of the Commit version: field.

For example, if the commit-version field is 2023.10.20.407, then fetch patches/2023.10.20.407.patch from the server. If the patch does not exist (404 or empty file, whichever is best to spare the server), nothing to patch. If it exists, find the changes in the patch which applies to the list. Repeat for other assets (while keeping a copy of the patch for other lists which might need it).

Once done, repeat the process -- as a result of applying patches/2023.10.20.407.patch, the Commit version: field will have been updated, which mean we can repeat the above with the new commit version, until no more diff-patching required for any candidate lists.

This way the only requirement is to have a Commit version: field for lists which are candidate for diff-patching, no need for maintaining a file to be the authority to instruct which patches need to apply to which files.

To make things even fancier with this approach is to keep a short list of last commit versions applied. You may have noticed the commit version currently in use at uAssetsCDN are time-based, and given this the client code could infer the pace of scheduling diff-based update according to the average time difference between the latest commits, so essentially this allows to control server-side how often clients run diff-based updates.

Also, pruning a growing list of patches in patches/ would be simply a matter of deleting all .patch older than a specific time, or delete .patch except for the most recent n ones, or a combination of both, or any other heuristic. In the end, if the client requests a patch which does not exist, it's a noop, and the regular current update cycle will eventually force an update of a list.

So unless there is a flaw in my thinking I am not seeing, this is the approach I want for uBO, and I consider it to be specific, because I would know exactly where to fit this in uBO's code with minimal changes and I feel having to use an external framework would just make things more difficult on my side, i.e. take longer to reach the goal and require modifying core code path in uBO, which I rather avoid.

@ameshkov
Copy link
Member Author

ameshkov commented Oct 20, 2023

@gorhill

First an observation which I guess might apply to AdGuard too: it's is not possible to apply diff patch to list content which is expanded client-side, i.e. the !#include directive

Yeah, it does apply indeed. The way I see it a generic solution requires expanding #include after the patch is applied, i.e. requires storing the original non-expanded list text. It's doable, but not an ideal thing to do.

Regarding the recursive approach with the Commit version: field, it does make sense to me. It should be rather easy to maintain even for an individual filter list author. The only thing that I don't understand is how do you want to make it work with batch patches that you mentioned earlier? Aren't there edge cases that would need to be handled client-side?

I'll try to amend the spec to make it more like this recursive approach that you're going to use for uAssetsCDN so that if at some point you decide to support it, it was easier to do.

Just one more thing that I wanted to propose. Since these metadata fields may influence how different ad blockers process filter lists, I suggest using platform prefix for fields that are not universally supported. Something like uBO Commit version, ADG Diff-URL.

@gorhill
Copy link
Contributor

gorhill commented Oct 20, 2023

how do you want to make it work with batch patches that you mentioned earlier?

Concretely,

Let's use ! Diff-patch version: instead of ! Commit version: .

The Diff-patch version value is kept in meta data describing an asset, let's name it diffpatchVersion for the sake of conversation.

For example in uBO there is already metadata associated with a cached asset: last time it was read from cache, time written to cache, "age" of the resource, etc. So diffpatchVersion is just a new value to keep track of for any given cached asset. Though I suppose if someone doesn't want to keep track of metadata structure, it simply becomes a matter of loading the whole cached content of the asset, then extract the Diff-patch version field whenever needed.

Then a diff-update cycle becomes a matter of iterating through the metadata of all cached assets. If an asset does not have diffpatchVersion value then skip over the asset.

If one does have such value, launch a diff-update operation, i.e. fetch the diff-patch as patches/${diffpatchVersion}.patch. If the fetch fails then skip over the asset.

Then parse and apply the patch -- that will require the most work to first write code to extract the specific subpatch which apply to the asset being patched, then code to apply the patch itself -- this is where a library makes sense. Keep fetched diff-patches around in case they are needed again for other assets during the current cycle in order to avoid fetching from server again.

For the case of one diff-patch per asset, then the same code above would work, except of course that ! Diff-patch version: would also contain the name of the asset, ! Diff-patch version: 2023.10.20.407.filters_filters.min.txt, so if you want to spec this, it should be that a client needs to be prepared that a diff-patch downloaded from a server may contain subpatches for many assets (again, code suitable for a library).

As you said originally, using standard unified=0 diff makes all this easy to spec and implement server-side, it just comes down to library code which could be used by AdGuard and uBO to apply a patch given a (diff-patch, resource name, text), where diff-patch would be standard unified=0 diff like:

Example diff
diff --git a/filters/filters.min.txt b/filters/filters.min.txt
index c79dca64..86639f9d 100644
--- a/filters/filters.min.txt
+++ b/filters/filters.min.txt
@@ -4 +4 @@
-! Last modified: Tue, 17 Oct 2023 19:43:41 +0000
+! Last modified: Wed, 18 Oct 2023 05:46:20 +0000
@@ -27,2 +26,0 @@ youtube.com##ytd-video-masthead-ad-advertiser-info-renderer,ytm-promoted-sparkle
-youtube.com#@#+js(json-prune-fetch-response, [].playerResponse.adPlacements [].playerResponse.playerAds [].playerResponse.adSlots playerResponse.adPlacements playerResponse.playerAds playerResponse.adSlots adPlacements playerAds adSlots, , propsToMatch, url:player?key= method:/post/i)
-youtube.com#@#+js(json-prune-fetch-response, [].playerResponse.adPlacements [].playerResponse.playerAds [].playerResponse.adSlots playerResponse.adPlacements playerResponse.playerAds playerResponse.adSlots adPlacements playerAds adSlots, , propsToMatch, url:player?key= method:/post/i bodyUsed:true)
@@ -64,6 +61,0 @@ youtube.com#@#+js(trusted-replace-fetch-response, /\"playerAds.*?true.*?\}\}\]\,
-youtube.com#@#+js(trusted-replace-fetch-response, /\"adPlacements.*?\"\}\}\}\]\,/, , url:player?key= method:/post/i bodyUsed:true)
-youtube.com#@#+js(trusted-replace-fetch-response, /\"adSlots.*?\}\]\}\}\]\,/, , url:player?key= method:/post/i bodyUsed:true)
-youtube.com#@#+js(trusted-replace-fetch-response, /\"playerAds.*?\}\}\]\,/, , url:player?key= method:/post/i bodyUsed:true)
-youtube.com#@#+js(trusted-replace-fetch-response, /\"adPlacements.*?\"\}\}\}\]\,/, , url:player?key= method:/post/i)
-youtube.com#@#+js(trusted-replace-fetch-response, /\"adSlots.*?\}\]\}\}\]\,/, , url:player?key= method:/post/i)
-youtube.com#@#+js(trusted-replace-fetch-response, /\"playerAds.*?\}\}\]\,/, , url:player?key= method:/post/i)
@@ -15844 +15836 @@ watashiwasugoidesu.com##.watas-bottom-vi
-@@||freereceivesms.com/ados.js$script,1p
+@@||freereceivesms.com^$script,1p
@@ -23286,0 +23279 @@ y2down.cc##+js(nowoif)
+@@||cdn.fuseplatform.net/publift/$3p,script,xhr,domain=quackr.io
diff --git a/filters/privacy.min.txt b/filters/privacy.min.txt
index 9f6cb1bf..f65844b3 100644
--- a/filters/privacy.min.txt
+++ b/filters/privacy.min.txt
@@ -7 +7 @@
-! Last modified: Fri, 13 Oct 2023 12:03:17 +0000
+! Last modified: Wed, 18 Oct 2023 05:46:43 +0000
@@ -514,0 +515 @@ natgeotv.com##+js(set, Visitor, {})
+www.lenovo.com##+js(aost, history.replaceState, injectedScript)
diff --git a/filters/quick-fixes.txt b/filters/quick-fixes.txt
index 22c0b3af..9912fc03 100644
--- a/filters/quick-fixes.txt
+++ b/filters/quick-fixes.txt
@@ -4 +4 @@
-! Last modified: Tue, 17 Oct 2023 20:54:25 +0000
+! Last modified: Wed, 18 Oct 2023 03:04:34 +0000
@@ -82,2 +82 @@ plagiarismchecker.co##[class][style*="display"][style*="block"]:has(a img[src^="
-! https://old.reddit.com/r/uBlockOrigin/comments/16lmeri/youtube_antiadblock_and_ads_september_18_2023/k1uf2s1/
-youtube.com#@#+js(json-prune-fetch-response, [].playerResponse.adPlacements [].playerResponse.playerAds [].playerResponse.adSlots playerResponse.adPlacements playerResponse.playerAds playerResponse.adSlots adPlacements playerAds adSlots, , propsToMatch, url:player?key= method:/post/i)
+! https://old.reddit.com/r/uBlockOrigin/comments/16lmeri/youtube_antiadblock_and_ads_september_18_2023/k1uf2s1/ - 03c8985d
@@ -110 +108,0 @@ youtube.com#@#+js(trusted-replace-fetch-response, /"playerAds.*?gutParams":\{"ta
-youtube.com#@#+js(json-prune-fetch-response, [].playerResponse.adPlacements [].playerResponse.playerAds [].playerResponse.adSlots playerResponse.adPlacements playerResponse.playerAds playerResponse.adSlots adPlacements playerAds adSlots, , propsToMatch, url:player?key= method:/post/i bodyUsed:true)
@@ -119,6 +116,0 @@ youtube.com#@#+js(trusted-replace-fetch-response, /\"playerAds.*?true.*?\}\}\]\,
-youtube.com#@#+js(trusted-replace-fetch-response, /\"adPlacements.*?\"\}\}\}\]\,/, , url:player?key= method:/post/i bodyUsed:true)
-youtube.com#@#+js(trusted-replace-fetch-response, /\"adSlots.*?\}\]\}\}\]\,/, , url:player?key= method:/post/i bodyUsed:true)
-youtube.com#@#+js(trusted-replace-fetch-response, /\"playerAds.*?\}\}\]\,/, , url:player?key= method:/post/i bodyUsed:true)
-youtube.com#@#+js(trusted-replace-fetch-response, /\"adPlacements.*?\"\}\}\}\]\,/, , url:player?key= method:/post/i)
-youtube.com#@#+js(trusted-replace-fetch-response, /\"adSlots.*?\}\]\}\}\]\,/, , url:player?key= method:/post/i)
-youtube.com#@#+js(trusted-replace-fetch-response, /\"playerAds.*?\}\}\]\,/, , url:player?key= method:/post/i)
@@ -156,4 +147,0 @@ sankaku.app##+js(no-xhr-if, googlesyndication)
-! https://github.com/uBlockOrigin/uAssets/issues/19982
-unsplash.com#@#div[style^="--row-gutter"] > div a[href="/brands"]:upward(div[style^="--row-gutter"] > div):remove()
-unsplash.com##div[style^="--row-gutter"] > div a[href="/brands"]:upward(div[style^="--row-gutter"] > div)
-

In that example diff, a ! Diff-patch version: field would also be changed as a result of applying the patch, so once all assets are iterated, we repeat the cycle until no patching occurs.

So in summary, a library which take as input the above example diff, the name of the resource to lookup in the patch (might be something which requires its own field too, i.e. ! Diff-patch name: /filters/filters.min.txt), and the text on which to apply the patch is what we can share and work on together. Once we have that code, we are both very close to supporting diff-update.


Forgot to mention, using a time-based version schema helps also to preemptively skip over a pointless fetch. As you see, I currently use year.month.day.minute-of-the-day to generate version. A client can use this to decide that since the version correspond to an asset updated only 37 minutes ago (just an example), it will skip over trying to diff-update the asset. This could be part of a spec (or not). As mentioned in previous comment, such version schema could also be used to dynamically infer schedule period.

@ameshkov
Copy link
Member Author

if you want to spec this, it should be that a client needs to be prepared that a diff-patch downloaded from a server may contain subpatches for many assets (again, code suitable for a library).

Good point, a single patch can indeed contain diffs for several assets there, it's on the client to figure out which list it actually corresponds to and it's doable if the client takes into account the files locations (URLs).

I'll draft an updated spec taking this all into consideration. Just one thing to note, I still think that for a generic case we need to have Diff-Expires. Also, the list author may want to be able to control how often the ad blocker downloads diffs.

@gorhill
Copy link
Contributor

gorhill commented Oct 27, 2023

Patches need to provide the time at which they were created. When repeatedly applying patches to bring a list up to date, there is a yet unknown number of patches to apply. Eventually the last patch will be fetched, but there is no way for the updater to know that this is the last patch, and this will result in a pointless fetch to the server, which will return with 404 or an empty file (as discussed elsewhere).

However if the creation time of a patch is part of the file itself, then the updater can infer that if (patchCreationTime + diffExpires) is in the future, there is no point to attempt a fetch. Currently on my side I use the patch name which is time-related, but if this is not part of the spec, 3rd-party solutions will choose differently and there won't be a way to prevent the pointless fetch at the end of the sequence.

So I propose a special line, the first one in the file, to provide that information:

patch time:[creation time]

Where [creation time] is the number of seconds since the epoch (we don't need ms resolution, and I prefer a single integer than a full blown date-time format). It could be optional, in which case an updater will end up causing pointless fetches as it won't be able to tell whether there is a good chance of a patch existing on the server, which is what (patchCreationTime + diffExpires) < now allows.

@gwarser
Copy link

gwarser commented Oct 28, 2023

There is already ! Last modified: in uAssets lists. Used to improve updates. Won't diff patch this too? It's "human-readable".

@gorhill
Copy link
Contributor

gorhill commented Oct 28, 2023

Yes ok so when batch-updating, the minimum value of (lastModified + diffExpires) of all fetched lists would tell whether there is likely another patch available if the result is before now. Will have to see how this goes in practice next dev cycle.

gorhill added a commit to uBlockOrigin/uAssets that referenced this issue Oct 28, 2023
@gwarser

This comment was marked as resolved.

@gorhill
Copy link
Contributor

gorhill commented Oct 28, 2023

Unfortunately ! Last modified: is not part of any spec and the field name varied across lists, if present at all. For uBO lists I just went with what EasyList was using. Spotted in various lists:

! Updated:
! TimeUpdated:
! Last change:
! Last update:

At least if it's part of the spec here, we will know where to find the lastModified information in a reliable manner.

@ameshkov ameshkov changed the title Add differential updates capabitlities Add differential updates capabilities Oct 29, 2023
@ameshkov
Copy link
Member Author

Tbh, the idea of relying on creation time does not seem reliable. When the patch is created it is known whether the patch is final or not and it can be somehow indicated in the patch file. The downside is that this approach requires removing this marker from the older patch files.

I propose extending our custom diff directive with a new optional field: final: bool. If every diff directive in a file has final: true, then the ad blocker knows that this is a final patch and there's no need to attempt downloading the next file.

@gorhill
Copy link
Contributor

gorhill commented Oct 29, 2023

new optional field: final: bool

This means revisiting all previous patches to change this field to false.

Anyway, on my side it doesn't matter in the end as I found that recursively applying patches does not work, in retrospect it was silly to think this was going to work, and in the end the simplest solution (which I was intending to use anyway before realizing recursion does not work) is that each time a new patch is created, all the existing patches will be revisited and recreated with the diff of all changed files since the tagged release matching the patch vs current state of the repo. This means there will be always only a single patch to apply, and thus final field is not needed, and neither is the creation time.

@ameshkov
Copy link
Member Author

as I found that recursively applying patches does not work, in retrospect it was silly to think this was going to work

What was the problem?

@gorhill
Copy link
Contributor

gorhill commented Oct 29, 2023

What was the problem?

  • list L1 says next patch will be at P1 in some future (through Diff-Path field)
  • time pass
  • P1 is generated, L1 hasn't changed => there is no diff block for L1 in P1
  • time pass
  • P2 is generated, L1 changes => diff block for L1 is added to P2

In the end, L1 is never updated because its patch pointer never points to P2, updates end up stalled for L1.

Possible solutions:

  • Regardless of whether there are actual changes or not, always modify the Diff-Path field to point to next future patch, this will ensure there will always be a diff block even when a list content doesn't really change, just so that the Diff-Path field can be updated
  • Re-generate P1 server-side to contain all changes leading to current state of repo

I am going with second one, since the benefit is one single request to bring all the lists up to date, instead of having to fetch all the patches in between. This also removes the issue about whether the code which fetch patches wonder if there is another more recent patch to apply: every patch always lead to the most current version of the lists.

@gorhill
Copy link
Contributor

gorhill commented Oct 29, 2023

By the way, maybe we could dispense requiring a Diff-Name field if we use some notation to add diff name to Diff-Path, for example from what I have so far, it could be something like:

! Diff-Path: ../patches/2023.10.28.1127.patch:ublock-filters

or

! Diff-Path: ../patches/2023.10.28.1127.patch#ublock-filters

@ameshkov
Copy link
Member Author

ameshkov commented Oct 29, 2023

@gorhill

Possible solutions

Yeah, I was also thinking about that. I think we'll just won't be relying on VCS for generating patches and will be just looking at the list version + the previous list file. So if there're no changes the patch will stay empty. In our case relying on VCS is quite problematic by itself since we're generating different platform-specific filter lists versions and keep only "full" unfiltered one in VCS.

By the way, maybe we could dispense requiring a Diff-Name field if we use some notation to add diff name to Diff-Path, for example from what I have so far, it could be something like

I think we don't save much on this, but make it a little bit harder to quickly figure out if this is a batch-updatable list or not.

@ameshkov
Copy link
Member Author

ameshkov commented Nov 10, 2023

@gorhill
Sorry for the late reply.

Batch updates will be mostly used by uBO filters so we should indeed to design it according to your needs. I'll update the spec accordingly.

Also, for the patch creation time, I also still think it's needed, and again, simpler if stored in the patch

Wouldn't it be better to standardize Last modified? This way it can be used even when the filter list does not support differential updates. I.e. if a list has Last modified then we assume that this is when the list was last updated. If a list does not have it, then we assume that the last modified time is the moment when it was downloaded.

ameshkov added a commit to ameshkov/diffupdates that referenced this issue Nov 10, 2023
See the discussion here: AdguardTeam/FiltersCompiler#192 (comment)

Updated the spec according to the discussion:

1. Diff-Name is replaced with a resource name specified as a part of Diff-Path
2. Optional timestamp added to the diff directive
@ameshkov
Copy link
Member Author

@Kishlay-notabot
Copy link

but the OP makes it sound like the delta update process takes place within FiltersCompiler

It's actually unrelated, just a repo to open the issue, we'll create a new one if everything is okay with the spec.

filters-diff-builder is a new tool that will build diffs. The input for this tool is two filter lists versions: the previous and the new one.

@ameshkov

Hi,
I am a complete beginner so I apologise in advance for my lack of understanding.
I read the diff-update repo in detail which you have created and I superficially understood the concept of the delta change acceptance and removal method of the old content in the filter lists.
But here you have stated that the tool has 2 inputs, the old filter list and the new filter list.
So I think that how does this make a change in the bandwidth consumed?
Because, firstly I have a doubt that where will the delta comparison of the updated-filter-list.txt and the primary-list.txt (the older one) will occur ? And I assume this will happen on the client side obviously. And if this comparison has to happen in this algorithm, we have already fetched the whole filter list from the server in order to accomplish this. So doesn't this cancel out our attempt to reduce requests ? Because already we are downloading the filter list from the server as a whole and just accepting the changes?

@105th
Copy link
Member

105th commented Jan 9, 2024

@Kishlay-notabot The comparison between the old and new filters will be calculated on the server side. When checking for filter updates, if new filters are found, a patch will be created. Subsequently, both the filters and patches are updated in the CDN. For a detailed explanation of how this process works, you can refer to the following link: https://github.com/AdguardTeam/FiltersRegistry?tab=readme-ov-file#how-to-build-filters-and-patches and this: https://github.com/AdguardTeam/FiltersRegistry/blob/master/scripts/auto_build.sh .

@Kishlay-notabot
Copy link

@105th Thanks for the links, will check them out!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants