-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add differential updates capabilities #192
Comments
@gorhill are you on board with this? If you are, any remarks? Would you like to change/improve the spec? |
I think that it can distribute a filters list very quickly with low traffic if the size of complied filters list is huge. |
@piquark6046 yeah, that's exactly the motivation for this change. |
By the way, what is a cryptographic hash function Checksum used for the |
@piquark6046 it's something very old:
|
There will be no issue with request count? |
I don't have expertise in information security enough to claim. Footnotes |
@gwarser good question, it depends on the CDN that's used. Generally, bandwidth is more important than the number of requests unless the CDN is too greedy. On the other hand, if we want to achieve faster filter updates, I don't see how we can avoid a huge number of requests. |
@piquark6046 the current checksum is legacy from the old ABP times, any change to the algorithm now will break backwards compatibility so it'll require a new field. Also, tbh, MD5 is an good&old standard way of calculating checksums and it's good enough to prevent list corruption. PS: it's not mandatory to the spec, just mentioned because some people may be using it. |
The way I understand this, without a whole lot of experience with If my quarter-way wild guess on that was to be correct, the process could be carried over to ABP if not for their quasi-'feature freeze', but adapting it in uBO would require some advanced work. |
Something that requires thoughts, so I will have to think about it while I will keep following the discussion. |
It's actually unrelated, just a repo to open the issue, we'll create a new one if everything is okay with the spec.
|
Did you know if ABP implemented this in some form? https://issues.adblockplus.org/ticket/6833/ ? |
I am not aware of ABP implementation, I wonder is there any person from eyeo we can tag on GH and ask? |
Seems to be implemented for MV3 list versions. Here's a list with diff updates support: https://easylist-downloads.adblockplus.org/v3/full/easylist.txt I am not fully sure about that, but it seems to be a very MV3-specific implementation that does not handle our case. Static ruleset is the same until the extension is updated so the extension simply downloads this |
Ok here is how I see your proposal from the point of view of what is currently there in uAssetsCDN:
This means having to add a
Ok. I tried For For
Since a client would not know which files changed, this means to try and fetch a diff file for every single asset. From the point of view of uAssetsCDN, I am not sure about this given that all changes are committed in batch every few hours, currently every six hours, so it might be more beneficial to have (version => file => diff) in a single file to fetch. The size of the file would grow faster though. An advantage of batch diffing is that it's beneficial in case of commits affecting many files, which is something occurring by design with uAssetsCDN.
Probably best to have the |
The spec can be extended to cover the possibility of a unified diff as well. The unified diff.json could look like this: {
"easylist": {
diffs: [....],
},
"ublock-unbreak": {
diffs: [....],
}
} This would require adding a new meta field
With several files there's a little bit higher risk of a race condition, i.e. when the files on the CDN are changed when fetching diffs are in progress. On the other hand, this risk is still there when you're doing the full sync and fetch full lists one by one. The only chance to completely avoid it is to unite everything into a single file.
The point of the proposal is to allow all list maintainers to provide diff-updatable subscriptions. If we all implement custom solutions we only solve this problem for ourselves, but leave other lists without this capability. I am all for incorporating changes to the original proposal if this helps achieve a universal solution. Regarding having paths instead of diffs, this indeed makes sense and allows to save lots of bandwidth. Initially, I have not added it to the proposal to keep it as simple as possible. On the other hand, taking another look at this all, it does not look too complex to me. |
Related discussion: AdguardTeam/FiltersCompiler#192
I've given more thoughts on this, again from how things are currently done in uBO. First an observation which I guess might apply to AdGuard too: it's is not possible to apply diff patch to list content which is expanded client-side, i.e. the So this means that diff-patching can only be done for lists which content is not further modified client-side. This is do-able in uBO for select key lists where diff-patching would be more important than avoiding to expand an Now regarding batch-commit currently being used in uAssetsCDN, I don't see the need for some central file describing patches etc. The way I see it, a simple field in the header of the file like For example, if the commit-version field is Once done, repeat the process -- as a result of applying This way the only requirement is to have a To make things even fancier with this approach is to keep a short list of last commit versions applied. You may have noticed the commit version currently in use at uAssetsCDN are time-based, and given this the client code could infer the pace of scheduling diff-based update according to the average time difference between the latest commits, so essentially this allows to control server-side how often clients run diff-based updates. Also, pruning a growing list of patches in So unless there is a flaw in my thinking I am not seeing, this is the approach I want for uBO, and I consider it to be specific, because I would know exactly where to fit this in uBO's code with minimal changes and I feel having to use an external framework would just make things more difficult on my side, i.e. take longer to reach the goal and require modifying core code path in uBO, which I rather avoid. |
Yeah, it does apply indeed. The way I see it a generic solution requires expanding Regarding the recursive approach with the I'll try to amend the spec to make it more like this recursive approach that you're going to use for uAssetsCDN so that if at some point you decide to support it, it was easier to do. Just one more thing that I wanted to propose. Since these metadata fields may influence how different ad blockers process filter lists, I suggest using platform prefix for fields that are not universally supported. Something like |
Concretely, Let's use The For example in uBO there is already metadata associated with a cached asset: last time it was read from cache, time written to cache, "age" of the resource, etc. So Then a diff-update cycle becomes a matter of iterating through the metadata of all cached assets. If an asset does not have If one does have such value, launch a diff-update operation, i.e. fetch the diff-patch as Then parse and apply the patch -- that will require the most work to first write code to extract the specific subpatch which apply to the asset being patched, then code to apply the patch itself -- this is where a library makes sense. Keep fetched diff-patches around in case they are needed again for other assets during the current cycle in order to avoid fetching from server again. For the case of one diff-patch per asset, then the same code above would work, except of course that As you said originally, using standard Example diffdiff --git a/filters/filters.min.txt b/filters/filters.min.txt
index c79dca64..86639f9d 100644
--- a/filters/filters.min.txt
+++ b/filters/filters.min.txt
@@ -4 +4 @@
-! Last modified: Tue, 17 Oct 2023 19:43:41 +0000
+! Last modified: Wed, 18 Oct 2023 05:46:20 +0000
@@ -27,2 +26,0 @@ youtube.com##ytd-video-masthead-ad-advertiser-info-renderer,ytm-promoted-sparkle
-youtube.com#@#+js(json-prune-fetch-response, [].playerResponse.adPlacements [].playerResponse.playerAds [].playerResponse.adSlots playerResponse.adPlacements playerResponse.playerAds playerResponse.adSlots adPlacements playerAds adSlots, , propsToMatch, url:player?key= method:/post/i)
-youtube.com#@#+js(json-prune-fetch-response, [].playerResponse.adPlacements [].playerResponse.playerAds [].playerResponse.adSlots playerResponse.adPlacements playerResponse.playerAds playerResponse.adSlots adPlacements playerAds adSlots, , propsToMatch, url:player?key= method:/post/i bodyUsed:true)
@@ -64,6 +61,0 @@ youtube.com#@#+js(trusted-replace-fetch-response, /\"playerAds.*?true.*?\}\}\]\,
-youtube.com#@#+js(trusted-replace-fetch-response, /\"adPlacements.*?\"\}\}\}\]\,/, , url:player?key= method:/post/i bodyUsed:true)
-youtube.com#@#+js(trusted-replace-fetch-response, /\"adSlots.*?\}\]\}\}\]\,/, , url:player?key= method:/post/i bodyUsed:true)
-youtube.com#@#+js(trusted-replace-fetch-response, /\"playerAds.*?\}\}\]\,/, , url:player?key= method:/post/i bodyUsed:true)
-youtube.com#@#+js(trusted-replace-fetch-response, /\"adPlacements.*?\"\}\}\}\]\,/, , url:player?key= method:/post/i)
-youtube.com#@#+js(trusted-replace-fetch-response, /\"adSlots.*?\}\]\}\}\]\,/, , url:player?key= method:/post/i)
-youtube.com#@#+js(trusted-replace-fetch-response, /\"playerAds.*?\}\}\]\,/, , url:player?key= method:/post/i)
@@ -15844 +15836 @@ watashiwasugoidesu.com##.watas-bottom-vi
-@@||freereceivesms.com/ados.js$script,1p
+@@||freereceivesms.com^$script,1p
@@ -23286,0 +23279 @@ y2down.cc##+js(nowoif)
+@@||cdn.fuseplatform.net/publift/$3p,script,xhr,domain=quackr.io
diff --git a/filters/privacy.min.txt b/filters/privacy.min.txt
index 9f6cb1bf..f65844b3 100644
--- a/filters/privacy.min.txt
+++ b/filters/privacy.min.txt
@@ -7 +7 @@
-! Last modified: Fri, 13 Oct 2023 12:03:17 +0000
+! Last modified: Wed, 18 Oct 2023 05:46:43 +0000
@@ -514,0 +515 @@ natgeotv.com##+js(set, Visitor, {})
+www.lenovo.com##+js(aost, history.replaceState, injectedScript)
diff --git a/filters/quick-fixes.txt b/filters/quick-fixes.txt
index 22c0b3af..9912fc03 100644
--- a/filters/quick-fixes.txt
+++ b/filters/quick-fixes.txt
@@ -4 +4 @@
-! Last modified: Tue, 17 Oct 2023 20:54:25 +0000
+! Last modified: Wed, 18 Oct 2023 03:04:34 +0000
@@ -82,2 +82 @@ plagiarismchecker.co##[class][style*="display"][style*="block"]:has(a img[src^="
-! https://old.reddit.com/r/uBlockOrigin/comments/16lmeri/youtube_antiadblock_and_ads_september_18_2023/k1uf2s1/
-youtube.com#@#+js(json-prune-fetch-response, [].playerResponse.adPlacements [].playerResponse.playerAds [].playerResponse.adSlots playerResponse.adPlacements playerResponse.playerAds playerResponse.adSlots adPlacements playerAds adSlots, , propsToMatch, url:player?key= method:/post/i)
+! https://old.reddit.com/r/uBlockOrigin/comments/16lmeri/youtube_antiadblock_and_ads_september_18_2023/k1uf2s1/ - 03c8985d
@@ -110 +108,0 @@ youtube.com#@#+js(trusted-replace-fetch-response, /"playerAds.*?gutParams":\{"ta
-youtube.com#@#+js(json-prune-fetch-response, [].playerResponse.adPlacements [].playerResponse.playerAds [].playerResponse.adSlots playerResponse.adPlacements playerResponse.playerAds playerResponse.adSlots adPlacements playerAds adSlots, , propsToMatch, url:player?key= method:/post/i bodyUsed:true)
@@ -119,6 +116,0 @@ youtube.com#@#+js(trusted-replace-fetch-response, /\"playerAds.*?true.*?\}\}\]\,
-youtube.com#@#+js(trusted-replace-fetch-response, /\"adPlacements.*?\"\}\}\}\]\,/, , url:player?key= method:/post/i bodyUsed:true)
-youtube.com#@#+js(trusted-replace-fetch-response, /\"adSlots.*?\}\]\}\}\]\,/, , url:player?key= method:/post/i bodyUsed:true)
-youtube.com#@#+js(trusted-replace-fetch-response, /\"playerAds.*?\}\}\]\,/, , url:player?key= method:/post/i bodyUsed:true)
-youtube.com#@#+js(trusted-replace-fetch-response, /\"adPlacements.*?\"\}\}\}\]\,/, , url:player?key= method:/post/i)
-youtube.com#@#+js(trusted-replace-fetch-response, /\"adSlots.*?\}\]\}\}\]\,/, , url:player?key= method:/post/i)
-youtube.com#@#+js(trusted-replace-fetch-response, /\"playerAds.*?\}\}\]\,/, , url:player?key= method:/post/i)
@@ -156,4 +147,0 @@ sankaku.app##+js(no-xhr-if, googlesyndication)
-! https://github.com/uBlockOrigin/uAssets/issues/19982
-unsplash.com#@#div[style^="--row-gutter"] > div a[href="/brands"]:upward(div[style^="--row-gutter"] > div):remove()
-unsplash.com##div[style^="--row-gutter"] > div a[href="/brands"]:upward(div[style^="--row-gutter"] > div)
- In that example diff, a So in summary, a library which take as input the above example diff, the name of the resource to lookup in the patch (might be something which requires its own field too, i.e. Forgot to mention, using a time-based version schema helps also to preemptively skip over a pointless fetch. As you see, I currently use |
Good point, a single patch can indeed contain diffs for several assets there, it's on the client to figure out which list it actually corresponds to and it's doable if the client takes into account the files locations (URLs). I'll draft an updated spec taking this all into consideration. Just one thing to note, I still think that for a generic case we need to have |
Patches need to provide the time at which they were created. When repeatedly applying patches to bring a list up to date, there is a yet unknown number of patches to apply. Eventually the last patch will be fetched, but there is no way for the updater to know that this is the last patch, and this will result in a pointless fetch to the server, which will return with 404 or an empty file (as discussed elsewhere). However if the creation time of a patch is part of the file itself, then the updater can infer that if So I propose a special line, the first one in the file, to provide that information:
Where |
There is already |
Yes ok so when batch-updating, the minimum value of |
This comment was marked as resolved.
This comment was marked as resolved.
Unfortunately
At least if it's part of the spec here, we will know where to find the |
Tbh, the idea of relying on creation time does not seem reliable. When the patch is created it is known whether the patch is final or not and it can be somehow indicated in the patch file. The downside is that this approach requires removing this marker from the older patch files. I propose extending our custom |
This means revisiting all previous patches to change this field to Anyway, on my side it doesn't matter in the end as I found that recursively applying patches does not work, in retrospect it was silly to think this was going to work, and in the end the simplest solution (which I was intending to use anyway before realizing recursion does not work) is that each time a new patch is created, all the existing patches will be revisited and recreated with the diff of all changed files since the tagged release matching the patch vs current state of the repo. This means there will be always only a single patch to apply, and thus |
What was the problem? |
In the end, L1 is never updated because its patch pointer never points to P2, updates end up stalled for L1. Possible solutions:
I am going with second one, since the benefit is one single request to bring all the lists up to date, instead of having to fetch all the patches in between. This also removes the issue about whether the code which fetch patches wonder if there is another more recent patch to apply: every patch always lead to the most current version of the lists. |
By the way, maybe we could dispense requiring a
or
|
Yeah, I was also thinking about that. I think we'll just won't be relying on VCS for generating patches and will be just looking at the list version + the previous list file. So if there're no changes the patch will stay empty. In our case relying on VCS is quite problematic by itself since we're generating different platform-specific filter lists versions and keep only "full" unfiltered one in VCS.
I think we don't save much on this, but make it a little bit harder to quickly figure out if this is a batch-updatable list or not. |
@gorhill Batch updates will be mostly used by uBO filters so we should indeed to design it according to your needs. I'll update the spec accordingly.
Wouldn't it be better to standardize |
See the discussion here: AdguardTeam/FiltersCompiler#192 (comment) Updated the spec according to the discussion: 1. Diff-Name is replaced with a resource name specified as a part of Diff-Path 2. Optional timestamp added to the diff directive
@gorhill please check out the PR: https://github.com/ameshkov/diffupdates/pull/2/files |
Hi, |
@Kishlay-notabot The comparison between the old and new filters will be calculated on the server side. When checking for filter updates, if new filters are found, a patch will be created. Subsequently, both the filters and patches are updated in the CDN. For a detailed explanation of how this process works, you can refer to the following link: https://github.com/AdguardTeam/FiltersRegistry?tab=readme-ov-file#how-to-build-filters-and-patches and this: https://github.com/AdguardTeam/FiltersRegistry/blob/master/scripts/auto_build.sh . |
@105th Thanks for the links, will check them out! |
We would like to add support for differential updates to the filter lists, but for that the compiler should be extended pretty substantially.
Having differential updates will allow ad blockers download filter lists updates much more frequently.
Here's the idea:
When it is building a filter list, the platform directory and its contents may already exist. Let's take a simple case, when we're running there's already this directory structure:
If this is the case, here's what it does:
1.txt
and the new state of1.txt
. We should use standard diff format for that, not unified, no context: https://en.wikipedia.org/wiki/DiffThe diff.json file format:
Note, that the diffs array is sorted (older versions first).
If the diff is larger than X bytes, we remove all diffs altogether. The diff.json will be empty:
{}
or{ diffs: [] }
or just an empty file.Ideally, I would like to make this mechanism available to all filter lists, not just the ones made by AdGuard. In order to do that we should allow indicating diff file URL in the filter list's metadata.
We need two new metadata fields for that:
Diff-URL: https://xxxxx.com/filter.diff.json
-- the address of the file with filter list diffs.Diff-Expires: 1 hour
-- expiration time of the filter list when differential updates are available. Note, thatExpires
continues to work as it was working before, i.e. once in a while AdGuard will do the so-called "full sync". However, we can and should increase it when differential updates are available.Also, I'd like us to provide a separate tool that builds diffs instead of having this all inside the filters compiler (which is very AdGuard-specific).
Here's how it should work:
new-list-path
- path to the new version of the filter list. The list will be modified in result,Diff-URL
andDiff-Expires
will be added (or replaced if they're already there).prev-list-path
- path to the previous version of the filter list.diff-expires
- value ofDiff-Expires
to be used.diff-url
- URL where the diff.json file will be published.max-diff-size
- maximum size of a diff file in bytes. If the resulting diff.json is larger than the specified value, it becomes empty effectively disabling differential updates.max-diff-history
- maximum number of diffs stored in the diff.json file.Changes to how filter lists updates are checked
Version
,Diff-URL
andDiff-Expires
are specified in the filter list metadata.Diff-Expires
period ad blocker should download the diff.json and then apply the diffs.How the diffs are applied
First of all, any error while applying a diff should disable differential updates. In this case the ad blocker should back off and wait until it's time for the full sync.
The algorithm should be the following:
Find the current version in the diffs file. If the version is missing, count it as an error and stop differential updates.
Incrementally apply every patch in the diff.json file starting from that version. While applying the patches, check that the "old version -> new version" chain is continuous and have no gaps.
Verify the resulting filter list:
Version:
value is the same as the last one in the diff.json.Checksum:
field in the filter list, check that it is a correct checksum for the list contents.When to do full sync
Expires:
field.Remarks and Q&A
max-diff-history
? I think that it depends on how often the list is updated. For instance, AdGuard Base filter is updated every hour and keeping last 30 diffs should be good enough so that people that use their computer on a daily basis could continue to receive updates every hour.Diff-Expires: 1 hour
+ increase the full sync period to 10 days then in the worst case scenario we'll be downloading about 90MB per month. However, this is very unlikely since computers don't usually work 24*7 so I'd assume we can safely divide it by 2. Anyways, I suggest first implementing thefilters-diff-builder
and see what real numbers we will be getting and then decide on how to proceed.The text was updated successfully, but these errors were encountered: