Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split out core functionality into separate module #1664

Closed
8 tasks done
mjethani opened this issue Jul 19, 2021 · 43 comments
Closed
8 tasks done

Split out core functionality into separate module #1664

mjethani opened this issue Jul 19, 2021 · 43 comments
Labels
enhancement New feature or request fixed issue has been addressed

Comments

@mjethani
Copy link

Prerequisites

I tried to reproduce the issue when...

  • uBO is the only extension
  • uBO with default lists/settings
  • using a new, unmodified browser profile

Description

I have just updated the Cliqz benchmarks with the latest versions of Adblock Plus (ghostery/adblocker@a2da894) and Brave (ghostery/adblocker@97dc7c1)

Unfortunately it's not trivial to do the same for uBlock Origin, because of the relative tight coupling in the code. In fact, it keeps getting harder. For example, uritools.js now refers to the vAPI object whereas previously it did not.

It would be ideal if the core filter parsing, matching, and serialization functionality were split out into a separate Node.js module.

A specific URL where the issue occurs

N/A

Steps to Reproduce

Try to update the benchmarks at https://github.com/cliqz-oss/adblocker/tree/master/packages/adblocker-benchmarks with the latest version of uBlock Origin.

Expected behavior

It should be as easy as updating a version number or a commit hash in the package.json file in that repo.

Actual behavior

It's too much work that involves copying and modifying a bunch of files from the uBlock Origin repo.

uBlock Origin version

1.36.2

Browser name and version

N/A

Operating System and version

N/A

@gorhill
Copy link
Member

gorhill commented Jul 19, 2021

I agree, it's something I keep thinking since a long time I should do.


Any advices/hints on how best to undertake this are welcome.

@uBlock-user uBlock-user added the something to address something to address label Jul 20, 2021
@mjethani
Copy link
Author

Maybe the first step should be to move core functionality into a new µBlockCore namespace.

The rules for the µBlockCore namespace would be like this:

  1. The code assumes only a JS environment and nothing else (neither browser, nor WebExtensions, nor Node.js); it uses only core JavaScript APIs.
  2. The code never accesses any global properties or anything in the µBlock namespace. If it needs any such API (e.g. runtime.getURL()), the platform-specific code injects it as a dependency.
  3. In background.html, all µBlockCore code is loaded before any µBlock code.

For example, the escapeRegex() function in js/utils.js could go into the µBlockCore namespace, whereas the webRequest object in js/traffic.js probably wants to stay in the µBlock namespace.

This alone will take multiple iterations.

The next step would be to make it possible to load µBlockCore as a Node.js module. There are multiple ways to do this. One approach is to write the code as a Node.js module to begin with and then use webpack to compile it into a single file to be loaded into the browser extension.

@mjethani
Copy link
Author

One approach is to write the code as a Node.js module to begin with and then use webpack to compile it into a single file to be loaded into the browser extension.

I think there might be a better way to do this using the Node.js modules API.

In any case, the first step should be to refactor the code and gradually move the µBlockCore namespace either into a separate js/core.js file or into its own js/core/ directory.

@gorhill
Copy link
Member

gorhill commented Jul 25, 2021

I've been looking at modules, and I think I will go that route, my understanding is that this would make the code compatible with Node.js.

@mjethani
Copy link
Author

It turns out the solution has nothing to do with the modules API at all but rather it's the VM API that we need to use.

For the sake of this example, let's say µBlockCore is split into two files: core/foo.js and core/bar.js.

console.log('loading foo.js');
  
if (typeof µBlockCore === 'undefined') {
  µBlockCore = {};
}
  
µBlockCore.foo = function () {
  console.log('foo!');
};
console.log('loading bar.js');

if (typeof µBlockCore === 'undefined') {
  µBlockCore = {};
}
  
µBlockCore.bar = function () {
  console.log('bar!');
};

In the browser extension, these files would be loaded like so:

<script src="core/foo.js"></script>
<script src="core/bar.js"></script>

<script>
  µBlockCore.foo();
  µBlockCore.bar();
</script>

Now the question is how to also make these files available to a Node.js program as a "package," without having to rewrite the code in these files.

Here's a Node.js program that uses the package:

let µBlockCore = require('ublock');

µBlockCore.foo();
µBlockCore.bar();

The package has been installed via GitHub using the command npm i gorhill/ublock.

For it to be a valid package, it must have a package.json file. We already know this and it's not the difficult part. In addition to the package.json file, it must also have an index.js file that exports some APIs.

Now the question is how to load core/foo.js and core/bar.js from within this index.js file as if they were being loaded in an HTML document using <script> tags.

Here's the magic:

let fs = require('fs');
let path = require('path');
let vm = require('vm');

let environment = {
  // Let µBlockCore be the current exports object so everything on it is
  // exported directly out of this module.
  µBlockCore: exports,

  // Pass the global console object to each script so it can use it.
  console
};

function loadScript(name) {
  if (!vm.isContext(environment)) {
    vm.createContext(environment);
  }

  vm.runInContext(fs.readFileSync(path.resolve(__dirname, name)),
                  environment);
}

loadScript('core/foo.js');
loadScript('core/bar.js');

We create a new VM "context" and run both core/foo.js and core/bar.js in the same context. Both scripts therefore end up sharing the same µBlockCore namespace, which in this case also happens to be the current module's exports object. This means any properties set on µBlockCore from within these scripts are automatically exported.

Since the scripts need access to console, it's also added to the context. If the scripts needed any other globals like setTimeout(), etc., we would add those too.

@gorhill if you've already decided to go down this route then I don't mind submitting a patch to set this up.

@gorhill
Copy link
Member

gorhill commented Jul 25, 2021

if you've already decided to go down this route then I don't mind submitting a patch to set this up.

I am already well advanced with using export/import and I like where it's going -- I am at the point where there is no hard dependencies to µBlock itself in static network filtering engine. I don't know much about Node packages, but my understanding from the documentation is that Node can deal with export/import, so eventually if I modularize properly everything I suppose it should become a breeze to publish as a Node package.

@mjethani
Copy link
Author

I am already well advanced with using export/import and I like where it's going

Alright, sounds good!

There are two kinds of modules in Node.js and they don't seem to get along very well, but you can cross that bridge once you get to it.

One thing you might want to keep in mind when you write the package.json file is that the version number for a package under development is 0.1.0 by convention.

@gorhill
Copy link
Member

gorhill commented Jul 26, 2021

So I've reached a point where I could create an HTML page which is able to load uBO's static network filtering engine (what is meant to be benchmarked) with module-based approach,

Example of usage

<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<title>uBO Static Network Filtering Engine</title>
</head>
<body>
<script src="./ublock/src/lib/punycode.js"></script>
<script src="./ublock/src/lib/publicsuffixlist/publicsuffixlist.js"></script>
<script type="module">
    import {
        FilteringContext,
    } from './ublock/src/js/core/filtering-context.js';
    import {
        CompiledListReader,
        CompiledListWriter,
    } from './ublock/src/js/core/compiledlistio.js';
    import {
        compile,
    } from './ublock/src/js/core/static-filtering-compiler.js';
    import {
        staticNetFilteringEngine,
    } from './ublock/src/js/core/static-net-filtering.js';

    staticNetFilteringEngine.reset(); // Remove all filters

    const applyList = async url => {
        const list = await fetch(url)
        const raw = await list.text();
        const writer = new CompiledListWriter();
        writer.properties.set('name', url);
        const compiled = compile(raw, writer);
        const reader = new CompiledListReader(compiled);
        staticNetFilteringEngine.fromCompiled(reader);
    };

    (async ( ) => {
        // Populate filtering engine with filter lists
        await Promise.all([
            applyList('https://easylist.to/easylist/easylist.txt'),
            applyList('https://easylist.to/easylist/easyprivacy.txt'),
        ]);

        // Commit changes
        staticNetFilteringEngine.freeze({ optimize: true });

        // Reuse filtering context: it's what uBO does
        const fctxt = new FilteringContext();

        // Tests
        fctxt.setDocOriginFromURL('https://www.bloomberg.com/');
        fctxt.setURL('https://www.bloomberg.com/tophat/assets/v2.6.1/that.css');
        fctxt.setType('stylesheet');
        if ( staticNetFilteringEngine.matchRequest(fctxt) !== 0 ) {
            console.log(staticNetFilteringEngine.toLogData());
        }

        fctxt.setDocOriginFromURL('https://www.bloomberg.com/');
        fctxt.setURL('https://securepubads.g.doubleclick.net/tag/js/gpt.js');
        fctxt.setType('script');
        if ( staticNetFilteringEngine.matchRequest(fctxt) !== 0 ) {
            console.log(staticNetFilteringEngine.toLogData());
        }

        fctxt.setDocOriginFromURL('https://www.bloomberg.com/');
        fctxt.setURL('https://sourcepointcmp.bloomberg.com/ccpa.js');
        fctxt.setType('script');
        if ( staticNetFilteringEngine.matchRequest(fctxt) !== 0 ) {
            console.log(staticNetFilteringEngine.toLogData());
        }
    })();
</script>
</body>
</html>

Now there is still a long way to go regarding all this work to have cleaner dependencies, naming, etc., but for the purpose of unblocking your work with Cliqz benchmark, and keeping in mind I have zero experience with nodejs or npm, what is left to address on my side for you to be able to run this in your benchmark package?


Never mind, found a guide.

@mjethani
Copy link
Author

This sounds about right!

You might run into the problem that Node.js doesn't recognize export in a .js file. The way to resolve it is to set type: "module" in your package.json.

gorhill added a commit to gorhill/uBlock that referenced this issue Jul 27, 2021
Related issue:
- uBlockOrigin/uBlock-issues#1664

The changes are enough to fulfill the related issue.

A new platform has been added in order to allow for building
a NodeJS package. From the root of the project:

    ./tools/make-nodejs

This will create new uBlock0.nodejs directory in the
./dist/build directory, which is a valid NodeJS package.

From the root of the package, you can try:

    node test

This will instantiate a static network filtering engine,
populated by easylist and easyprivacy, which can be used
to match network requests by filling the appropriate
filtering context object.

The test.js file contains code which is typical example
of usage of the package.

Limitations: the NodeJS package can't execute the WASM
versions of the code since the WASM module requires the
use of fetch(), which is not available in NodeJS.

This is a first pass at modularizing the codebase, and
while at it a number of opportunistic small rewrites
have also been made.

This commit requires the minimum supported version for
Chromium and Firefox be raised to 61 and 60 respectively.
@gorhill
Copy link
Member

gorhill commented Jul 27, 2021

@mjethani

I don't know if it's enough to unblock you, you can now build a nodejs package by using ./tools/make-nodejs from the root of the project -- the changes are in the 1664 branch.

I didn't go with a core folder, since in the end it's just a matter of copying the couple of files needed into the package. Hopefully this will make it easy for you to instantiate uBO's engine, tell me if anything is missing.

@mjethani
Copy link
Author

mjethani commented Jul 28, 2021

Thanks, this should work.

We'll need to benchmark the serialization functionality, and for this we could create a mock storage object in the benchmarking code:

class MockStorage {
  #map;

  constructor() {
    this.#map = new Map();
  }

  async put(key, value) {
    this.#map.set(key, value);
  }

  async get(key) {
    return this.#map.get(key);
  }
}

let storage = new MockStorage();

And then we could do snfe.toSelfie(storage, 'benchmark') and snfe.fromSelfie(storage, 'benchmark') for serialization and deserialization respectively.

Please correct me if I'm wrong.

@mjethani
Copy link
Author

A program using the package shouldn't have to read effective_tld_names.dat using the file system API. The way to resolve this is to make it a JSON file. You can do this in the build script tools/make-nodejs.sh.

node -pe "JSON.stringify(fs.readFileSync('$DES/data/effective_tld_names.dat', 'utf8'))" > $DES/data/effective_tld_names.json

After this a program can load the file like require('uBO-snfe/data/effective_tld_names.json') and get the raw PSL as a string.

But you could also go a step further and not expose it at all: If there's no argument to the pslInit() function, use the built-in raw PSL.

import { createRequire } from 'module';
const require = createRequire(import.meta.url);

function pslInit(raw = require('./data/effective_tld_names.json')) {
  // ...
}

gorhill added a commit to gorhill/uBlock that referenced this issue Jul 28, 2021
@gorhill
Copy link
Member

gorhill commented Jul 28, 2021

Thanks for the feedback, I've made the suggested changes in gorhill/uBlock@e1222d1.

@mjethani
Copy link
Author

I have a temporary branch here: https://github.com/mjethani/adblocker It's called update-ublockorigin-nodejs.

Here's the sequence of commands to run the benchmarks:

git clone https://github.com/mjethani/adblocker.git
cd adblocker
git checkout update-ublockorigin-nodejs 
cd packages/adblocker-benchmarks
npm install
make ublock

… the changes are in the 1664 branch.

@gorhill once this work is done, do you intend to merge it back into master?

@mjethani
Copy link
Author

One more issue I ran into is that the Node.js package uses console.info(). Unfortunately this clutters the output. For the time being, I'm setting console.info to () => {} at the start of the program.

@gorhill
Copy link
Member

gorhill commented Jul 28, 2021

I will rework the code to avoid console usage for what is just normal functioning, stumbling into an invalid filter is an expected condition.

Yes, I do plan to merge all this into master, and there will also be a uBlock0_[version].nodejs.zip package for each release so that it can simply be downloaded.

@mjethani
Copy link
Author

… there will also be a uBlock0_[version].nodejs.zip package

If you make it a .tgz, the URL can be given directly in the package.json of the project that's using the package.

gorhill added a commit to gorhill/uBlock that referenced this issue Jul 28, 2021
Related issue:
- uBlockOrigin/uBlock-issues#1664

Modularization is a necessary step toward possibly publishing
a more complete nodejs package to allow using uBO's filtering
capabilities outside of the uBO extension.

Additionally, as per feedback, remove undue usage of console
output as per feedback:
- uBlockOrigin/uBlock-issues#1664 (comment)
@mjethani
Copy link
Author

I have a temporary branch here: https://github.com/mjethani/adblocker

Merged.

git clone https://github.com/cliqz-oss/adblocker.git
cd adblocker/packages/adblocker-benchmarks
npm install
make ublock

@gorhill
Copy link
Member

gorhill commented Jul 29, 2021

Note that in the latest iteration, make-nodejs will git-clone a temporary uAssets if it's not found at the expected location.

gorhill added a commit to gorhill/uBlock that referenced this issue Jul 29, 2021
Related issue:
- uBlockOrigin/uBlock-issues#1664

The various filtering engine benchmarking functions are best
isolated in their own file since they have specific
dependencies that should not be suffered by the filtering
engines.

Additionally, moved decomposeHostname() into uri-utils.js
as it's a hostname-related function required by many
filtering engine cores -- this allows to further reduce
or outright remove dependency on `µb`.
@mjethani
Copy link
Author

Note that in the latest iteration, make-nodejs will git-clone a temporary uAssets if it's not found at the expected location.

Thanks.

I've been experimenting with git submodules.

You could do something like this:

  1. Run git submodule add https://github.com/uBlockOrigin/uAssets ./submodules/uAssets from the main repo
  2. Update all the paths to use ./submodules/uAssets/ instead of ../uAssets/
  3. Commit the changes

From that point on someone who wants to build uBlock Origin could clone the main repo with --recurse-submodules or run git submodule update --depth 1 --init afterwards.

@mjethani
Copy link
Author

With submodules: https://github.com/mjethani/uBlock/tree/submodules

git clone -b submodules https://github.com/mjethani/ublock
cd ublock
./tools/make-nodejs.sh

@gorhill
Copy link
Member

gorhill commented Jul 29, 2021

Let's say I have work to do in uAssets repo, can it be done from within the submodule, i.e. git commit/pull/push etc. from within it?

@gorhill
Copy link
Member

gorhill commented Jul 29, 2021

Thanks, I pulled your changes, I agree it's much cleaner and this simpilfies the make scripts.

@mjethani
Copy link
Author

Let's say I have work to do in uAssets repo, can it be done from within the submodule, i.e. git commit/pull/push etc. from within it?

You should be able to do it.

In fact, you could just delete the directory and make a symlink to ../uAssets to continue development as it is.

@mjethani
Copy link
Author

Since you mentioned development, I should clarify that you'll have to update the version of uAssets in the main repo from time to time.

Workflow:

cd submodules/uAssets
git fetch --unshallow
git checkout master
edit edit edit
git commit
git push

# Back to the main repo
cd .. 

# Update uAssets version in the main repo
git commit
git push

@gorhill
Copy link
Member

gorhill commented Jul 29, 2021

I think I will just keep working in the separate uAssets repo to keep it simple.

But now if locally I want to create a build with the current state of the uAssets repo, as per doc this is what needs to be done?

git submodule update --remote uAssets

Which I did, but now it seems I have local changes in submodules/uAssets:

> git status
modified:   submodules/uAssets (new commits)

Not sure what's next.


Never mind, I just git add-ed/commit-ed, and push-ed as with any other update.

@mjethani
Copy link
Author

mjethani commented Jul 29, 2021

Never mind, I just git add-ed/commit-ed, and push-ed as with any other update.

Yes, the idea is that the main repo always refers to a specific version of uAssets. If there's a new version, you need to upgrade to it.

I'm not sure if there's a way around this extra step.

@gorhill
Copy link
Member

gorhill commented Jul 29, 2021

It's all fine, having locally up to date submodule uAssets is rarely needed, except when to test a new install of uBO, and my understanding is that the gihub action-driven build always use the latest state of uAssets as this was the case before.

@mjethani
Copy link
Author

… the gihub action-driven build always use the latest state of uAssets

If you mean .github/workflows/main.yml, it doesn't use --remote at the moment. Maybe you want to add that option.

@mjethani
Copy link
Author

@gorhill pslInit() is taking ~50 ms on my device, almost all of it in parsing the raw public suffix list. Why not do this at build time? In the parse() function you end up with treeData and charData. If that's all you really need, it could just be a JSON object too. In the browser extension you'd still have to read it as a text file and then call JSON.parse() on it; in the Node.js version, you could just call require(). In both cases it'll likely be a lot faster this way.

This is not strictly related to the issue here but more of an optimization idea.

As for why this still matters in 2021, personally I'd love to run uBlock Origin on a low-end Android device and every bit of improvement can make a difference.

@gorhill
Copy link
Member

gorhill commented Jul 30, 2021

In uBO it's parsed once and then the state is serialized using toSelfie/fromSelfie, and from then on parse() is used only when the PSL updates, i.e. every ~19 days.

I could make pslInit() returns the PSL instance which you could use to measure fromSelfie() instead of parse(), since it's what really occurs in uBO.


gorhill/uBlock@5be4d5d

@uBlock-user uBlock-user added enhancement New feature or request fixed issue has been addressed and removed something to address something to address labels Jul 31, 2021
@mjethani
Copy link
Author

@gorhill I have added a makefile: https://github.com/mjethani/uBlock/tree/makefile

With this there's no need for the benchmarking code to know about the internal paths or anything else about the Node.js package. It can simply run make install-nodejs and make uninstall-nodejs.

This will also help with development in general: the make command will build only if one of the source files has been touched. Run make chromium, make firefox, make nodejs, or just make.

@gorhill
Copy link
Member

gorhill commented Jul 31, 2021

Awesome, thank you very much -- can you submit a pull request? I assume you should be able to submit PRs since your last commit should have made you a contributor, i.e. allowed to submit PRs.

gorhill added a commit to gorhill/uBlock that referenced this issue Jul 31, 2021
Another small step toward the goal of reducing dependency
on `µb`.

Related issue:
- uBlockOrigin/uBlock-issues#1664

text-iterators module has been renamed text-utils to better
reflect its content.
@gorhill
Copy link
Member

gorhill commented Jul 31, 2021

@mjethani I added a ublockWasm target to be able to run the benchmark for uBO with wasm code paths enabled. Is this worth a PR to https://github.com/cliqz-oss/adblocker?

I was curious to find out if it made a difference, but at a glance from the console output (executing node run ublockWasm requests.json) I didn't see much difference with the pure JS code paths.

@mjethani
Copy link
Author

@gorhill does uBlock Origin use Wasm on Chromium-based browsers? If the answer is no, maybe it's best to benchmark only the plain JS version on Node.js.

I'd like to add SpiderMonkey too and in that case we might want to enable Wasm.

@gorhill
Copy link
Member

gorhill commented Jul 31, 2021

It's not used in Chromium -- as this would require 'unsafe-eval' in the manifest.

@mjethani
Copy link
Author

It's not used in Chromium -- as this would require 'unsafe-eval' in the manifest.

wasm-eval works now.

I know you said you wouldn't use it.

@mjethani
Copy link
Author

mjethani commented Aug 1, 2021

@gorhill it's possible to convert a WebAssembly module into asm.js using the wasm2js tool in Emscripten. It makes me wonder if it might be possible to load it this way and how it might perform on V8.

@mjethani
Copy link
Author

mjethani commented Aug 2, 2021

Suggestion: In package.json, set the name to "ublock-core" or "ublockorigin-core". At the moment it exposes the static net filtering engine, but in the future it could expose other "core" objects. The suggested name then would still be appropriate. If a program wants to use only the static net filtering engine, it could do so like require('ublockorigin-core/snfe'). The README.md file would eventually explain how to use it in different ways.

@mjethani
Copy link
Author

mjethani commented Aug 2, 2021

… set the name to "ublock-core" or "ublockorigin-core"

Another way is to use a scope. e.g. @ublock/core or @gorhill/ublock-core.

@gorhill
Copy link
Member

gorhill commented Aug 2, 2021

I see ABP's core on npm, is there a reason they didn't go @eyeo/adblockpluscore?

gorhill added a commit to gorhill/uBlock that referenced this issue Aug 2, 2021
@mjethani
Copy link
Author

mjethani commented Aug 2, 2021

… is there a reason they didn't go @eyeo/adblockpluscore?

I have no idea.

But I do know that npm recommends using a scope.

By the way, even if you never end up publishing any packages on npm, it would still be a good idea to grab your username(s) if you haven't done so already.

gorhill added a commit to gorhill/uBlock that referenced this issue Aug 2, 2021
Related issue:
- uBlockOrigin/uBlock-issues#1664

This change allows to add the redirect engine into the
nodejs package. The purpose of the redirect engine is to
resolve a redirect token into a path to a local resource,
to be used by the caller as wished.
@mjethani
Copy link
Author

In case anyone stumbles by, the end result here is that the core of uBlock Origin is now available as a Node.js package in the npm registry: https://www.npmjs.com/package/@gorhill/ubo-core This has greatly simplified the integration with the Cliqz benchmarks and as a bonus has made the same code used in the uBlock Origin browser extension now available to any Node.js program.

Thanks @gorhill for the effort.

gorhill added a commit to gorhill/uBlock that referenced this issue Oct 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request fixed issue has been addressed
Projects
None yet
Development

No branches or pull requests

3 participants