Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI/CD to build and sanity check docker container and release executables #231

Closed
5 tasks done
CMCDragonkai opened this issue Aug 29, 2021 · 55 comments · Fixed by #292
Closed
5 tasks done

CI/CD to build and sanity check docker container and release executables #231

CMCDragonkai opened this issue Aug 29, 2021 · 55 comments · Fixed by #292
Assignees
Labels
epic Big issue with multiple subissues procedure Action that must be executed

Comments

@CMCDragonkai
Copy link
Member

CMCDragonkai commented Aug 29, 2021

The final built executables coming out of vercel/pkg are not tested when uploaded as pre-release executables.

We should run a final test on those, maybe a simplified test that runs just the --version or --help page to ensure that the executable actually runs on the target environment. This can help us know whether the final packaged executables are actually working mostly.

This requires our CI/CD to expand to include all environments:

We have these stages:

  stages:
    - check
    - build
    - release

This might work as an extra stage: qa to mean "quality assurance". And it needs to use raw environments (not based on our Nix image that we use to build things).

Tasks

  1. - Investigate adding an extra stage qa that enables Linux, Windows, and Mac environment runners, that means this stage runs on an entirely different runner compared to check, build and release.
  2. - Download the executables from the release point. This tests whether our release point is up and running. If the release point is down, the test should retry after some time, as this depends on network availability.
  3. - Run a sanity check over the executables such as --help and --version to get the help page and the version page respectively.
  4. - Ensure that this job is interruptible.

Regarding docker:

  1. - Create the docker image with nix-build ./release.nix -A docker
    • test that this container works with docker run -it name-version:hash.
    • example: docker run -it matrixai_typescript-demo-lib-1.0.0:f3wf1wxvkdbbmrhxx9rhr5qw2skbrah6
    • see TypeScript-Demo-Live README.md for more details
    • the TMPDIR may be used by PK, if so, use this https://gist.github.com/CMCDragonkai/94e6550302aa7ebc175c06819927605a
    • inspect the image contents for correctness by using this https://gist.github.com/CMCDragonkai/d7a652ada27289ab59e28bc914c942f6
    • do we need a cacert and tzdata?
      • cacert may be needed in case we want to contact external websites, but this is not a thing for the PK agents hosted on testnet.polykey.io
      • tzdata shouldn't be necessary since we don't operate on any timezones, we assume OS time is UTC time
@CMCDragonkai CMCDragonkai added the procedure Action that must be executed label Aug 29, 2021
@CMCDragonkai
Copy link
Member Author

Remember Gitlab provides many CI/CD runners. But I suspect this will require running a separate runner on the last stage. I'm not sure if the configuration allows this. But we must investigate this.

Also consider what additional costs there is, and whether it is free for Open Source, and we have replicated this to the Open Source group.

@CMCDragonkai
Copy link
Member Author

CMCDragonkai commented Aug 29, 2021

Should also do a hash check to ensure that the expected hashes of the released executables are in fact what we expect. That means during the release stage, we should be calculating the hashes from nix-build and then publishing those as well on the release page. Note that the hashes used in the naming currently are nix hashes which are hashes of the input derivation expression, not the build outputs.

@CMCDragonkai
Copy link
Member Author

This maybe relevant too https://nixos.org/manual/nix/unstable/command-ref/new-cli/nix3-store-make-content-addressable.html

Checking the hash of the release is just about ensuring integrity but it doesn't ensure authenticity. A signature would ensure authenticity. However combining the hash with an "authentic" trusted source is sufficient to ensure authenticity and integrity. Of course end-to-end authenticity is enhanced with a proper signature applied.

To dog-food Polykey, it would make sense to do a code-sign from the Polykey organisation gestalt rather than GPG since we stopped using GPG.

@CMCDragonkai CMCDragonkai added the epic Big issue with multiple subissues label Nov 1, 2021
@CMCDragonkai
Copy link
Member Author

The CI/CD check here should actually be done on the container image build. We should also do a check against the release executable builds while we're at it.

Note that the release.nix is what has all the specifications. The current docker attribute bundles up the cacert package. For the testnet and mainnet. I believe the cacert won't be required for NAT functionality as contacting external websites is only needed in the identities domain.

If however we are going to join all of the testnet nodes as part of the Matrix AI gestalt and mainnet could be in a different gestalt, it may in fact be required. We can keep the dependency in just for now.

@CMCDragonkai CMCDragonkai changed the title CI/CD to sanity check on final built executables in all environments CI/CD to sanity check on docker container and release executables Nov 1, 2021
@CMCDragonkai CMCDragonkai changed the title CI/CD to sanity check on docker container and release executables CI/CD to build and sanity check docker container and release executables Nov 1, 2021
@CMCDragonkai
Copy link
Member Author

Pushing this Nov 11 finish from Nov 10 as an additional last sanity check should be done once #269 is done. But this can be started without #269.

@CMCDragonkai
Copy link
Member Author

CMCDragonkai commented Nov 5, 2021

Make sure to check that bundled files JSON and protobuf files are being put into the ./dist directory properly. Except of course the ./package.json.

See: MatrixAI/TypeScript-Demo-Lib#34 (comment) for discussion and fix in TypeScript-Demo-Lib.

If there are future files that need to be copied over, it's also important to consider the npm postbuild hook to copy over those non-transpiled files. These stages of "building" should be documented.

@CMCDragonkai
Copy link
Member Author

Checking the build executables can be done in a number of ways automatically.

Application Emulation

The idea here is to directly execute the built artifacts in the same CI/CD container environment which currently is a "nix" container running on Linux. The advantage here is probably less resources being used, and all of this could be tested on our local development machines which are NixOS (or linux with Nix).

Disadvantage is that it's not a full test, we don't know for sure what might happen on the other platforms.

Example: https://jake-shadle.github.io/xwin/#2-specify-runner This guy uses WINE to test a cross-compiled Rust executable.

System Emulation/Virtualisation

This relies on virtualisation, such as using QEMU to run a full system. This is closer to the real deal, but more complex and expensive.

Atm, I don't believe the containers inside Gitlab CI/CD are allowed to execute QEMU so we don't automate it that way. Instead this will rely on the CI/CD system itself to provide these extra environments. Gitlab has MacOSX and Windows guests available to be used, and these can then run tests against our built executables. Note that this means the tests are not running inside the Nix container anymore. Which leads us to our next problem.

What kind of tests?

These tests that run against the final built executable are not like the tests we have currently inside our Jest testing. Jest itself is just a test runner, it uses Node executes JS at the end of the day. The JS itself right now is importing code directly from src and testing this. Even now our tests/bin tests don't actually do a child process execution. They are lighterweight by calling main function directly and mocking the STDOUT/STDERR... etc. And even if they were to actually do a proper process exec, it's again done against the src/ code, and not against the final built executables.

So for testing the final executables, the tests have to shell out out and execute the commands and then testing what is the expected side-effects. This can be done in jest by using child process exec, and all we are doing is using the JS to run child processes.

If we use a library like https://github.com/raingerber/jest-shell-matchers, this allows us to create expectations and matchers that can execute external executables.

However there are 2 challenges with this:

  1. We may need to copy some or alot of the tests in tests/bin to instead test external executables. Some of the tests can work without any change since all we do is swap out the usage of main with executing an external executable.
  2. Using this means all the platforms we're doing tests in will need to be able to run jest and node. That is the test utilities has to work in that environment.

If abandon jest entirely, then we would need a new test harness, and then platform-specificity is an issue. For example if we were to write tests in a shell language like bash, this may work easily on Linux and MacOS, but not as easily in Windows which doesn't have bash. We could create platform specific ways for every platform, Linux and Mac can use bash, while Windows uses powershell, but this just ends up duplicating our tests/bin 2-times.

We know that node and jest can work on Windows, Mac and Linux. But we won't be able to use Nix in most of these extra environments, since these are meant to be non-Nix platforms. So it's just a matter of installing npm and node on those platforms and proceeding from there. I'm sure the CI/CD scripts have some suggested way of installing Node on those platforms. Windows may have chocolatey, and Mac may have homebrew. See the listed methods here: https://nodejs.org/fa/download/package-manager/

Also Gitlab docs may have some suggested ways of doing things too.

Portability of tests/bin

So does this mean we have to write some new tests like tests/bin-external? Or could we re-use some of the tests in tests/bin?

Surely we can swap out execution of main with direct execution of the real thing right?

That may be the case, but if we are using jest-mock-process, then that won't work, since the real process is not using the Node process mocks.

However if all of the mocking used by https://github.com/EpicEric/jest-mock-process is done within a wrapper function like pkWithStdio, then it may be possible to swap that out with the real implementation in which case the side-effects are real, and STDOUT, STDERR and exit code can be captured as well. Then a parameter can be passed into tests/bin to run the same tests on the external executable.

I'm sure some tests may not work though, and this requires a careful review.

@tegefaulkes
Copy link
Contributor

I've done a little research into gitlabs CI/CD. It might be possible to test the built executables in native shell enviroments using a combination of tags, artefacts and dependencies.

We can us tags to specify a type of runner to run the job in. It seems like we can specify a runner that is a Linux bash shell or a windows Powershell this way.

Using job artefacts and dependencies we can ensure that the required built executable is copied between jobs.

I need to do some more research and testing for this but my impression is that it might be possible.

@CMCDragonkai
Copy link
Member Author

It would be best to first try it in the TS-Demo-Lib. You can create new branches and run the CI/CD manually there.

Also beware of Mac:

First thing would be non-NixOS executables and Windows which will the easiest to test.

@joshuakarp
Copy link
Contributor

The most recent CI/CD pipeline (after merging #284) has shown that the nix build job is failing, notably with similar errors to the following:

src/agent/GRPCClientAgent.ts:75:10 - error TS2742: The inferred type of 'vaultsGitInfoGet' cannot be named without a reference to '@matrixai/polykey/node_modules/@grpc/grpc-js'. This is likely not portable. A type annotation is necessary.
75   public vaultsGitInfoGet(...args) {
            ~~~~~~~~~~~~~~~~

This also relates to #200, where we hope to provide an explicit argument list on the GRPCClient function signatures (replacing ...args).

@joshuakarp
Copy link
Contributor

Changed from starting on 8th Nov to 12th Nov, based on arising complexity from async-init and CLI MR on Gitlab !213.

@tegefaulkes
Copy link
Contributor

We can add a report from jest in the CI/CD. Might be usefull.
registry.gitlab.com/matrixai/engineering/maintenance/

@tegefaulkes
Copy link
Contributor

Other things we can add to improve the speed of the pipeline is caching the nix/store and node_modules so they don't need to be downloaded for each job.

@CMCDragonkai
Copy link
Member Author

CMCDragonkai commented Nov 12, 2021 via email

@tegefaulkes
Copy link
Contributor

I have most of the ci/cd config worked out now.
Currently I have a problem with js-polykey building with pkg.it seems that the fd-lock package is causing the issue with node-gyp-build. with the error.

> [email protected] install /nix/store/m7kvh5qjjgm7yfw9n6cccr7b7bc763wm-node-dependencies-_at_matrixai_slash_polykey-1.0.0/@matrixai/polykey/node_modules/fd-lock
> node-gyp-build

sh: /nix/store/m7kvh5qjjgm7yfw9n6cccr7b7bc763wm-node-dependencies-_at_matrixai_slash_polykey-1.0.0/@matrixai/polykey/node_modules/.bin/node-gyp-build: /usr/bin/env: bad interpreter: No such file or directory

We've seen this problem before and AFAIK it should be fixed. so I don't know why it's happening with fd-lock. I dont remember how it was fixed however.

I've tried including fd-lock inside of the Typescript-demo-lib and added a test for it. there it seems to build and run just fine. I've compared the nix build process between the two repos and I can't find any differences.

Do you have any ideas @CMCDragonkai

@CMCDragonkai
Copy link
Member Author

I had a look through and couldn't find any issues where we ever talked about this. But it does seem familiar. Is it workable inside a nix-shell?

@tegefaulkes
Copy link
Contributor

Running pkg inside the shell seems to work. trying to do a nix-build of just the application gives the same error.
This seems to suggest that the problem is not with the pkg but when installing the modules during the build process.

I think this has something to do with the installation process trying to build the prebuilds instead of downloading them.

@tegefaulkes
Copy link
Contributor

In fact wasn't this fixed by preventing node-gyp-build from building anything in the first place?
I think since node2nix is providing everything then there is no need to actually build the modules?
Shouldn't dontNpmInstall = true be fixinging this since it's preventing the install/build with NPM?

@CMCDragonkai
Copy link
Member Author

It is definitely possible for node-gyp-build to really build the C++ code. I reckon that's leveldown and such is done right? Does this nix-build inside TS-Demo-Lib?

@tegefaulkes
Copy link
Contributor

If I fix it to 4.2.3 it works. 4.3.0 doesn't work.

@tegefaulkes
Copy link
Contributor

But for now it seems nix-build runs fine in the CI/CD environments so I'm looking at what progress I can make there.

@tegefaulkes
Copy link
Contributor

GRPCCientClient and GRPCClientAgent may need explicit return types for streams. Currently the compiler is complaining about this with the error

src/client/GRPCClientClient.ts:272:10 - error TS2742: The inferred type of 'vaultsLog' cannot be named without a reference to '@matrixai/polykey/node_modules/@grpc/grpc-js'. This is likely not portable. A type annotation is necessary.
272   public vaultsLog(...args) {
             ~~~~~~~~~
src/client/GRPCClientClient.ts:352:10 - error TS2742: The inferred type of 'keysCertsChainGet' cannot be named without a reference to '@matrixai/polykey/node_modules/@grpc/grpc-js'. This is likely not portable. A type annotation is necessary.

@tegefaulkes
Copy link
Contributor

js-polykey is now building in ci/cd

@joshuakarp
Copy link
Contributor

That's awesome news

@tegefaulkes
Copy link
Contributor

It is building and copying over the artifacts but the tests are failing now. It's just missing some of the required files so that's an easy fix. Right now its relating to this issue MatrixAI/TypeScript-Demo-Lib#34 where the .json files are missing.

@CMCDragonkai
Copy link
Member Author

Nice!!!

@CMCDragonkai
Copy link
Member Author

What was the problem?

@tegefaulkes
Copy link
Contributor

The problem just seems to be when running nix-build locally. it just seems to work in the runners. I don't know why yet.

@CMCDragonkai
Copy link
Member Author

Try cleaning your Nix cache and see what happens. Details here: https://nixos.org/manual/nix/unstable/package-management/garbage-collection.html

Note it will have to reinstall when you do nix-shell.

You should do it when you're not in any nix-shells.

@tegefaulkes
Copy link
Contributor

Good news, the js-polykey tests are succeeding now. Only the docker and macos tests needs to be fixed now.

@tegefaulkes
Copy link
Contributor

Try cleaning your Nix cache and see what happens. Details here: https://nixos.org/manual/nix/unstable/package-management/garbage-collection.html

Note it will have to reinstall when you do nix-shell.

You should do it when you're not in any nix-shells.

Nope, that doesn't fix it.

@tegefaulkes
Copy link
Contributor

Docker test is working now.

@tegefaulkes
Copy link
Contributor

Macos test is working now. just need to work out a small fix for it.

@tegefaulkes
Copy link
Contributor

All of the CI/CD tests are passing now!
The tests are just running Polykey without any arguments and expecting it to run with an exit code of 0. as it stands now we're just testing that commander runs. We want to do at minimum a test that starts a polykey agent and stops it.

@tegefaulkes
Copy link
Contributor

I added a flag to pk agent start to do a test start and stop of polykey. I'm using this in the tests now.

@tegefaulkes
Copy link
Contributor

Ok, right now the ground work has been laid out an we have basic tests running for the following enviroments.

  • running with npm run polykey in node
  • running the application build in node
  • running the docker image in docker.
  • running the packaged linux build in a linux environment .
  • running the packaged windows build in a windows environment.
  • running the packaged macos build in a mac environment.

The tests so far just run a quick test I made in agent start that starts and stops polykey with full logging. if it exits with a code 0 we assume the test succeeded.

The docker test is just running the docker image with no flags for now so it's just printing out the help page.

I also expanded the test jobs out into multiple runners, so testing should complete quicker now.

@tegefaulkes
Copy link
Contributor

Note: It's possible to take the output from the build stage and use it in the publish stage. this should be addressed at some point.

@tegefaulkes
Copy link
Contributor

tegefaulkes commented Nov 18, 2021

Also note, Nix-build fails when building locally but runs fine in the CI/CD. so far as I can tell it's failing due to the version of node-gyp-build. when node-gyp-build is 4.2.3 nix build patches the shebang and it runs fine. but if it's version 4.3.0 then nix doesn't patch the shebang and the build fails. As to why this is happening I am not sure and I'm running out if ideas to test.

There are comments above where I am troubleshooting it.

@joshuakarp
Copy link
Contributor

joshuakarp commented Nov 29, 2021

Adjusting end date to Wednesday December 1 (from November 19 - delayed from refactoring work in #283), to give time for review and merge.

@CMCDragonkai CMCDragonkai mentioned this issue Dec 12, 2021
12 tasks
@CMCDragonkai
Copy link
Member Author

Should incorporate the cached compiler path: #296 (comment)

@CMCDragonkai
Copy link
Member Author

The networking tests has been changed since async start was integrated. I'm removing these changes so that I can see what the original code was meant to do and fix these problems. This is why it's not a good idea to use be using beforeEach and afterEach for things that is the unit under test. @tegefaulkes.

@CMCDragonkai
Copy link
Member Author

@tegefaulkes all other tests which has this behaviour needs to be reverted back to the way it was before with the unit under test actually being in each test function.

@CMCDragonkai
Copy link
Member Author

I remember there was incorrect usage of getters in ForwardProxy as well. Getters should be used only for mostly static properties that are readable but mutable by the inside. Otherwise getProxyHost() is the correct way of doing this to indicate that this involved side-effects.

@CMCDragonkai
Copy link
Member Author

This is now all working. Successful build and QA sanity check here: https://gitlab.com/MatrixAI/open-source/js-polykey/-/pipelines/444133744

Note that due to staging, and expected test failures from nodes and related functionality, right now master is expected to have a failed pipeline. However in qa-testing when skipping testing, all of this worked.

Furthermore, QA sanity check is very basic. All it is checking is whether the program itself can run. By running the help command. That's it. This is a good first step.

Subsequently we can reuse our test-suite that we are using in tests/bin to also apply it in the other platforms. This will require some abstractions to build later. But for now this is done. In production we'll just do manual testing in the other platforms later.

@CMCDragonkai
Copy link
Member Author

In the process of working on this, we discovered issues regarding licenses and --public vs --public-packages option used by pkg. Worth understanding for future: #292 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
epic Big issue with multiple subissues procedure Action that must be executed
Development

Successfully merging a pull request may close this issue.

3 participants