Guardian enforces dependency boundary constraints and keeps your project dependencies sane. Mainly designed to apply to the package dependencies internal of a large Haskell monorepo, but it can also be used with monorepos in other/multi languages or at the level of modules.
- Introduction - How to keep your monorepo sane
- Installation
- Usage
- GitHub Actions
- Contribution
- Copyright
Maintaining a large monorepo consisting of dozens of packages is not easy. Sometimes, just making sure all packages compile on CI is not enough - to accelerate the development cycle, it is crucial to ensure that changes to the codebase trigger only necessary rebuilds as far as possible.
But enforcing this requirement by hand is not easy. To make things work, a kind of people known as programmers - especially Haskellers, you know - are inherently lazy. They are so lazy to grep through the project, shouting out "look, there is what we want!", and adding extra dependency - finally they make everything depend on each other and constitute the gigantic net of dependencies that takes a quarter day even if only a small portion of the code was changed.
Indeed, this was exactly the situation we developers at DeepFlow faced in 2021. Our main product is a high-performance numeric solver1, consisting of ~70 Haskell packages, which sums up to ~150k lines of code. It includes:
- abstraction of numeric computation backend,
- its concrete implemenatations,
- abstraction of plugins to extend solvers,
- its concrete implementations,
- abstraction of equation system,
- its concrete implementations,
- abstraction of communication strategies,
- its concrete implementations (e.g. MPI, standalone),
- abstract solver logic, and
- concrete solver implementations putting these all together.
What we had in 2021 was the chaotic melting pot of these players. If one changes a single line of a backend, it forces seemingly unrelated plugins to be recompiled. If one removes a single redundant instance, then it miraculously triggers the rebuild in some concrete implementation of some specialised solver. In any case, it took at most six hours and ~100 GiB to complete, as we make heavy use of type-level hackery and rely on the aggressive optimisation mechanism of GHC2.
This situation is far from optimal, sacrifices developer experience, and severely slows down the iteration speed. Our product seems dying slowly but surely.
In the middle of 2021, we decided to fight against this situation to save our project. Our rule of thumb here was Dependency Inversion Principle from the realm of OOP. That says:
- High-level modules should not import anything from low-level modules. Both should depend on abstractions
- Abstractions should not depend on details. Details (concrete implementations) should depend on abstractions.
(excerpt from Wikipedia; renumbering is due to the author)
This is called dependency inversion because it seems inverse to the traditional usage of inheritance in OOP. We decided to apply this principle to package dependencies.
We divided the packages into several groups, called dependency domains. We drew the term and intuition from the blog post by GitHub about partitioning databases into domains.
Our packages are divided into the following groups3:
infra
: providing common infrastructure, such as data containers and algorithms.highlevel
: high-level abstractions of backends, equations, and plugins.lowlevel
: low-level and concrete implementations.plugin
: concrete plugin implementationssolver
: concrete solver implementations.tool
: misc utility apps independent of solvers.test
: test packages, mainly providing cross-package integration tests.
Any package in monorepo must be classified into exactly one dependency domain. Furthermore, we define constraints on the dependencies between domains: each domain is given the set of domains on which its members can (transitively) depend, and the induced dependency graph must form a directed acyclic graph.
The example is as follows:
flowchart TD
test[test] --> solver[solver];
test --> tool[tool];
highlevel[highlevel] --> infra[infra];
plugin[plugin] --> highlevel;
lowlevel[lowlevel] --> infra;
solver[solver] --> highlevel;
solver --> lowlevel;
solver --> plugin;
tool --> lowlevel;
tool --> plugin;
Each package of a domain can depend only on the packages in the same or downstream package in the diagram.
Note that, in this setting, high-level (or abstract) packages can only depend on another abstraction or infrastructure package and only the application packages (say, solver
and `tool) can depend both on abstractions and implementations.
So far, so good. Now that we have the grand picture of the package dependency hierarchy, it is time to make things laid out.
The problem is: enforcing such an invariant by hand is almost impossible as the number of packages gets larger and larger. Even after we sorted out where the violation occurred, it takes much time to fix them all up while continuing to add new features and fix bugs. Furthermore, even if we had worked out things once somehow, our laziness could strike back to slowly violate such constraints and make everything rotten again. Uh-oh.
... Not really. Here, our brave guardian
can help us!
Guardian
is a tool designed for this purpose and developed in DeepFlow. We are running guardian
on our CI for almost two years.
In guardian, we declare the whole dependency constraints in dependency-domains.yaml
.
The above constraints are expressed as follows:
domains:
infra:
depends_on: []
packages:
- algorithms
- geometry
- linalgs
highlevel:
packages:
- numeric-backends
- plugin-base
- equation-class
depends_on: [infra]
lowlevel:
packages:
- reference-backend
- fast-backend
- mesh-format
depends_on: [infra]
plugin:
packages:
- plugin-A
- plugin-B
depends_on: [highlevel]
solver:
packages:
- solver-standalone
- solver-parallel
depends_on: [plugin, lowlevel, highlevel]
tool:
packages:
- post
- pre
depends_on: [plugin, lowlevel]
test:
packages:
- integration-tests
- regression-tests
depends_on: [solver, tool]
# We track dependencies in test & benchmark components as well.
components:
tests: true
benchmarks: false
Note that we don't have to include highlevel
into the dependencies of solver
- it is already introduced via dependency on plugin
.
Here, we include highlevel
just for documentation.
Actually, this is not the first rule we had. As mentioned above, our codebase consists of ~70 packages summing up to ~150k lines of code. In some cases, it takes much effort to remove dependency violating the dependency domain boundary without sacrificing performance.
To accommodate such cases, guradian
provides exception rules for compromise.
It looks like this4:
...
plugin:
packages:
- package: plugin-A
exception:
depends_on:
- lowlevel
- package: fast-backend
- plugin-B
depends_on: [highlevel]
...
As shown above, exception rules are specified per-package manner.
It can allow dependencies to other dependency domains and/or individual package which is otherwise banned.
An exception rule doesn't affect the transitive dependency - only the dependencies of the specified package are exempted.
In the above example, only plugin-A
is allowed to depend on the exempted targets.
So, if plugin-B
depends on plugin-A
, it cannot depend on fast-backend
package explicitly.
The point is that exception rules must be considered as tentative compromises - they must be removed at some points to enforce dependency inversion at the package level to keep the entire project healthy.
To make it clear, guardian
will emit warnings when the exception rules are used:
[info] Using configuration: dependency-domains.yaml
[info] Checking dependency of /path/to/project/" with backend Cabal
[warn] ------------------------------
[warn] * Following exceptional rules are used:
[warn] - "A2" depends on: PackageDep "B1"
[info] ------------------------------
[info] All dependency boundary is good, with some additional warning.
And when the situation allows some rules to be lifted, it also tells us as follows:
[warn] * 1 redundant exceptional dependency(s) found:
[warn] - "A2" doesn't depends on: PackageDep "B1"
So, what guardian
would do is:
- Check if all the packages are disjointly classified into dependency domains.
- Makes sure that the dependency graph between each domain forms a DAG.
- Check if all the dependency constraint is met, except for prescribed exceptions.
- Report the validation results with information about used/redundant exception rules.
One thing to note is that guardian
checks dependency constraints based solely on the dependencies specified in *.cabal
and/or package.yaml
file.
In other words, guardian doesn't treat a dependency introduced by module imports.
This is because such a dependency is already handled by the compiler.
In our case, we had two exception rules enabled at first and it took almost half a year to finally abolish all of the exception rules. This was finally done when the major rework on the entire structure of our product. Indeed, one of the motivations of the refactoring was to completely deprecate the exception rules and the overall dependency graph above was the driving force to make out the correct design of the entire package structure. In this way, the presence of exception rules in the dependency boundary constraints signals the design flaw in the package hierarchy and gives us a useful guideline for a redesign.
The benefit of dependency constraint enforcement by guardian is not limited to refactoring.
Running guardian
on CI, one can always check the sanity of the entire structure when adding new features/packages to the monorepo.
When one adds new packages to the monorepo, guardian force us to think in which domain to put them.
If we accidentally add package dependencies violating constraints in the midway of adding new features, guardian will tell us it generously.
In this way, guardian helps the product evolve healthily while preventing getting rotten.
With guardian, you can:
- divide monorepo packages into disjoint groups called dependency domains,
- define the topology of DAG of dependency domains to secure dependency boundaries,
- you can still specify exception rules for compromise - they will be warned so that you can finally remove it.
Equipped with these features, we can:
- Keep the package hierarchy of the monorepo clean and loose-coupled.
- Be more careful when adding new packages/features as guardian shouts at us when violations are found.
There are several possibilities in the design of actual DAG of domains, we recommend the following rules:
- Separate abstractions and concrete implementations as far as possible.
- Ideally, applications and/or tests can depend on both.
- Keep DIP in mind: Abstractions SHOULD NOT depend on implementations. Implementation should depend on abstractions.
- Refine domains based on semantics/purposes.
- Use dependency domain constraints as a guideline for the design of the overall monorepo.
- It guides us in making out the "correct" place to put new packages/features.
- The number of exception rules indicates the code smell.
You can download prebuilt binaries for macOS and Linux from Release.
You can also use GitHub Action in your CI.
To build from source, we recommend using cabal-install >= 3.8
:
git clone [email protected]/deepflowinc/guardian.git
cd guardian
cabal install
guardian (auto|stack|cabal) [-c|--config PATH] [DIR]
Subcommand auto
, stack
, cabal
specifies the adapter, i.e. build-system, to compute dependency.
stack
: uses Stack (>= 2.9) as an adapter.cabal
: uses cabal-install (>= 3.8) as an adapter.auto
: determines adapter based on the directory contents and guardian configuration. Currently, it chooses an adapter fromstack
orcabal
.- If exactly one of
cabal.project
orstack.yaml
is found, use the corresponding build system as an adapter. - If exactly one of the custom config sections (say,
cabal:
orstack:
) is found in the config file, use the corresponding build system.
- If exactly one of
custom
: uses the user-specified external process as an adapter. With this adapter, you can check dependency between arbitrary entities in any language. See the Custom adapter settings section for more detail.
Note that guardian
links directly to stack
and cabal-install
, so you don't need those binaries.
Make sure the project configuration is compatible with the above version constraint.
Optional argument DIR
specifies the project root directory to check. If omitted, the current working directory is used.
The option --config
(or -c
for short) specifies the path of the guardian configuration file relative to the project root.
If omitted, dependency-domains.yaml
is used.
When invoked, guardian will check in these steps:
- Defines a dependency graph based on the package dependencies.
- Checks if the graph forms DAG.
- Checks if the dependency domain constraints are satisfied first ignoring exception rules.
- If violating dependency is detected, exempt it if it is covered by any exception rule.
- If any exception rule can be used, warn of its use; otherwise, report it as an error.
- Report results, warning about used and redundant exception rules.
Guardian configuration file consists of the following top-level sections:
Section | Description |
---|---|
domains |
Required. Definition of Dependency Domains (see Domain Definition) |
wildcards |
Optional. If true , package name can contain wildcards * , which matches arbitrary string (when enabled and you need the literal character * , use \* ). Note that even if this option is set, you CANNOT use * in exception rule targets. (Default: false ) |
components |
Optional. Configuration whether track test/benchmark dependencies as well (see Component Section) |
cabal |
Optional. Cabal-specific configurations. (see Cabal specific settings) |
stack |
Optional. Stack-specific configurations. (see stack specific settings) |
The ordering of sections is irrelevant.
See Example Configuration for a complete example.
Dependency domains, their members, and constraints are specified in the domains
section.
The domains
section must be a dictionary associating each domain label to the domain definition.
A domain label must match /[A-Za-z0-9-_]+/
.
Individual domain definition object has the following fields:
Field | Type | Description |
---|---|---|
depends_on |
[String] |
Required. Labels of the other domains that the dependency being defined is depending on. |
packages |
Package |
Required. A list of packages of the domain. Each package can be a string or object; see below for detail |
description |
Maybe String |
Optional. A human-readable description of the domain. Note that this field is not processed by guardian. |
Package entry occurring in the packages
field can either be a string or package object.
A single string, e.g. mypackage
is interpreted as a package object with only the package
field specified, e.g. {package: "mypackage"}
.
Package object has the following field:
Field | Type | Description |
---|---|---|
package |
String |
Required. Package name |
exception |
ExceptionRule |
Optional. Package-specific exception rules |
Exception rule object is an object of form {depends_on: <an array of ExceptionItem's>]}
.
Exception item can be either a simple string or an object of form {package: "package-name"}
.
A single string as an exception rule is interpreted as a dependency on the domain with having the string itself as the label.
An object {package: "package-name"}
is interpreted as a dependency on the package package-name
.
Example:
domains:
A:
depends_on: [C]
packages:
- A1
- package: A2
exception: {depends_on: [package: B1]}
B:
packages: [B1]
depends_on: [C]
C:
packages: [C]
depends_on: []
In the above example, packages are divided into three domains: A
, B
, and C
.
Apart from the packages in the same domain, packages in domains A
and B
can depend on those in C
.
Exception rule is specified for package A2
in domain A
, which allows A2
to directly depend on package B1
.
Even if A1
depends on A2
, A1
cannot depend on it directly - this is how exception rules work.
Sometimes, one wants to exclude special components such as tests or benchmarks from dependency tracking.
The component
section is exactly for this purpose.
It consists of the following optional fields:
Field | Type | Description |
---|---|---|
tests |
Bool |
Optional. If true , tracks tests (default: true ). |
benchmarks |
Bool |
Optional. If true , tracks benchmarks (default: true ). |
The component
section itself can be omitted - in such cases, all the tests and benchmarks will be tracked for dependency.
Configurations specified to cabal-install
backend can be specified in cabal
top-level section.
It has the following fields:
Field | Type | Description |
---|---|---|
projectFile |
FilePath |
Optional. The path of the cabal project file relative to the project root (default: cabal.project ). |
update |
Bool or String |
Optional. If true , run (the equivalent of) cabal update before dependency checking. If a non-empty string is given, it will be treated as an index-state string and passed to the update command. (default: false ) |
Example:
cabal:
projectFile: cabal-custom.project
update: true
Example with index-state:
cabal:
projectFile: cabal-custom.project
update: "hackage.haskell.org,2023-02-03T00:00:00Z"
Stack-specific options are specified in the stack
top-level section.
For the time being, it only has the options
field, which is a possibly empty list of options to be passed to the stack
command.
It can be used, for example, to specify the custom stack.yaml
file as follows:
stack:
options:
- "--stack-yaml=stack-test.yaml"
If the stack
section is omitted, the options
will be treated as empty.
This is the most general adapter: it reads a dependency graph from STDOUT of an external process. This adapter allows you to enforce dependency boundary constraints for any entity in any language, provided that there is an external program that emits dependency graph (currently in Dot language).
For example:
- You can write custom shell script to emit a package dependency graph for the build system you use.
For example, Bazel provides the
query --output graph
command. - You can use
graphmod
to enforce Haskell module dependency domain constraints. See./dependency-domains-graphmod.yaml
for such an example.
The custom adapter sets the following environmental variables when invoking external process:
Variable | Description |
---|---|
GUARDIAN_ROOT_DIR |
The path to the project root to check dependencies |
GUARDIAN_INCLUDE_TESTS |
Set to 1 if components.tests is enabeld; otherwise unset. |
GUARDIAN_INCLUDE_BENCHMARKS |
Set to 1 if components.benchmarks is enabled; otherwise unset. |
Guardian will parse the standard output of the process as a Dot program and regard it as a dependency graph.
You can configure the custom adapter in the custom
top-level section in the configuration file.
It has the following fields:
Field | Type | Description |
---|---|---|
program |
Path |
Exactly one of program or shell must be specified. Specifies the path to the external program to use as an adapter. Beside the environmental variables mentioned above, it passes the project root as the first argument. The path must be point to an executable file and has the right permission. See also command . |
shell |
String |
Exactly one of program or shell must be specified. You can specify shell script instead of the path to the program. See also program . |
ignore_loop |
Bool |
Optional. If true , gurdian ignores self-loops in the dependency graph. (default: false ) |
An example setting with the shell
field:
custom:
shell: |
stack dot --no-external --no-include-base 2>/dev/null
An example with the program
field:
custom:
program: "./decode-cabal-plan.sh"
ignore_loop: true
with decode-cabal-plan.sh:
#!/bin/bash
cabal-plan --hide-builtin --hide-global dot 2>/dev/null \
| sed -r 's/:(test|benchmark|exe):[^\"]+//g; s/-[0-9]+(\.[0-9]+)*//g'
An example calling graphmod
:
custom:
shell: graphmod --no-cluster 2>/dev/null # --no-cluster is needed to avoid subgraphs
- Only Dot format is supported.
- Custom adapter silently ignores subgraphs in a dot file.
components: # Specifies whether track test/benchmark dependencies as well:
tests: true
benchmarks: false
domains:
domain-A:
depends_on:
- common
# Domain CANNOT depend on a separate package directly!
# - package: B3 # Error!
packages:
- A1
- A2
- package: A3
exception: # Exception rules for a particular package.
depends_on:
- C # domain name if a plain string
- package: B3 # You can specify a single package name ONLY in package rule.
domain-B:
depends_on:
- common
packages:
- B1
- B2
- B3
C:
depends_on:
- common
packages:
- C1
common:
depends_on: [] # You MUST specify empty dependency explicitly.
packages:
- mybase
- urbase
Guardian provides a GitHub Action that can be used in GitHub Workflow.
Prerequisites:
- OS running the action must be either Linux or macOS with the following executables in PATH:
sha256sum
tar
curl
jq
- If you are using the Cabal backend and
with-compiler
is specified explicitly, the corresponding version of GHC must be in the PATH.
Example workflow:
check-dependecy-boundary:
name: Checks Dependency Constraint
runs-on: ubuntu-20.04
continue-on-error: true
steps:
- uses: actions/checkout@v3
- uses: haskell-actions/setup@v2
with:
ghc-version: 9.0.2 # Install needed version of ghc
- uses: deepflowinc/guardian/[email protected]
name: Check with guardian
with:
backend: cabal # auto, cabal, or stack; auto if omitted
version: 0.4.0.0 # latest if omitted
## Specify the following if the project root /= repository root
# target: path/to/project/root
## If you are using non-standard name for config file
# config: custom-dependency-domains.yaml
Please feel free to open an issue, but also please search for existing issues to check if there already is a similar one.
See CONTRIBUTING.md for more details.
(c) 2021-2023, DeepFlow Inc.
Footnotes
-
For those who are curious, please read our old slide - it is an old presentation, so some details are outdated, but the overall situation is not so different. ↩
-
The situation, however, gets better as newer GHC releases came out and we refactor the code structure. Nowadays, it takes at most two and a half hours and only ~20GiB. The detail of such change is out of the scope of this article. ↩
-
These are not exactly what we have in reality but in simplified form. ↩
-
As package
fast-backend
is already included inlowlevel
domain, we don't have to includefast-backend
as a separate exception rule. Here, we included it for the exposition. ↩