Skip to content

Latest commit

 

History

History
250 lines (219 loc) · 15.8 KB

README.md

File metadata and controls

250 lines (219 loc) · 15.8 KB

GALACTUS

Overview

Galactus is the cortex of the Interlock Network. It coordinates blockchain-activities (like setting staking yields and limits) and browser-activities (like communicating with the ThreatSlayer extension to protect users, and to reward them for their browsing activities). There is also some book-keeping involved, like memorizing user-configs.

Given its central role, it is designed to be robust from the very beginning. We have organized the code as a state-machine, with explicit states and events, and assertions that validate the consistency of the program at the beginning and end of every transition, as well as on every write to any core, non-primitive object (i.e. galactus accounts). The idea is to prevent mistakes ahead of time, instead of desperately trying to find and fix the mistakes while the program actively causes damage (or -- in the best case -- refuses to run).

We have also adopted the notion of stacks and concatenative programming for encoding computations into transitions. We adopt the style of PushGP (where each major data-type has its own stack) and Harel's statecharts (where computations happen in the interstice betwee states -- the transtions). We implementent the concatenative/stack-based notation using the very common fluent-programming style (i.e. a.b().c().d().e()).

A major advantage of combining fluent-programming with type-segragated stacks, is that we get a mostly point-free style of programming (i.e. agglutinations of verbs but no nouns -- only constructor-verbs take arguments, because those are a wholistic chunk of meaning -- which means that most code can be safely copy-pasted anywhere without having to rename anything), and an extremely flexible and forgiving combinatory mode of expression (i.e. in x.num(1).num(2).str('1').str('2').add().cat() we can swap the string and number operations with each other and get the same result, because each stack represents its own computational continuity -- which can provide obvious opportunities for parallelization, but that is more of an aspiration).

Modeling the program as a set of unique states connected by unique transitions that are triggered by unique events, gives coders the opportunity to program new features in an append-only style (i.e. we rarely have to change code in existing transitions, we just have to add new transitions and new states, while leaving everything else as-is). Bug fixes are a different matter, and require modification, but such modifiations are usually append-only affairs internally, if they require the addition/removal of a new verb to the fleunt-call-chain. In fact, every single transition was implemented by copying the original transition begin-here -> ready, renaming it, and replacing its body, while keeping the boilerplate intact (there must be a way to reduce boilerplate with decorators, but we have been too busy to look), and just passing the (src, dst, event, trans_func) arguments to the add_transition method.

Fixing the internals of verbs themselves will resemble the kinds of fixes that you are used to doing in traditional code-bases, because the verb-internals are implemented using variables. That said, most verbs lack complicated branching structures, because most of branching is handled by the events and transitions in the state-machine. Some transitions contain internal branches on events (i.e. getters for objects would create dozens of unneccessary transitions, for no additional benefit). Some verbs are context sensitive and branch extensively on context (i.e. the data_load and data_store verbs need to branch on a context-object to know which stack to push the data to, and also how to deserialize the returned data-rows).

The verbs are analogous to the system-calls given by an operating system to its application -- they connect the endogenous logic of the program to the exogenous reality of the world. They present a clean separation between the spirit of the program and the manifestation of that spirit in the external world. The verbs modify the stacks, and they emit events (which the statemachine then reacts to). To modify the stacks, the verbs need to use the push/pop/peek methods, which check for access-priviledges (each verb must specify ahead of time which stacks it intends to read and write from -- this catches a surprising number of bugs early on). Just like a system-call, a call into a verb passes control from the state-machine (encoded in the Endo class) to the Exo class which is where the verbs are implemented -- these 2 classes communicate via the stacks, passing control back and forth between each other.

Unlike Harel's statecharts, which distinguish between events (missable external signals from the outside world) and conditions (boolean values derived from data internal to the program-memory), we overload the events and handle the distinction in the verb itself -- this allows us to easily create stochastified mocks, inside the verbs, to exercise all transitions without having to stress over the semantic distinction between an event and a condition.

Another deviation from Harel is that we do not have nestable states, we instead find that in-verb branching and event-branching transitions (i.e. those that are activated by event-sets) achieve a similar effect, while keeping the topology simple (i.e. a transition from load to respond is semantically the same concept regardless of what is being loaded and what kind of response is created -- those details are best revealed by the code, since trying to infer them from event-names is basically guesswork).

Overall, these abstractions give us an infinite number of ways to organize and express the same computation, letting the developers choose whether the computation is best expressed (and easiest to understand) on the level of states and transitions, transition-branches, verb compositions, or murky verb-internals. However, expressiveness is often at odds with tractability, and so we provide facilities for specifying correctness-truths about the relations between all states, events, stacks, and classes which should prevent developers from over-expressing themselves into a dark corner of tangled logical convolutions.

We use the hypothesis framework to generate test data and to drive event/data-inputs into the state-machine (see the fuzzer class). If any assertion gets triggered, hypothesis tries to create the smallest possible input-sequence that would trigger the assertion.

Below is a state-transition-diagram representing the galactus code-base. Merged transitions are identifiable by the @ that prefixes their event-name. The event-name text is written in the direction of the transition (i.e. if the transition points downwards, then the name must be read from top to bottom).

At a glance, you can tell that only admin-events may write to the db directly, while all others must pass through a load-phase (usually to autheticate or to fetch a resource). You can also see that admins receive notifications only after retryable operations exceed their retry-limit. It is also obvious that loading, storing, and safety-checking are the only retryable operations (so far). It is apparent that we explicitly require that you provide credentials for account destruction (even if you are already logged in). To put this in quantitative terms, this diagram represents 4.5k lines of code in a single screenfull.

Because we are explicit about our states and transitions, the semantics of our verbs, and the correctness constraints between all elements of the program, we can confidently engage in refactors and even total rewrites (perhaps to entirely different languages) because the notion of what makes a correct galactus is not defined in terms of any concepts that are endemic to Python or to any of the modules (i.e. fastapi, sqlalchemy, etc) that we use.

Cool Features

Tests / Fuzzing

We use the RuleBasedStateMachine class from hypothesis to barrage the Endo statemachine with inputs. We do some lightweight response-validation inside the methods of the fuzzer-class, but most of the verification happens inside the Endo class, which calls the pre_verify and post_verify functions (which contain tests that get run at the beginning and end of each transition, respectively. Each test is preceded with a short comment describing what it verifies. These tests can enable/disable themselves depending on state, context, and event (among other things). Other tests can be found in the __assert__ method attached to the classes, which verify the object integrity on every modification.

To run the fuzzer, make sure that the unit_test_flag is set to True, and simply do python core.py > TESTOUT 2> TESTERR. The reduced-test (smallest error-reproducing input-sequence) will be in TESTERR, while all the previous tests (and debug prints, if any) will be in TESTOUT.

The causes of most errors are obvious from the stack-trace and line-numbers, but sometimes they are not (i.e. backoff-object failing its __assert__ test). If they are not, you can narrow down the possibilities by adding more tests to the pre_verify and post_verify functions (i.e. making sure that the backoff-object is always reset when we enter the ready state, eliminates the possibility that the aforementioned assert-failure is due to drift between requests). Another good example: the tests detected that we enter the ready-state with non-empty load-query-stack, so we created a test to verify that we fail to empty the stack when we complete the load and try to test-site-safety.

These tests let us express falsifiable hypotheses, and then the fuzzer-class will try to trigger a test-failure by barraging the Endo-class with inputs. We have found (and fixed) errors that were exposed only after 50 inputs were given -- something that we are unlikely to run into in the course of normal development, and something that is likely impossible to diagnose should it occur in production (short of detailed logs, and core-files with enough information to guide us towards that 50-input-sequence -- and that is assuming that the error is directly noticeable to anybody at all).

Single Machine

Parallel Machines

Heatmaps

Stochastification

Deployment

Galactus is meant to be run on NixOS with one worker-process. We increase the number of workers in the future, and may want to use the gunicorn backend to allow for live, zero-downtime upgrades. Using a single worker is fine, and reduces the number of error-sources significantly -- Node.js apps usually run in a single thread/worker, it is in fact their whole selling point. Also much confusion results from -- based on various online readings -- running FastAPI with multiple workers without really understanding what that means. In any case, we start with 1 and grow to multiple workers later.

We also use SQLite for development and testing -- which is another reason to use 1 worker -- but we plan to deploy on Postgres in production (we will have to update the configuration.nix file for this).

To run the code, you have to clone this repo into a directory and run the following:

nix-shell shell.nix
uvicorn core:app --host  $YOUR_IP --port 80 --workers 1

If you do not specify an IP, it will run on 127.0.0.1 and will only be visible to localhost-clients. If you want a different port, you have to specify it -- the uvicorn default is 8000. And the workers should never be increased unless we have tested and verified that more workers is stable. (So far, the only thing that changes with multiple workers and a shared DB, is that if 2 workers try to create the same user-account they will race, and 1 will create it successfully while the other will fail to do so in a non-retryable way because of a primary-key-conflict -- a single worker-prevents this server-side, but users can still perceive a race client-side, if the client show the username as being available and another client completes registration before the current one does). Another possible race-condition is when the same user visits a 2 sites from 2 clients and those arrive to galactus at the same time: with multiple workers (but not with only 1 worker), this can result (unlikely but possibly) in the meters getting incremented only once instead of twice. There are ways to prevent and mitigate this (i.e. enclosing each ready-exit and ready-entry in a serializable transaction, or making such updates diff/log/queue-oriented, and having a periodic worker reduce the diff-log into a single value -- see the reference to Bloom per Alvaro below), but for now we are going with the simple solution of running one worker until it becomes a problem. When it becomes a problem (and we move to multiple workers), we will use the error-log as a source/queue from which we can backfill those increments (which means that our statistics act as a sample of all data-points, and that sample can be improved retroactively because of how sums work -- think of it as the resolution on a streaming video, it can increase/decrease dynamically as a function of load and available throughput).

Credits/Acknowledgements/Related-Work