Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert NULLBOOL type to ATOM type #3

Merged
merged 1 commit into from
Aug 9, 2023
Merged

Conversation

gpicron
Copy link
Contributor

@gpicron gpicron commented Mar 28, 2023

This extension is backward compatible and allow for dictionary encoded compression in various context.

For instance, in SSB-DB2, we could have a dictionary for object keys, a dictionary of Authors pub keys, types, etc. That offer a better compression and faster seekPath.

This extension is backward compatible and allow for dictionary encoded compression in various context.

For instance, in SSB-DB2, we could have a dictionary for object keys, a dictionary of Authors pub keys, types, etc.  That offer a better compression and faster seekPath.
@arj03
Copy link
Member

arj03 commented Mar 28, 2023

I like what you are doing here with dictionaries, but I'm wondering if this isn't the case for the extended type. Maybe convert this into an example of how that could be used would be great I think.

@gpicron
Copy link
Contributor Author

gpicron commented Mar 28, 2023

For the extended tag, I was planning to propose something else for less usual cases. The benefit of reuse the BOOLNULL tag and convert it to ATOM tag is that reduce the overhead and if you think about it, with today use for only null, true and false, we lose a lot of bits for just 3 values.#2

@gpicron
Copy link
Contributor Author

gpicron commented Mar 29, 2023

Example of usage for encoding with a key dict:

the json equivalent is { "key1": "Foo", "key2": "Bar" }

DEBUG encoded without keyDict: 
[149, 1, 32, 107, 101, 121, 49, 24, 70, 111, 111, 32, 107, 101, 121, 50, 24, 66, 97, 114]
len : 20
DEBUG encoded using a keyDict: 
[101, 14, 3,                    24, 70, 111, 111, 14, 4,                 24, 66, 97, 114]
len : 13

Note: currently I have on implemented a naive KeyDict for the object field keys. In the context of SSB larger gains in space and scan speed can be achieved using an Dict for some specific fields like Author, types, etc.

Some simulations to compare memory usage:

----------------------------------------
feed dump: tests-js/test-arj.json
feed Messages length: 9200
size for Array of JS objects : 10171730 bytes/ 9933 KB/ 10 MB
 average message size as JS Object: 1105.6228260869566  bytes
size for Array of BIPF : 6425120 bytes/ 6275 KB/ 6 MB
 average message size as BIPF: 698.3826086956522  bytes
ratio BIPF/JS: 63 %
size for BIPF Array : 6351524 bytes/ 6203 KB/ 6 MB
 average message size as BIPF Array: 690.3830434782609  bytes
ratio BIPF Array/JS: 62 %
size for BIPF Array with key dict : 5562270 bytes/ 5432 KB/ 5 MB
 average message size as BIPF Array with key dict: 604.5945652173913  bytes
size in memory of key dict : 3030 bytes/ 3 KB/ 0 MB
ratio BIPF Array with key dict/JS: 55 %
----------------------------------------
feed dump: tests-js/test-cel.json
feed Messages length: 1266
size for Array of JS objects : 1148226 bytes/ 1121 KB/ 1 MB
 average message size as JS Object: 906.9715639810427  bytes
size for Array of BIPF : 760168 bytes/ 742 KB/ 1 MB
 average message size as BIPF: 600.4486571879937  bytes
ratio BIPF/JS: 66 %
size for BIPF Array : 750044 bytes/ 732 KB/ 1 MB
 average message size as BIPF Array: 592.4518167456556  bytes
ratio BIPF Array/JS: 65 %
size for BIPF Array with key dict : 638560 bytes/ 624 KB/ 1 MB
 average message size as BIPF Array with key dict: 504.391785150079  bytes
size in memory of key dict : 4086 bytes/ 4 KB/ 0 MB
ratio BIPF Array with key dict/JS: 56 %
----------------------------------------
feed dump: tests-js/test-andre.json
feed Messages length: 5560
size for Array of JS objects : 9908370 bytes/ 9676 KB/ 9 MB
 average message size as JS Object: 1782.0809352517986  bytes
size for Array of BIPF : 5754826 bytes/ 5620 KB/ 5 MB
 average message size as BIPF: 1035.0406474820145  bytes
ratio BIPF/JS: 58 %
size for BIPF Array : 5710350 bytes/ 5577 KB/ 5 MB
 average message size as BIPF Array: 1027.0413669064749  bytes
ratio BIPF Array/JS: 58 %
size for BIPF Array with key dict : 5242743 bytes/ 5120 KB/ 5 MB
 average message size as BIPF Array with key dict: 942.9393884892087  bytes
size in memory of key dict : 4226 bytes/ 4 KB/ 0 MB
ratio BIPF Array with key dict/JS: 53 %
----------------------------------------
feed dump: tests-js/test-bench-fixture.json
feed Messages length: 80000
size for Array of JS objects : 110776138 bytes/ 108180 KB/ 106 MB
 average message size as JS Object: 1384.701725  bytes
size for Array of BIPF : 66056237 bytes/ 64508 KB/ 63 MB
 average message size as BIPF: 825.7029625  bytes
ratio BIPF/JS: 60 %
size for BIPF Array : 65416242 bytes/ 63883 KB/ 62 MB
 average message size as BIPF Array: 817.703025  bytes
ratio BIPF Array/JS: 59 %
size for BIPF Array with key dict : 59225490 bytes/ 57837 KB/ 56 MB
 average message size as BIPF Array with key dict: 740.318625  bytes
size in memory of key dict : 4260 bytes/ 4 KB/ 0 MB
ratio BIPF Array with key dict/JS: 53 %
----------------------------------------
Inserting anchors with the following rules:
 - 1 anchor when size of messages since last anchor is > 500 KB in BIPF
 - 1 anchor when duration since last anchor is > 3 months
----------------------------------------
feed anchors for  arj
anchors : 28
size for Array of anchors : 4677 bytes/ 5 KB/ 0 MB
size for Array of anchors : 3572 bytes/ 3 KB/ 0 MB
----------------------------------------
feed anchors for  cel
anchors : 26
size for Array of anchors : 4347 bytes/ 4 KB/ 0 MB
size for Array of anchors : 3320 bytes/ 3 KB/ 0 MB
----------------------------------------
feed anchors for  andre
anchors : 10
size for Array of anchors : 1705 bytes/ 2 KB/ 0 MB
size for Array of anchors : 1302 bytes/ 1 KB/ 0 MB
----------------------------------------
feed anchors for  generated-ssb-fixtures
anchors : 3
size for Array of anchors : 550 bytes/ 1 KB/ 0 MB
size for Array of anchors : 420 bytes/ 0 KB/ 0 MB
----------------------------------------

Worst case scenario: using the arj feed as a reference * 500 users
----------------------------------------
size of feed in BIPF (default) for 500 followed : 3175762000 bytes/ 3101330 KB/ 3029 MB
size of feed in BIPF (with key dict) for 500 followed : 2781135000 bytes/ 2715952 KB/ 2652 MB
size of feed in JS for 500 followed : 5085865000 bytes/ 4966665 KB/ 4850 MB
size of anchors for 500 followed : 2342500 bytes/ 2288 KB/ 2 MB
size of anchors (with Key dict) for 500 followed : 1790000 bytes/ 1748 KB/ 2 MB

If only keeping last 12 months of messages based on anchors:
 - number of messages since first anchor older than 12 months: 410
 - size of messages since first anchor in JS: 360380 bytes/ 352 KB/ 0 MB

 projection for 500 followed users:
 - size of messages since first anchor in BIPF: 118293000 bytes/ 115521 KB/ 113 MB
 - size of messages since first anchor in BIPF (with key dict): 101882500 bytes/ 99495 KB/ 97 MB
 - size of messages since first anchor in JS: 180190000 bytes/ 175967 KB/ 172 MB

If only keeping last 24 months of messages based on anchors:
 - number of messages since first anchor older than 24 months: 1182
 - size of messages since first anchor in JS: 1242128 bytes/ 1213 KB/ 1 MB

 projection for 500 followed users:
 - size of messages since first anchor in BIPF: 393342500 bytes/ 384124 KB/ 375 MB
 - size of messages since first anchor in BIPF (with key dict): 344931500 bytes/ 336847 KB/ 329 MB
 - size of messages since first anchor in JS: 621064000 bytes/ 606508 KB/ 592 MB

Some performance metrics on encoding/decoding:

  • bipf is the JS reference impl
  • nim_bipf is my lib coded in Nim compiled in JS
  • nim_bipf_node is my lib coded in Nim compiled in native NodeJs Module
Suite: Encoding data ssb messages from arj,cel,andre
   bipf#encode/ssb messages from arj,cel,andre (#)                          0%         (59,296 rps)   (avg: 16μs)
   nim_bipf#serialize/ssb messages from arj,cel,andre                  +53.34%         (90,924 rps)   (avg: 10μs)
   nim_bipf#serializeWithKeyDict/ssb messages from arj,cel,andre       +123.5%        (132,528 rps)   (avg: 7μs)
   nim_bipf_node#serialize/ssb messages from arj,cel,andre             +21.91%         (72,287 rps)   (avg: 13μs)
   json#stringify/ssb messages from arj,cel,andre                     +304.82%        (240,043 rps)   (avg: 4μs)
-----------------------------------------------------------------------

Suite: Encoding data ssb messages from ssb-fixture
   bipf#encode/ssb messages from ssb-fixture (#)                          0%         (79,714 rps)   (avg: 12μs)
   nim_bipf#serialize/ssb messages from ssb-fixture                  +46.15%        (116,505 rps)   (avg: 8μs)
   nim_bipf#serializeWithKeyDict/ssb messages from ssb-fixture       +55.05%        (123,599 rps)   (avg: 8μs)
   nim_bipf_node#serialize/ssb messages from ssb-fixture             -11.46%         (70,579 rps)   (avg: 14μs)
   json#stringify/ssb messages from ssb-fixture                      +215.2%        (251,259 rps)   (avg: 3μs)
-----------------------------------------------------------------------

Suite: Decoding data ssb messages from arj,cel,andre
   bipf#decode/ssb messages from arj,cel,andre (#)                            0%        (189,123 rps)   (avg: 5μs)
   nim_bipf#deserialize/ssb messages from arj,cel,andre                   -1.62%        (186,057 rps)   (avg: 5μs)
   nim_bipf#deserializeWithKeyDict/ssb messages from arj,cel,andre       +36.69%        (258,515 rps)   (avg: 3μs)
   nim_bipf_node#deserialize/ssb messages from arj,cel,andre             -26.41%        (139,177 rps)   (avg: 7μs)
   json#parse(string)/ssb messages from arj,cel,andre                    +29.75%        (245,390 rps)   (avg: 4μs)
   json#parse(buffer)/ssb messages from arj,cel,andre                     +53.7%        (290,687 rps)   (avg: 3μs)
-----------------------------------------------------------------------

Suite: Decoding data ssb messages from ssb-fixture
   bipf#decode/ssb messages from ssb-fixture (#)                            0%        (236,537 rps)   (avg: 4μs)
   nim_bipf#deserialize/ssb messages from ssb-fixture                   +3.19%        (244,080 rps)   (avg: 4μs)
   nim_bipf#deserializeWithKeyDict/ssb messages from ssb-fixture       +28.95%        (305,005 rps)   (avg: 3μs)
   nim_bipf_node#deserialize/ssb messages from ssb-fixture                -32%        (160,854 rps)   (avg: 6μs)
   json#parse(string)/ssb messages from ssb-fixture                     +20.1%        (284,090 rps)   (avg: 3μs)
   json#parse(buffer)/ssb messages from ssb-fixture                     +39.6%        (330,196 rps)   (avg: 3μs)
-----------------------------------------------------------------------
```

@gpicron
Copy link
Contributor Author

gpicron commented Mar 30, 2023

@arj03 @staltz @mixmix as you are the most active on sips. Could you review this spec change. The main next step for me would be to update the text to be compliant with sips rules and propose to move there.

Copy link
Member

@staltz staltz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little hesitant about this change, because it overloading like this feels hacky, or reduces the simplicity of BIPF. But I think it's okay to add this.

I think it will be important to highlight that when an ATOM is used for anything other than null/false/true, then it should always be application-internal semantics. So there should be a contract that application-internal ATOM values should never be sent to peers over the network. It should be considered invalid. I think it's important to mention this clearly.

@gpicron
Copy link
Contributor Author

gpicron commented Mar 30, 2023

I'm a little hesitant about this change, because it overloading like this feels hacky, or reduces the simplicity of BIPF. But I think it's okay to add this.

I think it will be important to highlight that when an ATOM is used for anything other than null/false/true, then it should always be application-internal semantics. So there should be a contract that application-internal ATOM values should never be sent to peers over the network. It should be considered invalid. I think it's important to mention this clearly.

This can be used in interoperability cases too is the meaning of atoms other than null, true and false is shared by some way.

Example 1: like in Erlang OTP

For instance, between 2 connected peers via TCP (or any tech that guarantee the ordering of messages) that exchange messages in BIPF.

Each peer create a sending cache and a receiving cache (let say an array of 2048 entry)

When A send a message to B. To encode the message it will use the sending cache as a kind of dictionary for object keys
When encoding in A, for each key, it computes a hashcode of the key, and lookup in the sending cache. If it match, it replace the key by the Atom(index in the sending table). Else it stores the key in the sending cache and encode the key as string.
When decoding in B: if the key is an atom, it replaces the atom by the corresponding string value in the receiver cache. If the key is a string, it computes the hashcode of it and place it in the cache.

Similarly, you can have additional caches for some paths in messages where value is repeated often during the communication between 2 peers. Like for instance the Author field

Example 2: predefined schemas.

Just using message schemas that everybody knows with numbered fields like with Protobuf, keys are atoms with the number.
This is yet actually the case but informal. Such message schema would be like a pre-shared dictionary.
Actually, you can use Protobuf as schema language and create serializer/deserializer in BIPF quite easily.

@mixmix
Copy link
Member

mixmix commented Apr 2, 2023

Not my domain of expertise. Will differ to bipf authors, UNLESS you want an outside perspective. In which case pull me back in

@gpicron
Copy link
Contributor Author

gpicron commented Apr 4, 2023

Can someone with write access merge it ?

@Powersource
Copy link
Contributor

author wants to merge, and someone else approved, so i'll just merge

@Powersource Powersource merged commit 71d36a9 into ssbc:master Aug 9, 2023
@gpicron gpicron mentioned this pull request Aug 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants