Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add NIP-45 for COUNT #144

Closed
wants to merge 2 commits into from
Closed

Conversation

staab
Copy link
Member

@staab staab commented Jan 4, 2023

This is the beginning of sort of a second-layer indexing scheme, I came across the need for it when implementing follower count. Currently I'm downloading ~4000 events in the case of jack, and throwing them away just to get the count.

@fiatjaf
Copy link
Member

fiatjaf commented Jan 4, 2023

This is interesting, but I think it might be better to make a new message, separate from REQ, to just ask for counts. It could take the same filter format and just return the count of events instead of the events themselves.

@v0l
Copy link
Member

v0l commented Jan 4, 2023

A REQC command could be trivial to implement for relays using the existing REQ filter format, a response EVENTC message can return the count for each filter:

Example:
["REQC", { filter1 }, { filter2 }]
["EVENTC", n_filter1, n_filter2, ...]

Some relays may also artificially limit all results sets to avoid abuse, counting rows can be very expensive so relays should also implement abuse prevention for this type

@eskema
Copy link
Collaborator

eskema commented Jan 4, 2023

I also think that it would be better to simply have a new command 'COUNT' that works exactly like 'REQ' but returns just the events.length:

request:

['COUNT', 'sub_id', {filter}]

response:

['COUNT', 'sub_id', 'num']

and what is the expected behaviour if you don't close the req? should the relay keep sending new counts (i.e. num++ ) or is this like a Notice single message?

@staab
Copy link
Member Author

staab commented Jan 4, 2023

I think I prefer @eskema's version as well. I don't think streaming count is useful enough for the complexity, so I would make it just a single message. I'll revise the NIP when I have a chance.

@staab
Copy link
Member Author

staab commented Jan 6, 2023

Updated the PR, let me know what you think. I'm slightly inclined to make this a little more open ended in case someone wants something other than a count, for example a unique list of pubkeys, or rounded time series data. But I suppose we can re-use the GROUP verb with a different request verb.

@eskema
Copy link
Collaborator

eskema commented Jan 6, 2023

not sure I like this approach, the groups seem unnecessary, why not send a new req for the counts you need instead of lumping it like that? or, just return a count for each filter if you want multiples in a req, but subgroups in the reqs seem complicated for no reason. why woold you lump a bunch of requests and ask separate counts?

@staab
Copy link
Member Author

staab commented Jan 6, 2023

In order to get certain groups, you would have to know what filters to use, which would require retrieving all events in the first place. For example, notes by pubkey:

["COUNT", "", {filters: {kinds: [1], since: <timestamp>}, group_by: ['pubkey']}]
["GROUP", "", {group: [<pubkey1>], count: 238}]
["GROUP", "", {group: [<pubkey1>], count: 21}]

@fiatjaf
Copy link
Member

fiatjaf commented Jan 6, 2023

Isn't this trying to replace SQL with a JSON query language?

I don't have a clear reason for this, but it doesn't seem right to treat relays as general-purpose databases.

@staab
Copy link
Member Author

staab commented Jan 6, 2023

It does cover the GROUP BY case popularized by sql, yes. Just like limit = limit, since/until = order by, offset, kind/#e/#p/author = where. Grouping is useful.

@eskema
Copy link
Collaborator

eskema commented Jan 6, 2023

I think it should simply be:

["COUNT", "", {kinds: [1], since: <timestamp>, authors: ['pubkey1']}]
["COUNT", "", {kinds: [1], since: <timestamp>, authors: ['pubkey2']}]

i.e, you do the grouping yourself before asking the relay, or

["COUNT", "", {kinds: [1], since: <timestamp>, authors: ['pubkey1']},  {kinds: [1], since: <timestamp>, authors: ['pubkey2']}]

and the response be count1, count2

@staab
Copy link
Member Author

staab commented Jan 6, 2023

@fiatjaf I see what you mean though, this opens the door to more and more complexity, while second-layer protocols or centralized services can perform this function successfully. I'm just not sure where to draw the line. Probably this side of optimizations.

@staab staab force-pushed the nip-45 branch 2 times, most recently from 897cd31 to dcc111a Compare January 11, 2023 16:12
@staab
Copy link
Member Author

staab commented Jan 11, 2023

Alright, since groups seem to be unpopular I've removed them, please take another look.

@Giszmo
Copy link
Member

Giszmo commented Jan 11, 2023

Reactions are the one thing I see that would most profit from a "count" nip but in its current form it doesn't profit at all and as the example was not brought up in the discussion, please pardon for beating this dead horse:

I think the most useful for light clients would be to stay close to the REQ syntax but optionally allow to group_by. To still allow matching queries to results, the easiest approach is to have an array per query, with those using group_by returning the fields grouped by, too.

['COUNT', '',
    {kinds: [7], '#e': [<eid1>, <eid2>], group_by: ['#e', 'content']},
    {kinds: [1717171], '#e': [<eid1>, <eid2>], group_by: ['#e', 'content']},
    {kinds: [3]}
    ]
['COUNT', '', 
    [{'#e': <eid1>, content: '', count: 12},
        {'#e': <eid1>, content: '+', count: 5},
        {'#e': <eid1>, content: '+ ', count: 1},
        {'#e': <eid1>, content: 'banana', count: 1},
        {'#e': <eid1>, content: '🤢', count: 3},
        {'#e': <eid1>, content: '🍆', count: 2},
        {'#e': <eid2>, content: '', count: 1}
    ],
    [],
    [{count: 5745}]]

While I think the above is preferable, it could of course also be pruned by some implicitly given information:

['COUNT', '',
    {kinds: [7], '#e': [<eid1>, <eid2>], group_by: ['#e', 'content']},
    {kinds: [1717171], '#e': [<eid1>, <eid2>], group_by: ['#e', 'content']},
    {kinds: [3]}
    ]
['COUNT', '', 
    [[<eid1>,
        ['', 12],
        ['+', 5],
        ['+ ', 1],
        ['banana', 1],
        ['🤢', 3],
        ['🍆', 2]],
    [[<eid2>,
        ['', 1]],
    [],
    5745]

@Giszmo
Copy link
Member

Giszmo commented Jan 11, 2023

Also while I do feel pity for my server that has to process tons of data to return some simple "12432", we have to see what is the overall most efficient way of achieving certain results and if the server doing x3 the work so the client can do just 1/5 of the work then that is acceptable for some client devs and server operators.

I assume that at scale, users will pay one way or another for resources used on the servers and I'm totally fine with a free tier that does not see likes beyond likes from their follows for example. The idea here is to standardize what's useful for some and I totally see the usefulness for some in the group_by.

@brugeman
Copy link
Contributor

Ack, I've had this implemented in wss://relay.nostr.band more than a month ago, somehow discussion got lost in the telegram group, you can try
["COUNT","1",{"#p":["84dee6e676e5bb67b4ad4e042cf70cbd8681155db535942fcc6a0533858a7240"], "kinds":[3]}]

@staab staab changed the title Add NIP-45, which defines a attribute on REQs Add NIP-45 for COUNT Jan 26, 2023
@pseudozach
Copy link
Contributor

Concept ACK
I'd like to see this functionality, will test relays with this capability and implement it in my client.

@mikedilger
Copy link
Contributor

How about adding that clients should check NIP-11 first to see if they support NIP-45 before issuing COUNT requests.

@fiatjaf
Copy link
Member

fiatjaf commented Feb 1, 2023

We should probably merge this.

@fabianfabian
Copy link

Different relays have different events and counts, so how would you get to the final count? You can't just add them up. Right now you have to get different events from multiple relays, filter the duplicates and thats the count from the client perspective.

@staab
Copy link
Member Author

staab commented Feb 2, 2023

That's a super good point, I hadn't thought of that. I do think COUNT could still be useful as an approximation (since it is anyway), you could take the max of multiple results, or select which relay to request a count from.

@staab
Copy link
Member Author

staab commented Feb 15, 2023

Aaand I've changed my mind, I've decided that COUNT is not useful. If we add relay extensions we could put something together that would be more complete.

@adamritter
Copy link

Hi,

I'm adding COUNT to my metadata / contact indexing nodes (wss://us.rbr.bio and wss://eu.rbr.bio) for follower counts, and it was easy to implement, but I have a few questions:

Why {count: 30} when ["COUNT","subname",30] would already contain the information and is simpler? For group_by extensions it makes sense to have an array, but for simple counts I would prefer just a simple number (this is just a nitpick though).

The more interesting question: I would just like to support a few types of COUNT and group_by queries. I think the best would be if I could specify the types of queries my relay supports in the relay information document (although that should be handled in another NIP).

And the main question: how should the server reply if it doesn't handle the specific COUNT query? Should the relay just reply with a NOTICE, or EOSE, or do nothing? I guess NOTICE would be the most backwards compatible, but NOTICE doesn't contain the subscription name, as it just supports 1 message.

@adamritter
Copy link

One more extra question: the NIP contains just "" as the second parameter, but I guess it's a label (maybe not subscription) that helps identifying the reply. It should be clear in the documentation.

@staab
Copy link
Member Author

staab commented Mar 13, 2023

I'm adding COUNT to my metadata / contact indexing nodes

This is exactly the place that COUNT makes sense. It's not good for direct use with multiple relays, but if you're using a relay multiplexer or indexer, COUNT can work.

To answer @adamritter's questions:

  • For simple counts I'm not using a number so that additions to the NIP can be made later without breaking backwards compatibility. One rule of thumb I've learned is always return an associative data type from APIs that aren't easily refactored.
  • I would guess NOTICE if not supported, or just ignore the request, but I don't really know.
  • The "" parameter is the subscription id, standard with REQ etc. Probably a good idea to clarify though

I'm not sure about selectively supporting COUNT attributes/groups, can you share the reason for that?

@staab staab force-pushed the nip-45 branch 2 times, most recently from 092d8f5 to 396e52d Compare March 13, 2023 17:36
@adamritter
Copy link

I'm not sure about selectively supporting COUNT attributes/groups, can you share the reason for that?

I'm running 2 relay servers that hold all metadata and contacts of all relays in RAM and serves them (wss://us.rbr.bio and wss://eu.rbr.bio).

I implemented it with hash maps in RAM, I don't use any query engine.

Already added follower counts support, but it's also a specific hash map just for followers:

https://iris.to/npub1dcl4zejwr8sg9h6jzl75fy4mj6g8gpdqkfczseca6lef0d5gvzxqvux5ey

I'm also planning to implement group_by authors/pubkey just for getting the list of all followers, as it's important and supported by the data structures in my server.

I was thinking of the selective supporting because most relay implementations have secondary indices that efficiently support count for certain types of queries, but it maybe not important to require those indices to be used for group_by operations (for example the reactions are not that many so group_by is easy for reactions on the server even if there is no index for it).

To tell you the truth, I'm also not sure about this selective support (I'm trying to conform to the standard as much as possible), but what's more important is that even if we do it, it shouldn't be part of this NIP, so there's no need to take a decision.

There may be millions of group by results though, so the group_by NIP should maybe specify a limit: and returning limited number of results (relays usually have a default max limit anyways).

@adamritter
Copy link

The current specification doesn't specify if multiple counts are supported or not in 1 query (I think they should be supported).

Right now there's both {count: 3} and [{count:3}] as suggestions for simple counts, I prefer {count:3}, and using arrays for group by when we extend count to support group by (which is needed)

@adamritter
Copy link

adamritter commented Mar 14, 2023

I implemented basic group_by support for getting followers on wss://us.rbr.bio:

["COUNT","hello",
      {"#p":["85080d3bad70ccdcd7f74c29a44f55bb85cbcd3dd0cbb957da1d215bdb931204"],"kinds":[3]},
      {"#p":["85080d3bad70ccdcd7f74c29a44f55bb85cbcd3dd0cbb957da1d215bdb931204"],"kinds":[3],"group_by":["pubkey"]}]

["COUNT","hello",{"count":50538}, 
  [{"pubkey":"82341f882b6eabcd2ba7f1ef90aad961cf074af15b9ef44a09f9d2a8fbfbe6a2","count":1},...

(returns top 1000 followers by popularity)

@staab
Copy link
Member Author

staab commented Mar 31, 2023

I've re-introduced group_by to fit @adamritter's implementation. If relays prefer not to implement it, they can raise a NOTICE or ignore the group_by key. I think we should go ahead and merge this, as there are multiple implementations in the wild, and despite the problems with merging results from a relay, COUNT can be done more reliably by either indexers or multiplexers.

@fiatjaf
Copy link
Member

fiatjaf commented Mar 31, 2023

This still looks like it is not really solving any problems, specially the group_by clause, it makes no sense to me.

@staab
Copy link
Member Author

staab commented Mar 31, 2023

Can you elaborate? Currently coracle only uses COUNT in one place, but it allows me to avoid downloading many megabytes of data to populate a single number. A couple use cases for group_by are enumerated above. Your comment that it might be burdensome to relays is valid, but it's opt-in as written, and can be useful for analytics as well as more common uses.

Copy link
Collaborator

@Semisol Semisol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follower count will not work with this accurately. Outdated lists, sending to multiple relays, ...

@staab
Copy link
Member Author

staab commented Mar 31, 2023

Follower count will not work with this accurately. Outdated lists, sending to multiple relays, ...

Yes, however it will work if asking a single relay, a multiplexer (which can do the heavy lifting on its end, saving client bandwidth), or an indexer that scrapes a total from many relays.

@fiatjaf
Copy link
Member

fiatjaf commented Apr 1, 2023

For example, now Coracle says I have 47k followers when in fact the truth is probably more like 5k.

But yes, I think COUNT is fine, even though it is not as useful as it would be on a centralized network, my problem is really with the group_by.

@staab
Copy link
Member Author

staab commented Apr 1, 2023

I can't imagine with 2 million profiles (yes, many fake), that you would only have 5k followers.

@fiatjaf
Copy link
Member

fiatjaf commented Apr 5, 2023

0e7300a merged. I feel the group_by is too messy. I may be wrong, but I would like to be proven wrong first. Then we can edit this.

@fiatjaf fiatjaf closed this Apr 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.