-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow increased IRC message lengths #281
Conversation
Replace every instance of Other than that, seems like a good addition. Variable line lengths is something I've been wanting for awhile now. |
@shawn-smith every character can be more than one octet, especially as UTF-8 is recommended. The RFC says about octets (bytes), not about unicode characters. |
Any thoughts on how the server is supposed to split the text? We (Instantbird and Thunderbird) will split a user's message on the closest space that makes the message small enough to send (if there's no space, then we'll split in the middle of a "word".) This is a somewhat sub-optimal though, depending on the language...If I recall correctly, certain spaces in French grammar are essentially the 'middle' of a word. Probably out of scope for the specification, but figured I'd ask. Also 👍 on being clear about octets vs. characters. |
@DarthGandalf IRC line length goes by number of characters regardless of how many octets it takes per character.
|
The RFCs refer to bytes, octets and characters (which is sketchy, but we need to work with). Octets seems to be the one that is most specific and is least likely to cause confusion/etc among people. @clokep I don't really define how to split messages in this on purpose, but my thoughts essentially boil down to "somewhere that makes sense", which is difficult to codify. Some will split on spaces, some on dashes and things as well, etc. |
@DanielOaks An octet is a group of 8. You're talking specifically about characters in your specification. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shawn-smith octet is 8 bit = 1 byte.
@DanielOaks to make it less confusing, let's use "byte" instead of octet. These days there are no non-8-bit bytes anymore AFAIK.
|
||
Similarly to standard message handling, tags and the rest of the message have separate length values. The value of the `maxline` capability represents the maximum number of octets that the tags section, and that the rest of the message, can take up. Line length calculation is done this way in order to better integrate with methods currently used by IRC software to limit line lengths. | ||
|
||
As an example, if `maxline` is 1024 then the maximum size of a full IRC message would be 2048 bytes (1024 for the tags, 1024 for the rest of the message). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's divide by 2 instead of multiplying by 2.
The name "maxline" says the "line", not "half of line", it's just too confusing.
If maxline is 2048, then tags and non-tags part should be up to 1024.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that's less confusing from the IRC server/client developer point of view, but (personally) it seems more confusing from the user/IRC operator point of view.
E.g. "I see maxline
is set to 4096, why can't I send messages that long?" And dividing also means the minimum of 1024 is no longer as easily recognized as the minimum of 512.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah my thoughts are similar to @digitalcircuit's. If a user sees maxline=2048
, I think they'd expect that they can send ~2000-long lines, which is why I think it should stay this way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case please emphasize this explanation in the text.
|
||
If a client has negotiated the `maxline` capability and sends a `PRIVMSG` or a `NOTICE` message that is longer than 512 octets, the receiving server MUST split this into multiple regular (512-octet) length messages when sending it to clients that have not negotiated the `maxline` capability. | ||
|
||
Servers SHOULD split on whitespace, but may use whatever method is easiest for them to implement. Splitting does not need to occur at the exact max length of the message, and servers can instead opt to split a number of characters earlier to simplify processing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line SHOULD NOT be split in middle of UTF-8 character (or a surrogate pair)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, by surrogate pairs I actually meant combining characters. But as number of them in a row is potentially unlimited, detecting combining characters can be more trouble than it worth.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be simplified if you convert to Unicode code points to find the split point before converting back, but might depend on an assumed character encoding.
It would be interesting to look at various clients' splitting routines to compare and see if any general guidelines can be included.
|
||
Servers SHOULD split on whitespace, but may use whatever method is easiest for them to implement. Splitting does not need to occur at the exact max length of the message, and servers can instead opt to split a number of characters earlier to simplify processing. | ||
|
||
Servers MAY split other commands/numerics into multiple lines in a way similar to `PRIVMSG` and `NOTICE` above, if it is purely for display purposes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is logging a display purpose?
Need to clarify what is display purpose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this language is a bit sketch, I'll just remove that clarifier.
In this example, C1 has negotiated `maxlen` but C2 has not. | ||
|
||
C1 -> PRIVMSG coolfriend :Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. | ||
C2 <- :c1!test@localhost PRIVMSG coolfriend :Lorem ipsum dolor sit amet, consectetur adipiscing elit, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think split lines also need to be marked with some tag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a batch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd consider that to be something to look at after we introduce message IDs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see how IDs are relevant. If anything, the ID would just apply to the batch as a whole,
@id=foo :irc.server.net BATCH +x too-long-line
@batch=x :c1!test@localhost PRIVMSG coolfriend :Lorem ipsum dolor
@batch=x :c1!test@localhost PRIVMSG coolfriend :sit amet
:irc.server.net BATCH -x
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair, my bad. In regards to marking them at all... maybe. I'm iffy on introducing a batch here because if they don't implement longer lines, they probably won't implement the batch either and it introduces a fairly large amount of overhead on the server side for a case which could be reasonably common.
Do we really need this and would it actually be useful for clients in the real world, or would it just unnecessarily complicate sending PRIVMSG/NOTICEs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clients already can implement batches, and some of them do. So the situation where client supports batches, but not longer lines is very possible.
If client doesn't want to receive the lines in batch, it's free to not request batch, and still receive the lines. E.g. old pre-IRCv3 clients would do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would clients that don't implement support for longer lines implement the batch type instead? @attilamolnar @jwheare @dequis @SaberUK your thoughts here for some more from both ircd and client sides? I'm not convinced it'll be used but if people really want it then can write it up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the most likely situation is a client supports neither. Speaking for HexChat I am more likely to implement long lines than batch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the client perspective I feel like tingping, from the server perspective I don't mind providing a batch fallback when splitting lines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not really in favour of speculatively specifying transitional solutions that offer half-solutions, it muddies the overall spec.
I feel similarly about the suggestion that a client might want to request a longer than 512 but shorter than advertised limit. I'd prefer a clearer, simpler, all-or-nothing spec.
@clokep In some cases, the language/libraries might handle this automatically - e.g. Quassel uses Qt's That might be overkill for simpler/low-resource servers, though, and I'd agree with @DanielOaks on not firmly defining it in the spec. The ideal path involves all clients using the new However, it might help to also suggest looking into whatever tools already exist for your given language/framework (if any). |
@DarthGandalf The amount of bytes required for a character varies between charset and encoding. The reason RFC1459 states the max line length in characters and not octets and bytes is because it's charset/encoding agnostic. Regardless of what you use there will be 510 usable characters in an IRC line + the terminating CRLF. This should be done the same way. Using characters and not octets or bytes. |
RFC assumes that a character is 8 bits, that's why it uses such terms interchangeably. |
@shawn-smith Section 8.2 of RFC1459 specifies a buffer of 512 bytes holds 1 full message. Section 2.2 can easily be interpreted to mean that a character is 8 bits (as @DarthGandalf pointed out while I was writing this). Furthermore, every implementation I've seen has used C char arrays (or something semantically similar), with 512 usable locations, to hold a line. Specifying that the line length limit is based on encoding-agnostic characters means that the IRCd must know the encoding used (not always possible) and that the storage is variable, which does not work well in C-based IRCds. Therefore, the line length should be specified in a unit that does not vary (eg bytes) for simplicity and ease of implementation. |
What if the server advertises a line length that is larger than what the client is willing to accept? If this is enabled directly through CAP REQ, that leaves no place for negotiation. My idea of that was that the value in CAP LS is just a suggestion, and doing CAP REQ just enables a new verb to set maximum line length, which is client initiated and acked by the server. Something like this:
(verb name subject to change) Downside: this may add too much additional complexity for servers and reduce their chances to reuse messages. For other ircd devs: How does this look from your point of view? My ircd isn't real enough to care about reusing messages. |
Another option: Specify a maximum upper bound that clients must accept. Say, up to 64kb or some other arbitrary high-but-not-too-high-number. 64 One could say that clients aren't as resource constrained as servers, or, at least, can't easily take advantage of message reuse like servers often do. If the server advertises something higher than $upperbound, the client may reject it and choose to stay with 512+512 message lengths. Or it might accept it anyway, if its internal limit is higher than what the server advertises. But to be compliant with this spec that internal limit must not be lower than $upperbound |
If IRC is a protocol, which it is, clients should if anything, use what the server sends them, granted, there should be an upper limit that servers can set, so we aren't having 1000+ character topics etc. |
RFC1459:
RFC refers to both bytes and characters. For this spec, I think just saying bytes makes sense (afaik most IRC servers should already be doing it based on bytes) so I'll do that. |
@dequis The main reason I've done it this way is to avoid that complexity. Just with mine, having to regenerate where to split messages for every single user based on whatever line length they've accepted seems really dodgy and could be more resource-intensive than it needs to be. At least with this (taking into account max nicklen/userlen/hostlen and all), you can generally split it once and use it across all the clients that haven't accepted longer line lengths. If the client's not willing to accept the larger line length, I think they'd just not request the cap and leave it there. |
|
||
If a client has negotiated the `maxline` capability and sends a `PRIVMSG` or a `NOTICE` message that is longer than 512 bytes, the receiving server MUST split this into multiple regular (512-byte) length messages when sending it to clients that have not negotiated the `maxline` capability. | ||
|
||
Servers SHOULD split on whitespace, but may use whatever method is easiest for them to implement. Splitting does not need to occur at the exact max length of the message, and servers can instead opt to split a number of characters earlier to simplify processing. Lines SHOULD NOT be split in the middle of a UTF-8 character. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about users who use non-UTF-8 encodings?
Would it be possible to make UTF-8 mandatory for IRCv3.3 clients so this is not an issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's not utf-8, split on byte boundaries. This is an optional "SHOULD NOT", it's a suggestion for servers to avoid introducing invalid utf-8 if possible. Most other encodings aren't multibyte, and it's okay to break those that are.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or just say don't split multi-byte characters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A server cannot reliably know if the client uses UTF-8 or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure you can, it's trivial if you limit your scope to knowing if an individual message is valid UTF-8 or not.
You could even do it without validating the whole message. UTF-8 has a set of properties that make it easy to detect (roughly: start byte & 0xc0 == 0xc0, continuation byte & 0xc0 == 0x80, and the number of required continuation bytes). If the bytes at the splitting point follow those properties, you split the line before the start character. If they don't, it's not valid utf-8, split at the original point. This is a pretty common thing to do.
It's harder if you have to deal with other non-utf8 multibyte encodings at the same time which is one of the reasons i wouldn't bother.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optional SHOULD NOT to tell servers to avoid unnecessarily breaking UTF-8 messages. We like UTF-8, this sort of a line makes sense for us.
|
||
### The `truncated` Tag | ||
|
||
The `truncated` tag, when present, indicates that a message has been truncated due to the client's line length. It may be sent to any client which supports message tags, as deemed appropriate by the IRCd. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add the number of truncated bytes/characters as a value to the tag, so clients can show something like “and X more bytes/characters”?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, the truncated
tag itself does also reduce the number of characters that are allowed in the tags section which makes this difficult, makes for some slightly annoying math if we want to have it spit out this val. I feel like just being able to say "this message is truncated" or similar for the clients should be fine.
C: CAP LS | ||
S: CAP * LS :maxline=2048 | ||
|
||
Similarly to standard message handling, tags and the rest of the message have separate length values. The value of the `maxline` capability represents the maximum number of bytes that the tags section, and that the rest of the message, can take up. Line length calculation is done this way in order to better integrate with methods currently used by IRC software to limit line lengths. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you explain why the tag limit and message limit can't be specified separately? I could imagine an environment (potentially such as one which I support) where many tags are needed to provide rich client capability, but a restricted message length would be preferred.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, thanks very much for the feedback! This also lets it work much more nicely with the new message-tags
changes that are coming.
A few thoughts on the basic idea of longer message lengths:
I think there should likely be some guidance about these points in the spec. |
Hmm, considering most users' messages (privmsgs/notices) are likely to be under 512 chars, could simply suggest something along the lines of this in a non-normative implementation considerations section:
Those general denial-of-service issues on the other hand... yeah that's hard to address, and becomes very interesting particularly as you look at passwords on registration/authentication. Unfortunately I don't think we can address that in the spec itself, more just something that implementers need to keep a close eye on. Maybe a similar sort of non-normative suggestion:
Thoughts? edit: Better specified section and improved language, thanks @jwheare |
SASL, at least, shouldn't be a problem since |
Thanks @djahandarie for the advice here and for bringing this issue up.
Is the problem with wanting increased line lengths or just with knowing when your line is going to be truncated? I don't think defining a limit in terms of line length alone works because the final message limit varies based on the hostname that the recipient sees, which may be different from what the sender thinks their hostname is or for privileged users that can see real hostnames. The end result is that clients may still try to guess where the text will be truncated and get it wrong. If you split messages on behalf of clients, what happens when a message with colours and other formatting characters gets split? What happens when a message with commands (in the middle of the line) intended for a bot gets split? I'd be inclined to add a |
I'll write the changes to let it interact much more nicely with |
There are still some q's to answer in the spec itself, but for peeps looking to test I've got an example implementation here. |
How will this interact with message IDs? Given a stupidly short maxline (just for the sake of the example), a maxline supporting client might see:
While a legacy client might see:
Is something like a
|
I could see a continuation tag make sense there (that works akin to that last example you posted). I'll throw that into the spec. |
Do lines except last one get any special tag like @there-is-continuation?
19 янв. 2017 г. 3:17 AM пользователь "Daniel Oaks" <[email protected]>
написал:
… I could see a continuation tag make sense there (that works akin to that
last example you posted). I'll throw that into the spec.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#281 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAT15Dv_l1xbqxaYDPvEbPkRWrgetXHCks5rTtXCgaJpZM4K3AUw>
.
|
Yeah perhaps a @split or @continued tag might make sense. Maybe with a value indicating how many continuations are to follow? So e.g. a theoretical message split into 3:
Dunno if the tag value is necessary, would it enable any valuable use cases? |
Oops, forgot about @ mentioning usernames :/ |
Any reason why that continuation type can't be a batch tag? |
*Continuation tag |
Yes it could be I suppose.
And I guess if the client didn't enable batch, only the first message would get the tags. Although all messages could safely have e.g. an |
@lp0:
Is this a new or real world issue? Client side splitting has the same theoretical issue but does it cause problems? |
Is there a client these continuation tags or batches will benefit to? To be more specific: is there a client that will be updated to support continuation tags or batches, but not to support arbitrary line length? |
It's a good point. It does seem like quite an edge case and may not be worth the bother. |
Exactly, that's why I'll just go with whatever option is simplest and throw in an update to the proposal. Should go well |
Having completely arbitrary line lengths is risky, what else should happen if the server limit is beyond that which the client limits to? |
Yeah, since this spec explicitly doesn't cover length negotiation to simplify things, clients have to choose to accept it or reject the length offered by the server, but they might still want to reassemble split messages. Either continuation tags or batches are fine for that. |
Why would a client refuse a given message size if it can reassemble the same size it afterward? |
Not necessarily the same size. For example, if the server offers 1mbyte and the client accepts up to 64kbytes, it can choose to stop reassembling incoming messages after appending that much to the original message and show them as separate messages. Most manually-written messages are going to be way smaller than the limit anyway, I'd expect almost everything to be in the range of 1-4 parts. |
Hopefully this can be boosted as time goes on, particularly via ircv3/ircv3-specifications#281
Hopefully this can be boosted as time goes on, particularly via ircv3/ircv3-specifications#281
I'd rather go with some sort of continuation cap that works with labels than this. The boosted tag space in the new message-tags spec helps out to a decent extent anyway, and given the complicated nature of this spec, I'd rather focus on something grounded a bit more in reality. |
One of the big issues we want to solve is that IRC lines are capped at 512 octets. This... works, but it would be very nice to allow for longer messages and things like longer topics without needing to implement dodgy hacks for every single command we want to allow longer lengths on.
This should ensure that things stay 100% backwards compatible and work correctly for clients that do not support longer lines, while allowing more up-to-date clients to negotiate the longer message length allowed by the server.
This is gonna be something that is a bit controversial, but it would be extremely useful to allow and something we've been looking at for a while.