Skip to content
This repository has been archived by the owner on Mar 12, 2020. It is now read-only.

Some user hostnames can't be parsed #34

Closed
elazar opened this issue Jun 10, 2015 · 21 comments
Closed

Some user hostnames can't be parsed #34

elazar opened this issue Jun 10, 2015 · 21 comments

Comments

@elazar
Copy link
Member

elazar commented Jun 10, 2015

Hostnames on the Rizon network appear to allow IPv6 address-like strings (i.e. hexadecimal numbers with segments delimited by colons) postfixed by :IP, which the parser presently can't handle.

2015-06-10 13:23:59 NOTICE [email protected] Parser unable to parse line: :hashworks!~hashworks@DCE7E23D:1D6D03E4:2248D1C4:IP PRIVMSG #moonbase :well, this is bad

The associated BNF notation from RFC 2812

hostname   =  shortname *( "." shortname )
shortname  =  ( letter / digit ) *( letter / digit / "-" ) 
    *( letter / digit ) ; as specified in RFC 1123 [HNAME]

From RFC 1123:

The syntax of a legal Internet host name was specified in RFC-952.

And from RFC-952:

A "name" (Net, Host, Gateway, or Domain name) is a text string up
   to 24 characters drawn from the alphabet (A-Z), digits (0-9), minus
   sign (-), and period (.)

We should probably modify the $host pattern to include a case for such hostnames.

@elazar
Copy link
Member Author

elazar commented Jun 10, 2015

Seeing a lot of Parser unable to parse line errors that look like they may also be related to hostname parsing, specifically cloaked hostnames on Freenode that includes characters like slashes.

2015-06-10 09:18:13 NOTICE [email protected] Parser unable to parse line: :ramsey!sid29832@gateway/web/irccloud.com/x-yjyvvvvrtuiwaqco PRIVMSG #phpc :specifically, when I enter a URL, she doesn’t respond with the <title> []
2015-06-10 09:20:04 NOTICE [email protected] Parser unable to parse line: :RJD22!~rjd22@2001:41d0:2:ae62:: PRIVMSG #phpmentoring :I actually kind of like doctrine migrations. []
2015-06-10 09:20:25 NOTICE [email protected] Parser unable to parse line: :ramsey!sid29832@gateway/web/irccloud.com/x-yjyvvvvrtuiwaqco PRIVMSG #phpc :the one I posted above… this one: http://devzone.zend.com/6178/20yearsofphp-a-timeline-of-tweets/ []
2015-06-10 09:20:46 NOTICE [email protected] Parser unable to parse line: :RJD22!~rjd22@2001:41d0:2:ae62:: PRIVMSG #phpmentoring :Actually because of the migration:diff feature. []
2015-06-10 09:21:21 NOTICE [email protected] Parser unable to parse line: :ramsey!sid29832@gateway/web/irccloud.com/x-yjyvvvvrtuiwaqco PRIVMSG #phpwomen :For this channel: https://twitter.com/phpc/status/608128918751580161 []
2015-06-10 09:21:23 NOTICE [email protected] Parser unable to parse line: :ramsey!sid29832@gateway/web/irccloud.com/x-yjyvvvvrtuiwaqco PRIVMSG #phpwomen ::-) []
2015-06-10 09:21:38 NOTICE [email protected] Parser unable to parse line: :sgolemon!~sgolemon@facebook/hhvm/sgolemon JOIN #phpc []
2015-06-10 09:22:53 NOTICE [email protected] Parser unable to parse line: :Evil_Otto!~chatzilla@unaffiliated/evilotto QUIT :Ping timeout: 252 seconds []
2015-06-10 09:24:53 NOTICE [email protected] Parser unable to parse line: :ramsey!sid29832@gateway/web/irccloud.com/x-yjyvvvvrtuiwaqco PRIVMSG #phpc :elazar: Oh, thanks. I guess I could’ve filed one myself. :P []
2015-06-10 09:25:14 NOTICE [email protected] Parser unable to parse line: :ramsey!sid29832@gateway/web/irccloud.com/x-yjyvvvvrtuiwaqco PRIVMSG #phpc :But, tbh, I wasn’t sure if just rebooting her would fix it, which is why I mentioned it to you. []

@elazar
Copy link
Member Author

elazar commented Jun 10, 2015

Further discussion appears to point to ea3633f being the offending change that's suddenly causing a number of user hostnames to be unparseable. @Renegade334 thoughts on how to handle this?

@elazar elazar changed the title Rizon IPv6 address hostnames aren't parsed Some user hostnames can't be parsed Jun 10, 2015
@elazar elazar reopened this Jun 10, 2015
@elazar
Copy link
Member Author

elazar commented Jun 10, 2015

Merged @hashworks' PR to handle this for now by allowing colons and forward slashes in hostnames. Re-opening this issue because I'd like to have further discussion on a potentially better fix for this problem, preferably with @Renegade334's feedback since he contributed the original changes.

@Renegade334
Copy link
Contributor

I guess the question is whether or not to stay relatively true to the RFC for hostnames, with the addition of specific characters like / and :, or just become permissive and assume that anything after an @ in a mask is a hostname by definition, regardless of its pattern.

Freenode cloaks, for example, are not RFC compliant in terms of characters. However, they are the "hostname" portion of the mask by virtue of being the remainder of the mask after the @ character. We could just accept this and accept any characters.

@elazar
Copy link
Member Author

elazar commented Jun 11, 2015

I'd say we should want the parser to work with as many networks as is feasible. For the moment, @hashworks' change makes it work on Freenode and Rizon. We may want to consider opening up the $host pattern to any non-whitespace character, though I'm hesitant to make the parser quite that liberal in what it accepts.

@Renegade334
Copy link
Contributor

My personal feeling is that if the parser already has to make specific exceptions to the RFC for known use cases, then it may as well just allow any non-whitespace character. Whatever is there is clearly intended by the server to be the user's "hostname", regardless of what characters it contains.

The hostname in a user mask is arguably one of the least useful components of an IRC message. I'd personally advocate taking a permissive approach, as the entire line will otherwise fail and be ignored if a non-compliant cloak "breaks" the regex.

@elazar
Copy link
Member Author

elazar commented Jun 11, 2015

Fair points. Feel free to file an issue and/or PR and I'll be happy to merge it.

@Renegade334
Copy link
Contributor

For reference, the commit which caused these to break was a42375d at line 212. Previously, these cloaks would still have failed to be parsed as hostnames, but the whole ident@host string was slurped into the user field, and the host field ended up empty. Once the @ character was disallowed from idents, that could no longer happen.

@hashworks
Copy link
Contributor

There are 2 tests for :nick!ident@- which should result in invalid. However with $host = "\\S+"; those are valid. What do you think should we do here?
A) Require a number or letter as the first character (Why do the tests still fail with [a-zA-Z0-9]\\S+?)
or
B) Remove those 2 tests?

@elazar
Copy link
Member Author

elazar commented Jun 12, 2015

@hashworks The original purpose of that test case was to ensure that invalid messages were handled correctly by the parser. However we handle this issue, the test should be updated such that the message is considered to be invalid by the parser in order to continue fulfilling the test's intent.

The tests fail even with the pattern change because the proposed new pattern you've referenced requires that the first character of the hostname be alphanumeric, which - is not, thus the failure.

Related: do we want the parser to care about the length of the hostname, so long as it is at least 1, or whether the first character of the hostname is alphanumeric or not, so long as it's separated from any subsequent message components by whitespace?

@hashworks
Copy link
Contributor

The tests are supposed to fail, we expect invalid: foobar. However the parser actually succeeds with [a-zA-Z0-9]\\S+, which is confusing me.

Oh, I bet there are funny server operators out there who use a!a@a.

@elazar
Copy link
Member Author

elazar commented Jun 12, 2015

Hm, I don't think I'm seeing what you're seeing, or I'm not making the same change you are.

08:23:55 parser $ git diff
diff --git a/src/Parser.php b/src/Parser.php
index 26cda8e..258da83 100644
--- a/src/Parser.php
+++ b/src/Parser.php
@@ -207,7 +207,7 @@ class Parser implements ParserInterface
         $trailing = "(?: :?[^$null$crlf]*)";
         $params = "(?P<params>$trailing?|(?:$middle{0,14}$trailing))";
         $name = "[$letter$number](?:[$letter$number:\\/\\-]*[$letter$number])?";
-        $host = "$name(?:\\.(?:$name)*)*";
+        $host = "[$letter$number]\S*";
         $nick = "(?:[$letter$special][$letter$number$special-]*)";
         $user = "(?:[^ $null$crlf@]+)";
         $prefix = "(?:(?:(?P<nick>$nick)(?:!(?P<user>$user))?(?:@(?P<host>$host))?)|(?P<servername>$host))";
08:23:58 parser $ ./vendor/bin/phpunit
PHPUnit 4.7.3 by Sebastian Bergmann and contributors.

...............................................................  63 / 556 ( 11%)
............................................................... 126 / 556 ( 22%)
............................................................... 189 / 556 ( 33%)
............................................................... 252 / 556 ( 45%)
...F........................................................... 315 / 556 ( 56%)
............................................................... 378 / 556 ( 67%)
............................................................... 441 / 556 ( 79%)
............................................................... 504 / 556 ( 90%)
.............................F......................

Time: 275 ms, Memory: 6.50Mb

There were 2 failures:

1) Phergie\Irc\Tests\ParserTest::testParse with data set #255 (':nick!ident@- PRIVMSG target :message\n', array(':nick!ident@- PRIVMSG target :message\n'))
Failed asserting that two arrays are equal.
--- Expected
+++ Actual
@@ @@
 Array (
-    'invalid' => ':nick!ident@- PRIVMSG target :message\n'
+    'prefix' => ':nick!ident@-'
+    'servername' => 'nick!ident@-'
+    'command' => 'PRIVMSG'
+    'params' => Array (...)
+    'message' => ':nick!ident@- PRIVMSG target :message\n'
+    'targets' => Array (...)
 )

/Users/matt/blopdev/phergie/parser/tests/ParserTest.php:34

2) Phergie\Irc\Tests\ParserTest::testConsume with data set #255 (':nick!ident@- PRIVMSG target :message\n', array(':nick!ident@- PRIVMSG target :message\n'))
Failed asserting that two arrays are equal.
--- Expected
+++ Actual
@@ @@
 Array (
-    'invalid' => ':nick!ident@- PRIVMSG target :message\n'
+    'prefix' => ':nick!ident@-'
+    'servername' => 'nick!ident@-'
+    'command' => 'PRIVMSG'
+    'params' => Array (...)
+    'message' => ':nick!ident@- PRIVMSG target :message\n'
+    'targets' => Array (...)
 )

/Users/matt/blopdev/phergie/parser/tests/ParserTest.php:63

FAILURES!
Tests: 556, Assertions: 561, Failures: 2.
08:24:00 parser $

@hashworks
Copy link
Contributor

You're getting the same output that I'm getting. But I don't understand why it isn't 'invalid' => ':nick!ident@- PRIVMSG target :message\n' here, since the Regex shouldn't match.

@elazar
Copy link
Member Author

elazar commented Jun 12, 2015

Ah, I see why.

+    'prefix' => ':nick!ident@-'
+    'servername' => 'nick!ident@-'

Notice that prefix and servername are equal with the exception of prefix having a prefixed : in its value.

Now, look at the $prefix pattern:

$prefix = "(?:(?:(?P<nick>$nick)(?:!(?P<user>$user))?(?:@(?P<host>$host))?)|(?P<servername>$host))";

The first nick!user@host pattern isn't matching, so the matcher is falling into the alternation servername and matching that because, with our change, $host matches much more liberally.

@Renegade334
Copy link
Contributor

The $host pattern should not be changed. This is the RFC-compliant hostname pattern, which is used in more than one place.

It's only the host component of the user mask that is affected by networks' cloaking policy etc. The change should therefore be made directly to the <host> pattern within $prefix.

         $host = "$name(?:\\.(?:$name)*)*";
         $nick = "(?:[$letter$special][$letter$number$special-]*)";
         $user = "(?:[^ $null$crlf@]+)";
-        $prefix = "(?:(?:(?P<nick>$nick)(?:!(?P<user>$user))?(?:@(?P<host>$host))?)|(?P<servername>$host))";
+        $prefix = "(?:(?:(?P<nick>$nick)(?:!(?P<user>$user))?(?:@(?P<host>\S+))?)|(?P<servername>$host))";

...or whatever pattern we want to allow there.

@elazar
Copy link
Member Author

elazar commented Jun 12, 2015

Should this change include a revert of the changes made to $name in 3eb687c, 9221d5f, and cf4537b then?

@Renegade334
Copy link
Contributor

Colons should be left in, I guess, for IPv6 address notation? (RFC2812 specifies host = hostname / hostaddr.) Slashes can probably go. There's also the question of whether trailing dots should be allowed, since they're not valid hostnames but Freenode still use it to designate pseudo-servers.

@elazar
Copy link
Member Author

elazar commented Jun 12, 2015

Yeah, we will want to allow for the Freenode case of trailing dots, even though it's not standards-compliant. And allowing IPv6 notation also makes sense, so colons should probably stay. Not sure it makes sense to include slashes in $name since there's no standard-specific case for them. Including them in the host component of the user pattern in $prefix may make more sense.

@DanielOaks
Copy link

Just thought I'd leave a note here, since it's still open. I know that at least Rizon supports hostnames with colour control codes directly in them, ie dan!~l@this.\x025is.\x024my.\x026host (\x02 being the IRC colour code in this notation), via vhosts. ops can also have hostnames without any dots in them at all, such as the example above of a!a@a, or a!a@\x0213a.

@stil
Copy link
Contributor

stil commented Nov 12, 2015

Some logs from Rizon:

2015-11-12 01:57:27 NOTICE [email protected] Parser unable to parse line: :BallSac!BallSac@VIzon^O MODE #/s4s/ -bbbb *!*[email protected] *!*@kinpatsu.no.shoujo *!*[email protected] *!*@afootpluto.cf []
2015-11-12 01:57:27 NOTICE [email protected] Parser unable to parse line: :BallSac!BallSac@VIzon^O MODE #/s4s/ -bbb *!*@Rizon-E9EC724B.aishiteru.moe *!*@alice.aishiteru.moe *!*[email protected] []
2015-11-12 01:57:30 NOTICE [email protected] Parser unable to parse line: :BallSac!BallSac@VIzon^O MODE #/s4s/ -b *!*@828B14CA:531CFC51:592D31DD:IP []

It's probably color codes in hostname, like @DanielOaks said. In my opinion, it's not our job to be fully 101% compliant with RFC standards - it's servers job. Out bots should be just working with any network. My experience says, that IRC network complaint with all standards is very rare thing nowadays.

So why bother and not just accept everything minus whitespaces as hostname?

@DanielOaks
Copy link

Honestly, when it comes to IRC mask splitting, the easiest way to handle it is to just allow most everything that's not whitespace in there (so long as you can split them properly from the nick/user parts). Mostly because of the issues with colour codes in hostnames, weird characters (and just plainly wrong IP addresses/hostnames) due to either people/admins setting special vanity hostnames or the weird cloaking characters and mechanisms different networks use.

Trying to play whack-a-mole with the different things you'll run into and have to allow isn't usually worth it, imo.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants