Skip to content
This repository has been archived by the owner on May 11, 2021. It is now read-only.

Rule set can't handle domains with domains with country codes #132

Closed
navidada opened this issue Jan 9, 2015 · 9 comments
Closed

Rule set can't handle domains with domains with country codes #132

navidada opened this issue Jan 9, 2015 · 9 comments

Comments

@navidada
Copy link

navidada commented Jan 9, 2015

When using the rule "Same second-level domain", policeman will treat all domains from the same country code as the same domain. For example:
http://www.news.com.au/

policeman will assume that the domain is 'com.au' and thus allow requests to other Australian sites such as:
foxsports.com.au
newsapi.com.au

The country code is irrelevant, it also happens in .co.uk, for example:
http://www.dailymail.co.uk will allow requests to 'we.and.co.uk' although it doesn't put it in the same category as dailymail.co.uk, like it did with news.co.au.
And I assume it will also happen in other sites such as .gov.uk or .org.mx and so on.

Is there a way to fix that rule?
There are a lot of permutations so maybe the solution is that everyone will enter specific country codes that are relevant to them [.uk, .es, .fr] and policeman will only make exceptions for these country codes. i.e check if the address 'y.z.fr' ends with one of these specific suffixes and then allow it to request for 'x.y.z.fr' and not to 'a.z.fr'.

@futpib futpib added the bug label Jan 9, 2015
@futpib
Copy link
Owner

futpib commented Jan 9, 2015

The simplest solution I can see is to build a complete (or somewhat complete) list of such special domain.

Wikipedia has comprehensive (if not complete) lists in corresponding articles: uk mx au. But I couldn't find a list of country-level domains with such a strange naming scheme (other than these three).

@futpib futpib added enhancement and removed bug labels Jan 9, 2015
@navidada
Copy link
Author

navidada commented Jan 9, 2015

There are also other countries that share this naming scheme: .jp, .il, .no, .ru, etc.
I can try and make a list from Wikipedia, but I don't know how to code the Rule and how to prevent it from interfering with it's original aim.

Also one needs to think of the algorithm complexity so it won't slow down the add-on too much (if it tries to check all possibilities. That's why I thought about the option that each user will enter the country domains that are relevant to him [.jp] and then the rule will only look for 9 possibilities - .co.jp, .ac.jp,...)

Anyway, will it help if I make this list even if I don't know coding?

@bastik-1001
Copy link
Contributor

@futpib
"(...)I thought about the option that each user will enter the country domains that are relevant to him(...)"

I'd prefer if it just works. Normally co.uk isn't relevant to me, but who knows where I am redirected to? I won't be able to say in advance that I never have to visit a co.jp address.

Can this be solved by a different pattern matching? Probably it can, I'm just not good at regular expressions. Like catching a TLD that is led by two letters which are led by more than two characters.

Edit: Supposing there are no domains with just two characters.

Update: Too bad, there are some: https://en.wikipedia.org/wiki/Single-letter_second-level_domain

I know at least one German.

@futpib
Copy link
Owner

futpib commented Jan 9, 2015

@bastik-tor The problem is, there are sites with two-letter domains (ya.ru, vk.com) and there probably exist two-letter sites that are hosting other sites (like ya.ru could host whatever.ya.ru), so this can't be solved by clever matching on the domain-string alone.

Again, I think we'll have to maintain a list of second-level domains that are no sites sited on their own. Or leave it as it is, heh, after all "Same second-level domain" does what it says.

@navidada It will help. Performance-wise, there are hashmaps with constant-time lookup on average. This is what "Persistent" and "Temporary" rulesets are built upon.

@futpib
Copy link
Owner

futpib commented Jan 11, 2015

@navidada No need for making a list anymore.

I've just been emailed with a link to this wiki page which points to what seems to be a complete list of the domains we need.

@ghost
Copy link

ghost commented Jan 11, 2015

@futpib Great finding, nice list but long, very long...
I wonder if it wouldn't be wiser for the user of Policeman to simply disable the Same second-level domain rule set. A bit of extra settings occasionally but would clear out this problem, one of the worst in coding I guess as are those handling exceptions out of any logical process. A real pain.

@navidada
Copy link
Author

@futpib I also encountered this list but it seems it's not fully complete. For example check .il for Israel (and not Illinois,US):
The old Mozilla site (https://wiki.mozilla.org/TLD_List) used to mention ac.il co.il org.il net.il k12.il gov.il muni.il idf.il - Those are still relevant (https://en.wikipedia.org/wiki/.il)
I don't know why the new site is not updated. I wanted to 'submit amendments' but didn't know how to create what they refer to as 'unified diff'.

It seems that Firefox itself uses this list in order to handle cookies and other things. So if Firefox "understand" how to parse each web-address, maybe Policeman can use that data directly from Firefox? (and thus save all the computational hassle)

@kafene
Copy link

kafene commented Jan 18, 2015

The addon-sdk provides getTLD, the source code of which is here

Rather than duplicate the list (especially with the new gTLDs, its getting massive), it's probably better to just use this built in functionality.

@futpib futpib closed this as completed in b895755 Jan 24, 2015
@futpib
Copy link
Owner

futpib commented Jan 24, 2015

@kafene Huge thanks.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants