-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[8.x] [intersphinx] support for arbitrary title names #11932
[8.x] [intersphinx] support for arbitrary title names #11932
Conversation
# them internally to be docutils compatible. As such, | ||
# we encode the length of the name after the priority. | ||
slen = f':{len(name)}' if ' ' in name else '' | ||
entry = '%s%s %s %s:%s %s %s\n' % ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One alternative that I had in mind is to use a special character for the end of the name, but this means that we assume that this character is not used in the name (e.g., \x00
). The size of the inventory won't be much bigger but we need to check that the name does not contain such character and raise an exception if this is the case during the writing phase.
Quick update after discussing with someone:
|
technically, this one is compatible with the current version but... I'm not really sure it should land in 7.3 (when would 7.3 even release?). So I'll convert it into a draft and we'll merge it in 8.x (after dropping the intersphinx format, though it's unrelated). |
I don't think this change to the inventory semantics is necessary. The most recent regex I implemented for It successfully imports all objects from the minimal example provided by OP at #11711. After a >>> import sphobjinv as soi
>>> inv = soi.Inventory(url='https://thosse.github.io/Playground_GH_Pages/objects.inv')
>>> len(inv.objects)
18
>>> from pathlib import Path
>>> len(Path("objects.txt").read_text().splitlines()) - 4 # Minus 4 b/c the header lines aren't objects
18 That decoded inventory, for reference:
It's of course possible that my regex is still missing some key features. But I've spent a lot of time working on this reverse-engineering, and I think it's pretty solid based on the lack of tickets in the There are a few more issues in |
@bskinn Thank you for your help but there are some titles such as pattern = re.compile(r'''^
(?P<title>(?:.|[^\s:]+:\S+\s+-?\d+\s*)+?)
\s+
(?P<domain>[^\s:]+):(?P<reftype>\S+)
\s+
(?P<prio>-?\d+)
\s+?
(?P<uri>\S*)
\s+
(?P<dispname>.+?)
\r?
$''', re.M | re.X) I would have wanted to use negative lookbehinds but only .NET regex parser supports variable-width patterns.
I'd be glad to have it. It could help a lot for #12204. |
Indeed, that is a point of failure. I can take a look at your regex sometime soon, sure, but---I had an idea. What if we changed the paradigm of the format a bit further? The key hangup as far as I can tell is that the single-line object format doesn't allow for an unambiguous separator character sequence, given the possible domain (not What if an uncompressed v3 inventory were something like this:
This format would provide This format would take a lot more vertical space when decompressed, but ... so what? In nearly all uses, humans won't be reading it, and it'll be zlib-compressed anyways. Even in the few cases where someone is looking directly at it, it still seems quite readable to me. And, even though there are extra characters in these delimiters, the zlib compression should work quite effectively on them. (I also tossed an object count and and MD5 hash into the header to improve the options for validity checking.) What do you think? |
Ideally yes: it's better to change the format so that we don't have possible ambiguities. An alternative for a simpler visual (without new lines) is to encode the length of the title (and there will be no need for a regex, or a fancy separator that could in the future exist (though I doubt newlines will ever be a valid title)). I should check (but I don't have time) the exact specs of DEFLATE in stream mode but I'm pretty sure that the size shouldn't blow up (could you check the bytes size of the compressed inventories with and w/o newlines as separators and with titles with whitespaces?) |
Definitely true, including the name length as you propose is probably about the most minimal change achievable, and wouldn't dramatically change the flow of line parsing. Quick size analysis, switching to my newline-delimited proposal would increase the current Python 3.12 compressed inventory by about 2%: >>> import sphobjinv as soi
>>> inv = soi.Inventory(url="https://docs.python.org/3/objects.inv")
>>> import zlib
>>> ( len_v2 := len(zlib.compress(inv.data_file(), 9)) ) # Compress the whole thing for simplicity
136182
>>> def newline_fmt(do):
... return f"{do.name}\n {do.domain}\n {do.role}\n {do.priority}\n {do.uri}\n {do.dispname}"
...
>>> file_newlines = b'\n'.join(inv.data_file().splitlines()[:4]) + b'\n' + ('\n\n'.join(newline_fmt(o) for o in inv.objects)).encode()
>>> print(file_newlines.decode()[:245])
# Sphinx inventory version 2
# Project: Python
# Version: 3.12
# The remainder of this file is compressed using zlib.
CO_FUTURE_DIVISION
c
member
1
c-api/veryhigh.html#c.$
-
METH_CLASS
c
macro
1
c-api/structures.html#c.$
-
>>> ( len_nls := len(zlib.compress(file_newlines, 9)) )
138880
>>> len_nls - len_v2
2698
>>> len_nls/len_v2
1.019811722547767 Impact is likely greater for smaller inventories, due to the smaller number of repeats of the |
Re your regex, I think you have a typo in the first line:
Is that trailing |
Ah, yes thank you, it's a typo (I did my regex this morning and I already forgot about it) |
I see. If we include the title length, I think we'll probably have less (in the end, it's as if you had a longer title with at most 3 characters (because, who in their right mind would have a title of more than 1000 characters)). |
Yes, agreed, yours is much more concise.
Yessss, goodness. I like your proposal far better than continuing work to find a more robust regex. (I rather suspect that it might always be possible to find a pathological example for any regex we might devise, and any time spent attempting to refine the regex is likely better spent on something else.) |
This is clearly something that I'd agree with. As such I will close this PR and make a new one. For a more precise parsing, I think we can even encode the length of each field instead and completely get rid of any regex (this will save us a lot of work!). Note that we can even drop the spaces in the format ! (although we could want to keep them for testing purposes). So, I have two suggestions:
I'm not sure which one would be the most compact but I like the first one because tests will be easier to write (and debug). I'm not sure if doubling the number of lines would have a direct impact. However, we need to synchronize whatever decision we make with #12204 to avoid creating an unusable format. Jakob's suggested that each domain is responsible for dumping and loading their own intersphinx fragment and I suggested using a similar construction as in ELF where each domain would have a dedicated section where they would write whatever they want and they would know what to do. I think, we can introduce this line-encoded format as a "default" encoding if a domain does not have a specific way to encode its information or if it supports the default logic. |
Ok, here's what I turned up, in no particular order:
I also have an entire test module in |
Fix #11711.
Since the logic is different, this required a complete update on how intersphinx inventories are represented. Before, each line in an inventory was of the form
However, if name contains a whitespace (which is the case in the linked issue), then it becomes extremely hard to have a "good" regex. Now, we represent records as follows:
where the name's length (and the colon) is only included if the name contains a space. In particular, the inventory's size is not changed if the name has no space inside. Otherwise, we increase a bit the inventory size with a 64-bit integer for those entries.