Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added content type check for getdatalink #607

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

d-giles
Copy link
Contributor

@d-giles d-giles commented Oct 4, 2024

This PR seeks to address issue #328 and the associated discussion by checking if the access_format property is present and a datalink or a votable. Failing that, this update checks the headers of URLs in the record with the meta.ref.url UCD, if the content type is found to be a datalink or a votable, the link is provided as the datalink_url for the record.
Where no datalink is found, a DALServiceError is raised informing the user that no datalink was found for the record.

Fixes #328

Copy link

codecov bot commented Oct 4, 2024

Codecov Report

Attention: Patch coverage is 73.68421% with 5 lines in your changes missing coverage. Please review.

Project coverage is 81.88%. Comparing base (029980e) to head (cda2eae).
Report is 24 commits behind head on main.

Files with missing lines Patch % Lines
pyvo/dal/adhoc.py 73.68% 5 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #607      +/-   ##
==========================================
+ Coverage   81.70%   81.88%   +0.18%     
==========================================
  Files          69       70       +1     
  Lines        7093     7178      +85     
==========================================
+ Hits         5795     5878      +83     
- Misses       1298     1300       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@bsipocz bsipocz added this to the v1.5.3 milestone Oct 4, 2024
Copy link
Contributor

@msdemlei msdemlei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm... first,

       if ('datalink' in access_format) or ('votable' in access_format):

has far too many false positives – and possibly also false negatives, because RFC 2045 defines media types as case-insensitive.

What I think ought to be done: Write a function is_datalink (or something like that) that receives a media type string and returns true if the media type compares equal to application/x-votable+xml;content=datalink according to RFC 2045 rules, that is, after parsing, case folding, and all. I'd say this should sit in pyvo.dal.adhoc (stupid name, another thing we ought to change). This function should then be used wherever there are comparisons against adhoc.DATALINK_MIME_TYPE in pyVO.

And then quite a bit of this duplicates functionality that's already in _guess_access_format and _guess_access_url. It's bad enough it that's in one place (as these kinds of heuristics always suck), but let's not have them in two places. Can this be refactored?

@d-giles
Copy link
Contributor Author

d-giles commented Oct 14, 2024

@msdemlei

has far too many false positives – and possibly also false negatives, because RFC 2045 defines media types as case-insensitive.

The access_format is converted to lowercase prior to the check, so that should, I believe, prevent false negatives with respect to case-insensitivity. As regards false positives, I agree that "votable" in access_format is incredibly broad, but I made the check as permissive as possible to replicate the original behavior of getdatlink while culling results which couldn't be parsed down the line. If it is the case that we can say for certain that every datalink will have content-type equals "application/x-votable+xml;content=datalink" then I'm happy to change the check to be much more restrictive and straightforward. Are we confident that "access_format" will be given as that exact value when the content is a datalink?

And then quite a bit of this duplicates functionality that's already in _guess_access_format and _guess_access_url. It's bad enough it that's in one place (as these kinds of heuristics always suck), but let's not have them in two places. Can this be refactored?

I can switch from self.getdataformat() and self.getdataurl() to DatalinkResults._guess_access_format(...) and DatalinkResults._guess_access_url(...), they look for a few more keywords. Though, if the access_url isn't a datalink and there's a separate column for datalinks then we need to loop through other columns anyways. Both getdata[...] and _guess_access_[...] return the first best guess and not a full list of matching conditions. I.e. in the example you gave on the original issue select top 1 dlurl from rosat.images, the datalink url is in the column dlurl and this is not returned by either of the aforementioned methods if the full record is queried. Should _guess_access_url be modified to optionally return a list of all potential urls or should the loop be confined to getdatalinkurl in your opinion?

@bsipocz bsipocz modified the milestones: v1.5.3, v1.6 Oct 15, 2024
@msdemlei
Copy link
Contributor

msdemlei commented Oct 15, 2024 via email

@zoghbi-a
Copy link
Contributor

I think this is issue may (or should) be split into two separate issues: One is handle this getdatalink error, and another to get the mime type.
For the first one, I think a similar check is already done in _iter_datalinks_from_product_rows here, so it may make sense to do a similar check in getdatalink.

@msdemlei
Copy link
Contributor

msdemlei commented Oct 22, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

getdatalink() exits messily when there isn't a datalink
4 participants