-
-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add global discovery #470
Merged
+1,485
−132
Merged
Add global discovery #470
Changes from all commits
Commits
Show all changes
38 commits
Select commit
Hold shift + click to select a range
3f93817
Adding inclusive keyword arguments to the STC constraints.
msdemlei 0ca75e4
Adding a discover sub-tree to hold global discovery code
msdemlei d331c43
Fleshing out global image discovery
msdemlei c4b6bdf
Fleshing out global image discovery.
msdemlei 9327e7c
Adding a "learnable" mocker for requests.
msdemlei 5e3027a
Making a first global discovery query work.
msdemlei 558b8a4
Making SIA1 discovery roughly work.
msdemlei 4f4e34e
In global discovery, now removing services that are served-by some other
msdemlei 826f051
Updating global discovery test data
msdemlei 3eac8e4
Minor fix in SIA1 branch of global image discovery
msdemlei 9f85595
Adding an obscore_new data model to the registry Datamodel constraint.
msdemlei 42e1282
New datamodel obscore-new to locate obscore tables.
msdemlei 4750280
Global discovery now elides duplicate images found based on their
msdemlei 06d7c9b
Adding timeouts to global discovery
msdemlei e39dd87
Including utils.testing docs in the main doc
msdemlei f6ceb40
Adding some documentation on global dataset discovery.
msdemlei c65d073
You can now pass a result of registry.search to discover.image_globally.
msdemlei e194f84
global discovery: the time constraint can now be an interval, too.
msdemlei 347034e
Adding a method for service stats to the discoverer
msdemlei 641d2a9
Adding basic cancelability to global discovery.
msdemlei 554d334
Global discovery records now record the service of origin.
msdemlei 0e5b7af
Global discovery: Downgrading obscore query to ADQL 2.0
msdemlei 40e9a4e
global discovery: get_service is now lax.
msdemlei 5aa6427
SIA1: warding against NULL-s in the bandpass definition
msdemlei bd0ce9a
global discovery: not requiring optional SIA1 fields any more
msdemlei 997c549
Updating global discovery test cases.
msdemlei afafe3e
Whitespace edits to by-and-large placate flake8
msdemlei 321c121
functools.cache is 3.9+; replace it
msdemlei 3ec9832
Removing LearnableRequestMocker and its infrastructure.
msdemlei 26d6f35
global discovery: Making query cancelling work.
msdemlei c84bb59
global discovery: watchers now receive a discoverer instance, too.
msdemlei 30e3d26
global discovery: more test coverage.
msdemlei 5411ea8
Documentation fixes.
msdemlei f384d0d
pyvo.discover now emits a prototype warning on import.
msdemlei 51c2d91
MAINT: dealing with warning at collection time
bsipocz b4fed08
TST: unrelated test fix, rebase should resolve it
bsipocz 8caab1f
Minor changes after bsipocz' 2024-10-09 review.
msdemlei 1e696ac
DOC: We can't link the module before generating its API docs
bsipocz File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
@@ -0,0 +1,213 @@ | ||||||
.. _pyvo-discover: | ||||||
|
||||||
****************************************** | ||||||
Global Dataset Discovery (`pyvo.discover`) | ||||||
****************************************** | ||||||
|
||||||
One of the promises of the Virtual Observatory has always been that | ||||||
researchers can globally look for data sets (spectra, say). In the | ||||||
early days of the VO, this looked relatively simple: Just enumerate all | ||||||
image (in those days: SIAP) services, send the same query to them, and | ||||||
then somehow shoehorn together their responses. | ||||||
|
||||||
In reality, there are all kinds of small traps ranging from hanging | ||||||
services to the sheer number of data collections. Hitting hundreds of | ||||||
internet sites takes quite a bit of time even when everything works. In | ||||||
the meantime, the picture is further complicated by the emergence of | ||||||
additional protocols. Images, for instance, can be published through | ||||||
SIA version 1, SIA version 2, and ObsTAP in 2024 – and in particular any | ||||||
combination of that. | ||||||
|
||||||
To keep global discovery viable, several techniques can be applied: | ||||||
|
||||||
* pre-select the services by using the service footprints in space, | ||||||
time, and spectrum. | ||||||
* elide services serving the same data collections | ||||||
* filter duplicate responses before presenting them to the user. | ||||||
|
||||||
This is the topic of this sub-module. | ||||||
|
||||||
In early 2024, this is still in early development. | ||||||
|
||||||
|
||||||
Basic Usage | ||||||
=========== | ||||||
|
||||||
The basic API for dataset discovery is through functions accepting | ||||||
constraints on | ||||||
|
||||||
* space (currently, a cone, i.e., RA, Dec and a radius in degrees), | ||||||
* spectrum (currently, a point in the spectrum as a spectral quantity), | ||||||
and | ||||||
* time (an astropy.time.Time instance or a pair of them to denote an | ||||||
interval). | ||||||
|
||||||
For instance:: | ||||||
|
||||||
from pyvo import discover | ||||||
from astropy import units as u | ||||||
from astropy import time | ||||||
|
||||||
datasets, log = discover.images_globally( | ||||||
space=(273.5, -12.1, 0.1), | ||||||
spectrum=1*u.nm, | ||||||
time=(time.Time('1995-01-01'), time.Time('1995-12-31'))) | ||||||
print(datasets) | ||||||
|
||||||
The function returns a pair of lists. ``datasets`` is a list of | ||||||
`~pyvo.discover.ImageFound` instances. This is the (potentially | ||||||
long) list of datasets located. | ||||||
|
||||||
The second returned value, ``log``, is a sequence of strings noting | ||||||
which services failed and which returned how many records. In | ||||||
exploratory work, it is probably all right to discard the information, | ||||||
but for research purposes, these log lines are an important part of the | ||||||
provenance and must be retained – after all, you might have missed an | ||||||
important clue just because a service was down at the moment you ran | ||||||
your discovery; also, you might want to re-query the failing services at | ||||||
some later stage. | ||||||
|
||||||
All constraints are optional, but without a space constraint, no SIA1 | ||||||
services will be queried. With spectrum and time constraints, it is | ||||||
probably wise to pass ``inclusive=True`` for the time being, as far too | ||||||
many resources do not define their coverage. | ||||||
|
||||||
The discovery function accepts a few other parameters you should know | ||||||
about. These are discussed in the following sections. | ||||||
|
||||||
|
||||||
``inclusive`` Searching | ||||||
----------------------- | ||||||
|
||||||
Unfortunately, many resources in the VO do not yet declare their | ||||||
coverage. In its default configuration, pyVO discovery will not query | ||||||
services that do not explicitly say they cover the region of interest | ||||||
and hence always skip these services (unless you manually pass them in, | ||||||
see below). To change that behaviour and try services that do not state | ||||||
their coverage, pass ``inclusive=True``. At this time, this will | ||||||
usually dramatically increase the search time. | ||||||
|
||||||
Setting ``inclusive`` to True will also include datasets that do not | ||||||
declare their temporal of spectral coverage when coming for version 1 | ||||||
SIAP services [TODO: do that in obscore, too?]. This may | ||||||
dramatically increase the number of false positives. It is probably | ||||||
wise to only try ``inclusive=True`` when desperate or when there is a | ||||||
particular necessity to not miss any potentially applicable data. | ||||||
|
||||||
|
||||||
The Watcher | ||||||
----------- | ||||||
|
||||||
Global discovery usually hits dozens of web services. To see what is | ||||||
going on, you can pass in a function accepting a single string as | ||||||
``watcher``. The trivial implementation would be:: | ||||||
|
||||||
import datetime | ||||||
|
||||||
def watch(disco, msg): | ||||||
print(datetime.datetime.now(), msg) | ||||||
|
||||||
found, log = discover.images_globally( | ||||||
space=(3, 1, 0.2), watcher=watch) | ||||||
|
||||||
Here, ``disco`` is an ``ImageDiscoverer`` instance; this way, you can | ||||||
further inspect the state of things, e.g., by looking at the | ||||||
``already_queried`` and ``failed_services`` attributes containing the | ||||||
number of total services tried and of services that gave errors, | ||||||
respectively. Also, although that clearly goes beyond watching, you can | ||||||
call the ``reset_services()`` method. This empties the query queues and | ||||||
thus in effect stops the discovery process. | ||||||
|
||||||
Setting Timeouts | ||||||
---------------- | ||||||
|
||||||
There are always some services that are broken. A particularly | ||||||
insidious sort of brokenness occurs when data centres run reverse | ||||||
proxies (many do these days) that are up and try to connect to a backend | ||||||
server intended to run the actual service. In certain configurations, | ||||||
it might take the reverse proxy literally forever to notice when a | ||||||
backend server is unreachable, and meanwhile your global discovery will | ||||||
hang, too. | ||||||
|
||||||
Therefore, pyVO global discovery will give up after ``timeout`` seconds, | ||||||
defaulting to 20 seconds. Note that large data collections *may* take | ||||||
longer than that to produce their response; but given the simple | ||||||
constraints we support so far, we would probably consider them broken in | ||||||
that case. Reducing the timeout to just a few seconds will make pyVO | ||||||
continue earlier on broken services. But that of course increases the | ||||||
risk of cutting off working services. | ||||||
|
||||||
If in doubt, have a brief look at the log lines; if a service that | ||||||
sounds promising shows a timeout, perhaps try again with a longer | ||||||
timeout or use partial matching. | ||||||
|
||||||
|
||||||
Overriding service selection | ||||||
---------------------------- | ||||||
|
||||||
You can also pass a `pyvo.registry.RegistryResults` instance to | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. namespace
Suggested change
|
||||||
``services`` to override the automatic selection of services to query. | ||||||
See the discussion of overriding the service selection in Discoverers_. | ||||||
|
||||||
|
||||||
Discoverers | ||||||
=========== | ||||||
|
||||||
For finer control of the discovery process, you can directly use | ||||||
the `pyvo.discover.image.ImageDiscoverer` class. It is constructed with | ||||||
essentially the same parameters as the search function. | ||||||
|
||||||
To run the discovery, first establish which services to query. There | ||||||
are two ways to do that: | ||||||
|
||||||
* Call the ``discover_services()`` method. This is what the search | ||||||
function does; it uses your constraints as above. | ||||||
* Pass a `pyvo.registry.RegistryResults` instance to ``set_services``. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. namespace
Suggested change
|
||||||
This lets you do your own searches. The image discoverer will only | ||||||
use resources that it knows how to handle. For instance, it is safe | ||||||
to call something like:: | ||||||
|
||||||
discoverer.set_services( | ||||||
registry.search(registry.Author("Hubble, %"))) | ||||||
|
||||||
to query services that give a particular author. More realistically, | ||||||
|
||||||
:: | ||||||
|
||||||
discoverer.set_services( | ||||||
registry.search(registry.Datamodel("obscore"))) | ||||||
|
||||||
will restrict the operation to obscore services. | ||||||
|
||||||
``set_services`` will purge redundant services, which means that | ||||||
services that say they (or their data) is served by another service that | ||||||
will already be queried will not be queried. Outside of debugging, this | ||||||
is what you want, but if you really do not want this, you can pass | ||||||
``purge_redundant=False``. Note, however, that you will still get only | ||||||
one match per access URL of the dataset. | ||||||
|
||||||
Once you have set the services, call ``query_services()`` to fill the | ||||||
``results`` and ``log_messages`` attributes. It may be informative to | ||||||
watch these change from, say, a different thread. Changing their | ||||||
content has undefined results. | ||||||
|
||||||
A working example would look like this:: | ||||||
|
||||||
from pyvo import discover, registry | ||||||
from astropy.time import Time | ||||||
|
||||||
im_discoverer = discover.image.ImageDiscoverer( | ||||||
space=(274.6880, -13.7920, 0.1), | ||||||
time=(Time('1996-10-04'), Time('1996-10-10'))) | ||||||
im_discoverer.set_services( | ||||||
registry.search(keywords=["heasarc rass"])) | ||||||
im_discoverer.query_services() | ||||||
print(im_discoverer.log_messages) | ||||||
print(im_discoverer.results) | ||||||
|
||||||
|
||||||
|
||||||
Reference/API | ||||||
============= | ||||||
|
||||||
.. automodapi:: pyvo.discover |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -132,6 +132,7 @@ Using ``pyvo`` | |
|
||
dal/index | ||
registry/index | ||
discover/index | ||
io/index | ||
auth/index | ||
samp | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,16 +1,23 @@ | ||
.. _pyvo-utils: | ||
|
||
*************************** | ||
PyVO utils (``pyvo.utils``) | ||
*************************** | ||
******************************* | ||
pyVO utilities (``pyvo.utils``) | ||
******************************* | ||
|
||
|
||
This module contains utilities and base classes intended for internal use | ||
within PyVO or other dependent libraries. | ||
This subpackage collects a few packages intended to help developing | ||
and maintaining pyVO. All of this is not part of pyVO's public API and | ||
may change at any time. It is documented here for the convenience | ||
of the maintainers and to help users when the effects of this code | ||
becomes user-visible. | ||
|
||
|
||
Reference/API | ||
============= | ||
|
||
.. automodapi:: pyvo.utils.xml.elements | ||
:no-inheritance-diagram: | ||
|
||
.. toctree:: | ||
:maxdepth: 1 | ||
|
||
prototypes |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,16 @@ | ||
# Licensed under a 3-clause BSD style license - see LICENSE.rst | ||
""" | ||
Various functions dealing with global data disovery. | ||
""" | ||
|
||
import warnings | ||
from pyvo.utils.prototype import PrototypeWarning | ||
|
||
# if you remove this warning, also remove the ignorere in test_imagediscovery. | ||
warnings.warn("pyvo.discover's API is still under design in pyVO 1.6 and" | ||
" may change without prior notice. Feedback to the authors is most" | ||
" welcome.", PrototypeWarning) | ||
|
||
from .image import images_globally, ImageDiscoverer, ImageFound | ||
|
||
__all__ = ['images_globally', "ImageDiscoverer", "ImageFound"] |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this expected to run long and therefore not doctested? This question stands for all the code examples in this file.
Even if they need to be skipped for a reason I would rather do it the same way we do skip modules with doctestplus, on the file level, rather than not formatting them for doctest to collect. I'm happy to fix up the file, just let me know the thought process
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, although the requested coverage probably will not match many services in the future, I think this is too unpredictable to work as part of anything that runs on every push. So, I'd rather skip it, yes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, so I'll make a note for the infrastructure side that we need a directive for slow doctests (we have such markers for the normal tests), and then can just run the slow ones in the weekly cron