Skip to content
This repository has been archived by the owner on Oct 18, 2019. It is now read-only.

Document and understand data flow leading to duplicate RA/NDBC stations #422

Open
dpsnowden opened this issue May 19, 2015 · 8 comments
Open

Comments

@dpsnowden
Copy link
Contributor

On the NDBC data set page you can see duplicate stations from CeNCOOS and AOOS. For example the station with WMO ID 46251 shows up as being served by both CeNCOOS and NDBC. I believe this is the correct behavior. The question is, why don't the other RA's show up this way?

Since it is only AOOS and CeNCOOS that exhibit this behavior, I'm guessing it is related to i52N configuration. What differs between i52N ID handling and ncSOS ID handling? But, since NANOOS and GCOOS aren't showing up this way, it must be more than just the software. It probably has more to do with the way @shane-axiom is handling URN station identifiers as part of the ingestion process that is not being done at the other sites.

We need to understand the data flow that led to this behavior and see what we can do to get it implemented everywhere. It may require a tool at the catalog that provides a summary of asset identifier usage for each collection (aka region). @emiliom I think you were looking into WMO ID issues, any insight? @lukecampbell thoughts?

@abirger or @carmelortiz can you track this or assign it to the appropriate person?

@abirger
Copy link

abirger commented May 19, 2015

@dpsnowden , the most interesting fact here is that the CeNCOOS SOS Capability document does not report a station with WMO ID 46251 at all. It is just not there! So, the dual affiliation

SOS | CeNCOOS
SOS | NOAA-NDBC

indicated on the station page, is fake, and the link to the service in SOS|CeNCOOS view just returns an Internal Server Error 500 (in SOS|NOAA-NDBC view that link properly leads to the NDBC SOS page).

On the other hand, a station with WMO ID 46244 actually is reported by both CeNCOOS and NDBC, and it is indicated by showing a triple affiliation

SOS | CeNCOOS
SOS | NOAA-NDBC
SOS | CeNCOOS

where the last line provides the real link to the service, although everything else is almost identical to the first line view.

@lukecampbell , can you explain why the Catalog falsely attributed the station with WMO ID 46251 to CeNCOOS? The only reason for that indication that I can imagine, is that the station's and CeNCOOS' observedProperties are partially overlapping. However, it is a big stretch, and the indication seems wrong regardless.

@emiliom
Copy link

emiliom commented May 20, 2015

@abirger's two examples are pointing out two "ghost" CeNCOOS datasets that have no valid service link/page in the IOOS Catalog. His second example, WMO ID 46244, makes that even more apparent: the dataset is shown as occurring twice under the SOS | CeNCOOS service, but only one of them has a valid service link; that one also happens to have a "Last updated" date of a year ago.

So, we really are dealing with two possibly separate issues here:

  1. The catalog is retaining ghost datasets that should be purged. We don't have enough information to know how widespread this problem is, and under what circumstances it occurs.
  2. Stations with WMO id's are not consistently identified as being the same, and listed jointly in the station dataset page.

I think the core issue @dpsnowden was bringing up is the second one, and that's the only one where I can add some, umm, insight or details. Though the 1st issue is obviously also important.

There are two relevant/important types of stations (datasets) that will have a WMO ID: NDBC-operated stations (buoys, C-MAN stations, etc), and stations from other providers (RA's, etc) made available for redistribution by NDBC and assigned a WMO ID. The two examples Derrick and Alex listed are both of the second type, as they are both CDIP wave buoys. AOOS and CeNCOOS (or maybe more to the point, Axiom) most likely get this CDIP data from NDBC, and add it to their SOS services while retaining the WMO ID in the station urn identifier. In the process, they also lose any metadata linkage to CDIP, and incorrectly assign NDBC as the "Operator Name" (eg, WMO 46244, http://catalog.ioos.us/datasets/53469a1d8c0db36efb591be3). But this begs the question: Why doesn't that WMO ID bring up the corresponding datasets from the CDIP DAC? It may be that the CDIP web service isn't providing a WMO ID for each buoy. The NDBC station page for WMO 46244 gives us the CDIP buoy code, 168 (see its CDIP page). Armed with that code, you'll see in the CDIP dataset listing that it occurs under 6 or so datasets! And most of those actually have two manifestations: "DAP | Other" and "DAP | CDIP" (eg, see this CDIP 168 datasets). And, BTW, the "DAP | Other" dataset appears to be another "ghost" dataset whose Catalog service link is broken or doesn't exist!

Ok, so in this case one maybe could blame CDIP for not broadcasting the WMO ID for each of its buoys (and also having multiple datasets for each buoy). Let me turn to a cleaner example of another non-NDBC-operated station with a WMO ID.

NANOOS operates the "NH-10" buoy whose data are redistributed on NDBC and has WMO ID 46094. Searching for the WMO ID on the catalog only brings up the datasets served by NDBC and CeNCOOS.

Note that that dataset page has *three service links (top left), two from CeNCOOS, and one of those CeNCOOS datasets being yet another ghost dataset with a "Last Update" date of a year ago. Let's stick with the CeNCOOS service link that seems current. Again, CeNCOOS (Axiom) is ingesting this dataset from NDBC and redistributing it via their i52N SOS. They retain the wmo id in the station id urn they advertise (urn:ioos:station:wmo:46094). Like the CDIP example I discussed, they're incorrectly assigning NDBC as the "Operator Name", rather than NANOOS or the institution that directly operates the buoy (OSU).

In that NH-10 Catalog dataset page, the NANOOS entry from our i52N SOS service should be listed together with the NDBC and CeNCOOS service links, but it's not. In fact, ideally it should be listed as the "primary" source (same with the CDIP example), but that's a somewhat different topic (and one that was discussed in #376). Unlike the CDIP example, we do make the WMO ID available in the proper metadata element in the DescribeSensor response, and the Catalog is properly parsing that information. See the NANOOS NH-10 dataset page on the IOOS Catalog; it correctly shows 46094 under Station WMO ID.

NANOOS doesn't construct a station urn with the wmo id, since that's not an IOOS DMAC requirement. Our SOS service only includes stations for which we're either the sole service provider to IOOS, or the primary one (as in NH-10). We don't redistribute data already available in IOOS services from their primary providers, such as NDBC and COOPS. All our station urn's have a nanoos namespace, so NH-10 is urn:ioos:station:nanoos:osu_nh10.

To summarize some conclusions from this long comment:

  • This is not news to Derrick, but it's worth remembering that some RA's follow the approach of CeNCOOS and AOOS in redistributing secondary data in their SOS service that's also available via IOOS compliant services (or other public web services) from their primary providers. This approach has pros and cons, but if not done well, it can lead to a misattribution of the original operator or provider, which I think is a big deal. Other RA's (including NANOOS) choose to only distribute primary data, though for data streams also shared with NDBC, the NDBC SOS service will also have them; again, pros and cons here.
  • The Catalog occurrence of "ghost" datasets that seem to be older versions of currently active datasets seems fairly widespread. That should be examined.
  • The Catalog correctly parses the WMO ID metadata element at least from i52N SOS services; it might not be doing that from ncSOS services, or maybe ncSOS doesn't have a good mechanism to advertise WMO ID, or maybe it's just CDIP that hasn't added WMO ID's.
  • The Catalog only seems to match up datasets with the same WMO ID when this ID is in the station urn or some other core station name or label. It is not using the IOOS compliant WMO ID metadata attribute to make that match up.

@lukecampbell
Copy link
Member

The "ghost" datasets are the result of dangling pointers.

When a new service is identified from NGDC Geoportal, a "dataset" record is created. That dataset has a pointer back to the parent service from which it is harvested from. Any new services that are identified that contain a reference to the URL for that dataset are added to the list of service parents.

Here's the document

{
    "_id" : ObjectId("53469a1d8c0db36efb591be3"),
    "updated" : ISODate("2015-05-15T11:49:36.150Z"),
    "uid" : "urn:ioos:station:wmo:46244",
    "created" : ISODate("2014-04-10T13:18:21.359Z"),
    "services" : [ 
        {
            "updated" : ISODate("2014-04-13T02:45:33.944Z"),
            "description" : "Humboldt Bay, North Spit, CA",
            "variables" : [ 
                "http://mmisw.org/ont/cf/parameter/sea_surface_swell_wave_period", 
                "http://mmisw.org/ont/cf/parameter/sea_surface_swell_wave_significant_height", 
                "http://mmisw.org/ont/cf/parameter/sea_surface_swell_wave_to_direction", 
                "http://mmisw.org/ont/cf/parameter/sea_surface_wave_significant_height", 
                "http://mmisw.org/ont/cf/parameter/sea_surface_wave_to_direction", 
                "http://mmisw.org/ont/cf/parameter/sea_surface_wind_wave_period", 
                "http://mmisw.org/ont/cf/parameter/sea_surface_wind_wave_significant_height", 
                "http://mmisw.org/ont/cf/parameter/sea_surface_wind_wave_to_direction", 
                "http://mmisw.org/ont/cf/parameter/sea_water_temperature", 
                "http://mmisw.org/ont/fake/parameter/sea_surface_dominant_wave_period", 
                "http://mmisw.org/ont/fake/parameter/sea_surface_wave_mean_period"
            ],
            "messages" : [],
            "geojson" : {
                "type" : "Point",
                "coordinates" : [ 
                    -124.357, 
                    40.888
                ]
            },
            "service_type" : "SOS",
            "metadata_type" : "sensorml",
            "keywords" : [],
            "data_provider" : "CeNCOOS",
            "service_id" : ObjectId("5311d2538c0db3469f7bfecb"),
            "metadata_value" : "",
            "asset_type" : "Buoy",
            "name" : "Humboldt Bay, North Spit, CA"
        }, 
        {
            "updated" : ISODate("2015-05-06T08:24:40.947Z"),
            "description" : "Humboldt Bay, North Spit, CA",
            "variables" : [ 
                "urn:ioos:sensor:wmo:46244::summarywav1", 
                "urn:ioos:sensor:wmo:46244::watertemp1"
            ],
            "messages" : [],
            "geojson" : {
                "type" : "Point",
                "coordinates" : [ 
                    -124.357, 
                    40.888
                ]
            },
            "name" : "46244",
            "keywords" : [],
            "metadata_type" : "sensorml",
            "service_type" : "SOS",
            "data_provider" : "NOAA-NDBC",
            "metadata_value" : "",
            "asset_type" : "MOORED BUOY",
            "service_id" : ObjectId("53d49ca78c0db37ff137030c")
        }, 
        {
            "updated" : ISODate("2015-05-15T11:49:36.150Z"),
            "description" : "Humboldt Bay, North Spit, CA (46244)",
            "variables" : [ 
                "http://mmisw.org/ont/cf/parameter/sea_surface_swell_wave_period", 
                "http://mmisw.org/ont/cf/parameter/sea_surface_swell_wave_significant_height", 
                "http://mmisw.org/ont/cf/parameter/sea_surface_swell_wave_to_direction", 
                "http://mmisw.org/ont/cf/parameter/sea_surface_wave_significant_height", 
                "http://mmisw.org/ont/cf/parameter/sea_surface_wave_to_direction", 
                "http://mmisw.org/ont/cf/parameter/sea_surface_wind_wave_period", 
                "http://mmisw.org/ont/cf/parameter/sea_surface_wind_wave_significant_height", 
                "http://mmisw.org/ont/cf/parameter/sea_surface_wind_wave_to_direction", 
                "http://mmisw.org/ont/cf/parameter/sea_water_temperature", 
                "http://mmisw.org/ont/fake/parameter/sea_surface_dominant_wave_period", 
                "http://mmisw.org/ont/fake/parameter/sea_surface_wave_mean_period"
            ],
            "metadata_type" : "sensorml",
            "keywords" : [ 
                "Humboldt Bay, North Spit, CA (46244)", 
                "NONE", 
                "urn:ioos:network:cencoos:all", 
                "urn:ioos:station:wmo:46244"
            ],
            "data_provider" : "CeNCOOS",
            "time_min" : ISODate("2014-11-01T00:15:00.163Z"),
            "asset_type" : "Buoy",
            "time_max" : ISODate("2015-05-11T23:43:00.876Z"),
            "name" : "Humboldt Bay, North Spit, CA (46244)",
            "messages" : [],
            "geojson" : {
                "type" : "Point",
                "coordinates" : [ 
                    -124.356, 
                    40.888
                ]
            },
            "service_type" : "SOS",
            "service_id" : ObjectId("53d34aed8c0db37e0b538fda"),
            "metadata_value" : ""
        }
    ],
    "active" : false
}

So the parent service(s) may have been deactivated or deleted but the dataset is still being referenced by at least one active service.

We will need to update the catalog behavior to periodically prune unreferenced services.

@lukecampbell
Copy link
Member

Should be fixed now
https://github.com/ioos/catalog/pull/426

@abirger
Copy link

abirger commented May 26, 2015

@lukecampbell , Yep, the ghost issue seems to be gone; however, the Catalog definitely does a selective matching of the WMO IDs as @emiliom has mentioned - it only matches datasets for the WMO ID that is a part of the station's URN, but drops the ball if WMO ID is defined in a separate identifier. It looks like a bug, since the Catalog seems to get the right WMO ID in both cases.

In regard to the wrong Operator attribution, it seems that the Catalog does not parse this attribute from NDBC at all (while it properly shows there Oregon Coastal Ocean Observing System for the station with WMO ID 46094) but takes it from the CeNCOOS station's ContactList instead (where it is attributed to NDBC).

@emiliom
Copy link

emiliom commented Aug 27, 2015

I happened to be thinking about WMO ID issues, and remembered this issue. It looks like this is still a problem on the IOOS Catalog (here's @abirger's summary, in the last entry on this issue):

the Catalog definitely does a selective matching of the WMO IDs as @emiliom has mentioned - it only matches datasets for the WMO ID that is a part of the station's URN, but drops the ball if WMO ID is defined in a separate identifier. It looks like a bug, since the Catalog seems to get the right WMO ID in both cases.

Curious to hear if there are plans to fix this. Thanks.

@emiliom
Copy link

emiliom commented Sep 21, 2015

Still curious about this. @lukecampbell ?

@lukecampbell
Copy link
Member

I believe it's still an outstanding bug.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants