-
Notifications
You must be signed in to change notification settings - Fork 493
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
7056 Skip unknown types in a dataset which cause the whole dataset to… #7057
Conversation
… fail from being harvested
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is fine so I'm moving it to QA. If other developers want to talk a look, please go ahead. We've never promoted using the dataverse_json format for harvesting (I believe DDI is preferred) but this should make it more possible to use it.
(I don't think it's worth sending back to the contributor over the wildcard imports but if someone feels strongly about this we certainly can.)
import edu.harvard.iq.dataverse.FileMetadata; | ||
import edu.harvard.iq.dataverse.MetadataBlockServiceBean; | ||
import edu.harvard.iq.dataverse.TermsOfUseAndAccess; | ||
import edu.harvard.iq.dataverse.*; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mentioned this in a review on another pull request but we favor using the full import statement instead of wildcards.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll reverse the change, my Intellij automatically does this. Super nice feature though otherwise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I've heard that about Intellij. Thanks for reversing the change.
@@ -566,7 +553,8 @@ public DatasetField parseField(JsonObject json, Boolean testType) throws JsonPar | |||
|
|||
|
|||
if (type == null) { | |||
throw new JsonParseException("Can't find type '" + json.getString("typeName", "") + "'"); | |||
logger.info("Can't find type '" + json.getString("typeName", "") + "'"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this going to happen a lot? If so, should it be reduced from "info" to "fine" so it doesn't clutter the logs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pdurbin This'll happen potentially multiple thousands of times when doing a harvesting run on a big OAI set for the first time. So I'll just change it to "fine".
Maybe not in the guides but I've promoted it to reduce information loss and most of the harvesting Harvard Dataverse does from other Dataverse installations is using dataverse_json. I would think that DDI Codebook would be preferred if all metadata could be mapped to it, but that's probably going to be less and less true with increasing support for other standards. |
@JingMa87 I am still seeing the two failures.
|
@kcondon I can't reproduce the error. I've looked into the DatasetCreate Command but without the actual error happening it's like looking for a needle in a haystack. Are you on the latest version with the latest changes? What happens if you re-run the harvest, does it work the second time? |
@JingMa87 I have run it again and still see the failures. Running it again just results in 0 harvested, 0 failed. Looks like failures do are not harvested again. Is it possible you have some unchecked in code on your local build env? |
@kcondon I pushed all my changes to the branch already. What if you delete the harvesting client, add it again and re-run, do you still get the issue? Sometimes I get weird intermittent errors that disappear if I retry. |
Does this query in the database return something for you? If so, this row is not supposed to be there and stops the field type mraCollection from being filtered out by my code change. Then, it gets passed to Solr and the error happens. I've looked into adding debugging code, but the class where the error happens (CreateHarvestedDatasetCommand) is a whole big mess of polymorphism, generics and interfaces gone wrong. When I use the debugger, I can't inspect the execute() function of that Command where the error happens so I'm relegating to reading the code. Hopefully this helps. |
@JingMa87 Yes, that sql query returns a record. Why do you say it should not be there? Also, wouldn't you have seen the same record in your own test? I understand about the code complexity. I was thinking about a lower tech debugging of adding debug logging if that helps. Ok, I think I understand what you were getting at: the purpose of the fix is so that harvest clients who do not have this field in their db can skip a field that is present on the harvest server/source. I did not consider this when I looked at the test case. However, it should actually work normally in the case where it is present? I will test with and without it on the client so at least we have information and compare to the code currently on develop branch. A quick test using another branch shows json format harvest also fails when the field is present on both client and server. Although this should work, I believe it is outside the scope of this ticket. Will confirm the original intent, masking the fields that are not present on the client. |
… fail from being harvested
What this PR does / why we need it: Makes datasets with unknown types harvestable using dataverse_json
Which issue(s) this PR closes: 7056
Closes #7056
Special notes for your reviewer: Good fix
Suggestions on how to test this: Harvest the Princeton_Authored_Datasets OAI set from the harvard repo using dataverse_json. Without this fix you will get 267 successful harvest and 2 fails. With the fix you will harvest all 269 datasets. The controversial datasets are https://dataverse.harvard.edu/api/datasets/export?exporter=dataverse_json&persistentId=doi:10.7910/DVN/B6OJKG and https://dataverse.harvard.edu/api/datasets/export?exporter=dataverse_json&persistentId=doi:10.7910/DVN/I5O6OS.
Does this PR introduce a user interface change? If mockups are available, please link/include them here: No
Is there a release notes update needed for this change?: I think it's an important change for harvesting from another dataverse.
Additional documentation: N/A