Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Epic: Fix importation of bogus authors and cleanup old data #331

Open
LeadSongDog opened this issue Sep 25, 2022 · 4 comments
Open

Epic: Fix importation of bogus authors and cleanup old data #331

LeadSongDog opened this issue Sep 25, 2022 · 4 comments
Assignees

Comments

@LeadSongDog
Copy link

LeadSongDog commented Sep 25, 2022

There are tens of thousands of bogus author records with names * Publishing or * Books. Somewhat fewer with * Editions and other-language equivalents.

Many originate with the import of low quality records from BWB or AMZ such as https://www.betterworldbooks.com/product/detail/9783110367737
which was imported as
https://openlibrary.org/books/OL34526350M/Quantenmechanik
where the authors include
https://openlibrary.org/authors/OL9711355A/Perseus_Books_Perseus_Books_LLC.

Many (about 30%) of these author records have no associated work record. Those are low-hanging fruit that could simply be bulk removed.

More have only work records that are misattributed to the “author” with these publisher names and the “publisher” shown as "Independently Published", “CreateSpace” or the like. For these there is often another correct work record of similar title showing the correct authorship. Some heuristics might help with these.

A substantial group however are corporate authorships by publisher staff writers with no public attribution to an individual. This is particularly common in bibliographies, reference works, study notes, and textbooks.

Suggestions?

@hornc
Copy link
Collaborator

hornc commented Aug 14, 2024

Junk publishers here: https://openlibrary.org/search/authors?q=Publisher&mode=everything

I'll try to get a bot task that can prune an entire edition + work + author tree based on the worst of these results.

@hornc hornc self-assigned this Aug 14, 2024
@LeadSongDog
Copy link
Author

Thank you. A few are real publishers with human names, sometimes shown where the author is unk or staff, but in most cases you’ll want a chainsaw, not pruning shears.

@hornc
Copy link
Collaborator

hornc commented Aug 20, 2024

This is hard going -- I have removed about 2k empty publisher-as-authors which have already been cleared out and had no editions or works assigned to them, and removed many entirely non-book publishers and their work, which has between a dozen and hundreds of junk non-book records (sometimes repeated over and over again)...

This leave many 100s of junk publisher names with only one or two non-book items, inter-dispersed with more legitimate publisher-recorded-as-author, where there is probably some kind of clean up required other than simply deleting junk, but I've cleared out as much as I can easily do in an automate sweep.

I'll think on what can be done with what remains, it'll probably be more of a general identification of obvious non-book items to remove them completely.

@LeadSongDog
Copy link
Author

@hornc Thank you for tackling this, but the job is rather bigger than that.

By way of quantification, there are currently

4448 “authors” with “Editions” in their name: https://openlibrary.org/search/authors?q=Editions&mode=everything

23104 with “Books”: https://openlibrary.org/search/authors?q=Books*

52698 with “Publish*”: https://openlibrary.org/search/authors?q=Publish*

of all these, about 60% show just one edition.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants