-
Notifications
You must be signed in to change notification settings - Fork 345
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Duplicate items #208
Comments
Hi Etenil, I also have many duplicates (articles) and I'd like to clean the DB. Thanks |
Hi GLLM, I don't have a solution to clean up the DB, what I've done so far only prevents fetching duplicate items (so that'll work for future updates). My patch is essentially a dirty fix that doesn't address the main issue here. The problem is that selfoss fails to check for presence of an existing item in the DB. My thinking is that it's somehow related to the uid being too long or something. But instead of fixing that, I'm just using the title field together with the uid to discriminate the items. I hope the selfoss dudes would have addressed this properly in the meantime. With that in mind, and since you ask it so nicely, I've forked the repo and will put my patch in there today. |
That is kind of you. If, in a first step, it could enable us to wipe the dupes, it'd be great ! Thank you very much :) |
Can you post an example feed? Some feeds have wrong ids and then selfoss can not differ the articles properly. Changing this id generation is a very critical part. I use the SimplePie mechanism which uses the feed given id and then uses a content based md5 hash. I possible solution would be a own spout for RSS feeds with problematical ids. |
@SSilence just grab slashdot's RSS feed; it consistently produces duplicates so you shouldn't have too much trouble diagnosing it. |
@SSilence You can just use slashdot's feed, it consistently creates duplicates. You'll notice that the articles'uid is, well, peculiar ;-). Not sure what needs fixing, I didn't dig much into selfoss's code. Like I said, my quick and dirty fix is to use both the uid and the link to ensure the record is unique, but that's not a good thing. @GLLM I've published my changes. |
I have tested the slashdot feed and no duplicate entries occurred. But I will wait until slashdot updates his feed. I think its possible, that the guid changes on every feed update. In this case all items should be fetched again. |
I'll wait for your feedback to see if I shall consider another solution, such as manual SQL, to remove duplicates. I have too many of them. Thanks, |
Okay, I have found the problem. slashdots uids has more than 255 characters. Sqlite don't cares this and compares the first 255 characters. Mysql will never find the existing items and returns an empty result. Now I generate an md5 hash on uids with more then 255 characters. Please reopen this issue if duplicate items occur again. |
@SSilence sorry for answering too late ... Thanks |
still have duplicates such as following: |
@binghuiyin: Which database backend do you use and how do you call update.php to update your feeds? The only way I'm still able to create a few duplicates in the db is when I run two instances of update.php in parallel:
|
Hourly cron job on a sqlite db... getting duplicates every day, not too many, but still getting those. |
@SSilence please re-open this Issue. See pic below: |
I subscribed the three feeds and will test. I don't know how duplicate entries can occur because the feeds and the uids of the feeds seems to be okay. Are you all using the newest version of selfoss? I will try to find this bug. |
Something strikes me as wrong with the way feeds are handled. Indeed, according to the RSS2 specifications, the guid key is not at all required:
See http://www.rssboard.org/rss-specification#hrelementsOfLtitemgt |
@SSilence I am updating to the latest available code every day ! still I have the dupes :( |
I have subscribed the rss feeds
And I have no duplicate items. Hmm, thats very strange. Could this be a particular sqlite driver version which is not compatible with fat free? |
@SSilence |
@SSilence |
@SSilence: Why don't you just deal with this on the database level and make UID a primary key? [Edit: That obviously still wouldn't help with feeds like the one binghuiyin linked, but this would -->] Or make title, content and link a compound key. Using keys would also be a better fix for #89. Edit: RSS implementations really are a mess. |
Yes, it seems the feed provider generates every day a new uid. Its really hard to handle this problems :( |
Was this ever fixed as I am getting the same issue? |
I am getting the same issue here too. A lot of feed generators don't follow specification and I had much more luck with using the entry's link rather than uid as a key. Just my 2 cents. |
Can't you filter them out on the application ? |
@SSilence Thank you for authoring selfoss. Helps a lot. Also, The issue of duplicates is still existing. I am getting duplicate for Packetstorm News/ Packetstorm Files. Probably, you might want to see the "Link" and "Source" in the database, to see if it's already present? |
@bitvijays Can you locate the duplicate articles in the database? |
The results of a query such as this one would help us find the right solution to this: SELECT items.uid, items2.uid, items.title, items2.title FROM items, items AS items2 WHERE items.source = items2.source AND items.id > items2.id AND items.link = items2.link; I have a feed that provides new items with the same link, so the link alone is not good. Also I'm against avoiding duplicated across multiple feeds because this allows one feed to prevent items from other feeds to get into the db. |
I do not understand how could this happen in proper operation. On the bright side, I fixed favicons for two of your feeds 😉 |
Duplicated I had proposed something to fix this some time ago (see #597) which was using a file lock to prevent concurrent updates on the same source. Another option would be to add a constraint on the |
@niol I think using |
The feed reader sometimes fails to detect the presence of an item within the database an thus creates multiple items.
This only happens with Slashdot's rss feed on my server, and slashdot uses some very long hyperlink as uid, which might be the cause of the problem.
I have been able to solve this on my server by modifying the dao and update code so that items'presence in the database be determined by both their uid and their link. So far it's worked OK, but I don't think it's a clean fix and thus will not commit it.
The text was updated successfully, but these errors were encountered: