Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate items #208

Closed
Etenil opened this issue Mar 30, 2013 · 36 comments
Closed

Duplicate items #208

Etenil opened this issue Mar 30, 2013 · 36 comments
Assignees
Labels

Comments

@Etenil
Copy link

Etenil commented Mar 30, 2013

The feed reader sometimes fails to detect the presence of an item within the database an thus creates multiple items.

This only happens with Slashdot's rss feed on my server, and slashdot uses some very long hyperlink as uid, which might be the cause of the problem.

I have been able to solve this on my server by modifying the dao and update code so that items'presence in the database be determined by both their uid and their link. So far it's worked OK, but I don't think it's a clean fix and thus will not commit it.

@GLLM
Copy link
Contributor

GLLM commented Apr 3, 2013

Hi Etenil,

I also have many duplicates (articles) and I'd like to clean the DB.
Do you have a working solution for this? It could be of interest for many of us.

Thanks

@Etenil
Copy link
Author

Etenil commented Apr 4, 2013

Hi GLLM,

I don't have a solution to clean up the DB, what I've done so far only prevents fetching duplicate items (so that'll work for future updates).

My patch is essentially a dirty fix that doesn't address the main issue here. The problem is that selfoss fails to check for presence of an existing item in the DB. My thinking is that it's somehow related to the uid being too long or something. But instead of fixing that, I'm just using the title field together with the uid to discriminate the items. I hope the selfoss dudes would have addressed this properly in the meantime.

With that in mind, and since you ask it so nicely, I've forked the repo and will put my patch in there today.

@GLLM
Copy link
Contributor

GLLM commented Apr 4, 2013

That is kind of you.

If, in a first step, it could enable us to wipe the dupes, it'd be great !

Thank you very much :)
GLLM

@SSilence
Copy link
Member

SSilence commented Apr 4, 2013

Can you post an example feed? Some feeds have wrong ids and then selfoss can not differ the articles properly. Changing this id generation is a very critical part. I use the SimplePie mechanism which uses the feed given id and then uses a content based md5 hash. I possible solution would be a own spout for RSS feeds with problematical ids.

@Etenil
Copy link
Author

Etenil commented Apr 4, 2013

@SSilence just grab slashdot's RSS feed; it consistently produces duplicates so you shouldn't have too much trouble diagnosing it.

@Etenil
Copy link
Author

Etenil commented Apr 4, 2013

@SSilence You can just use slashdot's feed, it consistently creates duplicates. You'll notice that the articles'uid is, well, peculiar ;-). Not sure what needs fixing, I didn't dig much into selfoss's code. Like I said, my quick and dirty fix is to use both the uid and the link to ensure the record is unique, but that's not a good thing.

@GLLM I've published my changes.

@SSilence
Copy link
Member

SSilence commented Apr 5, 2013

I have tested the slashdot feed and no duplicate entries occurred. But I will wait until slashdot updates his feed. I think its possible, that the guid changes on every feed update. In this case all items should be fetched again.

@GLLM
Copy link
Contributor

GLLM commented Apr 5, 2013

I'll wait for your feedback to see if I shall consider another solution, such as manual SQL, to remove duplicates. I have too many of them.
And I believe I do not do concurrent updates, since my cron is launch once every 60min ...

Thanks,
GLLM

@SSilence
Copy link
Member

SSilence commented Apr 6, 2013

@GLLM: Do you have a few example feeds with duplicate entries?

@Etenil: Thats really strange, I have updated the slashdot feed today and don't get duplicates. I have tested this with sqlite. I have seen that you use mysql. I will test again with mysql.

@SSilence
Copy link
Member

SSilence commented Apr 6, 2013

Okay, I have found the problem. slashdots uids has more than 255 characters. Sqlite don't cares this and compares the first 255 characters. Mysql will never find the existing items and returns an empty result.

Now I generate an md5 hash on uids with more then 255 characters.

Please reopen this issue if duplicate items occur again.

@SSilence SSilence closed this as completed Apr 6, 2013
@GLLM
Copy link
Contributor

GLLM commented Apr 10, 2013

@SSilence sorry for answering too late ...
I've had duplicates on (among many others) :

Thanks
GLLM

@binghuiyin
Copy link

still have duplicates such as following:

http://avaxhome.ws/ebooks/programming_development/rss.xml

2013-04-11_142026

@seanrand
Copy link
Contributor

@binghuiyin: Which database backend do you use and how do you call update.php to update your feeds?

The only way I'm still able to create a few duplicates in the db is when I run two instances of update.php in parallel:

$ sqlite3 data/sqlite/selfoss.db
sqlite> SELECT id, source, datetime, title, count(*) FROM items GROUP BY title, datetime HAVING count(*) > 1;
id          source      datetime             title                                                  count(*)
----------  ----------  -------------------  -----------------------------------------------------  ----------
1017        16          2013-04-11 22:23:35  Google Relaxes DMCA Takedown Restrictions, Eyes Abuse  2
1022        3           2013-04-11 21:48:47  Kurdish rebels prepare for peace                       2

@binghuiyin
Copy link

I added a Cron job with hourly update. as below. It is on Dreamhost.

Uploading 2013-04-11_155810.png . . .

@GLLM
Copy link
Contributor

GLLM commented Apr 12, 2013

Hourly cron job on a sqlite db... getting duplicates every day, not too many, but still getting those.

@binghuiyin
Copy link

@SSilence please re-open this Issue. See pic below:
2013-04-11_142026

@GLLM
Copy link
Contributor

GLLM commented Apr 13, 2013

Dupes & dupes again !

It's a pain ...
FYI : hourly cron with SQLite. No other manual updates of course
screenshot_4

@SSilence SSilence reopened this Apr 14, 2013
@SSilence
Copy link
Member

I subscribed the three feeds and will test. I don't know how duplicate entries can occur because the feeds and the uids of the feeds seems to be okay.

Are you all using the newest version of selfoss?

I will try to find this bug.

@ghost ghost assigned SSilence Apr 14, 2013
@Etenil
Copy link
Author

Etenil commented Apr 14, 2013

Something strikes me as wrong with the way feeds are handled. Indeed, according to the RSS2 specifications, the guid key is not at all required:

All elements of an item are optional, however at least one of title or description must be present.

See http://www.rssboard.org/rss-specification#hrelementsOfLtitemgt

@GLLM
Copy link
Contributor

GLLM commented Apr 14, 2013

@SSilence I am updating to the latest available code every day ! still I have the dupes :(

@SSilence
Copy link
Member

I have subscribed the rss feeds

And I have no duplicate items. Hmm, thats very strange. Could this be a particular sqlite driver version which is not compatible with fat free?

@binghuiyin
Copy link

@SSilence
I am using v2.6 now, with MySQL. Here is another rss for your test. I have duplicated items. It is a feed in Chinese.
http://feed.36kr.com/c/33346/f/566026/index.rss
2013-04-18_120744

@binghuiyin
Copy link

@SSilence
I did a search in database by searching the link. It return by three duplicated items.
2013-04-21_235823
It looks everything same, except uid.

@seanrand
Copy link
Contributor

@SSilence: Why don't you just deal with this on the database level and make UID a primary key? [Edit: That obviously still wouldn't help with feeds like the one binghuiyin linked, but this would -->] Or make title, content and link a compound key. Using keys would also be a better fix for #89.

Edit:
I just looked at the feed binghuiyin linked and that looks like an issue with the feed... they are generating different GUIDs for the same item. My guess is that the GUID uses the current date and thus every 24 hours every item has a new GUID.

RSS implementations really are a mess.

@SSilence
Copy link
Member

Yes, it seems the feed provider generates every day a new uid. Its really hard to handle this problems :(

@ghost
Copy link

ghost commented Dec 27, 2013

Was this ever fixed as I am getting the same issue?

@andreimarcu
Copy link

I am getting the same issue here too.

A lot of feed generators don't follow specification and I had much more luck with using the entry's link rather than uid as a key.

Just my 2 cents.

@cgelici
Copy link

cgelici commented Oct 7, 2014

I'm getting it too.

image

@ghost
Copy link

ghost commented Oct 10, 2014

Can't you filter them out on the application ?

@bitvijays
Copy link

@SSilence Thank you for authoring selfoss. Helps a lot. Also, The issue of duplicates is still existing. I am getting duplicate for Packetstorm News/ Packetstorm Files. Probably, you might want to see the "Link" and "Source" in the database, to see if it's already present?

@jtojnar
Copy link
Member

jtojnar commented Apr 20, 2017

@bitvijays Can you locate the duplicate articles in the database?

@niol
Copy link
Collaborator

niol commented Apr 20, 2017

The results of a query such as this one would help us find the right solution to this:

SELECT items.uid, items2.uid, items.title, items2.title FROM items, items AS items2 WHERE items.source = items2.source AND items.id > items2.id AND items.link = items2.link;

I have a feed that provides new items with the same link, so the link alone is not good. Also I'm against avoiding duplicated across multiple feeds because this allows one feed to prevent items from other feeds to get into the db.

@bitvijays
Copy link

selfoss.zip

@niol @jtojnar Here's the zip file containing the sqlite database. ( 1 MB only ). Hopefully, this should provide more insight.

@jtojnar
Copy link
Member

jtojnar commented Apr 20, 2017

I do not understand how could this happen in proper operation. On the bright side, I fixed favicons for two of your feeds 😉

@niol
Copy link
Collaborator

niol commented Apr 21, 2017

Duplicated uids may only happen if multiple parallel updates are running, for instance if a cronjob is running while a manual update is triggered manually.

I had proposed something to fix this some time ago (see #597) which was using a file lock to prevent concurrent updates on the same source. Another option would be to add a constraint on the items table ensuring that (uid, source) is unique and handle the INSERT error properly.

@jtojnar
Copy link
Member

jtojnar commented Apr 21, 2017

@niol I think using UNIQUE constraint is preferable due to a lower number of moving parts. Additionally, instead of handling an error, UPSERT can possibly be used, though it is hairy on PostgreSQL < 9.5

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

10 participants