Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

URI hash collision #12

Open
BestPig opened this issue Feb 20, 2019 · 1 comment
Open

URI hash collision #12

BestPig opened this issue Feb 20, 2019 · 1 comment

Comments

@BestPig
Copy link

BestPig commented Feb 20, 2019

There is a collision when generating md5 hash of URI when website use query for parameters

Example of a parsed "Cochonnet" youtube video.

>>> urlparse.urlparse(text_to_string('https://www.youtube.com/watch?v=30Nv0WY4Lg8'))
ParseResult(scheme='https', netloc='www.youtube.com', path='/watch', params='', query='v=30Nv0WY4Lg8', fragment='')
>>> 

If uri contents //, the used URI is a reconstruction of obj_uri.scheme + "://" + obj_uri.netloc + obj_uri.path
But youtube pass the video id in params, so the md5 generated for all youtube videos is exactly the same because it doesn't take into account the query.

Here are all line where I found the bug:

to_hash_uri=urllib.quote(obj_uri.scheme + "://" + obj_uri.netloc + obj_uri.path)

to_hash_uri=urllib.quote(obj_uri.scheme + "://" + obj_uri.netloc + obj_uri.path)

to_hash_uri=urllib.quote(obj_uri.scheme + "://" + obj_uri.netloc + obj_uri.path)

@acabrol
Copy link
Contributor

acabrol commented Jun 6, 2019

parameters in url are removed to avoid duplicate news due to token or origin parameters that are different for each submission.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants