Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Soup not properly parsing HTML from gtoolkit.com #7

Open
capsulecorplab opened this issue Oct 3, 2022 · 6 comments
Open

Soup not properly parsing HTML from gtoolkit.com #7

capsulecorplab opened this issue Oct 3, 2022 · 6 comments

Comments

@capsulecorplab
Copy link
Contributor

Seems to work fine on pharo.org, but not gtoolkit.com

Screenshot from 2022-10-02 14-14-49

@capsulecorplab
Copy link
Contributor Author

capsulecorplab commented Oct 3, 2022

I managed to eventually get it to properly parse the html from gtoolkit.com, but only after manually removing duplicate DOM elements that shared the same class name from the source body

@capsulecorplab capsulecorplab changed the title Soup not properly parsing HTML page on gtoolkit.com Soup not properly parsing HTML from gtoolkit.com Oct 3, 2022
@capsulecorplab
Copy link
Contributor Author

issue still persists in Pharo 10

@Ducasse
Copy link
Collaborator

Ducasse commented Jan 11, 2023

Can you provide a HTML sample that is not working?

@capsulecorplab
Copy link
Contributor Author

Can you provide a HTML sample that is not working?

view-source:https://gtoolkit.com/

@sweagraff
Copy link

I've run into a similar problem. Its because the retrieveContents for pharo.org returns a ByteString and the retrieveContents for gtoolkit.com returns a WideString. It appears that Soup has an issue with WideStrings.

@Ducasse
Copy link
Collaborator

Ducasse commented Mar 14, 2024

Soup code is old so it should be probably ported to modern pharo. Pharo manages well the encodings and the rest.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants