Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Difference in Parser behaviour between 1.11.3, 1.15.4 and 1.16.1-SNAPSHOT with respect to whitespace and newline #1913

Closed
Reeniya opened this issue Mar 9, 2023 · 2 comments
Labels
not-a-bug This issue is not a bug; it is working as per spec

Comments

@Reeniya
Copy link

Reeniya commented Mar 9, 2023

Hi,
We are seeing a difference in parser behaviour between 1.11.3, 1.15.4 and 1.16.1-SNAPSHOT with respect to whitespace and newlines between the html tags and text. We are using xml parser and prettyprint set to true.

For example:
Input:

						<div class="freetext"
							id="text1">
							1111
							<div align = "left">
								<b>V2</b><br />
								<br />
								Xdxux<br />
								v3yfygygyg
							</div>
						</div>

with Jsoup version 1.11.3 we get the parser output has

       <div class="freetext" id="text1">
         1111 
        <div align="left"> 
         <b>V2</b>
         <br /> 
         <br /> Xdxux
         <br /> v3yfygygyg 
        </div> 
       </div> 

with Jsoup version 1.15.4 we get the parser output has

       <div class="freetext" id="text1">
        1111 
        <div align="left"><b>V2</b><br /><br />
          Xdxux<br />
          v3yfygygyg
        </div>
       </div>

We can re-create this issue using: https://try.jsoup.org/~HLY5GwlDvfC8Fn8tiGSpgCyIFFo

I observed that some fixes was done around whitespace and newline character so I consumed 1.16.1-snapshot version.

with Jsoup version 1.16.1-snapshot version we get the parser output has

       <div class="freetext" id="text1">
        1111 
        <div align="left">
         <b>V2</b>
         <br />
         <br />
          Xdxux
         <br />
          v3yfygygyg
        </div>
       </div>

We are bit confused about which is the correct behaviour as we are upgrading the jsoup version from 1.11.3 to 1.15.4
to understand the difference I have highlighted it in the image below which was captured using notepad++ to showcase the difference in whitespace between various versions of jsoup

image

In 1.11.3 we see only <div> tag, but with 1.15.4 newline between <div> and <b> tag is lost.

With 1.11.3 we see there is a whitespace after every <br /> and after the text "v3yfygygyg" but with 1.15.4 and 1.16.1-SNAPSHOT the whitespace is not there.

with 1.11.3 we see there is a whitespace after <div align="left"> which is not present with 1.15.4 and 1.16.1-SNAPSHOT version of jsoup

with 1.11.3 we see there is a whitespace after </div> tag which is not present with 1.15.4 and 1.16.1-SNAPSHOT version of jsoup

@jhy We are bit confused with what is the correct behaviour of the parser. Was there an issue with 1.11.3 which is fixed now or is this a new issue? Please can you let us know what should be the correct behaviour.

Thank you....

@Reeniya Reeniya changed the title Difference in Parser behaviour between 1.11.3, 1.15.4 and 1.16.1-snapshot with respenct to whitespace and newline Difference in Parser behaviour between 1.11.3, 1.15.4 and 1.16.1-snapshot with respect to whitespace and newline Mar 9, 2023
@Reeniya Reeniya changed the title Difference in Parser behaviour between 1.11.3, 1.15.4 and 1.16.1-snapshot with respect to whitespace and newline Difference in Parser behaviour between 1.11.3, 1.15.4 and 1.16.1-SNAPSHOT with respect to whitespace and newline Mar 9, 2023
@brdeepak
Copy link

brdeepak commented Apr 4, 2023

@jhy,
Please let us know how to move forward with this problem if you have any suggestions or assistance would be appreciated.

Thank you in advance

@jhy
Copy link
Owner

jhy commented Apr 29, 2023

Hi,

The output of the pretty-printer is subject to change as we make improvements.

If the output of the printer causes a change to the way a browser renders the HTML, I would generally consider it a bug. E.g. see #1926. But there will be changes between releases. I believe the current output is better than the previous output, and so am inclined to keep it as-is.

@jhy jhy closed this as completed Apr 29, 2023
@jhy jhy added the not-a-bug This issue is not a bug; it is working as per spec label Apr 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
not-a-bug This issue is not a bug; it is working as per spec
Projects
None yet
Development

No branches or pull requests

3 participants