Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Highly unintuitive behavior of inSerial . stepNext . inSerial . stepNext #95

Open
bakhtiyarneyman opened this issue Oct 29, 2022 · 2 comments

Comments

@bakhtiyarneyman
Copy link

Define lists as follows:

ghci> lists = inSerial . many . stepNext

When used not from under chroot, lists is idempotent.

ghci> scrapeStringLike "<foo><baz></baz></foo><bar></bar>" $ lists $ html anySelector
Just ["<foo><baz></baz></foo>","<bar></bar>"]
ghci> scrapeStringLike "<foo><baz></baz></foo><bar></bar>" $ lists $ lists $ html anySelector
Just [["<foo><baz></baz></foo>"],["<bar></bar>"]]
ghci> scrapeStringLike "<foo><baz></baz></foo><bar></bar>" $ lists $ lists $ lists $ html anySelector
Just [[["<foo><baz></baz></foo>"]],[["<bar></bar>"]]]

Not so much when used from under chroot:

ghci> scrapeStringLike "<outer><foo><baz></baz></foo><bar></bar></outer>" $ chroot "outer" $ lists $ html anySelector
Just ["<foo><baz></baz></foo>","<bar></bar>"]
ghci> scrapeStringLike "<outer><foo><baz></baz></foo><bar></bar></outer>" $ chroot "outer" $ lists $ lists $ html anySelector
Just [["<baz></baz>"],[]]
ghci> scrapeStringLike "<outer><foo><baz></baz></foo><bar></bar></outer>" $ chroot "outer" $ lists $ lists $ lists $ html anySelector
Just [[[]],[]]

This inconsistency makes it much harder to reason locally about the written code. IMHO for a good user experience, one of the two behaviors should be selected, and I think the former is better because it's less opinionated. (The second behavior could be obtained by interspersing chroots between lists calls).

I think, this can be fixed by setting ctxInChroot to False in toZipper (defined as a where binding for inSerial.)

@bakhtiyarneyman
Copy link
Author

bakhtiyarneyman commented Oct 29, 2022

FWIW I don't particularly like the fact that inSerial changes the behavior when under a chroot. I find it unintuitive and it couples the combinators together. I'd prefer to have an explicit combinator that allows to drop inside the children of a node, rather than doing it implicitly in an ad-hoc way depending on whether we are under chroot or not.

Maybe something like:

children :: ScraperT str m a -> ScraperT str m a
root :: Selector -> ScraperT str m a -> ScraperT str m a
chroot = children . root 

This change would of course alter the behavior when getting attributes of the chrooted node.

@fimad
Copy link
Owner

fimad commented Oct 30, 2022

This does seem unintuitive :/

I think the reason for the discrepancy is that inSerial conceptually wraps the HTML in an outer set of tags when not in a chroot so that for HTML like <a>1</a><b>2</b>, inSerial would visit the <a> and <b> sub-trees.

It does seem broken that stacking multiple calls to inSerial wouldn't result in descending further into the tree.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants