Skip to content
This repository has been archived by the owner on Sep 20, 2021. It is now read-only.

Support embedded directions #21

Open
Hywan opened this issue Jan 26, 2015 · 16 comments · May be fixed by #23
Open

Support embedded directions #21

Hywan opened this issue Jan 26, 2015 · 16 comments · May be fixed by #23

Comments

@Hywan
Copy link
Member

Hywan commented Jan 26, 2015

A string can contain both left-to-right and right-to-left text. We need a better algorithm to guess the current direction of a text :-).

@boast
Copy link
Contributor

boast commented Jan 26, 2015

Hey there, coming from reddit :) Some suggestions for an algorithm to solve this issue:

  • Check if the string contains LRM (0x200e) or RLM (0x200f) (and treat ARM 0x061c ‭arabic letter mark as "alias" for RLM), as they are specifically used to mark the string, in which it should be interpreted.
    • If it contains both directions, return BIDI (should add this as constant)
    • else if it only contains LRM, return LTR
    • else if it only contains RLM and / or ARM, return RTL
  • Set default assumption on the first character
  • Check if we find any markers (LRM, LRE, LRO (may LRI) and RLM, RLE, RLO (may RLI), ARM) which would imply a direction change compared to the first character, if so, return BIDI
  • Check the string if it contains a character from the opposing direction, if so, return BIDI, if not, return the respective direction based on the assumption we have from the first string.

Does this sound reasonable? As I cannot think of any sane way to detect that "私 - is a japanese letter" "should" be LTR, the user has decide by himself what to do with BIDI text.

@Hywan
Copy link
Member Author

Hywan commented Jan 26, 2015

@boast It sounds reasonable yes. I didn't check how other implemetation deals with it. Any PR :-)?

@boast
Copy link
Contributor

boast commented Jan 26, 2015

As for reference implementations: https://github.com/waiting-for-dev/string-direction

Or http://en.wikipedia.org/wiki/Bi-directional_text on that topic (notice the table with the classifications). I'll work on it tonight 👍 However, probably need to refactor some methods into helper protected methods to do the checks more granulated.

@Hywan
Copy link
Member Author

Hywan commented Jan 27, 2015

@boast Thank you! :-)

@Hywan Hywan self-assigned this Jan 27, 2015
@boast
Copy link
Contributor

boast commented Jan 29, 2015

I tried my best to adapt the coding style. No tests broken (or lets say: some tests failed on my Ubuntu Dev Machine before I changed anything, seems like those collator and normalizer tests (especially when they are not available) are broken?) and added a new one following more or less the spec described above.

@boast boast linked a pull request Jan 29, 2015 that will close this issue
@Hywan
Copy link
Member Author

Hywan commented Mar 26, 2015

ping?

@boast
Copy link
Contributor

boast commented Aug 3, 2015

Hey there, thank you for the ping. I was occupied this half year with doing my bachelor degree in CS. ;) We should define our definitive approach for this problem together and then I / we can work out the implementation. My knowledge about the problem comes specifically from these sources:

IMHO, we should first decide on the actual "goal" and "usecase" of this method. Why and when is the information "which direction is this text going" needed? Because one can go crazy on the "strong", "weak" and "normal" characters and contexts...

@Hywan
Copy link
Member Author

Hywan commented Aug 3, 2015

So far, we use getCharDirection to decide the behavior of append, prepend and other methods. This method only checks the first character. We must check the last character first. Second, it should be great to have a method to know if we have bi-directional text. I don't know really why it can be useful yet but I am sure it will be. We can also add methods to force to change the direction of the text (maybe we would like to write french in reverse order 😉). And a most useful usage is:

  • Iterate over direction portions. It can be particularly useful when transforming it into HTML for instance (or PDF, text etc.).
  • Also, with the append and prepend methods for instance, we can say: $str->append('text', $str::RTL); to force appending something in the opposite direction (to have bi-directionnal text thus).

@Hywan
Copy link
Member Author

Hywan commented Aug 3, 2015

PS: How your bachelor goes 😉?

@Hywan
Copy link
Member Author

Hywan commented Aug 3, 2015

Another use case:

  • When comparing strings, we would compare portion of directions, not the whole string at once. This some usages I think of.

@boast
Copy link
Contributor

boast commented Sep 8, 2015

Hey Ivan,

thanks for asking - my bachelor is done now, so I think, I will find some time to contribute.

I will try to implement the algorithm according to the UNICODE BIDIRECTIONAL ALGORITHM. Especially the table Bidirectional Character Types looks very interesting and exactly what is lacking as of now ("weak" characters as numbers and punctuation are not handled correctly by our algorithm).

@Hywan
Copy link
Member Author

Hywan commented Sep 8, 2015

Excellent news!

@boast
Copy link
Contributor

boast commented Oct 14, 2015

Just a short update: I wrote a small script which parses the official bidi-classes from the unicode consortium (http://www.unicode.org/Public/UCD/latest/ucd/extracted/DerivedBidiClass.txt). It generates an optimized regex (not working atm, I miss something XD). The regex get quiet large though, but may some more optimizations are possible. The script is a small console app (Bin-folder) which allows easy regeneration if the spec should change.

After my regex works, I will implement the unicode bidi algorithm from http://www.unicode.org/reports/tr9/.

@Hywan
Copy link
Member Author

Hywan commented Oct 14, 2015

Why do we need such a regular expressions?

@boast
Copy link
Contributor

boast commented Oct 14, 2015

We need to distinguish between the different types of bidirectional
characters. Especially as some characters "change" their directions
depending on context (read: surrounding characters). It's quite complex at
the start, but as soon as you have the groups and get the hang of it, you
can exclude a lot of cases very fast.

On Wed, 14 Oct 2015 13:39 Ivan Enderlin [email protected] wrote:

Why do we need such a regular expressions?


Reply to this email directly or view it on GitHub
#21 (comment).

@Hywan
Copy link
Member Author

Hywan commented Oct 14, 2015

Ok :-).

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Development

Successfully merging a pull request may close this issue.

2 participants