Validate languages from lyra.yml #106

WULCAN · 2024-07-03T22:26:00Z

By allowing strange language identifiers, and providing matching strange filenames, one who controls the repository could perhaps wreck some havoc in Lyra. Validation of language codes closes one avenue of such unintended manipulation.

This relates to #34 but is arguably not part of it.

Every other block comment in the repository use a consistent style.

henrycatalinismith · 2024-07-06T09:13:05Z

If one of this project's eventual goals is broader adoption beyond Zetkin I think it's worth broadening the regex here to /^[a-z]{2}([-_][a-z]{2})?$/i.

Zetkin happens to be a relatively simple case where two letter language codes are getting the job done for now. Most projects I've worked on have fallen into one of these categories:

Locale codes such as en-GB, en-US due to global scale requiring attention to be paid to dialects and other regional language differences.
Locale codes as above but due to tools/people imposing that structure upon projects early in order to future proof them against global scale.
A combination of locale codes and two letter language codes because of an early assumption that language codes would be enough, and backwards compatibility issues preventing a full migration to locale codes once the assumption was eventually proved wrong.

Personally I think it's only a matter of time before even Zetkin joins that third category. Probably happens once there's a good few big US- and UK- based orgs onboarded and then a feature gets added where the only vocab that works for it falls within the "not easily mutually intelligible" subset of en-GB/en-US vocabulary differences.

WULCAN · 2024-07-07T10:31:10Z

Yes, I just wanted to start with as narrow validation as possible, and expand it later when we discover need.

You're right though that en is a problem and we should migrate Zetkin to something more specific soon. There is something like an unambiguous definition for en but let's not rely on it: https://github.com/unicode-org/cldr/blob/main/common/supplemental/likelySubtags.xml#L359 =)

What about /^[a-z]{2}(-[A-Z]{2})?$/, it adds support for some region subtags, while keeping the validation narrow.

"-" seems to be the preferred and canonical form: https://www.unicode.org/reports/tr35/tr35.html#BCP_47_Conformance

The canonical form uses upper case for region subtags: https://www.unicode.org/reports/tr35/tr35.html#Canonical_Unicode_Locale_Identifiers

henrycatalinismith · 2024-07-07T14:25:31Z

I see three factors in play in this PR.

Reduction of security risks by nerfing the potential for misuse of the filenames for stuff like directory traversal or XSS attacks.
Increased deployment friction for potential new users whose translation files follow a naming convention that falls outside the limits set by this validation.
Adherence to standards about what's strictly considered valid as a language identifier.

I think the first factor is possible to max out while minimising the second, by sacrificing the third. My understanding of your proposal is to maximise the third factor by sacrificing the second. I have a few thoughts about why I think it's the wrong sacrifice.

I don't think it's relevant to the goal of improving security. Just validating on /^[a-z0-9_.-]{1,32}$/i would achieve your security goals as far as I can tell.
I don't think everyone who it affects will take the time to file an issue asking for the validation to be widened.
I don't think the assumption that translation files are always named ^[language-code].[ext]$ is a good one. I have a project of my own where some of my translation files have names like devise.en.yml. It's a framework-level naming convention that's outside of my control and which makes Lyra unviable for those strings even despite the two letter language codes.

henrycatalinismith · 2024-07-07T14:57:44Z

That last one has me wondering if some of this could also be addressed by augmenting how projects can be configured in lyra.yml. Like if you could specify e.g. prefix: config/locales/devise. and thus configure that subset of the directory contents as a separate project, you'd be able to maintain the flexibility for the filenames and validate the language codes quite strictly.

Also noticed my example regex doesn't actually block the .. directory traversal case lol.

WULCAN · 2024-07-07T15:37:07Z

Yes, but I don't actually care so much about standards adherence, I just use it as a guide when choosing a minimally variable schema. My proposal was to maximise security by sacrificing users. =)

I agree with you but we have to think about this a little more:

I'm not sure I understand the prefix idea, are all translation files in the repository prefixed with devise. or are there other prefixes too, that should be handled like other projects in the same repository?

henrycatalinismith · 2024-07-07T17:59:52Z

Yeah quite a common thing is for a project to have its translations organised into multiple sets of files. So if there's an email feature and a podcast feature for example, there might be email.en.yml, email.fr.yml, podcast.en.yml, podcast.fr.yml. Fairly unremarkable for this stuff to get spread over multiple directories too. ViewComponent translations are a great example of this.

I think I've derailed here though. It's an interesting topic and these are important considerations but on reflection I think the right move in the short term is to unblock you here and then we can follow up on any of this other stuff if/when the time comes. Making Lyra usable for projects other than Zetkin is probably a problem best left until the Zetkin use case is completely covered and this was just a nice way for me to warm up a bit and think out loud about the problem space!!

WULCAN · 2024-07-08T05:58:11Z

Maybe the distinction between repositories, projects and message identifier segments isn’t necessary. What do you think about a user just navigating all of that in a common tree?

WULCAN added 3 commits July 3, 2024 21:52

Normalize block comment

427782b

Every other block comment in the repository use a consistent style.

Validate languages from lyra.yml

63b64e0

Refactor language validation

1f8279f

henrycatalinismith approved these changes Jul 7, 2024

View reviewed changes

This was referenced Jul 7, 2024

Support an encourage region subtags for en locales #111

Open

Support projects with prefixed translation file names #112

Open

WULCAN merged commit e87c7e6 into main Jul 7, 2024
1 check passed

WULCAN deleted the issue-34/validate-language branch July 7, 2024 20:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Validate languages from lyra.yml #106

Validate languages from lyra.yml #106

WULCAN commented Jul 3, 2024 •

edited

Loading

henrycatalinismith commented Jul 6, 2024

WULCAN commented Jul 7, 2024 •

edited

Loading

henrycatalinismith commented Jul 7, 2024

henrycatalinismith commented Jul 7, 2024 •

edited

Loading

WULCAN commented Jul 7, 2024

henrycatalinismith commented Jul 7, 2024

WULCAN commented Jul 8, 2024

Validate languages from lyra.yml #106

Validate languages from lyra.yml #106

Conversation

WULCAN commented Jul 3, 2024 • edited Loading

henrycatalinismith commented Jul 6, 2024

WULCAN commented Jul 7, 2024 • edited Loading

henrycatalinismith commented Jul 7, 2024

henrycatalinismith commented Jul 7, 2024 • edited Loading

WULCAN commented Jul 7, 2024

henrycatalinismith commented Jul 7, 2024

WULCAN commented Jul 8, 2024

WULCAN commented Jul 3, 2024 •

edited

Loading

WULCAN commented Jul 7, 2024 •

edited

Loading

henrycatalinismith commented Jul 7, 2024 •

edited

Loading