Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validate languages from lyra.yml #106

Merged
merged 3 commits into from
Jul 7, 2024
Merged

Validate languages from lyra.yml #106

merged 3 commits into from
Jul 7, 2024

Conversation

WULCAN
Copy link
Collaborator

@WULCAN WULCAN commented Jul 3, 2024

By allowing strange language identifiers, and providing matching strange filenames, one who controls the repository could perhaps wreck some havoc in Lyra. Validation of language codes closes one avenue of such unintended manipulation.

This relates to #34 but is arguably not part of it.

Every other block comment in the repository use a consistent style.
@henrycatalinismith
Copy link
Collaborator

If one of this project's eventual goals is broader adoption beyond Zetkin I think it's worth broadening the regex here to /^[a-z]{2}([-_][a-z]{2})?$/i.

Zetkin happens to be a relatively simple case where two letter language codes are getting the job done for now. Most projects I've worked on have fallen into one of these categories:

  1. Locale codes such as en-GB, en-US due to global scale requiring attention to be paid to dialects and other regional language differences.
  2. Locale codes as above but due to tools/people imposing that structure upon projects early in order to future proof them against global scale.
  3. A combination of locale codes and two letter language codes because of an early assumption that language codes would be enough, and backwards compatibility issues preventing a full migration to locale codes once the assumption was eventually proved wrong.

Personally I think it's only a matter of time before even Zetkin joins that third category. Probably happens once there's a good few big US- and UK- based orgs onboarded and then a feature gets added where the only vocab that works for it falls within the "not easily mutually intelligible" subset of en-GB/en-US vocabulary differences.

@WULCAN
Copy link
Collaborator Author

WULCAN commented Jul 7, 2024

Yes, I just wanted to start with as narrow validation as possible, and expand it later when we discover need.

You're right though that en is a problem and we should migrate Zetkin to something more specific soon. There is something like an unambiguous definition for en but let's not rely on it: https://github.com/unicode-org/cldr/blob/main/common/supplemental/likelySubtags.xml#L359 =)

What about /^[a-z]{2}(-[A-Z]{2})?$/, it adds support for some region subtags, while keeping the validation narrow.

"-" seems to be the preferred and canonical form: https://www.unicode.org/reports/tr35/tr35.html#BCP_47_Conformance

The canonical form uses upper case for region subtags: https://www.unicode.org/reports/tr35/tr35.html#Canonical_Unicode_Locale_Identifiers

@henrycatalinismith
Copy link
Collaborator

I see three factors in play in this PR.

  1. Reduction of security risks by nerfing the potential for misuse of the filenames for stuff like directory traversal or XSS attacks.
  2. Increased deployment friction for potential new users whose translation files follow a naming convention that falls outside the limits set by this validation.
  3. Adherence to standards about what's strictly considered valid as a language identifier.

I think the first factor is possible to max out while minimising the second, by sacrificing the third. My understanding of your proposal is to maximise the third factor by sacrificing the second. I have a few thoughts about why I think it's the wrong sacrifice.

  1. I don't think it's relevant to the goal of improving security. Just validating on /^[a-z0-9_.-]{1,32}$/i would achieve your security goals as far as I can tell.
  2. I don't think everyone who it affects will take the time to file an issue asking for the validation to be widened.
  3. I don't think the assumption that translation files are always named ^[language-code].[ext]$ is a good one. I have a project of my own where some of my translation files have names like devise.en.yml. It's a framework-level naming convention that's outside of my control and which makes Lyra unviable for those strings even despite the two letter language codes.

@henrycatalinismith
Copy link
Collaborator

henrycatalinismith commented Jul 7, 2024

That last one has me wondering if some of this could also be addressed by augmenting how projects can be configured in lyra.yml. Like if you could specify e.g. prefix: config/locales/devise. and thus configure that subset of the directory contents as a separate project, you'd be able to maintain the flexibility for the filenames and validate the language codes quite strictly.

Also noticed my example regex doesn't actually block the .. directory traversal case lol.

@WULCAN
Copy link
Collaborator Author

WULCAN commented Jul 7, 2024

Yes, but I don't actually care so much about standards adherence, I just use it as a guide when choosing a minimally variable schema. My proposal was to maximise security by sacrificing users. =)

I agree with you but we have to think about this a little more:

I'm not sure I understand the prefix idea, are all translation files in the repository prefixed with devise. or are there other prefixes too, that should be handled like other projects in the same repository?

@henrycatalinismith
Copy link
Collaborator

Yeah quite a common thing is for a project to have its translations organised into multiple sets of files. So if there's an email feature and a podcast feature for example, there might be email.en.yml, email.fr.yml, podcast.en.yml, podcast.fr.yml. Fairly unremarkable for this stuff to get spread over multiple directories too. ViewComponent translations are a great example of this.

I think I've derailed here though. It's an interesting topic and these are important considerations but on reflection I think the right move in the short term is to unblock you here and then we can follow up on any of this other stuff if/when the time comes. Making Lyra usable for projects other than Zetkin is probably a problem best left until the Zetkin use case is completely covered and this was just a nice way for me to warm up a bit and think out loud about the problem space!!

@WULCAN WULCAN merged commit e87c7e6 into main Jul 7, 2024
1 check passed
@WULCAN WULCAN deleted the issue-34/validate-language branch July 7, 2024 20:50
@WULCAN
Copy link
Collaborator Author

WULCAN commented Jul 8, 2024

Maybe the distinction between repositories, projects and message identifier segments isn’t necessary. What do you think about a user just navigating all of that in a common tree?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants