Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GTFS-Translations #180

Merged
merged 6 commits into from
Jan 9, 2020

Conversation

LeoFrachet
Copy link
Contributor

As explained in the issue #138 (Jan 29th 2019) then in the issue #175 (Jul 28th 2019), we drafted a GTFS-Translations proposal (bit.ly/gtfs-translations), which is based on Google's old private GTFS translation extension.

Since then, and after a few modification of the proposal (see the Google doc), Google has shifted to use it internally, deprecating their old private GTFS translation extension, as described in their documentation (here).

I'm opening a pull request with the current (2019-08-07T22:00:00-04:00) state of the Google Doc.

Google is already consuming since quite a while. What's currently missing to open the vote is a producer.

@LeoFrachet LeoFrachet mentioned this pull request Aug 8, 2019
@aababilov
Copy link
Contributor

+1 from Google. We have a big provider that gives more than 100 feeds in several countries and uses GTFS-Translations spec.

@aababilov
Copy link
Contributor

Here is a feed that Google gets from our producer for the city of Lviv in Ukraine:
https://drive.google.com/open?id=1qGuy5Y-jJvGy2fHU6h4Jv_zdGWNbTnEH

@flocsy
Copy link
Contributor

flocsy commented Aug 8, 2019

The default language - per dataset - is not clear to me. Shouldn't we allow different default language per record?. If for example there's a dataset that contains Switzerland (the whole of it) then what would be the default language? To me it sounds like probably Zürich (de) should be the default for Zurich, however Genève (fr) the default for Geneva.

@LeoFrachet
Copy link
Contributor Author

@flocsy: Yes, this is the example cited in the definition of feed_lang. When the default language must vary from places, you can defined it as mul and provide the local version in every place:

If the dataset contains values in multiple languages (e.g. in multilingual countries like Switzerland, Belgium or Canada), the norm ISO 639-2 contains the language code “mul” to describe such reality. In such case, the best practice is to provide a translation for each of the languages used in the dataset.

For example, a dataset in Switzerland will have feed_lang=mul and will contain by default stop names “Genève” for Geneva, “Zürich” for Zurich and “Biel/Bienne” for the bilingual city of Biel/Bienne. But translations will be provided, in German: “Genf”, “Zürich” and “Biel”; in French: “Genève”, “Zurich” and “Bienne”; in Italian: “Ginevra”, “Zurigo” and “Bienna”; and in English: “Geneva”, “Zurich” and “Biel/Bienne”.

If what you're suggesting is that we attached default_lang information to sub section of the feed, like agency or stop, that could be doable but I don't see the added value to do it.

@flocsy
Copy link
Contributor

flocsy commented Aug 8, 2019 via email

@LeoFrachet
Copy link
Contributor Author

But I would maybe like to see an optional field: "lang" in all the tables that can be translated, and it would only be used/useful if the default_lang="mul".

It would be useful for some, but sometime just one stop name would already be mul (Biel/Bienne). If there is a producer & a consumer interested by such feature, let us know. But I would rather keep that for another proposal later on, since it works as an extension of the current proposal.

Well it really depends on the consumer apps... but I can say that in a place where they have latin letters I might prefer to see the local names (Geneva, Zürich), 'cause that's the way I most probably will see/hear it, so why displaying it in English, just because my phone's language is set to
English, Hungarian, Hebrew.

Indeed, it depends on the consumer apps. There would be a lot to say of how should those translated fields be filled (e.g. should "Köln" be translated in English as "Cologne"? "Köln (Cologne)"?), but this is IMHO on the shoulders of the data producer to produce them, and on the consumer to decide how to display them.

Maybe some guidelines will be useful down the road if we see inconsistent behavior.

@LeoFrachet
Copy link
Contributor Author

I'm opening the vote on this proposal. Vote will be open until next Thursday 22nd, 23:59:59 UTC.

@flocsy
Copy link
Contributor

flocsy commented Aug 15, 2019

I still would like to change the following sentence to make it clearer:

If the dataset contains values in multiple languages (e.g. in multilingual countries like Switzerland, Belgium or Canada), the norm ISO 639-2 contains the language code “mul” to describe such reality. In such case, the best practice is to provide a translation for each of the languages used in the dataset.

It's unclear IMHO what "dataset contains values in multiple languages" means. In my reading this means that there are more than one languages in the DATASET. However if the default is "en" and I provide a translation to "fr", then no need for "mul".

I would suggest something like:

If the default values in the dataset contain values in multiple languages (e.g. in multilingual countries like Switzerland, Belgium or Canada in stops.txt you have more than one language), the norm ISO 639-2 contains the language code “mul” to describe such reality. In such case, the best practice is to provide a translation for each of the languages used in the dataset. If all the labels in stops.txt are in one language, and there are translations in translations.txt, then "mul" is not to be use.

I'm sure the English speakers can improve it even further, I'd like it to be as explicit as possible.

@LeoFrachet
Copy link
Contributor Author

LeoFrachet commented Aug 15, 2019

Thanks @flocsy for the suggested language. I'm adding a slightly altered version of your proposal:

If the untranslated values in the dataset are in multiple languages (e.g. in multilingual countries like Switzerland, Belgium or Canada the stop_name in stops.txt will be by default in different languages depending of the area), the feed_lang field should contain the language code mul defined by the norm ISO 639-2 to describe such situation. In such case, the best practice is to provide a translation for each of the languages used in the dataset. If all the untranslated values in the dataset are in the same language, then "mul" should not to be use.

@LeoFrachet
Copy link
Contributor Author

Since nobody voted since I opened the vote, and since we changed the phrasing, I'm closing and reopening the vote.

Vote will be open until Thursday 22nd, 23:59:59 UTC.

@aababilov
Copy link
Contributor

+1 from Google.

@flocsy
Copy link
Contributor

flocsy commented Aug 15, 2019

+1

@abyrd
Copy link

abyrd commented Aug 16, 2019

It took me a while to understand the text "If the untranslated values in the dataset are in multiple languages (e.g. in multilingual countries like Switzerland, Belgium or Canada the stop_name in stops.txt will be by default in different languages depending of the area)". I think it will not be immediately apparent to many readers what this means.

The expressions "untranslated values are in multiple languages" and "depending on the area" are ambiguous. At first I thought this was describing some kind of system that reacted to the location of the reader or consumer and extracted language-specific sub-values out of multi-lingual individual records.

Here is my attempt at a rewrite (also correcting some small errors with prepositions etc.):

Datasets may contain untranslated values in multiple languages. For example, in a multilingual country like Switzerland, Belgium, or Canada the stop_name field of each stop could be in a different language, depending on the dominant language in that stop's geographic location. In such cases, the feed_lang field should contain the language code mul defined by the norm ISO 639-2. The best practice here is to provide a translation for each of the languages used in the dataset. If all the untranslated values in the dataset are in the same language, then "mul" should not be used.


Though the comments mention putting off the stop_name="Biel/Bienne" case for the future, the proposal in its current form describes covers both known use cases of the mul language code (single language per record, multiple languages within a single field) They both seem like valid interpretations of mul to me so I'm happy to see both added to the spec.

@flocsy
Copy link
Contributor

flocsy commented Dec 2, 2019

So is it up for vote again?
+1 (Moovit)

@aababilov
Copy link
Contributor

+1 (Google)

@prhod
Copy link

prhod commented Dec 5, 2019

+1 (Kisio)

@timMillet
Copy link
Contributor

@flocsy and @aababilov
My bad, the vote is not up again. @prhod told me that I did a wrong redirection from my post on the GTFS Google Group about the vote for GTFS-Attributions. I am very sorry about that.

@timMillet
Copy link
Contributor

timMillet commented Dec 9, 2019

@flocsy and @abyrd
I worked on an improved proposition for the extension of the feed_info.feed_lang field description. The goals were:

  • making something similar to the way definitions are formatted within the GTFS reference guide: general description first, then example in italic (e.g. descriptions of calendar_dates.exception_type or stop_times.stop_dist_traveled).
  • making the use case of multilingual datasets clearer to understand, according to all your comments, both in the general description and in the example section.

Below would be the whole field description:

Default language used for the text in this dataset. This setting helps GTFS consumers choose capitalization rules and other language-specific settings for the dataset. The file translations.txt can be used if the text needs to be translated into languages other than the default one.

The default language may be multilingual for datasets with the original text in multiple languages. In such cases, the feed_lang field should contain the language code mul defined by the norm ISO 639-2. The best practice here would be to provide, in translations.txt, a translation for each language used throughout the dataset. If all the original text in the dataset is in the same language, then mul should not be used.

Example: Consider a dataset from a multilingual country like Switzerland, with the original stops.stop_name field populated with stop names in different languages. Each stop name is written according to the dominant language in that stop’s geographic location, e.g. Genève for the French-speaking city of Geneva, Zürich for the German-speaking city of Zurich, and Biel/Bienne for the bilingual city of Biel/Bienne. The dataset feed_lang should be mul and translations would be provided in translations.txt, in German: Genf, Zürich, and Biel; in French: Genève, Zurich, and Bienne; in Italian: Ginevra, Zurigo, and Bienna; and in English: Geneva, Zurich, and Biel/Bienne.

Please, don’t hesitate to provide any feedback!

@flocsy
Copy link
Contributor

flocsy commented Dec 10, 2019

this is clear to me!

@timMillet
Copy link
Contributor

timMillet commented Dec 16, 2019

Since both a producer and a consumer have implemented translations.txt as put forward by this pull request, and a consensus has been reached on the feed_info.feed_lang description, I am re-opening the vote.

The vote will be open until Monday, December 23rd at 23:59:59 UTC.
Gavriel (@flocsy ), Alexej (@aababilov ), Pascal (@prhod ): don’t hesitate to vote if you want to.

@skinkie
Copy link
Contributor

skinkie commented Dec 16, 2019

+1 Stichting OpenGeo / Bliksem Labs

@aababilov
Copy link
Contributor

+1 from Google.

@gcamp
Copy link
Contributor

gcamp commented Dec 16, 2019

+1 from Transit

@prhod
Copy link

prhod commented Dec 17, 2019

+1 from Kisio

@nighthawk
Copy link

+1 from SkedGo

@flocsy
Copy link
Contributor

flocsy commented Dec 17, 2019

+1 (Moovit)

@tsherlockcraig
Copy link

+1 from Trillium

@LeoFrachet
Copy link
Contributor Author

The vote is closed.

We have 6 votes in favor. Zero against.

We have a producer and a consumer.

So the proposal is adopted 🎉 !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
GTFS Schedule Issues and Pull Requests that focus on GTFS Schedule
Projects
None yet
Development

Successfully merging this pull request may close these issues.