Skip to content

MatrixMrsCatb

FrancisBond edited this page Aug 18, 2008 · 17 revisions

Cathedral and the Bazaar (catb)

[#candb This] is an early essay on Open Source. It is about 800 sentences, which is small, but there are more essays if we want more data. There are several good translations (not all linked to the main page). [http://en.wikipedia.org/wiki/The_Cathedral_and_the_Bazaar Wikipedia] also has a number of links to different translations: C (see on the left) (AF). It is freely available, but I (FCB) checked with the author anyway as a matter of courtesy and he was enthusiastic about us using it. There will be some clean up work involved in getting the translations aligned (there are several versions of the essay).

It was proposed (by FrancisBond) and accepted (by everyone) at the Kyoto Summit (2008) that we use this as a multilingual shared test suite to enable us to compare parses across different grammars. This page describes the steps we are taking to prepare the translations of the essays as a corpus. As the data becomes available, we will also link it to this pages.

TableOfContents

The Cathedral and the Bazaar in different languages

Language Grammar Web Version Profile Item
Catalan (ca) [http://www.danielclemente.com/apuntes/asai/recensio/catb.html ca] ?
Chinese (zh) traditional [http://www.linux.org.tw/CLDP/OLD/doc/Cathedral-Bazaar.html zh] (big5) 1.42
Chinese (zh) simplified [http://www.angeloliu.org/read-37.html zh] ?
English (en) ERG [http://www.catb.org/~esr/writings/cathedral-bazaar/cathedral-bazaar/ en] 1.57
French (fr) Grenouille [http://www.linux-france.org/article/these/cathedrale-bazar/cathedrale-bazar.html fr] 1.4
German (de) GG [http://gnuwin.epfl.ch/articles/de/Kathedrale/ de] 1.45
Greek, Modern (el) MGRG [http://howto.hellug.gr/howto/pub/html/cathedral-bazaar.html el] ?
Japanese (ja) Jacy [http://cruel.org/freeware/cathedral.html ja] 1.40
Korean (ko) KRG [http://wiki.kldp.org/wiki.php/DocbookSgml/Cathedral-Bazaar-TRANS ko] 1.32
Norwegian (no) Norsource NTNU
Portuguese (pt) LXgram [http://www.geocities.com/CollegePark/Union/3590/pt-cathedral-bazaar.html pt] 1.42
Spanish (es) SRG [http://es.tldp.org/Otros/catedral-bazar/cathedral-es-paper-13.html es] 1.28
Swedish (sv) [http://home.swipnet.se/swi/KatB-se.html sv] 1.51
Thai (th) [http://linux.thai.net/~thep/catb/cathedral-bazaar/index.html th] ?

At NiCT we also have a 201 sentence aligned subset of en,ko,zh,de,pt,it,fr which we use for MT testing. Sugita Sho used it to compare various MT systems [http://www.is.oit.ac.jp/~koda/server/~sugita/ 「機械翻訳の精度分析」 "An Analysis of Machine Translation Precision"].

Timeline

This is the timeline agreed on at the Kyoto Summit.

  • 0 Prepare and release partially filled skeletons (2008-08: FCB) 1 Make profile (2008-09)

    • align and correct translations

    • translate remaining text (if the original translation is incomplete)

    • feedback translations to the translators

    • segment if necessary

    • link the profile in the table above

    • add the profile to your grammar (e.g., jacy/tsdb/skeletons/catb.ja)

    • NiCT will also link as n(n-1) parallel corpora

    2 Treebank profile (2009-03)

    • translate

    • treebank

    • share the treebank to allow for comparisons

    • include it with your grammar (e.g., grammar/gold/catb.ja)

    3 Compare treebanks at the next DELPH-IN summit (2009-??)

Formatting Guidelines

Treebanking this text leads to several interesting issues with text cleansing: italics, embedded quotations, list numbers and so forth. In this section we will discuss what we have done in non-straightforward cases.

Note that we are not treating this as a corpus for testing the robustness of our systems to raw text, but rather as a set of sentences for comparing the semantic representations across languages. Therefore, we will try to make the input text as easy to parse as possible. In our corpus all markup is removed and obvious infelicities (typos, mispellings, bad translations) should be corrected. If and when we want to look at robustness issues, we will choose a new text (possibly the next essay in this series).

For the profile, we will use the [wiki:ItsdbReference itsdb text file format], which can be automatically converted into itsdb bitext profiles.

Markup

We have removed all markup (hyperlinks, italics, paragraph boundaries, ...). These can be added in when we have more of a handle on how to deal with them.

Examples:

  • Perhaps this is not only the future of <emphasis>open-source</emphasis> software.

    • Perhaps this is not only the future of open-source software.
  • Other examples are legion, as a visit to <ulink url="http://freshmeat.net/"&gt;Freshmeat&lt;/ulink> on any given day will quickly prove.

    • Other examples are legion, as a visit to Freshmeat on any given day will quickly prove.

Structure

Mark headers as headers (with a preceding + in the text profile, as XP in the item file):

  • +The Cathedral and the Bazaar

Keep list item numbers in the first sentence in the list item.

  • 6. Treating your users as co-developers is your least-hassle route to rapid code improvement and effective debugging.

Quotations

  • If a quotation spans multiple sentences, split at the first period:
[18200] ``Somebody finds the problem,'' he says, ``and somebody else understands it.
[18300] And I'll go on record as saying that finding it is the bigger challenge.''
  • Note that this means we expect to have sentences with unbalanced punctuation.

Typos

We should correct obvious typos in the profile, and also send them upstream to the maintainer of the essay/translation.

  • the costs of duplicated work tend to scale sub-qadratically with team size

    • the costs of duplicated work tend to scale sub-quadratically with team size

Anything that is not clearly in error should be left as is.

Sentence Numbering

  • The original English text has been numbered in intervals of 100. There are 769 sentences.
[100] +The Cathedral and the Bazaar
...
[76900] Finally, Linus Torvalds's comments were helpful and his early endorsement very encouraging.
  • Translated text should be aligned with the English.
    • If the mapping is one to one then: use the same number, and a to show the translation, and b the (English) source:
[100a]  伽藍とバザール
[100b]  +The Cathedral and the Bazaar
  • If the mapping is many to one then: align the first sentence with the English, and increment extra translations by 10.
[2900a] そしてその頃まったくの偶然から、自分の理論を試してみる完璧な機会がやってきた。
[2900b] Chance handed me a perfect way to test my theory, in the form of an open-source project that I could consciously try to run in the bazaar style.

[2910a] 意識的にバザール方式で運営できるようなフリーソフトプロジェクトという形で。
  • If the mapping is one to many then: combine the English as separate translations.
[4800a] そこでネットで探してみると、3つか4つ見つかった。
[4800b] So I went out on the Internet and found one.
[4900c] Actually, I found three or four.
  • I hope we have no many to many translations. If we do, then add all the English to the first sentence, and then have several sentences with no translations
[1000a]
[1000b]
[1000c]

[1010a]

[1020a]

[1030a]

Having many misaligned sentences makes cross language comparison just that much harder, ...

Clone this wiki locally