Skip to content
Klortho edited this page Sep 7, 2014 · 1 revision

[This page will document any non-obvious decisions about the format of the output of the transformation process (see issue #4), and their rationale, including:

  • The format of the main XML output file (the "xml wrapper"),
  • Format of links and other structured data elements within the wiki text portion of the output file,
  • How images and other media are handled, including directory and file naming conventions,]

MediaWiki XML Format

The output for the converter is MediaWiki XML Format, described on the Help:Export page of Wikipedia, and specified by this XSD.

The portions of the format that comprise XML markup are really incidental metadata. The main article contents are in MediaWiki text format contained within the /mediawiki/page/revision/text element. Here is a mockup of an article's output XML:

<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.6/" 
           xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
           xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.6/ 
                               http://www.mediawiki.org/xml/export-0.6.xsd" 
           version="0.6" 
           xml:lang="en">
  <page>
    <title>Open Access Week</title>
    <ns>0</ns>
    <id>33545656</id>
    <revision>
      <id>481470937</id>
      <timestamp>2012-03-12T06:12:23Z</timestamp>
      <contributor>
        <username>Helpful Pixie Bot</username>
        <id>14216826</id>
      </contributor>
      <minor/>
      <comment>Fixed header External Links =&gt; External links (Build J2)</comment>
      <text xml:space="preserve" bytes="4">Hey!</text>
      <sha1/>
    </revision>
  </page>
</mediawiki>

Note that there are some subtleties and nastinesses you have to be aware of when mixing HTML with wiki text syntax, and outputting it to the XML output format. See, for example, issue #6.

MediaWiki Text Format

TBD

Media Files

TBD

Clone this wiki locally