Your support for our advertisers helps cover the cost of hosting, research, and maintenance of this document

Formatting Information — An introduction to typesetting with LATEX

Chapter 8: Compatibility

Section 8.1: Converting into LATEX

In this section…

  1. Getting LATEX out of XML

Before looking at one-way systems, see the earlier note about Pandoc.

There are several systems which will save their text in LATEX format. The best known is probably LYX, which is a wordprocessor-like interface to LATEX (not quite WYSIWYG, more What You See Is What You Mean). Both AbiWord (Linux and Windows) and Kword (Linux) have a very good Save As... LATEX output, and OpenOffice (all platforms) has a LATEX plugin, so they can be used to open Microsoft Word documents as well as their own format, and convert them to LATEX. Several maths packages like the EuroMath editor, and the Mathematica and Maple analysis packages, can also save material in LATEX format.

In general, most other wordprocessors and DTP systems either don’t have the level of internal markup sophistication needed to create a LATEX file, or they lack a suitable filter to enable them to output what they do have. Often they are incapable of outputting any kind of structured document, because they only store what the text looks like, not why it’s there or what role it fulfils. There are two ways out of this:

  • Use the FileSave As... menu item to save the wordprocessor file as HTML, rationalise the HTML using Dave Raggett’s HTML Tidy, and convert the resulting XHTML file to LATEX with any of the standard XML tools (see below).

  • Get the files into Word or ODF format, and write a transformation in XSLT to convert the internal XML into LATEX. This is by far the most robust way to do it, but the quality of most wordprocessing files is poor when it comes to identifying which bits do what, which is what LATEX needs, so some guesswork or heuristics may be needed.

If you have large numbers of obsolete Word .doc files (too many to open and save as .docx), you can try to use a specialist conversion tool like EBT’s DynaTag (supposedly available from Enigma, if you can persuade them they have a copy to sell you; or you may still be able to get it from Red Bridge Interactive in Providence, RI). It’s old and expensive and they don’t advertise it, but for GUI-driven bulk conversion of consistently-marked Word (.doc, not .docx) files into usable XML it beats everything else hands down. But whatever system you use, the Word files MUST be consistent, though, and MUST use Named Styles from a stylesheet (template), otherwise no system on earth is going to be able to guess what they mean.

There is of course a fourth way, suitable for large volumes only: send it off to the Pacific Rim to be scanned or retyped into XML or LATEX. There are hundreds of companies from India to Polynesia who do this at high speed and low cost with very high accuracy. It sounds crazy when the document is already in electronic form, but it’s a good example of the problem of low quality of wordprocessor markup that this solution exists at all.

You will have noticed that most of the solutions lead to one place: XML. As explained above and elsewhere, this format is the only one so far devised capable of storing sufficient information in machine-processable, publicly-accessible form to enable your document to be recreated in multiple output formats. Once your document is in XML, there is a large range of software available to turn it into other formats, including LATEX. Processors in any of the common XML processing languages like XSLT or Omnimark can easily be written to output LATEX, and this approach is extremely common.

Much of this would be simplified if wordprocessors supported native, arbitrary XML/XSLT as a standard feature, because LATEX output would become much simpler to produce, but this seems unlikely.

However, the native format for both OpenOffice and Word is now XML. Both .docx and .odf files are actually Zip files containing the XML document together with stylesheets, images, and other ancillary files. This means that for a robust transformation into LATEX, you just need to write an XSLT stylesheet to do the job — non-trivial, as the XML formats used are extremely complex, but certainly possible.

Among the conversion programs for related formats on CTAN is Ujwal Sath­yam and Paul DuBois’s rtf2latex2e, which converts Rich Text Format (RTF) files (output by many wordprocessors) to LATEX. The package description says it has support for figures and tables; equations are read as figures; and it can handle the latest RTF versions from Microsoft Word 97/98/2000, StarOffice, and other wordprocessors. It runs on Macs, Linux, other Unix systems, and Windows.

8.1.1 Getting LATEX out of XML

Assuming you can get your document out of its wordprocessor format into XML by some method, here is a very brief example of how to turn it into LATEX.

You can of course buy any fully-fledged commercial XML editor with XSLT support, and run transformations within it. However, this is beyond the reach of many individual users, although oXygen is available at a discounted price to academic sites.

To do the job unaided you need to install three pieces of software: Java, Saxon or another XSLT processor, and the DocBook 5.0 DTD (links are correct at the time of writing). None of these has a graphical interface: they are run from the command-line.

As an example, let’s take the above paragraph, as typed or imported into AbiWord (see Figure 8.3). This is stored as a single paragraph with highlighting on the product names (italics), and the names are also links to their Internet sources, just as they are in this document. This is a convenient way to store two pieces of information in the same place.

Figure 8.3: Sample paragraph in AbiWord being converted to XML

abiword 

AbiWord can export in DocBook format, which is an XML vocabulary for describing technical (computer) documents — it’s what I use for this book. AbiWord can also export LATEX, but we’re going to make our own version, working from the XML (Brownie points for the reader who can guess why I’m not just accepting the LATEX conversion output).

Although AbiWord’s default is to output an XML book document type, we’ll convert it to a LATEX article document class. In this example I’ve changed the linebreaks to keep it within the bounds of the page size of the PDF edition:

<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.2//EN"
  "http://www.oasis-open.org/docbook/xml/4.2/docbookx.dtd">
<book> 
<!-- ================================================== --> 
<!-- This DocBook file was created by AbiWord.          --> 
<!-- AbiWord is a free, Open Source word processor.     --> 
<!-- You may obtain more information about AbiWord 
     at www.abisource.com                               --> 
<!-- ================================================== --> 
<chapter> 
  <title></title> 
  <section role="unnumbered">
    <title></title> 
    <para>You can of course buy and install a fully-fledged 
      commercial XML editor with XSLT support, and run this 
      application within it. However, this is beyond the 
      reach of many users, so to do this unaided you just 
      need to install three pieces of software: <ulink
      url="http://java.com/download/"><emphasis>Java</emphasis></ulink>,
      <ulink
      url="http://saxon.sourceforge.net"><emphasis>Saxon</emphasis></ulink>, 
      and the <ulink 
      url="http://www.docbook.org/xml/4.2/index.html">DocBook 
      4.2 DTD</ulink> (URIs are correct at the time of writing). 
      None of these has a visual interface: they are run from 
      the command-line in the same way as is possible with
      L<superscript>A</superscript>T<subscript>E</subscript>X.</para>
  </section> 
</chapter> 
</book>

The XSLT language lets us create templates for each type of element in an XML document. In our example, there are only three which need handling, as we did not create chapter or section titles (DocBook requires them to be present, but they don’t have to be used).

  • para, for the paragraph[s];

  • ulink, for the URIs;

  • emphasis, for the italicisation.

I’m going to cheat over the superscripting and subscripting of the letters in the LATEX logo, and use my editor to replace the whole thing with the \LaTeX command. In the other three cases, we already know how LATEX deals with these, so we can write the templates accordingly.

Writing XSLT is not hard, but requires a little learning. The output method here is text, which is LATEX’s file format (XSLT can also output HTML and other flavours of XML).

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                version="2.0">

  <xsl:output method="text"/>

  <xsl:template match="/">
    <xsl:text>\documentclass{article}\usepackage{url}</xsl:text>
    <xsl:apply-templates/>
  </xsl:template>

  <xsl:template match="book">
    <xsl:text>\begin{document}</xsl:text>
    <xsl:apply-templates/>
    <xsl:text>\end{document}</xsl:text>
  </xsl:template>

  <xsl:template match="para">
    <xsl:apply-templates/>
    <xsl:text>&#x0a;</xsl:text>
  </xsl:template>

  <xsl:template match="ulink">
    <xsl:apply-templates/>
    <xsl:text>\footnote{\url{</xsl:text>
    <xsl:value-of select="@url"/>
    <xsl:text>}}</xsl:text>
  </xsl:template>

  <xsl:template match="emphasis">
    <xsl:text>\emph{</xsl:text>
    <xsl:apply-templates/>
    <xsl:text>}</xsl:text>
  </xsl:template>

</xsl:stylesheet>
  1. The first template matches /, which is the document root (before the book start-tag). At this stage we output the text which will start the LATEX document, \documentclass{article} and \usepackage{url}.

    The apply-templates instructions tells the processor to carry on processing, looking for more matches. XML comments get ignored, and any elements which don’t match a template simply have their contents passed through until the next match occurs, or until plain text is encountered (and output).

  2. The book template outputs the \begin{document} command, invokes apply-templates to make it carry on processing the contents of the book element, and then at the end, outputs the \end{document} command.

  3. The para template just outputs its content, but follows it with a linebreak, using the hexadecimal character code x0A (see the ASCII chart in Table E.1).

  4. The ulink template outputs its content but follows it with a footnote using the \url command to output the value of the url attribute.

  5. The emphasis template surrounds its content with \emph{ and }.

If you run this through Saxon, which is an XSLT processor, you can output a LATEX file which you can typeset (see Figure 8.4).

$ java -jar saxon9.jar -o para.ltx para.dbk para.xsl 
$ pdflatex para.ltx
This is pdfTeX, Version 3.1415926-1.40.10 (TeX Live 2009/Debian)
 restricted \write18 enabled.
entering extended mode
(./para.ltx
LaTeX2e <2009/09/24>
Babel <v3.8l> and hyphenation patterns for english, usenglishmax, 
dumylang, nohyphenation, farsi, arabic, croatian, bulgarian, 
ukrainian, russian, czech, slovak, danish, dutch, finnish, french, 
basque, ngerman, german, german-x-2009-06-19, ngerman-x-2009-06-19, 
ibycus, monogreek, greek, ancientgreek, hungarian, sanskrit, italian, 
latin, latvian, lithuanian, mongolian2a, mongolian, bokmal, nynorsk, 
romanian, irish, coptic, serbian, turkish, welsh, esperanto, 
uppersorbian, estonian, indonesian, interlingua, icelandic, kurmanji, 
slovenian, polish, portuguese, spanish, galician, catalan, swedish, 
ukenglish, pinyin, loaded.
(/usr/share/texmf-texlive/tex/latex/base/article.cls
Document Class: article 2007/10/19 v1.4h Standard LaTeX document class
(/usr/share/texmf-texlive/tex/latex/base/size10.clo))
(/usr/share/texmf-texlive/tex/latex/ltxmisc/url.sty) (./para.aux) 
[1{/home/peter/.texmf-var/fonts/map/pdftex/updmap/pdftex.map}]
(./para.aux))
</usr/share/texmf-texlive/fonts/type1/public/amsfonts/cm/cmr10.pfb>
</usr/share/texmf-texlive/fonts/type1/public/amsfonts/cm/cmr6.pfb>
</usr/share/texmf-texlive/fonts/type1/public/amsfonts/cm/cmr7.pfb>
</usr/share/texmf-texlive/fonts/type1/public/amsfonts/cm/cmti10.pfb>
</usr/share/texmf-texlive/fonts/type1/public/amsfonts/cm/cmtt8.pfb>
Output written on para.pdf (1 page, 54289 bytes).
Transcript written on para.log.
$

Figure 8.4: The typeset paragraph and its generated source code

para 
\documentclass{article}\usepackage{url}\begin{document} 
You can of course buy and install a fully-fledged commercial XML
editor with XSLT support, and run this application within it. However,
this is beyond the reach of many users, so to do this unaided you just
need to install three pieces of software:
\emph{Java}\footnote{\url{http://java.sun.com/j2se/1.4.2/download.html}},
\emph{Saxon}\footnote{\url{http://saxon.sourceforge.net}}, and the
DocBook 4.2
DTD\footnote{\url{http://www.docbook.org/xml/4.2/index.html}} (links
are correct at the time of writing). None of these has a graphical
interface: they are run from the command-line in the same way as is
possible with \LaTeX.
\end{document}

This is a relatively trivial example, but it serves to show that it’s not hard to output LATEX from XML. In fact there is a set of templates already written to produce LATEX from a DocBook file at http://www.dpawson.co.uk/docbook/tools.html#d4e2905

  1. Strictly speaking it isn’t output at this stage: XML processors build a ‘tree’ (a hierarchy) of elements in memory, and they only get ‘serialised’ at the end of processing, into a stream of characters written to a file.