×Top 20 recent searches

Formatting information
An introduction to typesetting with LATEX

Compatibility with other systems

 
Copyright © 2010 Silmaril Consultants, version 5.5 (2011-05-30)

Chapter 10

Compatibility with other systems

As we saw in Chapter 2, LATEX uses plain-text files, so they can be read and written by any standard application that can open text files. This helps preserve your information over time, as the plain-text format cannot be obsoleted or hijacked by any manufacturer or sectoral interest, and it will always be readable on any computer, from your handheld (yes, LATEX is available for some PDAs, see Figure 10.1) to the biggest supercomputer.


Figure 10.1LATEX editing and processing on the Sharp Zaurus 5500 PDA


However, LATEX is intended as the last stage of the editorial process: formatting for print or display. If you have a requirement to re-use the text in some other environment — a database perhaps, or on the Web or a CD-ROM or DVD, or in Braille or voice output — then it should probably be edited, stored, and maintained in something neutral like the Extensible Markup Language (XML), and only converted to LATEX when a typeset copy is needed.

Although LATEX has many structured-document features in common with SGML and XML, it can still only be processed by the LATEX and pdfLATEX programs. Because its macro features make it almost infinitely redefinable, processing it requires a program which can unravel arbitrarily complex macros, and LATEX and its siblings are the only programs which can do that effectively. Like other typesetters and formatters (Quark XPress, PageMaker, FrameMaker, Microsoft Publisher, 3B2 etc.), LATEX is largely a one-way street leading to typeset printing or display formatting.

Converting LATEX to some other format therefore means you will unavoidably lose some formatting, as LATEX has features that others systems simply don't possess, so they cannot be translated — although there are several ways to minimise this loss. Similarly, converting other formats into LATEX often means editing back the stuff the other formats omit because they only store appearances, not structure.

However, there are at least two excellent systems for converting LATEX directly to HyperText Markup Language (HTML) so you can publish it on the web, as we shall see in section 10.2.

ToC10.1 Converting into LATEX

There are several systems which will save their text in LATEX format. The best known is probably LYX, which is a wordprocessor-like interface to LATEX for Windows and Unix. Both AbiWord (Linux and Windows) and Kword (Linux) have a very good Save As...LATEX output, and OpenOffice (all platforms) has a LATEX plugin, so they can be used to open Microsoft Word documents and convert to LATEX. Several maths packages like the EuroMath editor, and the Mathematica and Maple analysis packages, can also save material in LATEX format.

In general, most other wordprocessors and DTP systems either don't have the level of internal markup sophistication needed to create a LATEX file, or they lack a suitable filter to enable them to output what they do have. Often they are incapable of outputting any kind of structured document, because they only store what the text looks like, not why it's there or what role it fulfills. There are two ways out of this:

  • Use the FileSave As... menu item to save the wordprocessor file as HTML, rationalise the HTML using Dave Raggett's HTML Tidy, and convert the resulting Extensible HyperText Markup Language (XHTML) to LATEX with any of the standard XML tools (see below).

  • Use a specialist conversion tool like EBT's DynaTag (supposedly available from Enigma, if you can persuade them they have a copy to sell you; or you may still be able to get it from Red Bridge Interactive in Providence, RI). It's expensive and they don't advertise it, but for bulk conversion of consistently-marked Word files into usable XML it beats everything else hands down. The Word files must be consistent, though, and must use named styles from a stylesheet, otherwise no system on earth is going to be able to guess what it means.

There is of course a third way, suitable for large volumes only: send it off to the Pacific Rim to be scanned or retyped into XML or LATEX. There are hundreds of companies from India to Polynesia who do this at high speed and low cost with very high accuracy. It sounds crazy when the document is already in electronic form, but it's a good example of the problem of low quality of wordprocessor markup that this solution exists at all.

You will have noticed that most of the solutions lead to one place: XML. As explained above and elsewhere, this format is the only one so far devised capable of storing sufficient information in machine-processable, publicly-accessible form to enable your document to be recreated in multiple output formats. Once your document is in XML, there is a large range of software available to turn it into other formats, including LATEX. Processors in any of the common XML processing languages like the Extensible Stylesheet Language [Transformations] (XSLT) or Omnimark can easily be written to output LATEX, and this approach is extremely common.

Much of this would be simplified if wordprocessors supported native, arbitrary XML/XSLT as a standard feature, because LATEX output would become much simpler to produce, but this seems unlikely.

However, the native format for both OpenOffice and Word is now XML. Both .docx and .odf files are actually Zip files containing the XML document together with stylesheets, images, and other ancillary files. This means that for a robust transformation into LATEX, you just need to write an XSLT stylesheet to do the job — non-trivial, as the XML formats used are extremely complex, but certainly possible.

Among the conversion programs for related formats on CTAN is Ujwal Sath­yam and Paul DuBois's rtf2latex2e, which converts Rich Text Format (RTF) files (output by many wordprocessors) to LATEXε. The package description says it has support for figures and tables, equations are read as figures, and it can the handle the latest RTF versions from Microsoft Word 97/98/2000, StarOffice, and other wordprocessors. It runs on Macs, Linux, other Unix systems, and Windows.

ToC10.1.1 Getting LATEX out of XML

Assuming you can get your document out of its wordprocessor format into XML by some method, here is a very brief example of how to turn it into LATEX.

You can of course buy and install a fully-fledged commercial XML editor with XSLT support, and run this application within it. However, this is beyond the reach of many users, so to do this unaided you just need to install three pieces of software: Java, Saxon and the DocBook 4.2 DTD (URIs are correct at the time of writing). None of these has a visual interface: they are run from the command-line in the same way as is possible with LATEX.

As an example, let's take the above paragraph, as typed or imported into AbiWord (see Figure 10.1). This is stored as a single paragraph with highlighting on the product names (italics), and the names are also links to their Internet sources, just as they are in this document. This is a convenient way to store two pieces of information in the same place.


Figure 10.1: Sample paragraph in AbiWord converted to XML


AbiWord can export in DocBook format, which is an XML vocabulary for describing technical (computer) documents — it's what I use for this book. AbiWord can also export LATEX, but we're going make our own version, working from the XML (Brownie points for the reader who can guess why I'm not just accepting the LATEX conversion output).

Although AbiWord's default is to output an XML book document type, we'll convert it to a LATEX article document class. Notice that AbiWord has correctly output the expected section and title markup empty, even though it's not used. Here's the XML output (I've changed the linebreaks to keep it within the bounds of the page size of the printed edition):

<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.2//EN"
  "http://www.oasis-open.org/docbook/xml/4.2/docbookx.dtd">
<book> 
<!-- ================================================== --> 
<!-- This DocBook file was created by AbiWord.          --> 
<!-- AbiWord is a free, Open Source word processor.     --> 
<!-- You may obtain more information about AbiWord 
     at www.abisource.com                               --> 
<!-- ================================================== --> 
<chapter> 
  <title></title> 
  <section role="unnumbered">
    <title></title> 
    <para>You can of course buy and install a fully-fledged 
      commercial XML editor with XSLT support, and run this 
      application within it. However, this is beyond the 
      reach of many users, so to do this unaided you just 
      need to install three pieces of software: <ulink
      url="http://java.com/download/"><emphasis>Java</emphasis></ulink>,
      <ulink
      url="http://saxon.sourceforge.net"><emphasis>Saxon</emphasis></ulink>, 
      and the <ulink 
      url="http://www.docbook.org/xml/4.2/index.html">DocBook 
      4.2 DTD</ulink> (URIs are correct at the time of writing). 
      None of these has a visual interface: they are run from 
      the command-line in the same way as is possible with
      L<superscript>A</superscript>T<subscript>E</subscript>X.</para>
  </section> 
</chapter> 
</book>

The XSLT language lets us create templates for each type of element in an XML document. In our example, there are only three which need handling, as we did not create chapter or section titles (DocBook requires them to be present, but they don't have to be used).

  • para, for the paragraph[s];

  • ulink, for the URIs;

  • emphasis, for the italicisation.

I'm going to cheat over the superscripting and subscripting of the letters in the LATEX logo, and use my editor to replace the whole thing with the \LaTeX command. In the other three cases, we already know how LATEX deals with these, so we can write our templates (see Figure 10.2).


Figure 10.2XSLT script to convert the paragraph

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                version="1.0">

  <xsl:output method="text"/>

  <xsl:template match="/">
    <xsl:text>\documentclass{article}\usepackage{url}</xsl:text>
    <xsl:apply-templates/>
  </xsl:template>

  <xsl:template match="book">
    <xsl:text>\begin{document}</xsl:text>
    <xsl:apply-templates/>
    <xsl:text>\end{document}</xsl:text>
  </xsl:template>

  <xsl:template match="para">
    <xsl:apply-templates/>
    <xsl:text>&#x000A;</xsl:text>
  </xsl:template>

  <xsl:template match="ulink">
    <xsl:apply-templates/>
    <xsl:text>\footnote{\url{</xsl:text>
    <xsl:value-of select="@url"/>
    <xsl:text>}}</xsl:text>
  </xsl:template>

  <xsl:template match="emphasis">
    <xsl:text>\emph{</xsl:text>
    <xsl:apply-templates/>
    <xsl:text>}</xsl:text>
  </xsl:template>

</xsl:stylesheet>

If you run this through Saxon, which is an XSLT processor, you can output a LATEX file which you can process (see Figure 10.3) and view (see Figure 10.4).


Figure 10.3: Running and XML file through the Saxon XSLT processor

$ java -jar saxon9.jar -o para.ltx para.dbk para.xsl 
$ pdflatex para.ltx
This is pdfTeX, Version 3.1415926-1.40.10 (TeX Live 2009/Debian)
 restricted \write18 enabled.
entering extended mode
(./para.ltx
LaTeX2e <2009/09/24>
Babel <v3.8l> and hyphenation patterns for english, usenglishmax, 
dumylang, nohyphenation, farsi, arabic, croatian, bulgarian, 
ukrainian, russian, czech, slovak, danish, dutch, finnish, french, 
basque, ngerman, german, german-x-2009-06-19, ngerman-x-2009-06-19, 
ibycus, monogreek, greek, ancientgreek, hungarian, sanskrit, italian, 
latin, latvian, lithuanian, mongolian2a, mongolian, bokmal, nynorsk, 
romanian, irish, coptic, serbian, turkish, welsh, esperanto, 
uppersorbian, estonian, indonesian, interlingua, icelandic, kurmanji, 
slovenian, polish, portuguese, spanish, galician, catalan, swedish, 
ukenglish, pinyin, loaded.
(/usr/share/texmf-texlive/tex/latex/base/article.cls
Document Class: article 2007/10/19 v1.4h Standard LaTeX document class
(/usr/share/texmf-texlive/tex/latex/base/size10.clo))
(/usr/share/texmf-texlive/tex/latex/ltxmisc/url.sty) (./para.aux) 
[1{/home/peter/.texmf-var/fonts/map/pdftex/updmap/pdftex.map}] (./para.aux))
</usr/share/texmf-texlive/fonts/type1/public/amsfonts/cm/cmr10.pfb>
</usr/share/texmf-texlive/fonts/type1/public/amsfonts/cm/cmr6.pfb>
</usr/share/texmf-texlive/fonts/type1/public/amsfonts/cm/cmr7.pfb>
</usr/share/texmf-texlive/fonts/type1/public/amsfonts/cm/cmti10.pfb>
</usr/share/texmf-texlive/fonts/type1/public/amsfonts/cm/cmtt8.pfb>
Output written on para.pdf (1 page, 54289 bytes).
Transcript written on para.log.
$


Figure 10.4: Displaying the typeset paragraph


\documentclass{article}\usepackage{url}\begin{document} 
You can of course buy and install a fully-fledged commercial XML
editor with XSLT support, and run this application within it. However,
this is beyond the reach of many users, so to do this unaided you just
need to install three pieces of software:
\emph{Java}\footnote{\url{http://java.sun.com/j2se/1.4.2/download.html}},
\emph{Saxon}\footnote{\url{http://saxon.sourceforge.net}}, and the
DocBook 4.2
DTD\footnote{\url{http://www.docbook.org/xml/4.2/index.html}} (URIs
are correct at the time of writing). None of these has a visual
interface: they are run from the command-line in the same way as is
possible with \LaTeX.
\end{document}

Writing XSLT is not hard, but requires a little learning. The output method here is text, which is LATEX's file format (XSLT can also output HTML and other formats of XML).

  1. The first template matches ‘/’, which is the document root (before the book start-tag). At this stage we output the text which will start the LATEX document, \documentclass{article} and \usepackage{url}. The apply-templates instructions tells the processor to carry on processing, looking for more matches. XML comments get ignored, and any elements which don't match a template simply have their contents passed through until the next match occurs, or until plain text is encountered (and output).1

  2. The book template outputs the \begin{document} and the \end{document} commands, and between them the apply-templates makes it carry on processing.

  3. The para template just outputs its content, but follows it with a linebreak (using the hexadecimal character code x000A (see the ASCII chart in Table C.1).

  4. The ulink template outputs its content but follows it with a footnote using the \url command to output the value of the url attribute.

  5. The emphasis template surrounds its content with \emph{ and }.

This is a relatively trivial example, but it serves to show that it's not hard to output LATEX from XML. In fact there is a set of templates already written to produce LATEX from a DocBook file at http://www.dpawson.co.uk/docbook/tools.html#d4e2905

  1. Strictly speaking it isn't output at this stage: XML processors build a ‘tree’ (a hierarchy) of elements in memory, and they only get ‘serialised’ at the end of processing, into a stream of characters written to a file.

ToC10.2 Converting out of LATEX

This is much harder to do comprehensively. As noted earlier, the LATEX file format really requires the LATEX program itself in order to process all the packages and macros, because there is no telling what complexities authors have added themselves (what a lot of this book is about!).

Many authors and editors rely on custom-designed or homebrew converters, often written in the standard shell scripting languages (Unix shells, Perl, Python, Tcl, etc.). Although some of the packages presented here are also written in the same languages, they have some advantages and restrictions compared with private conversions:

  • Conversion done with the standard utilities (eg awk, tr, sed, grep, detex, etc.) can be faster for ad hoc translations, but it is easier to obtain consistency and a more sophisticated final product using LATEX2HTML or TEX4ht — see below — or one of the other systems available.

  • Embedding additional non-standard control sequences in LATEX source code may make it harder to edit and maintain, and will definitely make it harder to port to another system.

  • Both the above methods (and others) provide a fast and reasonable reliable way to get documents authored in LATEX onto the Web in an acceptable — if not optimal — format.

  • LATEX2HTML was written to solve the problem of getting LATEX-with-mathematics onto the Web, in the days before MathML and math-capable browsers. TEX4ht was written to turn LATEX documents into Web hypertext — mathematics or not.

ToC10.2.1 Conversion to Word

There are several programs on CTAN to do LATEX-to-Word and similar conversions, but they do not all handle everything LATEX can throw at them, and some only handle a subset of the built-in commands of default LATEX. Two in particular, however, have a good reputation, although I haven't used either of them (I tend to stay as far away from Word as possible):

  • latex2rtf by Wilfried Hennings, Fernando Dorner, and Andreas Granzer translates LATEX into RTF — the opposite of the rtf2latex2e mentioned earlier. RTF can be read by most wordprocessors, and this program preserves layout and formatting for most LATEX documents using standard built-in commands.

  • Kirill A Chikrii's TEX2Word for Microsoft Windows is a converter plug-in for Word to let it open TEX and LATEX documents. The author's company claims that ‘virtually any existing TEX/LATEX package can be supported by TEX2Word’ because it is customisable.

One easy route into wordprocessing, however, is the reverse of the procedures suggested in the preceding section: convert LATEX to HTML, which many wordprocessors read easily. The following sections cover two packages for this. Once it's in HTML, you could run it through Tidy to make it XHTML, add some embedded styling using Cascading Style Sheets (CSS), and rename the file to end in .doc, which can fool Word into opening it natively.

ToC10.2.2 LATEX2HTML

As its name suggests, this is a system to convert LATEX structured documents to HTML. Its main task is to reproduce the document structure as a set of interconnected HTML files. Despite using Perl, LATEX2HTML relies very heavily on standard Unix facilities like the NetPBM graphics package and the pipe syntax. Microsoft Windows is not well suited to this kind of composite processing, although all the required facilities are available for download in various forms and should in theory allow the package to run — but reports of problems are common.

  • The sectional structure is preserved, and navigational links are generated for the standard Next, Previous, and Up directions.

  • Links are also used for the cross-references, citations, footnotes, ToC, and lists of figures and tables.

  • Conversion is direct for common elements like lists, quotes, paragraph-breaks, type-styles, etc., where there is an obvious HTML equivalent.

  • Heavily formatted objects such as math and diagrams are converted to images.

  • There is no support for homebrew macros.

There is, however, support for arbitrary hypertext links, symbolic cross-references between ‘evolving remote documents’, conditional text, and the inclusion of raw HTML. These are extensions to LATEX, implemented as new commands and environments.

LATEX2HTML outputs a directory named after the input filename, and all the output files are put in that directory, so the output is self-contained and can be uploaded to a server as it stands.

ToC10.2.3 TEX4ht

TEX4ht operates differently from LATEX2HTML: it uses the TEX program to process the file, and handles conversion in a set of postprocessors for the common LATEX packages. It can also output to XML, including Text Encoding Initiative (TEI) and DocBook, and the OpenOffice and WordXML formats, and it can create TEXinfo format manuals.

By default, documents retain the single-file structure implied by the original, but there is again a set of additional configuration directives to make use of the features of hypertext and navigation, and to split files for ease of use. This is a most powerful system, and probably the most flexible way to do the job.

ToC10.2.4 Extraction from PostScript and PDF

If you have the full version of Adobe Acrobat (or one of several other commercial PDF products), you can open a PDF file created by pdfLATEX, select and copy all the text, and paste it into Word and some other wordprocessors, and retain some common formatting of headings, paragraphs, and lists. Both solutions still require the wordprocessor text to be edited into shape, but they preserve enough of the formatting to make it worthwhile for short documents. Otherwise, use the pdftotext program to extract everything from the PDF file as plain (paragraph-formatted) text.

ToC10.2.5 Last resort: strip the markup

At worst, the detex program on CTAN will strip a LATEX file of all markup and leave just the raw unformatted text, which can then be re-edited. There are also programs to extract the raw text from DVI and PostScript (PS) files.

ToC10.3 Going beyond LATEX

The reader will have deduced by now that while LATEX is possibly the best programmable typesetting system around, the LATEX file format is not generally usable with anything except the LATEX program. LATEX was originally written in the mid-1980s, about the same time as the Standard Generalized Markup Language (SGML), but the two projects were not connected. However, TEX and LATEX have proved such useful tools for formatting SGML and more recently XML that many users chose this route for their output, using conversions written in the languages already mentioned in section 10.2.

Unfortunately, when the rise of the Web in the early 1990s popularised SGML using the HTML, browser writers deliberately chose to encourage authors to ignore the rules of SGML. Robust auto-converted formatting therefore became almost impossible except via the browsers' low-quality print routines.

It was not until 1997, when the XML was devised, that it again became possible to provide the structural and descriptive power of SGML but without the complex and rarely-used options which had made standard SGML so difficult to program for.

XML is now becoming the principal system of markup. Because it is based on the international standard (SGML), it is not proprietary, so it has been implemented on most platforms, and there is lots of free software supporting it as well as many commercial packages. Like SGML, it is actually a meta-language to let you define your own markup, so it is much more flexible than HTML. Implementations of the companion Extensible Stylesheet Language (XSL) provide a direct route to PDF but at the expense of reinventing most of the wheels which LATEX already possesses, so the sibling XSLT can be used instead to translate to LATEX source code, as shown in the example in section 10.1.1. This is usually much faster than writing your own formatting from scratch in XSL, and it means that you can take full advantage of the packages and sophistication of LATEX. A similar system is used for the Linux Documentation Project, which uses SGML transformed by the Document Style Semantics and Specification Language (DSSSL) to TEX

The source code of this book, available online at http://www.ctan.org/tex-archive/info/beginlatex/src/ includes XSLT which does exactly this.