Formatting Information: Conversion — Converting into LATEX

Your support for our advertisers helps cover the cost of hosting, research, and maintenance of this document

Formatting Information — An introduction to typesetting with LATEX

Chapter 8: Conversion

In this section…

Conversion from wordprocessors
Bulk conversion
Getting LATEX out of XML

Turning other document formats into LATEX is generally several orders of magnitude easier than the other way round, because almost all other document-handling systems know and understand their own features.

The methods vary from wordprocessors or plugins with a menu entry for FileSave As…LaTeX, to a custom XSLT script for a bespoke solution.

8.1.1 Conversion from wordprocessors

Several wordprocessor systems can save their text in LATEX format using the FileSave As…LaTeX or Export As… menus. A very few actually create LATEX natively, but there are also some stand-alone converters.

Microsoft Word

Microsoft Word in particular does not have any LATEX export facility at all. Instead, you can either open the document in one of the other systems listed below, and use that to export into a LATEX document, or use a converter. If you prefer a commercial solution which runs as a plugin within Word, see the item ‘GrindEQ’ below or the item ‘TEX2Word’ in the next section (Converting out of LATEX) below.

8.1.1.1 Native LATEX

LYX

LYX is probably the best-known wordprocessor-like interface to LATEX (not quite WYSIWYG, more What You See Is What You Meant). It is already basically LATEX internally, and its LATEX export is very good, offering several flavours (LATEX, XƎTEX, LuaTEX, etc) as well as Word and Libre Office formats.

https://www.lyx.org/

Scientific Word

A writing and editing front-end interface to LATEX specifically designed for math and science papers. With LATEX as the backend, the companion product Scientific Workplace provides computational and plotting facilities.

https://sciword.co.uk/

8.1.1.2 Export via plugin

AbiWord

AbiWord can load a Word document and provides an extensive list of export formats, so it provides a good pathway for single-file conversion. Export formats include Word, HTML, XHTML, RTF, EPUB v3, DocBook, Libre Office, and others including LATEX.

http://www.abisource.com/

Libre Office

Libre Office has a LATEX plugin called Writer2LATEX, so it can be used to open Microsoft Word documents (as well Office Document Text (ODT) and other formats), and export them to LATEX.

https://www.libreoffice.org/

GrindEQ

GrindEQ is a commercial plugin for Microsoft Word to allow the loading and saving of LATEX documents. It is oriented primarily towards mathematics.

https://www.grindeq.com/

Several maths packages like the EuroMath editor, and the Mathematica and Maple analysis packages, can also save material in LATEX format.

Pandoc

See the note ‘Pandoc’ above.

https://pandoc.org/

docx2tex

This system converts Microsoft Word’s .docx to LATEX (only). It runs in Java (standalone or as an XProc pipeline with XML Calabash), so it works on all platforms. It is extremely configurable, with customisable directories, and a config file that lets you map your Word styles to LATEX \begin and \end commands, just like the old PCWriTeX driver.

https://github.com/transpect/docx2tex/releases

8.1.1.4 Failing that...

If you can’t get the kind of conversion you want from the existing utilities, or you need to make your own (perhaps to embed in another system), these are some alternatives.

Using HTML

Use the FileSave As...HTML (or export) menu from your wordproessor to save the file as HTML, then rationalise the HTML into the XML version of HTML (XHTML) using Dave Raggett’s HTML Tidy, and convert the resulting XHTML file to LATEX with the technique shown below in § 8.1.3 below.

http://tidy.sourceforge.net/

Using PDF

The pdftotext utility that comes with TEX Live converts PDF files to plain, unformatted text, with each paragraph separated by a newline. Two-column PDFs are problematic, though, because most creators produce a format which causes text-scanners to read one line from a column followed by a line from the next column, and only then go onto the second line, which makes the text wholly unusable.

Apache provides a Java utility called pdfbox which can extract the text from a PDF document into HTML format, preserving the bold and italics, which pdftotext does not do. This can save a lot of time in post-editing before using the HTML conversion mentioned above.

https://pdfbox.apache.org/

Using RTF

Among the conversion programs for related formats on CTAN is Ujwal Sathyam and Paul DuBois’s rtf2latex2e, which converts Rich Text Format (RTF) files (output by many wordprocessors) to LATEX2ε. The package description says it has support for figures and tables; equations are read as figures; and it can handle the latest RTF versions from Microsoft Word 97/98/2000, StarOffice (so presumably OpenOffice and Libre Office), and other wordprocessors. It runs on Windows and Unix & GNU/Linux systems, including Apple Macintosh OS X

http://rtf2latex2e.sourceforge.net/ (also available as package rtf2latex2e from CTAN)

Using Word or ODF

If you can get the files into Word or Libre Office format (which are both basically Zip files containing XML), write a transformation in XSLT to convert the internal XML directly into LATEX. This is by far the most robust way to do it but it requires that the author or editor has rigorously used Named Styles. Unfortunately, without them, the quality of most wordprocessing files is generally poor when it comes to identifying which bits do what (which is what LATEX needs to know), so some guesswork or heuristics may be needed.

At the extreme end are very simplistic systems that are incapable of outputting any kind of structured document, because they only store what the text looks like (basically, font, size, and style), rather than why it’s there or what role it fulfils. In those cases you may be able to save the file as a PDF, and use the pdfbox utility as in item ‘Using PDF’ above above.

8.1.2 Bulk conversion

Converting large numbers of related documents using most of the non-graphical (command-line) utilities is often straightforward using a shell script in (eg) bash or Powershell. At the simplest level it can just be a few lines like

for f in *.docx; do 
  pandoc -f docx -t latex $f ${f/docx/tex};
done

However, if you have large numbers of obsolete Word .doc files (too many to open and save as .docx), you can try to use a specialist conversion tool like EBT’s DynaTag (supposedly still available from Enigma, if you can persuade them they have a copy to sell you; or you may still be able to get it from Red Bridge Interactive in Providence, RI). It’s old and expensive and they don’t advertise it, but for the Graphical User Interface (GUI)-driven bulk conversion of consistently-marked Word (.doc, not .docx) files into usable XML it beats everything else hands down. But whatever system you use, the Word files MUST be consistent, though, and MUST use Named Styles from a stylesheet (template), otherwise no system on earth is going to be able to guess what they mean.

There is of course an external way, suitable for large volumes only: send it off to the Pacific Rim to be scanned, retyped, or hand-edited into XML or LATEX. There are hundreds of companies from India to Polynesia who do this at high speed and low cost with very high accuracy. It sounds crazy when/if document is already in electronic format, but it’s a good example of the problem of low quality of wordprocessor markup that this solution exists at all.

8.1.3 Getting LATEX out of XML

You will have noticed that most of the solutions lead to one place: XML. As explained above and elsewhere, this format is the only one so far devised capable of storing sufficient information in machine-processable, publicly-accessible form to enable your document to be recreated in multiple output formats. Once your document is in XML, there is a large range of software available to turn it into other formats, including LATEX. Processors in any of the common XML processing languages like XSLT or Omnimark can easily be written to output LATEX, and this approach is extremely common.

Much of this would be simplified if wordprocessors supported native, arbitrary XML/XSLT as a standard feature, because LATEX output would become much simpler to produce, but this seems unlikely.

However, since the early 2000s the internal format for both OpenOffice (now Libre Office) and Word is now XML. Both .docx and .odf files are actually Zip files containing the XML document together with stylesheets, images, and other ancillary files. This means that for a robust transformation into LATEX, you just need to write an XSLT stylesheet to do the job — non-trivial, as the XML formats used are extremely complex, but certainly possible.

Assuming you can get your document out of its wordprocessor format into XML by some method, here is a very brief example of how to turn it into LATEX.

You can of course buy any fully-fledged commercial XML editor with XSLT support, and run transformations within it. However, this is beyond the reach of many individual users, although oXygen is available at a discounted price to academic sites.

To do the job unaided you need to install three pieces of software: Java, Saxon or another XSLT processor, and the DocBook 5.0 DTD (links are correct at the time of writing). None of these has a graphical interface: they are run from the command-line.

As an example, let’s take the a sample paragraph, as typed or imported into AbiWord (see Figure 8.3 below). This is stored as a single paragraph with highlighting on the product names (italics), and the names are also links to their Internet sources, just as they are in this document. This is a convenient way to store two pieces of information in the same place.

Figure 8.3: Sample paragraph in AbiWord being converted to XML

AbiWord can export in DocBook format, which is an XML vocabulary for describing technical (computer) documents — it’s what I use for this book. AbiWord can also export LATEX, but we’re going to make our own version, working from the XML (Brownie points for the reader who can guess why I’m not just accepting the LATEX conversion output).

Although AbiWord’s default is to output an XML book document type, we’ll convert it to a LATEX article document class. In this example I’ve changed the linebreaks to keep it within the bounds of the page size of the PDF edition:

<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.2//EN"
  "http://www.oasis-open.org/docbook/xml/4.2/docbookx.dtd">
<book> 
<!-- ================================================== --> 
<!-- This DocBook file was created by AbiWord.          --> 
<!-- AbiWord is a free, Open Source word processor.     --> 
<!-- You may obtain more information about AbiWord 
     at www.abisource.com                               --> 
<!-- ================================================== --> 
<chapter> 
  <title></title> 
  <section role="unnumbered">
    <title></title> 
    <para>You can of course buy and install a fully-fledged 
      commercial XML editor with XSLT support, and run this 
      application within it. However, this is beyond the 
      reach of many users, so to do this unaided you just 
      need to install three pieces of software: <ulink
      url="http://java.com/download/"><emphasis>Java</emphasis></ulink>,
      <ulink
      url="http://saxon.sourceforge.net"><emphasis>Saxon</emphasis></ulink>, 
      and the <ulink 
      url="http://www.docbook.org/xml/4.2/index.html">DocBook 
      4.2 DTD</ulink> (URIs are correct at the time of writing). 
      None of these has a visual interface: they are run from 
      the command-line in the same way as is possible with
      L<superscript>A</superscript>T<subscript>E</subscript>X.</para>
  </section> 
</chapter> 
</book>

The XSLT language lets us create templates for each type of element in an XML document. In our example, there are only three which need handling, as we did not create chapter or section titles (DocBook requires them to be present, but they don’t have to be used).

para, for the paragraph[s];
ulink, for the URIs;
emphasis, for the italicisation.

I’m going to cheat over the superscripting and subscripting of the letters in the LATEX logo, and use my editor to replace the whole thing with the \LaTeX command. In the other three cases, we already know how LATEX deals with these, so we can write the templates accordingly.

Writing XSLT is not hard, but requires a little learning. The output method here is text, which is LATEX’s file format (XSLT can also output HTML and other flavours of XML).

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                version="2.0">

  <xsl:output method="text"/>

  <xsl:template match="/">
    <xsl:text>\documentclass{article}\usepackage{url}</xsl:text>
    <xsl:apply-templates/>
  </xsl:template>

  <xsl:template match="book">
    <xsl:text>\begin{document}</xsl:text>
    <xsl:apply-templates/>
    <xsl:text>\end{document}</xsl:text>
  </xsl:template>

  <xsl:template match="para">
    <xsl:apply-templates/>
    <xsl:text>&#x0a;</xsl:text>
  </xsl:template>

  <xsl:template match="ulink">
    <xsl:apply-templates/>
    <xsl:text>\footnote{\url{</xsl:text>
    <xsl:value-of select="@url"/>
    <xsl:text>}}</xsl:text>
  </xsl:template>

  <xsl:template match="emphasis">
    <xsl:text>\emph{</xsl:text>
    <xsl:apply-templates/>
    <xsl:text>}</xsl:text>
  </xsl:template>

</xsl:stylesheet>

The first template matches /, which is the document root (before the book start-tag). At this stage we output the text which will start the LATEX document, \documentclass{article} and \usepackage{url}.
The apply-templates instructions tells the processor to carry on processing, looking for more matches. XML comments get ignored, and any elements which don’t match a template simply have their contents passed through until the next match occurs, or until plain text is encountered (and output).
The book template outputs the \begin{document} command, invokes apply-templates to make it carry on processing the contents of the book element, and then at the end, outputs the \end{document} command.
The para template just outputs its content, but follows it with a linebreak, using the hexadecimal character code x0A.
The ulink template outputs its content but follows it with a footnote using the \url command to output the value of the url attribute.
The emphasis template surrounds its content with \emph{ and }.

If you run this through Saxon, which is an XSLT processor, you can output a LATEX file which you can typeset (see Figure 8.4 below).

$ java -jar saxon-he-10.3.jar -o para.ltx para.dbk para.xsl 
$ xelatex para.ltx
This is XeTeX, Version 3.14159265-2.6-0.999991 (TeX Live
  2019/Debian) (preloaded format=xelatex) \write18 enabled.
entering extended mode
LaTeX2e <2020-02-02> patch level 2
L3 programming layer <2020-02-14>
(./para.tex
(/usr/share/texlive/texmf-dist/tex/latex/base/article.cls
  Document Class: article 2019/12/20 v1.4l
  Standard LaTeX document class
(/usr/share/texlive/texmf-dist/tex/latex/base/size11.clo))
(/home/peter/texmf/tex/latex/geometry/geometry.sty
(/usr/share/texlive/texmf-dist/tex/latex/graphics/keyval.sty)
(/usr/share/texlive/texmf-dist/tex/generic/iftex/ifpdf.sty
(/usr/share/texlive/texmf-dist/tex/generic/iftex/iftex.sty))
(/usr/share/texlive/texmf-dist/tex/generic/iftex/ifvtex.sty)
(/usr/share/texlive/texmf-dist/tex/generic/iftex/ifxetex.sty)
(/home/peter/texmf/tex/latex/geometry/geometry.cfg))
(/usr/share/texlive/texmf-dist/tex/latex/base/textcomp.sty)
(/usr/share/texlive/texmf-dist/tex/latex/l3backend/l3backend-xdvipdfmx.def)
(./para.aux)
(/usr/share/texlive/texmf-dist/tex/latex/base/ts1cmr.fd)
*geometry* driver: auto-detecting
*geometry* detected driver: xetex
[1] (./para.aux) )
Output written on para.pdf (1 page).
Transcript written on para.log.
$

Figure 8.4: The typeset paragraph and its generated source code

\documentclass{article}\usepackage{url}\begin{document} 
You can of course buy and install a fully-fledged commercial XML
editor with XSLT support, and run this application within it. However,
this is beyond the reach of many users, so to do this unaided you just
need to install three pieces of software:
\emph{Java}\footnote{\url{http://java.sun.com/j2se/1.4.2/download.html}},
\emph{Saxon}\footnote{\url{http://saxon.sourceforge.net}}, and the
DocBook 4.2
DTD\footnote{\url{http://www.docbook.org/xml/4.2/index.html}} (links
are correct at the time of writing). None of these has a graphical
interface: they are run from the command-line in the same way as is
possible with \LaTeX.
\end{document}

This is a relatively trivial example, but it serves to show that it’s not hard to output LATEX from XML — this document is produced in exactly this way.

Previous chapter

Next section

The former OpenOffice was taken over by Apache, and is no longer regarded as a contender.
Strictly speaking it isn’t output at this stage: XML processors build a ‘tree’ (a hierarchy) of elements in memory, and they only get ‘serialised’ at the end of processing, into a stream of characters written to a file.