Formatting Information: Conversion — Converting out of LATEX

Your support for our advertisers helps cover the cost of hosting, research, and maintenance of this document

Formatting Information — An introduction to typesetting with LATEX

Chapter 8: Conversion

In this section…

Conversion to Word
Conversion to HTML
Conversion to XML
Conversion to plain text
Authoring with LATEX and XML

Converting LATEX to other formats is much harder to do comprehensively than converting into LATEX. As noted before, the LATEX file format really requires a LATEX processor in order to handle all the packages and macros, because LATEX is entirely reprogrammable, and there is therefore no telling what complexities authors have added themselves by redefining things (what a lot of this book is about!).

However, if you have stuck to the standard commands given in the LATEXbook, and not defined anything extra or redefined anything else, and have used only a small set of the most common packages, most of the converters shown here will do a good job.

Many authors and editors rely on custom-designed or homebrew converters, often written in the standard shell scripting languages (Unix shell commands, Perl, Python, Tcl, Lua, etc). Some of the public packages presented here are also written in the same languages, but they have some advantages and restrictions compared with private conversions:

Conversion done with the standard utilities (eg awk, tr, sed, grep, detex, etc) can be faster for one-off or ad hoc transformations, but it is easier to obtain consistency and a more sophisticated final product using a converter written to handle a wider range of LATEX features.
Embedding homebrew non-standard control sequences in LATEX source code (a common habit of authors) may be tempting, to make it easier for the author to edit and maintain, but will always make it harder to convert to another system.
Most of the converters mentioned here provide a fast and reasonably reliable way to get LATEX documents into Word, HTML, and other forms of XML in an acceptable — if not optimal — format, even if your primary target is eventually to convert to TEI, Journal Article Tag Suite (JATS), DocBook, or some other vocabulary, because once it’s in well-formed XML of one kind, translation to another vocabulary is much easier.
Above all, it is essential to understand that no conversion will produce an error-free result except for the most trivial of documents. All other output will require post-editing to correct things the converter was unable to handle, add things it missed, change formatting it was unable to apply, and delete things that should not have appeared.

There is a useful discussion of some of the alternatives mentioned below in § 3 of the lwarp package documentation (Dunne, 2020, p 71–73). If you actually want to author in a hybrid format that enables XML output (rather than converting existing native LATEX documents), see § 8.2.5 below.

8.2.1 Conversion to Word

This is the most frequently-requested conversion, and also one of the hardest to do, because there are things LATEX can do that simply cannot be represented in Word in any meaningful manner. However, Word uses XML internally, and it is also possible to convert your LATEX to XHTML and import it (see below).

There are several programs on CTAN to do LATEX-to-Word and similar conversions, but they do not all handle everything LATEX can produce, and some only handle a subset of the built-in commands of default LATEX. Four in particular, however, have a good reputation:

latex2rtf

by Wilfried Hennings, Fernando Dorner, and Andreas Granzer translates LATEX into RTF — the opposite of the rtf2latex2e mentioned earlier (RTF can be read by most wordprocessors). This Open Source program preserves layout and formatting for most LATEX documents using standard built-in commands and obsolete codepages (not Unicode), but it has little support for redefined commands and common packages, including fontspec; and because it doesn’t recognise the array its support for advanced column specifications in tables is limited, although it does a good job on simple tables. If you need a conversion for wordprocessors that can’t read .docx files, this is a good place to start.

http://latex2rtf.sourceforge.net/

TEX2Word

by Kirill A Chikrii for Microsoft Windows is a commercial converter plug-in for Word to let it import TEX and LATEX documents. The author’s company claims that ‘virtually any existing TEX/LATEX package can be supported by TEX2Word’ because it is customisable.

http://www.chikrii.com/products/tex2word/

Pandoc

See the note ‘Pandoc’ above.

https://pandoc.org/

TEX4HT

See item ‘TEX4HT’ below.

https://tug.org/tex4ht/

One easy route into wordprocessing, however, is the reverse of the procedures suggested in the preceding section: convert LATEX to HTML, which many wordprocessors can read easily, using any of the packages in § 8.2.2 below. Once it’s in HTML, run it through Tidy (see item ‘Using HTML’ above) to make it well-formed XHTML, add some embedded Cascading Style Sheets (CSS) styling to the header manually, and rename the file to end in the obsolete .doc filetype, which can fool Word into opening it natively as if it were a Word file.

Circular conversion

To the best of my knowledge, there is no off-the-shelf system that can convert circularly from LATEX to Word or XML/XHTML and back and back again, and back again, without serious loss of formatting. At each conversion, some document features will unavoidably be regularised to conventions of the target format which can no longer be represented in the source format. It may be possible for a very trivial document, but not for any real-life application.
This means that corporate, technical, or academic applications which depend on features of the Word interface such as Change Recording or the Style Margin cannot use LATEX as a distributed editing format. However, after all edits have been made, the Word document can of course be converted to LATEX for final typesetting.

8.2.2 Conversion to HTML

This probably runs Word a close second in frequency of demand. Conversion to HTML — or more probably XHTML — lets you put your LATEX document on the web in a format everyone can read, but as with Word, not everything you can do in LATEX can be represented in HTML, although with CSS3 you can get close.

Pandoc

See the note ‘Pandoc’ above.

https://pandoc.org/

lwarp

This is a LATEX package for producing HTML v.5 (HTML5) output, using external utility programs for the final conversion of text and images. Strictly speaking this is an authoring package, not a conversion: you have to write your document using the commands defined in the package, rather than normal LATEX. This makes it hard to use for handling existing documents, as they will need extensive editing before they can be processed. However, the package is under active development and supports a wide range of formatting packages.

The lwarp package is included in all TEX Live-based distributions.

LATEX2HTML

LATEX2HTML’s main task is to eproduce the document structure as a set of interconnected HTML files, so it is popular for creating multi-page web sites from a single large LATEX document. It outputs a directory named after the input filename, and all the output files are put in that directory, so the result is self-contained and can be uploaded to a web server as it stands. It supports mathematics via images, and can deal with the built-in commands and a small range of packages.

https://github.com/latex2html/latex2html/

TEX2page

This converts Plain TEX LATEX. or Texinfo documents to HTML. Complex requirements can be configured in the TEX2page extension language (Common Lisp or Scheme). The authors have tried to make running TEX2page as similar as running LATEX as possible.

https://github.com/ds26gte/tex2page

TEX4HT

(TEX-for-HyperText) is an Open Source program which converts TEX and LATEX documents to various kinds of XML and to Libre Office format (among others), which Word can open.

It operates differently from most other converters: It uses the TEX/LATEX program itself to process the file, and handles conversion in a set of postprocessors for the common LATEX packages. It can also output to XML, including TEI, DocBook, and the Libre Office and Word XML formats, and it can create Texinfo-format manuals.

By default, documents retain the single-file structure of the original, but there is a set of configuration directives to make use of the features of hypertext and navigation, and to split files for ease of use on the web.

https://tug.org/tex4ht/

HEVEA

This is an Open Source translator from LATEX to HTML by Luc Maranget at the Institut national de recherche en sciences et technologies du numérique [originally Institut de recherche en informatique et en automatique] (Inria) in Paris. HEVEA runs on Unix & GNU/Linux systems, and supports most of LATEX2ε, including macro definitions. and outputs HTML5. It can be customised via style files using LATEX code.

http://pauillac.inria.fr/~maranget/hevea/

TTH

This is an Open Source translator running on most platforms, predominantly for converting mathematical LATEX documents into HTML. TTH works with both Plain TEX and LATEX. Instead of using images of equations, it claims to translate them to actual HTML.

The author’s link at http://hutchinson.belmont.ma.us/tth/ seems to be dead, but the package is available as tth from CTAN.

GELLMU

Generalized Extensible LATEX-Like MarkUp (GELLMU) is a LATEX-like markup for authoring structured document types which can be converted to HTML, DocBook, TEI, or GELLMU ’s own LATEX-like document type ‘article’. The source language markup most closely resembles actual LATEX source markup: much of the markup vocabulary is the same as that of actual LATEX.

Documents are processed via the SGMLS Perl module and elisp routines in Emacs, and can output LATEX, classic HTML, or XHTML+MathML.

Available as package gellmu from CTAN;

plasTEX

plasTEX is a LATEX document processing framework. It comes bundled with an XHTML renderer (including multiple themes), as well as a way to simply dump the LATEX document to a generic form of XML. Other renderers can be added, including Unix & GNU/Linux man pages, Docbook, and EPUB v3.

It works by processing LATEX documents into XML Document Object Model (DOM)-like objects which can be used to generate various types of output. Many options can be set, including controlling splitting into multiple files and adding CSS files.

https://plastex.github.io/plastex/

8.2.3 Conversion to XML

Pandoc

See the note ‘Pandoc’ above.

https://pandoc.org/

LATEXML

LATEXML provides a conversion to an intermediate XML vocabulary which can be used to create industry publishing XML formats such as DocBook, TEI, and JATS, and even XHTML for EPUB v3.

LATEXML is at https://math.nist.gov/~BMiller/LaTeXML/.

Tralics

Tralics comes from the Apics and Marelle teams at Inria. It creates an XML document of its own design, representing everything it finds in the LATEX file, using an error element type for anything it cannot handle. In a way it is similar to LATEXML but using a different vocabulary, and it too has an extensive configuration mechanism to tune it for specific types or classes of document.

https://www-sop.inria.fr/marelle/tralics/

TEX4HT

See item ‘TEX4HT’ above.

https://tug.org/tex4ht/

latex2tei

This is a new converter written in Python by Marta Materni and available from Github. It converts explicitly to XML in the Text Encoding Initiative (TEI) format only, targeting the Digital Humanities community for publication and research.

https://github.com/digiflor/Latex2TEI

8.2.4 Conversion to plain text

When all else fails, you can always convert your document to plain, unmarked text.

Pandoc

See the note ‘Pandoc’ above.

https://pandoc.org/

Text extraction

If you have the full version of Adobe Acrobat Reader (or one of several other commercial PDF products), you can open a PDF file created by LATEX, select and copy all the text, and paste it into your wordprocessor, and it will retain some common formatting of headings, paragraphs, and lists. This still requires the text to be edited into shape, but enough of the formatting should be preserved to make it worthwhile for short documents.

Otherwise, use the Open Source pdftotext utility to extract everything from the PDF file as plain (paragraph-formatted) text, and open that in a plain-text editor.

If you need HTML, there is a Java utility from Apache, the web server project, called pdfbox (see item ‘Using PDF’ above), which can extract the text from a PDF document in HTML format, preserving the bold and italics, which can save a lot of time.

Last resort: strip the markup

At worst, the detex program on CTAN will strip a LATEX file of all markup and leave just the raw unformatted text, which can then be re-edited. There are also programs to extract the raw text from DVI and PS files.

8.2.5 Authoring with LATEX and XML

A number of systems have been mentioned which allow you to write your documents in a slightly different way to standard LATEX but still using the same syntax (item ‘GELLMU’ above and item ‘lwarp’ above, for example). There is also a document class called internet but reports indicate that it is no longer being developed.

Previous section

Next section

There was one once, in the mid-1990s, actually made by Microsoft: SGML Author for Word. It wasn’t an editor as its name suggested, but a converter that used Named Styles to convert losslessly from SGML to Word and back, repeatedly, so that non-tech management could edit tech documents. Just as XML was taking off, they dropped it on the floor. Go figure. See https://cora.ucc.ie/bitstream/handle/10468/1690/Human-Interfaces-to-Structured-Documents.pdf#page=393 and http://xml.silmaril.ie/downloads/sgml-author-review.pdf for more details