Your support for our advertisers helps cover the cost of hosting, research, and maintenance of this document

Formatting Information — An introduction to typesetting with LATEX

Chapter 8: Compatibility

Section 8.2: Converting out of LATEX

Before looking at one-way systems, see the earlier note about Pandoc the last para ‘However, there is one system that …’ in Chapter 8.

Converting LATEX to other formats is much harder to do comprehensively. As noted before, the LATEX file format really requires the LATEX program itself in order to process all the packages and macros, because there is no telling what complexities authors have added themselves (what a lot of this book is about!).

Many authors and editors rely on custom-designed or homebrew converters, often written in the standard shell scripting languages (Unix shells, Perl, Python, Tcl, etc). Although some of the packages presented here are also written in the same languages, they have some advantages and restrictions compared with private conversions:

  • Conversion done with the standard utilities (eg awk, tr, sed, grep, detex, etc) can be faster for ad hoc translations, but it is easier to obtain consistency and a more sophisticated final product using LATEX2HTML or TEX4ht — see below — or one of the other systems available.

  • Embedding additional non-standard control sequences in LATEX source code may make it harder to edit and maintain, and will definitely make it harder to port to another system.

  • Both the above methods (and others) provide a fast and reasonable reliable way to get documents authored in LATEX onto the Web in an acceptable — if not optimal — format.

  • LATEX2HTML was written to solve the problem of getting LATEX-with-mathematics onto the Web, in the days before MathML and math-capable browsers. TEX4ht was written to turn LATEX documents into Web hypertext — mathematics or not.

8.2.1 Conversion to Word

There are several programs on CTAN to do LATEX-to-Word and similar conversions, but they do not all handle everything LATEX can throw at them, and some only handle a subset of the built-in commands of default LATEX. Two in particular, however, have a good reputation, although I haven’t used either of them (I tend to stay as far away from Word as possible):

  • latex2rtf by Wilfried Hennings, Fernando Dorner, and Andreas Granzer translates LATEX into RTF — the opposite of the rtf2latex2e mentioned earlier. RTF can be read by most wordprocessors, and this program preserves layout and formatting for most LATEX documents using standard built-in commands.

  • Kirill A Chikrii’s TEX2Word for Microsoft Windows is a converter plug-in for Word to let it open TEX and LATEX documents. The author’s company claims that ‘virtually any existing TEX/LATEX package can be supported by TEX2Word’ because it is customisable.

One easy route into wordprocessing, however, is the reverse of the procedures suggested in the preceding section: convert LATEX to HTML, which many wordprocessors read easily. The following sections cover two packages for this. Once it’s in HTML, you could run it through Tidy to make it XHTML, add some embedded styling using Cascading Style Sheets (CSS), and rename the file to end in .doc, which can fool Word into opening it natively.

8.2.2 LATEX2HTML

As its name suggests, LATEX2HTML is a system to convert LATEX structured documents to HTML. Its main task is to reproduce the document structure as a set of interconnected HTML files. Despite using Perl, LATEX2HTML relies very heavily on standard Unix facilities like the NetPBM graphics package and the pipe syntax. Microsoft Windows is not well suited to this kind of composite processing, although all the required facilities are available for download in various forms and should in theory allow the package to run — but reports of problems are common.

  • The sectional structure is preserved, and navigational links are generated for the standard Next, Previous, and Up directions.

  • Links are also used for the cross-references, citations, footnotes, ToC, and lists of figures and tables.

  • Conversion is direct for common elements like lists, quotes, paragraph-breaks, type-styles, etc, where there is an obvious HTML equivalent.

  • Heavily formatted objects such as math and diagrams are converted to images.

  • There is no support for homebrew macros.

There is, however, support for arbitrary hypertext links, symbolic cross-references between ‘evolving remote documents’, conditional text, and the inclusion of raw HTML. These are extensions to LATEX, implemented as new commands and environments.

LATEX2HTML outputs a directory named after the input filename, and all the output files are put in that directory, so the output is self-contained and can be uploaded to a server as it stands.

8.2.3 TEX4ht

TEX4ht operates differently from LATEX2HTML: it uses the TEX program to process the file, and handles conversion in a set of postprocessors for the common LATEX packages. It can also output to XML, including Text Encoding Initiative (TEI) and DocBook, and the OpenOffice and WordXML formats, and it can create TEXinfo format manuals.

By default, documents retain the single-file structure implied by the original, but there is again a set of additional configuration directives to make use of the features of hypertext and navigation, and to split files for ease of use. This is a most powerful system, and probably the most flexible way to do the job.

8.2.4 Extraction from PostScript and PDF

If you have the full version of Adobe Acrobat Reader (or one of several other commercial PDF products), you can open a PDF file created by PDFLATEX, select and copy all the text, and paste it into Word and some other wordprocessors, and retain some common formatting of headings, paragraphs, and lists. Both solutions still require the wordprocessor text to be edited into shape, but they preserve enough of the formatting to make it worthwhile for short documents. Otherwise, use the pdftotext program to extract everything from the PDF file as plain (paragraph-formatted) text.

8.2.5 Last resort: strip the markup

At worst, the detex program on CTAN will strip a LATEX file of all markup and leave just the raw unformatted text, which can then be re-edited. There are also programs to extract the raw text from DVI and PostScript (PS) files.