Formatting Information: Conversion

Your support for our advertisers helps cover the cost of hosting, research, and maintenance of this document

Formatting Information — An introduction to typesetting with LATEX

In this chapter…

As we saw right at the start, LATEX uses plaintext files, so they can be read and written by any standard application that can open text files. This helps preserve your information over time, as the plaintext format cannot be obsoleted or hijacked by any manufacturer or sectoral interest, and it will always be readable on any computer, from your smartphone (LATEX is available for many handhelds, from old PDAs, see Figure 8.1 below, to Android devices, see Figure 8.2 below) through all desktops and servers right up to the biggest supercomputers.

Figure 8.1: LATEX editing and processing on the Sharp Zaurus 5500 PDA

Editing with zedit

Running LATEX

Displaying the PDF

Figure 8.2: LATEX editing and processing on the Samsung Galaxy Note 4

Typesetting with pdflatex

Displaying the PDF

However, LATEX is intended as the last stage of the editorial process: formatting for print or display. If you have a requirement to re-use the text in some other environment — a database perhaps, or on the Web or other media, or in Braille or voice output — then it should probably be edited, stored, and maintained in something neutral like the Extensible Markup Language (XML), and only converted to LATEX when a typeset copy is needed.

Although LATEX has many structured-document features in common with SGML and XML, it can still only be processed by the LATEX programs. Because its macro features make it almost infinitely redefinable, processing it requires a program which can unravel arbitrarily complex macros, and LATEX and its siblings are the only programs which can do that effectively. Like other typesetters and formatters (Quark XPress, Adobe InDesign and PageMaker, FrameMaker, Microsoft Publisher, 3B2, etc), LATEX is largely a one-way street leading to typeset printing or display formatting.

Converting LATEX to some other format therefore means you will unavoidably lose some formatting, as LATEX has features that others systems simply don’t possess, so they cannot be translated — although there are several ways to minimise this loss or compensate for it. Similarly, converting other formats into LATEX often means editing back the stuff the other formats omit because they only store appearances, not structure.

Most converters are one-way: that is, they convert into LATEX or out of LATEX, and there are several excellent systems for doing the conversion from LATEX directly to HyperText Markup Language (HTML) so you can at least publish it on the web, as we shall see in § 8.2 below.

Pandoc

There is one system that does conversion in both directions, and includes a huge range of formats: Pandoc. This is a large library of Haskell routines for handling conversions, with a command-line front end. Supported formats include Word, Libre Office, LATEX, DocBook, EPUB v3, InDesign, Markdown, and dozens of others, even JavaScript Object Notation (JSON).
Before trying the systems described in § 8.1 below and § 8.2 below, see if Pandoc will handle your files.It doesn’t have the same levels of speciality as some of the other converters, but it does provide those additional input formats. The exception is probably converting from XML to LATEX for which a robust Extensible Stylesheet Language 3 (XSLT) script is really the only reliable solution.
https://pandoc.org/

Most of the utilities listed below are Open Source or free-to-use software. There are many commercial solutions as well, either the software itself or a service where they do it for you, but they are mostly aimed at large-scale business conversion, and are ususally too expensive for domestic or academic single documents.

Previous chapter

First section

The former OpenOffice was taken over by Apache, and is no longer regarded as a contender.
Strictly speaking it isn’t output at this stage: XML processors build a ‘tree’ (a hierarchy) of elements in memory, and they only get ‘serialised’ at the end of processing, into a stream of characters written to a file.
There was one once, in the mid-1990s, actually made by Microsoft: SGML Author for Word. It wasn’t an editor as its name suggested, but a converter that used Named Styles to convert losslessly from SGML to Word and back, repeatedly, so that non-tech management could edit tech documents. Just as XML was taking off, they dropped it on the floor. Go figure. See https://cora.ucc.ie/bitstream/handle/10468/1690/Human-Interfaces-to-Structured-Documents.pdf#page=393 and http://xml.silmaril.ie/downloads/sgml-author-review.pdf for more details