Formatting Information: Writing documents — Dimensions, hyphenation, justification, and breaking

Your support for our advertisers helps cover the cost of hosting, research, and maintenance of this document

Formatting Information — An introduction to typesetting with LATEX

Chapter 1: Writing documents

In this section…

Specifying size units
Hyphenation
Breakable and unbreakable text
Dashes
Justification
Languages

LATEX’s internal measurement system is extremely accurate. The underlying TEX engine conducts all its business in units smaller than the wavelength of visible light, so if you ask for 15mm space, that’s what you’ll get — within the limitations of your screen or printer, of course. While modern high-resolution displays use pixels smaller than you can easily see, many older screens cannot show dimensions of less than 1⁄96″ without resorting to magnification or scaling; and on printers, even at 600dpi, fine oblique lines or curves can still sometimes be seen to stagger the dots.

Figure 1.3: Some parts of a piece of metal type

Based on an original image by Victor Puebla (https://mimoriarty.wordpress.com/about/)

At the same time, many dimensions in LATEX’s preprogrammed formatting are specially set up to be flexible: so much space, plus or minus certain limits to allow the system to make its own adjustments to accommodate variations like overlong lines, unevenly-sized images, displayed equations, and non-uniform spacing around headings. This is very different from the ‘grid’ system used in many other typesetting and DTP systems.

TEX uses a very sophisticated justification algorithm to achieve a smooth, even texture to normal paragraph text by justifying a whole paragraph at a time, quite unlike the line-by-line approach used in most wordprocessors and DTP systems.

Occasionally, however, you will need to hand-correct an unusual word-break or line-break, and there are facilities for doing this on individual occasions as well as automating it for use throughout a document.

1.10.1 Specifying size units

Most people in the printing and publishing industry in English-speaking cultures habitually use the traditional printers’ points, picas and ems as well as cm and mm when dealing with clients. Many older English-language speakers (and most North Americans) still use inches. In continental European and related cultures, Didot points and Ciceros (Didot picas) are also used professionally, but cm and mm are standard everywhere else: inches are largely obsolete and only used now when communicating with North American cultures.

Table 1.4: Units in LATEX

Unit	Size
	Printers’ fixed measures
pt	Anglo-American standard points (72.27 to the inch)
pc	Pica ems (12pt)
bp	Adobe’s ‘big’ points (exactly 72 to the inch)
sp	TEX’s internal ‘scaled’ points (65,536 to the pt)
dd	Didot (European standard) points (67.54 to the inch)
cc	Ciceros (European pica ems), 12dd)
	Printers’ relative measures
em	Ems of the current point size (historically the width of a letter ‘M’ but see Figure 1.4 below)
ex	x-height of the current font (height of a letter ‘x’)
	Other measures
cm	centimeters (2.54 to the inch)
mm	millimeters (25.4 to the inch)
in	inches (obsolete except in UK and parts of North America)

You can specify lengths in LATEX in any of these units, plus some others (see Table 1.4 above).

The em can cause beginners some puzzlement because it’s a relative measurement based on the ‘point size’ of the type, so 1em in 12pt type is half the size of 1em in 24pt type. The point size of type itself is also historically misleading: it refers to the depth of the metal body on which foundry type was cast in the days of metal typesetting. It does not refer to the visible height of the letters themselves when printed (see Figure 1.3 above). So the letter-size of 10pt type in one typeface can be radically different from 10pt type in another (look at Figure 1.4 below, where the widths are given for 10pt type).

An em is the height of the type-body in a specific size, so 1em of 10pt type is 10pt and 1em of 18pt type is 18pt. A 1em space is called a ‘quad’, so a 24pt quad is 24pt×24pt. LATEX has a command \quad for leaving exactly that much horizontal space. A special name is given to the 12pt em because it is so common: a ‘pica’ em (from the old name for 12pt type). A pica has become a fixed measure in its own right of exactly 12pt, and LATEX has a dimension ‘pc’ for this, so 15pc is 15×12pt long.

To highlight the differences between typefaces at the same size, Figure 1.4 below shows five capital Ms in different faces, surrounded by a box exactly 1em of those sizes wide, and showing the actual width of each M when set in 10pt type. Because of the different ways in which typefaces are designed, none of them is exactly 10pt wide.

Figure 1.4: An M of type of different faces boxed at 1em

The red line is the common baseline. Surrounding letters in grey are for illustration of the actual extent of the height and depth of one em of the current type size.

If you are working with other DTP users, watch out for those who think that Adobe points (bp) are the only ones. The difference between an Adobe big-point and the standard point is only .27pt per inch, but in 10″ of text (a full page of A4) that’s 2.7pt, which is nearly 1mm, enough to be clearly visible if you’re trying to align one sample with another.

1.10.2 Hyphenation

LATEX hyphenates automatically according to the language you use (see § 1.10.6 below). To specify different breakpoints for an individual word, you can insert soft-hyphens (discretionary hyphens), done with the \- command (backslash-hyphen) wherever you need them, for example:

When in Mexico, we visited Popo\-ca\-tépetl by 
helicopter.

If the words needs to be hyphenated, the best-fit of the points will be used, and the rest ignored.

To specify hyphenation points for all occurrences of a word in the document, use the \hyphenation command in your Preamble (see the sidebar ‘The Preamble’ above) with one or more words as patterns in its argument, separated by spaces; in this case using the normal hyphen to indicate permitted break-points. This will even let you break ‘helico-

pter’ correctly.

\hyphenation{helico-pter Popo-ca-tépetl vol-ca-no}

If you have frequent hyphenation problems with long, unusual, or technical words, ask an expert about changing the value of \spaceskip, which controls the flexibility of the space between words. This is not something you would normally want to do without advice, as it can change the appearance of your document quite significantly.

If you are using a lot of unbreakable text (see the next section and also § 4.7.1 below) it may also cause justification problems: you can turn justification off with \raggedright.

1.10.3 Breakable and unbreakable text

Unbreakable text is the opposite of discretionary hyphenation. To force LATEX to treat a word as unbreakable, use the \mbox command:

\mbox{pneumonoultramicroscopicsilicovolcanoconiosis}

This may have undesirable results, however, if you subsequently change margins or the size of the text: pneumonoultramicroscopicsilicovolcanoconiosis, although if you’re reading this in a browser, you probably won’t see the effect properly: look at the PDF.

Another option, for reoccurring words, is to use the \hyphenation command as shown in § 1.10.2 above, but give the word[s] with no hyphens at all, which stops them having any break-points.

To tie two words together with an unbreakable space (hard space), use a tilde (~) instead of the space (see the list in § 1.7 above). This will print as a normal space but LATEX will never break the line at that point.

A normal space between words is always a candidate for a place to break the text into lines, and the word-spacing gets evened-out between all the remaining words in the paragraph (not just the line)...with one exception: a full point (period) after a lowercase letter is treated in LATEX as the end of a sentence, and it automatically gets a little more space before the next word. You do not (indeed SHOULD NOT) type any extra space yourself between sentences.

However, after abbreviations in mid-sentence like ‘Prof.’, it’s not the end of a sentence, so we need a way to tell LATEX that this should be a normal space. The command for doing this is the \␣ (backslash-space — I have made the space visible here so you can see it, but it’s just a normal space). This prevents LATEX from adding the extra sentence-space and it also means it becomes a normal breakpoint (otherwise you would use the tilde as described above).

For example, it would look odd to split the author’s name Prof. D.E.

Knuth over a line-end. It’s a good idea to make adding the non-sentence space standard typing practice for things like people’s initials followed by their surname, as Prof.\␣D.E.~Knuth (I've used a visible space character here for emphasis but you just type a normal space).

1.10.4 Dashes

The hyphen (-) is only used for hyphenated compound words like editor-in-chief. LATEX inserts its own hyphens when it needs to break a word at right right-hand margin.

Dashes are different: they’re longer and they are used in different places. Check the sidebar ‘If you don’t have accented letters on your keyboard’ above for how to find these characters in your computer’s character-map.

Long dash

The long dash — what printers call an ‘em rule’ like this — is used to separate a short phrase from the surrounding text in a similar way to parentheses. If you’re using XƎLATEX, you can just type the long dash on your keyboard.

If you can’t find the character, type three hyphens typed together, like---this: LATEX will recognise this combination and replace it with a real em rule.
If you want space either side, bind the first hyphen to the preceding word with a tilde like~---␣this and use a normal space after the third hyphen (shown as a visible space here, but it’s just a normal space). This avoids the line being broken before the dash.

The difference between spaced and unspaced rules is purely æsthetic, but different cultures have different conventions (see the tip ‘Em rules vs En rules’ below). NEVER use a single hyphen for this purpose.

Em rules vs En rules

In a discussion on the TYPO-L mailing list, Yateendra Joshi observed:
[…] unspaced em dashes are standard in US publishing, whether the dashes occur in pairs enclosing parenthetical matter or come singly before the last part of a sentence. In the UK and Europe, I often see spaced en dashes when they occur in pairs but an unspaced em dash when it occurs singly.
Leila Singleton wrote:
[…] unspaced dashes are the standard for the US publishing industry, as it typically references the MLA Handbook (used by books + journals) to establish stylistic conventions. It's worth mentioning that the Associated Press Stylebook (used for newspapers and sometime magazines) instead calls for spaces. It's my understanding that an en dash in British usage is equivalent to an em dash in American usage, and that it's spaced whether it appears as a single or a pair …
Christopher R Maden wrote:
[I learned] that Jan Tschichold’s influential design for Penguin Books included spaced en-dashes instead of em-dashes, and that directive (and a few others) saw wide uptake throughout British typography.

Short dash

The short dash is used between digits like page ranges (35–47). Printers call this an ‘en rule’ and if you’re not using XƎLATEX you can get it by typing two hyphens together, as in 35--47. NEVER use a single hyphen for this purpose either.

Minus sign

If you want a minus sign, use math mode (see § 1.11 below) where you type a normal hyphen as part of a mathematical expression, so it occurs between math delimiters like \(x=y-z\) for x=y-z. DO NOT use the hyphen for a minus sign outside math mode.

There are other dashes for special purposes in the Unicode repertoire, but they are out of scope for this document.

1.10.5 Justification

The default mode for typesetting in LATEX is justified (two parallel margins, with word-spacing adjusted automatically for the best optical fit). In justifying, LATEX will never add space between letters, only between words. The soul package can be used if you need letter-spacing (‘tracking’), but this is best left to the expert.

There are two commands \raggedright and \raggedleft which typeset with only one margin aligned. Ragged-right has the text ranged (aligned) on the left, and ragged-left has it aligned on the right. They MUST be used inside a group (curly-braces, for example: see the sidebar ‘Grouping’ below) to confine their action to a part of your text, otherwise all the rest of the document will be done that way. Put the command in your Preamble if you want the whole document like that. This paragraph is set ragged-right.

These modes also exist as environments called raggedright and raggedleft which are more convenient when applying this formatting to a whole paragraph or more, like this one, set ragged-left.

\begin{raggedleft}
These modes also exist as environments
called raggedright and raggedleft which is more 
convenient when applying this formatting to a 
whole paragraph or more, like this one.
\end{raggedleft}

Ragged setting turns off hyphenation and indentation. There is a package ragged2e providing the command \RaggedRight (note the capitalisation) which retains hyphenation in ragged setting, useful when you have a lot of long words. There’s a \RaggedLeft and a \RaggedCenter, too.

To centre text, which is in effect both ragged-right and ragged-left at the same time, use the \centering command inside a group, or use the center environment.

Be careful when centering headings or other display-size material: it’s one of the rare occasions when you may need to add a premature linebreak or forced newline (the double-backslash \\) to make the lines break at sensible pauses in the meaning (Flynn, 2012). Never rely on the automated line-breaking of editors in these cases.

White-space and the double backslash

The \\ command is not the same as a paragraph break: it’s just a premature linebreak within the current paragraph. The double backslash command can have an optional argument (in square brackets) giving an amount of extra white-space to leave, if you need to, eg
not the same as a paragraph break\\[3mm]
it's just a premature linebreak
(If you need to start the new line with a square bracket for some reason, you will need to prefix it with an empty group ({}) to prevent it being interpreted as the optional argument to \\.)

1.10.6 Languages

LATEX can typeset in the native manner for several dozen languages. This affects hyphenation, word-spacing, indentation, and the automatic labelling of the parts of documents displayed in headings such as Chapter, Appendix, References, etc (but not the commands used to produce them).

Most distributions of LATEX come with US English and one or more other languages installed by default, but it is easy to use the babel or polyglossia package and specify any of the supported languages or variants, for example with babel:

\usepackage[german,frenchb,english]{babel}
...
As one writer has noted, \selectlanguage{german}``Das 
berühmte Voltaire-Zitat, \emph{\foreignlanguage{frenchb}
{il est bon de tuer de temps en temps un amiral pour 
encourager les autres}}, ist ein Beispiel sarkastischer 
Ironie.''\selectlanguage{english}y

Make sure that the base language of the document comes last in the list. The list of supported languages is in the package documentation. The syntax is similar for polyglossia but a little more explicit:

\usepackage{polyglossia}
\setmainlanguage{english}
\setotherlanguage{german}
\setotherlanguage{french}
\begin{document}
As one writer has noted, \textlang{german}{``Das 
berühmte Voltaire-Zitat, \emph{\textfrench{il est
bon de tuer de temps en temps un amiral pour 
encourager les autres}}, ist ein Beispiel sarkastischer 
Ironie.''}

Changing the language with babel or polyglossia is a cultural shift: it changes the hyphenation patterns, the word-spacing, the way in which indentation is used, and the names of the structural units and identifiers like ‘Abstract’, ‘Chapter’, and ‘Index’, etc. For example, using French as the default, chapters will start with ‘Chapitre’.

Both packages provide scoped and unscoped commands as shown in the examples to let you tell LATEX when to switch to the language specified in the argument. If you have only a small fragment in another language (a word or two, maybe a sentence, but less than a paragraph), use the scoped command with the first argument giving the language and the second with the word or phrase. For longer passages (more than a paragraph), use the unscoped command, with just the language, and then another unscoped command to switch back to the main language afterwards.

These packages use the hyphenation patterns provided with your version of LATEX (see the start of your document log files for a list). For other languages you need to set the hyphenation separately (outside the scope of this book).

Previous section

Next section