Special Characters

Soft hyphens (§9.3.3)

A soft hyphen (­) indicates where an optional word break may occur. When a soft hyphen breaks a word between one line and the next, a hyphen character is displayed at the end of the first line. When a soft hyphen does not break a word between lines, the hyphen must not be displayed.

Soft hyphens are vital for text that must be displayed on a tiny screen or in a narrow frame. Web browsers have no excuse for rendering them incorrectly, when they can be minimally compliant by ignoring them completely.

Although technically, the ­ character entity was defined in HTML 3.2, I’ll treat soft hyphens as a new feature of HTML 4.0. Until HTML 4.0 explicitly spelled out how they should work, soft hyphens had ambiguous semantics and a history of contradictory interpretations.

In addition to the soft hyphen, there is also a hard hyphen (‐ or ‐) which always renders, and a nonbreaking hyphen character (‑ or ‑), for hyphens that do not break words across lines.

Example:

For the following test, you may have to resize your screen or window so that a hyphenated word could be broken at the end of a line. (Widths of 80 columns and 640 pixels worked in local testing.)

The 1992 <cite>Guinness Book of World Records</cite> calls the 29&#8208;letter

<em>floc&shy;ci&shy;nau&shy;ci&shy;
ni&shy;hil&shy;i&shy;pil&shy;i&shy;
fi&shy;ca&shy;tion</em>

<q>the longest real word in <cite>Oxford English Dictionary</cite></q>, dismissing the 45&#8208;letter

<em>pneu&shy;mo&shy;no&shy;ul&shy;tra&shy;
mi&shy;cro&shy;scop&shy;ic&shy;
sili&shy;co&shy;vol&shy;ca&shy;no&shy;
co&shy;ni&shy;o&shy;sis</em>

as <q>the longest made&#8209;up word in the <cite>Oxford English Dictionary</cite></q>.

Line breaks have been added to the above source for readability. The sample below consists of a single line.

Your Web browser renders it like this:

The 1992 Guinness Book of World Records calls the 29‐letter floc­ci­nau­ci­ni­hil­i­pil­i­fi­ca­tion the longest real word in the Oxford English Dictionary, dismissing the 45‐letter pneu­mo­no­ultra­mi­cro­scop­ic­sili­co­vol­ca­no­co­ni­o­sis as the longest made‑up word in the Oxford English Dictionary.

Related Mozilla bug reports: shy.

Related Konqueror bug reports: #33798, #33855.

En spaces, em spaces, and thin spaces (§24)

HTML 4.0 has named entities for three fixed‐width spaces: the en space, the em space, and the thin space. Unlike ordinary spaces, which may vary in width when text is justified, the en, em, and thin spaces should not change in width.

The fixed‐width spaces are not white space characters, so two of them in sequence should not collapse into a single space. They should not be replaced by line breaks at the end of the line, though line breaks may occur immediately after them.

Example:

<dl>
<dt>space
<dd>The width of a space varies with the display font.
<dt>&amp;thinsp;
<dd>MathML&thinsp;defines&thinsp;thin&thinsp;spaces
&thinsp;as&thinsp;spaces&thinsp;of&thinsp;width
&thinsp;3&frasl;18&thinsp;as&thinsp;wide&thinsp;as
&thinsp;an&thinsp;em&thinsp;space.
<dt>&amp;ensp;
<dd>En&ensp;spaces&ensp;are&ensp;&frac12;&ensp;
as&ensp;wide&ensp;as&ensp;
an&ensp;em&ensp;space.
<dt>&amp;emsp;
<dd>The&emsp;width&emsp;of&emsp;an&emsp;em&emsp;
space&emsp;is&emsp;traditionally&emsp;
equal&emsp;to&emsp;the&emsp;point&emsp;size.
</dl>

Line breaks have been added above for readability.

Your Web browser renders it like this:

space
The width of a space varies with the display font.
&thinsp;
MathML defines thin spaces as spaces 3⁄18 as wide as an em space.
&ensp;
En spaces are ½ as wide as an em space.
&emsp;
The width of an em space is traditionally equal to the point size.

Zero‐width spaces (§9.1)

Long lines usually wrap at spaces between words, but in languages without spaces between words (like Thai), sentences may appear as if they were one continuous word.

Zero‐width spaces put “invisible spaces” between words where they can wrap to the next line.[1]. Zero‐width spaces divide long sequences of characters into smaller units that may wrap from one line to the next.

HTML 4.0 lacks a character entity name like &zws;, so we must use a numeric reference like &#8203; or &#x200B;. (MathML uses the entity name &ZeroWidthSpace; for this character.)

Zero‐width spaces function similarly to the proprietary <wbr> word break element in early versions of Netscape.

[1] Invisible in theory, anyway. Some Web browsers display zero‐width spaces as a visible unknown‐character glyph, which is technically not incorrect. Perhaps a future version of the standard will mandate how zero‐width spaces should be rendered, as HTML 4.0 does with soft hyphens.

Example:

The following sentence contains a very long number, in which I’ve helpfully included zero‐width spaces every 5 digits. In a visual Web browser that doesn’t treat zero‐width spaces as white space, this page will probably scroll horizontally.

&pi;=3.14159&#8203;26535&#8203;89793&#8203;
23846&#8203;26433&#8203;83279&#8203;50288&#8203;
41971&#8203;69399&#8203;37510&#8203;58209&#8203;
74944&#8203;59230&#8203;78164&#8203;06286&#8203;
20899&#8203;86280&#8203;34825&#8203;34211&#8203;
70679&hellip;

Line breaks have been added to the above source for readability. The sample below consists of a single line.

Your Web browser renders it like this:

π=3.14159​26535​89793​23846​26433​83279​50288​41971​69399​37510​58209​74944​59230​78164​06286​20899​86280​34825​34211​70679…

Related Mozilla bug reports: zws.

Related Konqueror bug reports: #29575.

Joining Controls (§8.2.5)

In Arabic scripts, individual characters join with following ones. However, sometimes Web browsers must be informed to join characters that normally do not, or not to join characters that normally do.

The &zwnj; entity prevents joining where joining would occur, but should not. The &zwj; entity forces joining when it would not occur, but should.

Example:

Here is an example of &zwj; and &zwnj; being used with Devanagari characters.

<p>&#2325;&#2381; + &#2340; = &#2325;&#2381;&#2340; (a glyph of kta)</p>
<p>&#2325;&#2381; + &amp;zwj; + &#2340; = &#2325;&#2381;&zwj;&#2340; (half&#8208;ka and ta)</p>
<p>&#2325;&#2381; + &amp;zwnj; + &#2340; = &#2325;&#2381;&zwnj;&#2340; (ka&#8208;halant and ta)</p>

Your Web browser renders it like this:

क् + त = क्त (a glyph of kta)

क् + &zwj; + त = क्‍त (half‐ka and ta)

क् + &zwnj; + त = क्‌त (ka‐halant and ta)

Related Mozilla bug reports: #202352.

Ligatures

In English‐language text, &zwj; may be used to form ligatures.

The MacOS character sets contains ligatures for “fi” and “fl”. Many fonts developed for both MacOS and Microsoft Windows contain the five main f‐ligatures.

LigatureEntityCharacter
ff&#64256;
fi&#64257;
fl&#64258;
ffi&#64259;
ffl&#64260;
st&#64262;

However, using these character numbers with a typeface that does not support them could result in unknown‐character glyphs. Instead, you could request these ligatures with zero‐width joiners, allowing Web browsers that cannot generate them to gracefully degrade to unjoined characters.

Zero‐width joiners may also request ligatures without official characters in Unicode. Germanic typefaces sometimes have traditional ligatures for “ch”, “ck”, and “tz”. Adobe makes some fancy OpenType fonts with ligatures for “fj”, “ffj”, “Th”, “ct”, and “sp”.

Using &zwj; to form ligatures in Latin text is controversial. Some think &zwj; should be used for ligatures only in contexts that absolutely require them, and consider it an abuse to request ligatures when unjoined letters convey the same meaning.

Example:

The following text requests the five main f‐ligatures. Your Web browser should either join the letters, or gracefully degrade by rendering them as separate letters. The zero‐width joiner itself should always be invisible.

The f&zwj;lower in the f&zwj;ile made the of&zwj;f&zwj;ice staf&zwj;f snif&zwj;f&zwj;le.

Your Web browser renders it like this:

The f‍lower in the f‍ile made the of‍fice staf‍f snif‍f‍le.

Related Mozilla bug reports: Ligatures.