Published on 2017, Mar 26
An article about a small, small subset of problems with using software tools that have ASCII- and/or English- only syntax elements.
If you’re reading this, I presume you know what ASCII is, but if not, it’s the letter-to-number mapping commonly used in American programming to display our alphabet, punctuation, and some other characters. It’s one of the earliest character encodings (it’s designed to drive typewriters, back when those were the interface used for computing) and won out over EBCDIC as the standard encoding used … in America.
This was fine when 7 bits was enough for anybody and we had neither the fonts nor the inclination to draw more symbols than could fit on a keyboard.
This is not an article about how we finally moved beyond ASCII (the short version is: the Unicode Consortium is defining a single table whose goal is to map all characters in all writing systems to numbers called code points, and the UTF-8 encoding is the One True Way to implement this; stop using other things, Windows); I just wanted to provide some backstory.
So when C, one of the ancestors of basically all modern programming, was being
built, it was written with ASCII being the only serious encoding around, which
was fine, at the time. It was written in America, by and for Americans, and
ASCII has American right in the name. C adopted sigil conventions from its own
immediate ancestor, BCPL, popularizing the use of English’s many different
grouping characters –
" – for different semantic
meanings that have continued essentially unchanged in C’s descendants today.
This is good; using words such as
begin/end to denote blocks instead of
is a recipe for disaster, and it permits complex arrangements to be made with
minimal line noise.
This is all well and good, until we look at strings. I take many, many issues with C strings and will complain at length about them some other time; for now I just want to talk about how text is placed in a source file.
In C, the single quote
' (U+0027) surrounds a single character. In C’s ASCII,
this is equivalent to a signed, 8-bit, integer. The values
identical. Single quotes permitted working with text as numbers with no special
weirdness in place. The double quote
" (U+0022) surrounded “C strings”, which
are actually arrays of ASCII numbers with a zero byte secretly slapped on the
end. This leads to fun confusion where
'A' is of type
char (8 bits wide) and
has the value 65, whereas
"A" is a pointer (as wide as your CPU) of type
char* and has some value pointing into the compilation output, and at the end
of the pointer is the sequence
The distinction between single and double quotes is essentially just a language quirk that one must learn as part of learning the language; I’m not here to complain about that either.
Look at the quotation marks I’m using in my normal text. Rather than using straight single quotes “'” and double quotes ‘"’, I am using the paired, fancy, curved quotation marks one sees in word processors.
If you’ve ever tried to copy code samples between a programming environment and a word processor, you’ve no doubt run into errors when the parser reached the fancy, non-ASCII, not-part-of-the-grammar, quotation marks.
Programming is a world-wide, supposedly modern, phenomenon, yet in many ways
we’re still hobbled by past assumptions and restrictions. Only recently have
languages branched out from ASCII in their grammar (the iconic example being
“emoji as valid identifiers” in Swift and other Unicode-aware parsers) and
process (for example, Rust mandates that all source files and
are UTF-8), yet the grammars have not expanded similarly.
We can now declare variables such as
😭, but we still can only
wrap text in
". Non-English Latin languages use accented characters,
which are only recently not compiler errors; non-Western scripts such as Arabic
or Chinese are even less supported: consider the language Qalb, which
contains no English text in its grammar, has the GitHub URL
https://github.com/nasser/---, and still uses ASCII quotes for string
English isn’t the only language with quotation marks that aren’t valid syntax, however. French uses « and » as its double quotes, and ‹ and › as its single quotes. There are more, and more esoteric, quotation marks that can be found on Wikipedia, but I don’t know much about them and would prefer to avoid putting my foot in my mouth over them.
This isn’t 1970. The English character set is more than 128 characters, and the world has more character sets in it than just English.
It’s past time to move beyond ASCII as an encoding, and UTF-8 is making excellent progress at supplanting it. It’s also time to move beyond ASCII as an alphabet.
‘’ “” ‹› «» as quotation marks in syntax grammars. That’s what
they mean (and to make parsing easier, there’s even a Unicode character property
“Quotation_Mark” that groups all, not just these, marks for easy and
future-proof processing). There will be fewer problems moving text between word
and code processors. There will be fewer international quirks for people and
text that aren’t a subset of English. There might even be benefits for those of
us who are: imagine being able to use specific quotation marks to control
string interpolation instead of the hacks currently present, such as
$<>, and all the other permutations that are essentially just
Frankeinstein abuses of unused sigils and ASCII-acceptable paired group markers.
Or imagine not having the name
Robert '); DROP TABLE Students; --
ruin your life because code and data mixed too freely.
Written language is a big place. Let’s not restrict ourselves to one small part of it simply because of accidents of history. The world is bigger than the US. The languages available are bigger than ASCII. They shouldn’t be kept as second-class citizens in our tools just because the ancestors of computing didn’t think of them.