an algorithm for putting text in a standard form before hashing or signing
This is a spec for a simple but powerful algorithm for canonicalizing chunks of text that flow not via files but via chat, copy/paste, or other non-file-oriented channels (social media, SMS, email, etc.).
You can use an interactive form to run the algorithm on arbitrary text.
A reference implementation exists in cqt.py, with unit tests in test_cqt.py. See also ports in cqt.js, Cqt.java, cqt.go, and cqt.rs. Code is in the public domain; see LICENSE.
Cryptographic hashes and signatures are usually applied to files or data structures. However, a very important category of communication is not file-oriented. In our modern world, lots of text moves across system boundaries using mechanisms that are prone to reformatting and error due to their inherent fuzziness. We see a post on social media on our phones, copy it, and paste it into a text to a friend. She emails it to a journalist acquaintance, who moves it into a word processor that is configured to use a different locale with different autocorrect and punctuation settings. Eventually, a student cites the journalist in a paper they’re writing. Somewhere along the way, whitespace is deleted, capitalization or spelling is altered, the codepage changes, smart quotes turn into dumb quotes or two hyphens become an em dash.
In this scenario, how can we evaluate whether the final text is the same as the original?
Of course, opinions about what constitutes sameness vary. There is no objectively correct answer. However, we can create deterministic answers that are useful. They can help us decide whether minor text changes are likely to matter, and check to see whether a digital signature matches a piece of text.
The algorithm documented here is for such cases. It says that two chunks of text are the same if, when transformed by the algorithm, they produce output that matches byte-for-byte.
The full name of this algorithm is “canonical quoted text 1.14”, but it is typically abbreviated “cqt1.14”.
The name contains two numbers. The first number (“1”) versions the logic of the algorithm, and the second number (“14”) references a version of the Unicode standard that documents certain details. Version 14 was chosen because it, or something newer, is widely supported by programming libraries. For all mainstream modern languages, the Unicode standard is fairly stable, so the algorithm is likely to produce identical or near-identical results even if the second number varies slightly. This is similar to the spirit of semver.org, but its definition of minor version semantics varies from it slightly.
The output of this algorithm can be piped to a digest function to produce a canonical hash of text. For example: canonical hash = Blake3(cqt1.14(text))
. The output of this algorithm can also be piped directly to a digital signature function to produce a signature over canonical text. For example: signature over canonical text = EdDSA(cqt1.14(text))
. Perhaps better (because it allows text value to be disclosed later), a signature can also take as input a canonical hash, producing a signature over canonical hash. For example: signature over canonical hash = EdDSA(Blake3(cqt1.14(text)))
. This formal notation can be used in specs and machine-processable metadata. If machines are parsing such expressions, all strings in the notations MUST be compared case-insensitively, with whitespace and all punctuation except parentheses removed.
Given any two input text samples and a literate, thoughtful human who knows the natural language(s) that they embody, the intent is to provide an algorithm that achieves the following goals:
We live in an imperfect world, so this algorithm makes calculated tradeoffs in the first two goals. Also, the third goal is less important than the first two and might be sacrificed in corner cases. For more on this, see Caveats.
Start with input content that has been transformed into plain text.
This is a precondition rather than a step in our algorithm. “Plain text” means that the text is ready to be interpreted as IANA media type
text/plain
: it contains no markup intended as instructions to a different formatting engine (e.g., escape sequences, HTML/XML tags, character entities…). Many programs that edit rich text already implement such transformations — when a user copies text, they place both a richly formatted and a “plain text” version of the content on the clipboard. However, intent matters; including an HTML tag in plain text is correct, if the plain text is intended to be an instruction about how to construct an HTML tag — and it is not correct otherwise. In other words, any required transformation depends on the initial media type.
Convert the text to Unicode, eliminating codepages as a source of difference. Represent the data in whichever encoding of Unicode (UTF-8, UTF-16, UTF-32…) is convenient; subsequent steps are described as Unicode operations rather than byte operations.
Normalize the text to Unicode’s Normalization Form KC (NFKC). This converts Chinese, Japanese, and Korean languages (CJK) from halfwidth to fullwidth forms, breaks ligatures, decomposes fractions, standardizes variants, handles diacritics uniformly, flattens super- and subscripts, converts all numbers to Arabic numerals, and eliminates many other unimportant differences.
Replace all instances of the ampersand (& U+0038
), the small ampersand (﹠ U+FE60
), and the fullwidth ampersand (& U+FF06
) with ` and ` (the word “and” with a space before and after).
U+2028 Line Separator
, U+2029 Paragraph Separator
, U+200B Zero Width Space
, U+FEFF Zero Width Non-Breaking Space
, U+00A0 Non-Breaking Space
, U+3000 ideographic space
, carriage return U+000A
(\r
), line feed U+000D
(\n
), tab (\t
).White_Space=yes
.U+0020
.-
(U+002D
).Convert some CJK characters (from Unicode’s CJK Symbols and Punctuation block from the fullwidth half of the CJK Halfwidth and Fullwidth Forms block) into their ASCII equivalents:
CJK character | Codepoint | ASCII equivalent |
---|---|---|
Ideographic comma 、 |
U+3001 |
, (ordinary comma, U+002C ) |
Ideographic full stop 。 |
U+3002 |
, (ordinary full stop, U+002E ) |
CJK fullwidth ASCII printable chars | U+FF01 to U+FF5E |
codepoint - 0xFEE0: ordinary ! to ~ |
U+2026
) with three instances of the period/full stop .
(U+002E
).U+2044
) with the ordinary slash /
(U+002F
).Replace various characters that are used as quotes with the least common denominator, the ASCII apostrophe '
(U+0027
):
Quote Char | Codepoint |
---|---|
ASCII double-quote " |
U+0022 |
Left smart apostrophe ‘ |
U+2018 |
Right smart apostrophe ’ |
U+2019 |
Left smart double quote “ |
U+201C |
Right smart double quote ” |
U+201D |
Left guillemet « |
U+00AB |
Right guillement » |
U+00BB |
Single left-angle quote ‹ |
U+2039 |
Single right-angle quote › |
U+203A |
CJK left-angle quote 〈 |
U+3008 |
CJK right-angle quote 〉 |
U+3009 |
CJK double left-angle quote 《 |
U+300A |
CJK double right-angle quote 》 |
U+300B |
CJK left corner bracket 「 |
U+300C |
CJK right corner bracket 」 |
U+300D |
Undo some common autocorrect transformations in word processors by converting fancier Unicode characters to their ASCII equivalents:
Unicode character | Codepoint | ASCII equivalent |
---|---|---|
😊 | U+1F60A |
:-) |
😐 | U+1F610 |
:-| |
☹ | U+2639 |
:-( |
😃 | U+1F603 |
:-D |
😝 | U+1F61D |
:-p |
😲 | U+1F632 |
:-o |
😉 | U+1F609 |
;-) |
❤ | U+2764 |
<3 |
💔 | U+1F494 |
</3 |
© | U+00A9 |
(c) |
® | U+00AE |
(R) |
• | U+2022 |
* |
Replace some ASCII emojis with their canonical equivalent:
Non-canonical | Canonical equivalent |
---|---|
:) |
:-) |
:| |
:-| |
:( |
:-( |
:D |
:-D |
:p |
:-p |
:o |
:-o |
;) |
;-) |
This algorithm collapses some differences that are usually insignificant in written text. Note the word “usually”. The algorithm may not distinguish certain input texts having subtle distinctions. For example:
i--
and the expression i-
produce identical output; so do x²
, x₂
, and x2
).Always place a comma inside double quotes: "abc,"
is normalized to the same value as Always place a comma inside double quotes: 'abc,'
(which contains no double quotes, despite what the text says).This algorithm also leaves intact some differences that some audiences may wish to collapse. Notably, it does not normalize case. Also:
I'm *really* serious
) is untouched and does not equate to italics or bolded text.