Condividi tramite


UTF-8 RTF

For RichEdit 4.0 (Windows XP SP1), I developed a UTF-8 version of the Rich Text Format (RTF). The reason was to have a faster, more reliable way of handling copy/paste for RichEdit than regular RTF. RichEdit 5.0 added the binary format for this purpose (and for OneNote) and RichEdit 6.0 added a still faster internal method to speed the build up of the math linear format. Accordingly starting with RichEdit 5.0 (Office 2003), the UTF-8 RTF format isn’t used for copy/paste unless the client specifically asks for it, and I hadn’t paid much attention to it.

But a UTF-8 RTF bug for the N’Ko script (the only right-to-left script that displays its digits RTL!) showed up the other day needing some attention. In my standard RTF debugging mode, I opened the file in NotePad to see what was going on. Much to my delight, it looks sooooooo much better than usual! Here’s the text part of the “new scripts for Windows 8” file written by RichEdit for standard RTF (N’Ko characters highlighted ):

\f0\fs22\lang1178\u-23344?\u-23337?\u-23305?\u-23320?\u-23314?\u-23313?\u-23302?\f1\lang1033\par

\f2\u-23286?\u-23275?\u-23261?\u-23115?\u-23054?\u-23077?\u-23025?\f3\par

\f4\rtlch\lang1176\u1986?\u2032?\u2025?\u2013?\u2041?\u2012?\u2000?\f5\ltrch\lang1033\par

\f6\u-10239?\u-9069?\u-10239?\u-9059?\u-10239?\u-9065?\u-10239?\u-9064?\u-10239?\u-9074?\u-10239?\u-9068?\u-10239?\u-9084?\f3\par

\f7\u-22444?\u-22411?\u-22453?\u-22449?\u-22438?\u-22433?\u-22418?\f3\par

\f8\u-10240?\u-8399?\u-10240?\u-8395?\u-10240?\u-8374?\u-10240?\u-8381?\u-10240?\u-8388?\u-10240?\u-8386?\u-10240?\u-8378?\f1\par

\f9\u-10239?\u-9195?\u-10239?\u-9182?\u-10239?\u-9207?\u-10239?\u-9139?\u-10239?\u-9187?\u-10239?\u-9163?\u-10239?\u-9177?\f1\par

\f10\u11570?\u11578?\u11586?\u11609?\u11621?\u11614?\u11580?\f1\par

You can’t tell what the characters are since they’re all represented by the RTF uN notation. Btw, this is still a lot simpler than Word writes. You can see the latter by saving a Word file in RTF and looking at the file in NotePad. The RichEdit file containing the RTF above is 1437 bytes and the corresponding Word file is 38190 bytes. You can see why people pass Word RTF files through WordPad to get something lighter.

Now here’s what the same text looks like when written in the UTF-8 RTF format

\f0\fs22\lang1178 ꓐꓗꓷꓨꓮꓯꓺ\f1\lang1033\par

\f2 ꔊꔕꔣꖵꗲꗛ꘏\f3\par

\f4\rtlch\lang1176 ߂߰ߩߝ߹ߜߐ\f5\ltrch\lang1033\par

\f6 𐒓𐒝𐒗𐒘𐒎𐒔𐒄\f3\par

\f7 ꡔ꡵ꡋꡏꡚꡟꡮ\f3\par

\f8 𐌱𐌵𐍊𐍃𐌼𐌾𐍆\f1\par

\f9 𐐕𐐢𐐉𐑍𐐝𐐵𐐧\f1\par

\f10 ⴲⴺⵂⵙⵥⵞⴼ\f1\par

You can read all the new-script characters instead of looking at \uN control words! Well, maybe you don't understand the text, but at least you can see the characters. The file containing this RTF is 1003 bytes, about 70% the size of the RichEdit standard RTF file and about a fortieth the size of the Word RTF file.

The \uN notation is certainly very valuable, but it’s particularly awkward because it uses signed 16-bit decimal values. To find out what the characters are you have to add 65536 to negative values and convert the results to hexadecimal. Furthermore a surrogate pair is represented by two \uN control words instead of one with an unsigned integer. So you have to convert two negative 16-bit decimal numbers to hex and then convert the resulting surrogate pair to the UTF-32 form to get what’s in the Unicode Standard. Since Word writes many RTF control words with unsigned 32-bit values, there really wasn’t any reason to stick with the original signed 16-bit convention. Standard RTF writers convert characters that can be represented using a standard Windows code page to that code page, making those characters virtually unreadable unless the code page is the Western 1252 code page. Meanwhile UTF-8 RTF simply displays all characters outright. If you paste a UTF-8 RTF file into Word, you can see the characters and use the alt+x hot key to examine their values in Unicode.

Makes one think the UTF-8 RTF format is really a much better format than the original RTF format. Except that only RichEdit understands it.

Comments

  • Anonymous
    November 28, 2013
    > Except that only RichEdit understands it. Well, you just proved Notepad does so, too! But, more seriously, I also implemented UTF-8 reading capability in my RTF formula editor because it's not difficult and allows recognition of plain text strings in many places where RTF is required.

  • Anonymous
    November 28, 2013
    Very intriguing. Did you just use a cpg65001 in the fN entry in the font table? In RichEdit, UTF-8 uses urtf1 instead of rtf1 to signal UTF-8, but it might actually be more general to use the cpg65001. Have to check to see if Word understands that...

  • Anonymous
    November 30, 2013
    The comment has been removed

  • Anonymous
    July 06, 2015
    The comment has been removed

    • Anonymous
      February 03, 2017
      Shift-JIS is marked by \fcharset128. UTF-8 RTF starts with {\urtf1 instead of {\rtf1, but \ansicpg65001 would have been a good choice too. And UTF-8 can be autorecognized. Sometimes it starts with the UTF-8 byte order mark (0xEF, 0xBB, 0xBF) and even without the BOM, the regularity of UTF-8 allows a program to recognize it reliably if it has a few nonASCII bytes.