Chinese looking characters in mails from MS Outlook users

Have you ever gotten mails from MS Outlook users which had chinese (or better asian) looking characters in the message body? Recently I got an email containing this sequence of UTF-8 characters in the message body next to the "normal" parts of the mail:

䐼噉氠湡㵧湥甭⁳汣獡㵳畏汴潯䵫獥慳敧效摡牥愠楬湧∽敬瑦•䥄㵒䰢剔㸢㰠但呎䘠捡㵥䌢•楓敺㈽ⴾ䈼㹒㰠㹂㱆䈯‾⁍㰠剂‾䈼匾⼼㹂ㄠ䈼㹒㰠㹂㱔䈯‾㱏剂‾㰠㹂㱓䈯‾㱎剂‾†⼼但呎‾⼼䥄㹖䐼噉㰾䐯噉

If you feed that string to the search engine of your choice, you will notice that they appear for many people and that this is wrongly encoded HTML. In this article I want to explain how you can examine such strings yourself and find their original meaning.

Step 1: Try to find the encoding

In order to examine these characters, it is helpful to extract them first. One way of doing that is to save the mail body (your mail client should support that). Alternatively you can just try to copy and paste them in a file, but that way the encoding might get changed. So I saved the mail first as text file.
In my case the mail was base64 encoded (you should be able to tell from the Content-Transfer-Encoding mail header). So I had to decode the mail via:

$ base64 -d mail.txt > mail_decoded.txt
Then you could use a tool like the file command, which gives you useful information about the content of the file and its encoding.
$ file mail_decoded.txt
mail_decoded.txt: HTML document, UTF-8 Unicode text, with CRLF line terminators

Alright so we have an HTML mail here with UTF-8 encoded characters.

Step 2: Remove all unnecessary character

Now let's remove all the parts that don't belong to the string we want to examine. For this you can use the editor of your choice (like vim), just make sure it doesn't change the encoding. The file command told us, that the lines are terminated with a carriage return plus a line feed. You can remove them via:

$ tr -d "\r" < mail_decoded.txt | tr -d "\n" > characters.txt

Now we only have the characters of interest in our characters.txt file.

Step 3: Check the encoding

In this step we want to find out if the string is maybe just wrongly encoded. Before we do that, we need to understand some basics.

Sometimes UTF-8 is confused with Unicode. Unicode is basically a mapping. It maps from characters (like Chinese characters) to so called Unicode code points. If you search for the first asian looking character with the search engine of your choice, you will find out that the code point for 䐼 is U+443C. The "U+" just tells us that this is a Unicode code point. The rest is a hex number which uniquely represents that character.

Now UTF-8, UTF-16 or UTF-32 are encodings. They describe how the Unicode code point is stored in memory or in other words it is a mapping from Unicode code points to byte sequences. Depending on which encoding you use, the resulting encoded character looks different in memory. The same character can take one byte in UTF-8 while it takes 2 bytes in UTF-16.

In order to further examine the string, it can be helpful to get these code points. We can use the iconv tool for that. Then we pipe the result to xxd in order to get a hexdump of the resulting code points.

I use "unicodelittle" here. The "little" is for "little endian" since I do that on a x86 machine (remember, depending on the CPU architecture, multi byte values are represented differently in memory, see here).

$ iconv -f utf8 -t unicodelittle characters.txt | xxd
00000000: 3c44 4956 206c 616e 673d 656e 2d75 7320  <DIV lang=en-us 
00000010: 636c 6173 733d 4f75 746c 6f6f 6b4d 6573  class=OutlookMes
00000020: 7361 6765 4865 6164 6572 2061 6c69 676e  sageHeader align
00000030: 3d22 6c65 6674 2220 4449 523d 224c 5452  ="left" DIR="LTR
00000040: 223e 203c 464f 4e54 2046 6163 653d 2243  "> <FONT Face="C
00000050: 2220 5369 7a65 3d32 3e2d 3c42 523e 203c  " Size=2>-<BR> <
00000060: 423e 463c 2f42 3e20 4d20 203c 4252 3e20  B>F</B> M  <BR> 
00000070: 3c42 3e53 3c2f 423e 2031 3c42 523e 203c  <B>S</B> 1<BR> <
00000080: 423e 543c 2f42 3e20 4f3c 4252 3e20 203c  B>T</B> O<BR>  <
00000090: 423e 533c 2f42 3e20 4e3c 4252 3e20 2020  B>S</B> N<BR>   
000000a0: 3c2f 464f 4e54 3e20 3c2f 4449 563e 3c44  </FONT> </DIV><D
000000b0: 4956 3e3c 2f44 4956                      IV></DIV

What do we have here? On the left side we have addresses which tell us where we are in the hex dump. In the middle we have our Unicode code points in hex and on the right we have an ASCII representation of these hex values. The hex values start with 0x3c and 0x44. Remember these hex numbers? 0x443c was the hex number which represented our first asian character in Unicode. Since we have little endian byte ordering, the 0x3c comes first. Our second Unicode code point is then U+5649 and so on.

Now the interesting stuff is on the right side. xxd tries to interpret the data it sees as ASCII and surprisingly this looks like HTML! It looks as if Outlook came across this piece of HTML (possibly when replying to an HTML mail) and misinterpreted it. Instead of assuming this is ASCII (where one byte represents one character) it looks as if it interpreted every 2 bytes as one character. But does that make sense? Why would Outlook assume that this is Unicode? Why would Unicode code points be in an email at all? Normally we would expect UTF-8, UTF-16 or UTF-32 encoded characters, but Unicode code points?

The question is: Is there maybe a 2 byte character encoding which looks similar to Unicode code points? The answer is: UTF-16. UTF-16 stores characters in either 2 or 4 bytes. All Unicode code points from U+0000 to U+FFFF are just encoded as they are without any bit manipulation. So the Unicode code point U+FFFF would be be 0xFFFF in UTF-16. No magic bit manipulation happens here, as for example with UTF-8.

In order to find out if we are right, we can just try to convert our original UTF-8 string to UTF-16. It should result in the same output as the Unicode output above.

$ iconv -f utf8 -t utf16 characters.txt 
00000000: fffe 3c44 4956 206c 616e 673d 656e 2d75  ..<DIV lang=en-u
00000010: 7320 636c 6173 733d 4f75 746c 6f6f 6b4d  s class=OutlookM
00000020: 6573 7361 6765 4865 6164 6572 2061 6c69  essageHeader ali
00000030: 676e 3d22 6c65 6674 2220 4449 523d 224c  gn="left" DIR="L
00000040: 5452 223e 203c 464f 4e54 2046 6163 653d  TR"> <FONT Face=
00000050: 2243 2220 5369 7a65 3d32 3e2d 3c42 523e  "C" Size=2>-<BR>
00000060: 203c 423e 463c 2f42 3e20 4d20 203c 4252   <B>F</B> M  <BR
00000070: 3e20 3c42 3e53 3c2f 423e 2031 3c42 523e  > <B>S</B> 1<BR>
00000080: 203c 423e 543c 2f42 3e20 4f3c 4252 3e20   <B>T</B> O<BR>
00000090: 203c 423e 533c 2f42 3e20 4e3c 4252 3e20   <B>S</B> N<BR>
000000a0: 2020 3c2f 464f 4e54 3e20 3c2f 4449 563e    </FONT> </DIV>
000000b0: 3c44 4956 3e3c 2f44 4956                 <DIV></DIV

This looks similar to what we had, but it is not the same output. Were we wrong with our guess? Not quite. You might note the first two bytes, which are 0xFFFEE. This is called a Byte Order Mark (BOM). Whenever this value is 0xFFEE it means the endianess is wrong. iconv converts our original string to UTF-16 big endian but the Byte Order Mark tells us it is little endian. So let's try again:

$ iconv -f utf8 -t utf16le characters.txt
00000000: 3c44 4956 206c 616e 673d 656e 2d75 7320  <DIV lang=en-us
00000010: 636c 6173 733d 4f75 746c 6f6f 6b4d 6573  class=OutlookMes
00000020: 7361 6765 4865 6164 6572 2061 6c69 676e  sageHeader align
00000030: 3d22 6c65 6674 2220 4449 523d 224c 5452  ="left" DIR="LTR
00000040: 223e 203c 464f 4e54 2046 6163 653d 2243  "> <FONT Face="C
00000050: 2220 5369 7a65 3d32 3e2d 3c42 523e 203c  " Size=2>-<BR> <
00000060: 423e 463c 2f42 3e20 4d20 203c 4252 3e20  B>F</B> M  <BR>
00000070: 3c42 3e53 3c2f 423e 2031 3c42 523e 203c  <B>S</B> 1<BR> <
00000080: 423e 543c 2f42 3e20 4f3c 4252 3e20 203c  B>T</B> O<BR>  <
00000090: 423e 533c 2f42 3e20 4e3c 4252 3e20 2020  B>S</B> N<BR>  
000000a0: 3c2f 464f 4e54 3e20 3c2f 4449 563e 3c44  </FONT> </DIV><D
000000b0: 4956 3e3c 2f44 4956                      IV></DIV

This looks right! The output looks exactly like the one we had when converting to Unicode code points. Every 2 bytes we have a new character. xxd again interprets the output byte wise as ASCII characters, so we don't see any asian characters on the right.

So we can conclude: Outlook came across the HTML code which was ASCII. Every character of the HTML code was encoded in one byte. During the mail processing at some point this HTML code was mistakenly assumed to be encoded as UTF-16 little endian rather than ASCII. So Outlook assumed two bytes make one character. At some point the mail was then converted from UTF-16 little endian to UTF-8, which resulted in our asian looking characters. Mystery solved!

comments (0) - add comment

No comments so far, leave one?