Have you ever gotten mails from MS Outlook users which had chinese (or better asian) looking characters in the message body? Recently I got an email containing this sequence of UTF-8 characters in the message body next to the "normal" parts of the mail:
If you feed that string to the search engine of your choice, you will notice that they appear for many people and that this is wrongly encoded HTML. In this article I want to explain how you can examine such strings yourself and find their original meaning.
Step 1: Try to find the encoding
In order to examine these characters, it is helpful to extract them first. One way of doing that is to save the mail body (your mail client should support that). Alternatively you can just try to copy and paste them in a file, but that way the encoding might get changed. So I saved the mail first as text file.
In my case the mail was base64 encoded (you should be able to tell from the Content-Transfer-Encoding mail header). So I had to decode the mail via:
$ base64 -d mail.txt > mail_decoded.txtThen you could use a tool like the
filecommand, which gives you useful information about the content of the file and its encoding.
$ file mail_decoded.txt mail_decoded.txt: HTML document, UTF-8 Unicode text, with CRLF line terminators
Alright so we have an HTML mail here with UTF-8 encoded characters.
Step 2: Remove all unnecessary character
Now let's remove all the parts that don't belong to the string we want to examine. For this you can use the editor of your choice (like vim), just make sure it doesn't change the encoding. The
file command told us, that the lines are terminated with a carriage return plus a line feed. You can remove them via:
$ tr -d "\r" < mail_decoded.txt | tr -d "\n" > characters.txt
Now we only have the characters of interest in our characters.txt file.
Step 3: Check the encoding
In this step we want to find out if the string is maybe just wrongly encoded. Before we do that, we need to understand some basics.
Sometimes UTF-8 is confused with Unicode. Unicode is basically a mapping. It maps from characters (like Chinese characters) to so called Unicode code points. If you search for the first asian looking character with the search engine of your choice, you will find out that the code point for 䐼 is U+443C. The "U+" just tells us that this is a Unicode code point. The rest is a hex number which uniquely represents that character.
Now UTF-8, UTF-16 or UTF-32 are encodings. They describe how the Unicode code point is stored in memory or in other words it is a mapping from Unicode code points to byte sequences. Depending on which encoding you use, the resulting encoded character looks different in memory. The same character can take one byte in UTF-8 while it takes 2 bytes in UTF-16.
In order to further examine the string, it can be helpful to get these code points. We can use the
iconv tool for that. Then we pipe the result to
xxd in order to get a hexdump of the resulting code points.
I use "unicodelittle" here. The "little" is for "little endian" since I do that on a x86 machine (remember, depending on the CPU architecture, multi byte values are represented differently in memory, see here).
$ iconv -f utf8 -t unicodelittle characters.txt | xxd 00000000: 3c44 4956 206c 616e 673d 656e 2d75 7320 <DIV lang=en-us 00000010: 636c 6173 733d 4f75 746c 6f6f 6b4d 6573 class=OutlookMes 00000020: 7361 6765 4865 6164 6572 2061 6c69 676e sageHeader align 00000030: 3d22 6c65 6674 2220 4449 523d 224c 5452 ="left" DIR="LTR 00000040: 223e 203c 464f 4e54 2046 6163 653d 2243 "> <FONT Face="C 00000050: 2220 5369 7a65 3d32 3e2d 3c42 523e 203c " Size=2>-<BR> < 00000060: 423e 463c 2f42 3e20 4d20 203c 4252 3e20 B>F</B> M <BR> 00000070: 3c42 3e53 3c2f 423e 2031 3c42 523e 203c <B>S</B> 1<BR> < 00000080: 423e 543c 2f42 3e20 4f3c 4252 3e20 203c B>T</B> O<BR> < 00000090: 423e 533c 2f42 3e20 4e3c 4252 3e20 2020 B>S</B> N<BR> 000000a0: 3c2f 464f 4e54 3e20 3c2f 4449 563e 3c44 </FONT> </DIV><D 000000b0: 4956 3e3c 2f44 4956 IV></DIV
What do we have here? On the left side we have addresses which tell us where we are in the hex dump. In the middle we have our Unicode code points in hex and on the right we have an ASCII representation of these hex values. The hex values start with 0x3c and 0x44. Remember these hex numbers? 0x443c was the hex number which represented our first asian character in Unicode. Since we have little endian byte ordering, the 0x3c comes first. Our second Unicode code point is then U+5649 and so on.
Now the interesting stuff is on the right side.
xxd tries to interpret the data it sees as ASCII and surprisingly this looks like HTML! It looks as if Outlook came across this piece of HTML (possibly when replying to an HTML mail) and misinterpreted it. Instead of assuming this is ASCII (where one byte represents one character) it looks as if it interpreted every 2 bytes as one character. But does that make sense? Why would Outlook assume that this is Unicode? Why would Unicode code points be in an email at all? Normally we would expect UTF-8, UTF-16 or UTF-32 encoded characters, but Unicode code points?
The question is: Is there maybe a 2 byte character encoding which looks similar to Unicode code points? The answer is: UTF-16. UTF-16 stores characters in either 2 or 4 bytes. All Unicode code points from U+0000 to U+FFFF are just encoded as they are without any bit manipulation. So the Unicode code point U+FFFF would be be 0xFFFF in UTF-16. No magic bit manipulation happens here, as for example with UTF-8.
In order to find out if we are right, we can just try to convert our original UTF-8 string to UTF-16. It should result in the same output as the Unicode output above.
$ iconv -f utf8 -t utf16 characters.txt 00000000: fffe 3c44 4956 206c 616e 673d 656e 2d75 ..<DIV lang=en-u 00000010: 7320 636c 6173 733d 4f75 746c 6f6f 6b4d s class=OutlookM 00000020: 6573 7361 6765 4865 6164 6572 2061 6c69 essageHeader ali 00000030: 676e 3d22 6c65 6674 2220 4449 523d 224c gn="left" DIR="L 00000040: 5452 223e 203c 464f 4e54 2046 6163 653d TR"> <FONT Face= 00000050: 2243 2220 5369 7a65 3d32 3e2d 3c42 523e "C" Size=2>-<BR> 00000060: 203c 423e 463c 2f42 3e20 4d20 203c 4252 <B>F</B> M <BR 00000070: 3e20 3c42 3e53 3c2f 423e 2031 3c42 523e > <B>S</B> 1<BR> 00000080: 203c 423e 543c 2f42 3e20 4f3c 4252 3e20 <B>T</B> O<BR> 00000090: 203c 423e 533c 2f42 3e20 4e3c 4252 3e20 <B>S</B> N<BR> 000000a0: 2020 3c2f 464f 4e54 3e20 3c2f 4449 563e </FONT> </DIV> 000000b0: 3c44 4956 3e3c 2f44 4956 <DIV></DIV
This looks similar to what we had, but it is not the same output. Were we wrong with our guess? Not quite. You might note the first two bytes, which are 0xFFFEE. This is called a Byte Order Mark (BOM). Whenever this value is 0xFFEE it means the endianess is wrong.
iconv converts our original string to UTF-16 big endian but the Byte Order Mark tells us it is little endian. So let's try again:
$ iconv -f utf8 -t utf16le characters.txt 00000000: 3c44 4956 206c 616e 673d 656e 2d75 7320 <DIV lang=en-us 00000010: 636c 6173 733d 4f75 746c 6f6f 6b4d 6573 class=OutlookMes 00000020: 7361 6765 4865 6164 6572 2061 6c69 676e sageHeader align 00000030: 3d22 6c65 6674 2220 4449 523d 224c 5452 ="left" DIR="LTR 00000040: 223e 203c 464f 4e54 2046 6163 653d 2243 "> <FONT Face="C 00000050: 2220 5369 7a65 3d32 3e2d 3c42 523e 203c " Size=2>-<BR> < 00000060: 423e 463c 2f42 3e20 4d20 203c 4252 3e20 B>F</B> M <BR> 00000070: 3c42 3e53 3c2f 423e 2031 3c42 523e 203c <B>S</B> 1<BR> < 00000080: 423e 543c 2f42 3e20 4f3c 4252 3e20 203c B>T</B> O<BR> < 00000090: 423e 533c 2f42 3e20 4e3c 4252 3e20 2020 B>S</B> N<BR> 000000a0: 3c2f 464f 4e54 3e20 3c2f 4449 563e 3c44 </FONT> </DIV><D 000000b0: 4956 3e3c 2f44 4956 IV></DIV
This looks right! The output looks exactly like the one we had when converting to Unicode code points. Every 2 bytes we have a new character.
xxd again interprets the output byte wise as ASCII characters, so we don't see any asian characters on the right.
So we can conclude: Outlook came across the HTML code which was ASCII. Every character of the HTML code was encoded in one byte. During the mail processing at some point this HTML code was mistakenly assumed to be encoded as UTF-16 little endian rather than ASCII. So Outlook assumed two bytes make one character. At some point the mail was then converted from UTF-16 little endian to UTF-8, which resulted in our asian looking characters. Mystery solved!