Fun with Unicode! 2007/12/01

Why can't XULRunner do this already?

Without really saying what "this" is, Mozilla wants XULRunner to be the standard play for cross-platform GUI applications and yet it has a bit of trouble interacting with 3rd party libraries. Most mature libraries (say, GraphicsMagick) still use char * or std::string to represent file paths. To interact with them, then, you have to be able to get to a char * representation of files you're trying to open.

Windows and Mac character encodings 101

As is becoming tradition, Windows makes my life difficult by using UTF-16 as its Unicode of choice. UTF-16 uses 2 bytes (usually) to represent each character, meaning it isn't binary compatible with ASCII, which uses 1 byte per character. These 2-byte characters are stored in wchar_ts which are twice as wide as chars. It isn't possible to pass a wchar_t * through normal C libraries that expect strings as char *.

Mac OS X uses UTF-8 internally and is my hero. UTF-8 is used internally to represent strings and has the advantage of using single bytes (that's char for C programmers playing along) as its base unit. While a single character in UTF-8 can span up to 8 bytes, the in-memory representation of these characters will pass muster with a C compiler. When the UTF-8 bytes are passed through some C library they can arrive intact to the fopen system call, which understands the UTF-8 bytes and acts like you'd expect.

There's an easy way?

There could be. Using the @mozilla.org/file/local;1 it might be possible to get the path as an nsACString, Mozilla's portable representation of single-byte ASCII strings. But, the reference warns that the native path available from nsILocalFile is not for passing to C libraries and isn't guaranteed to be correct. So it's up to me.

No pain, no gain

Bear with me, this gets ugly. The solution that lets me take paths from JavaScript-land to C-land, open Unicode paths with a normal char * and do so on Windows and Mac OS X is stupefying (and stupid). The iinitial path is stored as a JavaScript string, so the boundary between JavaScript and C will be an nsAString, which represents strings as UTF-16.

The string takes quite a journey here, so let's cover the easy side first. Macs understand UTF-8, and I'm making a transformation to UTF-8 in JavaScript. This means that the only transform needed in C is a type change, not an encoding change. The UTF-8 string enters C-land as wide characters and is simply cast character-by-character down to an array of bytes.

Windows is quite a bit more work, since a path in Windows that needs Unicode characters can't easily be represented as a char *. Enter the Windows API function GetShortPathName. This can convert any path to an old-school Windows path that fits in ASCII and uses that familiar "8.3" naming convention. I got the Mac half working first, so the Windows implementation is actually more complicated than it needs to be -- more on that later. Without further ado:

string * conv_path(const nsAString & fake) {

	// Fun with Windows paths
#ifdef XP_WIN

	// UTF-16 but really UTF-8 nsAString to really UTF-8 nsCString
	nsCString utf8 = NS_LossyConvertUTF16toASCII(fake);

	// UTF-8 nsCString to UTF-16 nsEmbedString
	nsEmbedString & utf16 = NS_ConvertUTF8toUTF16(utf8);

	// UTF-16 nsEmbedString to wchar_t[]
	wchar_t * w_arr = new wchar_t[utf16.Length() + 1];
	if (0 == w_arr) return 0;
	wchar_t * w_arr_p = w_arr;
	PRUnichar * w_start = (PRUnichar *)utf16.BeginReading();
	const PRUnichar * w_end = (PRUnichar *)utf16.EndReading();
	while (w_start !  w_end) {
		*w_arr_p++ = (wchar_t)*w_start++;
	}
	*w_arr_p = 0;

	// GetShortPathName to get guaranteed ASCII
	wchar_t s_arr[4096];
	if (0 == GetShortPathNameW(w_arr, s_arr, 4096)) {
		delete [] w_arr;
		return 0;
	}
	delete [] w_arr;

	// wchar_t[] to ASCII nsEmbedString
	nsEmbedString ascii;
	wchar_t * s_arr_p = s_arr;
	while (*s_arr_p) {
		ascii.Append((char)*s_arr_p++);
	}

	// Macs don't need any help since they understand UTF-8
#else
	nsEmbedString ascii;
	ascii.Assign(fake);
#endif

	// Convert the nsEmbedString into a std::string
	char * c_arr = c_arr = new char[ascii.Length() + 1];
	if (0 == c_arr) return 0;
	char * c_arr_p = c_arr;
	PRUnichar * c_start = (PRUnichar *)ascii.BeginReading();
	const PRUnichar * c_end = (PRUnichar *)ascii.EndReading();
	while (c_start !  c_end) {
		*c_arr_p++ = (char)*c_start++;
	}
	*c_arr_p = 0;
	string * str = new string(c_arr);
	delete [] c_arr;
	return str;

}

This works simply because, on either platform, the result is a char * that actually uniquely represents the desired file.

My code sucks, let me count the ways

There are of course things that could be better. First of all, after reading the code again (I wrote it Thursday and haven't looked back), I see that I should refactor a bit. By leaving the string in UTF-16 representation for passing to C-land, I can add a transform from UTF-16 to UTF-8 to the Mac version and greatly simplify the Windows version. I'll put that on my to-do list. The other place I hope to improve is in the event that GetShortPathName fails. In this case, copying the file under a new ASCII filename to a temporary directory (found using GetTempPath) would still let the file be accessed with a char * path.

Why did I have to do this?

It would be sweeeeeet if Mozilla integrated this code (after a healthy dose of optimization) into the toolkit as something like NS_ConvertToNativePath(nsAString) or some such function. Having this will make integrating with third-party libraries much easier.

Comments (7)

Yikes.. You are a beast for working through this.

— Mike Panchenko — 2007/12/01 10:55 am
The reason we don't do this is that GetShortPathName can fail, and has, if users have NtfsDisable8dot3NameCreation in their registry, or if they are mapping another filesystem (such as NFS) that doesn't have a concept of shortnames: see e.g. https://bugzilla.mozilla.org/show_bug.cgi?id=303598

— Benjamin Smedberg — 2007/12/01 7:11 pm
Ah, which is why I still need to back all this up with a copy to the TMP directory. I still think there might be a good reason to have this function if just to avoid unnecessary file copying. It isn't foolproof but it is a good shortcut to take when you can.

— Richard Crowley — 2007/12/02 9:36 am
what happens if you do an opendir/readdir (or whatever the windows api equivalent versions are) on a folder with unicode named files in, using the non-unicode windows api (the A functions) wih 8.3 disabled? do the files appear at all? if they do, presumably the filenames given back are able to be passed to fopen, etc. hmm

— cal — 2007/12/02 1:13 pm
bah. the answer is that these files are unopenable. nice work windows :(

— cal — 2007/12/03 11:09 am
I don't know what all this means, but it makes me want to play with XUL.

— Andrew Mager — 2007/12/25 2:29 pm
well.. utf8 is so convenient, that's true, but it's also a major source of bugs in the software industry. Many developers do the confusion: they see a char* and consider it's ASCII. They perform then some manipulation on the string using the usual functions, the most disastrous being probably strrchr() that will often break some Asian compounds on utf8 strings, while it virtually never happens with utf16 strings with the equivalent wcsrchr(). So, there's no black and white. people tend to say that "Apple is cool and Microsoft sucks" but they actually have both advantages. To use UTF16 in Windows was a good idea since they defined wchar_t to two bytes. This means you can easily and safely (I would actually say 'almost safely') manipulate Unicode strings natively, meaning you have them readable in your favorite debugger, thanks to the ANSI C string functions (wscpy, etc). On the Mac, they decided for some obscure reasons to define the size of wchar_t to 4 bytes. This means the ws functions handle UTF32 that nobody likes to use, and make 50% of the C string functions useless. Most multi platform applications used UTF16 because it was safe and until now, supported almost everywhere. Now, developers have to make conversions from UTF16 to UTF8 and/or UTF32 while there is no functions on Mac OS X to help. UTF16 is not supported -from what I know, and I can do mistakes- so developers use all kinds of tricks to convert their UTF16 strings to something else.. and bugs happen on the Mac only. So, was it really a judicious choice?

As we can see, there is not a 'cool Apple' and a 'bad Microsoft'. They both have pros and cons. An application that could be compiled for Windows 95 still compiles for Vista, sizes haven't changed and everything runs smoothly; while we have to run after Apple's caprices each time they change their architecture, libraries, frameworks... But on the other hand, we must admit that to rewrite code is good, so I wouldn't say that 'Apple sucks' and 'Microsoft is good' ;-)

But finally, all this comes from a lack of string management in native c/c++ libraries. We can hope that Boost extensions will make our life better, and give a end to these too often useless discussions.

— Luc Leroy — 2008/05/07 8:40 am