Anyone who says the console can’t do Unicode isn’t as smart as they think they are

Do you play the odds?


If you are a developer, the odds are that all things being equal you are not nearly as smart now before you read this blog as you will be once you have read it….


Back in the middle of February I mentioned in The real problem(s) with all of these console “fallback” discussions that, of the many people talking about the console these days, most of them are wrong.


Solving problems that don’t exist, incorrectly impacting problems that do exist, and just generally making the situation worse overall….


But I didn’t really finish the work there; the blog was merely armchair criticisms of bugs, design flaws, mistaken assumptions spoken as fact, documentation problems, etc.


100% accurate, but not described in a way that can help you move to the next step (getting it done right, in either native or managed code).


Today’s blog is going to change all that. :-)


All of this and much more will be covered in the upcoming training on the World-Ready Console, if you are on the Windows team….


After showing how the console could be 100% Unicode, which I did in March of 2008 after STL showed me, as I talked about in Conventional wisdom is retarded, aka What the @#%&* is _O_U16TEXT?, there is one piece of the puzzle still missing.


I mean it is all well and good to show it in just a few lines of native code using the CRT.


But the truth is that this problem exists in managed code too (some of which actually uses the CRT, and Win32), and also in native code that has no heavy CRT dependency or doesn’t take one on.


Behind the scenes, the CRT is doing all the right work in those circumstances to e.g. call WriteConsoleW or WriteFile (depending on whether the console’s output streams are redirected or not).


So anyone trying to do the same thing in native Win32 would have to do that same work.


And although the CRT and .NET are both being developed in the same division of Microsoft, and .Net has its own internal CRT dependencies (it depends on .Net’s version even when it ships with the OS), the managed Console class is not using this CRT functionality. And they are not doing it the hard way themselves, either.


Now calling the CRT from VB.Net or C# (or other non-C++ languages) has some interesting challenges that I am not going to get into here (if someone wants to go that way they can). I thought instead I’d just give you the code really quick so you can do it in whatever language, without the version or CRT dependencies.


Now this is C# code, this WriteLineRight sample function.


But it is pretty much Win32 code written in C#. So Win32 developers should have no trouble grokking it or what it is doing:



using System;
using System.Runtime.InteropServices;

public class Test {
public static void Main() {
string st = “u0627u0628u0629 u043au043eu0448u043au0430 u65e5u672cu56fdnn”;
WriteLineRight(st);
}

internal static bool IsConsoleFontTrueType(IntPtr std) {
CONSOLE_FONT_INFO_EX cfie = new CONSOLE_FONT_INFO_EX();
cfie.cbSize = (uint)Marshal.SizeOf(cfie);
if(GetCurrentConsoleFontEx(std, false, ref cfie)) {
return(((cfie.FontFamily & TMPF_TRUETYPE) == TMPF_TRUETYPE));
}
return false;
}

public static void WriteLineRight(string st) {
IntPtr stdout = GetStdHandle(STD_OUTPUT_HANDLE);
if(stdout != INVALID_HANDLE_VALUE) {
uint filetype = GetFileType(stdout);
if(! ((filetype == FILE_TYPE_UNKNOWN) && (Marshal.GetLastWin32Error() != ERROR_SUCCESS))) {
bool fConsole;
uint mode;
uint written;
filetype &= ~(FILE_TYPE_REMOTE);
if (filetype == FILE_TYPE_CHAR) {
bool retval = GetConsoleMode(stdout, out mode);
if ((retval == false) && (Marshal.GetLastWin32Error() == ERROR_INVALID_HANDLE)) {
fConsole = false;
} else {
fConsole = true;
}
} else {
fConsole = false;
}

if (fConsole) {
if (IsConsoleFontTrueType(stdout)) {
WriteConsoleW(stdout, st, st.Length, out written, IntPtr.Zero);
} else {
//
// Not a TrueType font, so the output may have trouble here
// Need to check the codepage settings
//
// TODO: Add the old style GetConsoleFallbackUICulture code here!!!
}
} else {
//
// Write out a Unicode BOM to ensure proper processing by text readers
//
WriteFile(stdout, BOM, 2, out written, IntPtr.Zero);
WriteFile(stdout, st, st.Length * 2, out written, IntPtr.Zero);
}
}
}
}

[DllImport("kernel32.dll", CharSet=CharSet.Unicode, ExactSpelling=true)]
internal static extern bool WriteConsoleW(IntPtr hConsoleOutput,
string lpBuffer,
int nNumberOfCharsToWrite,
out uint lpNumberOfCharsWritten,
IntPtr lpReserved);

[DllImport("kernel32.dll", CharSet=CharSet.Unicode, ExactSpelling=true)]
internal static extern bool WriteFile(IntPtr hFile,
string lpBuffer,
int nNumberOfBytesToWrite,
out uint lpNumberOfBytesWritten,
IntPtr lpOverlapped);

[DllImport("kernel32.dll", ExactSpelling=true, SetLastError=true)]
internal static extern bool GetConsoleMode(IntPtr hConsoleHandle, out uint lpMode);

[DllImport("kernel32.dll", ExactSpelling=true)]
internal static extern bool GetCurrentConsoleFontEx(IntPtr hConsoleOutput,
bool bMaximumWindow,
ref CONSOLE_FONT_INFO_EX lpConsoleCurrentFontEx);

[DllImport("Kernel32.DLL", ExactSpelling=true, SetLastError=true)]
internal static extern uint GetFileType(IntPtr hFile);

[DllImport("Kernel32.DLL", ExactSpelling=true)]
internal static extern IntPtr GetStdHandle(int nStdHandle);

internal struct COORD {
internal short X;
internal short Y;
internal COORD(short x, short y) {
X = x;
Y = y;
}
}

[StructLayout(LayoutKind.Sequential)]
internal unsafe struct CONSOLE_FONT_INFO_EX {
internal uint cbSize;
internal uint nFont;
internal COORD dwFontSize;
internal int FontFamily;
internal int FontWeight;
fixed char FaceName[LF_FACESIZE];
}

internal const int TMPF_TRUETYPE = 0x4;
internal const int LF_FACESIZE = 32;
internal const string BOM = “uFEFF”;
internal const int STD_OUTPUT_HANDLE = -11; // Handle to the standard output device.
internal const int ERROR_INVALID_HANDLE = 6;
internal const int ERROR_SUCCESS = 0;
internal const uint FILE_TYPE_UNKNOWN = 0x0000;
internal const uint FILE_TYPE_DISK = 0x0001;
internal const uint FILE_TYPE_CHAR = 0x0002;
internal const uint FILE_TYPE_PIPE = 0x0003;
internal const uint FILE_TYPE_REMOTE = 0x8000;
internal static IntPtr INVALID_HANDLE_VALUE = new IntPtr(-1);
}


And there you go!



A few things to note here:



  • Your console does have to have a TrueType font selected (mine has Consolas):



    If you don’t do this, the bitmap fonts will show question marks (though the redirect case will work just fine). Note that the code detects this case and currently has a huge TODO there; this is where the old logic of checkiing the encoding would be done.

  • Writing a BOM on every call when redirection as this sample does is happening is probably overkill. But in quick tests where you redirect the output to a file, it lets you immediately use the type command rather than waiting until you open it in Notepad and save it to get the BOM in. Your mileage should vary based on what you are using it for, it is never needed if there is no redirection and needed only once on the first line if there is.

  • If you mark the text that looks like three boxes, some Cyrillic, and three boxes and copy it to the clipboard and paste it somewhere, you will see it is valid Unicode:

    ابة кошка 日本国

    So much for no complex scripts in the console!

  • Ditto for the ones who said no CJK there when your default system locale isn’t the right code page for the CJK!

  • The text file you redirect to has the text 100% correct even if the font information is wrong due to it not being a TrueType font.

  • This code will work the same no matter what you do with chcp.

  • It will also work the same no matter what you do  with any of the functions to set the console codepages.

  • You can extend this code to both STDIN and STDERR by making similar calls to GetStdHandle for those handles, too.

  • You can do similar work with ReadConsoleW and ReadFile to take care of the STDIN; the writing of the ReadLineRight function is left as an exercise for the student.

  • The next time someone talks about SetThreadPreferredUILanguages(MUI_CONSOLE_FILTER, NULL, NULL) or GetConsoleFallbackUICulture, send them to this blog, as they are probably wrong.

  • When I said “probably” in the previous bullet point, I was being nice. And you have two huge blogs that prove they are wrong and show how to make everything right.

  • Anyone who say the console can’t do Unicode in either native or managed code isn’t as smart as they think they are.  

I know the last point because I used to say that, when I was not as smart as I am now.


In fact, as I said way back in the beginning, the odds are in favor of the fact that you yourself were not nearly as smart before you read this blog as you are now that you have read it! :-)


And now if you will excuse me, I have to start conversations with the gazillion console applications in Windows that routinely punt bugs in console apps talking about their lack of Unicode support….

15 thoughts on “Anyone who says the console can’t do Unicode isn’t as smart as they think they are”

  1. Is the ‘W’ at the end of ReadFileW and WriteFileW a typo?  I can’t find anything about it online or in the SDK headers.

    In any case, thanks for the post.  I feel smarter already!

  2. Those are the Unicode versions of the functions I link to — but you want to call the Unicode ones, whether by compiling with UNICODE or by calling the "W" versions explicitly (the sample does the latter).

  3. If you were to select a font which has the Arabic or CJK characters in it, will it appear correctly? I already notice that the [double-width] CJK characters take up only a single column each. So much for no complex scripts or CJK in the console, indeed.

  4. Thank you so much for the console font trick! I now have a way to read Japanese console text on a Japanese machine with the system locale set to English (for compapiblity reasons).  All my years of studying Japanese hadn’t increased my ability to read a series of ASCII question marks, so a true type font plus copy & paste is a very useful workaround to know.

  5. @Random832 – if such a font is available (generally they aren’t unless your system locale matches). But the redirect case works fine and the copy/paste works as well…..

    @Brendan Elliott: Great! Glad to assist. :-)

  6. Craig, Mike: there’s no ReadFileW or WriteFileW because the functions operate on binary data – therefore, not safe to convert anything. The documentation does not include the "Unicode and ANSI names" section for that reason. There is only ReadFile and WriteFile.

    Mostly functions that have A and W versions have string parameters, or structure parameters (or pointer-to-structure) where the structure contains one or more string parameters.

    ReadConsole and WriteConsole have A/W variants as they deal with string parameters even though the parameters are declared as VOID*. I’m not actually sure why this is, perhaps because the strings are not required to be null-terminated.

  7. This line

    if(! (filetype == FILE_TYPE_UNKNOWN) && (Marshal.GetLastWin32Error() != ERROR_SUCCESS)) {

    doesn’t look right. Perhaps the closing paren after UNKNOWN and an opening one before Marshal shouldn’t be there. Personally, I’d write

    if(filetype != FILE_TYPE_UNKNOWN ||

      Marshal.GetLastWin32Error() == ERROR_SUCCESS) {

  8. Actually, the check is kind of right, believe it or not — it is attempting to catch the case where it is unknown yet succeeded. Weird code behavior trying to key off weird function results….

  9. You (I mean MS) are still putting me down by not allowing complex scripts (opentype fonts) in console.

    -Pavanaja

  10. Am I missing something? In the line

    if(! (filetype == FILE_TYPE_UNKNOWN) && (Marshal.GetLastWin32Error() != ERROR_SUCCESS)) {

    say filetype is FILE_TYPE_CHAR, then "filetype == FILE_TYPE_UNKNOWN" evaluates to false, and so "! (filetype == FILE_TYPE_UNKNOWN)" evaluates to true. Since there wasn’t an error GetLastError() returns ERROR_SUCCESS and GetLastError() != ERROR_SUCCESS evaluates to false, and the whole expression evalutes to false and the function exits without writing anything?

    It seems like the condition you want to return on is if the type is unknown because there was an error. So if it’s unknown but there is no error then you still continue on with the write. I think that’s the same as just making sure there’s no error, so should that check be replaced with the following?

    if(GetLastError() != ERROR_SUCCESS) {

       return;

    }

    Also, why do we need to check that filetype is FILE_TYPE_CHAR? Isn’t it enough to just check that out is a console (are all consoles FILE_TYPE_CHAR?), and does the following do that?

    bool fConsole = (GetConsoleMode(out,&mode) || (GetLastError() != ERROR_INVALID_HANDLE));

    So could it be right to do the following?

    void WriteLineRight(std::string const &s) {

    //…

    HANDLE out = GetStdHandle(STD_OUTPUT_HANDLE);

    if(out == INVALID_HANDLE_VALUE) {

       return;

    }

    // we don’t directly check the filetype of output handle, we only check if it’s a console

    DWORD mode;

    bool fConsole = (GetConsoleMode(out,&mode) || (GetLastError() != ERROR_INVALID_HANDLE));

    if(fConsole) {

       //… don’t care about non-true-type consoles

       //… convert to wchar here

       WriteConsoleW(…)

    } else {

       WriteFile(out,&s[0],s.size(),&written,NULL);

    }

    }

  11. In my opinion, you are missing something, yes. :-)

    If you look at the docs for GetFileType, it is clear that:

    You can distinguish between a “valid” return of FILE_TYPE_UNKNOWN and its return due to a calling error (for example, passing an invalid handle to GetFileType) by calling GetLastError.

    If the function worked properly and FILE_TYPE_UNKNOWN was returned, a call to GetLastError will return NO_ERROR.

    If the function returned FILE_TYPE_UNKNOWN due to an error in calling GetFileType, GetLastError will return the error code.

    This sample code is distinguishing the two cases.

  12. Okay, was something wrong with my analysis of the expression? When I ran the sample code it seemed to confirm my analysis by skipping over printing when my own version does do the printing.

    My understanding of the requirements is that there are three possible cases:

    1. file type is not unknown. therefore we know the call succeeded

    2. file type is unknown, but the call succeeded

    3. file type is unknown and the call failed

    In case one we want to continue on with printing. In case two we also want to continue on with printing. In case three there was an error, and we cannot continue with printing and must return. This reduces down to just checking for success of the call, and checking if the type is unknown or not is unneeded.

    However the sample code seems to only cause printing in a fourth, impossible case: when filetype is not unknown, but the call failed.

    I think the expression contains a typo. ! has higher precedence than && right? so ! applies only to the left side, not the whole expression. If the ! were instead applied to the entire expression then it looks like it would be correct to me.

  13. Also I’m still curious about checking for FILE_TYPE_CHAR specifically. Can consoles be anything else? Isn’t a successful call to GetConsoleMode enough to distinguish between when we need to do  special things for console output vs. when we need to use WriteFile to write to the file that output’s being redirected to?

Comments are closed.

A blog about all the things that the old Blog was about!