3

Possible Duplicate:
Python, Unicode, and the Windows console

I have a folder with a filename "01 - ナナナン塊.txt"

I open python at the interactive prompt in the same folder as the file and attempt to walk the folder hierachy:

Python 3.1.2 (r312:79149, Mar 21 2010, 00:41:52) [MSC v.1500 32 bit (Intel)] on win32 Type "help", "copyright", "credits" or "license" for more information. >>> import os >>> for x in os.walk('.'): ... print(x) ... Traceback (most recent call last): File "<stdin>", line 2, in <module> File "C:\dev\Python31\lib\encodings\cp850.py", line 19, in encode return codecs.charmap_encode(input,self.errors,encoding_map)[0] UnicodeEncodeError: 'charmap' codec can't encode characters in position 17-21: character maps to <undefined> 

Clearly the encoding I'm using isn't able to deal with Japanese characters. Fine. But Python 3.1 is meant to be unicode all the way down, as I understand it, so I'm at a loss as to what I'm meant to do with this. Anyone have any ideas?

3
  • 1
    See stackoverflow.com/questions/5419/… - and ultimately, see: wiki.python.org/moin/PrintFails - I think that's what you're looking for. Commented Sep 24, 2010 at 18:44
  • Thanatos is correct - it's the print that's failing. I'm sad. I thought Python was easy to use :( Commented Sep 24, 2010 at 19:01
  • It turns out that the problem is that it's nothing to do with files - unicode support in Python 3 on Windows is a bit patchy - print doesn't work in the Console, and files are opened in non-utf mode (this was the other method I tried before posting here) so I was seemingly without options to dump out what I was walking over. In addition to the accepted answer, I could also have jumped through the codecs.open hoop to create a file which represents the default text type in Python and looked at that. How unpythonic. Commented Sep 24, 2010 at 22:30

2 Answers 2

7

It seems like all answers so far are from Unix people who assume the Windows console is like a Unix terminal, which it is not.

The problem is that you can't write Unicode output to the Windows console using the normal underlying file I/O functions. The Windows API WriteConsole needs to be used. Python should probably be doing this transparently, but it isn't.

There's a different problem if you redirect the output to a file: Windows text files are historically in the ANSI codepage, not Unicode. You can fairly safely write UTF-8 to text files in Windows these days, but Python doesn't do that by default.

I think it should do these things, but here's some code to make it happen. You don't have to worry about the details if you don't want to; just call ConsoleFile.wrap_standard_handles(). You do need PyWin installed to get access to the necessary APIs.

import os, sys, io, win32api, win32console, pywintypes def change_file_encoding(f, encoding): """ TextIOWrapper is missing a way to change the file encoding, so we have to do it by creating a new one. """ errors = f.errors line_buffering = f.line_buffering # f.newlines is not the same as the newline parameter to TextIOWrapper. # newlines = f.newlines buf = f.detach() # TextIOWrapper defaults newline to \r\n on Windows, even though the underlying # file object is already doing that for us. We need to explicitly say "\n" to # make sure we don't output \r\r\n; this is the same as the internal function # create_stdio. return io.TextIOWrapper(buf, encoding, errors, "\n", line_buffering) class ConsoleFile: class FileNotConsole(Exception): pass def __init__(self, handle): handle = win32api.GetStdHandle(handle) self.screen = win32console.PyConsoleScreenBufferType(handle) try: self.screen.GetConsoleMode() except pywintypes.error as e: raise ConsoleFile.FileNotConsole def write(self, s): self.screen.WriteConsole(s) def close(self): pass def flush(self): pass def isatty(self): return True @staticmethod def wrap_standard_handles(): sys.stdout.flush() try: # There seems to be no binding for _get_osfhandle. sys.stdout = ConsoleFile(win32api.STD_OUTPUT_HANDLE) except ConsoleFile.FileNotConsole: sys.stdout = change_file_encoding(sys.stdout, "utf-8") sys.stderr.flush() try: sys.stderr = ConsoleFile(win32api.STD_ERROR_HANDLE) except ConsoleFile.FileNotConsole: sys.stderr = change_file_encoding(sys.stderr, "utf-8") ConsoleFile.wrap_standard_handles() print("English 漢字 Кири́ллица") 

This is a little tricky: if stdout or stderr is the console, we need to output with WriteConsole; but if it's not (eg. foo.py > file), that's not going to work, and we need to change the file's encoding to UTF-8 instead.

The opposite in either case will not work. You can't output to a regular file with WriteConsole (it's not actually a byte API, but a UTF-16 one; PyWin hides this detail), and you can't write UTF-8 to a Windows console.

Also, it really should be using _get_osfhandle to get the handle to stdout and stderr, rather than assuming they're assigned to the standard handles, but that API doesn't seem to have any PyWin binding.

Sign up to request clarification or add additional context in comments.

6 Comments

+1 – you seem to be the first to actually understand the problem. I think the problem with WriteConsoleW vs. WriteFile is known in the Python community, but actually implementing the distinction seems to be difficult or at least unpopular.
Python is developed largely by Unix people, and spending time on the odd details of other peoples' platforms is never appealing--but this really is important. Major parts of Python in Windows (eg. print) should not be limited to '95-era (actually, these date back to DOS) ANSI codepages.
Wow. This is what I need to do to display a unicode string in the standard command window in Windows. If it wasn't so sad, it would be funny. Thank you very much for doing all that hard work of implementing the output streams properly.
Fortunately Python seems to be less Linux-centric than many other OSS projects: the developers are actively working towards better Windows support and accept that Windows is an important platform and not the devil himself. If somebody submitted a patch to switch console output to WriteConsoleW it would have a high chance of being integrated.
@Tom: consider yourself lucky that Python can even cope with Unicode filenames. Try this with something like PHP or Ruby and you wouldn't even be able to open the file. It's hugely unfortunate that the MS C runtime (on which Python and other languages are built) insists on using the system default codepage for stdio byte interfaces instead of UTF-8.
|
-2

For hard-coded strings, you'll need to specify the encoding at the top of source files. For bytestrings input from some other source - such as os.walk -, you need to specify the byte string's encoding (see unutbu's answer).

5 Comments

There are no byte strings in Windows, only UTF-16 strings.
@Philipp: All Windows-NT based kernel know only UTF-16 strings. You can still invoke ANSI version of all Win32 API, such as FindFirstFileA() to get a fodler listing containing what Python calls bytestrings. I assume this is what Python does because on my Windows machine, os.walk() with Python 2.6.5 returns items of class str, which are byte strings.
I'm using Python 3 which is entirely utf-8.python.org/dev/peps/pep-3120
Strings in Python 3 are either UTF-16 or UTF-32, but not UTF-8.
@Philipp: sorry, i was responding to the source file encoding thing, should have made that clearer

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.