msg19977 - (view) |
Author: Ronald L. Rivest (ronrivest) |
Date: 2004-02-13 02:49 |
I am running on Windows XP 5.1 using python version 2.3. The following simple code fails on my system. for dirpath,dirnames,filenames in os.walk("C:/"): for name in filenames: pathname = os.path.join(dirpath,name) size = os.path.getsize(pathname) print size, pathname I get an error from getsize that the file given by pathname does not exist. When it breaks, the variable "name" contains two question marks, which makes me think that this is a Unicode problem. In any case, shouldn't names returned by walk be acceptable in all cases to getsize??? |
|
|
msg19978 - (view) |
Author: Terry J. Reedy (terry.reedy) *  |
Date: 2004-02-14 00:47 |
Logged In: YES user_id=593130 Though it might be, I suspect that this is not a Python bug. Whether is it a Windows design or coding bug in is another matter. >variable "name" contains two question marks, which >makes me think that this is a Unicode problem. Since '?' is not legal in filenames, as you seem to know, I more believe this is the Windows substitute, in the Win function called by os.listdir and os.walk, for illegal characters in the filename. So of course getsize, which wraps os.stat(), which calls a system function, chokes on it. Could be disk bit glitch, or bad program writing directly to directory block. Happened to me once - difficult to get rid of. What does Windows Explorer show when you visit that directory? Ditto for 'dir' in a CommandPrompt window (Start/Accessories)? |
|
|
msg19979 - (view) |
Author: Ronald L. Rivest (ronrivest) |
Date: 2004-02-14 01:46 |
Logged In: YES user_id=863876 TJREEDY -- Thanks for the reply... To answer your questions: (1) What does Windows show when I visit the directory? -- I have several files in this directory that have the same problem. It is a hard, reproducible problem, not a transient glitch. The files are mp3 files that have the name "prelude.mp3", except that the first "e" is replaced by two question marks (for Python) or by two "boxes" in Windows Explorer. I would guess that this is some funky representation of the french "e" with an "accent aigu". (2) What does "dir" do in a Command Prompt? -- From a command prompt, I see two question marks at the problematic position. Does Windows allow one to create filenames with characters in the filename that are illegal for Windows? As I said in the original post, I find it very disturbing that os.walk should return a filename that os.path.exists says doesn't exist! If you can walk the directory and find the file, then os.path.exists (or, equivalently, os.path.getsize), should find it! This looks like a Python bug to me... no? Cheers, Ron Rivest |
|
|
msg19980 - (view) |
Author: Martin v. Löwis (loewis) *  |
Date: 2004-02-16 14:55 |
Logged In: YES user_id=21627 This behaviour is standard behaviour of Win32, and, disturbing as it may sound, is somewhat outside Python's control. When a file is found whose name cannot be represented in the system code page (CP_ACP, the "ANSI" code page), then non-representable characters are converted to question marks. What's worse: "roughly-representable" characters are sometimes converted to look-alike characters. When passing back such a file name to the Win32, it will not find the file, as it does have question marks in it. Withe the "ANSI" API, there is really no solution. Instead, you should use Unicode file names, i.e. write for dirpath,dirnames,filenames in os.walk(u"C:/"): Closing as "won't fix". |
|
|
msg19981 - (view) |
Author: Terry J. Reedy (terry.reedy) *  |
Date: 2004-02-16 16:42 |
Logged In: YES user_id=593130 Final comment: dir and explorer can display stats of files with bad names because they get both simultaneously without trying to use the bad names. CommandPrompt equivalent of listdir (or walk) followed by getsize (or stat) is 'dir /w' followed by 'dir badname', which should also give "File not found' error message. I believe this 'disturbing' behavior results from having filename rules that are not enforced by restricting directory disk block writes to os functions that respect the rules. A roundabout fix: replace 'size = ...' with something like try: size = ... except WhateverErrorYouGot: file = os.popenx('dir %s' % dirpath).read() # x = whichever of 1,2,3,4 works But prefixing 'u' to the root dir looks a lot easier if it gets you what you need. |
|
|
msg19982 - (view) |
Author: Martin v. Löwis (loewis) *  |
Date: 2004-02-16 17:57 |
Logged In: YES user_id=21627 This is not true: dir and explorer use both the Unicode ("wide") API (FindFirstFileW). Explorer then tries to render the file name correctly even if it is outside the code page. If there is no glyph in the font, a square box is displayed. dir.exe tries to convert the file name into the encoding of the terminal (typically CP_OEMCP), and replaces them with question marks on display. Also, this behaviour is not caused by applications performing direct IO to the directory disk block. First, XP does not allow such IO, and second, very few applications would know to write NTFS correctly. Instead, the problem is caused by applications which use the "wide" API for file names to create files, which is a problem for applications that use the "narrow" API. If Ron sees two sqare boxes where a single accented e should be, the application creating the file most likely has messed up the file name: Windows should be capable of representing this letter with a single character, and explorer should be capable of displaying it properly. |
|
|