本文介绍了执行os.walk时的UnicodeDecodeError的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

限时删除!!

我收到错误:

 'ascii'编解码器无法解码位置14的字节0x8b:序号不在范围(128)

当尝试做os.walk。出现此错误是因为目录中的某些文件中的文件具有0x8b(非utf8)字符。这些文件来自Windows系统(因此是utf-16文件名),但是我已将文件复制到Linux系统,并使用python 2.7(在Linux中运行)来遍历目录。



我试过通过一个unicode开始路径到os.walk,所有的文件& dirs它生成的是unicode名称,直到它涉及到一个非utf8名称,然后由于某种原因,它不会将这些名称转换为unicode,然后代码阻塞在utf-16名称。有没有办法解决手动查找和更改所有令人反感的名称的问题?



如果python2.7中没有解决方案,可以写入一个脚本python3遍历文件树,并通过将它们转换为utf-8(删除非utf8字符)来修复坏文件名?注:除了0x8b之外,还有许多非utf8的字符,所以它需要一般的工作。



更新:0x8b仍然只是一个btye char(只是无效的ascii)使它更令人费解。我已经验证了将这样的字符串转换为unicode有一个问题,但是可以直接创建一个unicode版本。说明:

 >>> test ='a string \x8b with non-ascii'
>>>>测试
'一个字符串\x8b与非ascii'
>>>> unicode(test)
追溯(最近的最后一次调用):
文件< stdin>,第1行,< module>
UnicodeDecodeError:'ascii'编解码器无法解码位置9中的字节0x8b:序号不在范围(128)
>>>
>>> test2 = u'a string \x8b with non-ascii'
>>>> test2
u'a string \x8b with non-ascii'

这是一个追溯我得到的错误:

  80。对于root,dirs,os.walk中的文件(unicode(startpath)):
文件/usr/lib/python2.7/os.py在walk
294. for x in walk(new_path ,从上到下,onerror,followlinks):
文件/usr/lib/python2.7/os.py在walk
294. for x in walk(new_path,topdown,onerror,followlinks):
文件/usr/lib/python2.7/os.pyin walk
284. if isdir(join(top,name)):
文件/ usr / lib / python2 7 / posixpath.py加入
71. path + ='/'+ b

异常类型:/ admin / casebuilder / company / 883 /
中的UnicodeDecodeError异常值:'ascii'编解码器无法解码位置14中的字节0x8b:序号不在范围(128)

该问题的根源出现在从listdir返回的文件列表中(在os.walk的第276行):

 名称= listdir(top)

使用chars> 128的名称作为非unicode字符串返回。 p>

解决方案

这个问题源于两个根本问题。第一个事实是Python 2.x默认编码是'ascii',而默认的Linux编码是'utf8'。您可以通过以下方式验证这些编码:

  sys.getdefaultencoding()#python 
sys.getfilesystemencoding()#OS

当os模块功能返回目录内容时,即os.walk& os.listdir返回仅包含ascii文件名和非ascii文件名的文件列表,ascii编码文件名自动转换为unicode。其他人不是。因此,结果是包含unicode和str对象的混合列表。这是可以导致问题的str对象。由于它们不是ascii,python无法知道使用什么编码,因此它们不能被自动解码为unicode。



因此,执行常用操作时例如os.path(dir,file),其中 dir 是unicode,而文件是一个编码的str,如果该文件不是ascii编码,则此调用将失败默认)。解决方案是检索每个文件名,并使用适当的编码将str(编码的)对象解码为unicode。



这是第一个问题,它的解。第二个有点棘手。由于这些文件最初来自Windows系统,所以文件名可能使用一种名为 windows-1252 的编码。一个简单的检查方法是调用:

  filename.decode('windows-1252')

如果有效的unicode版本的结果可能具有正确的编码。您可以通过在unicode版本上调用 print 进一步验证,并查看正确的文件名。



最后一个皱纹。在具有Windows原始文件的Linux系统中,可能甚至可能混合使用 windows-1252 utf8 编码。有两种方法来处理这种混合物。第一个也是最好的是运行:

  $ convmv -f windows-1252 -t utf8 -r DIRECTORY --notest 

其中DIRECTORY是包含需要转换的文件的一个。该命令将任何Windows-1252编码的文件名转换为utf8 。它做了一个智能转换,因为如果一个文件名已经是utf8(或ascii),它什么也不做。



替代(如果由于某种原因无法进行此转换)是在python中进行类似的操作。 wit wit wit wit wit ones ones ones ones ones ones ones ones ones ones ones ones ones ones ones ones ones ones ones ones ones ones ones ones ones ones ones ones ones ones ones ones ones ones ones ones ones ones ones ones ones ones ones ones单独
尝试:
name = name.decode('utf8')
除了:
name = name.decode('windows-1252')
返回名称

该函数首先尝试utf8解码。如果它失败了,那么它会回到windows-1252版本。在os调用返回文件列表后使用此功能:

 根,dirs,files = os.walk(路径): 
文件= [解码名称(f)为f文件]
#做一些unicode文件名现在

我个人发现unicode和编码的整个主题非常混乱,直到我阅读了这个精彩简单的教程:





我强烈推荐任何人努力解决unicode问题。


I am getting the error:

'ascii' codec can't decode byte 0x8b in position 14: ordinal not in range(128)

when trying to do os.walk. The error occurs because some of the files in a directory have the 0x8b (non-utf8) character in them. The files come from a Windows system (hence the utf-16 filenames), but I have copied the files over to a Linux system and am using python 2.7 (running in Linux) to traverse the directories.

I have tried passing a unicode start path to os.walk, and all the files & dirs it generates are unicode names until it comes to a non-utf8 name, and then for some reason, it doesn't convert those names to unicode and then the code chokes on the utf-16 names. Is there anyway to solve the problem short of manually finding and changing all the offensive names?

If there is not a solution in python2.7, can a script be written in python3 to traverse the file tree and fix the bad filenames by converting them to utf-8 (by removing the non-utf8 chars)? N.B. there are many non-utf8 chars in the names besides 0x8b, so it would need to work in a general fashion.

UPDATE: The fact that 0x8b is still only a btye char (just not valid ascii) makes it even more puzzling. I have verified that there is a problem converting such a string to unicode, but that a unicode version can be created directly. To wit:

>>> test = 'a string \x8b with non-ascii'
>>> test
'a string \x8b with non-ascii'
>>> unicode(test)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0x8b in position 9: ordinal not in  range(128)
>>>
>>> test2 = u'a string \x8b with non-ascii'
>>> test2
u'a string \x8b with non-ascii'

Here's a traceback of the error I am getting:

80.         for root, dirs, files in os.walk(unicode(startpath)):
File "/usr/lib/python2.7/os.py" in walk
294.             for x in walk(new_path, topdown, onerror, followlinks):
File "/usr/lib/python2.7/os.py" in walk
294.             for x in walk(new_path, topdown, onerror, followlinks):
File "/usr/lib/python2.7/os.py" in walk
284.         if isdir(join(top, name)):
File "/usr/lib/python2.7/posixpath.py" in join
71.             path += '/' + b

Exception Type: UnicodeDecodeError at /admin/casebuilder/company/883/
Exception Value: 'ascii' codec can't decode byte 0x8b in position 14: ordinal not in range(128)

The root of the problem occurs in the list of files returned from listdir (on line 276 of os.walk):

names = listdir(top)

The names with chars > 128 are returned as non-unicode strings.

解决方案

This problem stems from two fundamental problems. The first is fact that Python 2.x default encoding is 'ascii', while the default Linux encoding is 'utf8'. You can verify these encodings via:

sys.getdefaultencoding() #python
sys.getfilesystemencoding() #OS

When os module functions returning directory contents, namely os.walk & os.listdir return a list of files containing ascii only filenames and non-ascii filenames, the ascii-encoding filenames are converted automatically to unicode. The others are not. Therefore, the result is a list containing a mix of unicode and str objects. It is the str objects that can cause problems down the line. Since they are not ascii, python has no way of knowing what encoding to use, and therefore they can't be decoded automatically into unicode.

Therefore, when performing common operations such as os.path(dir, file), where dir is unicode and file is an encoded str, this call will fail if the file is not ascii-encoded (the default). The solution is to check each filename as soon as they are retrieved and decode the str (encoded ones) objects to unicode using the appropriate encoding.

That's the first problem and its solution. The second is a bit trickier. Since the files originally came from a Windows system, their filenames probably use an encoding called windows-1252. An easy means of checking is to call:

filename.decode('windows-1252')

If a valid unicode version results you probably have the correct encoding. You can further verify by calling print on the unicode version as well and see the correct filename rendered.

One last wrinkle. In a Linux system with files of Windows origin, it is possible or even probably to have a mix of windows-1252 and utf8 encodings. There are two means of dealing with this mixture. The first and preferable is to run:

$ convmv -f windows-1252 -t utf8 -r DIRECTORY --notest

where DIRECTORY is the one containing the files needing conversion.This command will convert any windows-1252 encoded filenames to utf8. It does a smart conversion, in that if a filename is already utf8 (or ascii), it will do nothing.

The alternative (if one cannot do this conversion for some reason) is to do something similar on the fly in python. To wit:

def decodeName(name):
    if type(name) == str: # leave unicode ones alone
        try:
            name = name.decode('utf8')
        except:
            name = name.decode('windows-1252')
    return name

The function tries a utf8 decoding first. If it fails, then it falls back to the windows-1252 version. Use this function after a os call returning a list of files:

root, dirs, files = os.walk(path):
    files = [decodeName(f) for f in files]
    # do something with the unicode filenames now

I personally found the entire subject of unicode and encoding very confusing, until I read this wonderful and simple tutorial:

http://farmdev.com/talks/unicode/

I highly recommend it for anyone struggling with unicode issues.

这篇关于执行os.walk时的UnicodeDecodeError的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

1403页,肝出来的..

09-07 17:35