Monday, 24 July 2017

file - Python 0xff byte



I got this error:




UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position: 0, invalid start byte


I found this solution:



>>> b"abcde".decode("utf-8")


from here:

Convert bytes to a Python string



But how do you use it if a) you don’t know where the 0xff is and/or b) you need to decode a file object? What is the correct syntax / format?



I am parsing through a directory, so I tried going through the files one at a time. (NOTE: This won't work when the project gets larger!!!)



>>> i = "b'0xff'"
>>> with open('firstfile') as f:
... g=f.readlines()
...

>>> i in g
False
>>> 0xff in g
False
>>> '0xff' in g
False
>>> b'0xff' in g
False

>>> with open('secondfile') as f:



>>> with open('thirdfile') as f:
... g = f.readlines()
...
Traceback (most recent call last):
File "", line 2, in
File "/usr/local/lib/python3.4/codecs.py", line 313, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte



So if this is the right file, and if I can't open it with Python (I put it in sublime text, found nothing) how do I decode, or encode, this?
Thanks.


Answer



You have a number of problems:




  • i = "b'0xff'" Creates a string of 7 bytes, not a single 0xFF byte. i = b'\xff' or i = bytes([0xff]) is the correct method.


  • open defaults to decoding files using the encoding returned by local.getpreferredencoding(False). Open in binary mode to get raw un-decoded bytes: open('firstfile','rb').



  • g=f.readlines() returns a list of lines. i in g checks for an exact match of the content of i to the content of a line in the line list.


  • Use meaningful variable names!



Instead:



byte = b'\xff'
with open('firstfile','rb') as f:
file_content = f.read()
if byte in file_content:

...


To decode a file, you need to know it's correct encoding and provide it when you open the file:



with open('firstfile',encoding='utf8') as f:
file_content = f.read()


If you don't know the encoding, the 3rd party chardet module can help you guess.



No comments:

Post a Comment

casting - Why wasn't Tobey Maguire in The Amazing Spider-Man? - Movies & TV

In the Spider-Man franchise, Tobey Maguire is an outstanding performer as a Spider-Man and also reprised his role in the sequels Spider-Man...