Opening a text file, and receiving a encoding error, tried multiple methods no hope

Question

I'm trying to open up a password database file (consists of a bunch of common passwords) and I'm getting the following error:

Attempts so far.. Code:

f = open("crackstation-human-only.txt", 'r') for i in f: print(i)

Error Code:

Traceback (most recent call last): File "C:\Users\David\eclipse-workspace\Kaplin\password_cracker.py", line 3, in <module> for i in f: File "C:\Users\David\AppData\Local\Programs\Python\Python37\lib\encodings\cp1252.py", line 23, in decode return codecs.charmap_decode(input,self.errors,decoding_table)[0] UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 753: character maps to <undefined>

After doing some research I was told to attempt encoding = 'utf-8' which I later discovered was basically guessing and hoping that the file would show all the outputs

Code:

f = open("crackstation-human-only.txt", 'r', encoding = 'utf-8') for i in f: print(i)

Error:

Traceback (most recent call last): File "C:\Users\David\eclipse-workspace\Kaplin\password_cracker.py", line 3, in <module> for i in f: File "C:\Users\David\AppData\Local\Programs\Python\Python37\lib\codecs.py", line 322, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe4 in position 5884: invalid continuation byte

After receiving this error message, I was recommended to attempt to download a text editor like 'Sublime Text 3', and to open the console end enter the command 'Encoding()', but unfortunately it wasn't able to detect the encoding.

My professor was able to use bash to 'grep cat' the lines in the file (I honestly know very little about bash so if anyone else knows those terms i'm not sure if that will help them out)

If anyone has any suggestions on what I can do in order to get this to work out I would greatly appreciate it.

I will post the link to the text document if anyone is interested in seeing what types of characters are within the file.

Link to the file, it's a .txt from my school/professors domain

UPDATE:

I have a fellow classmate that is running elementary OS, and he was using the terminal to write his python program which would iterate through the file, and he was using the encoding 'latin-1', he was able to output more characters than me, I'm on Windows 10, using Eclipse-atom for all my scripts.

So there seems to be something that's causing me possibly not to get the correct outputs based on these factors, i'm guessing because it just seems that way based on the results,

I will be installing elementary-os and attempting all the solutions there, to see if I can get this file to work out. I'll add another update soon!

You should read the file after you open it, like f.read(), otherwise you're attempting to read a file instance. — toti08
– toti08, Commented Sep 13, 2018 at 6:28
There is an encoding autodetect tool. Give it a try: pypi.org/project/chardet — VPfB
– VPfB, Commented Sep 13, 2018 at 6:32
Try reading with f.read() or f.readlines(). You are actually trying to get elements from file instance not from data inside the instance. — Muhammad Haseeb
– Muhammad Haseeb, Commented Sep 13, 2018 at 7:02
Did you try a different enconding like 'latin-1'? If both charmap and utf-8 failed then you should try a different one... — toti08
– toti08, Commented Sep 13, 2018 at 7:32

gavin · Accepted Answer · 2018-09-15 06:20:03Z

Faced a similar problem a while ago, and more often I've found that setting

encoding = 'raw_unicode_escape'

has worked for me

For your particular case, I tried all Python 3 supported encoding types and found

raw_unicode_escape
mbcs
palmos

Try either of the above to read your file

f = open("crackstation-human-only.txt", 'r', encoding = 'mbcs')

For more information on encodings, refer https://docs.python.org/2.4/lib/standard-encodings.html

Hope this helps.

re: With the link above i made a list of encoding formats to try on your file. I hadn't saved my previous work, which was more in detail, but this code should do the same. I re-ran it now as follows:

enc_list = ['big5big5-tw,', 'cp037IBM037,', 'cp437437,', 'cp737Greek', 'cp850850,', 'cp855855,', 'cp857857,', 'cp861861,', 'cp863863,', 'cp865865,', 'cp869869,', 'cp875Greek', 'cp949949,', 'cp1006Urdu', 'cp1140ibm1140Western', 'cp1251windows-1251Bulgarian,', 'cp1253windows-1253Greek', 'cp1255windows-1255Hebrew', 'cp1257windows-1257Baltic', 'euc_jpeucjp,', 'euc_jisx0213eucjisx0213Japanese', 'gb2312chinese,', 'gb18030gb18030-2000Unified', 'iso2022_jpcsiso2022jp,', 'iso2022_jp_2iso2022jp-2,', 'iso2022_jp_3iso2022jp-3,', 'iso2022_krcsiso2022kr,', 'iso8859_2iso-8859-2,', 'iso8859_4iso-8859-4,', 'iso8859_6iso-8859-6,', 'iso8859_8iso-8859-8,', 'iso8859_10iso-8859-10', 'iso8859_14iso-8859-14,', 'johabcp1361,', 'koi8_uUkrainian', 'mac_greekmacgreekGreek', 'mac_latin2maclatin2,', 'mac_turkishmacturkishTurkish', 'shift_jiscsshiftjis,', 'shift_jisx0213shiftjisx0213,', 'utf_16_beUTF-16BEall', 'utf_16_le', 'utf_7', 'utf_8', 'base64_codec', 'bz2_codec', 'hex_codec', 'idna', 'mbcs', 'palmos', 'punycode', 'quopri_codec', 'raw_unicode_escape', 'rot_13', 'string_escape', 'undefined', 'unicode_escape', 'unicode_internal', 'uu_codec', 'zlib_codec' ] for encode in enc_list: try: with open(r"crackstation-human-only.txt", encoding=encode) as f: temp = len(f.read()) except: enc_list.remove(encode) print(enc_list)

Run this code on your machine and you'll get a list of encodings you can try on your file. The output i received was

['cp037IBM037,', 'cp737Greek', 'cp855855,', 'cp861861,', 'cp865865,', 'cp875Greek', 'cp1006Urdu', 'cp1251windows-1251Bulgarian,', 'cp1255windows-1255Hebrew', 'euc_jpeucjp,', 'gb2312chinese,', 'iso2022_jpcsiso2022jp,', 'iso2022_jp_3iso2022jp-3,', 'iso8859_2iso-8859-2,', 'iso8859_6iso-8859-6,', 'iso8859_10iso-8859-10', 'johabcp1361,', 'mac_greekmacgreekGreek', 'mac_turkishmacturkishTurkish', 'shift_jisx0213shiftjisx0213,', 'utf_16_le', 'utf_8', 'bz2_codec', 'idna', 'mbcs', 'palmos', 'quopri_codec', 'raw_unicode_escape', 'string_escape', 'unicode_escape', 'uu_codec']

I tried all 3 unfortunately i'm getting the following error now
Traceback (most recent call last): File "C:\Users\David\eclipse-workspace\Kaplin\password_cracker.py", line 3, in <module> print(f) File "C:\Users\David\AppData\Local\Programs\Python\Python37\lib\encodings\cp1252.py", line 19, in encode return codecs.charmap_encode(input,self.errors,encoding_table)[0] UnicodeEncodeError: 'charmap' codec can't encode character '\x81' in position 135620: character maps to <undefined>
Would it be possible to set an ErrorHandler to attempt to use a different encoding in the case that the encoding fails? For instance, When i get the undefined character map from encoding = 'utf-8', try encoding = 'latin-1' or something like that within a for-loop or is that impossible
I tried exactly the same method, in order to find these three encodings. And i ran the script again printing each line of the file. I've not encountered any errors though. However, you can just catch the error and probably store the unreadable lines in another file. Then try other encoding types on them after you're script goes through the entire 683 MB file.
can you update your code to show me exactly what you did, i'm confused on how you did this. did you simply run the following code 3 times using the 3 encodings? or were you able to have them run simultaneously within one iteration of the file can you please clarify, (again im extremely new to python/programming so if you can explain what you did I would really appreciate it.)

dummman · Accepted Answer · 2018-09-13 07:14:31Z

1

You do have some interesting characters in there. Even though your code does work for me, I'd suggest using a try/except block to catch the lines your system can't handle and skip them:

with open("crackstation-human-only.txt", 'r') as f: for i in f: try: print(i) except UnicodeDecodeError: continue

Alternatively, try using open with

the binary read mode 'rb' instead of 'r'
the errors='replace' argument, but that will not do what you want.

see the open documentation

edited Sep 13, 2018 at 7:14

answered Sep 13, 2018 at 6:43

dummman

1995 bronze badges

5 Comments

david yeritsyan Over a year ago

The only thing is I need to have each and every password to be run through a hash function to be compared to a password file my professor has provided, so I need to have every possible password, I'm really curious to how the code was able to fully run through your machine, some people even suggested I do (encoding = 'ignore') after 'r', worse case i'll attack my professor :)

david yeritsyan Over a year ago

just tried your code, so i'm still getting the error prior to me trying encoding = 'utf-8'

david yeritsyan Over a year ago

Traceback (most recent call last): File "C:\Users\David\eclipse-workspace\Kaplin\password_cracker.py", line 2, in <module> for i in f: File "C:\Users\David\AppData\Local\Programs\Python\Python37\lib\encodings\cp1252.py", line 23, in decode return codecs.charmap_decode(input,self.errors,decoding_table)[0] UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 753: character maps to <undefined>

david yeritsyan Over a year ago

I'll be honest I've been playing with python for about 2 months or so, i'm still in the process of learning (please bare with me :D), My question: is it possible that I don't have a character package installed on my computer which would be causing this or does that have nothing to do with python lol

dummman Over a year ago

That is weird, since I am also using UTF-8.. I have updated my answer with some other suggestions

Collectives™ on Stack Overflow

Opening a text file, and receiving a encoding error, tried multiple methods no hope

2 Answers 2

6 Comments

5 Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

6 Comments

5 Comments

Related