utf-8 encoding gives error in pd.read_csv()

Question

I have some folders at the location : log_files_path. All these folders contain CSVs with different names. My aim is to read all these csvs from all the folders present at log_files_path and collate them into a single dataframe. I wrote the following code :

all_files = pd.DataFrame() for region in listdir(log_files_path): region_log_filepath = join(log_files_path, region) #files stores file paths files = [join(region_log_filepath, file) for file in listdir(region_log_filepath) if isfile(join(region_log_filepath, file))] #appends data from all files to a single a DF all_files for file in files : all_files = all_files.append(pd.read_csv(file, encoding= 'utf-8')).reset_index(drop=True) return all_files

This gives me an error : UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 61033: invalid start byte On opening the CSVs, found out that some columns have values like :

and ƒÂ‚Ã‚Â‚ÃƒÂƒÃ‚ÂƒÃƒÂ‚Ã‚ÂƒÃƒÂƒÃ‚Â‚ÃƒÂ‚Ã‚.

I want to ignore such characters all together. How can I do it?

There is nothing like an universal or catch-all encoding. You should try to guess the encoding either by hand or with the chardet module. Only if you want to ignore any encoding problems you can go with the Latin1 encoding which will accept any possible input, but will return garbage if the file uses a different encoding. — Serge Ballesta
– Serge Ballesta, Commented Feb 1, 2022 at 13:50

buran · Accepted Answer · 2022-02-01 13:32:04Z

3

You can pass encoding_errors='ignore', but I would advice to try different encoding first.

answered Feb 1, 2022 at 13:32

buran

14.4k13 gold badges45 silver badges76 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

K.Mat · Accepted Answer · 2022-02-17 13:51:03Z

tl;dr

For your situation, you probably need one of these:

pd.read_csv(file, encoding='utf-8', encoding_errors='replace') # or pd.read_csv(file, encoding='utf-8', encoding_errors='ignore')

Longer answer

As of version 1.3+ of pandas.read_csv() you can pass argument encoding_errors.

Possible values:

strict: Raise UnicodeError (or a subclass); this is the default.
ignore: Ignore the malformed data and continue without further notice.
replace: Replace with a suitable replacement marker; Python will use the official U+FFFD REPLACEMENT CHARACTER for the built-in codecs on decoding, and ‘?’ on encoding.
xmlcharrefreplace: Replace with the appropriate XML character reference (only for encoding).
backslashreplace: Replace with backslashed escape sequences.
namereplace: Replace with \N{...} escape sequences (only for encoding).
surrogateescape: On decoding, replace byte with individual surrogate code ranging from U+DC80 to U+DCFF. This code will then be turned back into the same byte when the 'surrogateescape' error handler is used when encoding the data.
surrogatepass: Allow encoding and decoding of surrogate codes. These codecs normally treat the presence of surrogates as an error.

Mark K · Accepted Answer · 2023-08-01 03:37:41Z

Referenced from @mikey's answer here UnicodeDecodeError when reading CSV file in Pandas, to get to know the encoding of the csv file before Pandas reading it.

from pathlib import Path import chardet filename = "file_name.csv" detected = chardet.detect(Path(filename).read_bytes()) # detected is something like {'encoding': 'utf-8', 'confidence': 0.99, 'language': ''} encoding = detected.get("encoding") assert encoding, "Unable to detect encoding, is it a binary file?" df = pd.read_csv(filename, encoding=encoding)

Collectives™ on Stack Overflow

utf-8 encoding gives error in pd.read_csv()

3 Answers 3

Comments

tl;dr

Longer answer

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

tl;dr

Longer answer

Comments

Comments

Linked

Related