3

I have some folders at the location : log_files_path. All these folders contain CSVs with different names. My aim is to read all these csvs from all the folders present at log_files_path and collate them into a single dataframe. I wrote the following code :

all_files = pd.DataFrame() for region in listdir(log_files_path): region_log_filepath = join(log_files_path, region) #files stores file paths files = [join(region_log_filepath, file) for file in listdir(region_log_filepath) if isfile(join(region_log_filepath, file))] #appends data from all files to a single a DF all_files for file in files : all_files = all_files.append(pd.read_csv(file, encoding= 'utf-8')).reset_index(drop=True) return all_files 

This gives me an error : UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 61033: invalid start byte On opening the CSVs, found out that some columns have values like : enter image description here

and ƒÂ‚‚ÃÂÂÂ.

I want to ignore such characters all together. How can I do it?

1
  • 1
    There is nothing like an universal or catch-all encoding. You should try to guess the encoding either by hand or with the chardet module. Only if you want to ignore any encoding problems you can go with the Latin1 encoding which will accept any possible input, but will return garbage if the file uses a different encoding. Commented Feb 1, 2022 at 13:50

3 Answers 3

3

You can pass encoding_errors='ignore', but I would advice to try different encoding first.

Sign up to request clarification or add additional context in comments.

Comments

0

tl;dr

For your situation, you probably need one of these:

pd.read_csv(file, encoding='utf-8', encoding_errors='replace') # or pd.read_csv(file, encoding='utf-8', encoding_errors='ignore') 

Longer answer

As of version 1.3+ of pandas.read_csv() you can pass argument encoding_errors.

Possible values:

  • strict: Raise UnicodeError (or a subclass); this is the default.
  • ignore: Ignore the malformed data and continue without further notice.
  • replace: Replace with a suitable replacement marker; Python will use the official U+FFFD REPLACEMENT CHARACTER for the built-in codecs on decoding, and ‘?’ on encoding.
  • xmlcharrefreplace: Replace with the appropriate XML character reference (only for encoding).
  • backslashreplace: Replace with backslashed escape sequences.
  • namereplace: Replace with \N{...} escape sequences (only for encoding).
  • surrogateescape: On decoding, replace byte with individual surrogate code ranging from U+DC80 to U+DCFF. This code will then be turned back into the same byte when the 'surrogateescape' error handler is used when encoding the data.
  • surrogatepass: Allow encoding and decoding of surrogate codes. These codecs normally treat the presence of surrogates as an error.

Comments

0

Referenced from @mikey's answer here UnicodeDecodeError when reading CSV file in Pandas, to get to know the encoding of the csv file before Pandas reading it.

from pathlib import Path import chardet filename = "file_name.csv" detected = chardet.detect(Path(filename).read_bytes()) # detected is something like {'encoding': 'utf-8', 'confidence': 0.99, 'language': ''} encoding = detected.get("encoding") assert encoding, "Unable to detect encoding, is it a binary file?" df = pd.read_csv(filename, encoding=encoding) 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.