11

Is there any way in Pandas to capture the warning produced by setting error_bad_lines = False and warn_bad_lines = True? For instance the following script:

import pandas as pd from StringIO import StringIO data = StringIO("""a,b,c 1,2,3 4,5,6 6,7,8,9 1,2,5 3,4,5""") pd.read_csv(data, warn_bad_lines=True, error_bad_lines=False) 

produces the warning:

Skipping line 4: expected 3 fields, saw 4 

I'd like to store this output to a string so that I can eventually write it to a log file to keep track of records that are being skipped.

I tried using the warning module but it doesn't appear as though this "warning" is of the traditional sense. I'm using Python 2.7 and Pandas 0.16.

1
  • It is possible to print the bad line? Commented Apr 3, 2020 at 6:41

2 Answers 2

9

I think it isn't implemented to pandas.
source1, source2

My solutions:

1. Pre or after processing

import pandas as pd import csv df = pd.read_csv('data.csv', warn_bad_lines=True, error_bad_lines=False) #compare length of rows by recommended value: RECOMMENDED = 3 with open('data.csv') as csv_file: reader = csv.reader(csv_file, delimiter=',') for row in reader: if (len(row) != RECOMMENDED): print ("Length of row is: %r" % len(row) ) print row #compare length of rows by length of columns in df lencols = len(df.columns) print lencols with open('data.csv') as csv_file: reader = csv.reader(csv_file, delimiter=',') for row in reader: if (len(row) != lencols): print ("Length of row is: %r" % len(row) ) print row 

2. Replaces sys.stdout

import pandas as pd import os import sys class RedirectStdStreams(object): def __init__(self, stdout=None, stderr=None): self._stdout = stdout or sys.stdout self._stderr = stderr or sys.stderr def __enter__(self): self.old_stdout, self.old_stderr = sys.stdout, sys.stderr self.old_stdout.flush(); self.old_stderr.flush() sys.stdout, sys.stderr = self._stdout, self._stderr def __exit__(self, exc_type, exc_value, traceback): self._stdout.flush(); self._stderr.flush() sys.stdout = self.old_stdout sys.stderr = self.old_stderr if __name__ == '__main__': devnull = open('log.txt', 'w') #replaces sys.stdout, sys.stderr, see http://stackoverflow.com/a/6796752/2901002 with RedirectStdStreams(stdout=devnull, stderr=devnull): df = pd.read_csv('data.csv', warn_bad_lines=True, error_bad_lines=False) 
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks! I will probably go with the second solution since I need to iterate over several files and unfortunately we're stuck with this format.
6

I can't help you with older than Python 3, but I've had very good success with the following:

import pandas as pd from contextlib import redirect_stderr import io # Redirect stderr to something we can report on. f = io.StringIO() with redirect_stderr(f): df = pd.read_csv( new_file_name, header=None, error_bad_lines=False, warn_bad_lines=True, dtype=header_types ) if f.getvalue(): logger.warning("Had parsing errors: {}".format(f.getvalue())) 

I searched for this issue a number of times and kept being pointed to this questions. Hope it helps someone else, later on.

2 Comments

can you define logger?
posted an edit. @scottlittle, replace logger.warning with print

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.