Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
d4c0cb7
Add doc and validation
mroeschke Dec 31, 2021
f654e39
Add whatsnew, testing, and docs
mroeschke Dec 31, 2021
6c12102
Fix whatsnew formatting
mroeschke Dec 31, 2021
1aee16c
Update doc
mroeschke Dec 31, 2021
103ae04
Merge remote-tracking branch 'upstream/master' into enh/on_bad_lines_…
mroeschke Dec 31, 2021
4a853f9
Merge remote-tracking branch 'upstream/master' into enh/on_bad_lines_…
mroeschke Jan 1, 2022
d759a88
fix docstring validation
mroeschke Jan 1, 2022
dbf13e7
Merge remote-tracking branch 'upstream/master' into enh/on_bad_lines_…
mroeschke Jan 2, 2022
9b73ae4
Test is callable returns a row longer than expected length
mroeschke Jan 2, 2022
15752be
Merge remote-tracking branch 'upstream/master' into enh/on_bad_lines_…
mroeschke Jan 3, 2022
b77da02
Address comments
mroeschke Jan 3, 2022
39a83b4
Allow callable behavior returning None
mroeschke Jan 3, 2022
a5f3656
Add test for index_col inferred
mroeschke Jan 3, 2022
ae4d499
Merge remote-tracking branch 'upstream/master' into enh/on_bad_lines_…
mroeschke Jan 3, 2022
d3f9c40
Merge remote-tracking branch 'upstream/master' into enh/on_bad_lines_…
mroeschke Jan 4, 2022
8886bf8
Merge remote-tracking branch 'upstream/master' into enh/on_bad_lines_…
mroeschke Jan 5, 2022
e3b445d
Merge remote-tracking branch 'upstream/master' into enh/on_bad_lines_…
mroeschke Jan 5, 2022
013f05f
Merge remote-tracking branch 'upstream/master' into enh/on_bad_lines_…
mroeschke Jan 5, 2022
67b7e3e
Merge remote-tracking branch 'upstream/master' into enh/on_bad_lines_…
mroeschke Jan 6, 2022
743b83b
improve docs
mroeschke Jan 6, 2022
bd67152
type
mroeschke Jan 6, 2022
e04124a
Revert "improve docs"
mroeschke Jan 6, 2022
6a92f07
Merge remote-tracking branch 'upstream/master' into enh/on_bad_lines_…
mroeschke Jan 6, 2022
4817770
Add example of writing to an external list
mroeschke Jan 6, 2022
File filter

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Address comments
  • Loading branch information
mroeschke committed Jan 3, 2022
commit b77da025e7027af065e301af2cf5e87d224de860
2 changes: 1 addition & 1 deletion doc/source/whatsnew/v1.4.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -207,7 +207,7 @@ Other enhancements
- :meth:`Series.str.split` now supports a ``regex`` argument that explicitly specifies whether the pattern is a regular expression. Default is ``None`` (:issue:`43563`, :issue:`32835`, :issue:`25549`)
- :meth:`DataFrame.dropna` now accepts a single label as ``subset`` along with array-like (:issue:`41021`)
- Added :meth:`DataFrameGroupBy.value_counts` (:issue:`43564`)
- :meth:`read_csv` now accepts a ``callable`` function in ``on_bad_lines`` when ``engine="python"`` for custom handling of bad lines (:issue:`5686`)
- :func:`read_csv` now accepts a ``callable`` function in ``on_bad_lines`` when ``engine="python"`` for custom handling of bad lines (:issue:`5686`)
- :class:`ExcelWriter` argument ``if_sheet_exists="overlay"`` option added (:issue:`40231`)
- :meth:`read_excel` now accepts a ``decimal`` argument that allow the user to specify the decimal point when parsing string columns to numeric (:issue:`14403`)
- :meth:`.GroupBy.mean`, :meth:`.GroupBy.std`, :meth:`.GroupBy.var`, :meth:`.GroupBy.sum` now supports `Numba <http://numba.pydata.org/>`_ execution with the ``engine`` keyword (:issue:`43731`, :issue:`44862`, :issue:`44939`)
Expand Down
1 change: 0 additions & 1 deletion pandas/io/parsers/python_parser.py
Original file line number Diff line number Diff line change
Expand Up @@ -197,7 +197,6 @@ class MyDialect(csv.Dialect):
skipinitialspace = self.skipinitialspace
quoting = self.quoting
lineterminator = "\n"
strict = not callable(self.on_bad_lines)

dia = MyDialect

Expand Down
7 changes: 5 additions & 2 deletions pandas/io/parsers/readers.py
Original file line number Diff line number Diff line change
Expand Up @@ -355,7 +355,7 @@
.. deprecated:: 1.3.0
The ``on_bad_lines`` parameter should be used instead to specify behavior upon
encountering a bad line instead.
on_bad_lines : str or callable, default 'error'
on_bad_lines : {{'error', 'warn', 'skip'}} or callable, default 'error'
Specifies what to do upon encountering a bad line (a line with too many fields).
Allowed values are :

Expand All @@ -367,7 +367,10 @@

- callable, function with signature ``(bad_line: list[str]) -> list[str]``
that will process a single bad line. ``bad_line`` is a list of strings
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

am I right in thinking the output list[str] must be a certain length? if the output were to be the same as the input, for example, then what would happen? Checked the tests but they seemed to only cover valid function cases where relevant?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

read_csv has a precedent of throwing a ParserWarning if a row has more elements that expected and continues parsing (seems to drop the extra elements), so I think if the callable does similar it should also throw a ParserWarning

Added a test to check this behavior.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically it can return a list of Hashables, this should not be an issue.

We should document, that the fallback behavior is a warning

split by the ``sep``. Only supported when ``engine="python"``
split by the ``sep``. If the function returns a new list of strings
with more elements than expected, a ``ParserWarning`` will be emitted
while dropping extra elements.
Only supported when ``engine="python"``

.. versionadded:: 1.4.0

Expand Down
36 changes: 31 additions & 5 deletions pandas/tests/io/parser/test_python_parser_only.py
Original file line number Diff line number Diff line change
Expand Up @@ -338,7 +338,12 @@ def readline(self):
def test_on_bad_lines_callable(python_parser_only, bad_line_func):
# GH 5686
parser = python_parser_only
bad_sio = StringIO("a,b\n1,2\n2,3,4,5,6\n3,4")
data = """a,b
1,2
2,3,4,5,6
3,4
"""
bad_sio = StringIO(data)
result = parser.read_csv(bad_sio, on_bad_lines=bad_line_func)
expected = DataFrame({"a": [1, 2, 3], "b": [2, 3, 4]})
tm.assert_frame_equal(result, expected)
Expand All @@ -347,7 +352,12 @@ def test_on_bad_lines_callable(python_parser_only, bad_line_func):
def test_on_bad_lines_callable_write_to_external_list(python_parser_only):
# GH 5686
parser = python_parser_only
bad_sio = StringIO("a,b\n1,2\n2,3,4,5,6\n3,4")
data = """a,b
1,2
2,3,4,5,6
3,4
"""
bad_sio = StringIO(data)
lst = []

def bad_line_func(bad_line):
Expand All @@ -366,7 +376,13 @@ def test_on_bad_lines_callable_iterator_true(python_parser_only, bad_line_func,
# GH 5686
# iterator=True has a separate code path than iterator=False
parser = python_parser_only
bad_sio = StringIO(f"0{sep}1\nhi{sep}there\nfoo{sep}bar{sep}baz\ngood{sep}bye")
data = f"""
0{sep}1
hi{sep}there
foo{sep}bar{sep}baz
good{sep}bye
"""
bad_sio = StringIO(data)
result_iter = parser.read_csv(
bad_sio, on_bad_lines=bad_line_func, chunksize=1, iterator=True, sep=sep
)
Expand All @@ -383,7 +399,12 @@ def test_on_bad_lines_callable_iterator_true(python_parser_only, bad_line_func,
def test_on_bad_lines_callable_dont_swallow_errors(python_parser_only):
# GH 5686
parser = python_parser_only
bad_sio = StringIO("a,b\n1,2\n2,3,4,5,6\n3,4")
data = """a,b
1,2
2,3,4,5,6
3,4
"""
bad_sio = StringIO(data)
msg = "This function is buggy."

def bad_line_func(bad_line):
Expand All @@ -396,7 +417,12 @@ def bad_line_func(bad_line):
def test_on_bad_lines_callable_not_expected_length(python_parser_only):
# GH 5686
parser = python_parser_only
bad_sio = StringIO("a,b\n1,2\n2,3,4,5,6\n3,4")
data = """a,b
1,2
2,3,4,5,6
3,4
"""
bad_sio = StringIO(data)

with tm.assert_produces_warning(ParserWarning, match="Length of header or names"):
result = parser.read_csv(bad_sio, on_bad_lines=lambda x: x)
Expand Down