Skip to content

BUG: 1.3.0 column assignment via single columnnp.matrix behaviour change #42376

@ivirshup

Description

@ivirshup
  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Code Sample, a copy-pastable example

import pandas as pd import numpy as np from scipy import sparse X = sparse.random(100, 100, density=0.2, format="csr") df = pd.DataFrame({"a": np.arange(100)}) df["X_sum"] = X.sum(axis=1) df

Before 1.3.0, this worked fine. As of 1.3.0, displaying df fails with:

traceback
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) /usr/local/lib/python3.8/site-packages/IPython/core/formatters.py in __call__(self, obj) 700 type_pprinters=self.type_printers, 701 deferred_pprinters=self.deferred_printers) --> 702 printer.pretty(obj) 703 printer.flush() 704 return stream.getvalue() /usr/local/lib/python3.8/site-packages/IPython/lib/pretty.py in pretty(self, obj) 392 if cls is not object \ 393 and callable(cls.__dict__.get('__repr__')): --> 394 return _repr_pprint(obj, self, cycle) 395 396 return _default_pprint(obj, self, cycle) /usr/local/lib/python3.8/site-packages/IPython/lib/pretty.py in _repr_pprint(obj, p, cycle) 698 """A pprint that just redirects to the normal repr function.""" 699 # Find newlines and replace them with p.break_() --> 700 output = repr(obj) 701 lines = output.splitlines() 702 with p.group(): /usr/local/lib/python3.8/site-packages/pandas/core/frame.py in __repr__(self) 993 else: 994 width = None --> 995 self.to_string( 996 buf=buf, 997 max_rows=max_rows, /usr/local/lib/python3.8/site-packages/pandas/core/frame.py in to_string(self, buf, columns, col_space, header, index, na_rep, formatters, float_format, sparsify, index_names, justify, max_rows, min_rows, max_cols, show_dimensions, decimal, line_width, max_colwidth, encoding) 1129 decimal=decimal, 1130 ) -> 1131 return fmt.DataFrameRenderer(formatter).to_string( 1132 buf=buf, 1133 encoding=encoding, /usr/local/lib/python3.8/site-packages/pandas/io/formats/format.py in to_string(self, buf, encoding, line_width) 1051 1052 string_formatter = StringFormatter(self.fmt, line_width=line_width) -> 1053 string = string_formatter.to_string() 1054 return save_to_buffer(string, buf=buf, encoding=encoding) 1055 /usr/local/lib/python3.8/site-packages/pandas/io/formats/string.py in to_string(self) 23 24 def to_string(self) -> str: ---> 25 text = self._get_string_representation() 26 if self.fmt.should_show_dimensions: 27 text = "".join([text, self.fmt.dimensions_info]) /usr/local/lib/python3.8/site-packages/pandas/io/formats/string.py in _get_string_representation(self) 38 return self._empty_info_line 39 ---> 40 strcols = self._get_strcols() 41 42 if self.line_width is None: /usr/local/lib/python3.8/site-packages/pandas/io/formats/string.py in _get_strcols(self) 29 30 def _get_strcols(self) -> list[list[str]]: ---> 31 strcols = self.fmt.get_strcols() 32 if self.fmt.is_truncated: 33 strcols = self._insert_dot_separators(strcols) /usr/local/lib/python3.8/site-packages/pandas/io/formats/format.py in get_strcols(self) 538 Render a DataFrame to a list of columns (as lists of strings). 539 """ --> 540 strcols = self._get_strcols_without_index() 541 542 if self.index: /usr/local/lib/python3.8/site-packages/pandas/io/formats/format.py in _get_strcols_without_index(self) 802 int(self.col_space.get(c, 0)), *(self.adj.len(x) for x in cheader) 803 ) --> 804 fmt_values = self.format_col(i) 805 fmt_values = _make_fixed_width( 806 fmt_values, self.justify, minimum=header_colwidth, adj=self.adj /usr/local/lib/python3.8/site-packages/pandas/io/formats/format.py in format_col(self, i) 816 frame = self.tr_frame 817 formatter = self._get_formatter(i) --> 818 return format_array( 819 frame.iloc[:, i]._values, 820 formatter, /usr/local/lib/python3.8/site-packages/pandas/io/formats/format.py in format_array(values, formatter, float_format, na_rep, digits, space, justify, decimal, leading_space, quoting) 1238 ) 1239 -> 1240 return fmt_obj.get_result() 1241 1242 /usr/local/lib/python3.8/site-packages/pandas/io/formats/format.py in get_result(self) 1269 1270 def get_result(self) -> list[str]: -> 1271 fmt_values = self._format_strings() 1272 return _make_fixed_width(fmt_values, self.justify) 1273 /usr/local/lib/python3.8/site-packages/pandas/io/formats/format.py in _format_strings(self) 1516 1517 def _format_strings(self) -> list[str]: -> 1518 return list(self.get_result_as_array()) 1519 1520 /usr/local/lib/python3.8/site-packages/pandas/io/formats/format.py in get_result_as_array(self) 1480 float_format = lambda value: self.float_format % value 1481 -> 1482 formatted_values = format_values_with(float_format) 1483 1484 if not self.fixed_width: /usr/local/lib/python3.8/site-packages/pandas/io/formats/format.py in format_values_with(float_format) 1454 values = self.values 1455 is_complex = is_complex_dtype(values) -> 1456 values = format_with_na_rep(values, formatter, na_rep) 1457 1458 if self.fixed_width: /usr/local/lib/python3.8/site-packages/pandas/io/formats/format.py in format_with_na_rep(values, formatter, na_rep) 1425 mask = isna(values) 1426 formatted = np.array( -> 1427 [ 1428 formatter(val) if not m else na_rep 1429 for val, m in zip(values.ravel(), mask.ravel()) /usr/local/lib/python3.8/site-packages/pandas/io/formats/format.py in <listcomp>(.0) 1426 formatted = np.array( 1427 [ -> 1428 formatter(val) if not m else na_rep 1429 for val, m in zip(values.ravel(), mask.ravel()) 1430 ] ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all() --------------------------------------------------------------------------- ValueError Traceback (most recent call last) /usr/local/lib/python3.8/site-packages/IPython/core/formatters.py in __call__(self, obj) 343 method = get_real_method(obj, self.print_method) 344 if method is not None: --> 345 return method() 346 return None 347 else: /usr/local/lib/python3.8/site-packages/pandas/core/frame.py in _repr_html_(self) 1045 decimal=".", 1046 ) -> 1047 return fmt.DataFrameRenderer(formatter).to_html(notebook=True) 1048 else: 1049 return None /usr/local/lib/python3.8/site-packages/pandas/io/formats/format.py in to_html(self, buf, encoding, classes, notebook, border, table_id, render_links) 1027 render_links=render_links, 1028 ) -> 1029 string = html_formatter.to_string() 1030 return save_to_buffer(string, buf=buf, encoding=encoding) 1031 /usr/local/lib/python3.8/site-packages/pandas/io/formats/html.py in to_string(self) 70 71 def to_string(self) -> str: ---> 72 lines = self.render() 73 if any(isinstance(x, str) for x in lines): 74 lines = [str(x) for x in lines] /usr/local/lib/python3.8/site-packages/pandas/io/formats/html.py in render(self) 619 self.write("<div>") 620 self.write_style() --> 621 super().render() 622 self.write("</div>") 623 return self.elements /usr/local/lib/python3.8/site-packages/pandas/io/formats/html.py in render(self) 76 77 def render(self) -> list[str]: ---> 78 self._write_table() 79 80 if self.should_show_dimensions: /usr/local/lib/python3.8/site-packages/pandas/io/formats/html.py in _write_table(self, indent) 246 self._write_header(indent + self.indent_delta) 247 --> 248 self._write_body(indent + self.indent_delta) 249 250 self.write("</table>", indent) /usr/local/lib/python3.8/site-packages/pandas/io/formats/html.py in _write_body(self, indent) 393 def _write_body(self, indent: int) -> None: 394 self.write("<tbody>", indent) --> 395 fmt_values = self._get_formatted_values() 396 397 # write values /usr/local/lib/python3.8/site-packages/pandas/io/formats/html.py in _get_formatted_values(self) 583 584 def _get_formatted_values(self) -> dict[int, list[str]]: --> 585 return {i: self.fmt.format_col(i) for i in range(self.ncols)} 586 587 def _get_columns_formatted_values(self) -> list[str]: /usr/local/lib/python3.8/site-packages/pandas/io/formats/html.py in <dictcomp>(.0) 583 584 def _get_formatted_values(self) -> dict[int, list[str]]: --> 585 return {i: self.fmt.format_col(i) for i in range(self.ncols)} 586 587 def _get_columns_formatted_values(self) -> list[str]: /usr/local/lib/python3.8/site-packages/pandas/io/formats/format.py in format_col(self, i) 816 frame = self.tr_frame 817 formatter = self._get_formatter(i) --> 818 return format_array( 819 frame.iloc[:, i]._values, 820 formatter, /usr/local/lib/python3.8/site-packages/pandas/io/formats/format.py in format_array(values, formatter, float_format, na_rep, digits, space, justify, decimal, leading_space, quoting) 1238 ) 1239 -> 1240 return fmt_obj.get_result() 1241 1242 /usr/local/lib/python3.8/site-packages/pandas/io/formats/format.py in get_result(self) 1269 1270 def get_result(self) -> list[str]: -> 1271 fmt_values = self._format_strings() 1272 return _make_fixed_width(fmt_values, self.justify) 1273 /usr/local/lib/python3.8/site-packages/pandas/io/formats/format.py in _format_strings(self) 1516 1517 def _format_strings(self) -> list[str]: -> 1518 return list(self.get_result_as_array()) 1519 1520 /usr/local/lib/python3.8/site-packages/pandas/io/formats/format.py in get_result_as_array(self) 1480 float_format = lambda value: self.float_format % value 1481 -> 1482 formatted_values = format_values_with(float_format) 1483 1484 if not self.fixed_width: /usr/local/lib/python3.8/site-packages/pandas/io/formats/format.py in format_values_with(float_format) 1454 values = self.values 1455 is_complex = is_complex_dtype(values) -> 1456 values = format_with_na_rep(values, formatter, na_rep) 1457 1458 if self.fixed_width: /usr/local/lib/python3.8/site-packages/pandas/io/formats/format.py in format_with_na_rep(values, formatter, na_rep) 1425 mask = isna(values) 1426 formatted = np.array( -> 1427 [ 1428 formatter(val) if not m else na_rep 1429 for val, m in zip(values.ravel(), mask.ravel()) /usr/local/lib/python3.8/site-packages/pandas/io/formats/format.py in <listcomp>(.0) 1426 formatted = np.array( 1427 [ -> 1428 formatter(val) if not m else na_rep 1429 for val, m in zip(values.ravel(), mask.ravel()) 1430 ] ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

I discovered this new behaviour due to our tests starting to fail. What was causing that was:

df["sum"] = X.sum(axis=1) df["log1p_sum"] = np.log1p(df["sum"])

failing with:

traceback
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-24-4ada647d56a0> in <module> 1 df["sum"] = X.sum(axis=1) ----> 2 df["log1p_sum"] = np.log1p(df["sum"]) /usr/local/lib/python3.8/site-packages/pandas/core/frame.py in __setitem__(self, key, value) 3605 else: 3606 # set column -> 3607 self._set_item(key, value) 3608 3609 def _setitem_slice(self, key: slice, value): /usr/local/lib/python3.8/site-packages/pandas/core/frame.py in _set_item(self, key, value) 3777 ensure homogeneity. 3778 """ -> 3779 value = self._sanitize_column(value) 3780 3781 if ( /usr/local/lib/python3.8/site-packages/pandas/core/frame.py in _sanitize_column(self, value) 4502 4503 if is_list_like(value): -> 4504 com.require_length_match(value, self.index) 4505 return sanitize_array(value, self.index, copy=True, allow_2d=True) 4506 /usr/local/lib/python3.8/site-packages/pandas/core/common.py in require_length_match(data, index) 525 """ 526 if len(data) != len(index): --> 527 raise ValueError( 528 "Length of values " 529 f"({len(data)}) " ValueError: Length of values (1) does not match length of index (100)

Problem description

This problem is being triggered because the result of X.sum(axis=1) when X is a scipy sparse matrix is not a 1d numpy ndarray, but a np.matrix with one column. This used to be handled by pandas, but now isn't.

This is a problem because it's a behaviour change that breaks existing code. As far as I can tell from the release notes, this was not an intentional behaviour change. It does look like some things around column assignment did change, and I imagine that assigning with deprecated numpy types was not considered.

Expected Output

I would expect this to not error, and for this to pass: np.testing.assert_array_equal(df["X_sum"], np.ravel(X.sum(axis=1)))

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit : f00ed8f47020034e752baf0250483053340971b0 python : 3.8.10.final.0 python-bits : 64 OS : Darwin OS-release : 20.5.0 Version : Darwin Kernel Version 20.5.0: Sat May 8 05:10:33 PDT 2021; root:xnu-7195.121.3~9/RELEASE_X86_64 machine : x86_64 processor : i386 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 1.3.0 numpy : 1.21.0 pytz : 2020.1 dateutil : 2.8.1 pip : 21.1.3 setuptools : 56.0.0 Cython : 0.29.23 pytest : 6.2.4 hypothesis : None sphinx : 4.0.2 blosc : None feather : None xlsxwriter : None lxml.etree : 4.6.3 html5lib : None pymysql : None psycopg2 : None jinja2 : 2.11.2 IPython : 7.23.1 pandas_datareader: None bs4 : 4.9.3 bottleneck : None fsspec : 2021.06.0 fastparquet : 0.4.1 gcsfs : None matplotlib : 3.4.2 numexpr : 2.7.2 odfpy : None openpyxl : None pandas_gbq : None pyarrow : 4.0.1 pyxlsb : None s3fs : 0.4.2 scipy : 1.7.0 sqlalchemy : 1.3.18 tables : 3.6.1 tabulate : 0.8.7 xarray : 0.18.2 xlrd : 1.2.0 xlwt : None numba : 0.53.1 

Metadata

Metadata

Assignees

No one assigned

    Labels

    BugCompatpandas objects compatability with Numpy or Python functionsIndexingRelated to indexing on series/frames, not to indexes themselvesRegressionFunctionality that used to work in a prior pandas version

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions