-
- Notifications
You must be signed in to change notification settings - Fork 19.4k
Class to read OpenDocument Tables #25427
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 61 commits
Commits
Show all changes
63 commits Select commit Hold shift + click to select a range
479e639 Class to read OpenDocument Tables
detrout 8be4b67 Remove unneeded assignments
detrout 77d9033 Rename filepath_or_stream to filepath_or_buffer
detrout 47b2ffb Use compat.string_types instead of str
detrout 0fa2ac9 Use pd as name as pandas
detrout e6e2365 Use single underscore for private functions
detrout 1bbf284 Return an unparsed sheet.
detrout d5c7ec0 Move ODFReader get_sheet exception testing code to its own function
detrout 691f1e9 Append _raises to end of function name that tests exceptions
detrout 93c2b66 Remove test docstrings that include no useful information
detrout 394c4bd Indicate likely minimum version.
detrout b149d84 Convert notes about some OpenDocument tests to comments
detrout 19587b3 Add note about new OpenDocument functionality to whatsnew
detrout 60a5bc1 Sort imports correctly
detrout 1fef008 Use str instead of compat.string_types
detrout 7148995 Remove leading underscore from ODFParser
detrout 5db1a0b Remove obsolete class (object)
detrout 83c0243 Improve docstring text
detrout 735e2b4 Merge remote-tracking branch 'upstream/master' into libreoffice-support
WillAyd 8302fd7 Added test_odf
WillAyd d0df3bd Merge remote-tracking branch 'upstream/master' into libreoffice-support
WillAyd 47597c9 Class naming consistency
WillAyd 9e1799a Whatsnew linting
WillAyd d5c60ab Added optional dependency load
WillAyd 39cfecf typo
WillAyd 8a9a66c Updated inheritance to use excel reader interface
WillAyd fd7663f Added ods test files
WillAyd 3bcc1b7 Updated tests
WillAyd 15e69eb convert_float handling
WillAyd 65615cd Merge remote-tracking branch 'upstream/master' into libreoffice-support
WillAyd 9584753 Fixed missing value handling
WillAyd 9dc34f4 Fixed error handling
WillAyd 5e32f6d Fixed bool handling
WillAyd 6360c07 Skip missing file on master
WillAyd 4227268 datetime compat
WillAyd 80607b0 fixed row repeat
WillAyd 43f7160 multiindex handling
WillAyd 4da0445 Merge remote-tracking branch 'upstream/master' into libreoffice-support
WillAyd cbbc653 Handled horizontally merged cells
WillAyd 1227216 Converted to pytest idiom
WillAyd 696ed5d Test idiom cleanup
WillAyd 49fff9f Removed duplicative test files
WillAyd 7b08304 Raised NotImplemented for vertical merging
WillAyd 4d97d84 Table attribute access simplification
WillAyd 59cdf0b Typing and func cleanups
WillAyd 98d3ca7 lint and isort
WillAyd fb48d8d Merge remote-tracking branch 'upstream/master' into libreoffice-support
WillAyd 6576af9 typing fixup
WillAyd 4dc1b51 Skip ods files for xlrd
WillAyd 8ce45b4 Removed one-off tests
WillAyd f9f88b0 Handled defusedxml warnings
WillAyd 3e0d758 Updated assert_warnings funcs to allow DeprecationWarnings
WillAyd ff28993 Merge remote-tracking branch 'upstream/master' into libreoffice-support
WillAyd 7396ad6 Updated to config_init.py
WillAyd 5a440a4 Updated whatsnew
WillAyd 250a3d3 Updated io.rst
WillAyd d7e7d05 Merge remote-tracking branch 'upstream/master' into libreoffice-support
WillAyd 93adedb Refactored to simplify
WillAyd 62a37e7 Removed unnecessary test
WillAyd 13fb76f lint fixup
WillAyd fb6c5ee mypy error
WillAyd 5c839f4 Merge remote-tracking branch 'upstream/master' into libreoffice-support
WillAyd 4026fc1 Doc updates
WillAyd File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| | @@ -32,6 +32,7 @@ The pandas I/O API is a set of top level ``reader`` functions accessed like | |
| text;`HTML <https://en.wikipedia.org/wiki/HTML>`__;:ref:`read_html<io.read_html>`;:ref:`to_html<io.html>` | ||
| text; Local clipboard;:ref:`read_clipboard<io.clipboard>`;:ref:`to_clipboard<io.clipboard>` | ||
| binary;`MS Excel <https://en.wikipedia.org/wiki/Microsoft_Excel>`__;:ref:`read_excel<io.excel_reader>`;:ref:`to_excel<io.excel_writer>` | ||
| binary;`OpenDocument <http://www.opendocumentformat.org>`__;:ref:`read_excel<io.ods>`; | ||
| binary;`HDF5 Format <https://support.hdfgroup.org/HDF5/whatishdf5.html>`__;:ref:`read_hdf<io.hdf5>`;:ref:`to_hdf<io.hdf5>` | ||
| binary;`Feather Format <https://github.com/wesm/feather>`__;:ref:`read_feather<io.feather>`;:ref:`to_feather<io.feather>` | ||
| binary;`Parquet Format <https://parquet.apache.org/>`__;:ref:`read_parquet<io.parquet>`;:ref:`to_parquet<io.parquet>` | ||
| | @@ -2779,9 +2780,10 @@ parse HTML tables in the top-level pandas io function ``read_html``. | |
| Excel files | ||
| ----------- | ||
| | ||
| The :func:`~pandas.read_excel` method can read Excel 2003 (``.xls``) and | ||
| Excel 2007+ (``.xlsx``) files using the ``xlrd`` Python | ||
| module. The :meth:`~DataFrame.to_excel` instance method is used for | ||
| The :func:`~pandas.read_excel` method can read Excel 2003 (``.xls``) | ||
| files using the ``xlrd`` Python module. Excel 2007+ (``.xlsx``) files | ||
| can be read using either ``xlrd`` or ``openpyxl``. | ||
| The :meth:`~DataFrame.to_excel` instance method is used for | ||
| saving a ``DataFrame`` to Excel. Generally the semantics are | ||
| similar to working with :ref:`csv<io.read_csv_table>` data. | ||
| See the :ref:`cookbook<cookbook.excel>` for some advanced strategies. | ||
| | @@ -3217,7 +3219,20 @@ The look and feel of Excel worksheets created from pandas can be modified using | |
| * ``float_format`` : Format string for floating point numbers (default ``None``). | ||
| * ``freeze_panes`` : A tuple of two integers representing the bottommost row and rightmost column to freeze. Each of these parameters is one-based, so (1, 1) will freeze the first row and first column (default ``None``). | ||
| | ||
| .. _io.ods: | ||
| | ||
| OpenDocument Spreadsheets | ||
| ------------------------- | ||
| | ||
| The :func:`~pandas.read_excel` method can also read OpenDocument spreadsheets | ||
| using the ``odfpy`` module. The semantics and features for reading | ||
| OpenDocument spreadsheets match what can be done for `Excel files`_ using | ||
| ``engine='odf'``. | ||
| | ||
| Contributor There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can you show an example here (code is ok) | ||
| .. note:: | ||
| | ||
| Currently pandas only supports *reading* OpenDocument spreadsheets. Writing | ||
| is not implemented. | ||
| | ||
| .. _io.clipboard: | ||
| | ||
| | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| | @@ -164,6 +164,7 @@ Other enhancements | |
| - Added new option ``plotting.backend`` to be able to select a plotting backend different than the existing ``matplotlib`` one. Use ``pandas.set_option('plotting.backend', '<backend-module>')`` where ``<backend-module`` is a library implementing the pandas plotting API (:issue:`14130`) | ||
| - :class:`pandas.offsets.BusinessHour` supports multiple opening hours intervals (:issue:`15481`) | ||
| - :func:`read_excel` can now use ``openpyxl`` to read Excel files via the ``engine='openpyxl'`` argument. This will become the default in a future release (:issue:`11499`) | ||
| - :func:`pandas.io.excel.read_excel` supports reading OpenDocument tables. Specify ``engine='odf'`` to enable. (:issue:`9070`) | ||
| ||
| | ||
| .. _whatsnew_0250.api_breaking: | ||
| | ||
| | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,176 @@ | ||
| from typing import List | ||
| | ||
| from pandas.compat._optional import import_optional_dependency | ||
| | ||
| import pandas as pd | ||
| from pandas._typing import FilePathOrBuffer, Scalar | ||
| | ||
| from pandas.io.excel._base import _BaseExcelReader | ||
| | ||
| | ||
| class _ODFReader(_BaseExcelReader): | ||
| """Read tables out of OpenDocument formatted files | ||
| | ||
| Parameters | ||
| ---------- | ||
| filepath_or_buffer: string, path to be parsed or | ||
| an open readable stream. | ||
| """ | ||
| def __init__(self, filepath_or_buffer: FilePathOrBuffer): | ||
| import_optional_dependency("odf") | ||
| super().__init__(filepath_or_buffer) | ||
| | ||
| @property | ||
| def _workbook_class(self): | ||
| from odf.opendocument import OpenDocument | ||
| return OpenDocument | ||
| | ||
| def load_workbook(self, filepath_or_buffer: FilePathOrBuffer): | ||
| from odf.opendocument import load | ||
| return load(filepath_or_buffer) | ||
| | ||
| @property | ||
| def empty_value(self) -> str: | ||
| """Property for compat with other readers.""" | ||
| return '' | ||
| | ||
| @property | ||
| def sheet_names(self) -> List[str]: | ||
| """Return a list of sheet names present in the document""" | ||
| from odf.table import Table | ||
| | ||
| tables = self.book.getElementsByType(Table) | ||
| return [t.getAttribute("name") for t in tables] | ||
| | ||
| def get_sheet_by_index(self, index: int): | ||
| from odf.table import Table | ||
| tables = self.book.getElementsByType(Table) | ||
| return tables[index] | ||
| | ||
| def get_sheet_by_name(self, name: str): | ||
| from odf.table import Table | ||
| | ||
| tables = self.book.getElementsByType(Table) | ||
| | ||
| for table in tables: | ||
| if table.getAttribute("name") == name: | ||
| return table | ||
| | ||
| raise ValueError("sheet {name} not found".format(name)) | ||
| | ||
| def get_sheet_data(self, sheet, convert_float: bool) -> List[List[Scalar]]: | ||
| """Parse an ODF Table into a list of lists | ||
| """ | ||
| from odf.table import CoveredTableCell, TableCell, TableRow | ||
| | ||
| covered_cell_name = CoveredTableCell().qname | ||
| table_cell_name = TableCell().qname | ||
| cell_names = {covered_cell_name, table_cell_name} | ||
| | ||
| sheet_rows = sheet.getElementsByType(TableRow) | ||
| empty_rows = 0 | ||
| max_row_len = 0 | ||
| | ||
| table = [] # type: List[List[Scalar]] | ||
| | ||
| for i, sheet_row in enumerate(sheet_rows): | ||
WillAyd marked this conversation as resolved. Show resolved Hide resolved | ||
| sheet_cells = [x for x in sheet_row.childNodes | ||
| if x.qname in cell_names] | ||
| empty_cells = 0 | ||
| table_row = [] # type: List[Scalar] | ||
| | ||
WillAyd marked this conversation as resolved. Show resolved Hide resolved | ||
| for j, sheet_cell in enumerate(sheet_cells): | ||
| if sheet_cell.qname == table_cell_name: | ||
| value = self._get_cell_value(sheet_cell, convert_float) | ||
| else: | ||
| value = self.empty_value | ||
| | ||
| column_repeat = self._get_column_repeat(sheet_cell) | ||
| | ||
| # Queue up empty values, writing only if content succeeds them | ||
| if value == self.empty_value: | ||
| empty_cells += column_repeat | ||
| else: | ||
| table_row.extend([self.empty_value] * empty_cells) | ||
| empty_cells = 0 | ||
| table_row.extend([value] * column_repeat) | ||
| | ||
| if max_row_len < len(table_row): | ||
| max_row_len = len(table_row) | ||
| | ||
| row_repeat = self._get_row_repeat(sheet_row) | ||
| if self._is_empty_row(sheet_row): | ||
| empty_rows += row_repeat | ||
| else: | ||
| # add blank rows to our table | ||
| table.extend([[self.empty_value]] * empty_rows) | ||
| empty_rows = 0 | ||
| for _ in range(row_repeat): | ||
| table.append(table_row) | ||
| | ||
| # Make our table square | ||
| for row in table: | ||
| if len(row) < max_row_len: | ||
| row.extend([self.empty_value] * (max_row_len - len(row))) | ||
| | ||
| return table | ||
| | ||
| def _get_row_repeat(self, row) -> int: | ||
| """Return number of times this row was repeated | ||
| Repeating an empty row appeared to be a common way | ||
| of representing sparse rows in the table. | ||
| """ | ||
| from odf.namespaces import TABLENS | ||
| | ||
| return int(row.attributes.get((TABLENS, 'number-rows-repeated'), 1)) | ||
| | ||
| def _get_column_repeat(self, cell) -> int: | ||
| from odf.namespaces import TABLENS | ||
| return int(cell.attributes.get( | ||
| (TABLENS, 'number-columns-repeated'), 1)) | ||
| | ||
| def _is_empty_row(self, row) -> bool: | ||
| """Helper function to find empty rows | ||
| """ | ||
| for column in row.childNodes: | ||
| if len(column.childNodes) > 0: | ||
| return False | ||
| | ||
| return True | ||
| | ||
| def _get_cell_value(self, cell, convert_float: bool) -> Scalar: | ||
| from odf.namespaces import OFFICENS | ||
| cell_type = cell.attributes.get((OFFICENS, 'value-type')) | ||
| if cell_type == 'boolean': | ||
| if str(cell) == "TRUE": | ||
| return True | ||
| return False | ||
| if cell_type is None: | ||
| return self.empty_value | ||
| elif cell_type == 'float': | ||
| # GH5394 | ||
| cell_value = float(cell.attributes.get((OFFICENS, 'value'))) | ||
| | ||
| if cell_value == 0. and str(cell) != cell_value: # NA handling | ||
| return str(cell) | ||
| | ||
| if convert_float: | ||
| val = int(cell_value) | ||
| if val == cell_value: | ||
| return val | ||
| return cell_value | ||
| elif cell_type == 'percentage': | ||
| cell_value = cell.attributes.get((OFFICENS, 'value')) | ||
| return float(cell_value) | ||
| elif cell_type == 'string': | ||
| return str(cell) | ||
| elif cell_type == 'currency': | ||
| cell_value = cell.attributes.get((OFFICENS, 'value')) | ||
| return float(cell_value) | ||
| elif cell_type == 'date': | ||
| cell_value = cell.attributes.get((OFFICENS, 'date-value')) | ||
| return pd.to_datetime(cell_value) | ||
| elif cell_type == 'time': | ||
| return pd.to_datetime(str(cell)).time() | ||
| else: | ||
| raise ValueError('Unrecognized type {}'.format(cell_type)) | ||
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,39 @@ | ||
| import functools | ||
| | ||
| import numpy as np | ||
| import pytest | ||
| | ||
| import pandas as pd | ||
| import pandas.util.testing as tm | ||
| | ||
| pytest.importorskip("odf") | ||
| | ||
| | ||
| @pytest.fixture(autouse=True) | ||
| def cd_and_set_engine(monkeypatch, datapath): | ||
| func = functools.partial(pd.read_excel, engine="odf") | ||
| monkeypatch.setattr(pd, 'read_excel', func) | ||
| monkeypatch.chdir(datapath("io", "data")) | ||
| | ||
| | ||
| def test_read_invalid_types_raises(): | ||
| # the invalid_value_type.ods required manually editing | ||
| # of the included content.xml file | ||
| with pytest.raises(ValueError, | ||
| match="Unrecognized type awesome_new_type"): | ||
| pd.read_excel("invalid_value_type.ods") | ||
| | ||
| | ||
| def test_read_writer_table(): | ||
| # Also test reading tables from an text OpenDocument file | ||
| # (.odt) | ||
| index = pd.Index(["Row 1", "Row 2", "Row 3"], name="Header") | ||
| expected = pd.DataFrame([ | ||
| [1, np.nan, 7], | ||
| [2, np.nan, 8], | ||
| [3, np.nan, 9], | ||
| ], index=index, columns=["Column 1", "Unnamed: 2", "Column 3"]) | ||
| | ||
| result = pd.read_excel("writertable.odt", 'Table1', index_col=0) | ||
| | ||
| tm.assert_frame_equal(result, expected) |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit. This suggestion is invalid because no changes were made to the code. Suggestions cannot be applied while the pull request is closed. Suggestions cannot be applied while viewing a subset of changes. Only one suggestion per line can be applied in a batch. Add this suggestion to a batch that can be applied as a single commit. Applying suggestions on deleted lines is not supported. You must change the existing code in this line in order to create a valid suggestion. Outdated suggestions cannot be applied. This suggestion has been applied or marked resolved. Suggestions cannot be applied from pending reviews. Suggestions cannot be applied on multi-line comments. Suggestions cannot be applied while the pull request is queued to merge. Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a versionchanged here