Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
49 commits
Select commit Hold shift + click to select a range
9a7b29a
[ENH] to_orc
Oct 3, 2021
d11026f
pandas.DataFrame.to_orc
Oct 3, 2021
0146ac3
Cleaning
Oct 3, 2021
0571602
Fix style & edit comments & change min dependency version to 5.0.0
chloeandmargaret Nov 21, 2021
d970b58
Fix style & add to see also
chloeandmargaret Nov 21, 2021
8b12e9f
Add ORC to documentation
chloeandmargaret Nov 21, 2021
65e6b7a
Changes according to review
chloeandmargaret Nov 22, 2021
2114616
Fix problems mentioned in comment
chloeandmargaret Nov 24, 2021
e4b40ef
Linter compliance
chloeandmargaret Nov 24, 2021
a7aa3e0
Address comments
chloeandmargaret Nov 24, 2021
1ab9b6c
Add orc test
chloeandmargaret Dec 2, 2021
96969d5
Fixes from pre-commit [automated commit]
chloeandmargaret Dec 3, 2021
2a54b8c
Fix issues according to comments
chloeandmargaret Mar 20, 2022
1caec9e
Simplify the code base after raising Arrow version to 7.0.0
chloeandmargaret Mar 21, 2022
6f0a538
Fix min arrow version in to_orc
chloeandmargaret Mar 21, 2022
ae65214
Add to_orc test in line with other formats
chloeandmargaret Mar 21, 2022
045c411
Add BytesIO support & test
chloeandmargaret Mar 22, 2022
c00ed0f
Fix some docs issues
chloeandmargaret Mar 22, 2022
fe275d7
Use keyword only arguments
chloeandmargaret Mar 25, 2022
9d3e0df
Fix bug
chloeandmargaret May 12, 2022
971f31c
Fix param issue
chloeandmargaret May 29, 2022
52b68a0
Doctest skipping due to minimal versions
chloeandmargaret May 29, 2022
76437ba
Doctest skipping due to minimal versions
chloeandmargaret May 29, 2022
c5d5852
Improve spacing in docstring & remove orc test in test_common that ha…
chloeandmargaret May 29, 2022
b5cd022
Fix docstring syntax
chloeandmargaret May 29, 2022
7ad3df9
ORC is not text
chloeandmargaret May 29, 2022
a73bb70
Fix BytesIO bug && do not require orc to be explicitly imported befor…
chloeandmargaret May 29, 2022
20aefe7
ORC writer does not work for categorical columns yet
chloeandmargaret May 29, 2022
e7e81fe
Appease mypy
chloeandmargaret May 29, 2022
6b659f7
Appease mypy
chloeandmargaret May 29, 2022
18e5429
Edit according to reviews
chloeandmargaret May 30, 2022
21cba6e
Fix path bug in test_orc
chloeandmargaret May 30, 2022
c7bf39f
Fix testdata tuple bug in test_orc
chloeandmargaret May 30, 2022
e43c6dd
Fix docstrings for check compliance
chloeandmargaret May 30, 2022
afa0a8a
read_orc does not have engine as a param
chloeandmargaret May 30, 2022
cd585e6
Fix sphinx warnings
chloeandmargaret May 30, 2022
b509c3c
Improve docs & rerun tests
chloeandmargaret May 30, 2022
1001002
Force retrigger
chloeandmargaret May 30, 2022
55cab6e
Fix test_orc according to review
chloeandmargaret Jun 7, 2022
89283e0
Rename some variables and func
chloeandmargaret Jun 7, 2022
989468a
Update pandas/core/frame.py
chloeandmargaret Jun 7, 2022
a7fca36
Fix issues according to review
chloeandmargaret Jun 12, 2022
7fc338c
Forced reruns
chloeandmargaret Jun 12, 2022
91d1556
Fix issues according to review
chloeandmargaret Jun 13, 2022
a28c5a8
Reraise Pyarrow TypeError as NotImplementedError
chloeandmargaret Jun 13, 2022
162e5bb
Fix bugs
chloeandmargaret Jun 13, 2022
b230583
Fix expected error msg in orc tests
chloeandmargaret Jun 13, 2022
e16edab
Avoid deprecated functions
chloeandmargaret Jun 13, 2022
e4770b8
Replace {} with None in arg
chloeandmargaret Jun 13, 2022
File filter

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Add BytesIO support & test
  • Loading branch information
chloeandmargaret committed Jun 7, 2022
commit 045c411d8640a002e2463c1df1b0ced498ca3bd9
17 changes: 0 additions & 17 deletions doc/source/user_guide/scale.rst
Original file line number Diff line number Diff line change
Expand Up @@ -18,23 +18,6 @@ tool for all situations. If you're working with very large datasets and a tool
like PostgreSQL fits your needs, then you should probably be using that.
Assuming you want or need the expressiveness and power of pandas, let's carry on.

.. ipython:: python

import pandas as pd
import numpy as np

.. ipython:: python
:suppress:

from pandas._testing import _make_timeseries

# Make a random in-memory dataset
ts = _make_timeseries(freq="30S", seed=0)
ts.to_csv("timeseries.csv")
ts.to_orc("timeseries.orc")
ts.to_parquet("timeseries.parquet")


Load less data
--------------

Expand Down
11 changes: 10 additions & 1 deletion pandas/io/orc.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
""" orc compat """
from __future__ import annotations

import io
from typing import (
TYPE_CHECKING,
Literal,
Expand Down Expand Up @@ -100,6 +101,14 @@ def to_orc(
raise ValueError("engine must be 'pyarrow'")
engine = import_optional_dependency(engine, min_version="7.0.0")

path_or_buf: FilePath | WriteBuffer[bytes] = io.BytesIO() if path is None else path
engine.orc.write_table(
engine.Table.from_pandas(df, preserve_index=index), path, **kwargs
engine.Table.from_pandas(df, preserve_index=index), path_or_buf, **kwargs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does write_table support os.PathLike, (fsspec)-urls, and strings indicating compression? If not it might be more consistent across to_* to have something like this:

was_none = path is None if was_none: path = BytesIO() with get_handle(path, "wb") as handle: engine.orc.write_table(..., handle.handle)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does for everything with the possible exception of the fsspec URLs which do need to be tested.

Here is the API doc for the function:

https://arrow.apache.org/docs/python/generated/pyarrow.orc.write_table.html

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we necessarily have fsspec yet so let's use the approach you mentioned.

)

if path is None:
assert isinstance(path_or_buf, io.BytesIO)
return path_or_buf.getvalue()
else:
return None

25 changes: 24 additions & 1 deletion pandas/tests/io/test_orc.py
Original file line number Diff line number Diff line change
Expand Up @@ -226,7 +226,7 @@ def test_orc_reader_snappy_compressed(dirpath):
tm.assert_equal(expected, got)


def test_orc_roundtrip(dirpath):
def test_orc_roundtrip_file(dirpath):
# GH44554
# PyArrow gained ORC write support with the current argument order
pytest.importorskip("pyarrow", minversion="7.0.0")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put the decorator on the function itself

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm looks like it may not do what it is meant to:
pytest-dev/pytest#9542
pytest-dev/pytest#9548

We don't really want to skip all other ORC tests for pyarrow version below 7.0.0 since the ORC reader is about 2 years older than the writer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

like this

pandas/tests/io/test_parquet.py: @td.skip_if_no("pyarrow", min_version="1.0.0") 

we have this as a standard pattern

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback Really thanks! Fixed!

Expand All @@ -248,3 +248,26 @@ def test_orc_roundtrip(dirpath):
got = read_orc(outputfile)

tm.assert_equal(expected, got)


def test_orc_roundtrip_bytesio():
# GH44554
# PyArrow gained ORC write support with the current argument order
pytest.importorskip("pyarrow", minversion="7.0.0")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see comment above.

data = {
"boolean1": np.array([False, True], dtype="bool"),
"byte1": np.array([1, 100], dtype="int8"),
"short1": np.array([1024, 2048], dtype="int16"),
"int1": np.array([65536, 65536], dtype="int32"),
"long1": np.array([9223372036854775807, 9223372036854775807], dtype="int64"),
"float1": np.array([1.0, 2.0], dtype="float32"),
"double1": np.array([-15.0, -5.0], dtype="float64"),
"bytes1": np.array([b"\x00\x01\x02\x03\x04", b""], dtype="object"),
"string1": np.array(["hi", "bye"], dtype="object"),
}
expected = pd.DataFrame.from_dict(data)

bytesio = expected.to_orc()
got = read_orc(bytesio)

tm.assert_equal(expected, got)