Skip to content

Conversation

@rosnfeld
Copy link
Contributor

@rosnfeld rosnfeld commented Mar 4, 2014

closes #6543

This is my first PR so I may have done a few things wrong. If I should have added my tests to a different file/location, please let me know.

I should point out that this code is also tested by many other sections of the pandas codebase, on my first pass I hadn't captured the "Timestamp - datetime" case and all sorts of things blew up. Better now. :)

@jreback jreback added this to the 0.14.0 milestone Mar 4, 2014
@jreback
Copy link
Contributor

jreback commented Mar 4, 2014

no exactly sure why travis doesn't run your branch (its supposed to)...but this fails under numpy < 1.7

https://travis-ci.org/jreback/pandas/jobs/20093901

need to put a conditional not to test under 1.7 (search for _np_version_under1p7) and skip that portion of the test (its the broken timedelta support that causes an issue)

@immerrr
Copy link
Contributor

immerrr commented Mar 5, 2014

@jreback this reminds me: I was looking at BlockManager implementation the other day and it seemed to me that it was too busy micromanaging type coercions & casts. I though that pandas could benefit by wrapping all the nasty numpy interaction details in a single module that would provide a pdarray subclass of np.ndarray with all the necessary casts and workarounds already applied (i.e. boxing/unboxing values, proper datetime parsing, type inference). It could also provide basic building blocks for more advanced algorithms, e.g. pandas-dtype-aware concatenation.

This seems an obvious application of single responsibility principle, did anyone consider/start doing this before? Were there any pitfalls that prevented that?

@immerrr
Copy link
Contributor

immerrr commented Mar 5, 2014

I stand corrected: single responsibility principle is more about OOP, I meant separation of concerns.

@jreback
Copy link
Contributor

jreback commented Mar 5, 2014

@immerrr and how would this be different than the Block types? that's really their main purpose.

more interesting would be to cythonize BlockManager (but a bit of a project!)

@immerrr
Copy link
Contributor

immerrr commented Mar 5, 2014

This is not really about Blocks, but rather about handling np.datetime64, np.timedelta64 and np.bool_ -related workarounds in a centralized manner. For a person outside of project's scope all that magic is impossible to follow, because they're distributed evenly all over the code base.

This would require proxying all objects and function imports that may relate to numpy via some pdnumpy package. This is quite a project, like you say, but it would demarcate an area inside pandas and within that area everything would behave exactly as expected. This would allow to abstract away those gory details and focus more on the actual value added by pandas. Do you agree?

@jreback
Copy link
Contributor

jreback commented Mar 5, 2014

@immerrr makes sense. Though one could argue that these are these are actually conventions that pandas uses to deal with numpy. Not bugs per se, but more 'workarounds' to treat things in a nicer manner (e.g. taking views on datetime64[ns] and such). IMHO this would only be useful as a layer above numpy, but I am not sure of the utilitiy for non-pandas (my 2c).

Centralizing this would be a very nice feature as they are pretty scattered. I am all for elegant code, so I would be +1 on this.

Their was some code in core/array.py to proxy out ndarray in general (should prob be removed as its only used in a very way and doesn't really do anything, side -issue). The series not a sub of ndarray mostly 'fixed' the need for this (as pandas I don't think can really ever divorce from numpy).

@rosnfeld
Copy link
Contributor Author

rosnfeld commented Mar 5, 2014

I am new to Travis, hopefully I got it right this time - looks like the same build works after my latest change?

https://travis-ci.org/rosnfeld/pandas/jobs/20135641

@jreback
Copy link
Contributor

jreback commented Mar 5, 2014

yep looks good

pls rebase and squash to a single commit, see here: https://github.com/pydata/pandas/wiki

(I can do it for you when you merge, but its nice to try to do it yourself, makes a clean workflow)

essentially

git rebase -i origin/master 

squash

git push myfork mybranch -f 
@rosnfeld
Copy link
Contributor Author

rosnfeld commented Mar 5, 2014

Ok, I hope I got it right. The command you posted has me rebasing on my own fork, but not incorporating the latest upstream commits, right? So you may need to do a tiny bit of work to merge release.rst ..?

@jreback
Copy link
Contributor

jreback commented Mar 5, 2014

you can pull in the master changes by doing

git pull origin/master

or do

git pull origin/master --rebase

@rosnfeld
Copy link
Contributor Author

rosnfeld commented Mar 5, 2014

I think you mean "upstream/master" for me to incorporate new items from others? "origin/master" is my fork and doesn't have anything new.

Sadly while the change (using upstream/master) looks good to my eyes, Travis seems to be upset. I don't understand why.

@jreback
Copy link
Contributor

jreback commented Mar 5, 2014

yes..upstream/master....rebase and force push again...

immerrr and others added 9 commits March 5, 2014 16:15
Preserve .names in df.set_index(df.index) Check that df.set_index(df.index) doesn't convert a MultiIndex to an Index Handle general case of df.set_index([df.index,...]) Cleanup Add to release notes Add equality checks Fix issue on 2.6 Add example to whatsnew
…s (GH6335) BUG: Changes types used in packing structs Corrected incorrect data type conversion between pandas and Stata Remove unnecessary, potentially precision degrading cast to Series when writing data Added function to cast columns from NumPy data types to Stata data types Corrected tests for correct Stata datatypes Fixed formatting in comparison after casting Added docstring for new function and warning class BUG: Fixes and tests for extreme values in all data types The extreme values of float and double (Stata, pandas eqiv: float 32 and float64) were not correct. This resulted in incorrect truncation. The handling of missing values have been improved and code to convert missing values in any format has been added. The improvement differentiated between valid ranges for data and missing values. Additional issues were found when handling missing Dates, where missing Dates (NaT) were converted to non-missing dates when written. A test has been added for extreme numeric values as well as missing values. Fixed legacy date issue with format 114 files Added test for 114 files Added format 114 (Stata 9/10/11) data file Add test for Stata data with file format 114 Added additional data files for testing alternative Stata file formats Added expected result to test Renamed Stata data files to include file format Types used for integer conversion where always half the size they should be. Produced a bug when exporting data tables with long integer data (np.int64). Added test for integer conversion bug Added test for incorrect integer conversion from int16, int32 and int64 Added additional data files for testing alternative Stata file formats Added expected result to test Renamed Stata data files to include file format Disabled the big endian skips
BUG/TST: fix handling of slice.stop < -len, obj.iloc[:-len(obj)] should be empty BUG/TST: fix exceptions raised by Series.iloc when slice.start > len CLN: remove unused _check_slice_bound function and raise_on_error params
@rosnfeld
Copy link
Contributor Author

rosnfeld commented Mar 5, 2014

Ok, I am confused now. Not sure why a second rebase did anything, though it did, and now this PR shows everyone's commits instead of just my own (probably because my branch is now "ahead" of my origin/master). I guess I would have expected this result after my first rebase. Meanwhile, Travis is still upset.

@jreback
Copy link
Contributor

jreback commented Mar 5, 2014

merged via baa4c1d

@rosnfeld git takes some getting used to......but great job on this PR!

thanks!

@jreback jreback closed this Mar 5, 2014
@rosnfeld rosnfeld deleted the subtract_timedelta branch March 7, 2014 00:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Bug Datetime Datetime data dtype Timedelta Timedelta data type

6 participants