Skip to content

Conversation

@erikcs
Copy link
Contributor

@erikcs erikcs commented Dec 1, 2016

Please see #14778 for details.

I make wide_to_long a bit faster (avoid slow regex search on long columns by first converting to Categorical, avoid melting all dataframes with all the id variables, and wait with trying to convert the "time" variable to int until last), and clear up the docstring.

@erikcs erikcs closed this Dec 1, 2016
@jreback
Copy link
Contributor

jreback commented Dec 1, 2016

you can simply push to this pr as you update FYI

@erikcs erikcs reopened this Dec 1, 2016
@sinhrks sinhrks added Docs Reshaping Concat, Merge/Join, Stack/Unstack, Explode and removed Docs labels Dec 1, 2016
new = df[id_vars].set_index(i).join(mstubs)

try:
new.index.set_levels(new.index.levels[-1].astype(int), level=-1,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when / why does this raise? can you provide a comment

Copy link
Contributor Author

@erikcs erikcs Dec 1, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just the same int conversion attempt done in the original code here, since this "time" column may contain strings not necessarily can be converted to integers. Like the original the index is set to [i, j], which is why this operation is done on the index at the end. I will add a comment, sorry for the confusion

@jreback jreback added the Performance Memory or execution speed performance label Dec 1, 2016
@erikcs
Copy link
Contributor Author

erikcs commented Dec 1, 2016

asv added:

[ 0.00%] · For pandas commit hash 4014f118: [ 0.00%] ·· Building for conda-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt... [ 0.00%] ·· Benchmarking conda-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt [ 50.00%] ··· Running reshape.wide_to_long_big.time_wide_to_long_big 130.66ms [ 50.00%] · For pandas commit hash 06f26b51: [ 50.00%] ·· Building for conda-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt... [ 50.00%] ·· Benchmarking conda-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt [100.00%] ··· Running reshape.wide_to_long_big.time_wide_to_long_big 2.05s before after ratio [06f26b51] [4014f118] - 2.05s 130.66ms 0.06 reshape.wide_to_long_big.time_wide_to_long_big 
@codecov-io
Copy link

codecov-io commented Dec 2, 2016

Current coverage is 85.27% (diff: 96.15%)

Merging #14779 into master will increase coverage by <.01%

@@ master #14779 diff @@ ========================================== Files 144 144 Lines 50981 50989 +8 Methods 0 0 Messages 0 0 Branches 0 0 ========================================== + Hits 43470 43481 +11  + Misses 7511 7508 -3  Partials 0 0 

Powered by Codecov. Last update cb2d6eb...df1edf8

new = df[id_vars].set_index(i).join(mstubs)

# The index of the new dataframe is [i, j], if the j column is a time
# variable, try to convert this to integer.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure I understand what you are doing. can you show the index before / after

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Jeff, here is an example:

In [8]: N = 3 ...: df = pd.DataFrame({ 'A 2010': np.random.rand(N), ...: 'A 2011': np.random.rand(N), ...: 'B 2010': np.random.rand(N), ...: 'B 2011': np.random.rand(N), ...: 'X' : np.random.randint(N, size=N), ...: }) ...: df['id'] = df.index ...: df ...: Out[8]: A 2010 A 2011 B 2010 B 2011 X id 0 0.731823 0.790627 0.236080 0.727762 1 0 1 0.820396 0.474342 0.614218 0.363226 0 1 2 0.463291 0.210859 0.332595 0.061011 0 2

before the Try/Except

In [9]: before = pd.wide_to_long(df, ['A', 'B'], i='id', j='year') ...: before.index ...: Out[9]: MultiIndex(levels=[[0, 1, 2], [u' 2010', u' 2011']], labels=[[0, 1, 2, 0, 1, 2], [0, 0, 0, 1, 1, 1]], names=[u'id', u'year'])

after

In [10]: after= pd.wide_to_long(df, ['A', 'B'], i='id', j='year') ...: after.index ...: Out[10]: MultiIndex(levels=[[0, 1, 2], [2010, 2011]], labels=[[0, 1, 2, 0, 1, 2], [0, 0, 0, 1, 1, 1]], names=[u'id', u'year'])

which is the same as on the master branch

In [11]: master = pd.wide_to_long(df, ['A', 'B'], i='id', j='year') ...: master.index ...: Out[11]: MultiIndex(levels=[[0, 1, 2], [2010, 2011]], labels=[[0, 1, 2, 0, 1, 2], [0, 0, 0, 1, 1, 1]], names=[u'id', u'year'])

Why the original author did the Try before converting to int:

In [13]: df2 = pd.DataFrame({ 'A one': np.random.rand(N), ...: 'A two': np.random.rand(N), ...: 'B one': np.random.rand(N), ...: 'B two': np.random.rand(N), ...: 'X' : np.random.randint(N, size=N), ...: }) ...: df2 ...: Out[13]: A one A two B one B two X 0 0.315281 0.684260 0.397193 0.531613 1 1 0.156044 0.749942 0.923540 0.383348 0 2 0.577983 0.507933 0.226466 0.937341 0

In long format:

In [15]: df2['id'] = df2.index ...: pd.wide_to_long(df2, ['A', 'B'], i='id', j='year') ...: Out[15]: X A B id year 0 one 1 0.315281 0.397193 1 one 0 0.156044 0.923540 2 one 0 0.577983 0.226466 0 two 1 0.684260 0.531613 1 two 0 0.749942 0.383348 2 two 0 0.507933 0.937341
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like the auto coercing of the strings -> ints. This is not very idiomatic and unexpected. I would leave the columns as strings.

Copy link
Contributor Author

@erikcs erikcs Dec 4, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, but regarding a character that separates the stub name from the variable part:

In [7]: df = pd.DataFrame({ 'A.2010': np.random.rand(N), ...: 'A.2011': np.random.rand(N), ...: 'B.2010': np.random.rand(N), ...: 'B.2011': np.random.rand(N), ...: 'X' : np.random.randint(N, size=N), ...: }) ...: ...: df ...: Out[7]: A.2010 A.2011 B.2010 B.2011 X 0 0.873404 0.467946 0.569808 0.358077 1 1 0.780154 0.554582 0.668437 0.810530 1 2 0.884003 0.555784 0.246305 0.038423 2
In [8]: df['id'] = df.index ...: pd.wide_to_long(df, ['A.', 'B.'], i='id', j='year') ...: Out[8]: X A. B. id year 0 2010 1 0.873404 0.569808 1 2010 1 0.780154 0.668437 2 2010 2 0.884003 0.246305 0 2011 1 0.467946 0.358077 1 2011 1 0.554582 0.810530 2 2011 2 0.555784 0.038423

A user might expect the new separating character (.) to be stripped, like reshape in R does.

@jreback
Copy link
Contributor

jreback commented Dec 4, 2016

needs a whatsnew entry (0.20.)

@jreback
Copy link
Contributor

jreback commented Dec 4, 2016

you could add an argument to specify the split (or make it take a regex)

yes it should get stripped

Performance Improvements
~~~~~~~~~~~~~~~~~~~~~~~~

- Improved performance of ``wide_to_long`` (:issue:`14779`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pd.wide_to_long()



def wide_to_long(df, stubnames, i, j):
def wide_to_long(df, stubnames, i, j, sep=""):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe make it sep='\s+' whitespace?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm strange, rstrip doesn't seem to recognise that?

In [13]: 'A (quarterly) '.rstrip('\s+') Out[13]: 'A (quarterly) '
In [14]: 'A (quarterly) '.rstrip(" ") Out[14]: 'A (quarterly)'
The name of the subobservation variable.
stubend : str
Regex to match for the end of the stubs.
sep : str, optional
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

specify what the default is

exp_frame = exp_frame.set_index(['id', 'year'])[["X", "A", "B"]]
long_frame = wide_to_long(df, ['A', 'B'], 'id', 'year')
tm.assert_frame_equal(long_frame, exp_frame)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add some tests with sep (and maybe some that have an invalid sep)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What input where you thinking about?

if a nonsense separator is passed nothing is stripped:

In [15]: df = pd.DataFrame({'A.2010': np.random.rand(3), ...: 'A.2011': np.random.rand(3), ...: 'B.2010': np.random.rand(3), ...: 'B.2011': np.random.rand(3), ...: 'X' : np.random.randint(3, size=3)}) ...: df['id'] = df.index ...: pd.wide_to_long(df, ['A.', 'B.'], i='id', j='year', sep="nope") ...: Out[15]: X A. B. id year 0 2010 2 0.330193 0.728615 1 2010 0 0.710791 0.601923 2 2010 1 0.066218 0.618455 0 2011 2 0.597949 0.324131 1 2011 0 0.024911 0.968051 2 2011 1 0.310596 0.866798
In [16]: pd.wide_to_long(df, ['A.', 'B.'], i='id', j='year', sep=",,") Out[16]: X A. B. id year 0 2010 2 0.330193 0.728615 1 2010 0 0.710791 0.601923 2 2010 1 0.066218 0.618455 0 2011 2 0.597949 0.324131 1 2011 0 0.024911 0.968051 2 2011 1 0.310596 0.866798
@jreback
Copy link
Contributor

jreback commented Dec 4, 2016

@jreback
Copy link
Contributor

jreback commented Dec 5, 2016

that's a regex

@erikcs
Copy link
Contributor Author

erikcs commented Dec 5, 2016

Sorry, I didn't get that? And melt looks pretty rock solid.

@jreback
Copy link
Contributor

jreback commented Dec 5, 2016

@Nuffe I was referring to wide_to_long in the docs (e.g. is the example ok).

you are splitting on the sep character, but it could also be a regext, so doing something like

In [7]: re.split('\s+','A 2010') Out[7]: ['A', '2010'] In [8]: re.split('\s+','A 2010') Out[8]: ['A', '2010'] 

is probably reasonable

@erikcs
Copy link
Contributor Author

erikcs commented Dec 5, 2016

To keep things simple I propose we break the API of wide_to_long and change the signature to adhere more to how R's reshape does this.

wide_to_long(df, varying, i, j, sep=' ')

I.e. the user passes the names of the time varying column as a list varying, and these (x) are expected (checked) to adhere to the following rules (more or less what reshape assumes):

based on sep x is split into exactly two strings.
sep (a single character) is constrained to be non-alphanumeric: \s, ., ;, etc...
the first part is the 'stubname', the second part is the 'time' part

In [3]: df = pd.DataFrame({"A 1970" : {0 : "a", 1 : "b", 2 : "c"}, ...: "A 1980" : {0 : "d", 1 : "e", 2 : "f"}, ...: "B 1970" : {0 : 2.5, 1 : 1.2, 2 : .7}, ...: "B 1980" : {0 : 3.2, 1 : 1.3, 2 : .1}, ...: "X" : dict(zip(range(3), np.random.randn(3))) ...: }) ...: df['id'] = df.index ...: df ...: Out[3]: A 1970 A 1980 B 1970 B 1980 X id 0 a d 2.5 3.2 0.136953 0 1 b e 1.2 1.3 -1.238109 1 2 c f 0.7 0.1 1.249809 2
In [4]: varying = ['A 1970', 'A 1980', 'B 1970', 'B 1980'] ...: pd.wide_to_long(df, varying, i='id', j='year', sep=' ') ...: Out[4]: X A B id year 0 1970 0.136953 a 2.5 1 1970 -1.238109 b 1.2 2 1970 1.249809 c 0.7 0 1980 0.136953 d 3.2 1 1980 -1.238109 e 1.3 2 1980 1.249809 f 0.1

The user can easily construct the varying list with a regex, a doc example can show this.

If the existing columns does not adhere to the above specification, they need to be changed to a suitable format first. A doc example can show how this can be easily done with a regex with a backreference.

What do you think?

@jreback
Copy link
Contributor

jreback commented Dec 5, 2016

the varying should be a list of tuples
space separated elements in lists are not pythonic

but otherwise looks ok

can this be backward compat?

@erikcs
Copy link
Contributor Author

erikcs commented Dec 5, 2016

I didn't understand the first comment: varying is just the names of all columns that should be varying. For a sample dataframe it could be df.iloc[:, 4:11].columns.tolist(). I.e. that the names happen to be space separated is just how the data columns ended up looking like in messy real world data?

And I do not think this can be made backward compat because the varying argument would be different, stubnames would now be computed inside the function.

The old doc example were there is no single character separator, f.eks varying = ['A1970', 'A1980', 'B1970', 'B1980'] will only work by first converting the column names to the allowed format with

df.columns.str.replace('([A-B])', '\\1.') Index([u'A.1970', u'A.1980', u'B.1970', u'B.1980', u'X', u'id'], dtype='object')

then calling wide_to_long with varying = ['A.1970', 'A.1980', 'B.1970', 'B.1980']

I do not know if this really is considered too unwieldy...? R's reshape interface is perhaps not the most user friendly, but with plenty of doc examples its flexibility should maybe be appreciated?

The original function author seemed to have tried to mimic Statas reshape where reshape essentially only would take the stubnames as argument. The problem is that in stata column names are highly constrained (can for ex not have whitespaces or non alphanumeric characters), while in pandas they can be any utf8 string, which makes it much harder to generalize.

So if we should preserve the original function authors intention, where the user only supplies stubnames as in Stata, we need to impose some strict assumptions on the column names passed, like the only kind of column names we can have (that are varying) are of the type PrefixPostfix, where Prefix and Postfix are alphanumeric. These are the only ones Stata's reshape needs to consider, and what the original wide_to_long function implicitly assumes. We can also handle PrefixSepPostfix where Sep is a single separating character.

Perhaps it is just better to make this implicit assumption explicit and keep its "Stata like" interface? And make it robust to this specificaton (PrefixSepPostfix), because the master branch function breaks with plenty of variations of this.

(sorry for the messyness here but I ended up spending some time familiarizing myself more with Rs less user friendly approach to this problem, and Stata's more user friendly but less flexible approach)

@erikcs
Copy link
Contributor Author

erikcs commented Dec 5, 2016

So here is an attempt to make the original interface more robust, these two examples fail on the master branch, but should be able to produce the correct result which is:

In [12]: df = pd.DataFrame({ ...: 'A11': ['a11', 'a22', 'a33'], ...: 'A12': ['a21', 'a22', 'a23'], ...: 'B11': ['b11', 'b12', 'b13'], ...: 'B12': ['b21', 'b22', 'b23'], ...: 'BB11': [1, 2, 3], ...: 'BB12': [4, 5, 6], ...: 'BBBX' : [91, 92, 93], ...: 'BBBZ' : [91, 92, 93] ...: }) ...: df['id'] = df.index ...: df ...: Out[12]: A11 A12 B11 B12 BB11 BB12 BBBX BBBZ id 0 a11 a21 b11 b21 1 4 91 91 0 1 a22 a22 b12 b22 2 5 92 92 1 2 a33 a23 b13 b23 3 6 93 93 2 In [13]: pd.wide_to_long(df, ['A', 'B', 'BB'], i='id', j='year') Out[13]: BBBX BBBZ A B BB id year 0 11 91 91 a11 b11 1 1 11 92 92 a22 b12 2 2 11 93 93 a33 b13 3 0 12 91 91 a21 b21 4 1 12 92 92 a22 b22 5 2 12 93 93 a23 b23 6
In [14]: df = pd.DataFrame({ ...: 'A(quarterly)2011': ['a11', 'a22', 'a33'], ...: 'A(quarterly)2012': ['a21', 'a22', 'a23'], ...: 'B(quarterly)2011': ['b11', 'b12', 'b13'], ...: 'B(quarterly)2012': ['b21', 'b22', 'b23'], ...: 'BB(quarterly)2011': [1, 2, 3], ...: 'BB(quarterly)2012': [4, 5, 6], ...: 'BBBX' : [91, 92, 93], ...: 'BBBZ' : [91, 92, 93] ...: }) ...: df['id'] = df.index ...: df ...: Out[14]: A(quarterly)2011 A(quarterly)2012 B(quarterly)2011 B(quarterly)2012 \ 0 a11 a21 b11 b21 1 a22 a22 b12 b22 2 a33 a23 b13 b23 BB(quarterly)2011 BB(quarterly)2012 BBBX BBBZ id 0 1 4 91 91 0 1 2 5 92 92 1 2 3 6 93 93 2 In [15]: pd.wide_to_long(df, ['A(quarterly)', 'B(quarterly)', 'BB(quarterly)'], i='id', j='year') Out[15]: BBBX BBBZ A(quarterly) B(quarterly) BB(quarterly) id year 0 2011 91 91 a11 b11 1 1 2011 92 92 a22 b12 2 2 2011 93 93 a33 b13 3 0 2012 91 91 a21 b21 4 1 2012 92 92 a22 b22 5 2 2012 93 93 a23 b23 6

The first one fails because the regex confuses the same substrings in the id_vars and value_vars, the second ones because of the parenthesis.

Assuming a Prefix(Optional Sep)Postfix structure on the "time" variables, I tried to make it robust:

 [16]: df = pd.DataFrame({ ...: 'A11': ['a11', 'a22', 'a33'], ...: 'A12': ['a21', 'a22', 'a23'], ...: 'B11': ['b11', 'b12', 'b13'], ...: 'B12': ['b21', 'b22', 'b23'], ...: 'BB11': [1, 2, 3], ...: 'BB12': [4, 5, 6], ...: 'Acat' : [91, 92, 93], ...: 'BBBZ' : [91, 92, 93] ...: }) ...: df['id'] = df.index ...: df ...: Out[16]: A11 A12 Acat B11 B12 BB11 BB12 BBBZ id 0 a11 a21 91 b11 b21 1 4 91 0 1 a22 a22 92 b12 b22 2 5 92 1 2 a33 a23 93 b13 b23 3 6 93 2

raises a ValueError: ('Ambiguous names: ', ['A11', 'A12', 'Acat']) .

While the following works

 df = pd.DataFrame({ ...: 'A-11': ['a11', 'a22', 'a33'], ...: 'A-12': ['a21', 'a22', 'a23'], ...: 'B-11': ['b11', 'b12', 'b13'], ...: 'B-12': ['b21', 'b22', 'b23'], ...: 'BB-11': [1, 2, 3], ...: 'BB-12': [4, 5, 6], ...: 'Acat' : [91, 92, 93], ...: 'BBBZ' : [91, 92, 93] ...: }) ...: df['id'] = df.index ...: df ...: Out[18]: A-11 A-12 Acat B-11 B-12 BB-11 BB-12 BBBZ id 0 a11 a21 91 b11 b21 1 4 91 0 1 a22 a22 92 b12 b22 2 5 92 1 2 a33 a23 93 b13 b23 3 6 93 2 In [19]: pd.wide_to_long(df, ['A', 'B', 'BB'], i='id', j='year', sep='-') Out[19]: Acat BBBZ A B BB id year 0 11 91 91 a11 b11 1 1 11 92 92 a22 b12 2 2 11 93 93 a33 b13 3 0 12 91 91 a21 b21 4 1 12 92 92 a22 b22 5 2 12 93 93 a23 b23 6
@jreback
Copy link
Contributor

jreback commented Dec 5, 2016

cc @jseabold any thought here.

@Nuffe ideally we want to make this back-compat (you can do introspection in the code to figure out what you are passed and such). And have as simple as API as possible.

@erikcs
Copy link
Contributor Author

erikcs commented Dec 6, 2016

I have maintained the user friendly (and evidently Stata inspired) interface (and stated what structure this function assumes on the column names), and tried to fix mistakes that arise with various "pathological" input, for example if the stubnames contains groups that share a substring (which I discovered when I tried different examples in Stata and compared them to wide_to_long)

in a typicaly case.
"""
# For robustess, escape every user input string we use in a regex
import re
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can be imported at the top of the file

# For ex. AA2011, AA2012, AAkitten have inconsistent postfix
for k, vars in enumerate(value_vars):
stripped = map(lambda x: x.replace(stubs[k], ""), vars)
is_digit = [s.isdigit() for s in stripped]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you have tests for this?

Copy link
Contributor Author

@erikcs erikcs Dec 7, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

considering the comment below on not using a regex to find the id_vars: perhaps just formulate a consistency check, and use warnings and warn the user if for example an inferred value_var has different types?

For example: at the end check if the new data frame's 'j' index contains both ints and strings and warn about this? If stubnames supplied is ['AA2011', 'AA2012'] and df contains a column named Acat then the new dataframe's j column will have levels 2011, 2012, cat. And likewise if stubnames contains ['CatOne', 'CatTwo'] and df has a colum named Cat3000 the new j index will have levels One, Two, 3000.

The only way to disambiguate the first case is to take an optional stubendtype parameter denoting the stubends are numbers. The second case is not possible to disambiguate (tried in Stata)

# two resulting value_vars lists
if len(value_vars_flattened + id_vars) != len(df.columns):
value_vars_augmented = map(lambda x: get_var_names(
df, "^{0}".format(re.escape(x))), stubnames)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks fragile. I would just raise here

Copy link
Contributor Author

@erikcs erikcs Dec 6, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or instead of doing a search for the id_vars in the first place, would it not be simpler just to do:

id_vars = set(df.columns.tolist()).difference(value_vars_flattened)?

(then do some consistency checks)

# This regex is needed to avoid multiple "greedy" matches with stubs
# that have overlapping substrings
# (for example A2011, A2012 are separate from AA2011, AA2012)
value_vars = list(map(lambda x: get_var_names(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ideally you would just look for a match of a letter followed by a non-letter (or vice versa), I think that is more robust.

Copy link
Contributor Author

@erikcs erikcs Dec 6, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But in the case of string stems, the three groups here will not be captured:

Aone, Atwo, Bone, Btwo, BBone, BBtwo

A negative lookahead ^B(?!B) could be more robust? I.e. the regex would be "^{0}(?!{1})".format(re.escape(x), x[-1])). That one would capture the three groups here and ignore for example BBBrating

@erikcs
Copy link
Contributor Author

erikcs commented Dec 8, 2016

I found another "Stata like" use case wide_to_long doesn't handle at this page: if the data frame does not have a single id column that identifies the wide variables, Stata users can supply the necessary column names, fex ['id1', 'id2']. Currently the user would have to generate a new column that identifies the ['id1', 'id2'] combination and pass that as id.

(I am going to add the option of supplying a list of 'id' variables, it will require another short rewrite since I have to move from using join to merge, to handle the new multilevel index)

@erikcs
Copy link
Contributor Author

erikcs commented Dec 9, 2016

Sometimes AppVeyor/Travis fails with unrelated tests (like test_bar_log_subplots, right now). Any hint on what I should do here? I've looked at the failing cases (unrelated plot methods), and there doesn't seem to be any state that has mutated because for ex. a random number seed has changed. (tests pass on my OS X laptop with both python 2 and python 3 virtual envs). Thanks

@jreback
Copy link
Contributor

jreback commented Dec 10, 2016

can you rebase, problem with appeveyor which i just fixed

) Speed up by avoiding big copies, and regex on categorical column Add functionality to deal with "pathological" input Add docstring examples and more test cases
in the wide format, to be stripped from the names in the long format.
For example, if your column names are A-suffix1, A-suffix2, you
can strip the hypen by specifying `sep`='-'
numeric_suffix : bool, default True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather call this suffix='\d+', IOW use a regex to match this, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that makes more sense

Going from long back to wide just takes some creative use of `unstack`
>>> w = l.reset_index().set_index(['famid', 'birth', 'age']).unstack()
>>> w.columns = [name + suffix for name, suffix in wide.columns.tolist()]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use this:

In [28]: Index(w.columns).str.join('') Out[28]: Index(['ht1', 'ht2'], dtype='object') 
if any(map(lambda s: s in df.columns.tolist(), stubnames)):
raise ValueError("stubname can't be identical to a column name")

if not isinstance(stubnames, list):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we usually use is_list_like, IOW, you can pass a non-string iterable, can you update the doc-string as well

self.assertEqual(stubs, ['inc', 'edu'])

def test_separating_character(self):
np.random.seed(123)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add this issue number as a comment

@jreback jreback added this to the 0.20.0 milestone Dec 11, 2016
@jreback
Copy link
Contributor

jreback commented Dec 11, 2016

lgtm. some minor comments.

@jorisvandenbossche

Use is_list_like Add GH ticket #
@erikcs
Copy link
Contributor Author

erikcs commented Dec 11, 2016

@jreback another minor issue: sphinx doesn't print the \ in the docstring suffix section. The only way I managed to get these printed was to set the entire docstring to a raw string literal, and escape the slash: '\\d+'

@jorisvandenbossche
Copy link
Member

set the entire docstring to a raw string literal

You can indeed do that, but normally then the escaping should not be needed

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't follow the full discussion above, but there was some talk about backwards compatibility. What is the conclusion on that? Is the last version back compat or are there changes in behaviour?


def setup(self):
vars = 'ABCD'
nyrs = 20
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you fix up the indentation here?

idobs = dict(zip(range(nidvars), np.random.rand(nidvars, N)))

self.df = pd.concat([pd.DataFrame(idobs), pd.DataFrame(yearobs)],
axis=1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can also do something like DataFrame(np.random.randn(N, nidvars + len(yrvars)), columns=list(range(nidvars)) + yrvars to make it a bit simpler

@erikcs
Copy link
Contributor Author

erikcs commented Dec 11, 2016

@jorisvandenbossche Yes this version is back compat. The PR got a bit lengthy because I did more than I anticipated (was originally a simple PR for a quick speed improvement - but I discovered afterwards that there where several use cases the original function couldn't handle).

@jreback jreback closed this in 86233e1 Dec 13, 2016
@jreback
Copy link
Contributor

jreback commented Dec 13, 2016

thanks @Nuffe very nice PR, and you were very responsive!

if you want to tackle other issues would be much appreciated!

ischurov pushed a commit to ischurov/pandas that referenced this pull request Dec 19, 2016
closes pandas-dev#14778 Please see regex search on long columns by first converting to Categorical, avoid melting all dataframes with all the id variables, and wait with trying to convert the "time" variable to `int` until last), and clear up the docstring. Author: nuffe <erik.cfr@gmail.com> Closes pandas-dev#14779 from nuffe/wide2longfix and squashes the following commits: df1edf8 [nuffe] asv_bench: fix indentation and simplify dc13064 [nuffe] Set docstring to raw literal to allow backslashes to be printed (still had to escape them) 295d1e6 [nuffe] Use pd.Index in doc example 1c49291 [nuffe] Can of course get rid negative lookahead now that suffix is a regex 54c5920 [nuffe] Specify the suffix with a regex 5747a25 [nuffe] ENH/DOC: wide_to_long performance and functionality improvements (pandas-dev#14779)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Performance Memory or execution speed performance Reshaping Concat, Merge/Join, Stack/Unstack, Explode

5 participants