Python equivalent of R c() function, for dataframe column indices?

Question

I would like to select from a pandas dataframe specific columns using column index.

In particular, I would like to select columns index by the column index generated by c(12:26,69:85,96:99,134:928,933:935,940:967) in R. I wonder how can I do that in Python?

I am thinking something like the following, but of course, python does not have a function called c()...

input2 = input2.iloc[:,c(12:26,69:85,96:99,134:928,933:935,940:967)]

From TFM - pandas.pydata.org/pandas-docs/stable/indexing.html - You can pass a list of columns to [] to select columns in that order. If a column is not contained in the DataFrame, an exception will be raised. Multiple columns can also be set in this manner. — hrbrmstr
– hrbrmstr, Commented Sep 7, 2015 at 17:48
Thanks @hrbrmstr for your prompt responses! I have read the help file in the link you posted, but still do not know how to solve my problem...I do not know how to create the list of column index fast, like in R I can use c(12:26,69:85,96:99,134:928,933:935,940:967), but I do not know how to do that in Python. Thanks! — user5309995
– user5309995, Commented Sep 7, 2015 at 17:56
list(range(12, 26) + range(69, 85) + range(96, 99) + range(134, 928) + range(933, 935) + range(940, 967)) — hrbrmstr
– hrbrmstr, Commented Sep 7, 2015 at 18:30
Do you only want the equivalent of c() for (numerical) dataframe column indices, or also for concatenating (string) column names ('labels' in Pandas terminology)? pandas.loc[:, ['a','b','c']] can handle both, whereas numpy.r_ only works on numerical indices, not string labels — smci
– smci, Commented Nov 15, 2019 at 1:11

smci · Accepted Answer · 2019-11-15 23:58:31Z

7

The equivalent is numpy's r_. It combines integer slices without needing to call ranges for each of them:

np.r_[2:4, 7:11, 21:25] Out: array([ 2, 3, 7, 8, 9, 10, 21, 22, 23, 24])

df = pd.DataFrame(np.random.randn(1000)) df.iloc[np.r_[2:4, 7:11, 21:25]] Out: 0 2 2.720383 3 0.656391 7 -0.581855 8 0.047612 9 1.416250 10 0.206395 21 -1.519904 22 0.681153 23 -1.208401 24 -0.358545

edited Nov 15, 2019 at 23:58

smci

34.2k21 gold badges118 silver badges152 bronze badges

answered Feb 13, 2017 at 21:42

user2285236

Sign up to request clarification or add additional context in comments.

11 Comments

Hendy Over a year ago

Wow. Surprised this isn't voted more. While other answers might be more pythonic, it surprised me coming from R that python was sooooo verbose. This is the true analog to c(), though I wonder why the dunder... does that imply it's a quasi-private method?

user2285236 Over a year ago

@Hendy Python is a general-purpose language so many of the things that R offers out of the box (let's say vector things) are provided by third party libraries in Python (such as numpy and pandas). I guess that's the reason for verbosity.

smci Over a year ago

Not really. pandas.loc[:, ['a','b','c']] can handle both, whereas numpy.r_ only works on numerical indices, not string labels. I have never needed numpy.r_, and I've never seen it used in pandas code either.

user2285236 Over a year ago

@smci You cannot pass non contiguous slices to loc or iloc without a helper like np.r_. That's the whole point of the question.

smci Over a year ago

@ayhan: yes you can, you just use list notation on the expanded slices e.g. df.iloc[[1, 3, 8, 9, 10], [1, 3]]. I've never seen numpy.r_ used in pandas.

|

tshynik · Accepted Answer · 2017-02-13 21:56:14Z

Putting @hrbrmstr 's comment into an answer, because it solved my issue and I want to make it clear that this question is resolved. In addition, please note that range(a,b) gives the numbers (a, a+1, ..., b-2, b-1), and doesn't include b.

R's combine function

c(4,12:26,69:85,96:99,134:928,933:935)

is translated into Python as

[4] + list(range(12,27)) + list(range(69,86)) + list(range(96,100)) + list(range(134,929)) + list(range(933,936))

miraculixx · Accepted Answer · 2019-11-25 12:15:57Z

To answer the actual question,

Python equivalent of R c() function, for dataframe column indices?

I'm using this definition of c()

c = lambda v: v.split(',') if ":" not in v else eval(f'np.r_[{v}]')

Then we can do things like:

df = pd.DataFrame({'x': np.random.randn(1000), 'y': np.random.randn(1000)}) # row selection df.iloc[c('2:4,7:11,21:25')] # columns by name df[c('x,y')] # columns by range df.T[c('12:15,17:25,500:750')]

That's pretty much as close as it gets in terms of R-like syntax.

To the curious mind

Note there is a performance penality in using c() as per above v.s. np.r_. To paraphrase Knuth, let's not optimize prematurely ;-)

%timeit np.r_[2:4, 7:11, 21:25] 27.3 µs ± 786 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) %timeit c("2:4, 7:11, 21:25") 53.7 µs ± 977 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Collectives™ on Stack Overflow

Python equivalent of R c() function, for dataframe column indices?

3 Answers 3

11 Comments

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

11 Comments

Comments

Comments

Linked

Related