5

I would like to select from a pandas dataframe specific columns using column index.

In particular, I would like to select columns index by the column index generated by c(12:26,69:85,96:99,134:928,933:935,940:967) in R. I wonder how can I do that in Python?

I am thinking something like the following, but of course, python does not have a function called c()...

input2 = input2.iloc[:,c(12:26,69:85,96:99,134:928,933:935,940:967)] 
4
  • 6
    From TFM - pandas.pydata.org/pandas-docs/stable/indexing.html - You can pass a list of columns to [] to select columns in that order. If a column is not contained in the DataFrame, an exception will be raised. Multiple columns can also be set in this manner. Commented Sep 7, 2015 at 17:48
  • Thanks @hrbrmstr for your prompt responses! I have read the help file in the link you posted, but still do not know how to solve my problem...I do not know how to create the list of column index fast, like in R I can use c(12:26,69:85,96:99,134:928,933:935,940:967), but I do not know how to do that in Python. Thanks! Commented Sep 7, 2015 at 17:56
  • 4
    list(range(12, 26) + range(69, 85) + range(96, 99) + range(134, 928) + range(933, 935) + range(940, 967)) Commented Sep 7, 2015 at 18:30
  • Do you only want the equivalent of c() for (numerical) dataframe column indices, or also for concatenating (string) column names ('labels' in Pandas terminology)? pandas.loc[:, ['a','b','c']] can handle both, whereas numpy.r_ only works on numerical indices, not string labels Commented Nov 15, 2019 at 1:11

3 Answers 3

7

The equivalent is numpy's r_. It combines integer slices without needing to call ranges for each of them:

np.r_[2:4, 7:11, 21:25] Out: array([ 2, 3, 7, 8, 9, 10, 21, 22, 23, 24]) 

df = pd.DataFrame(np.random.randn(1000)) df.iloc[np.r_[2:4, 7:11, 21:25]] Out: 0 2 2.720383 3 0.656391 7 -0.581855 8 0.047612 9 1.416250 10 0.206395 21 -1.519904 22 0.681153 23 -1.208401 24 -0.358545 
Sign up to request clarification or add additional context in comments.

11 Comments

Wow. Surprised this isn't voted more. While other answers might be more pythonic, it surprised me coming from R that python was sooooo verbose. This is the true analog to c(), though I wonder why the dunder... does that imply it's a quasi-private method?
@Hendy Python is a general-purpose language so many of the things that R offers out of the box (let's say vector things) are provided by third party libraries in Python (such as numpy and pandas). I guess that's the reason for verbosity.
Not really. pandas.loc[:, ['a','b','c']] can handle both, whereas numpy.r_ only works on numerical indices, not string labels. I have never needed numpy.r_, and I've never seen it used in pandas code either.
@smci You cannot pass non contiguous slices to loc or iloc without a helper like np.r_. That's the whole point of the question.
@ayhan: yes you can, you just use list notation on the expanded slices e.g. df.iloc[[1, 3, 8, 9, 10], [1, 3]]. I've never seen numpy.r_ used in pandas.
|
5

Putting @hrbrmstr 's comment into an answer, because it solved my issue and I want to make it clear that this question is resolved. In addition, please note that range(a,b) gives the numbers (a, a+1, ..., b-2, b-1), and doesn't include b.

R's combine function

c(4,12:26,69:85,96:99,134:928,933:935) 

is translated into Python as

[4] + list(range(12,27)) + list(range(69,86)) + list(range(96,100)) + list(range(134,929)) + list(range(933,936)) 

Comments

1

To answer the actual question,

Python equivalent of R c() function, for dataframe column indices?

I'm using this definition of c()

c = lambda v: v.split(',') if ":" not in v else eval(f'np.r_[{v}]') 

Then we can do things like:

df = pd.DataFrame({'x': np.random.randn(1000), 'y': np.random.randn(1000)}) # row selection df.iloc[c('2:4,7:11,21:25')] # columns by name df[c('x,y')] # columns by range df.T[c('12:15,17:25,500:750')] 

That's pretty much as close as it gets in terms of R-like syntax.

To the curious mind

Note there is a performance penality in using c() as per above v.s. np.r_. To paraphrase Knuth, let's not optimize prematurely ;-)

%timeit np.r_[2:4, 7:11, 21:25] 27.3 µs ± 786 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) %timeit c("2:4, 7:11, 21:25") 53.7 µs ± 977 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.