28

Performing .shape is giving me the following error.

AttributeError: 'DataFrame' object has no attribute 'shape'

How should I get the shape instead?

7 Answers 7

37

You can get the number of columns directly

len(df.columns) # this is fast 

You can also call len on the dataframe itself, though beware that this will trigger a computation.

len(df) # this requires a full scan of the data 

Dask.dataframe doesn't know how many records are in your data without first reading through all of it.

Sign up to request clarification or add additional context in comments.

4 Comments

len(df) is loading all of the records and in my case, finding len(df) for a table at size 144M rows took more than few minutes (wind10,ram16,intel7). Any other way?
It probably has to load all of the data to find out the length. No, there is no other way. You could consider using something like a database, which tracks this sort of information in metadata.
i've been doing df.index.size.compute() which is faster than running len(df) ... but my data is stored in columnar parquet... so it depends on what your underlying data architecture is.
Just want to second what @user108569 said about not using len(dataframe) columnar parquet. I timed the two methods he suggested for loading my parquet data (a tiny subset of the giant C4 dataset). I get this output: Time to get len(data): 1.17129 # 10,000 samples \nTime to get data.index.size.compute(): 0.00381 # 10,000 samples So the difference is ~300x.
28

With shape you can do the following

a = df.shape a[0].compute(),a[1] 

This will show the shape just as it is shown with pandas

Comments

7

Well, I know this is a quite old question, but I had the same issue and I got an out-of-the-box solution which I just want to register here.

Considering your data, I'm wondering that it is originally saved in a CSV similar file; so, for my situation, I just count the lines of that file (minus one, the header line). Inspired by this answer here, this is the solution I'm using:

import dask.dataframe as dd from itertools import (takewhile,repeat) def rawincount(filename): f = open(filename, 'rb') bufgen = takewhile(lambda x: x, (f.raw.read(1024*1024) for _ in repeat(None))) return sum( buf.count(b'\n') for buf in bufgen ) filename = 'myHugeDataframe.csv' df = dd.read_csv(filename) df_shape = (rawincount(filename) - 1, len(df.columns)) print(f"Shape: {df_shape}") 

Hope this could help someone else as well.

2 Comments

This approach is very fast and take an advantage of distributed processing in dask
Thank you! This is faster than the other possible solution of loading a single columns and obtaining its length.
3
print('(',len(df),',',len(df.columns),')') 

Comments

1

To get the shape we can try this way:

 dask_dataframe.describe().compute()  

"count" column of the index will give the number of rows

 len(dask_dataframe.columns) 

this will give the number of columns in the dataframe

Comments

0

For a dask dataframe named df:

df.compute().shape 

returns a tuple:
(number of rows, number of columns)

Comments

-2

Getting number of columns by below code.

import dask.dataframe as dd dd1=dd.read_csv("filename.txt") print(dd1.info) #Output <class 'dask.dataframe.core.DataFrame'> Columns: 6 entries, CountryName to Value dtypes: object(4), float64(1), int64(1) 

2 Comments

in Pandas, shape will output both number of rows and columns. I don't think showing number of columns answers OP's question.
Columns: 6 entries What is this in output and i am using dask FYI.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.