How should I get the shape of a dask dataframe?

Question

Performing .shape is giving me the following error.

AttributeError: 'DataFrame' object has no attribute 'shape'

How should I get the shape instead?

MRocklin · Accepted Answer · 2018-05-15 17:12:47Z

37

You can get the number of columns directly

len(df.columns) # this is fast

You can also call len on the dataframe itself, though beware that this will trigger a computation.

len(df) # this requires a full scan of the data

Dask.dataframe doesn't know how many records are in your data without first reading through all of it.

answered May 15, 2018 at 17:12

MRocklin

57.5k29 gold badges176 silver badges245 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Rebin Over a year ago

len(df) is loading all of the records and in my case, finding len(df) for a table at size 144M rows took more than few minutes (wind10,ram16,intel7). Any other way?

MRocklin Over a year ago

It probably has to load all of the data to find out the length. No, there is no other way. You could consider using something like a database, which tracks this sort of information in metadata.

user108569 Over a year ago

i've been doing df.index.size.compute() which is faster than running len(df) ... but my data is stored in columnar parquet... so it depends on what your underlying data architecture is.

d-gg Jan 4 at 2:41

Just want to second what @user108569 said about not using len(dataframe) columnar parquet. I timed the two methods he suggested for loading my parquet data (a tiny subset of the giant C4 dataset). I get this output: Time to get len(data): 1.17129 # 10,000 samples \nTime to get data.index.size.compute(): 0.00381 # 10,000 samples So the difference is ~300x.

tinashe matambo · Accepted Answer · 2024-04-08 07:05:46Z

With shape you can do the following

a = df.shape a[0].compute(),a[1]

This will show the shape just as it is shown with pandas

ti7 · Accepted Answer · 2020-08-25 14:58:08Z

Well, I know this is a quite old question, but I had the same issue and I got an out-of-the-box solution which I just want to register here.

Considering your data, I'm wondering that it is originally saved in a CSV similar file; so, for my situation, I just count the lines of that file (minus one, the header line). Inspired by this answer here, this is the solution I'm using:

import dask.dataframe as dd from itertools import (takewhile,repeat) def rawincount(filename): f = open(filename, 'rb') bufgen = takewhile(lambda x: x, (f.raw.read(1024*1024) for _ in repeat(None))) return sum( buf.count(b'\n') for buf in bufgen ) filename = 'myHugeDataframe.csv' df = dd.read_csv(filename) df_shape = (rawincount(filename) - 1, len(df.columns)) print(f"Shape: {df_shape}")

Hope this could help someone else as well.

This approach is very fast and take an advantage of distributed processing in dask
Thank you! This is faster than the other possible solution of loading a single columns and obtaining its length.

Omid Erfanmanesh · Accepted Answer · 2020-09-02 21:37:10Z

3

print('(',len(df),',',len(df.columns),')')

answered Sep 2, 2020 at 21:37

Omid Erfanmanesh

6161 gold badge10 silver badges32 bronze badges

Comments

Jyothish Arumugam · Accepted Answer · 2018-11-17 10:36:48Z

To get the shape we can try this way:

 dask_dataframe.describe().compute()

"count" column of the index will give the number of rows

 len(dask_dataframe.columns)

this will give the number of columns in the dataframe

L Tyrone · Accepted Answer · 2024-05-07 02:50:00Z

0

For a dask dataframe named df:

df.compute().shape

returns a tuple:
(number of rows, number of columns)

edited May 7, 2024 at 2:50

L Tyrone

8,36123 gold badges34 silver badges47 bronze badges

answered May 5, 2024 at 5:14

user24874457

1

Comments

sameer_nubia · Accepted Answer · 2021-04-12 10:01:36Z

-2

Getting number of columns by below code.

import dask.dataframe as dd dd1=dd.read_csv("filename.txt") print(dd1.info) #Output <class 'dask.dataframe.core.DataFrame'> Columns: 6 entries, CountryName to Value dtypes: object(4), float64(1), int64(1)

answered Apr 12, 2021 at 10:01

sameer_nubia

8119 silver badges11 bronze badges

2 Comments

Pan Over a year ago

in Pandas, shape will output both number of rows and columns. I don't think showing number of columns answers OP's question.

sameer_nubia Over a year ago

Columns: 6 entries What is this in output and i am using dask FYI.

Collectives™ on Stack Overflow

How should I get the shape of a dask dataframe?

7 Answers 7

4 Comments

Comments

2 Comments

Comments

Comments

Comments

2 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

7 Answers 7

4 Comments

Comments

2 Comments

Comments

Comments

Comments

2 Comments

Linked

Related