Performing .shape is giving me the following error.
AttributeError: 'DataFrame' object has no attribute 'shape'
How should I get the shape instead?
You can get the number of columns directly
len(df.columns) # this is fast You can also call len on the dataframe itself, though beware that this will trigger a computation.
len(df) # this requires a full scan of the data Dask.dataframe doesn't know how many records are in your data without first reading through all of it.
df.index.size.compute() which is faster than running len(df) ... but my data is stored in columnar parquet... so it depends on what your underlying data architecture is.len(dataframe) columnar parquet. I timed the two methods he suggested for loading my parquet data (a tiny subset of the giant C4 dataset). I get this output: Time to get len(data): 1.17129 # 10,000 samples \nTime to get data.index.size.compute(): 0.00381 # 10,000 samples So the difference is ~300x.Well, I know this is a quite old question, but I had the same issue and I got an out-of-the-box solution which I just want to register here.
Considering your data, I'm wondering that it is originally saved in a CSV similar file; so, for my situation, I just count the lines of that file (minus one, the header line). Inspired by this answer here, this is the solution I'm using:
import dask.dataframe as dd from itertools import (takewhile,repeat) def rawincount(filename): f = open(filename, 'rb') bufgen = takewhile(lambda x: x, (f.raw.read(1024*1024) for _ in repeat(None))) return sum( buf.count(b'\n') for buf in bufgen ) filename = 'myHugeDataframe.csv' df = dd.read_csv(filename) df_shape = (rawincount(filename) - 1, len(df.columns)) print(f"Shape: {df_shape}") Hope this could help someone else as well.
Getting number of columns by below code.
import dask.dataframe as dd dd1=dd.read_csv("filename.txt") print(dd1.info) #Output <class 'dask.dataframe.core.DataFrame'> Columns: 6 entries, CountryName to Value dtypes: object(4), float64(1), int64(1)