I wrote a pyspark function that emulates Pandas.DataFrame.info()
from collections import Counter def spark_info(df, abbreviate_columns=True, include_nested_types=False, count=None): """Similar to Pandas.DataFrame.info which produces output like: <class 'pandas.core.frame.DataFrame'> RangeIndex: 201100 entries, 0 to 201099 Columns: 151 entries, first_col to last_col dtypes: float64(20), int64(6), object(50) memory usage: 231.7+ MB """ classinfo = "<class 'pyspark.sql.dataframe.DataFrame'>" _cnt = count if count else df.count() numrows = f"Total Rows: {str(_cnt)}" _cols = ( ', to '.join([ df.columns[0], df.columns[-1]]) if abbreviate_columns else ', '.join(df.columns)) columns = f"{len(df.columns)} entries: {_cols}" _typs = [ col.dataType for col in df.schema if include_nested_types or ( 'ArrayType' not in str(col.dataType) and 'StructType' not in str(col.dataType) and 'MapType' not in str(col.dataType)) ] dtypes = ', '.join( f"{str(typ)}({cnt})" for typ, cnt in Counter(_typs).items()) mem = f'memory usage: ? bytes' return '\n'.join([classinfo, numrows, columns, dtypes, mem])
I wasn't sure about estimating size of pyspark dataframe. This depends on the full spark execution plan and configuration, but maybe try this answer for ideas.
Note that not all dtype summaries are included, by default nested types are excluded. Also df.count() is calculated, which can take a while, unless you calculate it first and pass it in.
Suggested usage:
>>> df = spark.createDataFrame(((1, 'a', 2),(2,'b',3)), ['id', 'letter', 'num']) >>> print(spark_info(df, count=2)) <class 'pyspark.sql.dataframe.DataFrame'> Total Rows: 2 3 entries: id, to num LongType(2), StringType(1) memory usage: ? bytes