Pyspark: Is there an equivalent method to pandas info()?

Question

Is there an equivalent method to pandas info() method in PySpark?

I am trying to gain basic statistics about a dataframe in PySpark, such as: Number of columns and rows Number of nulls Size of dataframe

Info() method in pandas provides all these statistics.

danielfs88 · Accepted Answer · 2020-05-15 12:39:53Z

Also there is summary method to get row numbers and some other descritive statistics. It is similar to describe method already mentioned.

From PySpark manual:

df.summary().show() +-------+------------------+-----+ |summary| age| name| +-------+------------------+-----+ | count| 2| 2| | mean| 3.5| null| | stddev|2.1213203435596424| null| | min| 2|Alice| | 25%| 2| null| | 50%| 2| null| | 75%| 5| null| | max| 5| Bob| +-------+------------------+-----+ or df.select("age", "name").summary("count").show() +-------+---+----+ |summary|age|name| +-------+---+----+ | count| 2| 2| +-------+---+----+

This is pandas describe() equivalent and not info() equivalent. For info() you just need to do a df.printSchema()

StackPointer · Accepted Answer · 2019-12-30 19:13:34Z

To figure out type information about data frame you could try df.schema

spark.read.csv('matchCount.csv',header=True).printSchema() StructType(List(StructField(categ,StringType,true),StructField(minv,StringType,true),StructField(maxv,StringType,true),StructField(counts,StringType,true),StructField(cutoff,StringType,true)))

For Summary stats you could also have a look at describe method from the documentation.

printSchema() will give you an easier to read version of the same info.

Rodney · Accepted Answer · 2019-04-07 10:07:17Z

4

I could not find a good answer so I use the slightly cheating

dataFrame.toPandas().info()

answered Apr 7, 2019 at 10:07

Rodney

5,6058 gold badges63 silver badges99 bronze badges

4 Comments

user3897315 Over a year ago

Have you tried this for a large data set? I'd expect that dataFrame.toPandas() will only work when the entire Spark dataframe can fit into memory on a single machine.

Rodney Over a year ago

Yes, you are probably right. Keep an eye on Koala's for the distributed implementation of the Panda's api - koalas.readthedocs.io/en/latest. If anyone has a better way of providing as much in-depth info as .info() I would like to know as it's one of the few Panda's methods I miss.

user3897315 Over a year ago

Many thanks for the tip re: koalas. I certainly will be watching that space.

Wassadamo Over a year ago

With this answer you're just seeing the default pandas mapped types instead of the native spark types.

Daniel Fernandez · Accepted Answer · 2019-11-07 16:10:18Z

Check this answer to get a count of the null and not null values.

from pyspark.sql.functions import isnan, when, count, col import numpy as np df = spark.createDataFrame( [(1, 1, None), (1, 2, float(5)), (1, 3, np.nan), (1, 4, None), (1, 5, float(10)), (1, 6, float('nan')), (1, 6, float('nan'))], ('session', "timestamp1", "id2")) df.show() # +-------+----------+----+ # |session|timestamp1| id2| # +-------+----------+----+ # | 1| 1|null| # | 1| 2| 5.0| # | 1| 3| NaN| # | 2| 4|null| # | 1| 5|10.0| # | 1| 6| NaN| # | 1| 6| NaN| # +-------+----------+----+ df.select([count(when(isnan(c), c)).alias(c) for c in df.columns]).show() # +-------+----------+---+ # |session|timestamp1|id2| # +-------+----------+---+ # | 0| 0| 3| # +-------+----------+---+ df.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in df.columns]).show() # +-------+----------+---+ # |session|timestamp1|id2| # +-------+----------+---+ # | 0| 0| 5| # +-------+----------+---+ df.describe().show() # +-------+-------+------------------+---+ # |summary|session| timestamp1|id2| # +-------+-------+------------------+---+ # | count| 7| 7| 5| # | mean| 1.0| 3.857142857142857|NaN| # | stddev| 0.0|1.9518001458970662|NaN| # | min| 1| 1|5.0| # | max| 1| 6|NaN| # +-------+-------+------------------+---

There is no equivalent to pandas.DataFrame.info() that I know of. PrintSchema is useful, and toPandas.info() works for small dataframes but When I use pandas.DataFrame.info() I often look at the null values.

isnan does not support dtypes other than numeric types. The following is more general: df.select([F.sum(F.col(c).isNull().cast(T.IntegerType())).alias(c) for c in df.columns]).show()

Wassadamo · Accepted Answer · 2022-04-06 00:27:13Z

I wrote a pyspark function that emulates Pandas.DataFrame.info()

from collections import Counter def spark_info(df, abbreviate_columns=True, include_nested_types=False, count=None): """Similar to Pandas.DataFrame.info which produces output like: <class 'pandas.core.frame.DataFrame'> RangeIndex: 201100 entries, 0 to 201099 Columns: 151 entries, first_col to last_col dtypes: float64(20), int64(6), object(50) memory usage: 231.7+ MB """ classinfo = "<class 'pyspark.sql.dataframe.DataFrame'>" _cnt = count if count else df.count() numrows = f"Total Rows: {str(_cnt)}" _cols = ( ', to '.join([ df.columns[0], df.columns[-1]]) if abbreviate_columns else ', '.join(df.columns)) columns = f"{len(df.columns)} entries: {_cols}" _typs = [ col.dataType for col in df.schema if include_nested_types or ( 'ArrayType' not in str(col.dataType) and 'StructType' not in str(col.dataType) and 'MapType' not in str(col.dataType)) ] dtypes = ', '.join( f"{str(typ)}({cnt})" for typ, cnt in Counter(_typs).items()) mem = f'memory usage: ? bytes' return '\n'.join([classinfo, numrows, columns, dtypes, mem])

I wasn't sure about estimating size of pyspark dataframe. This depends on the full spark execution plan and configuration, but maybe try this answer for ideas.

Note that not all dtype summaries are included, by default nested types are excluded. Also df.count() is calculated, which can take a while, unless you calculate it first and pass it in.

Suggested usage:

>>> df = spark.createDataFrame(((1, 'a', 2),(2,'b',3)), ['id', 'letter', 'num']) >>> print(spark_info(df, count=2)) <class 'pyspark.sql.dataframe.DataFrame'> Total Rows: 2 3 entries: id, to num LongType(2), StringType(1) memory usage: ? bytes

Collectives™ on Stack Overflow

Pyspark: Is there an equivalent method to pandas info()?

5 Answers 5

1 Comment

2 Comments

4 Comments

1 Comment

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

1 Comment

2 Comments

4 Comments

1 Comment

Comments

Linked

Related