14

Is there an equivalent method to pandas info() method in PySpark?

I am trying to gain basic statistics about a dataframe in PySpark, such as: Number of columns and rows Number of nulls Size of dataframe

Info() method in pandas provides all these statistics.

5 Answers 5

8

Also there is summary method to get row numbers and some other descritive statistics. It is similar to describe method already mentioned.

From PySpark manual:

df.summary().show() +-------+------------------+-----+ |summary| age| name| +-------+------------------+-----+ | count| 2| 2| | mean| 3.5| null| | stddev|2.1213203435596424| null| | min| 2|Alice| | 25%| 2| null| | 50%| 2| null| | 75%| 5| null| | max| 5| Bob| +-------+------------------+-----+ or df.select("age", "name").summary("count").show() +-------+---+----+ |summary|age|name| +-------+---+----+ | count| 2| 2| +-------+---+----+ 
Sign up to request clarification or add additional context in comments.

1 Comment

This is pandas describe() equivalent and not info() equivalent. For info() you just need to do a df.printSchema()
5

To figure out type information about data frame you could try df.schema

spark.read.csv('matchCount.csv',header=True).printSchema() StructType(List(StructField(categ,StringType,true),StructField(minv,StringType,true),StructField(maxv,StringType,true),StructField(counts,StringType,true),StructField(cutoff,StringType,true))) 

For Summary stats you could also have a look at describe method from the documentation.

2 Comments

printSchema() will give you an easier to read version of the same info.
Please change the answer to use printSchema() :)
4

I could not find a good answer so I use the slightly cheating

dataFrame.toPandas().info() 

4 Comments

Have you tried this for a large data set? I'd expect that dataFrame.toPandas() will only work when the entire Spark dataframe can fit into memory on a single machine.
Yes, you are probably right. Keep an eye on Koala's for the distributed implementation of the Panda's api - koalas.readthedocs.io/en/latest. If anyone has a better way of providing as much in-depth info as .info() I would like to know as it's one of the few Panda's methods I miss.
Many thanks for the tip re: koalas. I certainly will be watching that space.
With this answer you're just seeing the default pandas mapped types instead of the native spark types.
3

Check this answer to get a count of the null and not null values.

from pyspark.sql.functions import isnan, when, count, col import numpy as np df = spark.createDataFrame( [(1, 1, None), (1, 2, float(5)), (1, 3, np.nan), (1, 4, None), (1, 5, float(10)), (1, 6, float('nan')), (1, 6, float('nan'))], ('session', "timestamp1", "id2")) df.show() # +-------+----------+----+ # |session|timestamp1| id2| # +-------+----------+----+ # | 1| 1|null| # | 1| 2| 5.0| # | 1| 3| NaN| # | 2| 4|null| # | 1| 5|10.0| # | 1| 6| NaN| # | 1| 6| NaN| # +-------+----------+----+ df.select([count(when(isnan(c), c)).alias(c) for c in df.columns]).show() # +-------+----------+---+ # |session|timestamp1|id2| # +-------+----------+---+ # | 0| 0| 3| # +-------+----------+---+ df.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in df.columns]).show() # +-------+----------+---+ # |session|timestamp1|id2| # +-------+----------+---+ # | 0| 0| 5| # +-------+----------+---+ df.describe().show() # +-------+-------+------------------+---+ # |summary|session| timestamp1|id2| # +-------+-------+------------------+---+ # | count| 7| 7| 5| # | mean| 1.0| 3.857142857142857|NaN| # | stddev| 0.0|1.9518001458970662|NaN| # | min| 1| 1|5.0| # | max| 1| 6|NaN| # +-------+-------+------------------+--- 

There is no equivalent to pandas.DataFrame.info() that I know of. PrintSchema is useful, and toPandas.info() works for small dataframes but When I use pandas.DataFrame.info() I often look at the null values.

1 Comment

isnan does not support dtypes other than numeric types. The following is more general: df.select([F.sum(F.col(c).isNull().cast(T.IntegerType())).alias(c) for c in df.columns]).show()
3

I wrote a pyspark function that emulates Pandas.DataFrame.info()

from collections import Counter def spark_info(df, abbreviate_columns=True, include_nested_types=False, count=None): """Similar to Pandas.DataFrame.info which produces output like: <class 'pandas.core.frame.DataFrame'> RangeIndex: 201100 entries, 0 to 201099 Columns: 151 entries, first_col to last_col dtypes: float64(20), int64(6), object(50) memory usage: 231.7+ MB """ classinfo = "<class 'pyspark.sql.dataframe.DataFrame'>" _cnt = count if count else df.count() numrows = f"Total Rows: {str(_cnt)}" _cols = ( ', to '.join([ df.columns[0], df.columns[-1]]) if abbreviate_columns else ', '.join(df.columns)) columns = f"{len(df.columns)} entries: {_cols}" _typs = [ col.dataType for col in df.schema if include_nested_types or ( 'ArrayType' not in str(col.dataType) and 'StructType' not in str(col.dataType) and 'MapType' not in str(col.dataType)) ] dtypes = ', '.join( f"{str(typ)}({cnt})" for typ, cnt in Counter(_typs).items()) mem = f'memory usage: ? bytes' return '\n'.join([classinfo, numrows, columns, dtypes, mem]) 

I wasn't sure about estimating size of pyspark dataframe. This depends on the full spark execution plan and configuration, but maybe try this answer for ideas.

Note that not all dtype summaries are included, by default nested types are excluded. Also df.count() is calculated, which can take a while, unless you calculate it first and pass it in.

Suggested usage:

>>> df = spark.createDataFrame(((1, 'a', 2),(2,'b',3)), ['id', 'letter', 'num']) >>> print(spark_info(df, count=2)) <class 'pyspark.sql.dataframe.DataFrame'> Total Rows: 2 3 entries: id, to num LongType(2), StringType(1) memory usage: ? bytes 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.