Functions

General Functions

pyjanitor's general-purpose data cleaning functions.

`add_columns`

`add_column(df, column_name, value, fill_remaining=False)`

Add a column to the dataframe.

Intended to be the method-chaining alternative to:

df[column_name] = value

Note

This function will be deprecated in a 1.x release. Please use pd.DataFrame.assign instead.

Examples:

Add a column of constant values to the dataframe.

>>> import pandas as pd >>> import janitor >>> df = pd.DataFrame({"a": list(range(3)), "b": list("abc")}) >>> df.add_column(column_name="c", value=1)  a b c 0 0 a 1 1 1 b 1 2 2 c 1

Add a column of different values to the dataframe.

>>> import pandas as pd >>> import janitor >>> df = pd.DataFrame({"a": list(range(3)), "b": list("abc")}) >>> df.add_column(column_name="c", value=list("efg"))  a b c 0 0 a e 1 1 b f 2 2 c g

Add a column using an iterator.

>>> import pandas as pd >>> import janitor >>> df = pd.DataFrame({"a": list(range(3)), "b": list("abc")}) >>> df.add_column(column_name="c", value=range(4, 7))  a b c 0 0 a 4 1 1 b 5 2 2 c 6

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	A pandas DataFrame.	required
`column_name`	`str`	Name of the new column. Should be a string, in order for the column name to be compatible with the Feather binary format (this is a useful thing to have).	required
`value`	`Union[List[Any], Tuple[Any], Any]`	Either a single value, or a list/tuple of values.	required
`fill_remaining`	`bool`	If value is a tuple or list that is smaller than the number of rows in the DataFrame, repeat the list or tuple (R-style) to the end of the DataFrame.	`False`

Raises:

Type	Description
`ValueError`	If attempting to add a column that already exists.
`ValueError`	If `value` has more elements that number of rows in the DataFrame.
`ValueError`	If attempting to add an iterable of values with a length not equal to the number of DataFrame rows.
`ValueError`	If `value` has length of `0`.

Returns:

Type	Description
`DataFrame`	A pandas DataFrame with an added column.

Source code in janitor/functions/add_columns.py

@pf.register_dataframe_method @refactored_function( message=( "This function will be deprecated in a 1.x release. " "Please use `pd.DataFrame.assign` instead." ) ) @deprecated_alias(col_name="column_name") def add_column( df: pd.DataFrame, column_name: str, value: Union[List[Any], Tuple[Any], Any], fill_remaining: bool = False, ) -> pd.DataFrame:  """Add a column to the dataframe.  Intended to be the method-chaining alternative to:  ```python  df[column_name] = value  ```  !!!note  This function will be deprecated in a 1.x release.  Please use `pd.DataFrame.assign` instead.  Examples:  Add a column of constant values to the dataframe.  >>> import pandas as pd  >>> import janitor  >>> df = pd.DataFrame({"a": list(range(3)), "b": list("abc")})  >>> df.add_column(column_name="c", value=1)  a b c  0 0 a 1  1 1 b 1  2 2 c 1  Add a column of different values to the dataframe.  >>> import pandas as pd  >>> import janitor  >>> df = pd.DataFrame({"a": list(range(3)), "b": list("abc")})  >>> df.add_column(column_name="c", value=list("efg"))  a b c  0 0 a e  1 1 b f  2 2 c g  Add a column using an iterator.  >>> import pandas as pd  >>> import janitor  >>> df = pd.DataFrame({"a": list(range(3)), "b": list("abc")})  >>> df.add_column(column_name="c", value=range(4, 7))  a b c  0 0 a 4  1 1 b 5  2 2 c 6  Args:  df: A pandas DataFrame.  column_name: Name of the new column. Should be a string, in order  for the column name to be compatible with the Feather binary  format (this is a useful thing to have).  value: Either a single value, or a list/tuple of values.  fill_remaining: If value is a tuple or list that is smaller than  the number of rows in the DataFrame, repeat the list or tuple  (R-style) to the end of the DataFrame.  Raises:  ValueError: If attempting to add a column that already exists.  ValueError: If `value` has more elements that number of  rows in the DataFrame.  ValueError: If attempting to add an iterable of values with  a length not equal to the number of DataFrame rows.  ValueError: If `value` has length of `0`.  Returns:  A pandas DataFrame with an added column.  """ check("column_name", column_name, [str]) if column_name in df.columns: raise ValueError(f"Attempted to add column that already exists: {column_name}.") nrows = len(df) if hasattr(value, "__len__") and not isinstance(value, (str, bytes, bytearray)): len_value = len(value) # if `value` is a list, ndarray, etc. if len_value > nrows: raise ValueError( "`value` has more elements than number of rows " f"in your `DataFrame`. vals: {len_value}, " f"df: {nrows}" ) if len_value != nrows and not fill_remaining: raise ValueError( "Attempted to add iterable of values with length" " not equal to number of DataFrame rows" ) if not len_value: raise ValueError("`value` has to be an iterable of minimum length 1") elif fill_remaining: # relevant if a scalar val was passed, yet fill_remaining == True len_value = 1 value = [value] df = df.copy() if fill_remaining: times_to_loop = int(np.ceil(nrows / len_value)) fill_values = list(value) * times_to_loop df[column_name] = fill_values[:nrows] else: df[column_name] = value return df 

`add_columns(df, fill_remaining=False, **kwargs)`

Add multiple columns to the dataframe.

This method does not mutate the original DataFrame.

Method to augment add_column with ability to add multiple columns in one go. This replaces the need for multiple add_column calls.

Usage is through supplying kwargs where the key is the col name and the values correspond to the values of the new DataFrame column.

Values passed can be scalar or iterable (list, ndarray, etc.)

Note

This function will be deprecated in a 1.x release. Please use pd.DataFrame.assign instead.

Examples:

Inserting two more columns into a dataframe.

>>> import pandas as pd >>> import janitor >>> df = pd.DataFrame({"a": list(range(3)), "b": list("abc")}) >>> df.add_columns(x=4, y=list("def"))  a b x y 0 0 a 4 d 1 1 b 4 e 2 2 c 4 f

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	A pandas DataFrame.	required
`fill_remaining`	`bool`	If value is a tuple or list that is smaller than the number of rows in the DataFrame, repeat the list or tuple (R-style) to the end of the DataFrame. (Passed to `add_column`)	`False`
`**kwargs`	`Any`	Column, value pairs which are looped through in `add_column` calls.	`{}`

Returns:

Type	Description
`DataFrame`	A pandas DataFrame with added columns.

Source code in janitor/functions/add_columns.py

@pf.register_dataframe_method @refactored_function( message=( "This function will be deprecated in a 1.x release. " "Please use `pd.DataFrame.assign` instead." ) ) def add_columns( df: pd.DataFrame, fill_remaining: bool = False, **kwargs: Any, ) -> pd.DataFrame:  """Add multiple columns to the dataframe.  This method does not mutate the original DataFrame.  Method to augment  [`add_column`][janitor.functions.add_columns.add_column]  with ability to add multiple columns in  one go. This replaces the need for multiple  [`add_column`][janitor.functions.add_columns.add_column] calls.  Usage is through supplying kwargs where the key is the col name and the  values correspond to the values of the new DataFrame column.  Values passed can be scalar or iterable (list, ndarray, etc.)  !!!note  This function will be deprecated in a 1.x release.  Please use `pd.DataFrame.assign` instead.  Examples:  Inserting two more columns into a dataframe.  >>> import pandas as pd  >>> import janitor  >>> df = pd.DataFrame({"a": list(range(3)), "b": list("abc")})  >>> df.add_columns(x=4, y=list("def"))  a b x y  0 0 a 4 d  1 1 b 4 e  2 2 c 4 f  Args:  df: A pandas DataFrame.  fill_remaining: If value is a tuple or list that is smaller than  the number of rows in the DataFrame, repeat the list or tuple  (R-style) to the end of the DataFrame. (Passed to  [`add_column`][janitor.functions.add_columns.add_column])  **kwargs: Column, value pairs which are looped through in  [`add_column`][janitor.functions.add_columns.add_column] calls.  Returns:  A pandas DataFrame with added columns.  """ # Note: error checking can pretty much be handled in `add_column` for col_name, values in kwargs.items(): df = df.add_column(col_name, values, fill_remaining=fill_remaining) return df 

`alias`

Implementation of the toset function.

`alias(series, alias=None)`

Return a Series with a new name. Accepts either a scalar or a callable.

Examples:

>>> import pandas as pd >>> import janitor >>> s = pd.Series([1, 2, 3], name="series") >>> s 0 1 1 2 2 3 Name: series, dtype: int64 >>> s.alias("series_new") 0 1 1 2 2 3 Name: series_new, dtype: int64 >>> s.alias(str.upper) 0 1 1 2 2 3 Name: SERIES, dtype: int64

Parameters:

Name	Type	Description	Default
`series`	`Series`	A pandas Series.	required
`alias`	`Any`	scalar or callable to create a new name for the pandas Series.	`None`

Returns:

Type	Description
`Series`	A new pandas Series.

Source code in janitor/functions/alias.py

@pf.register_series_method def alias(series: pd.Series, alias: Any = None) -> pd.Series:  """Return a Series with a new name. Accepts either a scalar or a callable.  Examples:  >>> import pandas as pd  >>> import janitor  >>> s = pd.Series([1, 2, 3], name="series")  >>> s  0 1  1 2  2 3  Name: series, dtype: int64  >>> s.alias("series_new")  0 1  1 2  2 3  Name: series_new, dtype: int64  >>> s.alias(str.upper)  0 1  1 2  2 3  Name: SERIES, dtype: int64  Args:  series: A pandas Series.  alias: scalar or callable to create a new name for the pandas Series.  Returns:  A new pandas Series.  """ series = series[:] if alias is None: return series if callable(alias): alias = alias(series.name) series.name = alias return series 

`also`

Implementation source for chainable function also.

`also(df, func, *args, **kwargs)`

Run a function with side effects.

This function allows you to run an arbitrary function in the pyjanitor method chain. Doing so will let you do things like save the dataframe to disk midway while continuing to modify the dataframe afterwards.

Examples:

>>> import pandas as pd >>> import janitor >>> df = ( ...  pd.DataFrame({"a": [1, 2, 3], "b": list("abc")}) ...  .query("a > 1") ...  .also(lambda df: print(f"DataFrame shape is: {df.shape}")) ...  .rename_column(old_column_name="a", new_column_name="a_new") ...  .also(lambda df: df.to_csv("midpoint.csv")) ...  .also(lambda df: print(f"Columns: {df.columns}")) ... ) DataFrame shape is: (2, 2) Columns: Index(['a_new', 'b'], dtype='object')

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	A pandas DataFrame.	required
`func`	`Callable`	A function you would like to run in the method chain. It should take one DataFrame object as a parameter and have no return. If there is a return, it will be ignored.	required
`*args`	`Any`	Optional arguments for `func`.	`()`
`**kwargs`	`Any`	Optional keyword arguments for `func`.	`{}`

Returns:

Type	Description
`DataFrame`	The input pandas DataFrame, unmodified.

Source code in janitor/functions/also.py

@pf.register_dataframe_method def also(df: pd.DataFrame, func: Callable, *args: Any, **kwargs: Any) -> pd.DataFrame:  """Run a function with side effects.  This function allows you to run an arbitrary function  in the `pyjanitor` method chain.  Doing so will let you do things like save the dataframe to disk midway  while continuing to modify the dataframe afterwards.  Examples:  >>> import pandas as pd  >>> import janitor  >>> df = (  ... pd.DataFrame({"a": [1, 2, 3], "b": list("abc")})  ... .query("a > 1")  ... .also(lambda df: print(f"DataFrame shape is: {df.shape}"))  ... .rename_column(old_column_name="a", new_column_name="a_new")  ... .also(lambda df: df.to_csv("midpoint.csv"))  ... .also(lambda df: print(f"Columns: {df.columns}"))  ... )  DataFrame shape is: (2, 2)  Columns: Index(['a_new', 'b'], dtype='object')  Args:  df: A pandas DataFrame.  func: A function you would like to run in the method chain.  It should take one DataFrame object as a parameter and have no return.  If there is a return, it will be ignored.  *args: Optional arguments for `func`.  **kwargs: Optional keyword arguments for `func`.  Returns:  The input pandas DataFrame, unmodified.  """ # noqa: E501 func(df.copy(), *args, **kwargs) return df 

`bin_numeric`

Implementation source for bin_numeric.

`bin_numeric(df, from_column_name, to_column_name, bins=5, **kwargs)`

Generate a new column that labels bins for a specified numeric column.

This method does not mutate the original DataFrame.

A wrapper around the pandas cut() function to bin data of one column, generating a new column with the results.

Examples:

Binning a numeric column with specific bin edges.

>>> import pandas as pd >>> import janitor >>> df = pd.DataFrame({"a": [3, 6, 9, 12, 15]}) >>> df.bin_numeric( ...  from_column_name="a", ...  to_column_name="a_binned", ...  bins=[0, 5, 11, 15], ... )  a a_binned 0 3 (0, 5] 1 6 (5, 11] 2 9 (5, 11] 3 12 (11, 15] 4 15 (11, 15]

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	A pandas DataFrame.	required
`from_column_name`	`str`	The column whose data you want binned.	required
`to_column_name`	`str`	The new column to be created with the binned data.	required
`bins`	`Optional[Union[int, ScalarSequence, IntervalIndex]]`	The binning strategy to be utilized. Read the `pd.cut` documentation for more details.	`5`
`**kwargs`	`Any`	Additional kwargs to pass to `pd.cut`, except `retbins`.	`{}`

Raises:

Type	Description
`ValueError`	If `retbins` is passed in as a kwarg.

Returns:

Type	Description
`DataFrame`	A pandas DataFrame.

Source code in janitor/functions/bin_numeric.py

@pf.register_dataframe_method @deprecated_alias( from_column="from_column_name", to_column="to_column_name", num_bins="bins", ) def bin_numeric( df: pd.DataFrame, from_column_name: str, to_column_name: str, bins: Optional[Union[int, ScalarSequence, pd.IntervalIndex]] = 5, **kwargs: Any, ) -> pd.DataFrame:  """Generate a new column that labels bins for a specified numeric column.  This method does not mutate the original DataFrame.  A wrapper around the pandas [`cut()`][pd_cut_docs] function to bin data of  one column, generating a new column with the results.  [pd_cut_docs]: https://pandas.pydata.org/docs/reference/api/pandas.cut.html  Examples:  Binning a numeric column with specific bin edges.  >>> import pandas as pd  >>> import janitor  >>> df = pd.DataFrame({"a": [3, 6, 9, 12, 15]})  >>> df.bin_numeric(  ... from_column_name="a",  ... to_column_name="a_binned",  ... bins=[0, 5, 11, 15],  ... )  a a_binned  0 3 (0, 5]  1 6 (5, 11]  2 9 (5, 11]  3 12 (11, 15]  4 15 (11, 15]  Args:  df: A pandas DataFrame.  from_column_name: The column whose data you want binned.  to_column_name: The new column to be created with the binned data.  bins: The binning strategy to be utilized. Read the `pd.cut`  documentation for more details.  **kwargs: Additional kwargs to pass to `pd.cut`, except `retbins`.  Raises:  ValueError: If `retbins` is passed in as a kwarg.  Returns:  A pandas DataFrame.  """ if "retbins" in kwargs: raise ValueError("`retbins` is not an acceptable keyword argument.") check("from_column_name", from_column_name, [str]) check("to_column_name", to_column_name, [str]) check_column(df, from_column_name) df = df.assign( **{ to_column_name: pd.cut(df[from_column_name], bins=bins, **kwargs), } ) return df 

`case_when`

Implementation source for case_when.

`case_when(df, *args, default=None, column_name)`

Create a column based on a condition or multiple conditions.

Similar to SQL and dplyr's case_when with inspiration from pydatatable if_else function.

If your scenario requires direct replacement of values, pandas' replace method or map method should be better suited and more efficient; if the conditions check if a value is within a range of values, pandas' cut or qcut should be more efficient; np.where/np.select are also performant options.

This function relies on pd.Series.mask method.

When multiple conditions are satisfied, the first one is used.

The variable *args parameters takes arguments of the form : condition0, value0, condition1, value1, ..., default. If condition0 evaluates to True, then assign value0 to column_name, if condition1 evaluates to True, then assign value1 to column_name, and so on. If none of the conditions evaluate to True, assign default to column_name.

This function can be likened to SQL's case_when:

CASE WHEN condition0 THEN value0  WHEN condition1 THEN value1  --- more conditions  ELSE default  END AS column_name

compared to python's if-elif-else:

if condition0: value0 elif condition1: value1 # more elifs else: default

Examples:

>>> import pandas as pd >>> import janitor >>> df = pd.DataFrame( ...  { ...  "a": [0, 0, 1, 2, "hi"], ...  "b": [0, 3, 4, 5, "bye"], ...  "c": [6, 7, 8, 9, "wait"], ...  } ... ) >>> df  a b c 0 0 0 6 1 0 3 7 2 1 4 8 3 2 5 9 4 hi bye wait >>> df.case_when( ...  ((df.a == 0) & (df.b != 0)) | (df.c == "wait"), ...  df.a, ...  (df.b == 0) & (df.a == 0), ...  "x", ...  default=df.c, ...  column_name="value", ... )  a b c value 0 0 0 6 x 1 0 3 7 0 2 1 4 8 8 3 2 5 9 9 4 hi bye wait hi

Version Changed

0.24.0
- Added default parameter.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	A pandas DataFrame.	required
`*args`	`Any`	Variable argument of conditions and expected values. Takes the form `condition0`, `value0`, `condition1`, `value1`, ... . `condition` can be a 1-D boolean array, a callable, or a string. If `condition` is a callable, it should evaluate to a 1-D boolean array. The array should have the same length as the DataFrame. If it is a string, it is computed on the dataframe, via `df.eval`, and should return a 1-D boolean array. `result` can be a scalar, a 1-D array, or a callable. If `result` is a callable, it should evaluate to a 1-D array. For a 1-D array, it should have the same length as the DataFrame.	`()`
`default`	`Any`	This is the element inserted in the output when all conditions evaluate to False. Can be scalar, 1-D array or callable. If callable, it should evaluate to a 1-D array. The 1-D array should be the same length as the DataFrame.	`None`
`column_name`	`str`	Name of column to assign results to. A new column is created if it does not already exist in the DataFrame.	required

Raises:

Type	Description
`ValueError`	If condition/value fails to evaluate.

Returns:

Type	Description
`DataFrame`	A pandas DataFrame.

Source code in janitor/functions/case_when.py

@pf.register_dataframe_method @refactored_function( message=( "This function will be deprecated in a 1.x release. " "Please use `pd.Series.case_when` instead." ) ) def case_when( df: pd.DataFrame, *args: Any, default: Any = None, column_name: str ) -> pd.DataFrame:  """Create a column based on a condition or multiple conditions.  Similar to SQL and dplyr's case_when  with inspiration from `pydatatable` if_else function.  If your scenario requires direct replacement of values,  pandas' `replace` method or `map` method should be better  suited and more efficient; if the conditions check  if a value is within a range of values, pandas' `cut` or `qcut`  should be more efficient; `np.where/np.select` are also  performant options.  This function relies on `pd.Series.mask` method.  When multiple conditions are satisfied, the first one is used.  The variable `*args` parameters takes arguments of the form :  `condition0`, `value0`, `condition1`, `value1`, ..., `default`.  If `condition0` evaluates to `True`, then assign `value0` to  `column_name`, if `condition1` evaluates to `True`, then  assign `value1` to `column_name`, and so on. If none of the  conditions evaluate to `True`, assign `default` to  `column_name`.  This function can be likened to SQL's `case_when`:  ```sql  CASE WHEN condition0 THEN value0  WHEN condition1 THEN value1  --- more conditions  ELSE default  END AS column_name  ```  compared to python's `if-elif-else`:  ```python  if condition0:  value0  elif condition1:  value1  # more elifs  else:  default  ```  Examples:  >>> import pandas as pd  >>> import janitor  >>> df = pd.DataFrame(  ... {  ... "a": [0, 0, 1, 2, "hi"],  ... "b": [0, 3, 4, 5, "bye"],  ... "c": [6, 7, 8, 9, "wait"],  ... }  ... )  >>> df  a b c  0 0 0 6  1 0 3 7  2 1 4 8  3 2 5 9  4 hi bye wait  >>> df.case_when(  ... ((df.a == 0) & (df.b != 0)) | (df.c == "wait"),  ... df.a,  ... (df.b == 0) & (df.a == 0),  ... "x",  ... default=df.c,  ... column_name="value",  ... )  a b c value  0 0 0 6 x  1 0 3 7 0  2 1 4 8 8  3 2 5 9 9  4 hi bye wait hi  !!! abstract "Version Changed"  - 0.24.0  - Added `default` parameter.  Args:  df: A pandas DataFrame.  *args: Variable argument of conditions and expected values.  Takes the form  `condition0`, `value0`, `condition1`, `value1`, ... .  `condition` can be a 1-D boolean array, a callable, or a string.  If `condition` is a callable, it should evaluate  to a 1-D boolean array. The array should have the same length  as the DataFrame. If it is a string, it is computed on the dataframe,  via `df.eval`, and should return a 1-D boolean array.  `result` can be a scalar, a 1-D array, or a callable.  If `result` is a callable, it should evaluate to a 1-D array.  For a 1-D array, it should have the same length as the DataFrame.  default: This is the element inserted in the output  when all conditions evaluate to False.  Can be scalar, 1-D array or callable.  If callable, it should evaluate to a 1-D array.  The 1-D array should be the same length as the DataFrame.  column_name: Name of column to assign results to. A new column  is created if it does not already exist in the DataFrame.  Raises:  ValueError: If condition/value fails to evaluate.  Returns:  A pandas DataFrame.  """ # noqa: E501 # Preliminary checks on the case_when function. # The bare minimum checks are done; the remaining checks # are done within `pd.Series.mask`. check("column_name", column_name, [str]) len_args = len(args) if len_args < 2: raise ValueError("At least two arguments are required for the `args` parameter") if len_args % 2: if default is None: warnings.warn( "The last argument in the variable arguments " "has been assigned as the default. " "Note however that this will be deprecated " "in a future release; use an even number " "of boolean conditions and values, " "and pass the default argument to the `default` " "parameter instead.", DeprecationWarning, stacklevel=find_stack_level(), ) *args, default = args else: raise ValueError( "The number of conditions and values do not match. " f"There are {len_args - len_args // 2} conditions " f"and {len_args // 2} values." ) booleans = [] replacements = [] for index, value in enumerate(args): if index % 2: if callable(value): value = apply_if_callable(value, df) replacements.append(value) else: if callable(value): value = apply_if_callable(value, df) elif isinstance(value, str): value = df.eval(value) booleans.append(value) if callable(default): default = apply_if_callable(default, df) if is_scalar(default): default = pd.Series([default]).repeat(len(df)) if not hasattr(default, "shape"): default = pd.Series([*default]) if isinstance(default, pd.Index): arr_ndim = default.nlevels else: arr_ndim = default.ndim if arr_ndim != 1: raise ValueError( "The argument for the `default` parameter " "should either be a 1-D array, a scalar, " "or a callable that can evaluate to a 1-D array." ) if not isinstance(default, pd.Series): default = pd.Series(default) default.index = df.index # actual computation # ensures value assignment is on a first come basis booleans = booleans[::-1] replacements = replacements[::-1] for index, (condition, value) in enumerate(zip(booleans, replacements)): try: default = default.mask(condition, value) # error `feedoff` idea from SO # https://stackoverflow.com/a/46091127/7175713 except Exception as error: raise ValueError( f"condition{index} and value{index} failed to evaluate. " f"Original error message: {error}" ) from error return df.assign(**{column_name: default}) 

`change_index_dtype`

Implementation of the change_index_dtype function.

`change_index_dtype(df, dtype, axis='index')`

Cast an index to a specified dtype dtype.

This method does not mutate the original DataFrame.

Examples:

>>> import pandas as pd >>> import numpy as np >>> import janitor >>> rng = np.random.default_rng(seed=0) >>> np.random.seed(0) >>> tuples = list( ...  zip( ...  *[ ...  ["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"], ...  [1.0, 2.0, 1.0, 2.0, 1.0, 2.0, 1.0, 2.0], ...  ] ...  ) ... ) >>> idx = pd.MultiIndex.from_tuples(tuples, names=["first", "second"]) >>> df = pd.DataFrame(np.random.randn(8, 2), index=idx, columns=["A", "B"]) >>> df  A B first second bar 1.0 1.764052 0.400157  2.0 0.978738 2.240893 baz 1.0 1.867558 -0.977278  2.0 0.950088 -0.151357 foo 1.0 -0.103219 0.410599  2.0 0.144044 1.454274 qux 1.0 0.761038 0.121675  2.0 0.443863 0.333674 >>> outcome = df.change_index_dtype(dtype=str) >>> outcome  A B first second bar 1.0 1.764052 0.400157  2.0 0.978738 2.240893 baz 1.0 1.867558 -0.977278  2.0 0.950088 -0.151357 foo 1.0 -0.103219 0.410599  2.0 0.144044 1.454274 qux 1.0 0.761038 0.121675  2.0 0.443863 0.333674 >>> outcome.index.dtypes first object second object dtype: object >>> outcome = df.change_index_dtype(dtype={"second": int}) >>> outcome  A B first second bar 1 1.764052 0.400157  2 0.978738 2.240893 baz 1 1.867558 -0.977278  2 0.950088 -0.151357 foo 1 -0.103219 0.410599  2 0.144044 1.454274 qux 1 0.761038 0.121675  2 0.443863 0.333674 >>> outcome.index.dtypes first object second int64 dtype: object >>> outcome = df.change_index_dtype(dtype={0: "category", 1: int}) >>> outcome  A B first second bar 1 1.764052 0.400157  2 0.978738 2.240893 baz 1 1.867558 -0.977278  2 0.950088 -0.151357 foo 1 -0.103219 0.410599  2 0.144044 1.454274 qux 1 0.761038 0.121675  2 0.443863 0.333674 >>> outcome.index.dtypes first category second int64 dtype: object

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	A pandas DataFrame.	required
`dtype`		Use a str or dtype to cast the entire Index to the same type. Alternatively, use a dictionary to change the MultiIndex to new dtypes.	required
`axis`	`str`	Determines which axis to change the dtype(s). Should be either 'index' or 'columns'.	`'index'`

Returns:

Type	Description
`DataFrame`	A pandas DataFrame with new Index.

Source code in janitor/functions/change_index_dtype.py

@pf.register_dataframe_method def change_index_dtype( df: pd.DataFrame, dtype: Union[str, dict], axis: str = "index" ) -> pd.DataFrame:  """Cast an index to a specified dtype ``dtype``.  This method does not mutate the original DataFrame.  Examples:  >>> import pandas as pd  >>> import numpy as np  >>> import janitor  >>> rng = np.random.default_rng(seed=0)  >>> np.random.seed(0)  >>> tuples = list(  ... zip(  ... *[  ... ["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"],  ... [1.0, 2.0, 1.0, 2.0, 1.0, 2.0, 1.0, 2.0],  ... ]  ... )  ... )  >>> idx = pd.MultiIndex.from_tuples(tuples, names=["first", "second"])  >>> df = pd.DataFrame(np.random.randn(8, 2), index=idx, columns=["A", "B"])  >>> df  A B  first second  bar 1.0 1.764052 0.400157  2.0 0.978738 2.240893  baz 1.0 1.867558 -0.977278  2.0 0.950088 -0.151357  foo 1.0 -0.103219 0.410599  2.0 0.144044 1.454274  qux 1.0 0.761038 0.121675  2.0 0.443863 0.333674  >>> outcome = df.change_index_dtype(dtype=str)  >>> outcome  A B  first second  bar 1.0 1.764052 0.400157  2.0 0.978738 2.240893  baz 1.0 1.867558 -0.977278  2.0 0.950088 -0.151357  foo 1.0 -0.103219 0.410599  2.0 0.144044 1.454274  qux 1.0 0.761038 0.121675  2.0 0.443863 0.333674  >>> outcome.index.dtypes  first object  second object  dtype: object  >>> outcome = df.change_index_dtype(dtype={"second": int})  >>> outcome  A B  first second  bar 1 1.764052 0.400157  2 0.978738 2.240893  baz 1 1.867558 -0.977278  2 0.950088 -0.151357  foo 1 -0.103219 0.410599  2 0.144044 1.454274  qux 1 0.761038 0.121675  2 0.443863 0.333674  >>> outcome.index.dtypes  first object  second int64  dtype: object  >>> outcome = df.change_index_dtype(dtype={0: "category", 1: int})  >>> outcome  A B  first second  bar 1 1.764052 0.400157  2 0.978738 2.240893  baz 1 1.867558 -0.977278  2 0.950088 -0.151357  foo 1 -0.103219 0.410599  2 0.144044 1.454274  qux 1 0.761038 0.121675  2 0.443863 0.333674  >>> outcome.index.dtypes  first category  second int64  dtype: object  Args:  df: A pandas DataFrame.  dtype : Use a str or dtype to cast the entire Index  to the same type.  Alternatively, use a dictionary to change the MultiIndex  to new dtypes.  axis: Determines which axis to change the dtype(s).  Should be either 'index' or 'columns'.  Returns:  A pandas DataFrame with new Index.  """ # noqa: E501 check("axis", axis, [str]) if axis not in {"index", "columns"}: raise ValueError("axis should be either index or columns.") df = df[:] current_index = getattr(df, axis) if not isinstance(current_index, pd.MultiIndex): if isinstance(dtype, dict): raise TypeError( "Changing the dtype via a dictionary " "is not supported for a single index." ) current_index = current_index.astype(dtype) setattr(df, axis, current_index) return df if not isinstance(dtype, dict): dtype = {level_number: dtype for level_number in range(current_index.nlevels)} all_str = all(isinstance(level, str) for level in dtype) all_int = all(isinstance(level, int) for level in dtype) if not all_str | all_int: raise TypeError( "The levels in the dictionary should be either all strings or all integers." ) dtype = { current_index._get_level_number(label): _dtype for label, _dtype in dtype.items() } new_levels = [] codes = current_index.codes levels = current_index.levels for level_number in range(current_index.nlevels): _index = levels[level_number] if level_number in dtype: _dtype = dtype[level_number] _index = _index.astype(_dtype) new_levels.append(_index) current_index = pd.MultiIndex( levels=new_levels, codes=codes, names=current_index.names, copy=False, verify_integrity=False, ) setattr(df, axis, current_index) return df 

`change_type`

`change_type(df, column_name, dtype, ignore_exception=False)`

Change the type of a column.

This method does not mutate the original DataFrame.

Exceptions that are raised can be ignored. For example, if one has a mixed dtype column that has non-integer strings and integers, and you want to coerce everything to integers, you can optionally ignore the non-integer strings and replace them with NaN or keep the original value.

Intended to be the method-chaining alternative to:

df[col] = df[col].astype(dtype)

Note

This function will be deprecated in a 1.x release. Please use pd.DataFrame.astype instead.

Examples:

Change the type of a column.

>>> import pandas as pd >>> import janitor >>> df = pd.DataFrame({"col1": range(3), "col2": ["m", 5, True]}) >>> df  col1 col2 0 0 m 1 1 5 2 2 True >>> df.change_type( ...  "col1", ...  dtype=str, ... ).change_type( ...  "col2", ...  dtype=float, ...  ignore_exception="fillna", ... )  col1 col2 0 0 NaN 1 1 5.0 2 2 1.0

Change the type of multiple columns. To change the type of all columns, please use DataFrame.astype instead.

>>> import pandas as pd >>> import janitor >>> df = pd.DataFrame({"col1": range(3), "col2": ["m", 5, True]}) >>> df.change_type(["col1", "col2"], str)  col1 col2 0 0 m 1 1 5 2 2 True

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	A pandas DataFrame.	required
`column_name`	`Hashable \| list[Hashable] \| Index`	The column(s) in the dataframe.	required
`dtype`	`type`	The datatype to convert to. Should be one of the standard Python types, or a numpy datatype.	required
`ignore_exception`	`bool`	One of `{False, "fillna", "keep_values"}`.	`False`

Raises:

Type	Description
`ValueError`	If unknown option provided for `ignore_exception`.

Returns:

Type	Description
`DataFrame`	A pandas DataFrame with changed column types.

Source code in janitor/functions/change_type.py

@pf.register_dataframe_method @refactored_function( message=( "This function will be deprecated in a 1.x release. " "Please use `pd.DataFrame.astype` instead." ) ) @deprecated_alias(column="column_name") def change_type( df: pd.DataFrame, column_name: Hashable | list[Hashable] | pd.Index, dtype: type, ignore_exception: bool = False, ) -> pd.DataFrame:  """Change the type of a column.  This method does not mutate the original DataFrame.  Exceptions that are raised can be ignored. For example, if one has a mixed  dtype column that has non-integer strings and integers, and you want to  coerce everything to integers, you can optionally ignore the non-integer  strings and replace them with `NaN` or keep the original value.  Intended to be the method-chaining alternative to:  ```python  df[col] = df[col].astype(dtype)  ```  !!!note  This function will be deprecated in a 1.x release.  Please use `pd.DataFrame.astype` instead.  Examples:  Change the type of a column.  >>> import pandas as pd  >>> import janitor  >>> df = pd.DataFrame({"col1": range(3), "col2": ["m", 5, True]})  >>> df  col1 col2  0 0 m  1 1 5  2 2 True  >>> df.change_type(  ... "col1",  ... dtype=str,  ... ).change_type(  ... "col2",  ... dtype=float,  ... ignore_exception="fillna",  ... )  col1 col2  0 0 NaN  1 1 5.0  2 2 1.0  Change the type of multiple columns. To change the type of all columns,  please use `DataFrame.astype` instead.  >>> import pandas as pd  >>> import janitor  >>> df = pd.DataFrame({"col1": range(3), "col2": ["m", 5, True]})  >>> df.change_type(["col1", "col2"], str)  col1 col2  0 0 m  1 1 5  2 2 True  Args:  df: A pandas DataFrame.  column_name: The column(s) in the dataframe.  dtype: The datatype to convert to. Should be one of the standard  Python types, or a numpy datatype.  ignore_exception: One of `{False, "fillna", "keep_values"}`.  Raises:  ValueError: If unknown option provided for `ignore_exception`.  Returns:  A pandas DataFrame with changed column types.  """ # noqa: E501 df = df.copy() # avoid mutating the original DataFrame if not ignore_exception: df[column_name] = df[column_name].astype(dtype) elif ignore_exception == "keep_values": df[column_name] = df[column_name].astype(dtype, errors="ignore") elif ignore_exception == "fillna": if isinstance(column_name, Hashable): column_name = [column_name] df[column_name] = df[column_name].map(_convert, dtype=dtype) else: raise ValueError("Unknown option for ignore_exception") return df 

`clean_names`

Functions for cleaning columns/index names and/or column values.

`clean_names(df, axis='columns', column_names=None, strip_underscores=None, case_type='lower', remove_special=False, strip_accents=True, preserve_original_labels=True, enforce_string=True, truncate_limit=None)`

Clean column/index names. It can also be applied to column values.

Takes all column names, converts them to lowercase, then replaces all spaces with underscores.

By default, column names are converted to string types. This can be switched off by passing in enforce_string=False.

This method does not mutate the original DataFrame.

Examples:

>>> import pandas as pd >>> import janitor >>> df = pd.DataFrame( ...  {"Aloha": range(3), "Bell Chart": range(3), "Animals@#$%^": range(3)} ... ) >>> df  Aloha Bell Chart Animals@#$%^ 0 0 0 0 1 1 1 1 2 2 2 2 >>> df.clean_names()  aloha bell_chart animals@#$%^ 0 0 0 0 1 1 1 1 2 2 2 2 >>> df.clean_names(remove_special=True)  aloha bell_chart animals 0 0 0 0 1 1 1 1 2 2 2 2

Version Changed

0.26.0
- Added axis and column_names parameters.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	The pandas DataFrame object.	required
`axis`	`str`	Whether to clean the labels on the index or columns. If `None`, applies to a defined column or columns in `column_names`.	`'columns'`
`column_names`	`str \| list`	Clean the values in a column. `axis` should be `None`. Column selection is possible using the `select` syntax.	`None`
`strip_underscores`	`str \| bool`	Removes the outer underscores from all column names/values. Default None keeps outer underscores. Values can be either 'left', 'right' or 'both' or the respective shorthand 'l', 'r' and True.	`None`
`case_type`	`str`	Whether to make columns lower or uppercase. Current case may be preserved with 'preserve', while snake case conversion (from CamelCase or camelCase only) can be turned on using "snake". Default 'lower' makes all characters lowercase.	`'lower'`
`remove_special`	`bool`	Remove special characters from columns. Only letters, numbers and underscores are preserved.	`False`
`strip_accents`	`bool`	Whether or not to remove accents from columns names/values.	`True`
`preserve_original_labels`	`bool`	Preserve original names. This is later retrievable using `df.original_labels`. Applies if `axis` is not None.	`True`
`enforce_string`	`bool`	Whether or not to convert all column names/values to string type. Defaults to True, but can be turned off. Columns with >1 levels will not be converted by default.	`True`
`truncate_limit`	`int`	Truncates formatted column names/values to the specified length. Default None does not truncate.	`None`

Raises:

Type	Description
`ValueError`	If `axis=None` and `column_names=None`.

Returns:

Type	Description
`DataFrame`	A pandas DataFrame.

Source code in janitor/functions/clean_names.py

@pf.register_dataframe_method @deprecated_alias(preserve_original_columns="preserve_original_labels") def clean_names( df: pd.DataFrame, axis: str = "columns", column_names: str | list = None, strip_underscores: str | bool = None, case_type: str = "lower", remove_special: bool = False, strip_accents: bool = True, preserve_original_labels: bool = True, enforce_string: bool = True, truncate_limit: int = None, ) -> pd.DataFrame:  """Clean column/index names. It can also be applied to column values.  Takes all column names, converts them to lowercase,  then replaces all spaces with underscores.  By default, column names are converted to string types.  This can be switched off by passing in `enforce_string=False`.  This method does not mutate the original DataFrame.  Examples:  >>> import pandas as pd  >>> import janitor  >>> df = pd.DataFrame(  ... {"Aloha": range(3), "Bell Chart": range(3), "Animals@#$%^": range(3)}  ... )  >>> df  Aloha Bell Chart Animals@#$%^  0 0 0 0  1 1 1 1  2 2 2 2  >>> df.clean_names()  aloha bell_chart animals@#$%^  0 0 0 0  1 1 1 1  2 2 2 2  >>> df.clean_names(remove_special=True)  aloha bell_chart animals  0 0 0 0  1 1 1 1  2 2 2 2  !!! summary "Version Changed"  - 0.26.0  - Added `axis` and `column_names` parameters.  Args:  df: The pandas DataFrame object.  axis: Whether to clean the labels on the index or columns.  If `None`, applies to a defined column  or columns in `column_names`.  column_names: Clean the values in a column.  `axis` should be `None`.  Column selection is possible using the  [`select`][janitor.functions.select.select] syntax.  strip_underscores: Removes the outer underscores from all  column names/values. Default None keeps outer underscores.  Values can be either 'left', 'right' or 'both'  or the respective shorthand 'l',  'r' and True.  case_type: Whether to make columns lower or uppercase.  Current case may be preserved with 'preserve',  while snake case conversion (from CamelCase or camelCase only)  can be turned on using "snake".  Default 'lower' makes all characters lowercase.  remove_special: Remove special characters from columns.  Only letters, numbers and underscores are preserved.  strip_accents: Whether or not to remove accents from  columns names/values.  preserve_original_labels: Preserve original names.  This is later retrievable using `df.original_labels`.  Applies if `axis` is not None.  enforce_string: Whether or not to convert all  column names/values to string type.  Defaults to True, but can be turned off.  Columns with >1 levels will not be converted by default.  truncate_limit: Truncates formatted column names/values  to the specified length.  Default None does not truncate.  Raises:  ValueError: If `axis=None` and `column_names=None`.  Returns:  A pandas DataFrame.  """ if not axis and not column_names: raise ValueError( "Kindly provide an argument to `column_names`, if axis is None." ) if axis is None: column_names = get_index_labels(arg=column_names, df=df, axis="columns") if is_scalar(column_names): column_names = [column_names] df = df.copy() for column_name in column_names: df[column_name] = _clean_names( obj=df[column_name], enforce_string=enforce_string, case_type=case_type, remove_special=remove_special, strip_accents=strip_accents, strip_underscores=strip_underscores, truncate_limit=truncate_limit, ) return df assert axis in {"index", "columns"} df = df[:] target_axis = getattr(df, axis) if isinstance(target_axis, pd.MultiIndex): target_axis = [ target_axis.get_level_values(number) for number in range(target_axis.nlevels) ] target_axis = [ _clean_names( obj=obj, enforce_string=enforce_string, case_type=case_type, remove_special=remove_special, strip_accents=strip_accents, strip_underscores=strip_underscores, truncate_limit=truncate_limit, ) for obj in target_axis ] else: target_axis = _clean_names( obj=target_axis, enforce_string=enforce_string, case_type=case_type, remove_special=remove_special, strip_accents=strip_accents, strip_underscores=strip_underscores, truncate_limit=truncate_limit, ) # Store the original column names, if enabled by user if preserve_original_labels: df.__dict__["original_labels"] = getattr(df, axis) setattr(df, axis, target_axis) return df 

`coalesce`

Function for performing coalesce.

`coalesce(df, *column_names, target_column_name=None, default_value=None)`

Coalesce two or more columns of data in order of column names provided.

Given the variable arguments of column names, coalesce finds and returns the first non-missing value from these columns, for every row in the input dataframe. If all the column values are null for a particular row, then the default_value will be filled in.

If target_column_name is not provided, then the first column is coalesced.

This method does not mutate the original DataFrame.

The select syntax can be used in column_names.

Examples:

Use coalesce with 3 columns, "a", "b" and "c".

>>> import pandas as pd >>> import numpy as np >>> import janitor >>> df = pd.DataFrame( ...  { ...  "a": [np.nan, 1, np.nan], ...  "b": [2, 3, np.nan], ...  "c": [4, np.nan, np.nan], ...  } ... ) >>> df.coalesce("a", "b", "c")  a b c 0 2.0 2.0 4.0 1 1.0 3.0 NaN 2 NaN NaN NaN

Provide a target_column_name.

>>> df.coalesce("a", "b", "c", target_column_name="new_col")  a b c new_col 0 NaN 2.0 4.0 2.0 1 1.0 3.0 NaN 1.0 2 NaN NaN NaN NaN

Provide a default value.

>>> import pandas as pd >>> import numpy as np >>> import janitor >>> df = pd.DataFrame( ...  { ...  "a": [1, np.nan, np.nan], ...  "b": [2, 3, np.nan], ...  } ... ) >>> df.coalesce( ...  "a", ...  "b", ...  target_column_name="new_col", ...  default_value=-1, ... )  a b new_col 0 1.0 2.0 1.0 1 NaN 3.0 3.0 2 NaN NaN -1.0

This is more syntactic diabetes! For R users, this should look familiar to dplyr's coalesce function; for Python users, the interface should be more intuitive than the pandas.Series.combine_first method.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	A pandas DataFrame.	required
`column_names`	`Any`	A list of column names.	`()`
`target_column_name`	`Optional[str]`	The new column name after combining. If `None`, then the first column in `column_names` is updated, with the Null values replaced.	`None`
`default_value`	`Optional[Union[int, float, str]]`	A scalar to replace any remaining nulls after coalescing.	`None`

Raises:

Type	Description
`ValueError`	If length of `column_names` is less than 2.

Returns:

Type	Description
`DataFrame`	A pandas DataFrame with coalesced columns.

Source code in janitor/functions/coalesce.py

@pf.register_dataframe_method @deprecated_alias(columns="column_names", new_column_name="target_column_name") def coalesce( df: pd.DataFrame, *column_names: Any, target_column_name: Optional[str] = None, default_value: Optional[Union[int, float, str]] = None, ) -> pd.DataFrame:  """Coalesce two or more columns of data in order of column names provided.  Given the variable arguments of column names,  `coalesce` finds and returns the first non-missing value  from these columns, for every row in the input dataframe.  If all the column values are null for a particular row,  then the `default_value` will be filled in.  If `target_column_name` is not provided,  then the first column is coalesced.  This method does not mutate the original DataFrame.  The [`select`][janitor.functions.select.select] syntax  can be used in `column_names`.  Examples:  Use `coalesce` with 3 columns, "a", "b" and "c".  >>> import pandas as pd  >>> import numpy as np  >>> import janitor  >>> df = pd.DataFrame(  ... {  ... "a": [np.nan, 1, np.nan],  ... "b": [2, 3, np.nan],  ... "c": [4, np.nan, np.nan],  ... }  ... )  >>> df.coalesce("a", "b", "c")  a b c  0 2.0 2.0 4.0  1 1.0 3.0 NaN  2 NaN NaN NaN  Provide a target_column_name.  >>> df.coalesce("a", "b", "c", target_column_name="new_col")  a b c new_col  0 NaN 2.0 4.0 2.0  1 1.0 3.0 NaN 1.0  2 NaN NaN NaN NaN  Provide a default value.  >>> import pandas as pd  >>> import numpy as np  >>> import janitor  >>> df = pd.DataFrame(  ... {  ... "a": [1, np.nan, np.nan],  ... "b": [2, 3, np.nan],  ... }  ... )  >>> df.coalesce(  ... "a",  ... "b",  ... target_column_name="new_col",  ... default_value=-1,  ... )  a b new_col  0 1.0 2.0 1.0  1 NaN 3.0 3.0  2 NaN NaN -1.0  This is more syntactic diabetes! For R users, this should look familiar to  `dplyr`'s `coalesce` function; for Python users, the interface  should be more intuitive than the `pandas.Series.combine_first`  method.  Args:  df: A pandas DataFrame.  column_names: A list of column names.  target_column_name: The new column name after combining.  If `None`, then the first column in `column_names` is updated,  with the Null values replaced.  default_value: A scalar to replace any remaining nulls  after coalescing.  Raises:  ValueError: If length of `column_names` is less than 2.  Returns:  A pandas DataFrame with coalesced columns.  """ if not column_names: return df indexers = _select_index([*column_names], df, axis="columns") if len(indexers) < 2: raise ValueError("The number of columns to coalesce should be a minimum of 2.") if target_column_name: check("target_column_name", target_column_name, [str]) if default_value: check("default_value", default_value, [int, float, str]) df = df.copy() outcome = df.iloc[:, indexers[0]] for num in range(1, len(indexers)): position = indexers[num] replacement = df.iloc[:, position] outcome = outcome.fillna(replacement) if outcome.hasnans and (default_value is not None): outcome = outcome.fillna(default_value) if target_column_name is None: df.iloc[:, indexers[0]] = outcome else: df[target_column_name] = outcome return df 

`collapse_levels`

Implementation of the collapse_levels function.

`collapse_levels(df, sep=None, glue=None, axis='columns')`

Flatten multi-level index/column dataframe to a single level.

This method does not mutate the original DataFrame.

Given a DataFrame containing multi-level index/columns, flatten to single-level by string-joining the labels in each level.

After a groupby / aggregate operation where .agg() is passed a list of multiple aggregation functions, a multi-level DataFrame is returned with the name of the function applied in the second level.

It is sometimes convenient for later indexing to flatten out this multi-level configuration back into a single level. This function does this through a simple string-joining of all the names across different levels in a single column.

Examples:

>>> import pandas as pd >>> import janitor >>> df = pd.DataFrame( ...  { ...  "class": ["bird", "bird", "bird", "mammal", "mammal"], ...  "max_speed": [389, 389, 24, 80, 21], ...  "type": ["falcon", "falcon", "parrot", "Lion", "Monkey"], ...  } ... ) >>> df  class max_speed type 0 bird 389 falcon 1 bird 389 falcon 2 bird 24 parrot 3 mammal 80 Lion 4 mammal 21 Monkey >>> grouped_df = df.groupby("class")[["max_speed"]].agg(["mean", "median"]) >>> grouped_df  max_speed  mean median class bird 267.333333 389.0 mammal 50.500000 50.5 >>> grouped_df.collapse_levels(sep="_")  max_speed_mean max_speed_median class bird 267.333333 389.0 mammal 50.500000 50.5

Before applying .collapse_levels, the .agg operation returns a multi-level column DataFrame whose columns are (level 1, level 2):

[("max_speed", "mean"), ("max_speed", "median")]

.collapse_levels then flattens the column MultiIndex into a single level index with names:

["max_speed_mean", "max_speed_median"]

For more control, a glue specification can be passed, where the names of the levels are used to control the output of the flattened index:

>>> ( ...  grouped_df.rename_axis( ...  columns=["column_name", "agg_name"] ...  ).collapse_levels(glue="{agg_name}_{column_name}") ... )  mean_max_speed median_max_speed class bird 267.333333 389.0 mammal 50.500000 50.5

Note that for glue to work, the keyword arguments in the glue specification should be the names of the levels in the MultiIndex.

Version Changed

0.27.0
- Added glue and axis parameters.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	A pandas DataFrame.	required
`sep`	`str`	String separator used to join the column level names.	`None`
`glue`	`str`	A specification on how the column levels should be combined. It allows for a more granular composition, and serves as an alternative to `sep`.	`None`
`axis`	`str`	Determines whether to collapse the levels on the index or columns.	`'columns'`

Returns:

Type	Description
`DataFrame`	A pandas DataFrame with single-level column index.

Source code in janitor/functions/collapse_levels.py

@pf.register_dataframe_method def collapse_levels( df: pd.DataFrame, sep: str = None, glue: str = None, axis: str = "columns", ) -> pd.DataFrame:  """Flatten multi-level index/column dataframe to a single level.  This method does not mutate the original DataFrame.  Given a DataFrame containing multi-level index/columns, flatten to single-level  by string-joining the labels in each level.  After a `groupby` / `aggregate` operation where `.agg()` is passed a  list of multiple aggregation functions, a multi-level DataFrame is  returned with the name of the function applied in the second level.  It is sometimes convenient for later indexing to flatten out this  multi-level configuration back into a single level. This function does  this through a simple string-joining of all the names across different  levels in a single column.  Examples:  >>> import pandas as pd  >>> import janitor  >>> df = pd.DataFrame(  ... {  ... "class": ["bird", "bird", "bird", "mammal", "mammal"],  ... "max_speed": [389, 389, 24, 80, 21],  ... "type": ["falcon", "falcon", "parrot", "Lion", "Monkey"],  ... }  ... )  >>> df  class max_speed type  0 bird 389 falcon  1 bird 389 falcon  2 bird 24 parrot  3 mammal 80 Lion  4 mammal 21 Monkey  >>> grouped_df = df.groupby("class")[["max_speed"]].agg(["mean", "median"])  >>> grouped_df # doctest: +NORMALIZE_WHITESPACE  max_speed  mean median  class  bird 267.333333 389.0  mammal 50.500000 50.5  >>> grouped_df.collapse_levels(sep="_") # doctest: +NORMALIZE_WHITESPACE  max_speed_mean max_speed_median  class  bird 267.333333 389.0  mammal 50.500000 50.5  Before applying `.collapse_levels`, the `.agg` operation returns a  multi-level column DataFrame whose columns are `(level 1, level 2)`:  ```python  [("max_speed", "mean"), ("max_speed", "median")]  ```  `.collapse_levels` then flattens the column MultiIndex into a single  level index with names:  ```python  ["max_speed_mean", "max_speed_median"]  ```  For more control, a `glue` specification can be passed,  where the names of the levels are used to control the output of the  flattened index:  >>> (  ... grouped_df.rename_axis(  ... columns=["column_name", "agg_name"]  ... ).collapse_levels(glue="{agg_name}_{column_name}")  ... )  mean_max_speed median_max_speed  class  bird 267.333333 389.0  mammal 50.500000 50.5  Note that for `glue` to work, the keyword arguments  in the glue specification  should be the names of the levels in the MultiIndex.  !!! abstract "Version Changed"  - 0.27.0  - Added `glue` and `axis` parameters.  Args:  df: A pandas DataFrame.  sep: String separator used to join the column level names.  glue: A specification on how the column levels should be combined.  It allows for a more granular composition,  and serves as an alternative to `sep`.  axis: Determines whether to collapse the  levels on the index or columns.  Returns:  A pandas DataFrame with single-level column index.  """ # noqa: E501 if (sep is not None) and (glue is not None): raise ValueError("Only one of sep or glue should be provided.") if sep is not None: check("sep", sep, [str]) if glue is not None: check("glue", glue, [str]) check("axis", axis, [str]) if axis not in {"index", "columns"}: raise ValueError("axis argument should be either 'index' or 'columns'.") if not isinstance(getattr(df, axis), pd.MultiIndex): return df # TODO: Pyarrow offers faster string computations # future work should take this into consideration, # which would require a different route from python's string.join # since work is only on the columns # it is safe, and more efficient to slice/view the dataframe # plus Pandas creates a new Index altogether # as such, the original dataframe is not modified df = df[:] new_index = getattr(df, axis) if glue is not None: new_index = [dict(zip(new_index.names, entry)) for entry in new_index] new_index = [glue.format_map(mapping) for mapping in new_index] setattr(df, axis, new_index) return df sep = "_" if sep is None else sep levels = [level for level in new_index.levels] all_strings = all(map(is_string_dtype, levels)) if all_strings: no_empty_string = all((entry != "").all() for entry in levels) if no_empty_string: new_index = new_index.map(sep.join) setattr(df, axis, new_index) return df new_index = (map(str, entry) for entry in new_index) new_index = [ # faster to use a list comprehension within string.join # compared to a generator # https://stackoverflow.com/a/37782238 sep.join([entry for entry in word if entry]) for word in new_index ] setattr(df, axis, new_index) return df 

`complete`

`complete(df, *columns, sort=False, by=None, fill_value=None, explicit=True)`

Complete a data frame with missing combinations of data.

It is modeled after tidyr's complete function. In a way, it is the inverse of pd.dropna, as it exposes implicitly missing rows.

The variable columns parameter can be a column name, a list of column names, or a pandas Index, Series, or DataFrame. If a pandas Index, Series, or DataFrame is passed, it should have a name or names that exist in df.

A callable can also be passed - the callable should evaluate to a pandas Index, Series, or DataFrame, and the names of the pandas object should exist in df.

A dictionary can also be passed - the values of the dictionary should be either be a 1D array or a callable that evaluates to a 1D array, while the keys of the dictionary should exist in df.

User should ensure that the pandas object is unique and/or sorted - no checks are done to ensure uniqueness and/or sortedness.

If by is present, the DataFrame is completed per group. by should be a column name, or a list of column names.

Examples:

>>> import pandas as pd >>> import janitor >>> import numpy as np >>> df = pd.DataFrame( ...  { ...  "Year": [1999, 2000, 2004, 1999, 2004], ...  "Taxon": [ ...  "Saccharina", ...  "Saccharina", ...  "Saccharina", ...  "Agarum", ...  "Agarum", ...  ], ...  "Abundance": [4, 5, 2, 1, 8], ...  } ... ) >>> df  Year Taxon Abundance 0 1999 Saccharina 4 1 2000 Saccharina 5 2 2004 Saccharina 2 3 1999 Agarum 1 4 2004 Agarum 8

Expose missing pairings of Year and Taxon:

>>> df.complete("Year", "Taxon", sort=True)  Year Taxon Abundance 0 1999 Agarum 1.0 1 1999 Saccharina 4.0 2 2000 Agarum NaN 3 2000 Saccharina 5.0 4 2004 Agarum 8.0 5 2004 Saccharina 2.0

Expose missing years from 1999 to 2004:

>>> index = pd.Index(range(1999, 2005), name="Year") >>> df.complete(index, "Taxon", sort=True)  Year Taxon Abundance 0 1999 Agarum 1.0 1 1999 Saccharina 4.0 2 2000 Agarum NaN 3 2000 Saccharina 5.0 4 2001 Agarum NaN 5 2001 Saccharina NaN 6 2002 Agarum NaN 7 2002 Saccharina NaN 8 2003 Agarum NaN 9 2003 Saccharina NaN 10 2004 Agarum 8.0 11 2004 Saccharina 2.0

A dictionary can be used as well:

>>> dictionary = {"Year": range(1999, 2005)} >>> df.complete(dictionary, "Taxon", sort=True)  Year Taxon Abundance 0 1999 Agarum 1.0 1 1999 Saccharina 4.0 2 2000 Agarum NaN 3 2000 Saccharina 5.0 4 2001 Agarum NaN 5 2001 Saccharina NaN 6 2002 Agarum NaN 7 2002 Saccharina NaN 8 2003 Agarum NaN 9 2003 Saccharina NaN 10 2004 Agarum 8.0 11 2004 Saccharina 2.0

Fill missing values:

>>> df = pd.DataFrame( ...  dict( ...  group=(1, 2, 1, 2), ...  item_id=(1, 2, 2, 3), ...  item_name=("a", "a", "b", "b"), ...  value1=(1, np.nan, 3, 4), ...  value2=range(4, 8), ...  ) ... ) >>> df  group item_id item_name value1 value2 0 1 1 a 1.0 4 1 2 2 a NaN 5 2 1 2 b 3.0 6 3 2 3 b 4.0 7

>>> df.complete( ...  "group", ...  ["item_id", "item_name"], ...  fill_value={"value1": 0, "value2": 99}, ...  sort=True, ... )  group item_id item_name value1 value2 0 1 1 a 1.0 4.0 1 1 2 a 0.0 99.0 2 1 2 b 3.0 6.0 3 1 3 b 0.0 99.0 4 2 1 a 0.0 99.0 5 2 2 a 0.0 5.0 6 2 2 b 0.0 99.0 7 2 3 b 4.0 7.0

Limit the fill to only implicit missing values by setting explicit to False:

>>> df.complete( ...  "group", ...  ["item_id", "item_name"], ...  fill_value={"value1": 0, "value2": 99}, ...  explicit=False, ...  sort=True, ... )  group item_id item_name value1 value2 0 1 1 a 1.0 4.0 1 1 2 a 0.0 99.0 2 1 2 b 3.0 6.0 3 1 3 b 0.0 99.0 4 2 1 a 0.0 99.0 5 2 2 a NaN 5.0 6 2 2 b 0.0 99.0 7 2 3 b 4.0 7.0

Expose missing rows per group, using a callable:

>>> df = pd.DataFrame( ...  { ...  "state": ["CA", "CA", "HI", "HI", "HI", "NY", "NY"], ...  "year": [2010, 2013, 2010, 2012, 2016, 2009, 2013], ...  "value": [1, 3, 1, 2, 3, 2, 5], ...  } ... ) >>> df  state year value 0 CA 2010 1 1 CA 2013 3 2 HI 2010 1 3 HI 2012 2 4 HI 2016 3 5 NY 2009 2 6 NY 2013 5

>>> def new_year_values(df): ...  return pd.RangeIndex( ...  start=df.year.min(), stop=df.year.max() + 1, name="year" ...  ) >>> df.complete(new_year_values, by="state", sort=True)  state year value 0 CA 2010 1.0 1 CA 2011 NaN 2 CA 2012 NaN 3 CA 2013 3.0 4 HI 2010 1.0 5 HI 2011 NaN 6 HI 2012 2.0 7 HI 2013 NaN 8 HI 2014 NaN 9 HI 2015 NaN 10 HI 2016 3.0 11 NY 2009 2.0 12 NY 2010 NaN 13 NY 2011 NaN 14 NY 2012 NaN 15 NY 2013 5.0

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	A pandas DataFrame.	required
`*columns`	`Any`	This refers to the columns to be completed. It could be a column name, a list of column names, or a pandas Index, Series, or DataFrame. It can also be a callable that gets evaluated to a pandas Index, Series, or DataFrame. It can also be a dictionary, where the values are either a 1D array or a callable that evaluates to a 1D array, while the keys of the dictionary should exist in `df`.	`()`
`sort`	`bool`	Sort DataFrame based on *columns.	`False`
`by`	`str \| list`	Label or list of labels to group by. The explicit missing rows are returned per group.	`None`
`fill_value`	`dict \| Any`	Scalar value to use instead of NaN for missing combinations. A dictionary, mapping columns names to a scalar value is also accepted.	`None`
`explicit`	`bool`	Determines if only implicitly missing values should be filled (`False`), or all nulls existing in the dataframe (`True`). `explicit` is applicable only if `fill_value` is not `None`.	`True`

Returns:

Type	Description
`DataFrame`	A pandas DataFrame with explicit missing rows, if any.

Source code in janitor/functions/complete.py

@pf.register_dataframe_method def complete( df: pd.DataFrame, *columns: Any, sort: bool = False, by: str | list = None, fill_value: dict | Any = None, explicit: bool = True, ) -> pd.DataFrame:  """  Complete a data frame with missing combinations of data.  It is modeled after tidyr's `complete` function.  In a way, it is the inverse of `pd.dropna`, as it exposes  implicitly missing rows.  The variable `columns` parameter can be a column name,  a list of column names,  or a pandas Index, Series, or DataFrame.  If a pandas Index, Series, or DataFrame is passed, it should  have a name or names that exist in `df`.  A callable can also be passed - the callable should evaluate  to a pandas Index, Series, or DataFrame,  and the names of the pandas object should exist in `df`.  A dictionary can also be passed -  the values of the dictionary should be  either be a 1D array  or a callable that evaluates to a  1D array,  while the keys of the dictionary  should exist in `df`.  User should ensure that the pandas object is unique and/or sorted  - no checks are done to ensure uniqueness and/or sortedness.  If `by` is present, the DataFrame is *completed* per group.  `by` should be a column name, or a list of column names.  Examples:  >>> import pandas as pd  >>> import janitor  >>> import numpy as np  >>> df = pd.DataFrame(  ... {  ... "Year": [1999, 2000, 2004, 1999, 2004],  ... "Taxon": [  ... "Saccharina",  ... "Saccharina",  ... "Saccharina",  ... "Agarum",  ... "Agarum",  ... ],  ... "Abundance": [4, 5, 2, 1, 8],  ... }  ... )  >>> df  Year Taxon Abundance  0 1999 Saccharina 4  1 2000 Saccharina 5  2 2004 Saccharina 2  3 1999 Agarum 1  4 2004 Agarum 8  Expose missing pairings of `Year` and `Taxon`:  >>> df.complete("Year", "Taxon", sort=True)  Year Taxon Abundance  0 1999 Agarum 1.0  1 1999 Saccharina 4.0  2 2000 Agarum NaN  3 2000 Saccharina 5.0  4 2004 Agarum 8.0  5 2004 Saccharina 2.0  Expose missing years from 1999 to 2004:  >>> index = pd.Index(range(1999, 2005), name="Year")  >>> df.complete(index, "Taxon", sort=True)  Year Taxon Abundance  0 1999 Agarum 1.0  1 1999 Saccharina 4.0  2 2000 Agarum NaN  3 2000 Saccharina 5.0  4 2001 Agarum NaN  5 2001 Saccharina NaN  6 2002 Agarum NaN  7 2002 Saccharina NaN  8 2003 Agarum NaN  9 2003 Saccharina NaN  10 2004 Agarum 8.0  11 2004 Saccharina 2.0  A dictionary can be used as well:  >>> dictionary = {"Year": range(1999, 2005)}  >>> df.complete(dictionary, "Taxon", sort=True)  Year Taxon Abundance  0 1999 Agarum 1.0  1 1999 Saccharina 4.0  2 2000 Agarum NaN  3 2000 Saccharina 5.0  4 2001 Agarum NaN  5 2001 Saccharina NaN  6 2002 Agarum NaN  7 2002 Saccharina NaN  8 2003 Agarum NaN  9 2003 Saccharina NaN  10 2004 Agarum 8.0  11 2004 Saccharina 2.0  Fill missing values:  >>> df = pd.DataFrame(  ... dict(  ... group=(1, 2, 1, 2),  ... item_id=(1, 2, 2, 3),  ... item_name=("a", "a", "b", "b"),  ... value1=(1, np.nan, 3, 4),  ... value2=range(4, 8),  ... )  ... )  >>> df  group item_id item_name value1 value2  0 1 1 a 1.0 4  1 2 2 a NaN 5  2 1 2 b 3.0 6  3 2 3 b 4.0 7  >>> df.complete(  ... "group",  ... ["item_id", "item_name"],  ... fill_value={"value1": 0, "value2": 99},  ... sort=True,  ... )  group item_id item_name value1 value2  0 1 1 a 1.0 4.0  1 1 2 a 0.0 99.0  2 1 2 b 3.0 6.0  3 1 3 b 0.0 99.0  4 2 1 a 0.0 99.0  5 2 2 a 0.0 5.0  6 2 2 b 0.0 99.0  7 2 3 b 4.0 7.0  Limit the fill to only implicit missing values  by setting explicit to `False`:  >>> df.complete(  ... "group",  ... ["item_id", "item_name"],  ... fill_value={"value1": 0, "value2": 99},  ... explicit=False,  ... sort=True,  ... )  group item_id item_name value1 value2  0 1 1 a 1.0 4.0  1 1 2 a 0.0 99.0  2 1 2 b 3.0 6.0  3 1 3 b 0.0 99.0  4 2 1 a 0.0 99.0  5 2 2 a NaN 5.0  6 2 2 b 0.0 99.0  7 2 3 b 4.0 7.0  Expose missing rows per group, using a callable:  >>> df = pd.DataFrame(  ... {  ... "state": ["CA", "CA", "HI", "HI", "HI", "NY", "NY"],  ... "year": [2010, 2013, 2010, 2012, 2016, 2009, 2013],  ... "value": [1, 3, 1, 2, 3, 2, 5],  ... }  ... )  >>> df  state year value  0 CA 2010 1  1 CA 2013 3  2 HI 2010 1  3 HI 2012 2  4 HI 2016 3  5 NY 2009 2  6 NY 2013 5  >>> def new_year_values(df):  ... return pd.RangeIndex(  ... start=df.year.min(), stop=df.year.max() + 1, name="year"  ... )  >>> df.complete(new_year_values, by="state", sort=True)  state year value  0 CA 2010 1.0  1 CA 2011 NaN  2 CA 2012 NaN  3 CA 2013 3.0  4 HI 2010 1.0  5 HI 2011 NaN  6 HI 2012 2.0  7 HI 2013 NaN  8 HI 2014 NaN  9 HI 2015 NaN  10 HI 2016 3.0  11 NY 2009 2.0  12 NY 2010 NaN  13 NY 2011 NaN  14 NY 2012 NaN  15 NY 2013 5.0  Args:  df: A pandas DataFrame.  *columns: This refers to the columns to be completed.  It could be a column name,  a list of column names,  or a pandas Index, Series, or DataFrame.  It can also be a callable that gets evaluated  to a pandas Index, Series, or DataFrame.  It can also be a dictionary,  where the values are either a 1D array  or a callable that evaluates to a  1D array,  while the keys of the dictionary  should exist in `df`.  sort: Sort DataFrame based on *columns.  by: Label or list of labels to group by.  The explicit missing rows are returned per group.  fill_value: Scalar value to use instead of NaN  for missing combinations. A dictionary, mapping columns names  to a scalar value is also accepted.  explicit: Determines if only implicitly missing values  should be filled (`False`), or all nulls existing in the dataframe  (`True`). `explicit` is applicable only  if `fill_value` is not `None`.  Returns:  A pandas DataFrame with explicit missing rows, if any.  """ # noqa: E501 if not columns: return df return _computations_complete(df, columns, sort, by, fill_value, explicit) 

`concatenate_columns`

`concatenate_columns(df, column_names, new_column_name, sep='-', ignore_empty=True)`

Concatenates the set of columns into a single column.

Used to quickly generate an index based on a group of columns.

This method mutates the original DataFrame.

Examples:

Concatenate two columns row-wise.

>>> import pandas as pd >>> import janitor >>> df = pd.DataFrame({"a": [1, 3, 5], "b": list("xyz")}) >>> df  a b 0 1 x 1 3 y 2 5 z >>> df.concatenate_columns( ...  column_names=["a", "b"], ...  new_column_name="m", ... )  a b m 0 1 x 1-x 1 3 y 3-y 2 5 z 5-z

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	A pandas DataFrame.	required
`column_names`	`List[Hashable]`	A list of columns to concatenate together.	required
`new_column_name`	`Hashable`	The name of the new column.	required
`sep`	`str`	The separator between each column's data.	`'-'`
`ignore_empty`	`bool`	Ignore null values if exists.	`True`

Raises:

Type	Description
`JanitorError`	If at least two columns are not provided within `column_names`.

Returns:

Type	Description
`DataFrame`	A pandas DataFrame with concatenated columns.

Source code in janitor/functions/concatenate_columns.py

@pf.register_dataframe_method @deprecated_alias(columns="column_names") def concatenate_columns( df: pd.DataFrame, column_names: List[Hashable], new_column_name: Hashable, sep: str = "-", ignore_empty: bool = True, ) -> pd.DataFrame:  """Concatenates the set of columns into a single column.  Used to quickly generate an index based on a group of columns.  This method mutates the original DataFrame.  Examples:  Concatenate two columns row-wise.  >>> import pandas as pd  >>> import janitor  >>> df = pd.DataFrame({"a": [1, 3, 5], "b": list("xyz")})  >>> df  a b  0 1 x  1 3 y  2 5 z  >>> df.concatenate_columns(  ... column_names=["a", "b"],  ... new_column_name="m",  ... )  a b m  0 1 x 1-x  1 3 y 3-y  2 5 z 5-z  Args:  df: A pandas DataFrame.  column_names: A list of columns to concatenate together.  new_column_name: The name of the new column.  sep: The separator between each column's data.  ignore_empty: Ignore null values if exists.  Raises:  JanitorError: If at least two columns are not provided  within `column_names`.  Returns:  A pandas DataFrame with concatenated columns.  """ if len(column_names) < 2: raise JanitorError("At least two columns must be specified") df[new_column_name] = df[column_names].astype(str).fillna("").agg(sep.join, axis=1) if ignore_empty: def remove_empty_string(x):  """Ignore empty/null string values from the concatenated output.""" return sep.join(x for x in x.split(sep) if x) df[new_column_name] = df[new_column_name].transform(remove_empty_string) return df 

`conditional_join`

`conditional_join(df, right, *conditions, how='inner', df_columns=slice(None), right_columns=slice(None), keep='all', use_numba=False, indicator=False, force=False)`

The conditional_join function operates similarly to pd.merge, but supports joins on inequality operators, or a combination of equi and non-equi joins.

Joins solely on equality are not supported.

If the join is solely on equality, pd.merge function covers that; if you are interested in nearest joins, asof joins, or rolling joins, then pd.merge_asof covers that. There is also pandas' IntervalIndex, which is efficient for range joins, especially if the intervals do not overlap.

Column selection in df_columns and right_columns is possible using the select syntax.

Performance might be improved by setting use_numba to True - this can be handy for equi joins that have lots of duplicated keys. This can also be handy for non-equi joins, where there are more than two join conditions, or there is significant overlap in the range join columns. This assumes that numba is installed.

Noticeable performance can be observed for range joins, if both join columns from the right dataframe are monotonically increasing.

This function returns rows, if any, where values from df meet the condition(s) for values from right. The conditions are passed in as a variable argument of tuples, where the tuple is of the form (left_on, right_on, op); left_on is the column label from df, right_on is the column label from right, while op is the operator.

For multiple conditions, the and(&) operator is used to combine the results of the individual conditions.

In some scenarios there might be performance gains if the less than join, or the greater than join condition, or the range condition is executed before the equi join - pass force=True to force this.

The operator can be any of ==, !=, <=, <, >=, >.

There is no optimisation for the != operator.

The join is done only on the columns.

For non-equi joins, only numeric, timedelta and date columns are supported.

inner, left, right and outer joins are supported.

If the columns from df and right have nothing in common, a single index column is returned; else, a MultiIndex column is returned.

Examples:

>>> import pandas as pd >>> import janitor >>> df1 = pd.DataFrame({"value_1": [2, 5, 7, 1, 3, 4]}) >>> df2 = pd.DataFrame( ...  { ...  "value_2A": [0, 3, 7, 12, 0, 2, 3, 1], ...  "value_2B": [1, 5, 9, 15, 1, 4, 6, 3], ...  } ... ) >>> df1  value_1 0 2 1 5 2 7 3 1 4 3 5 4 >>> df2  value_2A value_2B 0 0 1 1 3 5 2 7 9 3 12 15 4 0 1 5 2 4 6 3 6 7 1 3

>>> df1.conditional_join( ...  df2, ("value_1", "value_2A", ">"), ("value_1", "value_2B", "<") ... )  value_1 value_2A value_2B 0 2 1 3 1 5 3 6 2 3 2 4 3 4 3 5 4 4 3 6

Select specific columns, after the join:

>>> df1.conditional_join( ...  df2, ...  ("value_1", "value_2A", ">"), ...  ("value_1", "value_2B", "<"), ...  right_columns="value_2B", ...  how="left", ... )  value_1 value_2B 0 2 3.0 1 5 6.0 2 3 4.0 3 4 5.0 4 4 6.0 5 7 NaN 6 1 NaN

Rename columns, before the join:

>>> ( ...  df1.rename(columns={"value_1": "left_column"}).conditional_join( ...  df2, ...  ("left_column", "value_2A", ">"), ...  ("left_column", "value_2B", "<"), ...  right_columns="value_2B", ...  how="outer", ...  ) ... )  left_column value_2B 0 2.0 3.0 1 5.0 6.0 2 3.0 4.0 3 4.0 5.0 4 4.0 6.0 5 7.0 NaN 6 1.0 NaN 7 NaN 1.0 8 NaN 9.0 9 NaN 15.0 10 NaN 1.0

Get the first match:

>>> df1.conditional_join( ...  df2, ...  ("value_1", "value_2A", ">"), ...  ("value_1", "value_2B", "<"), ...  keep="first", ... )  value_1 value_2A value_2B 0 2 1 3 1 5 3 6 2 3 2 4 3 4 3 5

Get the last match:

>>> df1.conditional_join( ...  df2, ...  ("value_1", "value_2A", ">"), ...  ("value_1", "value_2B", "<"), ...  keep="last", ... )  value_1 value_2A value_2B 0 2 1 3 1 5 3 6 2 3 2 4 3 4 3 6

Add an indicator column:

>>> df1.conditional_join( ...  df2, ...  ("value_1", "value_2A", ">"), ...  ("value_1", "value_2B", "<"), ...  how="outer", ...  indicator=True, ... )  value_1 value_2A value_2B _merge 0 2.0 1.0 3.0 both 1 5.0 3.0 6.0 both 2 3.0 2.0 4.0 both 3 4.0 3.0 5.0 both 4 4.0 3.0 6.0 both 5 7.0 NaN NaN left_only 6 1.0 NaN NaN left_only 7 NaN 0.0 1.0 right_only 8 NaN 7.0 9.0 right_only 9 NaN 12.0 15.0 right_only 10 NaN 0.0 1.0 right_only

Version Changed

0.24.0
- Added df_columns, right_columns, keep and use_numba parameters.
0.24.1
- Added indicator parameter.
0.25.0
- col class supported.
- Outer join supported. sort_by_appearance deprecated.
- Numba support for equi join
0.27.0
- Added support for timedelta dtype.
0.28.0
- col class deprecated.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	A pandas DataFrame.	required
`right`	`Union[DataFrame, Series]`	Named Series or DataFrame to join to.	required
`conditions`	`Any`	Variable argument of tuple(s) of the form `(left_on, right_on, op)`, where `left_on` is the column label from `df`, `right_on` is the column label from `right`, while `op` is the operator. The `col` class is also supported. The operator can be any of `==`, `!=`, `<=`, `<`, `>=`, `>`. For multiple conditions, the and(`&`) operator is used to combine the results of the individual conditions.	`()`
`how`	`Literal['inner', 'left', 'right', 'outer']`	Indicates the type of join to be performed. It can be one of `inner`, `left`, `right` or `outer`.	`'inner'`
`df_columns`	`Optional[Any]`	Columns to select from `df` in the final output dataframe. Column selection is based on the `select` syntax.	`slice(None)`
`right_columns`	`Optional[Any]`	Columns to select from `right` in the final output dataframe. Column selection is based on the `select` syntax.	`slice(None)`
`use_numba`	`bool`	Use numba, if installed, to accelerate the computation.	`False`
`keep`	`Literal['first', 'last', 'all']`	Choose whether to return the first match, last match or all matches.	`'all'`
`indicator`	`Optional[Union[bool, str]]`	If `True`, adds a column to the output DataFrame called `_merge` with information on the source of each row. The column can be given a different name by providing a string argument. The column will have a Categorical type with the value of `left_only` for observations whose merge key only appears in the left DataFrame, `right_only` for observations whose merge key only appears in the right DataFrame, and `both` if the observation’s merge key is found in both DataFrames.	`False`
`force`	`bool`	If `True`, force the non-equi join conditions to execute before the equi join.	`False`

Returns:

Type	Description
`DataFrame`	A pandas DataFrame of the two merged Pandas objects.

Source code in janitor/functions/conditional_join.py

@pf.register_dataframe_method def conditional_join( df: pd.DataFrame, right: Union[pd.DataFrame, pd.Series], *conditions: Any, how: Literal["inner", "left", "right", "outer"] = "inner", df_columns: Optional[Any] = slice(None), right_columns: Optional[Any] = slice(None), keep: Literal["first", "last", "all"] = "all", use_numba: bool = False, indicator: Optional[Union[bool, str]] = False, force: bool = False, ) -> pd.DataFrame:  """The conditional_join function operates similarly to `pd.merge`,  but supports joins on inequality operators,  or a combination of equi and non-equi joins.  Joins solely on equality are not supported.  If the join is solely on equality, `pd.merge` function  covers that; if you are interested in nearest joins, asof joins,  or rolling joins, then `pd.merge_asof` covers that.  There is also pandas' IntervalIndex, which is efficient for range joins,  especially if the intervals do not overlap.  Column selection in `df_columns` and `right_columns` is possible using the  [`select`][janitor.functions.select.select] syntax.  Performance might be improved by setting `use_numba` to `True` -  this can be handy for equi joins that have lots of duplicated keys.  This can also be handy for non-equi joins, where there are more than  two join conditions,  or there is significant overlap in the range join columns.  This assumes that `numba` is installed.  Noticeable performance can be observed for range joins,  if both join columns from the right dataframe  are monotonically increasing.  This function returns rows, if any, where values from `df` meet the  condition(s) for values from `right`. The conditions are passed in  as a variable argument of tuples, where the tuple is of  the form `(left_on, right_on, op)`; `left_on` is the column  label from `df`, `right_on` is the column label from `right`,  while `op` is the operator.  For multiple conditions, the and(`&`)  operator is used to combine the results of the individual conditions.  In some scenarios there might be performance gains if the less than join,  or the greater than join condition, or the range condition  is executed before the equi join - pass `force=True` to force this.  The operator can be any of `==`, `!=`, `<=`, `<`, `>=`, `>`.  There is no optimisation for the `!=` operator.  The join is done only on the columns.  For non-equi joins, only numeric, timedelta and date columns are supported.  `inner`, `left`, `right` and `outer` joins are supported.  If the columns from `df` and `right` have nothing in common,  a single index column is returned; else, a MultiIndex column  is returned.  Examples:  >>> import pandas as pd  >>> import janitor  >>> df1 = pd.DataFrame({"value_1": [2, 5, 7, 1, 3, 4]})  >>> df2 = pd.DataFrame(  ... {  ... "value_2A": [0, 3, 7, 12, 0, 2, 3, 1],  ... "value_2B": [1, 5, 9, 15, 1, 4, 6, 3],  ... }  ... )  >>> df1  value_1  0 2  1 5  2 7  3 1  4 3  5 4  >>> df2  value_2A value_2B  0 0 1  1 3 5  2 7 9  3 12 15  4 0 1  5 2 4  6 3 6  7 1 3  >>> df1.conditional_join(  ... df2, ("value_1", "value_2A", ">"), ("value_1", "value_2B", "<")  ... )  value_1 value_2A value_2B  0 2 1 3  1 5 3 6  2 3 2 4  3 4 3 5  4 4 3 6  Select specific columns, after the join:  >>> df1.conditional_join(  ... df2,  ... ("value_1", "value_2A", ">"),  ... ("value_1", "value_2B", "<"),  ... right_columns="value_2B",  ... how="left",  ... )  value_1 value_2B  0 2 3.0  1 5 6.0  2 3 4.0  3 4 5.0  4 4 6.0  5 7 NaN  6 1 NaN  Rename columns, before the join:  >>> (  ... df1.rename(columns={"value_1": "left_column"}).conditional_join(  ... df2,  ... ("left_column", "value_2A", ">"),  ... ("left_column", "value_2B", "<"),  ... right_columns="value_2B",  ... how="outer",  ... )  ... )  left_column value_2B  0 2.0 3.0  1 5.0 6.0  2 3.0 4.0  3 4.0 5.0  4 4.0 6.0  5 7.0 NaN  6 1.0 NaN  7 NaN 1.0  8 NaN 9.0  9 NaN 15.0  10 NaN 1.0  Get the first match:  >>> df1.conditional_join(  ... df2,  ... ("value_1", "value_2A", ">"),  ... ("value_1", "value_2B", "<"),  ... keep="first",  ... )  value_1 value_2A value_2B  0 2 1 3  1 5 3 6  2 3 2 4  3 4 3 5  Get the last match:  >>> df1.conditional_join(  ... df2,  ... ("value_1", "value_2A", ">"),  ... ("value_1", "value_2B", "<"),  ... keep="last",  ... )  value_1 value_2A value_2B  0 2 1 3  1 5 3 6  2 3 2 4  3 4 3 6  Add an indicator column:  >>> df1.conditional_join(  ... df2,  ... ("value_1", "value_2A", ">"),  ... ("value_1", "value_2B", "<"),  ... how="outer",  ... indicator=True,  ... )  value_1 value_2A value_2B _merge  0 2.0 1.0 3.0 both  1 5.0 3.0 6.0 both  2 3.0 2.0 4.0 both  3 4.0 3.0 5.0 both  4 4.0 3.0 6.0 both  5 7.0 NaN NaN left_only  6 1.0 NaN NaN left_only  7 NaN 0.0 1.0 right_only  8 NaN 7.0 9.0 right_only  9 NaN 12.0 15.0 right_only  10 NaN 0.0 1.0 right_only  !!! abstract "Version Changed"  - 0.24.0  - Added `df_columns`, `right_columns`, `keep` and `use_numba` parameters.  - 0.24.1  - Added `indicator` parameter.  - 0.25.0  - `col` class supported.  - Outer join supported. `sort_by_appearance` deprecated.  - Numba support for equi join  - 0.27.0  - Added support for timedelta dtype.  - 0.28.0  - `col` class deprecated.  Args:  df: A pandas DataFrame.  right: Named Series or DataFrame to join to.  conditions: Variable argument of tuple(s) of the form  `(left_on, right_on, op)`, where `left_on` is the column  label from `df`, `right_on` is the column label from `right`,  while `op` is the operator.  The `col` class is also supported. The operator can be any of  `==`, `!=`, `<=`, `<`, `>=`, `>`. For multiple conditions,  the and(`&`) operator is used to combine the results  of the individual conditions.  how: Indicates the type of join to be performed.  It can be one of `inner`, `left`, `right` or `outer`.  df_columns: Columns to select from `df` in the final output dataframe.  Column selection is based on the  [`select`][janitor.functions.select.select] syntax.  right_columns: Columns to select from `right` in the final output dataframe.  Column selection is based on the  [`select`][janitor.functions.select.select] syntax.  use_numba: Use numba, if installed, to accelerate the computation.  keep: Choose whether to return the first match, last match or all matches.  indicator: If `True`, adds a column to the output DataFrame  called `_merge` with information on the source of each row.  The column can be given a different name by providing a string argument.  The column will have a Categorical type with the value of `left_only`  for observations whose merge key only appears in the left DataFrame,  `right_only` for observations whose merge key  only appears in the right DataFrame, and `both` if the observation’s  merge key is found in both DataFrames.  force: If `True`, force the non-equi join conditions to execute before the equi join.  Returns:  A pandas DataFrame of the two merged Pandas objects.  """ # noqa: E501 return _conditional_join_compute( df=df, right=right, conditions=conditions, how=how, df_columns=df_columns, right_columns=right_columns, keep=keep, use_numba=use_numba, indicator=indicator, force=force, ) 

`get_join_indices(df, right, conditions, keep='all', use_numba=False, force=False)`

Convenience function to return the matching indices from an inner join.

New in version 0.27.0

Version Changed

0.29.0
- Add support for ragged array indices.
0.32.0
- ragged array indices deprecated.
- return indices as a dictionary

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	A pandas DataFrame.	required
`right`	`Union[DataFrame, Series]`	Named Series or DataFrame to join to.	required
`conditions`	`list[tuple[str]]`	List of arguments of tuple(s) of the form `(left_on, right_on, op)`, where `left_on` is the column label from `df`, `right_on` is the column label from `right`, while `op` is the operator. The operator can be any of `==`, `!=`, `<=`, `<`, `>=`, `>`. For multiple conditions, the and(`&`) operator is used to combine the results of the individual conditions.	required
`use_numba`	`bool`	Use numba, if installed, to accelerate the computation.	`False`
`keep`	`Literal['first', 'last', 'all']`	Choose whether to return the first match, last match or all matches.	`'all'`
`force`	`bool`	If `True`, force the non-equi join conditions to execute before the equi join.	`False`

Returns:

Type	Description
`dict`	A dictionary of indices for the rows in the dataframes that match.

Source code in janitor/functions/conditional_join.py

@deprecated_kwargs("return_ragged_arrays") def get_join_indices( df: pd.DataFrame, right: Union[pd.DataFrame, pd.Series], conditions: list[tuple[str]], keep: Literal["first", "last", "all"] = "all", use_numba: bool = False, force: bool = False, ) -> dict:  """Convenience function to return the matching indices from an inner join.  !!! info "New in version 0.27.0"  !!! abstract "Version Changed"  - 0.29.0  - Add support for ragged array indices.  - 0.32.0  - ragged array indices deprecated.  - return indices as a dictionary  Args:  df: A pandas DataFrame.  right: Named Series or DataFrame to join to.  conditions: List of arguments of tuple(s) of the form  `(left_on, right_on, op)`, where `left_on` is the column  label from `df`, `right_on` is the column label from `right`,  while `op` is the operator.  The operator can be any of  `==`, `!=`, `<=`, `<`, `>=`, `>`. For multiple conditions,  the and(`&`) operator is used to combine the results  of the individual conditions.  use_numba: Use numba, if installed, to accelerate the computation.  keep: Choose whether to return the first match, last match or all matches.  force: If `True`, force the non-equi join conditions  to execute before the equi join.  Returns:  A dictionary of indices for the rows in the dataframes that match.  """ return _conditional_join_compute( df=df, right=right, conditions=conditions, how="inner", df_columns=None, right_columns=None, keep=keep, use_numba=use_numba, indicator=False, force=force, return_matching_indices=True, ) 

`convert_date`

`convert_excel_date(df, column_names)`

Convert Excel's serial date format into Python datetime format.

This method does not mutate the original DataFrame.

Implementation is based on Stack Overflow.

Examples:

>>> import pandas as pd >>> import janitor >>> df = pd.DataFrame({"date": [39690, 39690, 37118]}) >>> df  date 0 39690 1 39690 2 37118 >>> df.convert_excel_date("date")  date 0 2008-08-30 1 2008-08-30 2 2001-08-15

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	A pandas DataFrame.	required
`column_names`	`Union[Hashable, list]`	A column name, or a list of column names.	required

Returns:

Type	Description
`DataFrame`	A pandas DataFrame with corrected dates.

Source code in janitor/functions/convert_date.py

@pf.register_dataframe_method @deprecated_alias(column="column_names") def convert_excel_date( df: pd.DataFrame, column_names: Union[Hashable, list] ) -> pd.DataFrame:  """Convert Excel's serial date format into Python datetime format.  This method does not mutate the original DataFrame.  Implementation is based on  [Stack Overflow](https://stackoverflow.com/questions/38454403/convert-excel-style-date-with-pandas).  Examples:  >>> import pandas as pd  >>> import janitor  >>> df = pd.DataFrame({"date": [39690, 39690, 37118]})  >>> df  date  0 39690  1 39690  2 37118  >>> df.convert_excel_date("date")  date  0 2008-08-30  1 2008-08-30  2 2001-08-15  Args:  df: A pandas DataFrame.  column_names: A column name, or a list of column names.  Returns:  A pandas DataFrame with corrected dates.  """ # noqa: E501 if not isinstance(column_names, list): column_names = [column_names] # https://stackoverflow.com/a/65460255/7175713 dictionary = { column_name: pd.to_datetime(df[column_name], unit="D", origin="1899-12-30") for column_name in column_names } return df.assign(**dictionary) 

`convert_matlab_date(df, column_names)`

Convert Matlab's serial date number into Python datetime format.

Implementation is based on Stack Overflow.

This method does not mutate the original DataFrame.

Examples:

>>> import pandas as pd >>> import janitor >>> df = pd.DataFrame({"date": [737125.0, 737124.815863, 737124.4985, 737124]}) >>> df  date 0 737125.000000 1 737124.815863 2 737124.498500 3 737124.000000 >>> df.convert_matlab_date("date")  date 0 2018-03-06 00:00:00.000000000 1 2018-03-05 19:34:50.563199671 2 2018-03-05 11:57:50.399998876 3 2018-03-05 00:00:00.000000000

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	A pandas DataFrame.	required
`column_names`	`Union[Hashable, list]`	A column name, or a list of column names.	required

Returns:

Type	Description
`DataFrame`	A pandas DataFrame with corrected dates.

Source code in janitor/functions/convert_date.py

@pf.register_dataframe_method @deprecated_alias(column="column_names") def convert_matlab_date( df: pd.DataFrame, column_names: Union[Hashable, list] ) -> pd.DataFrame:  """Convert Matlab's serial date number into Python datetime format.  Implementation is based on  [Stack Overflow](https://stackoverflow.com/questions/13965740/converting-matlabs-datenum-format-to-python).  This method does not mutate the original DataFrame.  Examples:  >>> import pandas as pd  >>> import janitor  >>> df = pd.DataFrame({"date": [737125.0, 737124.815863, 737124.4985, 737124]})  >>> df  date  0 737125.000000  1 737124.815863  2 737124.498500  3 737124.000000  >>> df.convert_matlab_date("date")  date  0 2018-03-06 00:00:00.000000000  1 2018-03-05 19:34:50.563199671  2 2018-03-05 11:57:50.399998876  3 2018-03-05 00:00:00.000000000  Args:  df: A pandas DataFrame.  column_names: A column name, or a list of column names.  Returns:  A pandas DataFrame with corrected dates.  """ # noqa: E501 # https://stackoverflow.com/a/49135037/7175713 if not isinstance(column_names, list): column_names = [column_names] dictionary = { column_name: pd.to_datetime(df[column_name] - 719529, unit="D") for column_name in column_names } return df.assign(**dictionary) 

`convert_unix_date(df, column_name)`

Convert unix epoch time into Python datetime format.

Note that this ignores local tz and convert all timestamps to naive datetime based on UTC!

This method mutates the original DataFrame.

Note

This function will be deprecated in a 1.x release. Please use pd.to_datetime instead.

Examples:

>>> import pandas as pd >>> import janitor >>> df = pd.DataFrame({"date": [1651510462, 53394822, 1126233195]}) >>> df  date 0 1651510462 1 53394822 2 1126233195 >>> df.convert_unix_date("date")  date 0 2022-05-02 16:54:22 1 1971-09-10 23:53:42 2 2005-09-09 02:33:15

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	A pandas DataFrame.	required
`column_name`	`Hashable`	A column name.	required

Returns:

Type	Description
`DataFrame`	A pandas DataFrame with corrected dates.

Source code in janitor/functions/convert_date.py

@pf.register_dataframe_method @refactored_function( message=( "This function will be deprecated in a 1.x release. " "Please use `pd.to_datetime` instead." ) ) @deprecated_alias(column="column_name") def convert_unix_date(df: pd.DataFrame, column_name: Hashable) -> pd.DataFrame:  """Convert unix epoch time into Python datetime format.  Note that this ignores local tz and convert all timestamps to naive  datetime based on UTC!  This method mutates the original DataFrame.  !!!note  This function will be deprecated in a 1.x release.  Please use `pd.to_datetime` instead.  Examples:  >>> import pandas as pd  >>> import janitor  >>> df = pd.DataFrame({"date": [1651510462, 53394822, 1126233195]})  >>> df  date  0 1651510462  1 53394822  2 1126233195  >>> df.convert_unix_date("date")  date  0 2022-05-02 16:54:22  1 1971-09-10 23:53:42  2 2005-09-09 02:33:15  Args:  df: A pandas DataFrame.  column_name: A column name.  Returns:  A pandas DataFrame with corrected dates.  """ try: df[column_name] = pd.to_datetime(df[column_name], unit="s") except OutOfBoundsDatetime: # Indicates time is in milliseconds. df[column_name] = pd.to_datetime(df[column_name], unit="ms") return df 

`count_cumulative_unique`

Implementation of count_cumulative_unique.

`count_cumulative_unique(df, column_name, dest_column_name, case_sensitive=True)`

Generates a running total of cumulative unique values in a given column.

A new column will be created containing a running count of unique values in the specified column. If case_sensitive is True, then the case of any letters will matter (i.e., a != A); otherwise, the case of any letters will not matter.

This method does not mutate the original DataFrame.

Examples:

>>> import pandas as pd >>> import janitor >>> df = pd.DataFrame( ...  { ...  "letters": list("aabABb"), ...  "numbers": range(4, 10), ...  } ... ) >>> df  letters numbers 0 a 4 1 a 5 2 b 6 3 A 7 4 B 8 5 b 9 >>> df.count_cumulative_unique( ...  column_name="letters", ...  dest_column_name="letters_unique_count", ... )  letters numbers letters_unique_count 0 a 4 1 1 a 5 1 2 b 6 2 3 A 7 3 4 B 8 4 5 b 9 4

Cumulative counts, ignoring casing.

>>> df.count_cumulative_unique( ...  column_name="letters", ...  dest_column_name="letters_unique_count", ...  case_sensitive=False, ... )  letters numbers letters_unique_count 0 a 4 1 1 a 5 1 2 b 6 2 3 A 7 2 4 B 8 2 5 b 9 2

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	A pandas DataFrame.	required
`column_name`	`Hashable`	Name of the column containing values from which a running count of unique values will be created.	required
`dest_column_name`	`str`	The name of the new column containing the cumulative count of unique values that will be created.	required
`case_sensitive`	`bool`	Whether or not uppercase and lowercase letters will be considered equal. Only valid with string-like columns.	`True`

Raises:

Type	Description
`TypeError`	If `case_sensitive` is False when counting a non-string `column_name`.

Returns:

Type	Description
`DataFrame`	A pandas DataFrame with a new column containing a cumulative count of unique values from another column.

Source code in janitor/functions/count_cumulative_unique.py

@pf.register_dataframe_method def count_cumulative_unique( df: pd.DataFrame, column_name: Hashable, dest_column_name: str, case_sensitive: bool = True, ) -> pd.DataFrame:  """Generates a running total of cumulative unique values in a given column.  A new column will be created containing a running  count of unique values in the specified column.  If `case_sensitive` is `True`, then the case of  any letters will matter (i.e., `a != A`);  otherwise, the case of any letters will not matter.  This method does not mutate the original DataFrame.  Examples:  >>> import pandas as pd  >>> import janitor  >>> df = pd.DataFrame(  ... {  ... "letters": list("aabABb"),  ... "numbers": range(4, 10),  ... }  ... )  >>> df  letters numbers  0 a 4  1 a 5  2 b 6  3 A 7  4 B 8  5 b 9  >>> df.count_cumulative_unique(  ... column_name="letters",  ... dest_column_name="letters_unique_count",  ... )  letters numbers letters_unique_count  0 a 4 1  1 a 5 1  2 b 6 2  3 A 7 3  4 B 8 4  5 b 9 4  Cumulative counts, ignoring casing.  >>> df.count_cumulative_unique(  ... column_name="letters",  ... dest_column_name="letters_unique_count",  ... case_sensitive=False,  ... )  letters numbers letters_unique_count  0 a 4 1  1 a 5 1  2 b 6 2  3 A 7 2  4 B 8 2  5 b 9 2  Args:  df: A pandas DataFrame.  column_name: Name of the column containing values from which a  running count of unique values will be created.  dest_column_name: The name of the new column containing the  cumulative count of unique values that will be created.  case_sensitive: Whether or not uppercase and lowercase letters  will be considered equal. Only valid with string-like columns.  Raises:  TypeError: If `case_sensitive` is False when counting a non-string  `column_name`.  Returns:  A pandas DataFrame with a new column containing a cumulative  count of unique values from another column.  """ check_column(df, column_name) check_column(df, dest_column_name, present=False) counter = df[column_name] if not case_sensitive: try: # Make it so that the the same uppercase and lowercase # letter are treated as one unique value counter = counter.str.lower() except (AttributeError, TypeError) as e: # AttributeError is raised by pandas when .str is used on # non-string types, e.g. int. # TypeError is raised by pandas when .str.lower is used on a # forbidden string type, e.g. bytes. raise TypeError( "case_sensitive=False can only be used with a string-like " f"type. Column {column_name} is {counter.dtype} type." ) from e counter = counter.groupby(counter, sort=False).cumcount().to_numpy(copy=False) counter = np.cumsum(counter == 0) return df.assign(**{dest_column_name: counter}) 

`currency_column_to_numeric`

`currency_column_to_numeric(df, column_name, cleaning_style=None, cast_non_numeric=None, fill_all_non_numeric=None, remove_non_numeric=False)`

Convert currency column to numeric.

This method does not mutate the original DataFrame.

This method allows one to take a column containing currency values, inadvertently imported as a string, and cast it as a float. This is usually the case when reading CSV files that were modified in Excel. Empty strings (i.e. '') are retained as NaN values.

Examples:

>>> import pandas as pd >>> import janitor >>> df = pd.DataFrame( ...  { ...  "a_col": [" 24.56", "-", "(12.12)", "1,000,000"], ...  "d_col": ["", "foo", "1.23 dollars", "-1,000 yen"], ...  } ... ) >>> df  a_col d_col 0 24.56 1 - foo 2 (12.12) 1.23 dollars 3 1,000,000 -1,000 yen

The default cleaning style.

>>> df.currency_column_to_numeric("d_col")  a_col d_col 0 24.56 NaN 1 - NaN 2 (12.12) 1.23 3 1,000,000 -1000.00

The accounting cleaning style.

>>> df.currency_column_to_numeric( ...  "a_col", cleaning_style="accounting" ... )  a_col d_col 0 24.56 1 0.00 foo 2 -12.12 1.23 dollars 3 1000000.00 -1,000 yen

Valid cleaning styles are:

None: Default cleaning is applied. Empty strings are always retained as NaN. Numbers, -, . are extracted and the resulting string is cast to a float.
'accounting': Replaces numbers in parentheses with negatives, removes commas.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	The pandas DataFrame.	required
`column_name`	`str`	The column containing currency values to modify.	required
`cleaning_style`	`Optional[str]`	What style of cleaning to perform.	`None`
`cast_non_numeric`	`Optional[dict]`	A dict of how to coerce certain strings to numeric type. For example, if there are values of 'REORDER' in the DataFrame, `{'REORDER': 0}` will cast all instances of 'REORDER' to 0. Only takes effect in the default cleaning style.	`None`
`fill_all_non_numeric`	`Optional[Union[float, int]]`	Similar to `cast_non_numeric`, but fills all strings to the same value. For example, `fill_all_non_numeric=1`, will make everything that doesn't coerce to a currency `1`. Only takes effect in the default cleaning style.	`None`
`remove_non_numeric`	`bool`	If set to True, rows of `df` that contain non-numeric values in the `column_name` column will be removed. Only takes effect in the default cleaning style.	`False`

Raises:

Type	Description
`ValueError`	If `cleaning_style` is not one of the accepted styles.

Returns:

Type	Description
`DataFrame`	A pandas DataFrame.

Source code in janitor/functions/currency_column_to_numeric.py

@pf.register_dataframe_method @deprecated_alias(col_name="column_name", type="cleaning_style") def currency_column_to_numeric( df: pd.DataFrame, column_name: str, cleaning_style: Optional[str] = None, cast_non_numeric: Optional[dict] = None, fill_all_non_numeric: Optional[Union[float, int]] = None, remove_non_numeric: bool = False, ) -> pd.DataFrame:  """Convert currency column to numeric.  This method does not mutate the original DataFrame.  This method allows one to take a column containing currency values,  inadvertently imported as a string, and cast it as a float. This is  usually the case when reading CSV files that were modified in Excel.  Empty strings (i.e. `''`) are retained as `NaN` values.  Examples:  >>> import pandas as pd  >>> import janitor  >>> df = pd.DataFrame(  ... {  ... "a_col": [" 24.56", "-", "(12.12)", "1,000,000"],  ... "d_col": ["", "foo", "1.23 dollars", "-1,000 yen"],  ... }  ... )  >>> df # doctest: +NORMALIZE_WHITESPACE  a_col d_col  0 24.56  1 - foo  2 (12.12) 1.23 dollars  3 1,000,000 -1,000 yen  The default cleaning style.  >>> df.currency_column_to_numeric("d_col")  a_col d_col  0 24.56 NaN  1 - NaN  2 (12.12) 1.23  3 1,000,000 -1000.00  The accounting cleaning style.  >>> df.currency_column_to_numeric(  ... "a_col", cleaning_style="accounting"  ... ) # doctest: +NORMALIZE_WHITESPACE  a_col d_col  0 24.56  1 0.00 foo  2 -12.12 1.23 dollars  3 1000000.00 -1,000 yen  Valid cleaning styles are:  - `None`: Default cleaning is applied. Empty strings are always retained as  `NaN`. Numbers, `-`, `.` are extracted and the resulting string  is cast to a float.  - `'accounting'`: Replaces numbers in parentheses with negatives, removes commas.  Args:  df: The pandas DataFrame.  column_name: The column containing currency values to modify.  cleaning_style: What style of cleaning to perform.  cast_non_numeric: A dict of how to coerce certain strings to numeric  type. For example, if there are values of 'REORDER' in the DataFrame,  `{'REORDER': 0}` will cast all instances of 'REORDER' to 0.  Only takes effect in the default cleaning style.  fill_all_non_numeric: Similar to `cast_non_numeric`, but fills all  strings to the same value. For example, `fill_all_non_numeric=1`, will  make everything that doesn't coerce to a currency `1`.  Only takes effect in the default cleaning style.  remove_non_numeric: If set to True, rows of `df` that contain  non-numeric values in the `column_name` column will be removed.  Only takes effect in the default cleaning style.  Raises:  ValueError: If `cleaning_style` is not one of the accepted styles.  Returns:  A pandas DataFrame.  """ # noqa: E501 check("column_name", column_name, [str]) check_column(df, column_name) column_series = df[column_name] if cleaning_style == "accounting": outcome = ( df[column_name] .str.strip() .str.replace(",", "", regex=False) .str.replace(")", "", regex=False) .str.replace("(", "-", regex=False) .replace({"-": 0.0}) .astype(float) ) return df.assign(**{column_name: outcome}) if cleaning_style is not None: raise ValueError( "`cleaning_style` is expected to be one of ('accounting', None). " f"Got {cleaning_style!r} instead." ) if cast_non_numeric: check("cast_non_numeric", cast_non_numeric, [dict]) _make_cc_patrial = partial( _currency_column_to_numeric, cast_non_numeric=cast_non_numeric, ) column_series = column_series.apply(_make_cc_patrial) if remove_non_numeric: df = df.loc[column_series != "", :] # _replace_empty_string_with_none is applied here after the check on # remove_non_numeric since "" is our indicator that a string was coerced # in the original column column_series = _replace_empty_string_with_none(column_series) if fill_all_non_numeric is not None: check("fill_all_non_numeric", fill_all_non_numeric, [int, float]) column_series = column_series.fillna(fill_all_non_numeric) column_series = _replace_original_empty_string_with_none(column_series) df = df.assign(**{column_name: pd.to_numeric(column_series)}) return df 

`deconcatenate_column`

Implementation of deconcatenating columns.

`deconcatenate_column(df, column_name, sep=None, new_column_names=None, autoname=None, preserve_position=False)`

De-concatenates a single column into multiple columns.

The column to de-concatenate can be either a collection (list, tuple, ...) which can be separated out with pd.Series.tolist(), or a string to slice based on sep.

To determine this behaviour automatically, the first element in the column specified is inspected.

If it is a string, then sep must be specified. Else, the function assumes that it is an iterable type (e.g. list or tuple), and will attempt to deconcatenate by splitting the list.

Given a column with string values, this is the inverse of the concatenate_columns function.

Used to quickly split columns out of a single column.

Examples:

>>> import pandas as pd >>> import janitor >>> df = pd.DataFrame({"m": ["1-x", "2-y", "3-z"]}) >>> df  m 0 1-x 1 2-y 2 3-z >>> df.deconcatenate_column("m", sep="-", autoname="col")  m col1 col2 0 1-x 1 x 1 2-y 2 y 2 3-z 3 z

The keyword argument preserve_position takes True or False boolean that controls whether the new_column_names will take the original position of the to-be-deconcatenated column_name:

When preserve_position=False (default), df.columns change from [..., column_name, ...] to [..., column_name, ..., new_column_names]. In other words, the deconcatenated new columns are appended to the right of the original dataframe and the original column_name is NOT dropped.
When preserve_position=True, df.column change from [..., column_name, ...] to [..., new_column_names, ...]. In other words, the deconcatenated new column will REPLACE the original column_name at its original position, and column_name itself is dropped.

The keyword argument autoname accepts a base string and then automatically creates numbered column names based off the base string. For example, if col is passed in as the argument to autoname, and 4 columns are created, then the resulting columns will be named col1, col2, col3, col4. Numbering is always 1-indexed, not 0-indexed, in order to make the column names human-friendly.

This method does not mutate the original DataFrame.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	A pandas DataFrame.	required
`column_name`	`Hashable`	The column to split.	required
`sep`	`Optional[str]`	The separator delimiting the column's data.	`None`
`new_column_names`	`Optional[Union[List[str], Tuple[str]]]`	A list of new column names post-splitting.	`None`
`autoname`	`str`	A base name for automatically naming the new columns. Takes precedence over `new_column_names` if both are provided.	`None`
`preserve_position`	`bool`	Boolean for whether or not to preserve original position of the column upon de-concatenation.	`False`

Raises:

Type	Description
`ValueError`	If `column_name` is not present in the DataFrame.
`ValueError`	If `sep` is not provided and the column values are of type `str`.
`ValueError`	If either `new_column_names` or `autoname` is not supplied.
`JanitorError`	If incorrect number of names is provided within `new_column_names`.

Returns:

Type	Description
`DataFrame`	A pandas DataFrame with a deconcatenated column.

Source code in janitor/functions/deconcatenate_column.py

@pf.register_dataframe_method @deprecated_alias(column="column_name") def deconcatenate_column( df: pd.DataFrame, column_name: Hashable, sep: Optional[str] = None, new_column_names: Optional[Union[List[str], Tuple[str]]] = None, autoname: str = None, preserve_position: bool = False, ) -> pd.DataFrame:  """De-concatenates a single column into multiple columns.  The column to de-concatenate can be either a collection (list, tuple, ...)  which can be separated out with `pd.Series.tolist()`,  or a string to slice based on `sep`.  To determine this behaviour automatically,  the first element in the column specified is inspected.  If it is a string, then `sep` must be specified.  Else, the function assumes that it is an iterable type  (e.g. `list` or `tuple`),  and will attempt to deconcatenate by splitting the list.  Given a column with string values, this is the inverse of the  [`concatenate_columns`][janitor.functions.concatenate_columns.concatenate_columns]  function.  Used to quickly split columns out of a single column.  Examples:  >>> import pandas as pd  >>> import janitor  >>> df = pd.DataFrame({"m": ["1-x", "2-y", "3-z"]})  >>> df  m  0 1-x  1 2-y  2 3-z  >>> df.deconcatenate_column("m", sep="-", autoname="col")  m col1 col2  0 1-x 1 x  1 2-y 2 y  2 3-z 3 z  The keyword argument `preserve_position`  takes `True` or `False` boolean  that controls whether the `new_column_names`  will take the original position  of the to-be-deconcatenated `column_name`:  - When `preserve_position=False` (default), `df.columns` change from  `[..., column_name, ...]` to `[..., column_name, ..., new_column_names]`.  In other words, the deconcatenated new columns are appended to the right  of the original dataframe and the original `column_name` is NOT dropped.  - When `preserve_position=True`, `df.column` change from  `[..., column_name, ...]` to `[..., new_column_names, ...]`.  In other words, the deconcatenated new column will REPLACE the original  `column_name` at its original position, and `column_name` itself  is dropped.  The keyword argument `autoname` accepts a base string  and then automatically creates numbered column names  based off the base string.  For example, if `col` is passed in as the argument to `autoname`,  and 4 columns are created, then the resulting columns will be named  `col1, col2, col3, col4`.  Numbering is always 1-indexed, not 0-indexed,  in order to make the column names human-friendly.  This method does not mutate the original DataFrame.  Args:  df: A pandas DataFrame.  column_name: The column to split.  sep: The separator delimiting the column's data.  new_column_names: A list of new column names post-splitting.  autoname: A base name for automatically naming the new columns.  Takes precedence over `new_column_names` if both are provided.  preserve_position: Boolean for whether or not to preserve original  position of the column upon de-concatenation.  Raises:  ValueError: If `column_name` is not present in the DataFrame.  ValueError: If `sep` is not provided and the column values  are of type `str`.  ValueError: If either `new_column_names` or `autoname`  is not supplied.  JanitorError: If incorrect number of names is provided  within `new_column_names`.  Returns:  A pandas DataFrame with a deconcatenated column.  """ # noqa: E501 if column_name not in df.columns: raise ValueError(f"column name {column_name} not present in DataFrame") if isinstance(df[column_name].iloc[0], str): if sep is None: raise ValueError( "`sep` must be specified if the column values are of type `str`." ) df_deconcat = df[column_name].str.split(sep, expand=True) else: df_deconcat = pd.DataFrame( df[column_name].to_list(), columns=new_column_names, index=df.index ) if new_column_names is None and autoname is None: raise ValueError("One of `new_column_names` or `autoname` must be supplied.") if autoname: new_column_names = [ f"{autoname}{i}" for i in range(1, df_deconcat.shape[1] + 1) ] if not len(new_column_names) == df_deconcat.shape[1]: raise JanitorError( f"You need to provide {len(df_deconcat.shape[1])} names " "to `new_column_names`" ) df_deconcat.columns = new_column_names df_new = pd.concat([df, df_deconcat], axis=1) if preserve_position: df_original = df.copy() cols = list(df_original.columns) index_original = cols.index(column_name) for i, col_new in enumerate(new_column_names): cols.insert(index_original + i, col_new) df_new = df_new.select(cols, axis="columns").drop(columns=column_name) return df_new 

`drop_constant_columns`

Implementation of drop_constant_columns.

`drop_constant_columns(df)`

Finds and drops the constant columns from a Pandas DataFrame.

Examples:

>>> import pandas as pd >>> import janitor >>> data_dict = { ...  "a": [1, 1, 1], ...  "b": [1, 2, 3], ...  "c": [1, 1, 1], ...  "d": ["rabbit", "leopard", "lion"], ...  "e": ["Cambridge", "Shanghai", "Basel"], ... } >>> df = pd.DataFrame(data_dict) >>> df  a b c d e 0 1 1 1 rabbit Cambridge 1 1 2 1 leopard Shanghai 2 1 3 1 lion Basel >>> df.drop_constant_columns()  b d e 0 1 rabbit Cambridge 1 2 leopard Shanghai 2 3 lion Basel

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	Input Pandas DataFrame	required

Returns:

Type	Description
`DataFrame`	The Pandas DataFrame with the constant columns dropped.

Source code in janitor/functions/drop_constant_columns.py

@pf.register_dataframe_method def drop_constant_columns(df: pd.DataFrame) -> pd.DataFrame:  """Finds and drops the constant columns from a Pandas DataFrame.  Examples:  >>> import pandas as pd  >>> import janitor  >>> data_dict = {  ... "a": [1, 1, 1],  ... "b": [1, 2, 3],  ... "c": [1, 1, 1],  ... "d": ["rabbit", "leopard", "lion"],  ... "e": ["Cambridge", "Shanghai", "Basel"],  ... }  >>> df = pd.DataFrame(data_dict)  >>> df  a b c d e  0 1 1 1 rabbit Cambridge  1 1 2 1 leopard Shanghai  2 1 3 1 lion Basel  >>> df.drop_constant_columns()  b d e  0 1 rabbit Cambridge  1 2 leopard Shanghai  2 3 lion Basel  Args:  df: Input Pandas DataFrame  Returns:  The Pandas DataFrame with the constant columns dropped.  """ return df.loc[:, df.nunique().ne(1)] 

`drop_duplicate_columns`

Implementation for drop_duplicate_columns.

`drop_duplicate_columns(df, column_name, nth_index=0)`

Remove a duplicated column specified by column_name.

Specifying nth_index=0 will remove the first column, nth_index=1 will remove the second column, and so on and so forth.

The corresponding tidyverse R's library is: select(-<column_name>_<nth_index + 1>)

Examples:

>>> import pandas as pd >>> import janitor >>> df = pd.DataFrame( ...  { ...  "a": range(2, 5), ...  "b": range(3, 6), ...  "A": range(4, 7), ...  "a*": range(6, 9), ...  } ... ).clean_names(remove_special=True) >>> df  a b a a 0 2 3 4 6 1 3 4 5 7 2 4 5 6 8 >>> df.drop_duplicate_columns(column_name="a", nth_index=1)  a b a 0 2 3 6 1 3 4 7 2 4 5 8

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	A pandas DataFrame	required
`column_name`	`Hashable`	Name of duplicated columns.	required
`nth_index`	`int`	Among the duplicated columns, select the nth column to drop.	`0`

Returns:

Type	Description
`DataFrame`	A pandas DataFrame

Source code in janitor/functions/drop_duplicate_columns.py

@pf.register_dataframe_method def drop_duplicate_columns( df: pd.DataFrame, column_name: Hashable, nth_index: int = 0 ) -> pd.DataFrame:  """Remove a duplicated column specified by `column_name`.  Specifying `nth_index=0` will remove the first column,  `nth_index=1` will remove the second column,  and so on and so forth.  The corresponding tidyverse R's library is:  `select(-<column_name>_<nth_index + 1>)`  Examples:  >>> import pandas as pd  >>> import janitor  >>> df = pd.DataFrame(  ... {  ... "a": range(2, 5),  ... "b": range(3, 6),  ... "A": range(4, 7),  ... "a*": range(6, 9),  ... }  ... ).clean_names(remove_special=True)  >>> df  a b a a  0 2 3 4 6  1 3 4 5 7  2 4 5 6 8  >>> df.drop_duplicate_columns(column_name="a", nth_index=1)  a b a  0 2 3 6  1 3 4 7  2 4 5 8  Args:  df: A pandas DataFrame  column_name: Name of duplicated columns.  nth_index: Among the duplicated columns,  select the nth column to drop.  Returns:  A pandas DataFrame  """ col_indexes = [ col_idx for col_idx, col_name in enumerate(df.columns) if col_name == column_name ] # Select the column to remove based on nth_index. removed_col_idx = col_indexes[nth_index] # Filter out columns except for the one to be removed. filtered_cols = [c_i for c_i, _ in enumerate(df.columns) if c_i != removed_col_idx] return df.iloc[:, filtered_cols] 

`dropnotnull`

Implementation source for dropnotnull.

`dropnotnull(df, column_name)`

Drop rows that do not have null values in the given column.

This method does not mutate the original DataFrame.

Examples:

>>> import numpy as np >>> import pandas as pd >>> import janitor >>> df = pd.DataFrame({"a": [1.0, np.NaN, 3.0], "b": [None, "y", "z"]}) >>> df  a b 0 1.0 None 1 NaN y 2 3.0 z >>> df.dropnotnull("a")  a b 1 NaN y >>> df.dropnotnull("b")  a b 0 1.0 None

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	A pandas DataFrame.	required
`column_name`	`Hashable`	The column name to drop rows from.	required

Returns:

Type	Description
`DataFrame`	A pandas DataFrame with dropped rows.

Source code in janitor/functions/dropnotnull.py

@pf.register_dataframe_method @deprecated_alias(column="column_name") def dropnotnull(df: pd.DataFrame, column_name: Hashable) -> pd.DataFrame:  """Drop rows that do *not* have null values in the given column.  This method does not mutate the original DataFrame.  Examples:  >>> import numpy as np  >>> import pandas as pd  >>> import janitor  >>> df = pd.DataFrame({"a": [1.0, np.NaN, 3.0], "b": [None, "y", "z"]})  >>> df  a b  0 1.0 None  1 NaN y  2 3.0 z  >>> df.dropnotnull("a")  a b  1 NaN y  >>> df.dropnotnull("b")  a b  0 1.0 None  Args:  df: A pandas DataFrame.  column_name: The column name to drop rows from.  Returns:  A pandas DataFrame with dropped rows.  """ return df[pd.isna(df[column_name])] 

`encode_categorical`

`encode_categorical(df, column_names=None, **kwargs)`

Encode the specified columns with Pandas' category dtype.

It is syntactic sugar around pd.Categorical.

This method does not mutate the original DataFrame.

Simply pass a string, or a sequence of column names to column_names; alternatively, you can pass kwargs, where the keys are the column names and the values can either be None, sort, appearance or a 1-D array-like object.

None: column is cast to an unordered categorical.
sort: column is cast to an ordered categorical, with the order defined by the sort-order of the categories.
appearance: column is cast to an ordered categorical, with the order defined by the order of appearance in the original column.
1d-array-like object: column is cast to an ordered categorical, with the categories and order as specified in the input array.

column_names and kwargs parameters cannot be used at the same time.

Examples:

Using column_names

>>> import pandas as pd >>> import janitor >>> df = pd.DataFrame( ...  { ...  "foo": ["b", "b", "a", "c", "b"], ...  "bar": range(4, 9), ...  } ... ) >>> df  foo bar 0 b 4 1 b 5 2 a 6 3 c 7 4 b 8 >>> df.dtypes foo object bar int64 dtype: object >>> enc_df = df.encode_categorical(column_names="foo") >>> enc_df.dtypes foo category bar int64 dtype: object >>> enc_df["foo"].cat.categories Index(['a', 'b', 'c'], dtype='object') >>> enc_df["foo"].cat.ordered False

Using kwargs to specify an ordered categorical.

>>> import pandas as pd >>> import janitor >>> df = pd.DataFrame( ...  { ...  "foo": ["b", "b", "a", "c", "b"], ...  "bar": range(4, 9), ...  } ... ) >>> df.dtypes foo object bar int64 dtype: object >>> enc_df = df.encode_categorical(foo="appearance") >>> enc_df.dtypes foo category bar int64 dtype: object >>> enc_df["foo"].cat.categories Index(['b', 'a', 'c'], dtype='object') >>> enc_df["foo"].cat.ordered True

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	A pandas DataFrame object.	required
`column_names`	`Union[str, Iterable[str], Hashable]`	A column name or an iterable (list or tuple) of column names.	`None`
`**kwargs`	`Any`	A mapping from column name to either `None`, `'sort'` or `'appearance'`, or a 1-D array. This is useful in creating categorical columns that are ordered, or if the user needs to explicitly specify the categories.	`{}`

Raises:

Type	Description
`ValueError`	If both `column_names` and `kwargs` are provided.

Returns:

Type	Description
`DataFrame`	A pandas DataFrame.

Source code in janitor/functions/encode_categorical.py

@pf.register_dataframe_method @deprecated_alias(columns="column_names") def encode_categorical( df: pd.DataFrame, column_names: Union[str, Iterable[str], Hashable] = None, **kwargs: Any, ) -> pd.DataFrame:  """Encode the specified columns with Pandas' [category dtype][cat].  [cat]: http://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html  It is syntactic sugar around `pd.Categorical`.  This method does not mutate the original DataFrame.  Simply pass a string, or a sequence of column names to `column_names`;  alternatively, you can pass kwargs, where the keys are the column names  and the values can either be None, `sort`, `appearance`  or a 1-D array-like object.  - None: column is cast to an unordered categorical.  - `sort`: column is cast to an ordered categorical,  with the order defined by the sort-order of the categories.  - `appearance`: column is cast to an ordered categorical,  with the order defined by the order of appearance  in the original column.  - 1d-array-like object: column is cast to an ordered categorical,  with the categories and order as specified  in the input array.  `column_names` and `kwargs` parameters cannot be used at the same time.  Examples:  Using `column_names`  >>> import pandas as pd  >>> import janitor  >>> df = pd.DataFrame(  ... {  ... "foo": ["b", "b", "a", "c", "b"],  ... "bar": range(4, 9),  ... }  ... )  >>> df  foo bar  0 b 4  1 b 5  2 a 6  3 c 7  4 b 8  >>> df.dtypes  foo object  bar int64  dtype: object  >>> enc_df = df.encode_categorical(column_names="foo")  >>> enc_df.dtypes  foo category  bar int64  dtype: object  >>> enc_df["foo"].cat.categories  Index(['a', 'b', 'c'], dtype='object')  >>> enc_df["foo"].cat.ordered  False  Using `kwargs` to specify an ordered categorical.  >>> import pandas as pd  >>> import janitor  >>> df = pd.DataFrame(  ... {  ... "foo": ["b", "b", "a", "c", "b"],  ... "bar": range(4, 9),  ... }  ... )  >>> df.dtypes  foo object  bar int64  dtype: object  >>> enc_df = df.encode_categorical(foo="appearance")  >>> enc_df.dtypes  foo category  bar int64  dtype: object  >>> enc_df["foo"].cat.categories  Index(['b', 'a', 'c'], dtype='object')  >>> enc_df["foo"].cat.ordered  True  Args:  df: A pandas DataFrame object.  column_names: A column name or an iterable (list or tuple)  of column names.  **kwargs: A mapping from column name to either `None`,  `'sort'` or `'appearance'`, or a 1-D array. This is useful  in creating categorical columns that are ordered, or  if the user needs to explicitly specify the categories.  Raises:  ValueError: If both `column_names` and `kwargs` are provided.  Returns:  A pandas DataFrame.  """ # noqa: E501 if all((column_names, kwargs)): raise ValueError("Only one of `column_names` or `kwargs` can be provided.") # column_names deal with only category dtype (unordered) # kwargs takes care of scenarios where user wants an ordered category # or user supplies specific categories to create the categorical if column_names is not None: column_names = get_index_labels([column_names], df, axis="columns") dtypes = {col: "category" for col in column_names} return df.astype(dtypes) return _computations_as_categorical(df, **kwargs) 

`expand_column`

Implementation for expand_column.

`expand_column(df, column_name, sep='|', concat=True)`

Expand a categorical column with multiple labels into dummy-coded columns.

Super sugary syntax that wraps pandas.Series.str.get_dummies.

This method does not mutate the original DataFrame.

Examples:

Functional usage syntax:

>>> import pandas as pd >>> df = pd.DataFrame( ...  { ...  "col1": ["A, B", "B, C, D", "E, F", "A, E, F"], ...  "col2": [1, 2, 3, 4], ...  } ... ) >>> df = expand_column( ...  df, ...  column_name="col1", ...  sep=", ", # note space in sep ... ) >>> df  col1 col2 A B C D E F 0 A, B 1 1 1 0 0 0 0 1 B, C, D 2 0 1 1 1 0 0 2 E, F 3 0 0 0 0 1 1 3 A, E, F 4 1 0 0 0 1 1

Method chaining syntax:

>>> import pandas as pd >>> import janitor >>> df = pd.DataFrame( ...  { ...  "col1": ["A, B", "B, C, D", "E, F", "A, E, F"], ...  "col2": [1, 2, 3, 4], ...  } ... ).expand_column(column_name="col1", sep=", ") >>> df  col1 col2 A B C D E F 0 A, B 1 1 1 0 0 0 0 1 B, C, D 2 0 1 1 1 0 0 2 E, F 3 0 0 0 0 1 1 3 A, E, F 4 1 0 0 0 1 1

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	A pandas DataFrame.	required
`column_name`	`Hashable`	Which column to expand.	required
`sep`	`str`	The delimiter, same to `pandas.Series.str.get_dummies`'s `sep`.	`'\|'`
`concat`	`bool`	Whether to return the expanded column concatenated to the original dataframe (`concat=True`), or to return it standalone (`concat=False`).	`True`

Returns:

Type	Description
`DataFrame`	A pandas DataFrame with an expanded column.

Source code in janitor/functions/expand_column.py

@pf.register_dataframe_method @deprecated_alias(column="column_name") def expand_column( df: pd.DataFrame, column_name: Hashable, sep: str = "|", concat: bool = True, ) -> pd.DataFrame:  """Expand a categorical column with multiple labels into dummy-coded columns.  Super sugary syntax that wraps `pandas.Series.str.get_dummies`.  This method does not mutate the original DataFrame.  Examples:  Functional usage syntax:  >>> import pandas as pd  >>> df = pd.DataFrame(  ... {  ... "col1": ["A, B", "B, C, D", "E, F", "A, E, F"],  ... "col2": [1, 2, 3, 4],  ... }  ... )  >>> df = expand_column(  ... df,  ... column_name="col1",  ... sep=", ", # note space in sep  ... )  >>> df  col1 col2 A B C D E F  0 A, B 1 1 1 0 0 0 0  1 B, C, D 2 0 1 1 1 0 0  2 E, F 3 0 0 0 0 1 1  3 A, E, F 4 1 0 0 0 1 1  Method chaining syntax:  >>> import pandas as pd  >>> import janitor  >>> df = pd.DataFrame(  ... {  ... "col1": ["A, B", "B, C, D", "E, F", "A, E, F"],  ... "col2": [1, 2, 3, 4],  ... }  ... ).expand_column(column_name="col1", sep=", ")  >>> df  col1 col2 A B C D E F  0 A, B 1 1 1 0 0 0 0  1 B, C, D 2 0 1 1 1 0 0  2 E, F 3 0 0 0 0 1 1  3 A, E, F 4 1 0 0 0 1 1  Args:  df: A pandas DataFrame.  column_name: Which column to expand.  sep: The delimiter, same to  `pandas.Series.str.get_dummies`'s `sep`.  concat: Whether to return the expanded column concatenated to  the original dataframe (`concat=True`), or to return it standalone  (`concat=False`).  Returns:  A pandas DataFrame with an expanded column.  """ # noqa: E501 expanded_df = df[column_name].str.get_dummies(sep=sep) if concat: return df.join(expanded_df) return expanded_df 

`expand_grid`

Implementation source for expand_grid.

`cartesian_product(*inputs, sort=False)`

Creates a DataFrame from a cartesian combination of all inputs.

Inspiration is from tidyr's expand_grid() function.

The input argument should be a pandas Index/Series/DataFrame, or a dictionary - the values of the dictionary should be a 1D array.

Examples:

>>> import pandas as pd >>> import janitor as jn >>> df = pd.DataFrame({"x": [1, 2], "y": [2, 1]}) >>> data = pd.Series([1, 2, 3], name="z") >>> jn.cartesian_product(df, data)  x y z 0 1 2 1 1 1 2 2 2 1 2 3 3 2 1 1 4 2 1 2 5 2 1 3

cartesian_product also works with non-pandas objects:

>>> data = {"x": [1, 2, 3], "y": [1, 2]} >>> cartesian_product(data)  x y 0 1 1 1 1 2 2 2 1 3 2 2 4 3 1 5 3 2

Parameters:

Name	Type	Description	Default
`*inputs`	`tuple`	Variable arguments. The arguments should be a pandas Index/Series/DataFrame, or a dictionary, where the values in the dictionary is a 1D array.	`()`
`sort`	`bool`	If True, sort the output DataFrame.	`False`

Returns:

Type	Description
`DataFrame`	A pandas DataFrame.

Source code in janitor/functions/expand_grid.py

def cartesian_product(*inputs: tuple, sort: bool = False) -> pd.DataFrame:  """Creates a DataFrame from a cartesian combination of all inputs.  Inspiration is from tidyr's expand_grid() function.  The input argument should be a pandas Index/Series/DataFrame,  or a dictionary - the values of the dictionary should be  a 1D array.  Examples:  >>> import pandas as pd  >>> import janitor as jn  >>> df = pd.DataFrame({"x": [1, 2], "y": [2, 1]})  >>> data = pd.Series([1, 2, 3], name="z")  >>> jn.cartesian_product(df, data)  x y z  0 1 2 1  1 1 2 2  2 1 2 3  3 2 1 1  4 2 1 2  5 2 1 3  `cartesian_product` also works with non-pandas objects:  >>> data = {"x": [1, 2, 3], "y": [1, 2]}  >>> cartesian_product(data)  x y  0 1 1  1 1 2  2 2 1  3 2 2  4 3 1  5 3 2  Args:  *inputs: Variable arguments. The arguments should be  a pandas Index/Series/DataFrame, or a dictionary,  where the values in the dictionary is a 1D array.  sort: If True, sort the output DataFrame.  Returns:  A pandas DataFrame.  """ contents = [] for entry in inputs: if isinstance(entry, dict): for label, value in entry.items(): arr = pd.Series(value, name=label) contents.append(arr) else: contents.append(entry) outcome = _compute_cartesian_product(inputs=contents, sort=sort) # the values in the outcome dictionary are copies, # based on numpy indexing semantics; # as such, it is safe to pass copy=False return pd.DataFrame(data=outcome, copy=False) 

`expand(df, *columns, sort=False, by=None)`

Creates a DataFrame from a cartesian combination of all inputs.

Inspiration is from tidyr's expand() function.

expand() is often useful with pd.merge to convert implicit missing values to explicit missing values - similar to complete.

It can also be used to figure out which combinations are missing (e.g identify gaps in your DataFrame).

The variable columns parameter can be a column name, a list of column names, a pandas Index/Series/DataFrame, or a callable, which when applied to the DataFrame, evaluates to a pandas Index/Series/DataFrame.

A dictionary can also be passed to the variable columns parameter - the values of the dictionary should be either be a 1D array or a callable that evaluates to a 1D array. The array should be unique; no check is done to verify this.

If by is present, the DataFrame is expanded per group. by should be a column name, or a list of column names.

Examples:

>>> import pandas as pd >>> import janitor >>> data = [ ...  {"type": "apple", "year": 2010, "size": "XS"}, ...  {"type": "orange", "year": 2010, "size": "S"}, ...  {"type": "apple", "year": 2012, "size": "M"}, ...  {"type": "orange", "year": 2010, "size": "S"}, ...  {"type": "orange", "year": 2011, "size": "S"}, ...  {"type": "orange", "year": 2012, "size": "M"}, ... ] >>> df = pd.DataFrame(data) >>> df  type year size 0 apple 2010 XS 1 orange 2010 S 2 apple 2012 M 3 orange 2010 S 4 orange 2011 S 5 orange 2012 M

Get unique observations:

>>> df.expand("type")  type 0 apple 1 orange >>> df.expand("size")  size 0 XS 1 S 2 M >>> df.expand("type", "size")  type size 0 apple XS 1 apple S 2 apple M 3 orange XS 4 orange S 5 orange M >>> df.expand("type", "size", "year")  type size year 0 apple XS 2010 1 apple XS 2012 2 apple XS 2011 3 apple S 2010 4 apple S 2012 5 apple S 2011 6 apple M 2010 7 apple M 2012 8 apple M 2011 9 orange XS 2010 10 orange XS 2012 11 orange XS 2011 12 orange S 2010 13 orange S 2012 14 orange S 2011 15 orange M 2010 16 orange M 2012 17 orange M 2011

Get observations that only occur in the data:

>>> df.expand(["type", "size"])  type size 0 apple XS 1 orange S 2 apple M 3 orange M >>> df.expand(["type", "size", "year"])  type size year 0 apple XS 2010 1 orange S 2010 2 apple M 2012 3 orange S 2011 4 orange M 2012

Expand the DataFrame to include new observations:

>>> df.expand("type", "size", {"new_year": range(2010, 2014)})  type size new_year 0 apple XS 2010 1 apple XS 2011 2 apple XS 2012 3 apple XS 2013 4 apple S 2010 5 apple S 2011 6 apple S 2012 7 apple S 2013 8 apple M 2010 9 apple M 2011 10 apple M 2012 11 apple M 2013 12 orange XS 2010 13 orange XS 2011 14 orange XS 2012 15 orange XS 2013 16 orange S 2010 17 orange S 2011 18 orange S 2012 19 orange S 2013 20 orange M 2010 21 orange M 2011 22 orange M 2012 23 orange M 2013

Filter for missing observations:

>>> combo = df.expand("type", "size", "year") >>> anti_join = df.merge(combo, how="right", indicator=True) >>> anti_join.query("_merge=='right_only'").drop(columns="_merge")  type year size 1 apple 2012 XS 2 apple 2011 XS 3 apple 2010 S 4 apple 2012 S 5 apple 2011 S 6 apple 2010 M 8 apple 2011 M 9 orange 2010 XS 10 orange 2012 XS 11 orange 2011 XS 14 orange 2012 S 16 orange 2010 M 18 orange 2011 M

Expand within each group, using by:

>>> df.expand("year", "size", by="type")  year size type apple 2010 XS apple 2010 M apple 2012 XS apple 2012 M orange 2010 S orange 2010 M orange 2011 S orange 2011 M orange 2012 S orange 2012 M

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	A pandas DataFrame.	required
`columns`	`tuple`	Specification of columns to expand. It could be column labels, a list/tuple of column labels, or a pandas Index/Series/DataFrame. It can also be a callable; the callable will be applied to the entire DataFrame. The callable should return a pandas Series/Index/DataFrame. It can also be a dictionary, where the values are either a 1D array or a callable that evaluates to a 1D array. The array should be unique; no check is done to verify this.	`()`
`sort`	`bool`	If True, sort the DataFrame.	`False`
`by`	`str \| list`	Label or list of labels to group by.	`None`

Returns:

Type	Description
`DataFrame`	A pandas DataFrame.

Source code in janitor/functions/expand_grid.py

@pf.register_dataframe_method def expand( df: pd.DataFrame, *columns: tuple, sort: bool = False, by: str | list = None, ) -> pd.DataFrame:  """  Creates a DataFrame from a cartesian combination of all inputs.  Inspiration is from tidyr's expand() function.  expand() is often useful with  [pd.merge](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html)  to convert implicit  missing values to explicit missing values - similar to  [`complete`][janitor.functions.complete.complete].  It can also be used to figure out which combinations are missing  (e.g identify gaps in your DataFrame).  The variable `columns` parameter can be a column name,  a list of column names, a pandas Index/Series/DataFrame,  or a callable, which when applied to the DataFrame,  evaluates to a pandas Index/Series/DataFrame.  A dictionary can also be passed  to the variable `columns` parameter -  the values of the dictionary should be  either be a 1D array  or a callable that evaluates to a  1D array. The array should be unique;  no check is done to verify this.  If `by` is present, the DataFrame is *expanded* per group.  `by` should be a column name, or a list of column names.  Examples:  >>> import pandas as pd  >>> import janitor  >>> data = [  ... {"type": "apple", "year": 2010, "size": "XS"},  ... {"type": "orange", "year": 2010, "size": "S"},  ... {"type": "apple", "year": 2012, "size": "M"},  ... {"type": "orange", "year": 2010, "size": "S"},  ... {"type": "orange", "year": 2011, "size": "S"},  ... {"type": "orange", "year": 2012, "size": "M"},  ... ]  >>> df = pd.DataFrame(data)  >>> df  type year size  0 apple 2010 XS  1 orange 2010 S  2 apple 2012 M  3 orange 2010 S  4 orange 2011 S  5 orange 2012 M  Get unique observations:  >>> df.expand("type")  type  0 apple  1 orange  >>> df.expand("size")  size  0 XS  1 S  2 M  >>> df.expand("type", "size")  type size  0 apple XS  1 apple S  2 apple M  3 orange XS  4 orange S  5 orange M  >>> df.expand("type", "size", "year")  type size year  0 apple XS 2010  1 apple XS 2012  2 apple XS 2011  3 apple S 2010  4 apple S 2012  5 apple S 2011  6 apple M 2010  7 apple M 2012  8 apple M 2011  9 orange XS 2010  10 orange XS 2012  11 orange XS 2011  12 orange S 2010  13 orange S 2012  14 orange S 2011  15 orange M 2010  16 orange M 2012  17 orange M 2011  Get observations that only occur in the data:  >>> df.expand(["type", "size"])  type size  0 apple XS  1 orange S  2 apple M  3 orange M  >>> df.expand(["type", "size", "year"])  type size year  0 apple XS 2010  1 orange S 2010  2 apple M 2012  3 orange S 2011  4 orange M 2012  Expand the DataFrame to include new observations:  >>> df.expand("type", "size", {"new_year": range(2010, 2014)})  type size new_year  0 apple XS 2010  1 apple XS 2011  2 apple XS 2012  3 apple XS 2013  4 apple S 2010  5 apple S 2011  6 apple S 2012  7 apple S 2013  8 apple M 2010  9 apple M 2011  10 apple M 2012  11 apple M 2013  12 orange XS 2010  13 orange XS 2011  14 orange XS 2012  15 orange XS 2013  16 orange S 2010  17 orange S 2011  18 orange S 2012  19 orange S 2013  20 orange M 2010  21 orange M 2011  22 orange M 2012  23 orange M 2013  Filter for missing observations:  >>> combo = df.expand("type", "size", "year")  >>> anti_join = df.merge(combo, how="right", indicator=True)  >>> anti_join.query("_merge=='right_only'").drop(columns="_merge")  type year size  1 apple 2012 XS  2 apple 2011 XS  3 apple 2010 S  4 apple 2012 S  5 apple 2011 S  6 apple 2010 M  8 apple 2011 M  9 orange 2010 XS  10 orange 2012 XS  11 orange 2011 XS  14 orange 2012 S  16 orange 2010 M  18 orange 2011 M  Expand within each group, using `by`:  >>> df.expand("year", "size", by="type")  year size  type  apple 2010 XS  apple 2010 M  apple 2012 XS  apple 2012 M  orange 2010 S  orange 2010 M  orange 2011 S  orange 2011 M  orange 2012 S  orange 2012 M  Args:  df: A pandas DataFrame.  columns: Specification of columns to expand.  It could be column labels,  a list/tuple of column labels,  or a pandas Index/Series/DataFrame.  It can also be a callable;  the callable will be applied to the  entire DataFrame. The callable should  return a pandas Series/Index/DataFrame.  It can also be a dictionary,  where the values are either a 1D array  or a callable that evaluates to a  1D array.  The array should be unique;  no check is done to verify this.  sort: If True, sort the DataFrame.  by: Label or list of labels to group by.  Returns:  A pandas DataFrame.  """ # noqa: E501 if by is None: contents = _build_pandas_objects_for_expand(df=df, columns=columns) return cartesian_product(*contents, sort=sort) if not is_scalar(by) and not isinstance(by, list): raise TypeError( "The argument to the by parameter " "should be a scalar or a list; " f"instead got {type(by).__name__}" ) check_column(df, column_names=by, present=True) grouped = df.groupby(by=by, sort=False, dropna=False, observed=True) index = grouped._grouper.result_index dictionary = defaultdict(list) lengths = [] for _, frame in grouped: objects = _build_pandas_objects_for_expand(df=frame, columns=columns) objects = _compute_cartesian_product(inputs=objects, sort=False) length = objects[next(iter(objects))].size lengths.append(length) for k, v in objects.items(): dictionary[k].append(v) dictionary = {key: concat_compat(value) for key, value in dictionary.items()} index = index.repeat(lengths) out = pd.DataFrame(data=dictionary, index=index, copy=False) if sort: headers = out.columns.tolist() return out.sort_values(headers) return out 

`expand_grid(df=None, df_key=None, *, others=None)`

Creates a DataFrame from a cartesian combination of all inputs.

Note

This function will be deprecated in a 1.x release; use cartesian_product instead.

It is not restricted to a pandas DataFrame; it can work with any list-like structure that is 1 or 2 dimensional.

If method-chaining to a DataFrame, a string argument to df_key parameter must be provided.

Data types are preserved in this function, including pandas' extension array dtypes.

The output will always be a DataFrame, usually with a MultiIndex column, with the keys of the others dictionary serving as the top level columns.

If a pandas Series/DataFrame is passed, and has a labeled index, or a MultiIndex index, the index is discarded; the final DataFrame will have a RangeIndex.

The MultiIndexed DataFrame can be flattened using pyjanitor's collapse_levels method; the user can also decide to drop any of the levels, via pandas' droplevel method.

Examples:

>>> import pandas as pd >>> import janitor as jn >>> df = pd.DataFrame({"x": [1, 2], "y": [2, 1]}) >>> data = {"z": [1, 2, 3]} >>> df.expand_grid(df_key="df", others=data)  df z  x y 0 0 1 2 1 1 1 2 2 2 1 2 3 3 2 1 1 4 2 1 2 5 2 1 3

expand_grid works with non-pandas objects:

>>> data = {"x": [1, 2, 3], "y": [1, 2]} >>> jn.expand_grid(others=data)  x y  0 0 0 1 1 1 1 2 2 2 1 3 2 2 4 3 1 5 3 2

Parameters:

Name	Type	Description	Default
`df`	`Optional[DataFrame]`	A pandas DataFrame.	`None`
`df_key`	`Optional[str]`	Name of key for the dataframe. It becomes part of the column names of the dataframe.	`None`
`others`	`Optional[dict]`	A dictionary that contains the data to be combined with the dataframe. If no dataframe exists, all inputs in `others` will be combined to create a DataFrame.	`None`

Raises:

Type	Description
`KeyError`	If there is a DataFrame and `df_key` is not provided.

Returns:

Type	Description
`Union[DataFrame, None]`	A pandas DataFrame of the cartesian product.
`Union[DataFrame, None]`	If `df` is not provided, and `others` is not provided,
`Union[DataFrame, None]`	None is returned.

Source code in janitor/functions/expand_grid.py

@pf.register_dataframe_method @refactored_function( message=( "This function will be deprecated in a 1.x release. " "Please use `janitor.cartesian_product` instead." ) ) def expand_grid( df: Optional[pd.DataFrame] = None, df_key: Optional[str] = None, *, others: Optional[dict] = None, ) -> Union[pd.DataFrame, None]:  """  Creates a DataFrame from a cartesian combination of all inputs.  !!!note  This function will be deprecated in a 1.x release;  use [`cartesian_product`][janitor.functions.expand_grid.cartesian_product]  instead.  It is not restricted to a pandas DataFrame;  it can work with any list-like structure  that is 1 or 2 dimensional.  If method-chaining to a DataFrame, a string argument  to `df_key` parameter must be provided.  Data types are preserved in this function,  including pandas' extension array dtypes.  The output will always be a DataFrame, usually with a MultiIndex column,  with the keys of the `others` dictionary serving as the top level columns.  If a pandas Series/DataFrame is passed, and has a labeled index, or  a MultiIndex index, the index is discarded; the final DataFrame  will have a RangeIndex.  The MultiIndexed DataFrame can be flattened using pyjanitor's  [`collapse_levels`][janitor.functions.collapse_levels.collapse_levels]  method; the user can also decide to drop any of the levels, via pandas'  `droplevel` method.  Examples:  >>> import pandas as pd  >>> import janitor as jn  >>> df = pd.DataFrame({"x": [1, 2], "y": [2, 1]})  >>> data = {"z": [1, 2, 3]}  >>> df.expand_grid(df_key="df", others=data)  df z  x y 0  0 1 2 1  1 1 2 2  2 1 2 3  3 2 1 1  4 2 1 2  5 2 1 3  `expand_grid` works with non-pandas objects:  >>> data = {"x": [1, 2, 3], "y": [1, 2]}  >>> jn.expand_grid(others=data)  x y  0 0  0 1 1  1 1 2  2 2 1  3 2 2  4 3 1  5 3 2  Args:  df: A pandas DataFrame.  df_key: Name of key for the dataframe.  It becomes part of the column names of the dataframe.  others: A dictionary that contains the data  to be combined with the dataframe.  If no dataframe exists, all inputs  in `others` will be combined to create a DataFrame.  Raises:  KeyError: If there is a DataFrame and `df_key` is not provided.  Returns:  A pandas DataFrame of the cartesian product.  If `df` is not provided, and `others` is not provided,  None is returned.  """ # noqa: E501 if df is not None: check("df", df, [pd.DataFrame]) if not df_key: raise KeyError( "Using `expand_grid` as part of a " "DataFrame method chain requires that " "a string argument be provided for " "the `df_key` parameter. " ) check("df_key", df_key, [str]) if not others and (df is not None): return df if not others: return None check("others", others, [dict]) for key in others: check("key", key, [str]) if df is not None: others = {**{df_key: df}, **others} others = _computations_expand_grid(others) return pd.DataFrame(others, copy=False) 

`explode_index`

Implementation of the explode_index function.

`explode_index(df, names_sep=None, names_pattern=None, axis='columns', level_names=None)`

Explode a single index DataFrame into a MultiIndex DataFrame.

This method does not mutate the original DataFrame.

Examples:

>>> import pandas as pd >>> import janitor >>> df = pd.DataFrame( ...  { ...  "max_speed_mean": [267.3333333333333, 50.5], ...  "max_speed_median": [389.0, 50.5], ...  } ... ) >>> df  max_speed_mean max_speed_median 0 267.333333 389.0 1 50.500000 50.5 >>> df.explode_index( ...  names_sep="_", axis="columns" ... )  max  speed  mean median 0 267.333333 389.0 1 50.500000 50.5 >>> df.explode_index( ...  names_pattern=r"(.+speed)_(.+)", axis="columns" ... )  max_speed  mean median 0 267.333333 389.0 1 50.500000 50.5 >>> df.explode_index( ...  names_pattern=r"(?P<measurement>.+speed)_(?P<aggregation>.+)", ...  axis="columns", ... ) measurement max_speed aggregation mean median 0 267.333333 389.0 1 50.500000 50.5 >>> df.explode_index( ...  names_sep="_", ...  axis="columns", ...  level_names=["min or max", "measurement", "aggregation"], ... ) min or max max measurement speed aggregation mean median 0 267.333333 389.0 1 50.500000 50.5

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	A pandas DataFrame.	required
`names_sep`	`Union[str, None]`	string or compiled regex used to split the column/index into levels.	`None`
`names_pattern`	`Union[str, None]`	regex to extract new levels from the column/index.	`None`
`axis`	`str`	'index/columns'. Determines which axis to explode.	`'columns'`
`level_names`	`list`	names of the levels in the MultiIndex.	`None`

Returns:

Type	Description
`DataFrame`	A pandas DataFrame with a MultiIndex.

Source code in janitor/functions/explode_index.py

@pf.register_dataframe_method def explode_index( df: pd.DataFrame, names_sep: Union[str, None] = None, names_pattern: Union[str, None] = None, axis: str = "columns", level_names: list = None, ) -> pd.DataFrame:  """Explode a single index DataFrame into a MultiIndex DataFrame.  This method does not mutate the original DataFrame.  Examples:  >>> import pandas as pd  >>> import janitor  >>> df = pd.DataFrame(  ... {  ... "max_speed_mean": [267.3333333333333, 50.5],  ... "max_speed_median": [389.0, 50.5],  ... }  ... )  >>> df  max_speed_mean max_speed_median  0 267.333333 389.0  1 50.500000 50.5  >>> df.explode_index(  ... names_sep="_", axis="columns"  ... ) # doctest: +NORMALIZE_WHITESPACE  max  speed  mean median  0 267.333333 389.0  1 50.500000 50.5  >>> df.explode_index(  ... names_pattern=r"(.+speed)_(.+)", axis="columns"  ... ) # doctest: +NORMALIZE_WHITESPACE  max_speed  mean median  0 267.333333 389.0  1 50.500000 50.5  >>> df.explode_index(  ... names_pattern=r"(?P<measurement>.+speed)_(?P<aggregation>.+)",  ... axis="columns",  ... ) # doctest: +NORMALIZE_WHITESPACE  measurement max_speed  aggregation mean median  0 267.333333 389.0  1 50.500000 50.5  >>> df.explode_index(  ... names_sep="_",  ... axis="columns",  ... level_names=["min or max", "measurement", "aggregation"],  ... ) # doctest: +NORMALIZE_WHITESPACE  min or max max  measurement speed  aggregation mean median  0 267.333333 389.0  1 50.500000 50.5  Args:  df: A pandas DataFrame.  names_sep: string or compiled regex used to split the column/index into levels.  names_pattern: regex to extract new levels from the column/index.  axis: 'index/columns'. Determines which axis to explode.  level_names: names of the levels in the MultiIndex.  Returns:  A pandas DataFrame with a MultiIndex.  """ # noqa: E501 check("axis", axis, [str]) if axis not in {"index", "columns"}: raise ValueError("axis should be either index or columns.") if (names_sep is None) and (names_pattern is None): raise ValueError("Provide argument for either names_sep or names_pattern.") if (names_sep is not None) and (names_pattern is not None): raise ValueError( "Provide argument for either names_sep or names_pattern, not both." ) if names_sep is not None: check("names_sep", names_sep, [str]) if names_pattern is not None: check("names_pattern", names_pattern, [str]) if level_names is not None: check("level_names", level_names, [list]) new_index = getattr(df, axis) if isinstance(new_index, pd.MultiIndex): return df # avoid a copy - Index is immutable; a slice is safe to use. df = df[:] if names_sep: new_index = new_index.str.split(names_sep, expand=True) else: named_groups = re.compile(names_pattern).groupindex if named_groups and not level_names: level_names = list(named_groups) new_index = new_index.str.extract(names_pattern) new_index = [arr.array for _, arr in new_index.items()] new_index = pd.MultiIndex.from_arrays(new_index) if level_names: new_index.names = level_names setattr(df, axis, new_index) return df 

`factorize_columns`

Implementation of the factorize_columns function

`factorize_columns(df, column_names, suffix='_enc', **kwargs)`

Converts labels into numerical data.

This method will create a new column with the string _enc appended after the original column's name. This can be overridden with the suffix parameter.

Internally, this method uses pandas factorize method. It takes in an optional suffix and keyword arguments also. An empty string as suffix will override the existing column.

This method does not mutate the original DataFrame.

Examples:

>>> import pandas as pd >>> import janitor >>> df = pd.DataFrame( ...  { ...  "foo": ["b", "b", "a", "c", "b"], ...  "bar": range(4, 9), ...  } ... ) >>> df  foo bar 0 b 4 1 b 5 2 a 6 3 c 7 4 b 8 >>> df.factorize_columns(column_names="foo")  foo bar foo_enc 0 b 4 0 1 b 5 0 2 a 6 1 3 c 7 2 4 b 8 0

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	The pandas DataFrame object.	required
`column_names`	`Union[str, Iterable[str], Hashable]`	A column name or an iterable (list or tuple) of column names.	required
`suffix`	`str`	Suffix to be used for the new column. An empty string suffix means, it will override the existing column.	`'_enc'`
`**kwargs`	`Any`	Keyword arguments. It takes any of the keyword arguments, which the pandas factorize method takes like `sort`, `na_sentinel`, `size_hint`.	`{}`

Returns:

Type	Description
`DataFrame`	A pandas DataFrame.

Source code in janitor/functions/factorize_columns.py

@pf.register_dataframe_method def factorize_columns( df: pd.DataFrame, column_names: Union[str, Iterable[str], Hashable], suffix: str = "_enc", **kwargs: Any, ) -> pd.DataFrame:  """Converts labels into numerical data.  This method will create a new column with the string `_enc` appended  after the original column's name.  This can be overridden with the suffix parameter.  Internally, this method uses pandas `factorize` method.  It takes in an optional suffix and keyword arguments also.  An empty string as suffix will override the existing column.  This method does not mutate the original DataFrame.  Examples:  >>> import pandas as pd  >>> import janitor  >>> df = pd.DataFrame(  ... {  ... "foo": ["b", "b", "a", "c", "b"],  ... "bar": range(4, 9),  ... }  ... )  >>> df  foo bar  0 b 4  1 b 5  2 a 6  3 c 7  4 b 8  >>> df.factorize_columns(column_names="foo")  foo bar foo_enc  0 b 4 0  1 b 5 0  2 a 6 1  3 c 7 2  4 b 8 0  Args:  df: The pandas DataFrame object.  column_names: A column name or an iterable (list or tuple) of  column names.  suffix: Suffix to be used for the new column.  An empty string suffix means, it will override the existing column.  **kwargs: Keyword arguments. It takes any of the keyword arguments,  which the pandas factorize method takes like `sort`, `na_sentinel`,  `size_hint`.  Returns:  A pandas DataFrame.  """ df = _factorize(df.copy(), column_names, suffix, **kwargs) return df 

`fill`

`fill_direction(df, **kwargs)`

Provide a method-chainable function for filling missing values in selected columns.

It is a wrapper for pd.Series.ffill and pd.Series.bfill, and pairs the column name with one of up, down, updown, and downup.

Note

This function will be deprecated in a 1.x release. Please use pd.DataFrame.assign instead.

Examples:

>>> import pandas as pd >>> import janitor as jn >>> df = pd.DataFrame( ...  { ...  "col1": [1, 2, 3, 4], ...  "col2": [None, 5, 6, 7], ...  "col3": [8, 9, 10, None], ...  "col4": [None, None, 11, None], ...  "col5": [None, 12, 13, None], ...  } ... ) >>> df  col1 col2 col3 col4 col5 0 1 NaN 8.0 NaN NaN 1 2 5.0 9.0 NaN 12.0 2 3 6.0 10.0 11.0 13.0 3 4 7.0 NaN NaN NaN >>> df.fill_direction(col2="up", col3="down", col4="downup", col5="updown")  col1 col2 col3 col4 col5 0 1 5.0 8.0 11.0 12.0 1 2 5.0 9.0 11.0 12.0 2 3 6.0 10.0 11.0 13.0 3 4 7.0 10.0 11.0 13.0

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	A pandas DataFrame.	required
`**kwargs`	`Any`	Key - value pairs of columns and directions. Directions can be either `down`, `up`, `updown` (fill up then down) and `downup` (fill down then up).	`{}`

Raises:

Type	Description
`ValueError`	If direction supplied is not one of `down`, `up`, `updown`, or `downup`.

Returns:

Type	Description
`DataFrame`	A pandas DataFrame with modified column(s).

Source code in janitor/functions/fill.py

@pf.register_dataframe_method @refactored_function( message=( "This function will be deprecated in a 1.x release. " "Please use `pd.DataFrame.assign` instead." ) ) def fill_direction(df: pd.DataFrame, **kwargs: Any) -> pd.DataFrame:  """Provide a method-chainable function for filling missing values  in selected columns.  It is a wrapper for `pd.Series.ffill` and `pd.Series.bfill`,  and pairs the column name with one of `up`, `down`, `updown`,  and `downup`.  !!!note  This function will be deprecated in a 1.x release.  Please use `pd.DataFrame.assign` instead.  Examples:  >>> import pandas as pd  >>> import janitor as jn  >>> df = pd.DataFrame(  ... {  ... "col1": [1, 2, 3, 4],  ... "col2": [None, 5, 6, 7],  ... "col3": [8, 9, 10, None],  ... "col4": [None, None, 11, None],  ... "col5": [None, 12, 13, None],  ... }  ... )  >>> df  col1 col2 col3 col4 col5  0 1 NaN 8.0 NaN NaN  1 2 5.0 9.0 NaN 12.0  2 3 6.0 10.0 11.0 13.0  3 4 7.0 NaN NaN NaN  >>> df.fill_direction(col2="up", col3="down", col4="downup", col5="updown")  col1 col2 col3 col4 col5  0 1 5.0 8.0 11.0 12.0  1 2 5.0 9.0 11.0 12.0  2 3 6.0 10.0 11.0 13.0  3 4 7.0 10.0 11.0 13.0  Args:  df: A pandas DataFrame.  **kwargs: Key - value pairs of columns and directions.  Directions can be either `down`, `up`, `updown`  (fill up then down) and `downup` (fill down then up).  Raises:  ValueError: If direction supplied is not one of `down`, `up`,  `updown`, or `downup`.  Returns:  A pandas DataFrame with modified column(s).  """ # noqa: E501 if not kwargs: return df fill_types = {fill.name for fill in _FILLTYPE} for column_name, fill_type in kwargs.items(): check("column_name", column_name, [str]) check("fill_type", fill_type, [str]) if fill_type.upper() not in fill_types: raise ValueError("fill_type should be one of up, down, updown, or downup.") check_column(df, kwargs) new_values = {} for column_name, fill_type in kwargs.items(): direction = _FILLTYPE[f"{fill_type.upper()}"].value if len(direction) == 1: direction = methodcaller(direction[0]) output = direction(df[column_name]) else: direction = [methodcaller(entry) for entry in direction] output = _chain_func(df[column_name], *direction) new_values[column_name] = output return df.assign(**new_values) 

`fill_empty(df, column_names, value)`

Fill NaN values in specified columns with a given value.

Super sugary syntax that wraps pandas.DataFrame.fillna.

This method mutates the original DataFrame.

Note

This function will be deprecated in a 1.x release. Please use jn.impute instead.

Examples:

>>> import pandas as pd >>> import janitor >>> df = pd.DataFrame( ...  {"col1": [1, 2, 3], "col2": [None, 4, None], "col3": [None, 5, 6]} ... ) >>> df  col1 col2 col3 0 1 NaN NaN 1 2 4.0 5.0 2 3 NaN 6.0 >>> df.fill_empty(column_names="col2", value=0)  col1 col2 col3 0 1 0.0 NaN 1 2 4.0 5.0 2 3 0.0 6.0 >>> df.fill_empty(column_names=["col2", "col3"], value=0)  col1 col2 col3 0 1 0.0 0.0 1 2 4.0 5.0 2 3 0.0 6.0

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	A pandas DataFrame.	required
`column_names`	`Union[str, Iterable[str], Hashable]`	A column name or an iterable (list or tuple) of column names. If a single column name is passed in, then only that column will be filled; if a list or tuple is passed in, then those columns will all be filled with the same value.	required
`value`	`Any`	The value that replaces the `NaN` values.	required

Returns:

Type	Description
`DataFrame`	A pandas DataFrame with `NaN` values filled.

Source code in janitor/functions/fill.py

@pf.register_dataframe_method @refactored_function( message="This function will be deprecated in a 1.x release. " "Kindly use `jn.impute` instead." ) @deprecated_alias(columns="column_names") def fill_empty( df: pd.DataFrame, column_names: Union[str, Iterable[str], Hashable], value: Any, ) -> pd.DataFrame:  """Fill `NaN` values in specified columns with a given value.  Super sugary syntax that wraps `pandas.DataFrame.fillna`.  This method mutates the original DataFrame.  !!!note  This function will be deprecated in a 1.x release.  Please use [`jn.impute`][janitor.functions.impute.impute] instead.  Examples:  >>> import pandas as pd  >>> import janitor  >>> df = pd.DataFrame(  ... {"col1": [1, 2, 3], "col2": [None, 4, None], "col3": [None, 5, 6]}  ... )  >>> df  col1 col2 col3  0 1 NaN NaN  1 2 4.0 5.0  2 3 NaN 6.0  >>> df.fill_empty(column_names="col2", value=0)  col1 col2 col3  0 1 0.0 NaN  1 2 4.0 5.0  2 3 0.0 6.0  >>> df.fill_empty(column_names=["col2", "col3"], value=0)  col1 col2 col3  0 1 0.0 0.0  1 2 4.0 5.0  2 3 0.0 6.0  Args:  df: A pandas DataFrame.  column_names: A column name or an iterable (list  or tuple) of column names. If a single column name is passed in,  then only that column will be filled; if a list or tuple is passed  in, then those columns will all be filled with the same value.  value: The value that replaces the `NaN` values.  Returns:  A pandas DataFrame with `NaN` values filled.  """ check_column(df, column_names) return _fill_empty(df, column_names, value=value) 

`filter`

`filter_column_isin(df, column_name, iterable, complement=False)`

Filter a dataframe for values in a column that exist in the given iterable.

This method does not mutate the original DataFrame.

Assumes exact matching; fuzzy matching not implemented.

Examples:

Filter the dataframe to retain rows for which names are exactly James or John.

>>> import pandas as pd >>> import janitor >>> df = pd.DataFrame({"names": ["Jane", "Jeremy", "John"], "foo": list("xyz")}) >>> df  names foo 0 Jane x 1 Jeremy y 2 John z >>> df.filter_column_isin(column_name="names", iterable=["James", "John"])  names foo 2 John z

This is the method-chaining alternative to:

df = df[df["names"].isin(["James", "John"])]

If complement=True, then we will only get rows for which the names are neither James nor John.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	A pandas DataFrame.	required
`column_name`	`Hashable`	The column on which to filter.	required
`iterable`	`Iterable`	An iterable. Could be a list, tuple, another pandas Series.	required
`complement`	`bool`	Whether to return the complement of the selection or not.	`False`

Raises:

Type	Description
`ValueError`	If `iterable` does not have a length of `1` or greater.

Returns:

Type	Description
`DataFrame`	A filtered pandas DataFrame.

Source code in janitor/functions/filter.py

@pf.register_dataframe_method @deprecated_alias(column="column_name") def filter_column_isin( df: pd.DataFrame, column_name: Hashable, iterable: Iterable, complement: bool = False, ) -> pd.DataFrame:  """Filter a dataframe for values in a column that exist in the given iterable.  This method does not mutate the original DataFrame.  Assumes exact matching; fuzzy matching not implemented.  Examples:  Filter the dataframe to retain rows for which `names`  are exactly `James` or `John`.  >>> import pandas as pd  >>> import janitor  >>> df = pd.DataFrame({"names": ["Jane", "Jeremy", "John"], "foo": list("xyz")})  >>> df  names foo  0 Jane x  1 Jeremy y  2 John z  >>> df.filter_column_isin(column_name="names", iterable=["James", "John"])  names foo  2 John z  This is the method-chaining alternative to:  ```python  df = df[df["names"].isin(["James", "John"])]  ```  If `complement=True`, then we will only get rows for which the names  are neither `James` nor `John`.  Args:  df: A pandas DataFrame.  column_name: The column on which to filter.  iterable: An iterable. Could be a list, tuple, another pandas  Series.  complement: Whether to return the complement of the selection or  not.  Raises:  ValueError: If `iterable` does not have a length of `1`  or greater.  Returns:  A filtered pandas DataFrame.  """ # noqa: E501 if len(iterable) == 0: raise ValueError( "`iterable` kwarg must be given an iterable of length 1 or greater." ) criteria = df[column_name].isin(iterable) if complement: return df[~criteria] return df[criteria] 

`filter_date(df, column_name, start_date=None, end_date=None, years=None, months=None, days=None, column_date_options=None, format=None)`

Filter a date-based column based on certain criteria.

This method does not mutate the original DataFrame.

Dates may be finicky and this function builds on top of the magic from the pandas to_datetime function that is able to parse dates well.

Additional options to parse the date type of your column may be found at the official pandas documentation.

Examples:

>>> import pandas as pd >>> import janitor >>> df = pd.DataFrame( ...  { ...  "a": range(5, 9), ...  "dt": ["2021-11-12", "2021-12-15", "2022-01-03", "2022-01-09"], ...  } ... ) >>> df  a dt 0 5 2021-11-12 1 6 2021-12-15 2 7 2022-01-03 3 8 2022-01-09 >>> df.filter_date("dt", start_date="2021-12-01", end_date="2022-01-05")  a dt 1 6 2021-12-15 2 7 2022-01-03 >>> df.filter_date("dt", years=[2021], months=[12])  a dt 1 6 2021-12-15

Note

This method will cast your column to a Timestamp!

Note

This only affects the format of the start_date and end_date parameters. If there's an issue with the format of the DataFrame being parsed, you would pass {'format': your_format} to column_date_options.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	The dataframe to filter on.	required
`column_name`	`Hashable`	The column which to apply the fraction transformation.	required
`start_date`	`Optional[date]`	The beginning date to use to filter the DataFrame.	`None`
`end_date`	`Optional[date]`	The end date to use to filter the DataFrame.	`None`
`years`	`Optional[List]`	The years to use to filter the DataFrame.	`None`
`months`	`Optional[List]`	The months to use to filter the DataFrame.	`None`
`days`	`Optional[List]`	The days to use to filter the DataFrame.	`None`
`column_date_options`	`Optional[Dict]`	Special options to use when parsing the date column in the original DataFrame. The options may be found at the official Pandas documentation.	`None`
`format`	`Optional[str]`	If you're using a format for `start_date` or `end_date` that is not recognized natively by pandas' `to_datetime` function, you may supply the format yourself. Python date and time formats may be found here.	`None`

Returns:

Type	Description
`DataFrame`	A filtered pandas DataFrame.

Source code in janitor/functions/filter.py

@pf.register_dataframe_method @deprecated_alias(column="column_name", start="start_date", end="end_date") def filter_date( df: pd.DataFrame, column_name: Hashable, start_date: Optional[dt.date] = None, end_date: Optional[dt.date] = None, years: Optional[List] = None, months: Optional[List] = None, days: Optional[List] = None, column_date_options: Optional[Dict] = None, format: Optional[str] = None, # skipcq: PYL-W0622 ) -> pd.DataFrame:  """Filter a date-based column based on certain criteria.  This method does not mutate the original DataFrame.  Dates may be finicky and this function builds on top of the *magic* from  the pandas `to_datetime` function that is able to parse dates well.  Additional options to parse the date type of your column may be found at  the official pandas [documentation][datetime].  [datetime]: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html  Examples:  >>> import pandas as pd  >>> import janitor  >>> df = pd.DataFrame(  ... {  ... "a": range(5, 9),  ... "dt": ["2021-11-12", "2021-12-15", "2022-01-03", "2022-01-09"],  ... }  ... )  >>> df  a dt  0 5 2021-11-12  1 6 2021-12-15  2 7 2022-01-03  3 8 2022-01-09  >>> df.filter_date("dt", start_date="2021-12-01", end_date="2022-01-05")  a dt  1 6 2021-12-15  2 7 2022-01-03  >>> df.filter_date("dt", years=[2021], months=[12])  a dt  1 6 2021-12-15  !!!note  This method will cast your column to a Timestamp!  !!!note  This only affects the format of the `start_date` and `end_date`  parameters. If there's an issue with the format of the DataFrame being  parsed, you would pass `{'format': your_format}` to `column_date_options`.  Args:  df: The dataframe to filter on.  column_name: The column which to apply the fraction transformation.  start_date: The beginning date to use to filter the DataFrame.  end_date: The end date to use to filter the DataFrame.  years: The years to use to filter the DataFrame.  months: The months to use to filter the DataFrame.  days: The days to use to filter the DataFrame.  column_date_options: Special options to use when parsing the date  column in the original DataFrame. The options may be found at the  official Pandas documentation.  format: If you're using a format for `start_date` or `end_date`  that is not recognized natively by pandas' `to_datetime` function, you  may supply the format yourself. Python date and time formats may be  found [here](http://strftime.org/).  Returns:  A filtered pandas DataFrame.  """ # noqa: E501 def _date_filter_conditions(conditions):  """Taken from: https://stackoverflow.com/a/13616382.""" return reduce(np.logical_and, conditions) if column_date_options is None: column_date_options = {} df[column_name] = pd.to_datetime(df[column_name], **column_date_options) _filter_list = [] if start_date: start_date = pd.to_datetime(start_date, format=format) _filter_list.append(df[column_name] >= start_date) if end_date: end_date = pd.to_datetime(end_date, format=format) _filter_list.append(df[column_name] <= end_date) if years: _filter_list.append(df[column_name].dt.year.isin(years)) if months: _filter_list.append(df[column_name].dt.month.isin(months)) if days: _filter_list.append(df[column_name].dt.day.isin(days)) if start_date and end_date and start_date > end_date: warnings.warn( f"Your start date of {start_date} is after your end date of " f"{end_date}. Is this intended?" ) return df.loc[_date_filter_conditions(_filter_list), :] 

`filter_on(df, criteria, complement=False)`

Return a dataframe filtered on a particular criteria.

This method does not mutate the original DataFrame.

This is super-sugary syntax that wraps the pandas .query() API, enabling users to use strings to quickly specify filters for filtering their dataframe. The intent is that filter_on as a verb better matches the intent of a pandas user than the verb query.

This is intended to be the method-chaining equivalent of the following:

df = df[df["score"] < 3]

Note

This function will be deprecated in a 1.x release. Please use pd.DataFrame.query instead.

Examples:

Filter students who failed an exam (scored less than 50).

>>> import pandas as pd >>> import janitor >>> df = pd.DataFrame( ...  { ...  "student_id": ["S1", "S2", "S3"], ...  "score": [40, 60, 85], ...  } ... ) >>> df  student_id score 0 S1 40 1 S2 60 2 S3 85 >>> df.filter_on("score < 50", complement=False)  student_id score 0 S1 40

Credit to Brant Peterson for the name.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	A pandas DataFrame.	required
`criteria`	`str`	A filtering criteria that returns an array or Series of booleans, on which pandas can filter on.	required
`complement`	`bool`	Whether to return the complement of the filter or not. If set to True, then the rows for which the criteria is False are retained instead.	`False`

Returns:

Type	Description
`DataFrame`	A filtered pandas DataFrame.

Source code in janitor/functions/filter.py

@pf.register_dataframe_method @refactored_function( message=( "This function will be deprecated in a 1.x release. " "Please use `pd.DataFrame.query` instead." ) ) def filter_on( df: pd.DataFrame, criteria: str, complement: bool = False, ) -> pd.DataFrame:  """Return a dataframe filtered on a particular criteria.  This method does not mutate the original DataFrame.  This is super-sugary syntax that wraps the pandas `.query()` API, enabling  users to use strings to quickly specify filters for filtering their  dataframe. The intent is that `filter_on` as a verb better matches the  intent of a pandas user than the verb `query`.  This is intended to be the method-chaining equivalent of the following:  ```python  df = df[df["score"] < 3]  ```  !!!note  This function will be deprecated in a 1.x release.  Please use `pd.DataFrame.query` instead.  Examples:  Filter students who failed an exam (scored less than 50).  >>> import pandas as pd  >>> import janitor  >>> df = pd.DataFrame(  ... {  ... "student_id": ["S1", "S2", "S3"],  ... "score": [40, 60, 85],  ... }  ... )  >>> df  student_id score  0 S1 40  1 S2 60  2 S3 85  >>> df.filter_on("score < 50", complement=False)  student_id score  0 S1 40  Credit to Brant Peterson for the name.  Args:  df: A pandas DataFrame.  criteria: A filtering criteria that returns an array or Series of  booleans, on which pandas can filter on.  complement: Whether to return the complement of the filter or not.  If set to True, then the rows for which the criteria is False are  retained instead.  Returns:  A filtered pandas DataFrame.  """ warnings.warn( "This function will be deprecated in a 1.x release. " "Kindly use `pd.DataFrame.query` instead.", DeprecationWarning, stacklevel=find_stack_level(), ) if complement: return df.query(f"not ({criteria})") return df.query(criteria) 

`filter_string(df, column_name, search_string, complement=False, case=True, flags=0, na=None, regex=True)`

Filter a string-based column according to whether it contains a substring.

This is super sugary syntax that builds on top of pandas.Series.str.contains. It is meant to be the method-chaining equivalent of the following:

df = df[df[column_name].str.contains(search_string)]]

This method does not mutate the original DataFrame.

Examples:

Retain rows whose column values contain a particular substring.

>>> import pandas as pd >>> import janitor >>> df = pd.DataFrame({"a": range(3, 6), "b": ["bear", "peeL", "sail"]}) >>> df  a b 0 3 bear 1 4 peeL 2 5 sail >>> df.filter_string(column_name="b", search_string="ee")  a b 1 4 peeL >>> df.filter_string(column_name="b", search_string="L", case=False)  a b 1 4 peeL 2 5 sail

Filter names does not contain '.' (disable regex mode).

>>> import pandas as pd >>> import janitor >>> df = pd.Series(["JoseChen", "Brian.Salvi"], name="Name").to_frame() >>> df  Name 0 JoseChen 1 Brian.Salvi >>> df.filter_string( ...  column_name="Name", search_string=".", regex=False, complement=True ... )  Name 0 JoseChen

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	A pandas DataFrame.	required
`column_name`	`Hashable`	The column to filter. The column should contain strings.	required
`search_string`	`str`	A regex pattern or a (sub-)string to search.	required
`complement`	`bool`	Whether to return the complement of the filter or not. If set to True, then the rows for which the string search fails are retained instead.	`False`
`case`	`bool`	If True, case sensitive.	`True`
`flags`	`int`	Flags to pass through to the re module, e.g. re.IGNORECASE.	`0`
`na`	`Any`	Fill value for missing values. The default depends on dtype of the array. For object-dtype, `numpy.nan` is used. For `StringDtype`, `pandas.NA` is used.	`None`
`regex`	`bool`	If True, assumes `search_string` is a regular expression. If False, treats the `search_string` as a literal string.	`True`

Returns:

Type	Description
`DataFrame`	A filtered pandas DataFrame.

Source code in janitor/functions/filter.py

@pf.register_dataframe_method @deprecated_alias(column="column_name") def filter_string( df: pd.DataFrame, column_name: Hashable, search_string: str, complement: bool = False, case: bool = True, flags: int = 0, na: Any = None, regex: bool = True, ) -> pd.DataFrame:  """Filter a string-based column according to whether it contains a substring.  This is super sugary syntax that builds on top of `pandas.Series.str.contains`.  It is meant to be the method-chaining equivalent of the following:  ```python  df = df[df[column_name].str.contains(search_string)]]  ```  This method does not mutate the original DataFrame.  Examples:  Retain rows whose column values contain a particular substring.  >>> import pandas as pd  >>> import janitor  >>> df = pd.DataFrame({"a": range(3, 6), "b": ["bear", "peeL", "sail"]})  >>> df  a b  0 3 bear  1 4 peeL  2 5 sail  >>> df.filter_string(column_name="b", search_string="ee")  a b  1 4 peeL  >>> df.filter_string(column_name="b", search_string="L", case=False)  a b  1 4 peeL  2 5 sail  Filter names does not contain `'.'` (disable regex mode).  >>> import pandas as pd  >>> import janitor  >>> df = pd.Series(["JoseChen", "Brian.Salvi"], name="Name").to_frame()  >>> df  Name  0 JoseChen  1 Brian.Salvi  >>> df.filter_string(  ... column_name="Name", search_string=".", regex=False, complement=True  ... )  Name  0 JoseChen  Args:  df: A pandas DataFrame.  column_name: The column to filter. The column should contain strings.  search_string: A regex pattern or a (sub-)string to search.  complement: Whether to return the complement of the filter or not. If  set to True, then the rows for which the string search fails are retained  instead.  case: If True, case sensitive.  flags: Flags to pass through to the re module, e.g. re.IGNORECASE.  na: Fill value for missing values. The default depends on dtype of  the array. For object-dtype, `numpy.nan` is used. For `StringDtype`,  `pandas.NA` is used.  regex: If True, assumes `search_string` is a regular expression. If False,  treats the `search_string` as a literal string.  Returns:  A filtered pandas DataFrame.  """ # noqa: E501 criteria = df[column_name].str.contains( pat=search_string, case=case, flags=flags, na=na, regex=regex, ) if complement: return df[~criteria] return df[criteria] 

`find_replace`

Implementation for find_replace.

`find_replace(df, match='exact', **mappings)`

Perform a find-and-replace action on provided columns.

Note

This function will be deprecated in a 1.x release. Please use pd.DataFrame.replace instead.

Depending on use case, users can choose either exact, full-value matching, or regular-expression-based fuzzy matching (hence allowing substring matching in the latter case). For strings, the matching is always case sensitive.

Examples:

For instance, given a DataFrame containing orders at a coffee shop:

>>> df = pd.DataFrame( ...  { ...  "customer": ["Mary", "Tom", "Lila"], ...  "order": ["ice coffee", "lemonade", "regular coffee"], ...  } ... ) >>> df  customer order 0 Mary ice coffee 1 Tom lemonade 2 Lila regular coffee

Our task is to replace values ice coffee and regular coffee of the order column into latte.

Example 1 - exact matching (functional usage):

>>> df = find_replace( ...  df, ...  match="exact", ...  order={"ice coffee": "latte", "regular coffee": "latte"}, ... ) >>> df  customer order 0 Mary latte 1 Tom lemonade 2 Lila latte

Example 1 - exact matching (method chaining):

>>> df = df.find_replace( ...  match="exact", ...  order={"ice coffee": "latte", "regular coffee": "latte"}, ... ) >>> df  customer order 0 Mary latte 1 Tom lemonade 2 Lila latte

Example 2 - Regular-expression-based matching (functional usage):

>>> df = find_replace( ...  df, ...  match="regex", ...  order={"coffee$": "latte"}, ... ) >>> df  customer order 0 Mary latte 1 Tom lemonade 2 Lila latte

Example 2 - Regular-expression-based matching (method chaining usage):

>>> df = df.find_replace( ...  match="regex", ...  order={"coffee$": "latte"}, ... ) >>> df  customer order 0 Mary latte 1 Tom lemonade 2 Lila latte

To perform a find and replace on the entire DataFrame, pandas' df.replace() function provides the appropriate functionality. You can find more detail on the replace docs.

This function only works with column names that have no spaces or punctuation in them. For example, a column name item_name would work with find_replace, because it is a contiguous string that can be parsed correctly, but item name would not be parsed correctly by the Python interpreter.

If you have column names that might not be compatible, we recommend calling on clean_names() as the first method call. If, for whatever reason, that is not possible, then _find_replace is available as a function that you can do a pandas pipe call on.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	A pandas DataFrame.	required
`match`	`str`	Whether or not to perform an exact match or not. Valid values are "exact" or "regex".	`'exact'`
`**mappings`	`Any`	keyword arguments corresponding to column names that have dictionaries passed in indicating what to find (keys) and what to replace with (values).	`{}`

Returns:

Type	Description
`DataFrame`	A pandas DataFrame with replaced values.

Source code in janitor/functions/find_replace.py

@pf.register_dataframe_method @refactored_function( message=( "This function will be deprecated in a 1.x release. " "Please use `pd.DataFrame.replace` instead." ) ) def find_replace( df: pd.DataFrame, match: str = "exact", **mappings: Any ) -> pd.DataFrame:  """Perform a find-and-replace action on provided columns.  !!!note  This function will be deprecated in a 1.x release.  Please use `pd.DataFrame.replace` instead.  Depending on use case, users can choose either exact, full-value matching,  or regular-expression-based fuzzy matching  (hence allowing substring matching in the latter case).  For strings, the matching is always case sensitive.  Examples:  For instance, given a DataFrame containing orders at a coffee shop:  >>> df = pd.DataFrame(  ... {  ... "customer": ["Mary", "Tom", "Lila"],  ... "order": ["ice coffee", "lemonade", "regular coffee"],  ... }  ... )  >>> df  customer order  0 Mary ice coffee  1 Tom lemonade  2 Lila regular coffee  Our task is to replace values `ice coffee` and `regular coffee`  of the `order` column into `latte`.  Example 1 - exact matching (functional usage):  >>> df = find_replace(  ... df,  ... match="exact",  ... order={"ice coffee": "latte", "regular coffee": "latte"},  ... )  >>> df  customer order  0 Mary latte  1 Tom lemonade  2 Lila latte  Example 1 - exact matching (method chaining):  >>> df = df.find_replace(  ... match="exact",  ... order={"ice coffee": "latte", "regular coffee": "latte"},  ... )  >>> df  customer order  0 Mary latte  1 Tom lemonade  2 Lila latte  Example 2 - Regular-expression-based matching (functional usage):  >>> df = find_replace(  ... df,  ... match="regex",  ... order={"coffee$": "latte"},  ... )  >>> df  customer order  0 Mary latte  1 Tom lemonade  2 Lila latte  Example 2 - Regular-expression-based matching (method chaining usage):  >>> df = df.find_replace(  ... match="regex",  ... order={"coffee$": "latte"},  ... )  >>> df  customer order  0 Mary latte  1 Tom lemonade  2 Lila latte  To perform a find and replace on the entire DataFrame,  pandas' `df.replace()` function provides the appropriate functionality.  You can find more detail on the [replace] docs.  [replace]: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.replace.html  This function only works with column names that have no spaces  or punctuation in them.  For example, a column name `item_name` would work with `find_replace`,  because it is a contiguous string that can be parsed correctly,  but `item name` would not be parsed correctly by the Python interpreter.  If you have column names that might not be compatible,  we recommend calling on [`clean_names()`][janitor.functions.clean_names.clean_names]  as the first method call. If, for whatever reason, that is not possible,  then `_find_replace` is available as a function  that you can do a pandas [pipe](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.pipe.html) call on.  Args:  df: A pandas DataFrame.  match: Whether or not to perform an exact match or not.  Valid values are "exact" or "regex".  **mappings: keyword arguments corresponding to column names  that have dictionaries passed in indicating what to find (keys)  and what to replace with (values).  Returns:  A pandas DataFrame with replaced values.  """ # noqa: E501 for column_name, mapper in mappings.items(): df = _find_replace(df, column_name, mapper, match=match) return df 

`flag_nulls`

Implementation source for flag_nulls.

`flag_nulls(df, column_name='null_flag', columns=None)`

Creates a new column to indicate whether you have null values in a given row.

If the columns parameter is not set, looks across the entire DataFrame, otherwise will look only in the columns you set.

This method does not mutate the original DataFrame.

Examples:

>>> import pandas as pd >>> import janitor >>> df = pd.DataFrame( ...  { ...  "a": ["w", "x", None, "z"], ...  "b": [5, None, 7, 8], ...  } ... ) >>> df.flag_nulls()  a b null_flag 0 w 5.0 0 1 x NaN 1 2 None 7.0 1 3 z 8.0 0 >>> df.flag_nulls(columns="b")  a b null_flag 0 w 5.0 0 1 x NaN 1 2 None 7.0 0 3 z 8.0 0

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	Input pandas DataFrame.	required
`column_name`	`Optional[Hashable]`	Name for the output column.	`'null_flag'`
`columns`	`Optional[Union[str, Iterable[str], Hashable]]`	List of columns to look at for finding null values. If you only want to look at one column, you can simply give its name. If set to None (default), all DataFrame columns are used.	`None`

Raises:

Type	Description
`ValueError`	If `column_name` is already present in the DataFrame.
`ValueError`	If any column within `columns` is not present in the DataFrame.

Returns:

Type	Description
`DataFrame`	Input dataframe with the null flag column.

Source code in janitor/functions/flag_nulls.py

@pf.register_dataframe_method def flag_nulls( df: pd.DataFrame, column_name: Optional[Hashable] = "null_flag", columns: Optional[Union[str, Iterable[str], Hashable]] = None, ) -> pd.DataFrame:  """Creates a new column to indicate whether you have null values in a given  row.  If the columns parameter is not set, looks across the entire  DataFrame, otherwise will look only in the columns you set.  This method does not mutate the original DataFrame.  Examples:  >>> import pandas as pd  >>> import janitor  >>> df = pd.DataFrame(  ... {  ... "a": ["w", "x", None, "z"],  ... "b": [5, None, 7, 8],  ... }  ... )  >>> df.flag_nulls()  a b null_flag  0 w 5.0 0  1 x NaN 1  2 None 7.0 1  3 z 8.0 0  >>> df.flag_nulls(columns="b")  a b null_flag  0 w 5.0 0  1 x NaN 1  2 None 7.0 0  3 z 8.0 0  Args:  df: Input pandas DataFrame.  column_name: Name for the output column.  columns: List of columns to look at for finding null values. If you  only want to look at one column, you can simply give its name.  If set to None (default), all DataFrame columns are used.  Raises:  ValueError: If `column_name` is already present in the  DataFrame.  ValueError: If any column within `columns` is not present in  the DataFrame.  Returns:  Input dataframe with the null flag column.  <!--  # noqa: DAR402  -->  """ # Sort out columns input if isinstance(columns, str): columns = [columns] elif columns is None: columns = df.columns elif not isinstance(columns, Iterable): # catches other hashable types columns = [columns] # Input sanitation checks check_column(df, columns) check_column(df, [column_name], present=False) # This algorithm works best for n_rows >> n_cols. See issue #501 null_array = np.zeros(len(df)) for col in columns: null_array = np.logical_or(null_array, pd.isna(df[col])) df = df.copy() df[column_name] = null_array.astype(int) return df 

`get_dupes`

Implementation of the get_dupes function

`get_dupes(df, column_names=None)`

Return all duplicate rows.

This method does not mutate the original DataFrame.

Examples:

Method chaining syntax:

>>> import pandas as pd >>> import janitor >>> df = pd.DataFrame( ...  { ...  "item": ["shoe", "shoe", "bag", "shoe", "bag"], ...  "quantity": [100, 100, 75, 200, 75], ...  } ... ) >>> df  item quantity 0 shoe 100 1 shoe 100 2 bag 75 3 shoe 200 4 bag 75 >>> df.get_dupes()  item quantity 0 shoe 100 1 shoe 100 2 bag 75 4 bag 75

Optional column_names usage:

>>> import pandas as pd >>> import janitor >>> df = pd.DataFrame( ...  { ...  "item": ["shoe", "shoe", "bag", "shoe", "bag"], ...  "quantity": [100, 100, 75, 200, 75], ...  } ... ) >>> df  item quantity 0 shoe 100 1 shoe 100 2 bag 75 3 shoe 200 4 bag 75 >>> df.get_dupes(column_names=["item"])  item quantity 0 shoe 100 1 shoe 100 2 bag 75 3 shoe 200 4 bag 75 >>> df.get_dupes(column_names=["quantity"])  item quantity 0 shoe 100 1 shoe 100 2 bag 75 4 bag 75

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	The pandas DataFrame object.	required
`column_names`	`Optional[Union[str, Iterable[str], Hashable]]`	A column name or an iterable (list or tuple) of column names. Following pandas API, this only considers certain columns for identifying duplicates. Defaults to using all columns.	`None`

Returns:

Type	Description
`DataFrame`	The duplicate rows, as a pandas DataFrame.

Source code in janitor/functions/get_dupes.py

@pf.register_dataframe_method @deprecated_alias(columns="column_names") def get_dupes( df: pd.DataFrame, column_names: Optional[Union[str, Iterable[str], Hashable]] = None, ) -> pd.DataFrame:  """  Return all duplicate rows.  This method does not mutate the original DataFrame.  Examples:  Method chaining syntax:  >>> import pandas as pd  >>> import janitor  >>> df = pd.DataFrame(  ... {  ... "item": ["shoe", "shoe", "bag", "shoe", "bag"],  ... "quantity": [100, 100, 75, 200, 75],  ... }  ... )  >>> df  item quantity  0 shoe 100  1 shoe 100  2 bag 75  3 shoe 200  4 bag 75  >>> df.get_dupes()  item quantity  0 shoe 100  1 shoe 100  2 bag 75  4 bag 75  Optional `column_names` usage:  >>> import pandas as pd  >>> import janitor  >>> df = pd.DataFrame(  ... {  ... "item": ["shoe", "shoe", "bag", "shoe", "bag"],  ... "quantity": [100, 100, 75, 200, 75],  ... }  ... )  >>> df  item quantity  0 shoe 100  1 shoe 100  2 bag 75  3 shoe 200  4 bag 75  >>> df.get_dupes(column_names=["item"])  item quantity  0 shoe 100  1 shoe 100  2 bag 75  3 shoe 200  4 bag 75  >>> df.get_dupes(column_names=["quantity"])  item quantity  0 shoe 100  1 shoe 100  2 bag 75  4 bag 75  Args:  df: The pandas DataFrame object.  column_names: A column name or an iterable  (list or tuple) of column names. Following pandas API, this only  considers certain columns for identifying duplicates. Defaults  to using all columns.  Returns:  The duplicate rows, as a pandas DataFrame.  """ return df.loc[df.duplicated(subset=column_names, keep=False)] 

`groupby_agg`

Implementation source for groupby_agg.

`groupby_agg(df, by, new_column_name, agg_column_name, agg, dropna=True)`

Shortcut for assigning a groupby-transform to a new column.

This method does not mutate the original DataFrame.

Intended to be the method-chaining equivalent of:

df = df.assign(...=df.groupby(...)[...].transform(...))

Note

This function will be deprecated in a 1.x release. Please use jn.transform_column instead.

Examples:

Basic usage.

>>> import pandas as pd >>> import janitor >>> df = pd.DataFrame( ...  { ...  "item": ["shoe", "shoe", "bag", "shoe", "bag"], ...  "quantity": [100, 120, 75, 200, 25], ...  } ... ) >>> df.groupby_agg( ...  by="item", ...  agg="mean", ...  agg_column_name="quantity", ...  new_column_name="avg_quantity", ... )  item quantity avg_quantity 0 shoe 100 140.0 1 shoe 120 140.0 2 bag 75 50.0 3 shoe 200 140.0 4 bag 25 50.0

Set dropna=False to compute the aggregation, treating the null values in the by column as an isolated "group".

>>> import pandas as pd >>> import janitor >>> df = pd.DataFrame( ...  { ...  "x": ["a", "a", None, "b"], ...  "y": [9, 9, 9, 9], ...  } ... ) >>> df.groupby_agg( ...  by="x", ...  agg="count", ...  agg_column_name="y", ...  new_column_name="y_count", ...  dropna=False, ... )  x y y_count 0 a 9 2 1 a 9 2 2 None 9 1 3 b 9 1

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	A pandas DataFrame.	required
`by`	`Union[List, Callable, str]`	Column(s) to groupby on, will be passed into `DataFrame.groupby`.	required
`new_column_name`	`str`	Name of the aggregation output column.	required
`agg_column_name`	`str`	Name of the column to aggregate over.	required
`agg`	`Union[Callable, str]`	How to aggregate.	required
`dropna`	`bool`	Whether or not to include null values, if present in the `by` column(s). Default is True (null values in `by` are assigned NaN in the new column).	`True`

Returns:

Type	Description
`DataFrame`	A pandas DataFrame.

Source code in janitor/functions/groupby_agg.py

@pf.register_dataframe_method @deprecated_alias(new_column="new_column_name", agg_column="agg_column_name") @refactored_function( message=( "This function will be deprecated in a 1.x release. " "Please use `janitor.transform_column` instead." ) ) def groupby_agg( df: pd.DataFrame, by: Union[List, Callable, str], new_column_name: str, agg_column_name: str, agg: Union[Callable, str], dropna: bool = True, ) -> pd.DataFrame:  """Shortcut for assigning a groupby-transform to a new column.  This method does not mutate the original DataFrame.  Intended to be the method-chaining equivalent of:  ```python  df = df.assign(...=df.groupby(...)[...].transform(...))  ```  !!!note  This function will be deprecated in a 1.x release.  Please use  [`jn.transform_column`][janitor.functions.transform_columns.transform_column]  instead.  Examples:  Basic usage.  >>> import pandas as pd  >>> import janitor  >>> df = pd.DataFrame(  ... {  ... "item": ["shoe", "shoe", "bag", "shoe", "bag"],  ... "quantity": [100, 120, 75, 200, 25],  ... }  ... )  >>> df.groupby_agg(  ... by="item",  ... agg="mean",  ... agg_column_name="quantity",  ... new_column_name="avg_quantity",  ... )  item quantity avg_quantity  0 shoe 100 140.0  1 shoe 120 140.0  2 bag 75 50.0  3 shoe 200 140.0  4 bag 25 50.0  Set `dropna=False` to compute the aggregation, treating the null  values in the `by` column as an isolated "group".  >>> import pandas as pd  >>> import janitor  >>> df = pd.DataFrame(  ... {  ... "x": ["a", "a", None, "b"],  ... "y": [9, 9, 9, 9],  ... }  ... )  >>> df.groupby_agg(  ... by="x",  ... agg="count",  ... agg_column_name="y",  ... new_column_name="y_count",  ... dropna=False,  ... )  x y y_count  0 a 9 2  1 a 9 2  2 None 9 1  3 b 9 1  Args:  df: A pandas DataFrame.  by: Column(s) to groupby on, will be passed into `DataFrame.groupby`.  new_column_name: Name of the aggregation output column.  agg_column_name: Name of the column to aggregate over.  agg: How to aggregate.  dropna: Whether or not to include null values, if present in the  `by` column(s). Default is True (null values in `by` are assigned NaN in  the new column).  Returns:  A pandas DataFrame.  """ # noqa: E501 return df.assign( **{ new_column_name: df.groupby(by, dropna=dropna)[agg_column_name].transform( agg ), } ) 

`groupby_topk`

Implementation of the groupby_topk function

`groupby_topk(df, by, column, k, dropna=True, ascending=True, ignore_index=True)`

Return top k rows from a groupby of a set of columns.

Returns a DataFrame that has the top k values per column, grouped by by. Under the hood it uses nlargest/nsmallest, for numeric columns, which avoids sorting the entire dataframe, and is usually more performant. For non-numeric columns, pd.sort_values is used. No sorting is done to the by column(s); the order is maintained in the final output.

Examples:

>>> import pandas as pd >>> import janitor >>> df = pd.DataFrame( ...  { ...  "age": [20, 23, 22, 43, 21], ...  "id": [1, 4, 6, 2, 5], ...  "result": ["pass", "pass", "fail", "pass", "fail"], ...  } ... ) >>> df  age id result 0 20 1 pass 1 23 4 pass 2 22 6 fail 3 43 2 pass 4 21 5 fail

Ascending top 3:

>>> df.groupby_topk(by="result", column="age", k=3)  age id result 0 20 1 pass 1 23 4 pass 2 43 2 pass 3 21 5 fail 4 22 6 fail

Descending top 2:

>>> df.groupby_topk( ...  by="result", column="age", k=2, ascending=False, ignore_index=False ... )  age id result 3 43 2 pass 1 23 4 pass 2 22 6 fail 4 21 5 fail

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	A pandas DataFrame.	required
`by`	`Union[list, Hashable]`	Column name(s) to group input DataFrame `df` by.	required
`column`	`Hashable`	Name of the column that determines `k` rows to return.	required
`k`	`int`	Number of top rows to return for each group.	required
`dropna`	`bool`	If `True`, and `NA` values exist in `by`, the `NA` values are not used in the groupby computation to get the relevant `k` rows. If `False`, and `NA` values exist in `by`, then the `NA` values are used in the groupby computation to get the relevant `k` rows.	`True`
`ascending`	`bool`	If `True`, the smallest top `k` rows, determined by `column` are returned; if `False, the largest top`k`rows, determined by`column` are returned.	`True`
`ignore_index`	`bool`	If `True`, the original index is ignored. If `False`, the original index for the top `k` rows is retained.	`True`

Raises:

Type	Description
`ValueError`	If `k` is less than 1.

Returns:

Type	Description
`DataFrame`	A pandas DataFrame with top `k` rows per `column`, grouped by `by`.

Source code in janitor/functions/groupby_topk.py

@pf.register_dataframe_method @deprecated_alias(groupby_column_name="by", sort_column_name="column") def groupby_topk( df: pd.DataFrame, by: Union[list, Hashable], column: Hashable, k: int, dropna: bool = True, ascending: bool = True, ignore_index: bool = True, ) -> pd.DataFrame:  """Return top `k` rows from a groupby of a set of columns.  Returns a DataFrame that has the top `k` values per `column`,  grouped by `by`. Under the hood it uses `nlargest/nsmallest`,  for numeric columns, which avoids sorting the entire dataframe,  and is usually more performant. For non-numeric columns, `pd.sort_values`  is used.  No sorting is done to the `by` column(s); the order is maintained  in the final output.  Examples:  >>> import pandas as pd  >>> import janitor  >>> df = pd.DataFrame(  ... {  ... "age": [20, 23, 22, 43, 21],  ... "id": [1, 4, 6, 2, 5],  ... "result": ["pass", "pass", "fail", "pass", "fail"],  ... }  ... )  >>> df  age id result  0 20 1 pass  1 23 4 pass  2 22 6 fail  3 43 2 pass  4 21 5 fail  Ascending top 3:  >>> df.groupby_topk(by="result", column="age", k=3)  age id result  0 20 1 pass  1 23 4 pass  2 43 2 pass  3 21 5 fail  4 22 6 fail  Descending top 2:  >>> df.groupby_topk(  ... by="result", column="age", k=2, ascending=False, ignore_index=False  ... )  age id result  3 43 2 pass  1 23 4 pass  2 22 6 fail  4 21 5 fail  Args:  df: A pandas DataFrame.  by: Column name(s) to group input DataFrame `df` by.  column: Name of the column that determines `k` rows  to return.  k: Number of top rows to return for each group.  dropna: If `True`, and `NA` values exist in `by`, the `NA`  values are not used in the groupby computation to get the relevant  `k` rows. If `False`, and `NA` values exist in `by`, then the `NA`  values are used in the groupby computation to get the relevant  `k` rows.  ascending: If `True`, the smallest top `k` rows,  determined by `column` are returned; if `False, the largest top `k`  rows, determined by `column` are returned.  ignore_index: If `True`, the original index is ignored.  If `False`, the original index for the top `k` rows is retained.  Raises:  ValueError: If `k` is less than 1.  Returns:  A pandas DataFrame with top `k` rows per `column`, grouped by `by`.  """ # noqa: E501 if isinstance(by, Hashable): by = [by] check("by", by, [Hashable, list]) check_column(df, [column]) check_column(df, by) if k < 1: raise ValueError( "Numbers of rows per group to be returned must be greater than 0." ) indices = df.groupby(by=by, dropna=dropna, sort=False, observed=True) indices = indices[column] try: if ascending: indices = indices.nsmallest(n=k) else: indices = indices.nlargest(n=k) except TypeError: indices = indices.apply(lambda d: d.sort_values(ascending=ascending).head(k)) indices = indices.index.get_level_values(-1) if ignore_index: return df.loc[indices].reset_index(drop=True) return df.loc[indices] 

`impute`

Implementation of impute function

`impute(df, column_names, value=None, statistic_column_name=None)`

Method-chainable imputation of values in a column.

This method does not mutate the original DataFrame.

Underneath the hood, this function calls the .fillna() method available to every pandas.Series object.

Either one of value or statistic_column_name should be provided.

If value is provided, then all null values in the selected column will take on the value provided.

If statistic_column_name is provided, then all null values in the selected column(s) will take on the summary statistic value of other non-null values.

Column selection in column_names is possible using the select syntax.

Currently supported statistics include:

mean (also aliased by average)
median
mode
minimum (also aliased by min)
maximum (also aliased by max)

Examples:

>>> import numpy as np >>> import pandas as pd >>> import janitor >>> df = pd.DataFrame( ...  { ...  "a": [1, 2, 3], ...  "sales": np.nan, ...  "score": [np.nan, 3, 2], ...  } ... ) >>> df  a sales score 0 1 NaN NaN 1 2 NaN 3.0 2 3 NaN 2.0

Imputing null values with 0 (using the value parameter):

>>> df.impute(column_names="sales", value=0.0)  a sales score 0 1 0.0 NaN 1 2 0.0 3.0 2 3 0.0 2.0

Imputing null values with median (using the statistic_column_name parameter):

>>> df.impute(column_names="score", statistic_column_name="median")  a sales score 0 1 NaN 2.5 1 2 NaN 3.0 2 3 NaN 2.0

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	A pandas DataFrame.	required
`column_names`	`Any`	The name of the column(s) on which to impute values.	required
`value`	`Optional[Any]`	The value used for imputation, passed into `.fillna` method of the underlying pandas Series.	`None`
`statistic_column_name`	`Optional[str]`	The column statistic to impute.	`None`

Raises:

Type	Description
`ValueError`	If both `value` and `statistic_column_name` are provided.
`KeyError`	If `statistic_column_name` is not one of `mean`, `average`, `median`, `mode`, `minimum`, `min`, `maximum`, or `max`.

Returns:

Type	Description
`DataFrame`	An imputed pandas DataFrame.

Source code in janitor/functions/impute.py

@pf.register_dataframe_method @deprecated_alias(column="column_name") @deprecated_alias(column_name="column_names") @deprecated_alias(statistic="statistic_column_name") def impute( df: pd.DataFrame, column_names: Any, value: Optional[Any] = None, statistic_column_name: Optional[str] = None, ) -> pd.DataFrame:  """Method-chainable imputation of values in a column.  This method does not mutate the original DataFrame.  Underneath the hood, this function calls the `.fillna()` method available  to every `pandas.Series` object.  Either one of `value` or `statistic_column_name` should be provided.  If `value` is provided, then all null values in the selected column will  take on the value provided.  If `statistic_column_name` is provided, then all null values in the  selected column(s) will take on the summary statistic value  of other non-null values.  Column selection in `column_names` is possible using the  [`select`][janitor.functions.select.select] syntax.  Currently supported statistics include:  - `mean` (also aliased by `average`)  - `median`  - `mode`  - `minimum` (also aliased by `min`)  - `maximum` (also aliased by `max`)  Examples:  >>> import numpy as np  >>> import pandas as pd  >>> import janitor  >>> df = pd.DataFrame(  ... {  ... "a": [1, 2, 3],  ... "sales": np.nan,  ... "score": [np.nan, 3, 2],  ... }  ... )  >>> df  a sales score  0 1 NaN NaN  1 2 NaN 3.0  2 3 NaN 2.0  Imputing null values with 0 (using the `value` parameter):  >>> df.impute(column_names="sales", value=0.0)  a sales score  0 1 0.0 NaN  1 2 0.0 3.0  2 3 0.0 2.0  Imputing null values with median (using the `statistic_column_name`  parameter):  >>> df.impute(column_names="score", statistic_column_name="median")  a sales score  0 1 NaN 2.5  1 2 NaN 3.0  2 3 NaN 2.0  Args:  df: A pandas DataFrame.  column_names: The name of the column(s) on which to impute values.  value: The value used for imputation, passed into `.fillna` method  of the underlying pandas Series.  statistic_column_name: The column statistic to impute.  Raises:  ValueError: If both `value` and `statistic_column_name` are  provided.  KeyError: If `statistic_column_name` is not one of `mean`,  `average`, `median`, `mode`, `minimum`, `min`, `maximum`, or  `max`.  Returns:  An imputed pandas DataFrame.  """ # Firstly, we check that only one of `value` or `statistic` are provided. if (value is None) and (statistic_column_name is None): raise ValueError("Kindly specify a value or a statistic_column_name") if value is not None and statistic_column_name is not None: raise ValueError( "Only one of `value` or `statistic_column_name` should be provided." ) column_names = get_index_labels([column_names], df, axis="columns") if value is not None: value = dict(product(column_names, [value])) else: # If statistic is provided, then we compute # the relevant summary statistic # from the other data. funcs = { "mean": "mean", "average": "mean", # aliased "median": "median", "mode": "mode", "minimum": "min", "min": "min", # aliased "maximum": "max", "max": "max", # aliased } # Check that the statistic keyword argument is one of the approved. if statistic_column_name not in funcs: raise KeyError(f"`statistic_column_name` must be one of {funcs.keys()}.") value = dict(product(column_names, [funcs[statistic_column_name]])) value = df.agg(value) # special treatment for mode if statistic_column_name == "mode": value = {key: val.at[0] for key, val in value.items()} return df.fillna(value=value) 

`jitter`

Implementation of the jitter function.

`jitter(df, column_name, dest_column_name, scale, clip=None, random_state=None)`

Adds Gaussian noise (jitter) to the values of a column.

A new column will be created containing the values of the original column with Gaussian noise added. For each value in the column, a Gaussian distribution is created having a location (mean) equal to the value and a scale (standard deviation) equal to scale. A random value is then sampled from this distribution, which is the jittered value. If a tuple is supplied for clip, then any values of the new column less than clip[0] will be set to clip[0], and any values greater than clip[1] will be set to clip[1]. Additionally, if a numeric value is supplied for random_state, this value will be used to set the random seed used for sampling. NaN values are ignored in this method.

This method mutates the original DataFrame.

Examples:

>>> import numpy as np >>> import pandas as pd >>> import janitor >>> df = pd.DataFrame({"a": [3, 4, 5, np.nan]}) >>> df  a 0 3.0 1 4.0 2 5.0 3 NaN >>> df.jitter("a", dest_column_name="a_jit", scale=1, random_state=42)  a a_jit 0 3.0 3.496714 1 4.0 3.861736 2 5.0 5.647689 3 NaN NaN

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	A pandas DataFrame.	required
`column_name`	`Hashable`	Name of the column containing values to add Gaussian jitter to.	required
`dest_column_name`	`str`	The name of the new column containing the jittered values that will be created.	required
`scale`	`number`	A positive value multiplied by the original column value to determine the scale (standard deviation) of the Gaussian distribution to sample from. (A value of zero results in no jittering.)	required
`clip`	`Optional[Iterable[number]]`	An iterable of two values (minimum and maximum) to clip the jittered values to, default to None.	`None`
`random_state`	`Optional[number]`	An integer or 1-d array value used to set the random seed, default to None.	`None`

Raises:

Type	Description
`TypeError`	If `column_name` is not numeric.
`ValueError`	If `scale` is not a numerical value greater than `0`.
`ValueError`	If `clip` is not an iterable of length `2`.
`ValueError`	If `clip[0]` is greater than `clip[1]`.

Returns:

Type	Description
`DataFrame`	A pandas DataFrame with a new column containing Gaussian-jittered values from another column.

Source code in janitor/functions/jitter.py

@pf.register_dataframe_method def jitter( df: pd.DataFrame, column_name: Hashable, dest_column_name: str, scale: np.number, clip: Optional[Iterable[np.number]] = None, random_state: Optional[np.number] = None, ) -> pd.DataFrame:  """Adds Gaussian noise (jitter) to the values of a column.  A new column will be created containing the values of the original column  with Gaussian noise added.  For each value in the column, a Gaussian distribution is created  having a location (mean) equal to the value  and a scale (standard deviation) equal to `scale`.  A random value is then sampled from this distribution,  which is the jittered value.  If a tuple is supplied for `clip`,  then any values of the new column less than `clip[0]`  will be set to `clip[0]`,  and any values greater than `clip[1]` will be set to `clip[1]`.  Additionally, if a numeric value is supplied for `random_state`,  this value will be used to set the random seed used for sampling.  NaN values are ignored in this method.  This method mutates the original DataFrame.  Examples:  >>> import numpy as np  >>> import pandas as pd  >>> import janitor  >>> df = pd.DataFrame({"a": [3, 4, 5, np.nan]})  >>> df  a  0 3.0  1 4.0  2 5.0  3 NaN  >>> df.jitter("a", dest_column_name="a_jit", scale=1, random_state=42)  a a_jit  0 3.0 3.496714  1 4.0 3.861736  2 5.0 5.647689  3 NaN NaN  Args:  df: A pandas DataFrame.  column_name: Name of the column containing  values to add Gaussian jitter to.  dest_column_name: The name of the new column containing the  jittered values that will be created.  scale: A positive value multiplied by the original  column value to determine the scale (standard deviation) of the  Gaussian distribution to sample from. (A value of zero results in  no jittering.)  clip: An iterable of two values (minimum and maximum) to clip  the jittered values to, default to None.  random_state: An integer or 1-d array value used to set the random  seed, default to None.  Raises:  TypeError: If `column_name` is not numeric.  ValueError: If `scale` is not a numerical value  greater than `0`.  ValueError: If `clip` is not an iterable of length `2`.  ValueError: If `clip[0]` is greater than `clip[1]`.  Returns:  A pandas DataFrame with a new column containing  Gaussian-jittered values from another column.  """ # Check types check("scale", scale, [int, float]) # Check that `column_name` is a numeric column if not np.issubdtype(df[column_name].dtype, np.number): raise TypeError(f"{column_name} must be a numeric column.") if scale <= 0: raise ValueError("`scale` must be a numeric value greater than 0.") values = df[column_name] if random_state is not None: np.random.seed(random_state) result = np.random.normal(loc=values, scale=scale) if clip: # Ensure `clip` has length 2 if len(clip) != 2: raise ValueError("`clip` must be an iterable of length 2.") # Ensure the values in `clip` are ordered as min, max if clip[1] < clip[0]: raise ValueError("`clip[0]` must be less than or equal to `clip[1]`.") result = np.clip(result, *clip) df[dest_column_name] = result return df 

`join_apply`

Implementation of the join_apply function

`join_apply(df, func, new_column_name)`

Join the result of applying a function across dataframe rows.

This method does not mutate the original DataFrame.

This is a convenience function that allows us to apply arbitrary functions that take any combination of information from any of the columns. The only requirement is that the function signature takes in a row from the DataFrame.

Examples:

Sum the result of two columns into a new column.

>>> import pandas as pd >>> import janitor >>> df = pd.DataFrame({"a": [1, 2, 3], "b": [2, 3, 4]}) >>> df  a b 0 1 2 1 2 3 2 3 4 >>> df.join_apply( ...  func=lambda x: 2 * x["a"] + x["b"], ...  new_column_name="2a+b", ... )  a b 2a+b 0 1 2 4 1 2 3 7 2 3 4 10

Incorporating conditionals in func.

>>> import pandas as pd >>> import janitor >>> df = pd.DataFrame({"a": [1, 2, 3], "b": [20, 30, 40]}) >>> df  a b 0 1 20 1 2 30 2 3 40 >>> def take_a_if_even(x): ...  if x["a"] % 2 == 0: ...  return x["a"] ...  else: ...  return x["b"] >>> df.join_apply(take_a_if_even, "a_if_even")  a b a_if_even 0 1 20 20 1 2 30 2 2 3 40 40

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	A pandas DataFrame.	required
`func`	`Callable`	A function that is applied elementwise across all rows of the DataFrame.	required
`new_column_name`	`str`	Name of the resulting column.	required

Returns:

Type	Description
`DataFrame`	A pandas DataFrame with new column appended.

Source code in janitor/functions/join_apply.py

@pf.register_dataframe_method def join_apply( df: pd.DataFrame, func: Callable, new_column_name: str, ) -> pd.DataFrame:  """Join the result of applying a function across dataframe rows.  This method does not mutate the original DataFrame.  This is a convenience function that allows us to apply arbitrary functions  that take any combination of information from any of the columns. The only  requirement is that the function signature takes in a row from the  DataFrame.  Examples:  Sum the result of two columns into a new column.  >>> import pandas as pd  >>> import janitor  >>> df = pd.DataFrame({"a": [1, 2, 3], "b": [2, 3, 4]})  >>> df  a b  0 1 2  1 2 3  2 3 4  >>> df.join_apply(  ... func=lambda x: 2 * x["a"] + x["b"],  ... new_column_name="2a+b",  ... )  a b 2a+b  0 1 2 4  1 2 3 7  2 3 4 10  Incorporating conditionals in `func`.  >>> import pandas as pd  >>> import janitor  >>> df = pd.DataFrame({"a": [1, 2, 3], "b": [20, 30, 40]})  >>> df  a b  0 1 20  1 2 30  2 3 40  >>> def take_a_if_even(x):  ... if x["a"] % 2 == 0:  ... return x["a"]  ... else:  ... return x["b"]  >>> df.join_apply(take_a_if_even, "a_if_even")  a b a_if_even  0 1 20 20  1 2 30 2  2 3 40 40  Args:  df: A pandas DataFrame.  func: A function that is applied elementwise across all rows of the  DataFrame.  new_column_name: Name of the resulting column.  Returns:  A pandas DataFrame with new column appended.  """ # noqa: E501 df = df.copy().join(df.apply(func, axis=1).rename(new_column_name)) return df 

`label_encode`

Implementation of label_encode function

`label_encode(df, column_names)`

Convert labels into numerical data.

This method will create a new column with the string _enc appended after the original column's name. Consider this to be syntactic sugar. This function uses the factorize pandas function under the hood.

This method behaves differently from encode_categorical. This method creates a new column of numeric data. encode_categorical replaces the dtype of the original column with a categorical dtype.

This method mutates the original DataFrame.

Note

This function will be deprecated in a 1.x release. Please use factorize_columns instead.

Examples:

>>> import pandas as pd >>> import janitor >>> df = pd.DataFrame( ...  { ...  "foo": ["b", "b", "a", "c", "b"], ...  "bar": range(4, 9), ...  } ... ) >>> df  foo bar 0 b 4 1 b 5 2 a 6 3 c 7 4 b 8 >>> df.label_encode(column_names="foo")  foo bar foo_enc 0 b 4 0 1 b 5 0 2 a 6 1 3 c 7 2 4 b 8 0

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	The pandas DataFrame object.	required
`column_names`	`Union[str, Iterable[str], Hashable]`	A column name or an iterable (list or tuple) of column names.	required

Returns:

Type	Description
`DataFrame`	A pandas DataFrame.

Source code in janitor/functions/label_encode.py

@pf.register_dataframe_method @refactored_function( message=( "This function will be deprecated in a 1.x release. " "Please use `janitor.factorize_columns` instead." ) ) @deprecated_alias(columns="column_names") def label_encode( df: pd.DataFrame, column_names: Union[str, Iterable[str], Hashable], ) -> pd.DataFrame:  """Convert labels into numerical data.  This method will create a new column with the string `_enc` appended  after the original column's name.  Consider this to be syntactic sugar.  This function uses the `factorize` pandas function under the hood.  This method behaves differently from  [`encode_categorical`][janitor.functions.encode_categorical.encode_categorical].  This method creates a new column of numeric data.  [`encode_categorical`][janitor.functions.encode_categorical.encode_categorical]  replaces the dtype of the original column with a *categorical* dtype.  This method mutates the original DataFrame.  !!!note  This function will be deprecated in a 1.x release.  Please use [`factorize_columns`][janitor.functions.factorize_columns.factorize_columns]  instead.  Examples:  >>> import pandas as pd  >>> import janitor  >>> df = pd.DataFrame(  ... {  ... "foo": ["b", "b", "a", "c", "b"],  ... "bar": range(4, 9),  ... }  ... )  >>> df  foo bar  0 b 4  1 b 5  2 a 6  3 c 7  4 b 8  >>> df.label_encode(column_names="foo")  foo bar foo_enc  0 b 4 0  1 b 5 0  2 a 6 1  3 c 7 2  4 b 8 0  Args:  df: The pandas DataFrame object.  column_names: A column name or an iterable (list  or tuple) of column names.  Returns:  A pandas DataFrame.  """ # noqa: E501 warnings.warn( "`label_encode` will be deprecated in a 1.x release. " "Please use `factorize_columns` instead." ) df = _factorize(df, column_names, "_enc") return df 

`limit_column_characters`

Implementation of limit_column_characters.

`limit_column_characters(df, column_length, col_separator='_')`

Truncate column sizes to a specific length.

This method mutates the original DataFrame.

Method chaining will truncate all columns to a given length and append a given separator character with the index of duplicate columns, except for the first distinct column name.

Examples:

>>> import pandas as pd >>> import janitor >>> data_dict = { ...  "really_long_name": [9, 8, 7], ...  "another_really_long_name": [2, 4, 6], ...  "another_really_longer_name": list("xyz"), ...  "this_is_getting_out_of_hand": list("pqr"), ... } >>> df = pd.DataFrame(data_dict) >>> df  really_long_name another_really_long_name another_really_longer_name this_is_getting_out_of_hand 0 9 2 x p 1 8 4 y q 2 7 6 z r >>> df.limit_column_characters(7)  really_ another another_1 this_is 0 9 2 x p 1 8 4 y q 2 7 6 z r

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	A pandas DataFrame.	required
`column_length`	`int`	Character length for which to truncate all columns. The column separator value and number for duplicate column name does not contribute. Therefore, if all columns are truncated to 10 characters, the first distinct column will be 10 characters and the remaining will be 12 characters (assuming a column separator of one character).	required
`col_separator`	`str`	The separator to use for counting distinct column values, for example, `'_'` or `'.'`. Supply an empty string (i.e. `''`) to remove the separator.	`'_'`

Returns:

Type	Description
`DataFrame`	A pandas DataFrame with truncated column lengths.

Source code in janitor/functions/limit_column_characters.py

@pf.register_dataframe_method def limit_column_characters( df: pd.DataFrame, column_length: int, col_separator: str = "_", ) -> pd.DataFrame:  """Truncate column sizes to a specific length.  This method mutates the original DataFrame.  Method chaining will truncate all columns to a given length and append  a given separator character with the index of duplicate columns, except  for the first distinct column name.  Examples:  >>> import pandas as pd  >>> import janitor  >>> data_dict = {  ... "really_long_name": [9, 8, 7],  ... "another_really_long_name": [2, 4, 6],  ... "another_really_longer_name": list("xyz"),  ... "this_is_getting_out_of_hand": list("pqr"),  ... }  >>> df = pd.DataFrame(data_dict)  >>> df # doctest: +SKIP  really_long_name another_really_long_name another_really_longer_name this_is_getting_out_of_hand  0 9 2 x p  1 8 4 y q  2 7 6 z r  >>> df.limit_column_characters(7)  really_ another another_1 this_is  0 9 2 x p  1 8 4 y q  2 7 6 z r  Args:  df: A pandas DataFrame.  column_length: Character length for which to truncate all columns.  The column separator value and number for duplicate column name does  not contribute. Therefore, if all columns are truncated to 10  characters, the first distinct column will be 10 characters and the  remaining will be 12 characters (assuming a column separator of one  character).  col_separator: The separator to use for counting distinct column  values, for example, `'_'` or `'.'`.  Supply an empty string (i.e. `''`) to remove the separator.  Returns:  A pandas DataFrame with truncated column lengths.  """ # noqa: E501 check("column_length", column_length, [int]) check("col_separator", col_separator, [str]) col_names = df.columns col_names = [col_name[:column_length] for col_name in col_names] col_name_set = set(col_names) col_name_count = {} # If no columns are duplicates, we can skip the loops below. if len(col_name_set) == len(col_names): df.columns = col_names return df for col_name_to_check in col_name_set: count = 0 for idx, col_name in enumerate(col_names): if col_name_to_check == col_name: col_name_count[idx] = count count += 1 final_col_names = [] for idx, col_name in enumerate(col_names): if col_name_count[idx] > 0: col_name_to_append = col_name + col_separator + str(col_name_count[idx]) final_col_names.append(col_name_to_append) else: final_col_names.append(col_name) df.columns = final_col_names return df 

`min_max_scale`

`min_max_scale(df, feature_range=(0, 1), column_name=None, jointly=False)`

Scales DataFrame to between a minimum and maximum value.

One can optionally set a new target minimum and maximum value using the feature_range keyword argument.

If column_name is specified, then only that column(s) of data is scaled. Otherwise, the entire dataframe is scaled. If jointly is True, the column_names provided entire dataframe will be regnozied as the one to jointly scale. Otherwise, each column of data will be scaled separately.

Examples:

>>> import pandas as pd >>> import janitor >>> df = pd.DataFrame({"a": [1, 2], "b": [0, 1]}) >>> df.min_max_scale()  a b 0 0.0 0.0 1 1.0 1.0 >>> df.min_max_scale(jointly=True)  a b 0 0.5 0.0 1 1.0 0.5

Setting custom minimum and maximum.

>>> import pandas as pd >>> import janitor >>> df = pd.DataFrame({"a": [1, 2], "b": [0, 1]}) >>> df.min_max_scale(feature_range=(0, 100))  a b 0 0.0 0.0 1 100.0 100.0 >>> df.min_max_scale(feature_range=(0, 100), jointly=True)  a b 0 50.0 0.0 1 100.0 50.0

Apply min-max to the selected columns.

>>> import pandas as pd >>> import janitor >>> df = pd.DataFrame({"a": [1, 2], "b": [0, 1], "c": [1, 0]}) >>> df.min_max_scale( ...  feature_range=(0, 100), ...  column_name=["a", "c"], ... )  a b c 0 0.0 0 100.0 1 100.0 1 0.0 >>> df.min_max_scale( ...  feature_range=(0, 100), ...  column_name=["a", "c"], ...  jointly=True, ... )  a b c 0 50.0 0 50.0 1 100.0 1 0.0 >>> df.min_max_scale(feature_range=(0, 100), column_name="a")  a b c 0 0.0 0 1 1 100.0 1 0

The aforementioned example might be applied to something like scaling the isoelectric points of amino acids. While technically they range from approx 3-10, we can also think of them on the pH scale which ranges from 1 to 14. Hence, 3 gets scaled not to 0 but approx. 0.15 instead, while 10 gets scaled to approx. 0.69 instead.

Version Changed

0.24.0
- Deleted old_min, old_max, new_min, and new_max options.
- Added feature_range, and jointly options.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	A pandas DataFrame.	required
`feature_range`	`tuple[int \| float, int \| float]`	Desired range of transformed data.	`(0, 1)`
`column_name`	`str \| int \| list[str \| int] \| Index`	The column on which to perform scaling.	`None`
`jointly`	`bool`	Scale the entire data if True.	`False`

Raises:

Type	Description
`ValueError`	If `feature_range` isn't tuple type.
`ValueError`	If the length of `feature_range` isn't equal to two.
`ValueError`	If the element of `feature_range` isn't number type.
`ValueError`	If `feature_range[1]` <= `feature_range[0]`.

Returns:

Type	Description
`DataFrame`	A pandas DataFrame with scaled data.

Source code in janitor/functions/min_max_scale.py

@pf.register_dataframe_method @deprecated_kwargs( "old_min", "old_max", "new_min", "new_max", message=( "The keyword argument {argument!r} of {func_name!r} is deprecated. " "Please use 'feature_range' instead." ), ) @deprecated_alias(col_name="column_name") def min_max_scale( df: pd.DataFrame, feature_range: tuple[int | float, int | float] = (0, 1), column_name: str | int | list[str | int] | pd.Index = None, jointly: bool = False, ) -> pd.DataFrame:  """Scales DataFrame to between a minimum and maximum value.  One can optionally set a new target **minimum** and **maximum** value  using the `feature_range` keyword argument.  If `column_name` is specified, then only that column(s) of data is scaled.  Otherwise, the entire dataframe is scaled.  If `jointly` is `True`, the `column_names` provided entire dataframe will  be regnozied as the one to jointly scale. Otherwise, each column of data  will be scaled separately.  Examples:  >>> import pandas as pd  >>> import janitor  >>> df = pd.DataFrame({"a": [1, 2], "b": [0, 1]})  >>> df.min_max_scale()  a b  0 0.0 0.0  1 1.0 1.0  >>> df.min_max_scale(jointly=True)  a b  0 0.5 0.0  1 1.0 0.5  Setting custom minimum and maximum.  >>> import pandas as pd  >>> import janitor  >>> df = pd.DataFrame({"a": [1, 2], "b": [0, 1]})  >>> df.min_max_scale(feature_range=(0, 100))  a b  0 0.0 0.0  1 100.0 100.0  >>> df.min_max_scale(feature_range=(0, 100), jointly=True)  a b  0 50.0 0.0  1 100.0 50.0  Apply min-max to the selected columns.  >>> import pandas as pd  >>> import janitor  >>> df = pd.DataFrame({"a": [1, 2], "b": [0, 1], "c": [1, 0]})  >>> df.min_max_scale(  ... feature_range=(0, 100),  ... column_name=["a", "c"],  ... )  a b c  0 0.0 0 100.0  1 100.0 1 0.0  >>> df.min_max_scale(  ... feature_range=(0, 100),  ... column_name=["a", "c"],  ... jointly=True,  ... )  a b c  0 50.0 0 50.0  1 100.0 1 0.0  >>> df.min_max_scale(feature_range=(0, 100), column_name="a")  a b c  0 0.0 0 1  1 100.0 1 0  The aforementioned example might be applied to something like scaling the  isoelectric points of amino acids. While technically they range from  approx 3-10, we can also think of them on the pH scale which ranges from  1 to 14. Hence, 3 gets scaled not to 0 but approx. 0.15 instead, while 10  gets scaled to approx. 0.69 instead.  !!! summary "Version Changed"  - 0.24.0  - Deleted `old_min`, `old_max`, `new_min`, and `new_max` options.  - Added `feature_range`, and `jointly` options.  Args:  df: A pandas DataFrame.  feature_range: Desired range of transformed data.  column_name: The column on which to perform scaling.  jointly: Scale the entire data if True.  Raises:  ValueError: If `feature_range` isn't tuple type.  ValueError: If the length of `feature_range` isn't equal to two.  ValueError: If the element of `feature_range` isn't number type.  ValueError: If `feature_range[1]` <= `feature_range[0]`.  Returns:  A pandas DataFrame with scaled data.  """ # noqa: E501 if not ( isinstance(feature_range, (tuple, list)) and len(feature_range) == 2 and all((isinstance(i, (int, float))) for i in feature_range) and feature_range[1] > feature_range[0] ): raise ValueError( "`feature_range` should be a range type contains number element, " "the first element must be greater than the second one" ) if column_name is not None: df = df.copy() # Avoid to change the original DataFrame. old_feature_range = df[column_name].pipe(_min_max_value, jointly) df[column_name] = df[column_name].pipe( _apply_min_max, *old_feature_range, *feature_range, ) else: old_feature_range = df.pipe(_min_max_value, jointly) df = df.pipe( _apply_min_max, *old_feature_range, *feature_range, ) return df 

`move`

Implementation of move.

`move(df, source, target=None, position='before', axis=0)`

Changes rows or columns positions in the dataframe.

It uses the select syntax, making it easy to move blocks of rows or columns at once.

This operation does not reset the index of the dataframe. User must explicitly do so.

The dataframe must have unique column names or indices.

Examples:

Move a row:

>>> import pandas as pd >>> import janitor >>> df = pd.DataFrame({"a": [2, 4, 6, 8], "b": list("wxyz")}) >>> df  a b 0 2 w 1 4 x 2 6 y 3 8 z >>> df.move(source=0, target=3, position="before", axis=0)  a b 1 4 x 2 6 y 0 2 w 3 8 z

Move a column:

>>> import pandas as pd >>> import janitor >>> data = [{"a": 1, "b": 1, "c": 1, "d": "a", "e": "a", "f": "a"}] >>> df = pd.DataFrame(data) >>> df  a b c d e f 0 1 1 1 a a a >>> df.move(source="a", target="c", position="after", axis=1)  b c a d e f 0 1 1 1 a a a >>> df.move(source="f", target="b", position="before", axis=1)  a f b c d e 0 1 a 1 1 a a >>> df.move(source="a", target=None, position="after", axis=1)  b c d e f a 0 1 1 a a a 1

Move columns:

>>> from pandas.api.types import is_numeric_dtype, is_string_dtype >>> df.move(source=is_string_dtype, target=None, position="before", axis=1)  d e f a b c 0 a a a 1 1 1 >>> df.move(source=is_numeric_dtype, target=None, position="after", axis=1)  d e f a b c 0 a a a 1 1 1 >>> df.move( ...  source=["d", "f"], target=is_numeric_dtype, position="before", axis=1 ... )  d f a b c e 0 a a 1 1 1 a

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	The pandas DataFrame object.	required
`source`	`Any`	Columns or rows to move.	required
`target`	`Any`	Columns or rows to move adjacent to. If `None` and `position == 'before'`, `source` is moved to the beginning; if `position == 'after'`, `source` is moved to the end.	`None`
`position`	`str`	Specifies the destination of the columns/rows. Values can be either `before` or `after`; defaults to `before`.	`'before'`
`axis`	`int`	Axis along which the function is applied. 0 to move along the index, 1 to move along the columns.	`0`

Raises:

Type	Description
`ValueError`	If `axis` is not `0` or `1`.
`ValueError`	If `position` is not `before` or `after`.

Returns:

Type	Description
`DataFrame`	The dataframe with the Series moved.

Source code in janitor/functions/move.py

@pf.register_dataframe_method def move( df: pd.DataFrame, source: Any, target: Any = None, position: str = "before", axis: int = 0, ) -> pd.DataFrame:  """Changes rows or columns positions in the dataframe.  It uses the  [`select`][janitor.functions.select.select] syntax,  making it easy to move blocks of rows or columns at once.  This operation does not reset the index of the dataframe. User must  explicitly do so.  The dataframe must have unique column names or indices.  Examples:  Move a row:  >>> import pandas as pd  >>> import janitor  >>> df = pd.DataFrame({"a": [2, 4, 6, 8], "b": list("wxyz")})  >>> df  a b  0 2 w  1 4 x  2 6 y  3 8 z  >>> df.move(source=0, target=3, position="before", axis=0)  a b  1 4 x  2 6 y  0 2 w  3 8 z  Move a column:  >>> import pandas as pd  >>> import janitor  >>> data = [{"a": 1, "b": 1, "c": 1, "d": "a", "e": "a", "f": "a"}]  >>> df = pd.DataFrame(data)  >>> df  a b c d e f  0 1 1 1 a a a  >>> df.move(source="a", target="c", position="after", axis=1)  b c a d e f  0 1 1 1 a a a  >>> df.move(source="f", target="b", position="before", axis=1)  a f b c d e  0 1 a 1 1 a a  >>> df.move(source="a", target=None, position="after", axis=1)  b c d e f a  0 1 1 a a a 1  Move columns:  >>> from pandas.api.types import is_numeric_dtype, is_string_dtype  >>> df.move(source=is_string_dtype, target=None, position="before", axis=1)  d e f a b c  0 a a a 1 1 1  >>> df.move(source=is_numeric_dtype, target=None, position="after", axis=1)  d e f a b c  0 a a a 1 1 1  >>> df.move(  ... source=["d", "f"], target=is_numeric_dtype, position="before", axis=1  ... )  d f a b c e  0 a a 1 1 1 a  Args:  df: The pandas DataFrame object.  source: Columns or rows to move.  target: Columns or rows to move adjacent to.  If `None` and `position == 'before'`, `source`  is moved to the beginning; if `position == 'after'`,  `source` is moved to the end.  position: Specifies the destination of the columns/rows.  Values can be either `before` or `after`; defaults to `before`.  axis: Axis along which the function is applied. 0 to move along  the index, 1 to move along the columns.  Raises:  ValueError: If `axis` is not `0` or `1`.  ValueError: If `position` is not `before` or `after`.  Returns:  The dataframe with the Series moved.  """ # noqa: E501 if axis not in [0, 1]: raise ValueError(f"Invalid axis '{axis}'. Can only be 0 or 1.") if position not in ["before", "after"]: raise ValueError( f"Invalid position '{position}'. Can only be 'before' or 'after'." ) mapping = {0: "index", 1: "columns"} names = getattr(df, mapping[axis]) assert names.is_unique index = np.arange(names.size) source = _select_index([source], df, mapping[axis]) source = _index_converter(source, index) if target is None: if position == "after": target = np.array([names.size]) else: target = np.array([0]) else: target = _select_index([target], df, mapping[axis]) target = _index_converter(target, index) index = np.delete(index, source) if position == "before": position = index.searchsorted(target[0]) else: position = index.searchsorted(target[-1]) + 1 start = index[:position] end = index[position:] position = np.concatenate([start, source, end]) return df.iloc(axis=axis)[position] 

`mutate`

Implementation of mutate.

`mutate(df, *args)`

New in version 0.31.0

Note

Before reaching for mutate, try pd.DataFrame.assign.

mutate creates new columns that are functions of existing columns. It can also modify columns (if the name is the same as an existing column).

The argument provided to args should be either a dictionary, a callable or a tuple; however, anything can be passed, as long as it can be aligned with the original DataFrame.

dictionary argument: If the argument is a dictionary, the value in the {key:value} pairing should be either a string, a callable or a tuple.
- If the value in the dictionary is a string or a callable, the key of the dictionary should be an existing column name.
Note
- If the value is a string, the string should be a pandas string function, e.g "sum", "mean", etc.
- If the value of the dictionary is a tuple, it should be of length 2, and of the form (column_name, mutation_func), where column_name should exist in the DataFrame, and mutation_func should be either a string or a callable. The key in the dictionary can be a new column name.
Note
- If mutation_func is a string, the string should be a pandas string function, e.g "sum", "mean", etc.
tuple argument: If the argument is a tuple, it should be of length 2, and of the form (column_name, mutation_func), where column_name should exist in the DataFrame, and mutation_func should be either a string or a callable.
Note
- if mutation_func is a string, the string should be a pandas string function, e.g "sum", "mean", etc.
Note
- column_name can be anything supported by the select syntax; as such multiple columns can be processed here - they will be processed individually.
callable argument: If the argument is a callable, the callable is applied on the DataFrame. The result from the callable should be a pandas Series or DataFrame.

Mutation does not occur on the original DataFrame.

Examples:

>>> import pandas as pd >>> import numpy as np >>> import janitor >>> df = pd.DataFrame( ...  { ...  "col1": [5, 10, 15], ...  "col2": [3, 6, 9], ...  "col3": [10, 100, 1_000], ...  } ... )

Transformation via a dictionary:

>>> df.mutate({"col4": ("col1", np.log10), "col1": np.log10})  col1 col2 col3 col4 0 0.698970 3 10 0.698970 1 1.000000 6 100 1.000000 2 1.176091 9 1000 1.176091

Transformation via a tuple:

>>> df.mutate(("col1", np.log10))  col1 col2 col3 0 0.698970 3 10 1 1.000000 6 100 2 1.176091 9 1000 >>> df.mutate(("col*", np.log10))  col1 col2 col3 0 0.698970 0.477121 1.0 1 1.000000 0.778151 2.0 2 1.176091 0.954243 3.0

Transformation via a callable:

>>> df.mutate(lambda df: df.sum(axis=1).rename("total"))  col1 col2 col3 total 0 5 3 10 18 1 10 6 100 116 2 15 9 1000 1024

Parameters:

Name	Type	Description	Default
`df`	`DataFrame \| DataFrameGroupBy`	A pandas DataFrame or GroupBy object.	required
`args`	`tuple[dict \| tuple]`	Either a dictionary or a tuple.	`()`

Raises:

Type	Description
`ValueError`	If a tuple is passed and the length is not 2.

Returns:

Type	Description
`DataFrame \| DataFrameGroupBy`	A pandas DataFrame, Series, or GroupBy object.

Source code in janitor/functions/mutate.py

@pf.register_groupby_method @pf.register_dataframe_method def mutate( df: pd.DataFrame | DataFrameGroupBy, *args: tuple[dict | tuple], ) -> pd.DataFrame | DataFrameGroupBy:  """  !!! info "New in version 0.31.0"  !!!note  Before reaching for `mutate`, try `pd.DataFrame.assign`.  mutate creates new columns that are functions of existing columns.  It can also modify columns (if the name is the same as an existing column).  The argument provided to *args* should be either  a dictionary, a callable or a tuple; however,  anything can be passed, as long as it can  be aligned with the original DataFrame.  - **dictionary argument**:  If the argument is a dictionary,  the value in the `{key:value}` pairing  should be either a string, a callable or a tuple.  - If the value in the dictionary  is a string or a callable,  the key of the dictionary  should be an existing column name.  !!!note  - If the value is a string,  the string should be a pandas string function,  e.g "sum", "mean", etc.  - If the value of the dictionary is a tuple,  it should be of length 2, and of the form  `(column_name, mutation_func)`,  where `column_name` should exist in the DataFrame,  and `mutation_func` should be either a string or a callable.  The key in the dictionary can be a new column name.  !!!note  - If `mutation_func` is a string,  the string should be a pandas string function,  e.g "sum", "mean", etc.  - **tuple argument**:  If the argument is a tuple, it should be of length 2,  and of the form  `(column_name, mutation_func)`,  where `column_name` should exist in the DataFrame,  and `mutation_func` should be either a string or a callable.  !!!note  - if `mutation_func` is a string,  the string should be a pandas string function,  e.g "sum", "mean", etc.  !!!note  - `column_name` can be anything supported by the  [`select`][janitor.functions.select.select] syntax;  as such multiple columns can be processed here -  they will be processed individually.  - **callable argument**:  If the argument is a callable, the callable is applied  on the DataFrame.  The result from the callable should be a pandas Series  or DataFrame.  Mutation does not occur on the original DataFrame.  Examples:  >>> import pandas as pd  >>> import numpy as np  >>> import janitor  >>> df = pd.DataFrame(  ... {  ... "col1": [5, 10, 15],  ... "col2": [3, 6, 9],  ... "col3": [10, 100, 1_000],  ... }  ... )  Transformation via a dictionary:  >>> df.mutate({"col4": ("col1", np.log10), "col1": np.log10})  col1 col2 col3 col4  0 0.698970 3 10 0.698970  1 1.000000 6 100 1.000000  2 1.176091 9 1000 1.176091  Transformation via a tuple:  >>> df.mutate(("col1", np.log10))  col1 col2 col3  0 0.698970 3 10  1 1.000000 6 100  2 1.176091 9 1000  >>> df.mutate(("col*", np.log10))  col1 col2 col3  0 0.698970 0.477121 1.0  1 1.000000 0.778151 2.0  2 1.176091 0.954243 3.0  Transformation via a callable:  >>> df.mutate(lambda df: df.sum(axis=1).rename("total"))  col1 col2 col3 total  0 5 3 10 18  1 10 6 100 116  2 15 9 1000 1024  Args:  df: A pandas DataFrame or GroupBy object.  args: Either a dictionary or a tuple.  Raises:  ValueError: If a tuple is passed and the length is not 2.  Returns:  A pandas DataFrame, Series, or GroupBy object.  """ # noqa: E501 if isinstance(df, DataFrameGroupBy): df = copy.copy(df) df_ = df.obj.copy(deep=None) df.obj = df_ for arg in args: df_ = _mutator(arg, df=df_, by=df) return df df = df.copy(deep=None) for arg in args: df = _mutator(arg, df=df, by=None) return df 

`ungroup(df)`

New in version 0.32.0

Ungroups a GroupBy object into a DataFrame.

Examples:

>>> import pandas as pd >>> import janitor >>> data = { ...  "avg_jump": [3, 4, 1, 2, 3, 4], ...  "avg_run": [3, 4, 1, 3, 2, 4], ...  "combine_id": [100200, 100200, 101200, 101200, 102201, 103202], ... } >>> df = pd.DataFrame(data) >>> df  avg_jump avg_run combine_id 0 3 3 100200 1 4 4 100200 2 1 1 101200 3 2 3 101200 4 3 2 102201 5 4 4 103202 >>> df.groupby("combine_id").mutate("mean").ungroup()  avg_jump avg_run combine_id 0 3.5 3.5 100200 1 3.5 3.5 100200 2 1.5 2.0 101200 3 1.5 2.0 101200 4 3.0 2.0 102201 5 4.0 4.0 103202

Parameters:

Name	Type	Description	Default
`df`	`DataFrameGroupBy`	A pandas GroupBy object.	required

Returns:

Type	Description
`DataFrame`	A pandas DataFrame.

Source code in janitor/functions/mutate.py

@pf.register_groupby_method def ungroup( df: DataFrameGroupBy, ) -> pd.DataFrame:  """  !!! info "New in version 0.32.0"  Ungroups a GroupBy object into a DataFrame.  Examples:  >>> import pandas as pd  >>> import janitor  >>> data = {  ... "avg_jump": [3, 4, 1, 2, 3, 4],  ... "avg_run": [3, 4, 1, 3, 2, 4],  ... "combine_id": [100200, 100200, 101200, 101200, 102201, 103202],  ... }  >>> df = pd.DataFrame(data)  >>> df  avg_jump avg_run combine_id  0 3 3 100200  1 4 4 100200  2 1 1 101200  3 2 3 101200  4 3 2 102201  5 4 4 103202  >>> df.groupby("combine_id").mutate("mean").ungroup()  avg_jump avg_run combine_id  0 3.5 3.5 100200  1 3.5 3.5 100200  2 1.5 2.0 101200  3 1.5 2.0 101200  4 3.0 2.0 102201  5 4.0 4.0 103202  Args:  df: A pandas GroupBy object.  Returns:  A pandas DataFrame.  """ return df.obj 

`pivot`

`pivot_longer(df, index=None, column_names=None, names_to=None, values_to='value', column_level=None, names_sep=None, names_pattern=None, names_transform=None, dropna=False, sort_by_appearance=False, ignore_index=True)`

Unpivots a DataFrame from wide to long format.

This method does not mutate the original DataFrame.

It is modeled after the pivot_longer function in R's tidyr package, and also takes inspiration from R's data.table package.

This function is useful to massage a DataFrame into a format where one or more columns are considered measured variables, and all other columns are considered as identifier variables.

All measured variables are unpivoted (and typically duplicated) along the row axis.

Column selection in index and column_names is possible using the select syntax.

For more granular control on the unpivoting, have a look at pivot_longer_spec.

Examples:

>>> import pandas as pd >>> import janitor >>> df = pd.DataFrame( ...  { ...  "Sepal.Length": [5.1, 5.9], ...  "Sepal.Width": [3.5, 3.0], ...  "Petal.Length": [1.4, 5.1], ...  "Petal.Width": [0.2, 1.8], ...  "Species": ["setosa", "virginica"], ...  } ... ) >>> df  Sepal.Length Sepal.Width Petal.Length Petal.Width Species 0 5.1 3.5 1.4 0.2 setosa 1 5.9 3.0 5.1 1.8 virginica

Replicate pandas' melt:

>>> df.pivot_longer(index="Species")  Species variable value 0 setosa Sepal.Length 5.1 1 virginica Sepal.Length 5.9 2 setosa Sepal.Width 3.5 3 virginica Sepal.Width 3.0 4 setosa Petal.Length 1.4 5 virginica Petal.Length 5.1 6 setosa Petal.Width 0.2 7 virginica Petal.Width 1.8

Convenient, flexible column selection in the index via the select syntax:

>>> from pandas.api.types import is_string_dtype >>> df.pivot_longer(index=is_string_dtype)  Species variable value 0 setosa Sepal.Length 5.1 1 virginica Sepal.Length 5.9 2 setosa Sepal.Width 3.5 3 virginica Sepal.Width 3.0 4 setosa Petal.Length 1.4 5 virginica Petal.Length 5.1 6 setosa Petal.Width 0.2 7 virginica Petal.Width 1.8

Split the column labels into individual columns:

>>> df.pivot_longer( ...  index="Species", ...  names_to=("part", "dimension"), ...  names_sep=".", ...  sort_by_appearance=True, ... )  Species part dimension value 0 setosa Sepal Length 5.1 1 setosa Sepal Width 3.5 2 setosa Petal Length 1.4 3 setosa Petal Width 0.2 4 virginica Sepal Length 5.9 5 virginica Sepal Width 3.0 6 virginica Petal Length 5.1 7 virginica Petal Width 1.8

Retain parts of the column names as headers:

>>> df.pivot_longer( ...  index="Species", ...  names_to=("part", ".value"), ...  names_sep=".", ...  sort_by_appearance=True, ... )  Species part Length Width 0 setosa Sepal 5.1 3.5 1 setosa Petal 1.4 0.2 2 virginica Sepal 5.9 3.0 3 virginica Petal 5.1 1.8

Split the column labels based on regex:

>>> df = pd.DataFrame({"id": [1], "new_sp_m5564": [2], "newrel_f65": [3]}) >>> df  id new_sp_m5564 newrel_f65 0 1 2 3 >>> df.pivot_longer( ...  index="id", ...  names_to=("diagnosis", "gender", "age"), ...  names_pattern=r"new_?(.+)_(.)(\d+)", ... )  id diagnosis gender age value 0 1 sp m 5564 2 1 1 rel f 65 3

Split the column labels for the above dataframe using named groups in names_pattern:

>>> df.pivot_longer( ...  index="id", ...  names_pattern=r"new_?(?P<diagnosis>.+)_(?P<gender>.)(?P<age>\d+)", ... )  id diagnosis gender age value 0 1 sp m 5564 2 1 1 rel f 65 3

Convert the dtypes of specific columns with names_transform:

>>> result = df.pivot_longer( ...  index="id", ...  names_to=("diagnosis", "gender", "age"), ...  names_pattern=r"new_?(.+)_(.)(\d+)", ...  names_transform={"gender": "category", "age": "int"}, ... ) >>> result.dtypes id int64 diagnosis object gender category age int64 value int64 dtype: object

Use multiple .value to reshape the dataframe:

>>> df = pd.DataFrame( ...  [ ...  { ...  "x_1_mean": 10, ...  "x_2_mean": 20, ...  "y_1_mean": 30, ...  "y_2_mean": 40, ...  "unit": 50, ...  } ...  ] ... ) >>> df  x_1_mean x_2_mean y_1_mean y_2_mean unit 0 10 20 30 40 50 >>> df.pivot_longer( ...  index="unit", ...  names_to=(".value", "time", ".value"), ...  names_pattern=r"(x|y)_([0-9])(_mean)", ... )  unit time x_mean y_mean 0 50 1 10 30 1 50 2 20 40

Replicate the above with named groups in names_pattern - use _ instead of .value:

>>> df.pivot_longer( ...  index="unit", ...  names_pattern=r"(?P<_>x|y)_(?P<time>[0-9])(?P<__>_mean)", ... )  unit time x_mean y_mean 0 50 1 10 30 1 50 2 20 40

Convenient, flexible column selection in the column_names via the select syntax:

>>> df.pivot_longer( ...  column_names="*mean", ...  names_to=(".value", "time", ".value"), ...  names_pattern=r"(x|y)_([0-9])(_mean)", ... )  unit time x_mean y_mean 0 50 1 10 30 1 50 2 20 40

>>> df.pivot_longer( ...  column_names=slice("x_1_mean", "y_2_mean"), ...  names_to=(".value", "time", ".value"), ...  names_pattern=r"(x|y)_([0-9])(_mean)", ... )  unit time x_mean y_mean 0 50 1 10 30 1 50 2 20 40

Reshape the dataframe by passing a sequence to names_pattern:

>>> df = pd.DataFrame( ...  { ...  "hr1": [514, 573], ...  "hr2": [545, 526], ...  "team": ["Red Sox", "Yankees"], ...  "year1": [2007, 2007], ...  "year2": [2008, 2008], ...  } ... ) >>> df  hr1 hr2 team year1 year2 0 514 545 Red Sox 2007 2008 1 573 526 Yankees 2007 2008 >>> df.pivot_longer( ...  index="team", names_to=["year", "hr"], names_pattern=["year", "hr"] ... )  team hr year 0 Red Sox 514 2007 1 Yankees 573 2007 2 Red Sox 545 2008 3 Yankees 526 2008

Reshape the above dataframe by passing a dictionary to names_pattern:

>>> df.pivot_longer(index="team", names_pattern={"year": "year", "hr": "hr"})  team hr year 0 Red Sox 514 2007 1 Yankees 573 2007 2 Red Sox 545 2008 3 Yankees 526 2008

Multiple values_to:

>>> df = pd.DataFrame( ...  { ...  "City": ["Houston", "Austin", "Hoover"], ...  "State": ["Texas", "Texas", "Alabama"], ...  "Name": ["Aria", "Penelope", "Niko"], ...  "Mango": [4, 10, 90], ...  "Orange": [10, 8, 14], ...  "Watermelon": [40, 99, 43], ...  "Gin": [16, 200, 34], ...  "Vodka": [20, 33, 18], ...  }, ... ) >>> df  City State Name Mango Orange Watermelon Gin Vodka 0 Houston Texas Aria 4 10 40 16 20 1 Austin Texas Penelope 10 8 99 200 33 2 Hoover Alabama Niko 90 14 43 34 18 >>> df.pivot_longer( ...  index=["City", "State"], ...  column_names=slice("Mango", "Vodka"), ...  names_to=("Fruit", "Drink"), ...  values_to=("Pounds", "Ounces"), ...  names_pattern=["M|O|W", "G|V"], ... )  City State Fruit Drink Pounds Ounces 0 Houston Texas Mango Gin 4 16.0 1 Austin Texas Mango Gin 10 200.0 2 Hoover Alabama Mango Gin 90 34.0 3 Houston Texas Orange Vodka 10 20.0 4 Austin Texas Orange Vodka 8 33.0 5 Hoover Alabama Orange Vodka 14 18.0 6 Houston Texas Watermelon None 40 NaN 7 Austin Texas Watermelon None 99 NaN 8 Hoover Alabama Watermelon None 43 NaN

Replicate the above transformation with a nested dictionary passed to names_pattern - the outer keys in the names_pattern dictionary are passed to names_to, while the inner keys are passed to values_to:

>>> df.pivot_longer( ...  index=["City", "State"], ...  column_names=slice("Mango", "Vodka"), ...  names_pattern={ ...  "Fruit": {"Pounds": "M|O|W"}, ...  "Drink": {"Ounces": "G|V"}, ...  }, ... )  City State Fruit Drink Pounds Ounces 0 Houston Texas Mango Gin 4 16.0 1 Austin Texas Mango Gin 10 200.0 2 Hoover Alabama Mango Gin 90 34.0 3 Houston Texas Orange Vodka 10 20.0 4 Austin Texas Orange Vodka 8 33.0 5 Hoover Alabama Orange Vodka 14 18.0 6 Houston Texas Watermelon None 40 NaN 7 Austin Texas Watermelon None 99 NaN 8 Hoover Alabama Watermelon None 43 NaN

Version Changed

0.24.0
- Added dropna parameter.
0.24.1
- names_pattern can accept a dictionary.
- named groups supported in names_pattern.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	A pandas DataFrame.	required
`index`	`Any`	Name(s) of columns to use as identifier variables. Should be either a single column name, or a list/tuple of column names. `index` should be a list of tuples if the columns are a MultiIndex. Column selection is possible using the `select` syntax.	`None`
`column_names`	`Any`	Name(s) of columns to unpivot. Should be either a single column name or a list/tuple of column names. `column_names` should be a list of tuples if the columns are a MultiIndex. Column selection is possible using the `select` syntax.	`None`
`names_to`	`list \| tuple \| str`	Name of new column as a string that will contain what were previously the column names in `column_names`. The default is `variable` if no value is provided. It can also be a list/tuple of strings that will serve as new column names, if `name_sep` or `names_pattern` is provided. If `.value` is in `names_to`, new column names will be extracted from part of the existing column names and overrides `values_to`.	`None`
`values_to`	`str`	Name of new column as a string that will contain what were previously the values of the columns in `column_names`. values_to can also be a list/tuple and requires that names_pattern is also a list/tuple.	`'value'`
`column_level`	`int \| str`	If columns are a MultiIndex, then use this level to unpivot the DataFrame. Provided for compatibility with pandas' melt, and applies only if neither `names_sep` nor `names_pattern` is provided.	`None`
`names_sep`	`str \| Pattern`	Determines how the column name is broken up, if `names_to` contains multiple values. It takes the same specification as pandas' `str.split` method, and can be a string or regular expression. `names_sep` does not work with MultiIndex columns.	`None`
`names_pattern`	`list \| tuple \| str \| Pattern`	Determines how the column name is broken up. It can be a regular expression containing matching groups. Under the hood it is processed with pandas' `str.extract` function. If it is a single regex, the number of groups must match the length of `names_to`. Named groups are supported, if `names_to` is none. `_` is used instead of `.value` as a placeholder in named groups. `_` can be overloaded for multiple `.value` calls - `_`, `__`, `___`, ... `names_pattern` can also be a list/tuple of regular expressions It can also be a list/tuple of strings; the strings will be treated as regular expressions. Under the hood it is processed with pandas' `str.contains` function. For a list/tuple of regular expressions, `names_to` must also be a list/tuple and the lengths of both arguments must match. `names_pattern` can also be a dictionary, where the keys are the new column names, while the values can be a regular expression or a string which will be evaluated as a regular expression. Alternatively, a nested dictionary can be used, where the sub key(s) are associated with `values_to`. Please have a look at the examples for usage. `names_pattern` does not work with MultiIndex columns.	`None`
`names_transform`	`str \| Callable \| dict`	Use this option to change the types of columns that have been transformed to rows. This does not applies to the values' columns. Accepts any argument that is acceptable by `pd.astype`.	`None`
`dropna`	`bool`	Determines whether or not to drop nulls from the values columns. Default is `False`.	`False`
`sort_by_appearance`	`bool`	Boolean value that determines the final look of the DataFrame. If `True`, the unpivoted DataFrame will be stacked in order of first appearance.	`False`
`ignore_index`	`bool`	If `True`, the original index is ignored. If `False`, the original index is retained and the index labels will be repeated as necessary.	`True`

Returns:

Type	Description
`DataFrame`	A pandas DataFrame that has been unpivoted from wide to long format.

Source code in janitor/functions/pivot.py

@pf.register_dataframe_method def pivot_longer( df: pd.DataFrame, index: Any = None, column_names: Any = None, names_to: list | tuple | str = None, values_to: str = "value", column_level: int | str = None, names_sep: str | Pattern = None, names_pattern: list | tuple | str | Pattern = None, names_transform: str | Callable | dict = None, dropna: bool = False, sort_by_appearance: bool = False, ignore_index: bool = True, ) -> pd.DataFrame:  """Unpivots a DataFrame from *wide* to *long* format.  This method does not mutate the original DataFrame.  It is modeled after the `pivot_longer` function in R's tidyr package,  and also takes inspiration from R's data.table package.  This function is useful to massage a DataFrame into a format where  one or more columns are considered measured variables, and all other  columns are considered as identifier variables.  All measured variables are *unpivoted* (and typically duplicated) along the  row axis.  Column selection in `index` and `column_names` is possible using the  [`select`][janitor.functions.select.select] syntax.  For more granular control on the unpivoting, have a look at  [`pivot_longer_spec`][janitor.functions.pivot.pivot_longer_spec].  Examples:  >>> import pandas as pd  >>> import janitor  >>> df = pd.DataFrame(  ... {  ... "Sepal.Length": [5.1, 5.9],  ... "Sepal.Width": [3.5, 3.0],  ... "Petal.Length": [1.4, 5.1],  ... "Petal.Width": [0.2, 1.8],  ... "Species": ["setosa", "virginica"],  ... }  ... )  >>> df  Sepal.Length Sepal.Width Petal.Length Petal.Width Species  0 5.1 3.5 1.4 0.2 setosa  1 5.9 3.0 5.1 1.8 virginica  Replicate pandas' melt:  >>> df.pivot_longer(index="Species")  Species variable value  0 setosa Sepal.Length 5.1  1 virginica Sepal.Length 5.9  2 setosa Sepal.Width 3.5  3 virginica Sepal.Width 3.0  4 setosa Petal.Length 1.4  5 virginica Petal.Length 5.1  6 setosa Petal.Width 0.2  7 virginica Petal.Width 1.8  Convenient, flexible column selection in the `index` via the  [`select`][janitor.functions.select.select] syntax:  >>> from pandas.api.types import is_string_dtype  >>> df.pivot_longer(index=is_string_dtype)  Species variable value  0 setosa Sepal.Length 5.1  1 virginica Sepal.Length 5.9  2 setosa Sepal.Width 3.5  3 virginica Sepal.Width 3.0  4 setosa Petal.Length 1.4  5 virginica Petal.Length 5.1  6 setosa Petal.Width 0.2  7 virginica Petal.Width 1.8  Split the column labels into individual columns:  >>> df.pivot_longer(  ... index="Species",  ... names_to=("part", "dimension"),  ... names_sep=".",  ... sort_by_appearance=True,  ... )  Species part dimension value  0 setosa Sepal Length 5.1  1 setosa Sepal Width 3.5  2 setosa Petal Length 1.4  3 setosa Petal Width 0.2  4 virginica Sepal Length 5.9  5 virginica Sepal Width 3.0  6 virginica Petal Length 5.1  7 virginica Petal Width 1.8  Retain parts of the column names as headers:  >>> df.pivot_longer(  ... index="Species",  ... names_to=("part", ".value"),  ... names_sep=".",  ... sort_by_appearance=True,  ... )  Species part Length Width  0 setosa Sepal 5.1 3.5  1 setosa Petal 1.4 0.2  2 virginica Sepal 5.9 3.0  3 virginica Petal 5.1 1.8  Split the column labels based on regex:  >>> df = pd.DataFrame({"id": [1], "new_sp_m5564": [2], "newrel_f65": [3]})  >>> df  id new_sp_m5564 newrel_f65  0 1 2 3  >>> df.pivot_longer(  ... index="id",  ... names_to=("diagnosis", "gender", "age"),  ... names_pattern=r"new_?(.+)_(.)(\\d+)",  ... )  id diagnosis gender age value  0 1 sp m 5564 2  1 1 rel f 65 3  Split the column labels for the above dataframe using named groups in `names_pattern`:  >>> df.pivot_longer(  ... index="id",  ... names_pattern=r"new_?(?P<diagnosis>.+)_(?P<gender>.)(?P<age>\\d+)",  ... )  id diagnosis gender age value  0 1 sp m 5564 2  1 1 rel f 65 3  Convert the dtypes of specific columns with `names_transform`:  >>> result = df.pivot_longer(  ... index="id",  ... names_to=("diagnosis", "gender", "age"),  ... names_pattern=r"new_?(.+)_(.)(\\d+)",  ... names_transform={"gender": "category", "age": "int"},  ... )  >>> result.dtypes  id int64  diagnosis object  gender category  age int64  value int64  dtype: object  Use multiple `.value` to reshape the dataframe:  >>> df = pd.DataFrame(  ... [  ... {  ... "x_1_mean": 10,  ... "x_2_mean": 20,  ... "y_1_mean": 30,  ... "y_2_mean": 40,  ... "unit": 50,  ... }  ... ]  ... )  >>> df  x_1_mean x_2_mean y_1_mean y_2_mean unit  0 10 20 30 40 50  >>> df.pivot_longer(  ... index="unit",  ... names_to=(".value", "time", ".value"),  ... names_pattern=r"(x|y)_([0-9])(_mean)",  ... )  unit time x_mean y_mean  0 50 1 10 30  1 50 2 20 40  Replicate the above with named groups in `names_pattern` - use `_` instead of `.value`:  >>> df.pivot_longer(  ... index="unit",  ... names_pattern=r"(?P<_>x|y)_(?P<time>[0-9])(?P<__>_mean)",  ... )  unit time x_mean y_mean  0 50 1 10 30  1 50 2 20 40  Convenient, flexible column selection in the `column_names` via  the [`select`][janitor.functions.select.select] syntax:  >>> df.pivot_longer(  ... column_names="*mean",  ... names_to=(".value", "time", ".value"),  ... names_pattern=r"(x|y)_([0-9])(_mean)",  ... )  unit time x_mean y_mean  0 50 1 10 30  1 50 2 20 40  >>> df.pivot_longer(  ... column_names=slice("x_1_mean", "y_2_mean"),  ... names_to=(".value", "time", ".value"),  ... names_pattern=r"(x|y)_([0-9])(_mean)",  ... )  unit time x_mean y_mean  0 50 1 10 30  1 50 2 20 40  Reshape the dataframe by passing a sequence to `names_pattern`:  >>> df = pd.DataFrame(  ... {  ... "hr1": [514, 573],  ... "hr2": [545, 526],  ... "team": ["Red Sox", "Yankees"],  ... "year1": [2007, 2007],  ... "year2": [2008, 2008],  ... }  ... )  >>> df  hr1 hr2 team year1 year2  0 514 545 Red Sox 2007 2008  1 573 526 Yankees 2007 2008  >>> df.pivot_longer(  ... index="team", names_to=["year", "hr"], names_pattern=["year", "hr"]  ... )  team hr year  0 Red Sox 514 2007  1 Yankees 573 2007  2 Red Sox 545 2008  3 Yankees 526 2008  Reshape the above dataframe by passing a dictionary to `names_pattern`:  >>> df.pivot_longer(index="team", names_pattern={"year": "year", "hr": "hr"})  team hr year  0 Red Sox 514 2007  1 Yankees 573 2007  2 Red Sox 545 2008  3 Yankees 526 2008  Multiple values_to:  >>> df = pd.DataFrame(  ... {  ... "City": ["Houston", "Austin", "Hoover"],  ... "State": ["Texas", "Texas", "Alabama"],  ... "Name": ["Aria", "Penelope", "Niko"],  ... "Mango": [4, 10, 90],  ... "Orange": [10, 8, 14],  ... "Watermelon": [40, 99, 43],  ... "Gin": [16, 200, 34],  ... "Vodka": [20, 33, 18],  ... },  ... )  >>> df  City State Name Mango Orange Watermelon Gin Vodka  0 Houston Texas Aria 4 10 40 16 20  1 Austin Texas Penelope 10 8 99 200 33  2 Hoover Alabama Niko 90 14 43 34 18  >>> df.pivot_longer(  ... index=["City", "State"],  ... column_names=slice("Mango", "Vodka"),  ... names_to=("Fruit", "Drink"),  ... values_to=("Pounds", "Ounces"),  ... names_pattern=["M|O|W", "G|V"],  ... )  City State Fruit Drink Pounds Ounces  0 Houston Texas Mango Gin 4 16.0  1 Austin Texas Mango Gin 10 200.0  2 Hoover Alabama Mango Gin 90 34.0  3 Houston Texas Orange Vodka 10 20.0  4 Austin Texas Orange Vodka 8 33.0  5 Hoover Alabama Orange Vodka 14 18.0  6 Houston Texas Watermelon None 40 NaN  7 Austin Texas Watermelon None 99 NaN  8 Hoover Alabama Watermelon None 43 NaN  Replicate the above transformation with a nested dictionary passed to `names_pattern`  - the outer keys in the `names_pattern` dictionary are passed to `names_to`,  while the inner keys are passed to `values_to`:  >>> df.pivot_longer(  ... index=["City", "State"],  ... column_names=slice("Mango", "Vodka"),  ... names_pattern={  ... "Fruit": {"Pounds": "M|O|W"},  ... "Drink": {"Ounces": "G|V"},  ... },  ... )  City State Fruit Drink Pounds Ounces  0 Houston Texas Mango Gin 4 16.0  1 Austin Texas Mango Gin 10 200.0  2 Hoover Alabama Mango Gin 90 34.0  3 Houston Texas Orange Vodka 10 20.0  4 Austin Texas Orange Vodka 8 33.0  5 Hoover Alabama Orange Vodka 14 18.0  6 Houston Texas Watermelon None 40 NaN  7 Austin Texas Watermelon None 99 NaN  8 Hoover Alabama Watermelon None 43 NaN  !!! abstract "Version Changed"  - 0.24.0  - Added `dropna` parameter.  - 0.24.1  - `names_pattern` can accept a dictionary.  - named groups supported in `names_pattern`.  Args:  df: A pandas DataFrame.  index: Name(s) of columns to use as identifier variables.  Should be either a single column name, or a list/tuple of  column names.  `index` should be a list of tuples if the columns are a MultiIndex.  Column selection is possible using the  [`select`][janitor.functions.select.select] syntax.  column_names: Name(s) of columns to unpivot. Should be either  a single column name or a list/tuple of column names.  `column_names` should be a list of tuples  if the columns are a MultiIndex.  Column selection is possible using the  [`select`][janitor.functions.select.select] syntax.  names_to: Name of new column as a string that will contain  what were previously the column names in `column_names`.  The default is `variable` if no value is provided. It can  also be a list/tuple of strings that will serve as new column  names, if `name_sep` or `names_pattern` is provided.  If `.value` is in `names_to`, new column names will be extracted  from part of the existing column names and overrides `values_to`.  values_to: Name of new column as a string that will contain what  were previously the values of the columns in `column_names`.  values_to can also be a list/tuple  and requires that names_pattern is also a list/tuple.  column_level: If columns are a MultiIndex, then use this level to  unpivot the DataFrame. Provided for compatibility with pandas' melt,  and applies only if neither `names_sep` nor `names_pattern` is  provided.  names_sep: Determines how the column name is broken up, if  `names_to` contains multiple values. It takes the same  specification as pandas' `str.split` method, and can be a string  or regular expression. `names_sep` does not work with MultiIndex  columns.  names_pattern: Determines how the column name is broken up.  It can be a regular expression containing matching groups.  Under the hood it is processed with pandas' `str.extract` function.  If it is a single regex, the number of groups must match  the length of `names_to`.  Named groups are supported, if `names_to` is none. `_` is used  instead of `.value` as a placeholder in named groups.  `_` can be overloaded for multiple `.value`  calls - `_`, `__`, `___`, ...  `names_pattern` can also be a list/tuple of regular expressions  It can also be a list/tuple of strings;  the strings will be treated as regular expressions.  Under the hood it is processed with pandas' `str.contains` function.  For a list/tuple of regular expressions,  `names_to` must also be a list/tuple and the lengths of both  arguments must match.  `names_pattern` can also be a dictionary, where the keys are  the new column names, while the values can be a regular expression  or a string which will be evaluated as a regular expression.  Alternatively, a nested dictionary can be used, where the sub  key(s) are associated with `values_to`. Please have a look  at the examples for usage.  `names_pattern` does not work with MultiIndex columns.  names_transform: Use this option to change the types of columns that  have been transformed to rows. This does not applies to the values' columns.  Accepts any argument that is acceptable by `pd.astype`.  dropna: Determines whether or not to drop nulls  from the values columns. Default is `False`.  sort_by_appearance: Boolean value that determines  the final look of the DataFrame. If `True`, the unpivoted DataFrame  will be stacked in order of first appearance.  ignore_index: If `True`,  the original index is ignored. If `False`, the original index  is retained and the index labels will be repeated as necessary.  Returns:  A pandas DataFrame that has been unpivoted from wide to long  format.  """ # noqa: E501 # this code builds on the wonderful work of @benjaminjack’s PR # https://github.com/benjaminjack/pyjanitor/commit/e3df817903c20dd21634461c8a92aec137963ed0 return _computations_pivot_longer( df=df, index=index, column_names=column_names, column_level=column_level, names_to=names_to, values_to=values_to, names_sep=names_sep, names_pattern=names_pattern, names_transform=names_transform, dropna=dropna, sort_by_appearance=sort_by_appearance, ignore_index=ignore_index, ) 

`pivot_longer_spec(df, spec, sort_by_appearance=False, ignore_index=True, dropna=False, df_columns_is_unique=True)`

A declarative interface to pivot a DataFrame from wide to long form, where you describe how the data will be unpivoted, using a DataFrame.

This gives you, the user, more control over unpivoting, where you create a “spec” data frame that describes exactly how data stored in the column names becomes variables.

It can come in handy for situations where pivot_longer seems inadequate for the transformation.

New in version 0.28.0

Examples:

>>> import pandas as pd >>> import janitor >>> df = pd.DataFrame( ...  { ...  "Sepal.Length": [5.1, 5.9], ...  "Sepal.Width": [3.5, 3.0], ...  "Petal.Length": [1.4, 5.1], ...  "Petal.Width": [0.2, 1.8], ...  "Species": ["setosa", "virginica"], ...  } ... ) >>> df  Sepal.Length Sepal.Width Petal.Length Petal.Width Species 0 5.1 3.5 1.4 0.2 setosa 1 5.9 3.0 5.1 1.8 virginica >>> spec = { ...  ".name": ["Sepal.Length", "Petal.Length", "Sepal.Width", "Petal.Width"], ...  ".value": ["Length", "Length", "Width", "Width"], ...  "part": ["Sepal", "Petal", "Sepal", "Petal"], ... } >>> spec = pd.DataFrame(spec) >>> spec  .name .value part 0 Sepal.Length Length Sepal 1 Petal.Length Length Petal 2 Sepal.Width Width Sepal 3 Petal.Width Width Petal >>> pivot_longer_spec(df=df, spec=spec)  Species part Length Width 0 setosa Sepal 5.1 3.5 1 virginica Sepal 5.9 3.0 2 setosa Petal 1.4 0.2 3 virginica Petal 5.1 1.8

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	The source DataFrame to unpivot.	required
`spec`	`DataFrame`	A specification DataFrame. At a minimum, the spec DataFrame must have a '.name' and a '.value' columns. The '.name' column should contain the columns in the source DataFrame that will be transformed to long form. The '.value' column gives the name of the column(s) that the values in the source DataFrame will go into. Additional columns in spec should be named to match columns in the long format of the dataset and contain values corresponding to columns pivoted from the wide format. Note that these additional columns should not already exist in the source DataFrame.	required
`sort_by_appearance`	`bool`	Boolean value that determines the final look of the DataFrame. If `True`, the unpivoted DataFrame will be stacked in order of first appearance.	`False`
`ignore_index`	`bool`	If `True`, the original index is ignored. If `False`, the original index is retained and the index labels will be repeated as necessary.	`True`
`dropna`	`bool`	Determines whether or not to drop nulls from the values columns. Default is `False`.	`False`
`df_columns_is_unique`	`bool`	Boolean value to indicate if the source DataFrame's columns is unique. Default is `True`.	`True`

Raises:

Type	Description
`KeyError`	If '.name' or '.value' is missing from the spec's columns.
`ValueError`	If the spec's columns is not unique, or the labels in spec['.name'] is not unique.

Returns:

Type	Description
`DataFrame`	A pandas DataFrame.

Source code in janitor/functions/pivot.py

def pivot_longer_spec( df: pd.DataFrame, spec: pd.DataFrame, sort_by_appearance: bool = False, ignore_index: bool = True, dropna: bool = False, df_columns_is_unique: bool = True, ) -> pd.DataFrame:  """A declarative interface to pivot a DataFrame from wide to long form,  where you describe how the data will be unpivoted,  using a DataFrame.  This gives you, the user,  more control over unpivoting, where you create a “spec”  data frame that describes exactly how data stored  in the column names becomes variables.  It can come in handy for situations where  [`pivot_longer`][janitor.functions.pivot.pivot_longer]  seems inadequate for the transformation.  !!! info "New in version 0.28.0"  Examples:  >>> import pandas as pd  >>> import janitor  >>> df = pd.DataFrame(  ... {  ... "Sepal.Length": [5.1, 5.9],  ... "Sepal.Width": [3.5, 3.0],  ... "Petal.Length": [1.4, 5.1],  ... "Petal.Width": [0.2, 1.8],  ... "Species": ["setosa", "virginica"],  ... }  ... )  >>> df  Sepal.Length Sepal.Width Petal.Length Petal.Width Species  0 5.1 3.5 1.4 0.2 setosa  1 5.9 3.0 5.1 1.8 virginica  >>> spec = {  ... ".name": ["Sepal.Length", "Petal.Length", "Sepal.Width", "Petal.Width"],  ... ".value": ["Length", "Length", "Width", "Width"],  ... "part": ["Sepal", "Petal", "Sepal", "Petal"],  ... }  >>> spec = pd.DataFrame(spec)  >>> spec  .name .value part  0 Sepal.Length Length Sepal  1 Petal.Length Length Petal  2 Sepal.Width Width Sepal  3 Petal.Width Width Petal  >>> pivot_longer_spec(df=df, spec=spec)  Species part Length Width  0 setosa Sepal 5.1 3.5  1 virginica Sepal 5.9 3.0  2 setosa Petal 1.4 0.2  3 virginica Petal 5.1 1.8  Args:  df: The source DataFrame to unpivot.  spec: A specification DataFrame.  At a minimum, the spec DataFrame  must have a '.name' and a '.value' columns.  The '.name' column should contain the  columns in the source DataFrame that will be  transformed to long form.  The '.value' column gives the name of the column(s)  that the values in the source DataFrame will go into.  Additional columns in spec should be named to match columns  in the long format of the dataset and contain values  corresponding to columns pivoted from the wide format.  Note that these additional columns should not already exist  in the source DataFrame.  sort_by_appearance: Boolean value that determines  the final look of the DataFrame. If `True`, the unpivoted DataFrame  will be stacked in order of first appearance.  ignore_index: If `True`,  the original index is ignored. If `False`, the original index  is retained and the index labels will be repeated as necessary.  dropna: Determines whether or not to drop nulls  from the values columns. Default is `False`.  df_columns_is_unique: Boolean value to indicate if the source  DataFrame's columns is unique. Default is `True`.  Raises:  KeyError: If '.name' or '.value' is missing from the spec's columns.  ValueError: If the spec's columns is not unique,  or the labels in spec['.name'] is not unique.  Returns:  A pandas DataFrame.  """ check("spec", spec, [pd.DataFrame]) if not spec.columns.is_unique: raise ValueError("Kindly ensure the spec's columns is unique.") if ".name" not in spec.columns: raise KeyError("Kindly ensure the spec DataFrame has a `.name` column.") if ".value" not in spec.columns: raise KeyError("Kindly ensure the spec DataFrame has a `.value` column.") if spec.columns.tolist()[:2] != [".name", ".value"]: raise ValueError( "The first two columns of the spec DataFrame " "should be '.name' and '.value', " "with '.name' coming before '.value'." ) if not spec[".name"].is_unique: raise ValueError("The labels in the `.name` column should be unique.") exclude = df.columns.intersection(spec.columns) if not exclude.empty: raise ValueError( f"Labels {(*exclude,)} in the spec DataFrame already exist " "as column labels in the source DataFrame. " "Kindly ensure the spec DataFrame's columns " "are not present in the source DataFrame." ) check("dropna", dropna, [bool]) check("sort_by_appearance", sort_by_appearance, [bool]) check("ignore_index", ignore_index, [bool]) check("df_columns_is_unique", df_columns_is_unique, [bool]) index = df.columns.difference(spec[".name"], sort=False) index = {name: df[name]._values for name in index} df = df.loc[:, spec[".name"]] if not df_columns_is_unique: spec = pd.DataFrame({".name": df.columns}).merge(spec, on=".name", how="inner") others = [label for label in spec if label not in {".name", ".value"}] return _pivot_longer_dot_value( df=df, spec=spec.drop(columns=".name"), index=index, others=others, sort_by_appearance=sort_by_appearance, ignore_index=ignore_index, dropna=dropna, ) 

`pivot_wider(df, names_from, index=None, values_from=None, flatten_levels=True, names_sep='_', names_glue=None, reset_index=True, names_expand=False, index_expand=False)`

Reshapes data from long to wide form.

The number of columns are increased, while decreasing the number of rows. It is the inverse of the pivot_longer method, and is a wrapper around pd.DataFrame.pivot method.

This method does not mutate the original DataFrame.

Column selection in index, names_from and values_from is possible using the select syntax.

A ValueError is raised if the combination of the index and names_from is not unique.

By default, values from values_from are always at the top level if the columns are not flattened. If flattened, the values from values_from are usually at the start of each label in the columns.

Examples:

>>> import pandas as pd >>> import janitor >>> df = [ ...  {"dep": 5.5, "step": 1, "a": 20, "b": 30}, ...  {"dep": 5.5, "step": 2, "a": 25, "b": 37}, ...  {"dep": 6.1, "step": 1, "a": 22, "b": 19}, ...  {"dep": 6.1, "step": 2, "a": 18, "b": 29}, ... ] >>> df = pd.DataFrame(df) >>> df  dep step a b 0 5.5 1 20 30 1 5.5 2 25 37 2 6.1 1 22 19 3 6.1 2 18 29

Pivot and flatten columns:

>>> df.pivot_wider( ...  index="dep", ...  names_from="step", ... )  dep a_1 a_2 b_1 b_2 0 5.5 20 25 30 37 1 6.1 22 18 19 29

Modify columns with names_sep:

>>> df.pivot_wider( ...  index="dep", ...  names_from="step", ...  names_sep="", ... )  dep a1 a2 b1 b2 0 5.5 20 25 30 37 1 6.1 22 18 19 29

Modify columns with names_glue:

>>> df.pivot_wider( ...  index="dep", ...  names_from="step", ...  names_glue="{_value}_step{step}", ... )  dep a_step1 a_step2 b_step1 b_step2 0 5.5 20 25 30 37 1 6.1 22 18 19 29

Expand columns to expose implicit missing values - this applies only to categorical columns:

>>> weekdays = ("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun") >>> daily = pd.DataFrame( ...  { ...  "day": pd.Categorical( ...  values=("Tue", "Thu", "Fri", "Mon"), categories=weekdays ...  ), ...  "value": (2, 3, 1, 5), ...  }, ...  index=[0, 0, 0, 0], ... ) >>> daily  day value 0 Tue 2 0 Thu 3 0 Fri 1 0 Mon 5 >>> daily.pivot_wider(names_from="day", values_from="value")  Tue Thu Fri Mon 0 2 3 1 5 >>> ( ...  daily.pivot_wider( ...  names_from="day", values_from="value", names_expand=True ...  ) ... )  Mon Tue Wed Thu Fri Sat Sun 0 5 2 NaN 3 1 NaN NaN

Expand the index to expose implicit missing values - this applies only to categorical columns:

>>> daily = daily.assign(letter=list("ABBA")) >>> daily  day value letter 0 Tue 2 A 0 Thu 3 B 0 Fri 1 B 0 Mon 5 A >>> daily.pivot_wider( ...  index="day", names_from="letter", values_from="value" ... )  day A B 0 Tue 2.0 NaN 1 Thu NaN 3.0 2 Fri NaN 1.0 3 Mon 5.0 NaN >>> ( ...  daily.pivot_wider( ...  index="day", ...  names_from="letter", ...  values_from="value", ...  index_expand=True, ...  ) ... )  day A B 0 Mon 5.0 NaN 1 Tue 2.0 NaN 2 Wed NaN NaN 3 Thu NaN 3.0 4 Fri NaN 1.0 5 Sat NaN NaN 6 Sun NaN NaN

Version Changed

0.24.0
- Added reset_index, names_expand and index_expand parameters.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	A pandas DataFrame.	required
`index`	`Any`	Name(s) of columns to use as identifier variables. It should be either a single column name, or a list of column names. If `index` is not provided, the DataFrame's index is used. Column selection is possible using the `select` syntax.	`None`
`names_from`	`Any`	Name(s) of column(s) to use to make the new DataFrame's columns. Should be either a single column name, or a list of column names. Column selection is possible using the `select` syntax.	required
`values_from`	`Any`	Name(s) of column(s) that will be used for populating the new DataFrame's values. If `values_from` is not specified, all remaining columns will be used. Column selection is possible using the `select` syntax.	`None`
`flatten_levels`	`bool`	If `False`, the DataFrame stays as a MultiIndex.	`True`
`names_sep`	`str`	If `names_from` or `values_from` contain multiple variables, this will be used to join the values into a single string to use as a column name. Default is `_`. Applicable only if `flatten_levels` is `True`.	`'_'`
`names_glue`	`str`	A string to control the output of the flattened columns. It offers flexibility in creating custom column names, and uses python's `str.format_map` under the hood. Simply create the string template, using the column labels in `names_from`, and special `_value` as a placeholder for `values_from`. Applicable only if `flatten_levels` is `True`.	`None`
`reset_index`	`bool`	Determines whether to restore `index` as a column/columns. Applicable only if `index` is provided, and `flatten_levels` is `True`.	`True`
`names_expand`	`bool`	Expand columns to show all the categories. Applies only if `names_from` is a categorical column.	`False`
`index_expand`	`bool`	Expand the index to show all the categories. Applies only if `index` is a categorical column.	`False`

Returns:

Type	Description
`DataFrame`	A pandas DataFrame that has been unpivoted from long to wide form.

Source code in janitor/functions/pivot.py

@pf.register_dataframe_method def pivot_wider( df: pd.DataFrame, names_from: Any, index: Any = None, values_from: Any = None, flatten_levels: bool = True, names_sep: str = "_", names_glue: str = None, reset_index: bool = True, names_expand: bool = False, index_expand: bool = False, ) -> pd.DataFrame:  """Reshapes data from *long* to *wide* form.  The number of columns are increased, while decreasing  the number of rows. It is the inverse of the  [`pivot_longer`][janitor.functions.pivot.pivot_longer]  method, and is a wrapper around `pd.DataFrame.pivot` method.  This method does not mutate the original DataFrame.  Column selection in `index`, `names_from` and `values_from`  is possible using the  [`select`][janitor.functions.select.select] syntax.  A ValueError is raised if the combination  of the `index` and `names_from` is not unique.  By default, values from `values_from` are always  at the top level if the columns are not flattened.  If flattened, the values from `values_from` are usually  at the start of each label in the columns.  Examples:  >>> import pandas as pd  >>> import janitor  >>> df = [  ... {"dep": 5.5, "step": 1, "a": 20, "b": 30},  ... {"dep": 5.5, "step": 2, "a": 25, "b": 37},  ... {"dep": 6.1, "step": 1, "a": 22, "b": 19},  ... {"dep": 6.1, "step": 2, "a": 18, "b": 29},  ... ]  >>> df = pd.DataFrame(df)  >>> df  dep step a b  0 5.5 1 20 30  1 5.5 2 25 37  2 6.1 1 22 19  3 6.1 2 18 29  Pivot and flatten columns:  >>> df.pivot_wider( # doctest: +SKIP  ... index="dep",  ... names_from="step",  ... )  dep a_1 a_2 b_1 b_2  0 5.5 20 25 30 37  1 6.1 22 18 19 29  Modify columns with `names_sep`:  >>> df.pivot_wider( # doctest: +SKIP  ... index="dep",  ... names_from="step",  ... names_sep="",  ... )  dep a1 a2 b1 b2  0 5.5 20 25 30 37  1 6.1 22 18 19 29  Modify columns with `names_glue`:  >>> df.pivot_wider( # doctest: +SKIP  ... index="dep",  ... names_from="step",  ... names_glue="{_value}_step{step}",  ... )  dep a_step1 a_step2 b_step1 b_step2  0 5.5 20 25 30 37  1 6.1 22 18 19 29  Expand columns to expose implicit missing values  - this applies only to categorical columns:  >>> weekdays = ("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun")  >>> daily = pd.DataFrame(  ... {  ... "day": pd.Categorical(  ... values=("Tue", "Thu", "Fri", "Mon"), categories=weekdays  ... ),  ... "value": (2, 3, 1, 5),  ... },  ... index=[0, 0, 0, 0],  ... )  >>> daily  day value  0 Tue 2  0 Thu 3  0 Fri 1  0 Mon 5  >>> daily.pivot_wider(names_from="day", values_from="value") # doctest: +SKIP  Tue Thu Fri Mon  0 2 3 1 5  >>> (  ... daily.pivot_wider( # doctest: +SKIP  ... names_from="day", values_from="value", names_expand=True  ... )  ... )  Mon Tue Wed Thu Fri Sat Sun  0 5 2 NaN 3 1 NaN NaN  Expand the index to expose implicit missing values  - this applies only to categorical columns:  >>> daily = daily.assign(letter=list("ABBA"))  >>> daily  day value letter  0 Tue 2 A  0 Thu 3 B  0 Fri 1 B  0 Mon 5 A  >>> daily.pivot_wider(  ... index="day", names_from="letter", values_from="value"  ... ) # doctest: +SKIP  day A B  0 Tue 2.0 NaN  1 Thu NaN 3.0  2 Fri NaN 1.0  3 Mon 5.0 NaN  >>> (  ... daily.pivot_wider( # doctest: +SKIP  ... index="day",  ... names_from="letter",  ... values_from="value",  ... index_expand=True,  ... )  ... )  day A B  0 Mon 5.0 NaN  1 Tue 2.0 NaN  2 Wed NaN NaN  3 Thu NaN 3.0  4 Fri NaN 1.0  5 Sat NaN NaN  6 Sun NaN NaN  !!! abstract "Version Changed"  - 0.24.0  - Added `reset_index`, `names_expand` and `index_expand` parameters.  Args:  df: A pandas DataFrame.  index: Name(s) of columns to use as identifier variables.  It should be either a single column name, or a list of column names.  If `index` is not provided, the DataFrame's index is used.  Column selection is possible using the  [`select`][janitor.functions.select.select] syntax.  names_from: Name(s) of column(s) to use to make the new  DataFrame's columns. Should be either a single column name,  or a list of column names.  Column selection is possible using the  [`select`][janitor.functions.select.select] syntax.  values_from: Name(s) of column(s) that will be used for populating  the new DataFrame's values.  If `values_from` is not specified, all remaining columns  will be used.  Column selection is possible using the  [`select`][janitor.functions.select.select] syntax.  flatten_levels: If `False`, the DataFrame stays as a MultiIndex.  names_sep: If `names_from` or `values_from` contain multiple  variables, this will be used to join the values into a single string  to use as a column name. Default is `_`.  Applicable only if `flatten_levels` is `True`.  names_glue: A string to control the output of the flattened columns.  It offers flexibility in creating custom column names,  and uses python's `str.format_map` under the hood.  Simply create the string template,  using the column labels in `names_from`,  and special `_value` as a placeholder for `values_from`.  Applicable only if `flatten_levels` is `True`.  reset_index: Determines whether to restore `index`  as a column/columns. Applicable only if `index` is provided,  and `flatten_levels` is `True`.  names_expand: Expand columns to show all the categories.  Applies only if `names_from` is a categorical column.  index_expand: Expand the index to show all the categories.  Applies only if `index` is a categorical column.  Returns:  A pandas DataFrame that has been unpivoted from long to wide form.  """ # noqa: E501 # no need for an explicit copy --> df = df.copy() # `pd.pivot` creates one return _computations_pivot_wider( df, index, names_from, values_from, flatten_levels, names_sep, names_glue, reset_index, names_expand, index_expand, ) 

`pivot_wider_spec(df, spec, index=None, reset_index=True)`

A declarative interface to pivot a DataFrame from long to wide form, where you describe how the data will be pivoted, using a DataFrame.

This gives you, the user, more control over pivoting, where you create a “spec” data frame that describes exactly how data stored in the column names becomes variables.

It can come in handy for situations where pd.DataFrame.pivot seems inadequate for the transformation.

New in version 0.31.0

Examples:

>>> import pandas as pd >>> from janitor import pivot_wider_spec >>> df = pd.DataFrame( ...  [ ...  {"famid": 1, "birth": 1, "age": 1, "ht": 2.8}, ...  {"famid": 1, "birth": 1, "age": 2, "ht": 3.4}, ...  {"famid": 1, "birth": 2, "age": 1, "ht": 2.9}, ...  {"famid": 1, "birth": 2, "age": 2, "ht": 3.8}, ...  {"famid": 1, "birth": 3, "age": 1, "ht": 2.2}, ...  {"famid": 1, "birth": 3, "age": 2, "ht": 2.9}, ...  {"famid": 2, "birth": 1, "age": 1, "ht": 2.0}, ...  {"famid": 2, "birth": 1, "age": 2, "ht": 3.2}, ...  {"famid": 2, "birth": 2, "age": 1, "ht": 1.8}, ...  {"famid": 2, "birth": 2, "age": 2, "ht": 2.8}, ...  {"famid": 2, "birth": 3, "age": 1, "ht": 1.9}, ...  {"famid": 2, "birth": 3, "age": 2, "ht": 2.4}, ...  {"famid": 3, "birth": 1, "age": 1, "ht": 2.2}, ...  {"famid": 3, "birth": 1, "age": 2, "ht": 3.3}, ...  {"famid": 3, "birth": 2, "age": 1, "ht": 2.3}, ...  {"famid": 3, "birth": 2, "age": 2, "ht": 3.4}, ...  {"famid": 3, "birth": 3, "age": 1, "ht": 2.1}, ...  {"famid": 3, "birth": 3, "age": 2, "ht": 2.9}, ...  ] ... ) >>> df  famid birth age ht 0 1 1 1 2.8 1 1 1 2 3.4 2 1 2 1 2.9 3 1 2 2 3.8 4 1 3 1 2.2 5 1 3 2 2.9 6 2 1 1 2.0 7 2 1 2 3.2 8 2 2 1 1.8 9 2 2 2 2.8 10 2 3 1 1.9 11 2 3 2 2.4 12 3 1 1 2.2 13 3 1 2 3.3 14 3 2 1 2.3 15 3 2 2 3.4 16 3 3 1 2.1 17 3 3 2 2.9 >>> spec = {".name": ["ht1", "ht2"], ".value": ["ht", "ht"], "age": [1, 2]} >>> spec = pd.DataFrame(spec) >>> spec  .name .value age 0 ht1 ht 1 1 ht2 ht 2 >>> pivot_wider_spec(df=df, spec=spec, index=["famid", "birth"])  famid birth ht1 ht2 0 1 1 2.8 3.4 1 1 2 2.9 3.8 2 1 3 2.2 2.9 3 2 1 2.0 3.2 4 2 2 1.8 2.8 5 2 3 1.9 2.4 6 3 1 2.2 3.3 7 3 2 2.3 3.4 8 3 3 2.1 2.9

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	A pandas DataFrame.	required
`spec`	`DataFrame`	A specification DataFrame. At a minimum, the spec DataFrame must have a '.name' and a '.value' columns. The '.name' column should contain the the names of the columns in the output DataFrame. The '.value' column should contain the name of the column(s) in the source DataFrame that will be serve as the values. Additional columns in spec will serves as the columns to be flipped to wide form. Note that these additional columns should already exist in the source DataFrame.	required
`index`	`list \| tuple \| str \| Pattern`	Name(s) of columns to use as identifier variables. It should be either a single column name, or a list of column names. If `index` is not provided, the DataFrame's index is used. Column selection is possible using the `select` syntax.	`None`
`reset_index`	`bool`	Determines whether to reset the `index`. Applicable only if `index` is provided.	`True`

Returns:

Type	Description
`DataFrame`	A pandas DataFrame that has been unpivoted from long to wide form.

Source code in janitor/functions/pivot.py

def pivot_wider_spec( df: pd.DataFrame, spec: pd.DataFrame, index: list | tuple | str | Pattern = None, reset_index: bool = True, ) -> pd.DataFrame:  """A declarative interface to pivot a DataFrame from long to wide form,  where you describe how the data will be pivoted,  using a DataFrame.  This gives you, the user,  more control over pivoting, where you create a “spec”  data frame that describes exactly how data stored  in the column names becomes variables.  It can come in handy for situations where  `pd.DataFrame.pivot`  seems inadequate for the transformation.  !!! info "New in version 0.31.0"  Examples:  >>> import pandas as pd  >>> from janitor import pivot_wider_spec  >>> df = pd.DataFrame(  ... [  ... {"famid": 1, "birth": 1, "age": 1, "ht": 2.8},  ... {"famid": 1, "birth": 1, "age": 2, "ht": 3.4},  ... {"famid": 1, "birth": 2, "age": 1, "ht": 2.9},  ... {"famid": 1, "birth": 2, "age": 2, "ht": 3.8},  ... {"famid": 1, "birth": 3, "age": 1, "ht": 2.2},  ... {"famid": 1, "birth": 3, "age": 2, "ht": 2.9},  ... {"famid": 2, "birth": 1, "age": 1, "ht": 2.0},  ... {"famid": 2, "birth": 1, "age": 2, "ht": 3.2},  ... {"famid": 2, "birth": 2, "age": 1, "ht": 1.8},  ... {"famid": 2, "birth": 2, "age": 2, "ht": 2.8},  ... {"famid": 2, "birth": 3, "age": 1, "ht": 1.9},  ... {"famid": 2, "birth": 3, "age": 2, "ht": 2.4},  ... {"famid": 3, "birth": 1, "age": 1, "ht": 2.2},  ... {"famid": 3, "birth": 1, "age": 2, "ht": 3.3},  ... {"famid": 3, "birth": 2, "age": 1, "ht": 2.3},  ... {"famid": 3, "birth": 2, "age": 2, "ht": 3.4},  ... {"famid": 3, "birth": 3, "age": 1, "ht": 2.1},  ... {"famid": 3, "birth": 3, "age": 2, "ht": 2.9},  ... ]  ... )  >>> df  famid birth age ht  0 1 1 1 2.8  1 1 1 2 3.4  2 1 2 1 2.9  3 1 2 2 3.8  4 1 3 1 2.2  5 1 3 2 2.9  6 2 1 1 2.0  7 2 1 2 3.2  8 2 2 1 1.8  9 2 2 2 2.8  10 2 3 1 1.9  11 2 3 2 2.4  12 3 1 1 2.2  13 3 1 2 3.3  14 3 2 1 2.3  15 3 2 2 3.4  16 3 3 1 2.1  17 3 3 2 2.9  >>> spec = {".name": ["ht1", "ht2"], ".value": ["ht", "ht"], "age": [1, 2]}  >>> spec = pd.DataFrame(spec)  >>> spec  .name .value age  0 ht1 ht 1  1 ht2 ht 2  >>> pivot_wider_spec(df=df, spec=spec, index=["famid", "birth"])  famid birth ht1 ht2  0 1 1 2.8 3.4  1 1 2 2.9 3.8  2 1 3 2.2 2.9  3 2 1 2.0 3.2  4 2 2 1.8 2.8  5 2 3 1.9 2.4  6 3 1 2.2 3.3  7 3 2 2.3 3.4  8 3 3 2.1 2.9  Args:  df: A pandas DataFrame.  spec: A specification DataFrame.  At a minimum, the spec DataFrame  must have a '.name' and a '.value' columns.  The '.name' column should contain the  the names of the columns in the output DataFrame.  The '.value' column should contain the name of the column(s)  in the source DataFrame that will be serve as the values.  Additional columns in spec will serves as the columns  to be flipped to wide form.  Note that these additional columns should already exist  in the source DataFrame.  index: Name(s) of columns to use as identifier variables.  It should be either a single column name, or a list of column names.  If `index` is not provided, the DataFrame's index is used.  Column selection is possible using the  [`select`][janitor.functions.select.select] syntax.  reset_index: Determines whether to reset the `index`.  Applicable only if `index` is provided.  Returns:  A pandas DataFrame that has been unpivoted from long to wide form.  """ # noqa: E501 check("spec", spec, [pd.DataFrame]) check("reset_index", reset_index, [bool]) if not spec.columns.is_unique: raise ValueError("Kindly ensure the spec's columns is unique.") if ".name" not in spec.columns: raise KeyError("Kindly ensure the spec DataFrame has a `.name` column.") if ".value" not in spec.columns: raise KeyError("Kindly ensure the spec DataFrame has a `.value` column.") if spec.columns.tolist()[:2] != [".name", ".value"]: raise ValueError( "The first two columns of the spec DataFrame " "should be '.name' and '.value', " "with '.name' coming before '.value'." ) if spec.columns.size == 2: raise ValueError( "Kindly provide the column(s) to use to make new frame’s columns" ) columns = spec.columns[2:] values = spec[".value"].unique() if index is not None: index = _select_index([index], df, axis="columns") index = df.columns[index].tolist() df = df.pivot(index=index, columns=columns, values=values) _index = spec.columns[1:].tolist() spec = spec.set_index(_index).squeeze() df = df.reindex(columns=spec.index) df.columns = df.columns.map(spec) if reset_index and index: return df.reset_index() return df 

`process_text`

Implementation source for process_text.

`process_text(df, column_name, string_function, **kwargs)`

Apply a Pandas string method to an existing column.

This function aims to make string cleaning easy, while chaining, by simply passing the string method name, along with keyword arguments, if any, to the function.

This modifies an existing column; it does not create a new column; new columns can be created via pyjanitor's transform_columns.

A list of all the string methods in Pandas can be accessed here.

Note

This function will be deprecated in a 1.x release. Please use jn.transform_column instead.

Examples:

>>> import pandas as pd >>> import janitor >>> import re >>> df = pd.DataFrame( ...  {"text": ["Ragnar", "sammywemmy", "ginger"], "code": [1, 2, 3]} ... ) >>> df  text code 0 Ragnar 1 1 sammywemmy 2 2 ginger 3 >>> df.process_text(column_name="text", string_function="lower")  text code 0 ragnar 1 1 sammywemmy 2 2 ginger 3

For string methods with parameters, simply pass the keyword arguments:

>>> df.process_text( ...  column_name="text", ...  string_function="extract", ...  pat=r"(ag)", ...  expand=False, ...  flags=re.IGNORECASE, ... )  text code 0 ag 1 1 NaN 2 2 NaN 3

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	A pandas DataFrame.	required
`column_name`	`str`	String column to be operated on.	required
`string_function`	`str`	pandas string method to be applied.	required
`**kwargs`	`Any`	Keyword arguments for parameters of the `string_function`.	`{}`

Raises:

Type	Description
`KeyError`	If `string_function` is not a Pandas string method.
`ValueError`	If the text function returns a DataFrame, instead of a Series.

Returns:

Type	Description
`DataFrame`	A pandas DataFrame with modified column.

Source code in janitor/functions/process_text.py

@pf.register_dataframe_method @refactored_function( message=( "This function will be deprecated in a 1.x release. " "Please use `jn.transform_columns` instead." ) ) @deprecated_alias(column="column_name") def process_text( df: pd.DataFrame, column_name: str, string_function: str, **kwargs: Any, ) -> pd.DataFrame:  """Apply a Pandas string method to an existing column.  This function aims to make string cleaning easy, while chaining,  by simply passing the string method name,  along with keyword arguments, if any, to the function.  This modifies an existing column; it does not create a new column;  new columns can be created via pyjanitor's  [`transform_columns`][janitor.functions.transform_columns.transform_columns].  A list of all the string methods in Pandas can be accessed [here](https://pandas.pydata.org/docs/user_guide/text.html#method-summary).  !!!note  This function will be deprecated in a 1.x release.  Please use [`jn.transform_column`][janitor.functions.transform_columns.transform_column]  instead.  Examples:  >>> import pandas as pd  >>> import janitor  >>> import re  >>> df = pd.DataFrame(  ... {"text": ["Ragnar", "sammywemmy", "ginger"], "code": [1, 2, 3]}  ... )  >>> df  text code  0 Ragnar 1  1 sammywemmy 2  2 ginger 3  >>> df.process_text(column_name="text", string_function="lower")  text code  0 ragnar 1  1 sammywemmy 2  2 ginger 3  For string methods with parameters, simply pass the keyword arguments:  >>> df.process_text(  ... column_name="text",  ... string_function="extract",  ... pat=r"(ag)",  ... expand=False,  ... flags=re.IGNORECASE,  ... )  text code  0 ag 1  1 NaN 2  2 NaN 3  Args:  df: A pandas DataFrame.  column_name: String column to be operated on.  string_function: pandas string method to be applied.  **kwargs: Keyword arguments for parameters of the `string_function`.  Raises:  KeyError: If `string_function` is not a Pandas string method.  ValueError: If the text function returns a DataFrame, instead of a Series.  Returns:  A pandas DataFrame with modified column.  """ # noqa: E501 check("column_name", column_name, [str]) check("string_function", string_function, [str]) check_column(df, [column_name]) pandas_string_methods = [ func.__name__ for _, func in inspect.getmembers(pd.Series.str, inspect.isfunction) if not func.__name__.startswith("_") ] if string_function not in pandas_string_methods: raise KeyError(f"{string_function} is not a Pandas string method.") result = getattr(df[column_name].str, string_function)(**kwargs) if isinstance(result, pd.DataFrame): raise ValueError( "The outcome of the processed text is a DataFrame, " "which is not supported in `process_text`." ) return df.assign(**{column_name: result}) 

`remove_columns`

Implementation of remove_columns.

`remove_columns(df, column_names)`

Remove the set of columns specified in column_names.

This method does not mutate the original DataFrame.

Intended to be the method-chaining alternative to del df[col].

Note

This function will be deprecated in a 1.x release. Kindly use pd.DataFrame.drop instead.

Examples:

>>> import pandas as pd >>> import janitor >>> df = pd.DataFrame({"a": [2, 4, 6], "b": [1, 3, 5], "c": [7, 8, 9]}) >>> df  a b c 0 2 1 7 1 4 3 8 2 6 5 9 >>> df.remove_columns(column_names=["a", "c"])  b 0 1 1 3 2 5

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	A pandas DataFrame.	required
`column_names`	`Union[str, Iterable[str], Hashable]`	The columns to remove.	required

Returns:

Type	Description
`DataFrame`	A pandas DataFrame.

Source code in janitor/functions/remove_columns.py

@pf.register_dataframe_method @refactored_function( message=( "This function will be deprecated in a 1.x release. " "Please use `pd.DataFrame.drop` instead." ) ) @deprecated_alias(columns="column_names") def remove_columns( df: pd.DataFrame, column_names: Union[str, Iterable[str], Hashable], ) -> pd.DataFrame:  """Remove the set of columns specified in `column_names`.  This method does not mutate the original DataFrame.  Intended to be the method-chaining alternative to `del df[col]`.  !!!note  This function will be deprecated in a 1.x release.  Kindly use `pd.DataFrame.drop` instead.  Examples:  >>> import pandas as pd  >>> import janitor  >>> df = pd.DataFrame({"a": [2, 4, 6], "b": [1, 3, 5], "c": [7, 8, 9]})  >>> df  a b c  0 2 1 7  1 4 3 8  2 6 5 9  >>> df.remove_columns(column_names=["a", "c"])  b  0 1  1 3  2 5  Args:  df: A pandas DataFrame.  column_names: The columns to remove.  Returns:  A pandas DataFrame.  """ return df.drop(columns=column_names) 

`remove_empty`

Implementation of remove_empty.

`remove_empty(df, reset_index=True)`

Drop all rows and columns that are completely null.

This method does not mutate the original DataFrame.

Implementation is inspired from StackOverflow.

Examples:

>>> import numpy as np >>> import pandas as pd >>> import janitor >>> df = pd.DataFrame( ...  { ...  "a": [1, np.nan, 2], ...  "b": [3, np.nan, 4], ...  "c": [np.nan, np.nan, np.nan], ...  } ... ) >>> df  a b c 0 1.0 3.0 NaN 1 NaN NaN NaN 2 2.0 4.0 NaN >>> df.remove_empty()  a b 0 1.0 3.0 1 2.0 4.0

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	The pandas DataFrame object.	required
`reset_index`	`bool`	Determines if the index is reset.	`True`

Returns:

Type	Description
`DataFrame`	A pandas DataFrame.

Source code in janitor/functions/remove_empty.py

@pf.register_dataframe_method def remove_empty(df: pd.DataFrame, reset_index: bool = True) -> pd.DataFrame:  """Drop all rows and columns that are completely null.  This method does not mutate the original DataFrame.  Implementation is inspired from [StackOverflow][so].  [so]: https://stackoverflow.com/questions/38884538/python-pandas-find-all-rows-where-all-values-are-nan  Examples:  >>> import numpy as np  >>> import pandas as pd  >>> import janitor  >>> df = pd.DataFrame(  ... {  ... "a": [1, np.nan, 2],  ... "b": [3, np.nan, 4],  ... "c": [np.nan, np.nan, np.nan],  ... }  ... )  >>> df  a b c  0 1.0 3.0 NaN  1 NaN NaN NaN  2 2.0 4.0 NaN  >>> df.remove_empty()  a b  0 1.0 3.0  1 2.0 4.0  Args:  df: The pandas DataFrame object.  reset_index: Determines if the index is reset.  Returns:  A pandas DataFrame.  """ # noqa: E501 outcome = df.isna() outcome = df.loc[~outcome.all(axis=1), ~outcome.all(axis=0)] if reset_index: return outcome.reset_index(drop=True) return outcome 

`rename_columns`

`rename_column(df, old_column_name, new_column_name)`

Rename a column in place.

This method does not mutate the original DataFrame.

Note

This function will be deprecated in a 1.x release. Please use pd.DataFrame.rename instead.

This is just syntactic sugar/a convenience function for renaming one column at a time. If you are convinced that there are multiple columns in need of changing, then use the pandas.DataFrame.rename method.

Examples:

Change the name of column 'a' to 'a_new'.

>>> import pandas as pd >>> import janitor >>> df = pd.DataFrame({"a": list(range(3)), "b": list("abc")}) >>> df.rename_column(old_column_name="a", new_column_name="a_new")  a_new b 0 0 a 1 1 b 2 2 c

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	The pandas DataFrame object.	required
`old_column_name`	`str`	The old column name.	required
`new_column_name`	`str`	The new column name.	required

Returns:

Type	Description
`DataFrame`	A pandas DataFrame with renamed columns.

Source code in janitor/functions/rename_columns.py

@pf.register_dataframe_method @refactored_function( message=( "This function will be deprecated in a 1.x release. " "Please use `pd.DataFrame.rename` instead." ) ) @deprecated_alias(old="old_column_name", new="new_column_name") def rename_column( df: pd.DataFrame, old_column_name: str, new_column_name: str, ) -> pd.DataFrame:  """Rename a column in place.  This method does not mutate the original DataFrame.  !!!note  This function will be deprecated in a 1.x release.  Please use `pd.DataFrame.rename` instead.  This is just syntactic sugar/a convenience function for renaming one column at a time.  If you are convinced that there are multiple columns in need of changing,  then use the `pandas.DataFrame.rename` method.  Examples:  Change the name of column 'a' to 'a_new'.  >>> import pandas as pd  >>> import janitor  >>> df = pd.DataFrame({"a": list(range(3)), "b": list("abc")})  >>> df.rename_column(old_column_name="a", new_column_name="a_new")  a_new b  0 0 a  1 1 b  2 2 c  Args:  df: The pandas DataFrame object.  old_column_name: The old column name.  new_column_name: The new column name.  Returns:  A pandas DataFrame with renamed columns.  """ # noqa: E501 check_column(df, [old_column_name]) return df.rename(columns={old_column_name: new_column_name}) 

`rename_columns(df, new_column_names=None, function=None)`

Rename columns.

This method does not mutate the original DataFrame.

Note

This function will be deprecated in a 1.x release. Please use pd.DataFrame.rename instead.

One of the new_column_names or function are a required parameter. If both are provided, then new_column_names takes priority and function is never executed.

Examples:

Rename columns using a dictionary which maps old names to new names.

>>> import pandas as pd >>> import janitor >>> df = pd.DataFrame({"a": list(range(3)), "b": list("xyz")}) >>> df  a b 0 0 x 1 1 y 2 2 z >>> df.rename_columns(new_column_names={"a": "a_new", "b": "b_new"})  a_new b_new 0 0 x 1 1 y 2 2 z

Rename columns using a generic callable.

>>> import pandas as pd >>> import janitor >>> df = pd.DataFrame({"a": list(range(3)), "b": list("xyz")}) >>> df.rename_columns(function=str.upper)  A B 0 0 x 1 1 y 2 2 z

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	The pandas DataFrame object.	required
`new_column_names`	`Union[Dict, None]`	A dictionary of old and new column names.	`None`
`function`	`Callable`	A function which should be applied to all the columns.	`None`

Raises:

Type	Description
`ValueError`	If both `new_column_names` and `function` are None.

Returns:

Type	Description
`DataFrame`	A pandas DataFrame with renamed columns.

Source code in janitor/functions/rename_columns.py

@pf.register_dataframe_method @refactored_function( message=( "This function will be deprecated in a 1.x release. " "Please use `pd.DataFrame.rename` instead." ) ) def rename_columns( df: pd.DataFrame, new_column_names: Union[Dict, None] = None, function: Callable = None, ) -> pd.DataFrame:  """Rename columns.  This method does not mutate the original DataFrame.  !!!note  This function will be deprecated in a 1.x release.  Please use `pd.DataFrame.rename` instead.  One of the `new_column_names` or `function` are a required parameter.  If both are provided, then `new_column_names` takes priority and `function`  is never executed.  Examples:  Rename columns using a dictionary which maps old names to new names.  >>> import pandas as pd  >>> import janitor  >>> df = pd.DataFrame({"a": list(range(3)), "b": list("xyz")})  >>> df  a b  0 0 x  1 1 y  2 2 z  >>> df.rename_columns(new_column_names={"a": "a_new", "b": "b_new"})  a_new b_new  0 0 x  1 1 y  2 2 z  Rename columns using a generic callable.  >>> import pandas as pd  >>> import janitor  >>> df = pd.DataFrame({"a": list(range(3)), "b": list("xyz")})  >>> df.rename_columns(function=str.upper)  A B  0 0 x  1 1 y  2 2 z  Args:  df: The pandas DataFrame object.  new_column_names: A dictionary of old and new column names.  function: A function which should be applied to all the columns.  Raises:  ValueError: If both `new_column_names` and `function` are None.  Returns:  A pandas DataFrame with renamed columns.  """ # noqa: E501 if new_column_names is None and function is None: raise ValueError("One of new_column_names or function must be provided") if new_column_names is not None: check_column(df, new_column_names) return df.rename(columns=new_column_names) return df.rename(mapper=function, axis="columns") 

`reorder_columns`

Implementation source for reorder_columns.

`reorder_columns(df, column_order)`

Reorder DataFrame columns by specifying desired order as list of col names.

Columns not specified retain their order and follow after the columns specified in column_order.

All columns specified within the column_order list must be present within df.

This method does not mutate the original DataFrame.

Examples:

>>> import pandas as pd >>> import janitor >>> df = pd.DataFrame({"col1": [1, 1, 1], "col2": [2, 2, 2], "col3": [3, 3, 3]}) >>> df  col1 col2 col3 0 1 2 3 1 1 2 3 2 1 2 3 >>> df.reorder_columns(["col3", "col1"])  col3 col1 col2 0 3 1 2 1 3 1 2 2 3 1 2

Notice that the column order of df is now col3, col1, col2.

Internally, this function uses DataFrame.reindex with copy=False to avoid unnecessary data duplication.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	`DataFrame` to reorder	required
`column_order`	`Union[Iterable[str], Index, Hashable]`	A list of column names or Pandas `Index` specifying their order in the returned `DataFrame`.	required

Raises:

Type	Description
`IndexError`	If a column within `column_order` is not found within the DataFrame.

Returns:

Type	Description
`DataFrame`	A pandas DataFrame with reordered columns.

Source code in janitor/functions/reorder_columns.py

@pf.register_dataframe_method def reorder_columns( df: pd.DataFrame, column_order: Union[Iterable[str], pd.Index, Hashable] ) -> pd.DataFrame:  """Reorder DataFrame columns by specifying desired order as list of col names.  Columns not specified retain their order and follow after the columns specified  in `column_order`.  All columns specified within the `column_order` list must be present within `df`.  This method does not mutate the original DataFrame.  Examples:  >>> import pandas as pd  >>> import janitor  >>> df = pd.DataFrame({"col1": [1, 1, 1], "col2": [2, 2, 2], "col3": [3, 3, 3]})  >>> df  col1 col2 col3  0 1 2 3  1 1 2 3  2 1 2 3  >>> df.reorder_columns(["col3", "col1"])  col3 col1 col2  0 3 1 2  1 3 1 2  2 3 1 2  Notice that the column order of `df` is now `col3`, `col1`, `col2`.  Internally, this function uses `DataFrame.reindex` with `copy=False`  to avoid unnecessary data duplication.  Args:  df: `DataFrame` to reorder  column_order: A list of column names or Pandas `Index`  specifying their order in the returned `DataFrame`.  Raises:  IndexError: If a column within `column_order` is not found  within the DataFrame.  Returns:  A pandas DataFrame with reordered columns.  """ # noqa: E501 check("column_order", column_order, [list, tuple, pd.Index]) if any(col not in df.columns for col in column_order): raise IndexError( "One or more columns in `column_order` were not found in the DataFrame." ) # if column_order is a Pandas index, needs conversion to list: column_order = list(column_order) return df.reindex( columns=(column_order + [col for col in df.columns if col not in column_order]), copy=False, ) 

`round_to_fraction`

Implementation of round_to_fraction

`round_to_fraction(df, column_name, denominator, digits=np.inf)`

Round all values in a column to a fraction.

This method mutates the original DataFrame.

Taken from the R package.

Also, optionally round to a specified number of digits.

Examples:

Round numeric column to the nearest 1/4 value.

>>> import numpy as np >>> import pandas as pd >>> import janitor >>> df = pd.DataFrame( ...  { ...  "a1": [1.263, 2.499, np.nan], ...  "a2": ["x", "y", "z"], ...  } ... ) >>> df  a1 a2 0 1.263 x 1 2.499 y 2 NaN z >>> df.round_to_fraction("a1", denominator=4)  a1 a2 0 1.25 x 1 2.50 y 2 NaN z

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	A pandas DataFrame.	required
`column_name`	`Hashable`	Name of column to round to fraction.	required
`denominator`	`float`	The denominator of the fraction for rounding. Must be a positive number.	required
`digits`	`float`	The number of digits for rounding after rounding to the fraction. Default is np.inf (i.e. no subsequent rounding).	`inf`

Raises:

Type	Description
`ValueError`	If `denominator` is not a positive number.

Returns:

Type	Description
`DataFrame`	A pandas DataFrame with a column's values rounded.

Source code in janitor/functions/round_to_fraction.py

@pf.register_dataframe_method @deprecated_alias(col_name="column_name") def round_to_fraction( df: pd.DataFrame, column_name: Hashable, denominator: float, digits: float = np.inf, ) -> pd.DataFrame:  """Round all values in a column to a fraction.  This method mutates the original DataFrame.  Taken from [the R package](https://github.com/sfirke/janitor/issues/235).  Also, optionally round to a specified number of digits.  Examples:  Round numeric column to the nearest 1/4 value.  >>> import numpy as np  >>> import pandas as pd  >>> import janitor  >>> df = pd.DataFrame(  ... {  ... "a1": [1.263, 2.499, np.nan],  ... "a2": ["x", "y", "z"],  ... }  ... )  >>> df  a1 a2  0 1.263 x  1 2.499 y  2 NaN z  >>> df.round_to_fraction("a1", denominator=4)  a1 a2  0 1.25 x  1 2.50 y  2 NaN z  Args:  df: A pandas DataFrame.  column_name: Name of column to round to fraction.  denominator: The denominator of the fraction for rounding. Must be  a positive number.  digits: The number of digits for rounding after rounding to the  fraction. Default is np.inf (i.e. no subsequent rounding).  Raises:  ValueError: If `denominator` is not a positive number.  Returns:  A pandas DataFrame with a column's values rounded.  """ check_column(df, column_name) check("denominator", denominator, [float, int]) check("digits", digits, [float, int]) if denominator <= 0: raise ValueError("denominator is expected to be a positive number.") df[column_name] = round(df[column_name] * denominator, 0) / denominator if not np.isinf(digits): df[column_name] = round(df[column_name], digits) return df 

`row_to_names`

Implementation of the row_to_names function.

`row_to_names(df, row_numbers=0, remove_rows=False, remove_rows_above=False, reset_index=False)`

Elevates a row, or rows, to be the column names of a DataFrame.

This method does not mutate the original DataFrame.

Contains options to remove the elevated row from the DataFrame along with removing the rows above the selected row.

Examples:

Replace column names with the first row and reset the index.

>>> import pandas as pd >>> import janitor >>> df = pd.DataFrame( ...  { ...  "a": ["nums", 6, 9], ...  "b": ["chars", "x", "y"], ...  } ... ) >>> df  a b 0 nums chars 1 6 x 2 9 y >>> df.row_to_names(0, remove_rows=True, reset_index=True)  nums chars 0 6 x 1 9 y >>> df.row_to_names([0, 1], remove_rows=True, reset_index=True)  nums chars  6 x 0 9 y

Remove rows above the elevated row and the elevated row itself.

>>> import pandas as pd >>> import janitor >>> df = pd.DataFrame( ...  { ...  "a": ["bla1", "nums", 6, 9], ...  "b": ["bla2", "chars", "x", "y"], ...  } ... ) >>> df  a b 0 bla1 bla2 1 nums chars 2 6 x 3 9 y >>> df.row_to_names( ...  1, remove_rows=True, remove_rows_above=True, reset_index=True ... )  nums chars 0 6 x 1 9 y

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	A pandas DataFrame.	required
`row_numbers`	`int \| list \| slice`	Position of the row(s) containing the variable names. It can be an integer, a list or a slice. Defaults to 0 (first row).	`0`
`remove_rows`	`bool`	Whether the row(s) should be removed from the DataFrame.	`False`
`remove_rows_above`	`bool`	Whether the row(s) above the selected row should be removed from the DataFrame.	`False`
`reset_index`	`bool`	Whether the index should be reset on the returning DataFrame.	`False`

Returns:

Type	Description
`DataFrame`	A pandas DataFrame with set column names.

Source code in janitor/functions/row_to_names.py

@pf.register_dataframe_method @deprecated_alias(row_number="row_numbers", remove_row="remove_rows") def row_to_names( df: pd.DataFrame, row_numbers: int | list | slice = 0, remove_rows: bool = False, remove_rows_above: bool = False, reset_index: bool = False, ) -> pd.DataFrame:  """Elevates a row, or rows, to be the column names of a DataFrame.  This method does not mutate the original DataFrame.  Contains options to remove the elevated row from the DataFrame along with  removing the rows above the selected row.  Examples:  Replace column names with the first row and reset the index.  >>> import pandas as pd  >>> import janitor  >>> df = pd.DataFrame(  ... {  ... "a": ["nums", 6, 9],  ... "b": ["chars", "x", "y"],  ... }  ... )  >>> df  a b  0 nums chars  1 6 x  2 9 y  >>> df.row_to_names(0, remove_rows=True, reset_index=True)  nums chars  0 6 x  1 9 y  >>> df.row_to_names([0, 1], remove_rows=True, reset_index=True)  nums chars  6 x  0 9 y  Remove rows above the elevated row and the elevated row itself.  >>> import pandas as pd  >>> import janitor  >>> df = pd.DataFrame(  ... {  ... "a": ["bla1", "nums", 6, 9],  ... "b": ["bla2", "chars", "x", "y"],  ... }  ... )  >>> df  a b  0 bla1 bla2  1 nums chars  2 6 x  3 9 y  >>> df.row_to_names(  ... 1, remove_rows=True, remove_rows_above=True, reset_index=True  ... )  nums chars  0 6 x  1 9 y  Args:  df: A pandas DataFrame.  row_numbers: Position of the row(s) containing the variable names.  It can be an integer, a list or a slice.  Defaults to 0 (first row).  remove_rows: Whether the row(s) should be removed from the DataFrame.  remove_rows_above: Whether the row(s) above the selected row should  be removed from the DataFrame.  reset_index: Whether the index should be reset on the returning DataFrame.  Returns:  A pandas DataFrame with set column names.  """ # noqa: E501 return _row_to_names( row_numbers, df=df, remove_rows=remove_rows, remove_rows_above=remove_rows_above, reset_index=reset_index, ) 

`select`

`DropLabel` `dataclass`

Helper class for removing labels within the select syntax.

label can be any of the types supported in the select, select_rows and select_columns functions. An array of integers not matching the labels is returned.

New in version 0.24.0

Parameters:

Name	Type	Description	Default
`label`	`Any`	Label(s) to be dropped from the index.	required

Source code in janitor/functions/select.py

@dataclass class DropLabel:  """Helper class for removing labels within the `select` syntax.  `label` can be any of the types supported in the `select`,  `select_rows` and `select_columns` functions.  An array of integers not matching the labels is returned.  !!! info "New in version 0.24.0"  Args:  label: Label(s) to be dropped from the index.  """ label: Any 

`get_columns(group, label)`

Helper function for selecting columns on a grouped object, using the select syntax.

New in version 0.25.0

Note

This function will be deprecated in a 1.x release. Please use jn.select instead.

Parameters:

Name	Type	Description	Default
`group`	`DataFrameGroupBy \| SeriesGroupBy`	A Pandas GroupBy object.	required
`label`	`Any`	column(s) to select.	required

Returns:

Type	Description
`DataFrameGroupBy \| SeriesGroupBy`	A pandas groupby object.

Source code in janitor/functions/select.py

@refactored_function( message=( "This function will be deprecated in a 1.x release. " "Please use `jn.select` instead." ) ) def get_columns( group: DataFrameGroupBy | SeriesGroupBy, label: Any ) -> DataFrameGroupBy | SeriesGroupBy:  """  Helper function for selecting columns on a grouped object,  using the  [`select`][janitor.functions.select.select] syntax.  !!! info "New in version 0.25.0"  !!!note  This function will be deprecated in a 1.x release.  Please use `jn.select` instead.  Args:  group: A Pandas GroupBy object.  label: column(s) to select.  Returns:  A pandas groupby object.  """ return _get_columns_on_a_grouped_object(group=group, label=label) 

`get_index_labels(arg, df, axis)`

Convenience function to get actual labels from column/index

New in version 0.25.0

Parameters:

Name	Type	Description	Default
`arg`	`Any`	Valid inputs include: an exact column name to look for, a shell-style glob string (e.g. `_thing_`), a regular expression, a callable, or variable arguments of all the aforementioned. A sequence of booleans is also acceptable. A dictionary can be used for selection on a MultiIndex on different levels.	required
`df`	`DataFrame`	The pandas DataFrame object.	required
`axis`	`Literal['index', 'columns']`	Should be either `index` or `columns`.	required

Returns:

Type	Description
`Index`	A pandas Index.

Source code in janitor/functions/select.py

def get_index_labels( arg: Any, df: pd.DataFrame, axis: Literal["index", "columns"] ) -> pd.Index:  """Convenience function to get actual labels from column/index  !!! info "New in version 0.25.0"  Args:  arg: Valid inputs include: an exact column name to look for,  a shell-style glob string (e.g. `*_thing_*`),  a regular expression,  a callable,  or variable arguments of all the aforementioned.  A sequence of booleans is also acceptable.  A dictionary can be used for selection  on a MultiIndex on different levels.  df: The pandas DataFrame object.  axis: Should be either `index` or `columns`.  Returns:  A pandas Index.  """ assert axis in {"index", "columns"} index = getattr(df, axis) return index[_select_index(arg, df, axis)] 

`select(df, *args, index=None, columns=None, axis='columns', invert=False)`

Method-chainable selection of rows and/or columns.

It accepts a string, shell-like glob strings (*string*), regex, slice, array-like object, or a list of the previous options.

Selection on a MultiIndex on a level, or multiple levels, is possible with a dictionary.

This method does not mutate the original DataFrame or Series.

If the pandas object is a Series, selection is possible only on the index.

Selection can be inverted with the DropLabel class.

Optional ability to invert selection of index/columns available as well.

New in version 0.24.0

Note

The preferred option when selecting columns or rows in a Pandas DataFrame is with .loc or .iloc methods, as they are generally performant. select is primarily for convenience.

Version Changed

0.26.0
- Added variable args, invert and axis parameters.
- rows keyword deprecated in favour of index.
0.31.0
- Add support for pd.Series.
0.32.0
- Add support for DataFrameGroupBy.

Examples:

>>> import pandas as pd >>> import janitor >>> df = pd.DataFrame( ...  [[1, 2], [4, 5], [7, 8]], ...  index=["cobra", "viper", "sidewinder"], ...  columns=["max_speed", "shield"], ... ) >>> df  max_speed shield cobra 1 2 viper 4 5 sidewinder 7 8 >>> df.select(index="cobra", columns="shield")  shield cobra 2

Labels can be dropped with the DropLabel class:

>>> df.select(index=DropLabel("cobra"))  max_speed shield viper 4 5 sidewinder 7 8

More examples can be found in the select_columns section.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame \| Series \| DataFrameGroupBy`	A pandas DataFrame or a GroupBy object.	required
`*args`	`tuple`	Valid inputs include: an exact index name to look for, a shell-style glob string (e.g. `_thing_`), a regular expression, a callable, or variable arguments of all the aforementioned. A sequence of booleans is also acceptable. A dictionary can be used for selection on a MultiIndex on different levels.	`()`
`index`	`Any`	Valid inputs include: an exact label to look for, a shell-style glob string (e.g. `_thing_`), a regular expression, a callable, or variable arguments of all the aforementioned. A sequence of booleans is also acceptable. A dictionary can be used for selection on a MultiIndex on different levels.	`None`
`columns`	`Any`	Valid inputs include: an exact label to look for, a shell-style glob string (e.g. `_thing_`), a regular expression, a callable, or variable arguments of all the aforementioned. A sequence of booleans is also acceptable. A dictionary can be used for selection on a MultiIndex on different levels.	`None`
`invert`	`bool`	Whether or not to invert the selection. This will result in the selection of the complement of the rows/columns provided.	`False`
`axis`	`str`	Whether the selection should be on the index('index'), or columns('columns'). Applicable only for the variable args parameter.	`'columns'`

Raises:

Type	Description
`ValueError`	If args and index/columns are provided.

Returns:

Type	Description
`DataFrame \| Series \| DataFrameGroupBy`	A pandas DataFrame or Series with the specified rows and/or columns selected.

Source code in janitor/functions/select.py

@pf.register_groupby_method @pf.register_dataframe_method @pf.register_series_method @deprecated_alias(rows="index") def select( df: pd.DataFrame | pd.Series | DataFrameGroupBy, *args: tuple, index: Any = None, columns: Any = None, axis: str = "columns", invert: bool = False, ) -> pd.DataFrame | pd.Series | DataFrameGroupBy:  """Method-chainable selection of rows and/or columns.  It accepts a string, shell-like glob strings `(*string*)`,  regex, slice, array-like object, or a list of the previous options.  Selection on a MultiIndex on a level, or multiple levels,  is possible with a dictionary.  This method does not mutate the original DataFrame or Series.  If the pandas object is a Series, selection is possible only on the index.  Selection can be inverted with the `DropLabel` class.  Optional ability to invert selection of index/columns available as well.  !!! info "New in version 0.24.0"  !!!note  The preferred option when selecting columns or rows in a Pandas DataFrame  is with `.loc` or `.iloc` methods, as they are generally performant.  `select` is primarily for convenience.  !!! abstract "Version Changed"  - 0.26.0  - Added variable `args`, `invert` and `axis` parameters.  - `rows` keyword deprecated in favour of `index`.  - 0.31.0  - Add support for pd.Series.  - 0.32.0  - Add support for DataFrameGroupBy.  Examples:  >>> import pandas as pd  >>> import janitor  >>> df = pd.DataFrame(  ... [[1, 2], [4, 5], [7, 8]],  ... index=["cobra", "viper", "sidewinder"],  ... columns=["max_speed", "shield"],  ... )  >>> df  max_speed shield  cobra 1 2  viper 4 5  sidewinder 7 8  >>> df.select(index="cobra", columns="shield")  shield  cobra 2  Labels can be dropped with the `DropLabel` class:  >>> df.select(index=DropLabel("cobra"))  max_speed shield  viper 4 5  sidewinder 7 8  More examples can be found in the  [`select_columns`][janitor.functions.select.select_columns] section.  Args:  df: A pandas DataFrame or a GroupBy object.  *args: Valid inputs include: an exact index name to look for,  a shell-style glob string (e.g. `*_thing_*`),  a regular expression,  a callable,  or variable arguments of all the aforementioned.  A sequence of booleans is also acceptable.  A dictionary can be used for selection  on a MultiIndex on different levels.  index: Valid inputs include: an exact label to look for,  a shell-style glob string (e.g. `*_thing_*`),  a regular expression,  a callable,  or variable arguments of all the aforementioned.  A sequence of booleans is also acceptable.  A dictionary can be used for selection  on a MultiIndex on different levels.  columns: Valid inputs include: an exact label to look for,  a shell-style glob string (e.g. `*_thing_*`),  a regular expression,  a callable,  or variable arguments of all the aforementioned.  A sequence of booleans is also acceptable.  A dictionary can be used for selection  on a MultiIndex on different levels.  invert: Whether or not to invert the selection.  This will result in the selection  of the complement of the rows/columns provided.  axis: Whether the selection should be on the index('index'),  or columns('columns').  Applicable only for the variable args parameter.  Raises:  ValueError: If args and index/columns are provided.  Returns:  A pandas DataFrame or Series with the specified rows and/or columns selected.  """ # noqa: E501 if args and isinstance(df, DataFrameGroupBy): return _get_columns_on_a_grouped_object(group=df, label=list(args)) if isinstance(df, DataFrameGroupBy): return _get_columns_on_a_grouped_object(group=df, label=[columns]) if args: check("invert", invert, [bool]) if (index is not None) or (columns is not None): raise ValueError( "Either provide variable args with the axis parameter, " "or provide arguments to the index and/or columns parameters." ) if axis == "index": return _select(df, rows=list(args), columns=columns, invert=invert) if axis == "columns": return _select(df, columns=list(args), rows=index, invert=invert) raise ValueError("axis should be either 'index' or 'columns'.") return _select(df, rows=index, columns=columns, invert=invert) 

`select_columns(df, *args, invert=False)`

Method-chainable selection of columns.

It accepts a string, shell-like glob strings (*string*), regex, slice, array-like object, or a list of the previous options.

Selection on a MultiIndex on a level, or multiple levels, is possible with a dictionary.

This method does not mutate the original DataFrame.

Optional ability to invert selection of columns available as well.

Note

The preferred option when selecting columns or rows in a Pandas DataFrame is with .loc or .iloc methods. select_columns is primarily for convenience.

Note

This function will be deprecated in a 1.x release. Please use jn.select instead.

Examples:

>>> import pandas as pd >>> import janitor >>> from numpy import nan >>> pd.set_option("display.max_columns", None) >>> pd.set_option("display.expand_frame_repr", False) >>> pd.set_option("max_colwidth", None) >>> data = { ...  "name": [ ...  "Cheetah", ...  "Owl monkey", ...  "Mountain beaver", ...  "Greater short-tailed shrew", ...  "Cow", ...  ], ...  "genus": ["Acinonyx", "Aotus", "Aplodontia", "Blarina", "Bos"], ...  "vore": ["carni", "omni", "herbi", "omni", "herbi"], ...  "order": [ ...  "Carnivora", ...  "Primates", ...  "Rodentia", ...  "Soricomorpha", ...  "Artiodactyla", ...  ], ...  "conservation": ["lc", nan, "nt", "lc", "domesticated"], ...  "sleep_total": [12.1, 17.0, 14.4, 14.9, 4.0], ...  "sleep_rem": [nan, 1.8, 2.4, 2.3, 0.7], ...  "sleep_cycle": [nan, nan, nan, 0.133333333, 0.666666667], ...  "awake": [11.9, 7.0, 9.6, 9.1, 20.0], ...  "brainwt": [nan, 0.0155, nan, 0.00029, 0.423], ...  "bodywt": [50.0, 0.48, 1.35, 0.019, 600.0], ... } >>> df = pd.DataFrame(data) >>> df  name genus vore order conservation sleep_total sleep_rem sleep_cycle awake brainwt bodywt 0 Cheetah Acinonyx carni Carnivora lc 12.1 NaN NaN 11.9 NaN 50.000 1 Owl monkey Aotus omni Primates NaN 17.0 1.8 NaN 7.0 0.01550 0.480 2 Mountain beaver Aplodontia herbi Rodentia nt 14.4 2.4 NaN 9.6 NaN 1.350 3 Greater short-tailed shrew Blarina omni Soricomorpha lc 14.9 2.3 0.133333 9.1 0.00029 0.019 4 Cow Bos herbi Artiodactyla domesticated 4.0 0.7 0.666667 20.0 0.42300 600.000

Explicit label selection:

>>> df.select_columns("name", "order")  name order 0 Cheetah Carnivora 1 Owl monkey Primates 2 Mountain beaver Rodentia 3 Greater short-tailed shrew Soricomorpha 4 Cow Artiodactyla

Selection via globbing:

>>> df.select_columns("sleep*", "*wt")  sleep_total sleep_rem sleep_cycle brainwt bodywt 0 12.1 NaN NaN NaN 50.000 1 17.0 1.8 NaN 0.01550 0.480 2 14.4 2.4 NaN NaN 1.350 3 14.9 2.3 0.133333 0.00029 0.019 4 4.0 0.7 0.666667 0.42300 600.000

Selection via regex:

>>> import re >>> df.select_columns(re.compile(r"o.+er"))  order conservation 0 Carnivora lc 1 Primates NaN 2 Rodentia nt 3 Soricomorpha lc 4 Artiodactyla domesticated

Selection via slicing:

>>> df.select_columns( ...  slice("name", "order"), slice("sleep_total", "sleep_cycle") ... )  name genus vore order sleep_total sleep_rem sleep_cycle 0 Cheetah Acinonyx carni Carnivora 12.1 NaN NaN 1 Owl monkey Aotus omni Primates 17.0 1.8 NaN 2 Mountain beaver Aplodontia herbi Rodentia 14.4 2.4 NaN 3 Greater short-tailed shrew Blarina omni Soricomorpha 14.9 2.3 0.133333 4 Cow Bos herbi Artiodactyla 4.0 0.7 0.666667

Selection via callable:

>>> from pandas.api.types import is_numeric_dtype >>> df.select_columns(is_numeric_dtype)  sleep_total sleep_rem sleep_cycle awake brainwt bodywt 0 12.1 NaN NaN 11.9 NaN 50.000 1 17.0 1.8 NaN 7.0 0.01550 0.480 2 14.4 2.4 NaN 9.6 NaN 1.350 3 14.9 2.3 0.133333 9.1 0.00029 0.019 4 4.0 0.7 0.666667 20.0 0.42300 600.000 >>> df.select_columns(lambda f: f.isna().any())  conservation sleep_rem sleep_cycle brainwt 0 lc NaN NaN NaN 1 NaN 1.8 NaN 0.01550 2 nt 2.4 NaN NaN 3 lc 2.3 0.133333 0.00029 4 domesticated 0.7 0.666667 0.42300

Exclude columns with the invert parameter:

>>> df.select_columns(is_numeric_dtype, invert=True)  name genus vore order conservation 0 Cheetah Acinonyx carni Carnivora lc 1 Owl monkey Aotus omni Primates NaN 2 Mountain beaver Aplodontia herbi Rodentia nt 3 Greater short-tailed shrew Blarina omni Soricomorpha lc 4 Cow Bos herbi Artiodactyla domesticated

Exclude columns with the DropLabel class:

>>> from janitor import DropLabel >>> df.select_columns(DropLabel(slice("name", "awake")))  brainwt bodywt 0 NaN 50.000 1 0.01550 0.480 2 NaN 1.350 3 0.00029 0.019 4 0.42300 600.000

Selection on MultiIndex columns:

>>> d = { ...  "num_legs": [4, 4, 2, 2], ...  "num_wings": [0, 0, 2, 2], ...  "class": ["mammal", "mammal", "mammal", "bird"], ...  "animal": ["cat", "dog", "bat", "penguin"], ...  "locomotion": ["walks", "walks", "flies", "walks"], ... } >>> df = pd.DataFrame(data=d) >>> df = df.set_index(["class", "animal", "locomotion"]).T >>> df class mammal bird animal cat dog bat penguin locomotion walks walks flies walks num_legs 4 4 2 2 num_wings 0 0 2 2

Selection with a scalar:

>>> df.select_columns("mammal") class mammal animal cat dog bat locomotion walks walks flies num_legs 4 4 2 num_wings 0 0 2

Selection with a tuple:

>>> df.select_columns(("mammal", "bat")) class mammal animal bat locomotion flies num_legs 2 num_wings 2

Selection within a level is possible with a dictionary, where the key is either a level name or number:

>>> df.select_columns({"animal": "cat"}) class mammal animal cat locomotion walks num_legs 4 num_wings 0 >>> df.select_columns({1: ["bat", "cat"]}) class mammal animal bat cat locomotion flies walks num_legs 2 4 num_wings 2 0

Selection on multiple levels:

>>> df.select_columns({"class": "mammal", "locomotion": "flies"}) class mammal animal bat locomotion flies num_legs 2 num_wings 2

Selection with a regex on a level:

>>> df.select_columns({"animal": re.compile(".+t$")}) class mammal animal cat bat locomotion walks flies num_legs 4 2 num_wings 0 2

Selection with a callable on a level:

>>> df.select_columns({"animal": lambda f: f.str.endswith("t")}) class mammal animal cat bat locomotion walks flies num_legs 4 2 num_wings 0 2

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	A pandas DataFrame.	required
`*args`	`Any`	Valid inputs include: an exact column name to look for, a shell-style glob string (e.g. `_thing_`), a regular expression, a callable, or variable arguments of all the aforementioned. A sequence of booleans is also acceptable. A dictionary can be used for selection on a MultiIndex on different levels.	`()`
`invert`	`bool`	Whether or not to invert the selection. This will result in the selection of the complement of the columns provided.	`False`

Returns:

Type	Description
`DataFrame`	A pandas DataFrame with the specified columns selected.

Source code in janitor/functions/select.py

@pf.register_dataframe_method @refactored_function( message=( "This function will be deprecated in a 1.x release. " "Please use `jn.select` instead." ) ) def select_columns( df: pd.DataFrame, *args: Any, invert: bool = False, ) -> pd.DataFrame:  """Method-chainable selection of columns.  It accepts a string, shell-like glob strings `(*string*)`,  regex, slice, array-like object, or a list of the previous options.  Selection on a MultiIndex on a level, or multiple levels,  is possible with a dictionary.  This method does not mutate the original DataFrame.  Optional ability to invert selection of columns available as well.  !!!note  The preferred option when selecting columns or rows in a Pandas DataFrame  is with `.loc` or `.iloc` methods.  `select_columns` is primarily for convenience.  !!!note  This function will be deprecated in a 1.x release.  Please use `jn.select` instead.  Examples:  >>> import pandas as pd  >>> import janitor  >>> from numpy import nan  >>> pd.set_option("display.max_columns", None)  >>> pd.set_option("display.expand_frame_repr", False)  >>> pd.set_option("max_colwidth", None)  >>> data = {  ... "name": [  ... "Cheetah",  ... "Owl monkey",  ... "Mountain beaver",  ... "Greater short-tailed shrew",  ... "Cow",  ... ],  ... "genus": ["Acinonyx", "Aotus", "Aplodontia", "Blarina", "Bos"],  ... "vore": ["carni", "omni", "herbi", "omni", "herbi"],  ... "order": [  ... "Carnivora",  ... "Primates",  ... "Rodentia",  ... "Soricomorpha",  ... "Artiodactyla",  ... ],  ... "conservation": ["lc", nan, "nt", "lc", "domesticated"],  ... "sleep_total": [12.1, 17.0, 14.4, 14.9, 4.0],  ... "sleep_rem": [nan, 1.8, 2.4, 2.3, 0.7],  ... "sleep_cycle": [nan, nan, nan, 0.133333333, 0.666666667],  ... "awake": [11.9, 7.0, 9.6, 9.1, 20.0],  ... "brainwt": [nan, 0.0155, nan, 0.00029, 0.423],  ... "bodywt": [50.0, 0.48, 1.35, 0.019, 600.0],  ... }  >>> df = pd.DataFrame(data)  >>> df  name genus vore order conservation sleep_total sleep_rem sleep_cycle awake brainwt bodywt  0 Cheetah Acinonyx carni Carnivora lc 12.1 NaN NaN 11.9 NaN 50.000  1 Owl monkey Aotus omni Primates NaN 17.0 1.8 NaN 7.0 0.01550 0.480  2 Mountain beaver Aplodontia herbi Rodentia nt 14.4 2.4 NaN 9.6 NaN 1.350  3 Greater short-tailed shrew Blarina omni Soricomorpha lc 14.9 2.3 0.133333 9.1 0.00029 0.019  4 Cow Bos herbi Artiodactyla domesticated 4.0 0.7 0.666667 20.0 0.42300 600.000  Explicit label selection:  >>> df.select_columns("name", "order")  name order  0 Cheetah Carnivora  1 Owl monkey Primates  2 Mountain beaver Rodentia  3 Greater short-tailed shrew Soricomorpha  4 Cow Artiodactyla  Selection via globbing:  >>> df.select_columns("sleep*", "*wt")  sleep_total sleep_rem sleep_cycle brainwt bodywt  0 12.1 NaN NaN NaN 50.000  1 17.0 1.8 NaN 0.01550 0.480  2 14.4 2.4 NaN NaN 1.350  3 14.9 2.3 0.133333 0.00029 0.019  4 4.0 0.7 0.666667 0.42300 600.000  Selection via regex:  >>> import re  >>> df.select_columns(re.compile(r"o.+er"))  order conservation  0 Carnivora lc  1 Primates NaN  2 Rodentia nt  3 Soricomorpha lc  4 Artiodactyla domesticated  Selection via slicing:  >>> df.select_columns(  ... slice("name", "order"), slice("sleep_total", "sleep_cycle")  ... )  name genus vore order sleep_total sleep_rem sleep_cycle  0 Cheetah Acinonyx carni Carnivora 12.1 NaN NaN  1 Owl monkey Aotus omni Primates 17.0 1.8 NaN  2 Mountain beaver Aplodontia herbi Rodentia 14.4 2.4 NaN  3 Greater short-tailed shrew Blarina omni Soricomorpha 14.9 2.3 0.133333  4 Cow Bos herbi Artiodactyla 4.0 0.7 0.666667  Selection via callable:  >>> from pandas.api.types import is_numeric_dtype  >>> df.select_columns(is_numeric_dtype)  sleep_total sleep_rem sleep_cycle awake brainwt bodywt  0 12.1 NaN NaN 11.9 NaN 50.000  1 17.0 1.8 NaN 7.0 0.01550 0.480  2 14.4 2.4 NaN 9.6 NaN 1.350  3 14.9 2.3 0.133333 9.1 0.00029 0.019  4 4.0 0.7 0.666667 20.0 0.42300 600.000  >>> df.select_columns(lambda f: f.isna().any())  conservation sleep_rem sleep_cycle brainwt  0 lc NaN NaN NaN  1 NaN 1.8 NaN 0.01550  2 nt 2.4 NaN NaN  3 lc 2.3 0.133333 0.00029  4 domesticated 0.7 0.666667 0.42300  Exclude columns with the `invert` parameter:  >>> df.select_columns(is_numeric_dtype, invert=True)  name genus vore order conservation  0 Cheetah Acinonyx carni Carnivora lc  1 Owl monkey Aotus omni Primates NaN  2 Mountain beaver Aplodontia herbi Rodentia nt  3 Greater short-tailed shrew Blarina omni Soricomorpha lc  4 Cow Bos herbi Artiodactyla domesticated  Exclude columns with the `DropLabel` class:  >>> from janitor import DropLabel  >>> df.select_columns(DropLabel(slice("name", "awake")))  brainwt bodywt  0 NaN 50.000  1 0.01550 0.480  2 NaN 1.350  3 0.00029 0.019  4 0.42300 600.000  Selection on MultiIndex columns:  >>> d = {  ... "num_legs": [4, 4, 2, 2],  ... "num_wings": [0, 0, 2, 2],  ... "class": ["mammal", "mammal", "mammal", "bird"],  ... "animal": ["cat", "dog", "bat", "penguin"],  ... "locomotion": ["walks", "walks", "flies", "walks"],  ... }  >>> df = pd.DataFrame(data=d)  >>> df = df.set_index(["class", "animal", "locomotion"]).T  >>> df  class mammal bird  animal cat dog bat penguin  locomotion walks walks flies walks  num_legs 4 4 2 2  num_wings 0 0 2 2  Selection with a scalar:  >>> df.select_columns("mammal")  class mammal  animal cat dog bat  locomotion walks walks flies  num_legs 4 4 2  num_wings 0 0 2  Selection with a tuple:  >>> df.select_columns(("mammal", "bat"))  class mammal  animal bat  locomotion flies  num_legs 2  num_wings 2  Selection within a level is possible with a dictionary,  where the key is either a level name or number:  >>> df.select_columns({"animal": "cat"})  class mammal  animal cat  locomotion walks  num_legs 4  num_wings 0  >>> df.select_columns({1: ["bat", "cat"]})  class mammal  animal bat cat  locomotion flies walks  num_legs 2 4  num_wings 2 0  Selection on multiple levels:  >>> df.select_columns({"class": "mammal", "locomotion": "flies"})  class mammal  animal bat  locomotion flies  num_legs 2  num_wings 2  Selection with a regex on a level:  >>> df.select_columns({"animal": re.compile(".+t$")})  class mammal  animal cat bat  locomotion walks flies  num_legs 4 2  num_wings 0 2  Selection with a callable on a level:  >>> df.select_columns({"animal": lambda f: f.str.endswith("t")})  class mammal  animal cat bat  locomotion walks flies  num_legs 4 2  num_wings 0 2  Args:  df: A pandas DataFrame.  *args: Valid inputs include: an exact column name to look for,  a shell-style glob string (e.g. `*_thing_*`),  a regular expression,  a callable,  or variable arguments of all the aforementioned.  A sequence of booleans is also acceptable.  A dictionary can be used for selection  on a MultiIndex on different levels.  invert: Whether or not to invert the selection.  This will result in the selection  of the complement of the columns provided.  Returns:  A pandas DataFrame with the specified columns selected.  """ # noqa: E501 return _select(df, columns=list(args), invert=invert) 

`select_rows(df, *args, invert=False)`

Method-chainable selection of rows.

It accepts a string, shell-like glob strings (*string*), regex, slice, array-like object, or a list of the previous options.

Selection on a MultiIndex on a level, or multiple levels, is possible with a dictionary.

This method does not mutate the original DataFrame.

Optional ability to invert selection of rows available as well.

New in version 0.24.0

Note

The preferred option when selecting columns or rows in a Pandas DataFrame is with .loc or .iloc methods, as they are generally performant. select_rows is primarily for convenience.

Note

This function will be deprecated in a 1.x release. Please use jn.select instead.

Examples:

>>> import pandas as pd >>> import janitor >>> df = {"col1": [1, 2], "foo": [3, 4], "col2": [5, 6]} >>> df = pd.DataFrame.from_dict(df, orient="index") >>> df  0 1 col1 1 2 foo 3 4 col2 5 6 >>> df.select_rows("col*")  0 1 col1 1 2 col2 5 6

More examples can be found in the select_columns section.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	A pandas DataFrame.	required
`*args`	`Any`	Valid inputs include: an exact index name to look for, a shell-style glob string (e.g. `_thing_`), a regular expression, a callable, or variable arguments of all the aforementioned. A sequence of booleans is also acceptable. A dictionary can be used for selection on a MultiIndex on different levels.	`()`
`invert`	`bool`	Whether or not to invert the selection. This will result in the selection of the complement of the rows provided.	`False`

Returns:

Type	Description
`DataFrame`	A pandas DataFrame with the specified rows selected.

Source code in janitor/functions/select.py

@pf.register_dataframe_method @refactored_function( message=( "This function will be deprecated in a 1.x release. " "Please use `jn.select` instead." ) ) def select_rows( df: pd.DataFrame, *args: Any, invert: bool = False, ) -> pd.DataFrame:  """Method-chainable selection of rows.  It accepts a string, shell-like glob strings `(*string*)`,  regex, slice, array-like object, or a list of the previous options.  Selection on a MultiIndex on a level, or multiple levels,  is possible with a dictionary.  This method does not mutate the original DataFrame.  Optional ability to invert selection of rows available as well.  !!! info "New in version 0.24.0"  !!!note  The preferred option when selecting columns or rows in a Pandas DataFrame  is with `.loc` or `.iloc` methods, as they are generally performant.  `select_rows` is primarily for convenience.  !!!note  This function will be deprecated in a 1.x release.  Please use `jn.select` instead.  Examples:  >>> import pandas as pd  >>> import janitor  >>> df = {"col1": [1, 2], "foo": [3, 4], "col2": [5, 6]}  >>> df = pd.DataFrame.from_dict(df, orient="index")  >>> df  0 1  col1 1 2  foo 3 4  col2 5 6  >>> df.select_rows("col*")  0 1  col1 1 2  col2 5 6  More examples can be found in the  [`select_columns`][janitor.functions.select.select_columns] section.  Args:  df: A pandas DataFrame.  *args: Valid inputs include: an exact index name to look for,  a shell-style glob string (e.g. `*_thing_*`),  a regular expression,  a callable,  or variable arguments of all the aforementioned.  A sequence of booleans is also acceptable.  A dictionary can be used for selection  on a MultiIndex on different levels.  invert: Whether or not to invert the selection.  This will result in the selection  of the complement of the rows provided.  Returns:  A pandas DataFrame with the specified rows selected.  """ # noqa: E501 return _select(df, rows=list(args), invert=invert) 

`shuffle`

Implementation of shuffle functions.

`shuffle(df, random_state=None, reset_index=True)`

Shuffle the rows of the DataFrame.

This method does not mutate the original DataFrame.

Super-sugary syntax! Underneath the hood, we use df.sample(frac=1), with the option to set the random state.

Examples:

>>> import pandas as pd >>> import janitor >>> df = pd.DataFrame( ...  { ...  "col1": range(5), ...  "col2": list("abcde"), ...  } ... ) >>> df  col1 col2 0 0 a 1 1 b 2 2 c 3 3 d 4 4 e >>> df.shuffle(random_state=42)  col1 col2 0 1 b 1 4 e 2 2 c 3 0 a 4 3 d

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	A pandas DataFrame.	required
`random_state`	`Any`	If provided, set a seed for the random number generator. Passed to `pd.DataFrame.sample()`.	`None`
`reset_index`	`bool`	If True, reset the dataframe index to the default RangeIndex.	`True`

Returns:

Type	Description
`DataFrame`	A shuffled pandas DataFrame.

Source code in janitor/functions/shuffle.py

@pf.register_dataframe_method def shuffle( df: pd.DataFrame, random_state: Any = None, reset_index: bool = True ) -> pd.DataFrame:  """Shuffle the rows of the DataFrame.  This method does not mutate the original DataFrame.  Super-sugary syntax! Underneath the hood, we use `df.sample(frac=1)`,  with the option to set the random state.  Examples:  >>> import pandas as pd  >>> import janitor  >>> df = pd.DataFrame(  ... {  ... "col1": range(5),  ... "col2": list("abcde"),  ... }  ... )  >>> df  col1 col2  0 0 a  1 1 b  2 2 c  3 3 d  4 4 e  >>> df.shuffle(random_state=42)  col1 col2  0 1 b  1 4 e  2 2 c  3 0 a  4 3 d  Args:  df: A pandas DataFrame.  random_state: If provided, set a seed for the random number  generator. Passed to `pd.DataFrame.sample()`.  reset_index: If True, reset the dataframe index to the default  RangeIndex.  Returns:  A shuffled pandas DataFrame.  """ result = df.sample(frac=1, random_state=random_state) if reset_index: result = result.reset_index(drop=True) return result 

`sort_column_value_order`

Implementation of the sort_column_value_order function.

`sort_column_value_order(df, column, column_value_order, columns=None)`

This function adds precedence to certain values in a specified column, then sorts based on that column and any other specified columns.

Examples:

>>> import pandas as pd >>> import janitor >>> import numpy as np >>> company_sales = { ...  "SalesMonth": ["Jan", "Feb", "Feb", "Mar", "April"], ...  "Company1": [150.0, 200.0, 200.0, 300.0, 400.0], ...  "Company2": [180.0, 250.0, 250.0, np.nan, 500.0], ...  "Company3": [400.0, 500.0, 500.0, 600.0, 675.0], ... } >>> df = pd.DataFrame.from_dict(company_sales) >>> df  SalesMonth Company1 Company2 Company3 0 Jan 150.0 180.0 400.0 1 Feb 200.0 250.0 500.0 2 Feb 200.0 250.0 500.0 3 Mar 300.0 NaN 600.0 4 April 400.0 500.0 675.0 >>> df.sort_column_value_order( ...  "SalesMonth", {"April": 1, "Mar": 2, "Feb": 3, "Jan": 4} ... )  SalesMonth Company1 Company2 Company3 4 April 400.0 500.0 675.0 3 Mar 300.0 NaN 600.0 1 Feb 200.0 250.0 500.0 2 Feb 200.0 250.0 500.0 0 Jan 150.0 180.0 400.0

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	pandas DataFrame that we are manipulating	required
`column`	`str`	This is a column name as a string we are using to specify which column to sort by	required
`column_value_order`	`dict`	Dictionary of values that will represent precedence of the values in the specified column	required
`columns`	`str`	A list of additional columns that we can sort by	`None`

Raises:

Type	Description
`ValueError`	If chosen Column Name is not in Dataframe, or if `column_value_order` dictionary is empty.

Returns:

Type	Description
`DataFrame`	A sorted pandas DataFrame.

Source code in janitor/functions/sort_column_value_order.py

@pf.register_dataframe_method def sort_column_value_order( df: pd.DataFrame, column: str, column_value_order: dict, columns: str = None, ) -> pd.DataFrame:  """This function adds precedence to certain values in a specified column,  then sorts based on that column and any other specified columns.  Examples:  >>> import pandas as pd  >>> import janitor  >>> import numpy as np  >>> company_sales = {  ... "SalesMonth": ["Jan", "Feb", "Feb", "Mar", "April"],  ... "Company1": [150.0, 200.0, 200.0, 300.0, 400.0],  ... "Company2": [180.0, 250.0, 250.0, np.nan, 500.0],  ... "Company3": [400.0, 500.0, 500.0, 600.0, 675.0],  ... }  >>> df = pd.DataFrame.from_dict(company_sales)  >>> df  SalesMonth Company1 Company2 Company3  0 Jan 150.0 180.0 400.0  1 Feb 200.0 250.0 500.0  2 Feb 200.0 250.0 500.0  3 Mar 300.0 NaN 600.0  4 April 400.0 500.0 675.0  >>> df.sort_column_value_order(  ... "SalesMonth", {"April": 1, "Mar": 2, "Feb": 3, "Jan": 4}  ... )  SalesMonth Company1 Company2 Company3  4 April 400.0 500.0 675.0  3 Mar 300.0 NaN 600.0  1 Feb 200.0 250.0 500.0  2 Feb 200.0 250.0 500.0  0 Jan 150.0 180.0 400.0  Args:  df: pandas DataFrame that we are manipulating  column: This is a column name as a string we are using to specify  which column to sort by  column_value_order: Dictionary of values that will  represent precedence of the values in the specified column  columns: A list of additional columns that we can sort by  Raises:  ValueError: If chosen Column Name is not in  Dataframe, or if `column_value_order` dictionary is empty.  Returns:  A sorted pandas DataFrame.  """ # Validation checks check_column(df, column, present=True) check("column_value_order", column_value_order, [dict]) if not column_value_order: raise ValueError("column_value_order dictionary cannot be empty") df = df.assign(cond_order=df[column].map(column_value_order)) sort_by = ["cond_order"] if columns is not None: sort_by = ["cond_order"] + columns df = df.sort_values(sort_by).remove_columns("cond_order") return df 

`sort_naturally`

Implementation of the sort_naturally function.

`sort_naturally(df, column_name, **natsorted_kwargs)`

Sort a DataFrame by a column using natural sorting.

Natural sorting is distinct from the default lexiographical sorting provided by pandas. For example, given the following list of items:

["A1", "A11", "A3", "A2", "A10"]

Lexicographical sorting would give us:

["A1", "A10", "A11", "A2", "A3"]

By contrast, "natural" sorting would give us:

["A1", "A2", "A3", "A10", "A11"]

This function thus provides natural sorting on a single column of a dataframe.

To accomplish this, we do a natural sort on the unique values that are present in the dataframe. Then, we reconstitute the entire dataframe in the naturally sorted order.

Natural sorting is provided by the Python package natsort.

All keyword arguments to natsort should be provided after the column name to sort by is provided. They are passed through to the natsorted function.

Examples:

>>> import pandas as pd >>> import janitor >>> df = pd.DataFrame( ...  { ...  "Well": ["A21", "A3", "A21", "B2", "B51", "B12"], ...  "Value": [1, 2, 13, 3, 4, 7], ...  } ... ) >>> df  Well Value 0 A21 1 1 A3 2 2 A21 13 3 B2 3 4 B51 4 5 B12 7 >>> df.sort_naturally("Well")  Well Value 1 A3 2 0 A21 1 2 A21 13 3 B2 3 5 B12 7 4 B51 4

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	A pandas DataFrame.	required
`column_name`	`str`	The column on which natural sorting should take place.	required
`**natsorted_kwargs`	`Any`	Keyword arguments to be passed to natsort's `natsorted` function.	`{}`

Returns:

Type	Description
`DataFrame`	A sorted pandas DataFrame.

Source code in janitor/functions/sort_naturally.py

@pf.register_dataframe_method def sort_naturally( df: pd.DataFrame, column_name: str, **natsorted_kwargs: Any ) -> pd.DataFrame:  """Sort a DataFrame by a column using *natural* sorting.  Natural sorting is distinct from  the default lexiographical sorting provided by `pandas`.  For example, given the following list of items:  ```python  ["A1", "A11", "A3", "A2", "A10"]  ```  Lexicographical sorting would give us:  ```python  ["A1", "A10", "A11", "A2", "A3"]  ```  By contrast, "natural" sorting would give us:  ```python  ["A1", "A2", "A3", "A10", "A11"]  ```  This function thus provides *natural* sorting  on a single column of a dataframe.  To accomplish this, we do a natural sort  on the unique values that are present in the dataframe.  Then, we reconstitute the entire dataframe  in the naturally sorted order.  Natural sorting is provided by the Python package  [natsort](https://natsort.readthedocs.io/en/master/index.html).  All keyword arguments to `natsort` should be provided  after the column name to sort by is provided.  They are passed through to the `natsorted` function.  Examples:  >>> import pandas as pd  >>> import janitor  >>> df = pd.DataFrame(  ... {  ... "Well": ["A21", "A3", "A21", "B2", "B51", "B12"],  ... "Value": [1, 2, 13, 3, 4, 7],  ... }  ... )  >>> df  Well Value  0 A21 1  1 A3 2  2 A21 13  3 B2 3  4 B51 4  5 B12 7  >>> df.sort_naturally("Well")  Well Value  1 A3 2  0 A21 1  2 A21 13  3 B2 3  5 B12 7  4 B51 4  Args:  df: A pandas DataFrame.  column_name: The column on which natural sorting should take place.  **natsorted_kwargs: Keyword arguments to be passed  to natsort's `natsorted` function.  Returns:  A sorted pandas DataFrame.  """ new_order = index_natsorted(df[column_name], **natsorted_kwargs) return df.iloc[new_order, :] 

`summarise`

Implementation of summarise.

`summarise(df, *args)`

New in version 0.31.0

Note

Before reaching for summarise, try pd.DataFrame.agg.

summarise creates a new dataframe; it returns one row for each combination of grouping columns. If there are no grouping variables, the output will have a single row summarising all observations in the input.

The argument provided to args should be either a dictionary, a callable or a tuple; however, anything can be passed, as long as it fits within pandas' aggregation semantics.

dictionary argument: If the argument is a dictionary, the value in the {key:value} pairing should be either a string, a callable, or a tuple.
- If the value in the dictionary is a string or a callable, the key of the dictionary should be an existing column name.
The function is applied on the df[column_name] series.
Note
- If the value is a string, the string should be a pandas string function, e.g "sum", "mean", etc.
- If the value of the dictionary is a tuple, it should be of length 2, and of the form (column_name, aggfunc), where column_name should exist in the DataFrame, and aggfunc should be either a string or a callable.
This option allows for custom renaming of the aggregation output, where the key in the dictionary can be a new column name.
tuple argument: If the argument is a tuple, it should be of length 2, and of the form (column_name, aggfunc), where column_name should exist in the DataFrame, and aggfunc should be either a string or a callable.
Note
- if aggfunc is a string, the string should be a pandas string function, e.g "sum", "mean", etc.
Note
- column_name can be anything supported by the select syntax; as such multiple columns can be processed here - they will be processed individually.
callable argument: If the argument is a callable, the callable is applied on the DataFrame or GroupBy object. The result from the callable should be a pandas Series or DataFrame.

Aggregated columns cannot be reused in summarise.

Arguments supported in pd.DataFrame.groupby can also be passed to by via a dictionary.

Examples:

>>> import pandas as pd >>> import janitor >>> data = { ...  "avg_jump": [3, 4, 1, 2, 3, 4], ...  "avg_run": [3, 4, 1, 3, 2, 4], ...  "combine_id": [100200, 100200, 101200, 101200, 102201, 103202], ... } >>> df = pd.DataFrame(data) >>> df  avg_jump avg_run combine_id 0 3 3 100200 1 4 4 100200 2 1 1 101200 3 2 3 101200 4 3 2 102201 5 4 4 103202

Aggregation on a DataFrame via a callable:

>>> df.summarise(lambda df: df.select("avg*").mean().rename("mean"))  mean avg_jump 2.833333 avg_run 2.833333

Aggregation on a DataFrame via a tuple:

>>> df.summarise(("avg_*", "mean"))  avg_jump avg_run mean 2.833333 2.833333

Aggregation on a DataFrame via a dictionary:

>>> df.summarise({"avg_jump": "mean"})  avg_jump mean 2.833333

>>> df.summarise({"avg_run_2": ("avg_run", "mean")})  avg_run_2 mean 2.833333

>>> grouped = df.groupby("combine_id")

Aggregation on a grouped object via a callable:

>>> grouped.summarise(lambda df: df.sum())  avg_jump avg_run combine_id 100200 7 7 101200 3 4 102201 3 2 103202 4 4

Aggregation on a grouped object via a tuple:

>>> grouped.summarise(("avg_run", "mean"))  avg_run combine_id 100200 3.5 101200 2.0 102201 2.0 103202 4.0

Aggregation on a grouped object via a dictionary:

>>> grouped.summarise({"avg_run": "mean"})  avg_run combine_id 100200 3.5 101200 2.0 102201 2.0 103202 4.0 >>> grouped.summarise({"avg_run_2": ("avg_run", "mean")})  avg_run_2 combine_id 100200 3.5 101200 2.0 102201 2.0 103202 4.0

Parameters:

Name	Type	Description	Default
`df`	`DataFrame \| DataFrameGroupBy`	A pandas DataFrame or DataFrameGroupBy object.	required
`args`	`tuple[dict \| tuple]`	Either a dictionary or a tuple.	`()`

Raises:

Type	Description
`ValueError`	If a tuple is passed and the length is not 2.

Returns:

Type	Description
`DataFrame`	A pandas DataFrame with aggregated columns.

Source code in janitor/functions/summarise.py

@pf.register_groupby_method @pf.register_dataframe_method def summarise( df: pd.DataFrame | DataFrameGroupBy, *args: tuple[dict | tuple], ) -> pd.DataFrame:  """  !!! info "New in version 0.31.0"  !!!note  Before reaching for `summarise`, try `pd.DataFrame.agg`.  summarise creates a new dataframe;  it returns one row for each combination of grouping columns.  If there are no grouping variables,  the output will have a single row  summarising all observations in the input.  The argument provided to *args* should be either  a dictionary, a callable or a tuple; however,  anything can be passed, as long as it fits  within pandas' aggregation semantics.  - **dictionary argument**:  If the argument is a dictionary,  the value in the `{key:value}` pairing  should be either a string, a callable, or a tuple.  - If the value in the dictionary  is a string or a callable,  the key of the dictionary  should be an existing column name.  The function is applied on the `df[column_name]` series.  !!!note  - If the value is a string,  the string should be a pandas string function,  e.g "sum", "mean", etc.  - If the value of the dictionary is a tuple,  it should be of length 2, and of the form  `(column_name, aggfunc)`,  where `column_name` should exist in the DataFrame,  and `aggfunc` should be either a string or a callable.  This option allows for custom renaming of the aggregation output,  where the key in the dictionary can be a new column name.  - **tuple argument**:  If the argument is a tuple, it should be of length 2,  and of the form  `(column_name, aggfunc)`,  where column_name should exist in the DataFrame,  and `aggfunc` should be either a string or a callable.  !!!note  - if `aggfunc` is a string,  the string should be a pandas string function,  e.g "sum", "mean", etc.  !!!note  - `column_name` can be anything supported by the  [`select`][janitor.functions.select.select] syntax;  as such multiple columns can be processed here -  they will be processed individually.  - **callable argument**:  If the argument is a callable, the callable is applied  on the DataFrame or GroupBy object.  The result from the callable should be a pandas Series  or DataFrame.  Aggregated columns cannot be reused in `summarise`.  Arguments supported in `pd.DataFrame.groupby`  can also be passed to `by` via a dictionary.  Examples:  >>> import pandas as pd  >>> import janitor  >>> data = {  ... "avg_jump": [3, 4, 1, 2, 3, 4],  ... "avg_run": [3, 4, 1, 3, 2, 4],  ... "combine_id": [100200, 100200, 101200, 101200, 102201, 103202],  ... }  >>> df = pd.DataFrame(data)  >>> df  avg_jump avg_run combine_id  0 3 3 100200  1 4 4 100200  2 1 1 101200  3 2 3 101200  4 3 2 102201  5 4 4 103202  Aggregation on a DataFrame via a callable:  >>> df.summarise(lambda df: df.select("avg*").mean().rename("mean"))  mean  avg_jump 2.833333  avg_run 2.833333  Aggregation on a DataFrame via a tuple:  >>> df.summarise(("avg_*", "mean"))  avg_jump avg_run  mean 2.833333 2.833333  Aggregation on a DataFrame via a dictionary:  >>> df.summarise({"avg_jump": "mean"})  avg_jump  mean 2.833333  >>> df.summarise({"avg_run_2": ("avg_run", "mean")})  avg_run_2  mean 2.833333  >>> grouped = df.groupby("combine_id")  Aggregation on a grouped object via a callable:  >>> grouped.summarise(lambda df: df.sum())  avg_jump avg_run  combine_id  100200 7 7  101200 3 4  102201 3 2  103202 4 4  Aggregation on a grouped object via a tuple:  >>> grouped.summarise(("avg_run", "mean"))  avg_run  combine_id  100200 3.5  101200 2.0  102201 2.0  103202 4.0  Aggregation on a grouped object via a dictionary:  >>> grouped.summarise({"avg_run": "mean"})  avg_run  combine_id  100200 3.5  101200 2.0  102201 2.0  103202 4.0  >>> grouped.summarise({"avg_run_2": ("avg_run", "mean")})  avg_run_2  combine_id  100200 3.5  101200 2.0  102201 2.0  103202 4.0  Args:  df: A pandas DataFrame or DataFrameGroupBy object.  args: Either a dictionary or a tuple.  Raises:  ValueError: If a tuple is passed and the length is not 2.  Returns:  A pandas DataFrame with aggregated columns.  """ # noqa: E501 if isinstance(df, DataFrameGroupBy): by = df df = df.obj else: by = None contents = [] for arg in args: aggregate = _aggfunc(arg, df=df, by=by) contents.extend(aggregate) counts = 0 for entry in contents: if isinstance(entry, pd.DataFrame): length = entry.columns.nlevels elif isinstance(entry.name, tuple): length = len(entry.name) else: length = 1 counts = max(counts, length) contents_ = [] for entry in contents: if isinstance(entry, pd.DataFrame): length_ = entry.columns.nlevels length = counts - length_ if length: patch = [""] * length columns = [entry.columns.get_level_values(n) for n in range(length_)] columns.append(patch) names = [*entry.columns.names] names.extend([None] * length) columns = pd.MultiIndex.from_arrays(columns, names=names) entry.columns = columns elif is_scalar(entry.name): length = counts - 1 if length: patch = [""] * length name = (entry.name, *patch) entry.name = name elif isinstance(entry.name, tuple): length = counts - len(entry.name) if length: patch = [""] * length name = (*entry.name, *patch) entry.name = name contents_.append(entry) return pd.concat(contents_, axis=1, copy=False, sort=False) 

`take_first`

Implementation of take_first function.

`take_first(df, subset, by, ascending=True)`

Take the first row within each group specified by subset.

Examples:

>>> import pandas as pd >>> import janitor >>> df = pd.DataFrame({"a": ["x", "x", "y", "y"], "b": [0, 1, 2, 3]}) >>> df  a b 0 x 0 1 x 1 2 y 2 3 y 3 >>> df.take_first(subset="a", by="b")  a b 0 x 0 2 y 2

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	A pandas DataFrame.	required
`subset`	`Union[Hashable, Iterable[Hashable]]`	Column(s) defining the group.	required
`by`	`Hashable`	Column to sort by.	required
`ascending`	`bool`	Whether or not to sort in ascending order, `bool`.	`True`

Returns:

Type	Description
`DataFrame`	A pandas DataFrame.

Source code in janitor/functions/take_first.py

@pf.register_dataframe_method def take_first( df: pd.DataFrame, subset: Union[Hashable, Iterable[Hashable]], by: Hashable, ascending: bool = True, ) -> pd.DataFrame:  """Take the first row within each group specified by `subset`.  Examples:  >>> import pandas as pd  >>> import janitor  >>> df = pd.DataFrame({"a": ["x", "x", "y", "y"], "b": [0, 1, 2, 3]})  >>> df  a b  0 x 0  1 x 1  2 y 2  3 y 3  >>> df.take_first(subset="a", by="b")  a b  0 x 0  2 y 2  Args:  df: A pandas DataFrame.  subset: Column(s) defining the group.  by: Column to sort by.  ascending: Whether or not to sort in ascending order, `bool`.  Returns:  A pandas DataFrame.  """ result = df.sort_values(by=by, ascending=ascending).drop_duplicates( subset=subset, keep="first" ) return result 

`then`

Implementation source for then.

`then(df, func)`

Add an arbitrary function to run in the pyjanitor method chain.

This method does not mutate the original DataFrame.

Note

This function will be deprecated in a 1.x release. Please use pd.DataFrame.pipe instead.

Examples:

A trivial example using a lambda func.

>>> import pandas as pd >>> import janitor >>> (pd.DataFrame({"a": [1, 2, 3], "b": [7, 8, 9]}).then(lambda df: df * 2))  a b 0 2 14 1 4 16 2 6 18

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	A pandas DataFrame.	required
`func`	`Callable`	A function you would like to run in the method chain. It should take one parameter and return one parameter, each being the DataFrame object. After that, do whatever you want in the middle. Go crazy.	required

Returns:

Type	Description
`DataFrame`	A pandas DataFrame.

Source code in janitor/functions/then.py

@pf.register_dataframe_method @refactored_function( message="This function will be deprecated in a 1.x release. " "Kindly use `pd.DataFrame.pipe` instead." ) def then(df: pd.DataFrame, func: Callable) -> pd.DataFrame:  """Add an arbitrary function to run in the `pyjanitor` method chain.  This method does not mutate the original DataFrame.  !!!note  This function will be deprecated in a 1.x release.  Please use `pd.DataFrame.pipe` instead.  Examples:  A trivial example using a lambda `func`.  >>> import pandas as pd  >>> import janitor  >>> (pd.DataFrame({"a": [1, 2, 3], "b": [7, 8, 9]}).then(lambda df: df * 2))  a b  0 2 14  1 4 16  2 6 18  Args:  df: A pandas DataFrame.  func: A function you would like to run in the method chain.  It should take one parameter and return one parameter, each being  the DataFrame object. After that, do whatever you want in the  middle. Go crazy.  Returns:  A pandas DataFrame.  """ df = func(df) return df 

`to_datetime`

Implementation source for to_datetime.

`to_datetime(df, column_name, **kwargs)`

Convert column to a datetime type, in-place.

Intended to be the method-chaining equivalent of:

df[column_name] = pd.to_datetime(df[column_name], **kwargs)

This method mutates the original DataFrame.

Note

This function will be deprecated in a 1.x release. Please use jn.transform_column instead.

Examples:

Converting a string column to datetime type with custom format.

>>> import pandas as pd >>> import janitor >>> df = pd.DataFrame({"date": ["20200101", "20200202", "20200303"]}) >>> df  date 0 20200101 1 20200202 2 20200303 >>> df.to_datetime("date", format="%Y%m%d")  date 0 2020-01-01 1 2020-02-02 2 2020-03-03

Read the pandas documentation for to_datetime for more information.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	A pandas DataFrame.	required
`column_name`	`Hashable`	Column name.	required
`**kwargs`	`Any`	Provide any kwargs that `pd.to_datetime` can take.	`{}`

Returns:

Type	Description
`DataFrame`	A pandas DataFrame with updated datetime data.

Source code in janitor/functions/to_datetime.py

@pf.register_dataframe_method @deprecated_alias(column="column_name") @refactored_function( message=( "This function will be deprecated in a 1.x release. " "Please use `jn.transform_columns` instead." ) ) def to_datetime(df: pd.DataFrame, column_name: Hashable, **kwargs: Any) -> pd.DataFrame:  """Convert column to a datetime type, in-place.  Intended to be the method-chaining equivalent of:  ```python  df[column_name] = pd.to_datetime(df[column_name], **kwargs)  ```  This method mutates the original DataFrame.  !!!note  This function will be deprecated in a 1.x release.  Please use [`jn.transform_column`][janitor.functions.transform_columns.transform_column]  instead.  Examples:  Converting a string column to datetime type with custom format.  >>> import pandas as pd  >>> import janitor  >>> df = pd.DataFrame({"date": ["20200101", "20200202", "20200303"]})  >>> df  date  0 20200101  1 20200202  2 20200303  >>> df.to_datetime("date", format="%Y%m%d")  date  0 2020-01-01  1 2020-02-02  2 2020-03-03  Read the pandas documentation for [`to_datetime`][pd_docs] for more information.  [pd_docs]: https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html  Args:  df: A pandas DataFrame.  column_name: Column name.  **kwargs: Provide any kwargs that `pd.to_datetime` can take.  Returns:  A pandas DataFrame with updated datetime data.  """ # noqa: E501 df[column_name] = pd.to_datetime(df[column_name], **kwargs) return df 

`toset`

Implementation of the toset function.

`toset(series)`

Return a set of the values.

Note

This function will be deprecated in a 1.x release. Please use set(df[column]) instead.

These are each a scalar type, which is a Python scalar (for str, int, float) or a pandas scalar (for Timestamp/Timedelta/Interval/Period)

Examples:

>>> import pandas as pd >>> import janitor >>> s = pd.Series([1, 2, 3, 5, 5], index=["a", "b", "c", "d", "e"]) >>> s a 1 b 2 c 3 d 5 e 5 dtype: int64 >>> s.toset() {1, 2, 3, 5}

Parameters:

Name	Type	Description	Default
`series`	`Series`	A pandas series.	required

Returns:

Type	Description
`Set`	A set of values.

Source code in janitor/functions/toset.py

@pf.register_series_method @refactored_function( message=( "This function will be deprecated in a 1.x release. " "Please use `set(df[column])` instead." ) ) def toset(series: pd.Series) -> Set:  """Return a set of the values.  !!!note  This function will be deprecated in a 1.x release.  Please use `set(df[column])` instead.  These are each a scalar type, which is a Python scalar  (for str, int, float) or a pandas scalar  (for Timestamp/Timedelta/Interval/Period)  Examples:  >>> import pandas as pd  >>> import janitor  >>> s = pd.Series([1, 2, 3, 5, 5], index=["a", "b", "c", "d", "e"])  >>> s  a 1  b 2  c 3  d 5  e 5  dtype: int64  >>> s.toset()  {1, 2, 3, 5}  Args:  series: A pandas series.  Returns:  A set of values.  """ return set(series.tolist()) 

`transform_columns`

`transform_column(df, column_name, function, dest_column_name=None, elementwise=True)`

Transform the given column using the provided function.

Meant to be the method-chaining equivalent of:

df[dest_column_name] = df[column_name].apply(function)

Functions can be applied in one of two ways:

Element-wise (default; elementwise=True). Then, the individual column elements will be passed in as the first argument of function.
Column-wise (elementwise=False). Then, function is expected to take in a pandas Series and return a sequence that is of identical length to the original.

If dest_column_name is provided, then the transformation result is stored in that column. Otherwise, the transformed result is stored under the name of the original column.

This method does not mutate the original DataFrame.

Examples:

Transform a column in-place with an element-wise function.

>>> import pandas as pd >>> import janitor >>> df = pd.DataFrame( ...  { ...  "a": [2, 3, 4], ...  "b": ["area", "pyjanitor", "grapefruit"], ...  } ... ) >>> df  a b 0 2 area 1 3 pyjanitor 2 4 grapefruit >>> df.transform_column( ...  column_name="a", ...  function=lambda x: x**2 - 1, ... )  a b 0 3 area 1 8 pyjanitor 2 15 grapefruit

Examples:

Transform a column in-place with an column-wise function.

>>> df.transform_column( ...  column_name="b", ...  function=lambda srs: srs.str[:5], ...  elementwise=False, ... )  a b 0 2 area 1 3 pyjan 2 4 grape

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	A pandas DataFrame.	required
`column_name`	`Hashable`	The column to transform.	required
`function`	`Callable`	A function to apply on the column.	required
`dest_column_name`	`Optional[str]`	The column name to store the transformation result in. Defaults to None, which will result in the original column name being overwritten. If a name is provided here, then a new column with the transformed values will be created.	`None`
`elementwise`	`bool`	Whether to apply the function elementwise or not. If `elementwise` is True, then the function's first argument should be the data type of each datum in the column of data, and should return a transformed datum. If `elementwise` is False, then the function's should expect a pandas Series passed into it, and return a pandas Series.	`True`

Returns:

Type	Description
`DataFrame`	A pandas DataFrame with a transformed column.

Source code in janitor/functions/transform_columns.py

@pf.register_dataframe_method @deprecated_alias(col_name="column_name", dest_col_name="dest_column_name") def transform_column( df: pd.DataFrame, column_name: Hashable, function: Callable, dest_column_name: Optional[str] = None, elementwise: bool = True, ) -> pd.DataFrame:  """Transform the given column using the provided function.  Meant to be the method-chaining equivalent of:  ```python  df[dest_column_name] = df[column_name].apply(function)  ```  Functions can be applied in one of two ways:  - **Element-wise** (default; `elementwise=True`). Then, the individual  column elements will be passed in as the first argument of `function`.  - **Column-wise** (`elementwise=False`). Then, `function` is expected to  take in a pandas Series and return a sequence that is of identical length  to the original.  If `dest_column_name` is provided, then the transformation result is stored  in that column. Otherwise, the transformed result is stored under the name  of the original column.  This method does not mutate the original DataFrame.  Examples:  Transform a column in-place with an element-wise function.  >>> import pandas as pd  >>> import janitor  >>> df = pd.DataFrame(  ... {  ... "a": [2, 3, 4],  ... "b": ["area", "pyjanitor", "grapefruit"],  ... }  ... )  >>> df  a b  0 2 area  1 3 pyjanitor  2 4 grapefruit  >>> df.transform_column(  ... column_name="a",  ... function=lambda x: x**2 - 1,  ... )  a b  0 3 area  1 8 pyjanitor  2 15 grapefruit  Examples:  Transform a column in-place with an column-wise function.  >>> df.transform_column(  ... column_name="b",  ... function=lambda srs: srs.str[:5],  ... elementwise=False,  ... )  a b  0 2 area  1 3 pyjan  2 4 grape  Args:  df: A pandas DataFrame.  column_name: The column to transform.  function: A function to apply on the column.  dest_column_name: The column name to store the transformation result  in. Defaults to None, which will result in the original column  name being overwritten. If a name is provided here, then a new  column with the transformed values will be created.  elementwise: Whether to apply the function elementwise or not.  If `elementwise` is True, then the function's first argument  should be the data type of each datum in the column of data,  and should return a transformed datum.  If `elementwise` is False, then the function's should expect  a pandas Series passed into it, and return a pandas Series.  Returns:  A pandas DataFrame with a transformed column.  """ check_column(df, column_name) if dest_column_name is None: dest_column_name = column_name elif dest_column_name != column_name: # If `dest_column_name` is provided and equals `column_name`, then we # assume that the user's intent is to perform an in-place # transformation (Same behaviour as when `dest_column_name` = None). # Otherwise we throw an error if `dest_column_name` already exists in # df. check_column(df, dest_column_name, present=False) result = _get_transform_column_result( df[column_name], function, elementwise, ) return df.assign(**{dest_column_name: result}) 

`transform_columns(df, column_names, function, suffix=None, elementwise=True, new_column_names=None)`

Transform multiple columns through the same transformation.

This method does not mutate the original DataFrame.

Super syntactic sugar! Essentially wraps transform_column and calls it repeatedly over all column names provided.

User can optionally supply either a suffix to create a new set of columns with the specified suffix, or provide a dictionary mapping each original column name in column_names to its corresponding new column name. Note that all column names must be strings.

Examples:

log10 transform a list of columns, replacing original columns.

>>> import numpy as np >>> import pandas as pd >>> import janitor >>> df = pd.DataFrame( ...  { ...  "col1": [5, 10, 15], ...  "col2": [3, 6, 9], ...  "col3": [10, 100, 1_000], ...  } ... ) >>> df  col1 col2 col3 0 5 3 10 1 10 6 100 2 15 9 1000 >>> df.transform_columns(["col1", "col2", "col3"], np.log10)  col1 col2 col3 0 0.698970 0.477121 1.0 1 1.000000 0.778151 2.0 2 1.176091 0.954243 3.0

Using the suffix parameter to create new columns.

>>> df.transform_columns(["col1", "col3"], np.log10, suffix="_log")  col1 col2 col3 col1_log col3_log 0 5 3 10 0.698970 1.0 1 10 6 100 1.000000 2.0 2 15 9 1000 1.176091 3.0

Using the new_column_names parameter to create new columns.

>>> df.transform_columns( ...  ["col1", "col3"], ...  np.log10, ...  new_column_names={"col1": "transform1"}, ... )  col1 col2 col3 transform1 0 5 3 1.0 0.698970 1 10 6 2.0 1.000000 2 15 9 3.0 1.176091

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	A pandas DataFrame.	required
`column_names`	`Union[List[str], Tuple[str]]`	An iterable of columns to transform.	required
`function`	`Callable`	A function to apply on each column.	required
`suffix`	`Optional[str]`	Suffix to use when creating new columns to hold the transformed values.	`None`
`elementwise`	`bool`	Passed on to `transform_column`; whether or not to apply the transformation function elementwise (True) or columnwise (False).	`True`
`new_column_names`	`Optional[Dict[str, str]]`	An explicit mapping of old column names in `column_names` to new column names. If any column specified in `column_names` is not a key in this dictionary, the transformation will happen in-place for that column.	`None`

Raises:

Type	Description
`ValueError`	If both `suffix` and `new_column_names` are specified.

Returns:

Type	Description
`DataFrame`	A pandas DataFrame with transformed columns.

Source code in janitor/functions/transform_columns.py

@pf.register_dataframe_method @deprecated_alias(columns="column_names", new_names="new_column_names") def transform_columns( df: pd.DataFrame, column_names: Union[List[str], Tuple[str]], function: Callable, suffix: Optional[str] = None, elementwise: bool = True, new_column_names: Optional[Dict[str, str]] = None, ) -> pd.DataFrame:  """Transform multiple columns through the same transformation.  This method does not mutate the original DataFrame.  Super syntactic sugar!  Essentially wraps [`transform_column`][janitor.functions.transform_columns.transform_column]  and calls it repeatedly over all column names provided.  User can optionally supply either a suffix to create a new set of columns  with the specified suffix, or provide a dictionary mapping each original  column name in `column_names` to its corresponding new column name.  Note that all column names must be strings.  Examples:  log10 transform a list of columns, replacing original columns.  >>> import numpy as np  >>> import pandas as pd  >>> import janitor  >>> df = pd.DataFrame(  ... {  ... "col1": [5, 10, 15],  ... "col2": [3, 6, 9],  ... "col3": [10, 100, 1_000],  ... }  ... )  >>> df  col1 col2 col3  0 5 3 10  1 10 6 100  2 15 9 1000  >>> df.transform_columns(["col1", "col2", "col3"], np.log10)  col1 col2 col3  0 0.698970 0.477121 1.0  1 1.000000 0.778151 2.0  2 1.176091 0.954243 3.0  Using the `suffix` parameter to create new columns.  >>> df.transform_columns(["col1", "col3"], np.log10, suffix="_log")  col1 col2 col3 col1_log col3_log  0 5 3 10 0.698970 1.0  1 10 6 100 1.000000 2.0  2 15 9 1000 1.176091 3.0  Using the `new_column_names` parameter to create new columns.  >>> df.transform_columns(  ... ["col1", "col3"],  ... np.log10,  ... new_column_names={"col1": "transform1"},  ... )  col1 col2 col3 transform1  0 5 3 1.0 0.698970  1 10 6 2.0 1.000000  2 15 9 3.0 1.176091  Args:  df: A pandas DataFrame.  column_names: An iterable of columns to transform.  function: A function to apply on each column.  suffix: Suffix to use when creating new columns to hold  the transformed values.  elementwise: Passed on to [`transform_column`][janitor.functions.transform_columns.transform_column]; whether or not  to apply the transformation function elementwise (True)  or columnwise (False).  new_column_names: An explicit mapping of old column names in  `column_names` to new column names. If any column specified in  `column_names` is not a key in this dictionary, the transformation  will happen in-place for that column.  Raises:  ValueError: If both `suffix` and `new_column_names` are specified.  Returns:  A pandas DataFrame with transformed columns.  """ # noqa: E501 check("column_names", column_names, [list, tuple]) check_column(df, column_names) if suffix is not None and new_column_names is not None: raise ValueError( "Only one of `suffix` or `new_column_names` should be specified." ) if suffix: check("suffix", suffix, [str]) dest_column_names = {col: col + suffix for col in column_names} elif new_column_names: check("new_column_names", new_column_names, [dict]) dest_column_names = { col: new_column_names.get(col, col) for col in column_names } else: dest_column_names = dict(zip(column_names, column_names)) results = {} for old_col, new_col in dest_column_names.items(): if old_col != new_col: check_column(df, new_col, present=False) results[new_col] = _get_transform_column_result( df[old_col], function, elementwise=elementwise, ) return df.assign(**results) 

`truncate_datetime`

Implementation of the truncate_datetime family of functions.

`truncate_datetime_dataframe(df, datepart)`

Truncate times down to a user-specified precision of year, month, day, hour, minute, or second.

This method does not mutate the original DataFrame.

Examples:

>>> import pandas as pd >>> import janitor >>> df = pd.DataFrame( ...  { ...  "foo": ["xxxx", "yyyy", "zzzz"], ...  "dt": pd.date_range("2020-03-11", periods=3, freq="15H"), ...  } ... ) >>> df  foo dt 0 xxxx 2020-03-11 00:00:00 1 yyyy 2020-03-11 15:00:00 2 zzzz 2020-03-12 06:00:00 >>> df.truncate_datetime_dataframe("day")  foo dt 0 xxxx 2020-03-11 1 yyyy 2020-03-11 2 zzzz 2020-03-12

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	The pandas DataFrame on which to truncate datetime.	required
`datepart`	`str`	Truncation precision, YEAR, MONTH, DAY, HOUR, MINUTE, SECOND. (String is automagically capitalized)	required

Raises:

Type	Description
`ValueError`	If an invalid `datepart` precision is passed in.

Returns:

Type	Description
`DataFrame`	A pandas DataFrame with all valid datetimes truncated down to the specified precision.

Source code in janitor/functions/truncate_datetime.py

@pf.register_dataframe_method def truncate_datetime_dataframe( df: pd.DataFrame, datepart: str, ) -> pd.DataFrame:  """Truncate times down to a user-specified precision of  year, month, day, hour, minute, or second.  This method does not mutate the original DataFrame.  Examples:  >>> import pandas as pd  >>> import janitor  >>> df = pd.DataFrame(  ... {  ... "foo": ["xxxx", "yyyy", "zzzz"],  ... "dt": pd.date_range("2020-03-11", periods=3, freq="15H"),  ... }  ... )  >>> df  foo dt  0 xxxx 2020-03-11 00:00:00  1 yyyy 2020-03-11 15:00:00  2 zzzz 2020-03-12 06:00:00  >>> df.truncate_datetime_dataframe("day")  foo dt  0 xxxx 2020-03-11  1 yyyy 2020-03-11  2 zzzz 2020-03-12  Args:  df: The pandas DataFrame on which to truncate datetime.  datepart: Truncation precision, YEAR, MONTH, DAY,  HOUR, MINUTE, SECOND. (String is automagically  capitalized)  Raises:  ValueError: If an invalid `datepart` precision is passed in.  Returns:  A pandas DataFrame with all valid datetimes truncated down  to the specified precision.  """ # idea from Stack Overflow # https://stackoverflow.com/a/28783971/7175713 # https://numpy.org/doc/stable/reference/arrays.datetime.html ACCEPTABLE_DATEPARTS = { "YEAR": "datetime64[Y]", "MONTH": "datetime64[M]", "DAY": "datetime64[D]", "HOUR": "datetime64[h]", "MINUTE": "datetime64[m]", "SECOND": "datetime64[s]", } datepart = datepart.upper() if datepart not in ACCEPTABLE_DATEPARTS: raise ValueError( "Received an invalid `datepart` precision. " f"Please enter any one of {ACCEPTABLE_DATEPARTS}." ) dictionary = {} for label, series in df.items(): if is_datetime64_any_dtype(series): dtype = ACCEPTABLE_DATEPARTS[datepart] # TODO: add branch for pyarrow arrays series = np.array(series._values, dtype=dtype) dictionary[label] = series return pd.DataFrame(dictionary) 

`update_where`

Function for updating values based on other column values.

`update_where(df, conditions, target_column_name, target_val)`

Add multiple conditions to update a column in the dataframe.

This method does not mutate the original DataFrame.

Examples:

>>> import janitor >>> data = { ...  "a": [1, 2, 3, 4], ...  "b": [5, 6, 7, 8], ...  "c": [0, 0, 0, 0], ... } >>> df = pd.DataFrame(data) >>> df  a b c 0 1 5 0 1 2 6 0 2 3 7 0 3 4 8 0 >>> df.update_where( ...  conditions=(df.a > 2) & (df.b < 8), ...  target_column_name="c", ...  target_val=10, ... )  a b c 0 1 5 0 1 2 6 0 2 3 7 10 3 4 8 0 >>> df.update_where( # supports pandas *query* style string expressions ...  conditions="a > 2 and b < 8", target_column_name="c", target_val=10 ... )  a b c 0 1 5 0 1 2 6 0 2 3 7 10 3 4 8 0

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	The pandas DataFrame object.	required
`conditions`	`Any`	Conditions used to update a target column and target value.	required
`target_column_name`	`Hashable`	Column to be updated. If column does not exist in DataFrame, a new column will be created; note that entries that do not get set in the new column will be null.	required
`target_val`	`Any`	Value to be updated.	required

Raises:

Type	Description
`ValueError`	If `conditions` does not return a boolean array-like data structure.

Returns:

Type	Description
`DataFrame`	A pandas DataFrame.

Source code in janitor/functions/update_where.py

@pf.register_dataframe_method @deprecated_alias(target_col="target_column_name") def update_where( df: pd.DataFrame, conditions: Any, target_column_name: Hashable, target_val: Any, ) -> pd.DataFrame:  """Add multiple conditions to update a column in the dataframe.  This method does not mutate the original DataFrame.  Examples:  >>> import janitor  >>> data = {  ... "a": [1, 2, 3, 4],  ... "b": [5, 6, 7, 8],  ... "c": [0, 0, 0, 0],  ... }  >>> df = pd.DataFrame(data)  >>> df  a b c  0 1 5 0  1 2 6 0  2 3 7 0  3 4 8 0  >>> df.update_where(  ... conditions=(df.a > 2) & (df.b < 8),  ... target_column_name="c",  ... target_val=10,  ... )  a b c  0 1 5 0  1 2 6 0  2 3 7 10  3 4 8 0  >>> df.update_where( # supports pandas *query* style string expressions  ... conditions="a > 2 and b < 8", target_column_name="c", target_val=10  ... )  a b c  0 1 5 0  1 2 6 0  2 3 7 10  3 4 8 0  Args:  df: The pandas DataFrame object.  conditions: Conditions used to update a target column  and target value.  target_column_name: Column to be updated. If column does not exist  in DataFrame, a new column will be created; note that entries  that do not get set in the new column will be null.  target_val: Value to be updated.  Raises:  ValueError: If `conditions` does not return a boolean array-like  data structure.  Returns:  A pandas DataFrame.  """ df = df.copy() # use query mode if a string expression is passed if isinstance(conditions, str): conditions = df.eval(conditions) if not is_bool_dtype(conditions): raise ValueError(  """  Kindly ensure that `conditions` passed  evaluates to a Boolean dtype.  """ ) df.loc[conditions, target_column_name] = target_val return df 

`utils`

Utility functions for all of the functions submodule.

`patterns(regex_pattern)`

This function converts a string into a compiled regular expression.

It can be used to select columns in the index or columns_names arguments of pivot_longer function.

Warning

This function is deprecated. Kindly use re.compile instead.

Parameters:

Name	Type	Description	Default
`regex_pattern`	`Union[str, Pattern]`	String to be converted to compiled regular expression.	required

Returns:

Type	Description
`Pattern`	A compile regular expression from provided `regex_pattern`.

Source code in janitor/functions/utils.py

def patterns(regex_pattern: Union[str, Pattern]) -> Pattern:  """This function converts a string into a compiled regular expression.  It can be used to select columns in the index or columns_names  arguments of `pivot_longer` function.  !!!warning  This function is deprecated. Kindly use `re.compile` instead.  Args:  regex_pattern: String to be converted to compiled regular  expression.  Returns:  A compile regular expression from provided `regex_pattern`.  """ warnings.warn( "This function is deprecated. Kindly use `re.compile` instead.", DeprecationWarning, stacklevel=find_stack_level(), ) check("regular expression", regex_pattern, [str, Pattern]) return re.compile(regex_pattern) 

`unionize_dataframe_categories(*dataframes, column_names=None)`

Given a group of dataframes which contain some categorical columns, for each categorical column present, find all the possible categories across all the dataframes which have that column. Update each dataframes' corresponding column with a new categorical object that contains the original data but has labels for all the possible categories from all dataframes. This is useful when concatenating a list of dataframes which all have the same categorical columns into one dataframe.

If, for a given categorical column, all input dataframes do not have at least one instance of all the possible categories, Pandas will change the output dtype of that column from category to object, losing out on dramatic speed gains you get from the former format.

Examples:

Usage example for concatenation of categorical column-containing dataframes:

Instead of:

concatenated_df = pd.concat([df1, df2, df3], ignore_index=True)

which in your case has resulted in category -> object conversion, use:

unionized_dataframes = unionize_dataframe_categories(df1, df2, df2) concatenated_df = pd.concat(unionized_dataframes, ignore_index=True)

Parameters:

Name	Type	Description	Default
`*dataframes`	`Any`	The dataframes you wish to unionize the categorical objects for.	`()`
`column_names`	`Optional[Iterable[CategoricalDtype]]`	If supplied, only unionize this subset of columns.	`None`

Raises:

Type	Description
`TypeError`	If any of the inputs are not pandas DataFrames.

Returns:

Type	Description
`List[DataFrame]`	A list of the category-unioned dataframes in the same order they were provided.

Source code in janitor/functions/utils.py

def unionize_dataframe_categories( *dataframes: Any, column_names: Optional[Iterable[pd.CategoricalDtype]] = None, ) -> List[pd.DataFrame]:  """  Given a group of dataframes which contain some categorical columns, for  each categorical column present, find all the possible categories across  all the dataframes which have that column.  Update each dataframes' corresponding column with a new categorical object  that contains the original data  but has labels for all the possible categories from all dataframes.  This is useful when concatenating a list of dataframes which all have the  same categorical columns into one dataframe.  If, for a given categorical column, all input dataframes do not have at  least one instance of all the possible categories,  Pandas will change the output dtype of that column from `category` to  `object`, losing out on dramatic speed gains you get from the former  format.  Examples:  Usage example for concatenation of categorical column-containing  dataframes:  Instead of:  ```python  concatenated_df = pd.concat([df1, df2, df3], ignore_index=True)  ```  which in your case has resulted in `category` -> `object` conversion,  use:  ```python  unionized_dataframes = unionize_dataframe_categories(df1, df2, df2)  concatenated_df = pd.concat(unionized_dataframes, ignore_index=True)  ```  Args:  *dataframes: The dataframes you wish to unionize the categorical  objects for.  column_names: If supplied, only unionize this subset of columns.  Raises:  TypeError: If any of the inputs are not pandas DataFrames.  Returns:  A list of the category-unioned dataframes in the same order they  were provided.  """ if any(not isinstance(df, pd.DataFrame) for df in dataframes): raise TypeError("Inputs must all be dataframes.") if column_names is None: # Find all columns across all dataframes that are categorical column_names = set() for dataframe in dataframes: column_names = column_names.union( [ column_name for column_name in dataframe.columns if isinstance(dataframe[column_name].dtype, pd.CategoricalDtype) ] ) else: column_names = [column_names] # For each categorical column, find all possible values across the DFs category_unions = { column_name: union_categoricals( [df[column_name] for df in dataframes if column_name in df.columns] ) for column_name in column_names } # Make a shallow copy of all DFs and modify the categorical columns # such that they can encode the union of all possible categories for each. refactored_dfs = [] for df in dataframes: df = df.copy(deep=False) for column_name, categorical in category_unions.items(): if column_name in df.columns: df[column_name] = pd.Categorical( df[column_name], categories=categorical.categories ) refactored_dfs.append(df) return refactored_dfs