Get values from dataframe with MultiIndex index containg NaNs

Question

I cannot access the values of an index position that has a nan in it and wonder how I could solve this. (In my project this index has a very special meaning and I really need to keep it, otherwise I would need to make some dirty manual modifications: "there is always a solution" even if it is a very bad one).

df Out temp_playlist objId 0 o1 [0, 6] o2 [1, 4] o3 [2, 5] o4 [8, 9, 12] o5 [10, 13] o6 [11, 14] NaN [3, 7] Name: x, dtype: object df.index Out MultiIndex([(0, 'o1'), (0, 'o2'), (0, 'o3'), (0, 'o4'), (0, 'o5'), (0, 'o6'), (0, nan)], names=['temp_playlist', 'objId'])

Now I want to access the [3, 7] values as df.loc[(0, np.nan)] and obtain the KeyError: (0, nan) error.

Just to put it in perspective: [df.loc[idx] for idx in df.index if not pd.isna(idx[1])] works properly because I am skipping the problematic index.

What am I missing and how could I solve this?

(Windows 10, python 3.8.5, pandas 1.3.1, numpy 1.20.3, reported to pandas here)

I tried to create the index manually as pd.MultiIndex.from_arrays([[0, 0, 0, 0, 0, 0, 0], ['o1', 'o2', 'o3', 'o4', 'o5', 'o6', None]], names=('temp_playlist', 'objId')) and the None gets converted into np.nan. The result is the same exact index as posted in the question. — deponovo
– deponovo, Commented Sep 29, 2021 at 10:10
Yes, also with KeyError, but this one is clear, but that gave me an idea. Now this idea is a "bad solution", but it would be: df.index = [str(idx) for idx in df.index]; df.loc['(0, nan)']. I am not going to post this "solution" as an answer I am not going to accept ;) — deponovo
– deponovo, Commented Sep 29, 2021 at 10:24

ogdenkev · Accepted Answer · 2021-09-29 12:11:01Z

Update

I am able to reproduce your error after grouping and aggregating a data frame.

>>> import pandas as pd >>> data = pd.DataFrame({ ... "temp_playlist": [0] * 15, ... "objId": ['o1'] * 2 + ['o2'] * 2 + ['o3'] * 2 + ['o4'] * 3 + ['o5'] * 2 + ['o6'] * 2 + [pd.NA] * 2, ... "vals": [0, 6, 1, 4, 2, 5, 8, 9, 12, 10, 13, 11, 14, 3, 7] ... }) >>> df = data.groupby(["temp_playlist", "objId"], dropna=False).agg(list) >>> df.loc[(0, pd.NA)] Traceback (most recent call last): File "/home/ec2-user/miniconda3/envs/so-pandas-nan-index/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 3361, in get_loc return self._engine.get_loc(casted_key) File "pandas/_libs/index.pyx", line 76, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/index.pyx", line 108, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/hashtable_class_helper.pxi", line 5198, in pandas._libs.hashtable.PyObjectHashTable.get_item File "pandas/_libs/hashtable_class_helper.pxi", line 5206, in pandas._libs.hashtable.PyObjectHashTable.get_item KeyError: <NA>

Passing in an explit MultiIndex works, though.

>>> df.loc[pd.MultiIndex.from_tuples([(0, pd.NA)], names=["temp_playlist", "objId"])] vals temp_playlist objId 0 NaN [3, 7] >>> df.loc[pd.MultiIndex.from_tuples([(0, pd.NA)])] vals 0 NaN [3, 7]

And so does returning a data frame using a single tuple. Note using [[]] returns a DataFrame.

>>> df.loc[[(0, pd.NA)]] vals temp_playlist objId 0 NaN [3, 7]

As does DataFrame.reindex (see also the user guide on reindexing).

>>> df.reindex([(0, pd.NA)]) vals temp_playlist objId 0 NaN [3, 7]

Original Attempt to Reproduce Error

I am not able to reproduce your error. You can see below that using df.loc[(0, np.nan)] works.

Python 3.8.5 (default, Sep 4 2020, 07:30:14) [GCC 7.3.0] :: Anaconda, Inc. on linux Type "help", "copyright", "credits" or "license" for more information. >>> import numpy as np >>> import pandas as pd >>> nan_index = pd.MultiIndex.from_tuples([(0, 'o1'), (0, 'o2'), (0, 'o3'), (0, 'o4'), (0, 'o5'), (0, 'o6'), (0, np.nan)]) >>> print(nan_index) MultiIndex([(0, 'o1'), (0, 'o2'), (0, 'o3'), (0, 'o4'), (0, 'o5'), (0, 'o6'), (0, nan)], ) >>> rng = np.random.default_rng(42) >>> vals = [rng.choice(20, 2) for i in range(nan_index.shape[0])] >>> print(vals) [array([ 1, 15]), array([13, 8]), array([ 8, 17]), array([ 1, 13]), array([4, 1]), array([10, 19]), array([14, 15])] >>> df = pd.DataFrame({"vals": vals}, index=nan_index) >>> print(df) vals 0 o1 [1, 15] o2 [13, 8] o3 [8, 17] o4 [1, 13] o5 [4, 1] o6 [10, 19] NaN [14, 15] >>> print(df.loc[(0, 'o1')]) vals [1, 15] Name: (0, o1), dtype: object >>> print(df.loc[(0, np.nan)]) vals [14, 15] Name: (0, nan), dtype: object >>> print(pd.__version__) 1.3.1

Then I noticed that your index was printed as (0, nan) but mine was (0, np.nan). The difference was that I used np.nan and I suspect yours is pd.NA.

>>> nan_index = pd.MultiIndex.from_tuples([(0, 'o1'), (0, 'o2'), (0, 'o3'), (0, 'o4'), (0, 'o5'), (0, 'o6'), (0, pd.NA)]) >>> nan_index MultiIndex([(0, 'o1'), (0, 'o2'), (0, 'o3'), (0, 'o4'), (0, 'o5'), (0, 'o6'), (0, nan)], ) >>> df = pd.DataFrame({"vals": vals}, index=nan_index) >>> df vals 0 o1 [1, 15] o2 [13, 8] o3 [8, 17] o4 [1, 13] o5 [4, 1] o6 [10, 19] NaN [14, 15]

However, that did not resolve the difference. I was still able to use df.loc[(0, np.nan)].

>>> df.loc[(0, pd.NA)] vals [14, 15] Name: (0, nan), dtype: object >>> df.loc[(0, np.nan)] vals [14, 15] Name: (0, nan), dtype: object

Moreover, I was also able to use df.loc[(0, None)].

>>> df.loc[(0, None)] vals [14, 15] Name: (0, nan), dtype: object

Just to confirm, np.nan, pd.NA, and None are all different objects. Pandas must treat them the same when used with DataFrame.loc.

>>> pd.NA is np.nan False >>> pd.NA is None False >>> np.nan is None False >>> type(pd.NA) <class 'pandas._libs.missing.NAType'> >>> type(np.nan) <class 'float'>

[...] Passing in an explit MultiIndex works, though [...] very interesting. Tks, this seems more like a "better solution" than mine and that from @jezrael.
Still it looks like a bug. I mean, why would then df.loc[[(0, <any pd compatible NaN variation>]] work and actually return a list of the data at that index (eg [[3, 7]])?

jezrael · Accepted Answer · 2021-09-29 11:00:12Z

Idea with replace NaN to NA:

i = pd.MultiIndex.from_tuples([(0, 'o1'), (0, 'o2'), (0, 'o3'), (0, 'o4'), (0, 'o5'), (0, 'o6'), (0, np.nan)]) df = pd.DataFrame({'a':0}, index=i) df = df.rename(lambda x: 'NA' if pd.isna(x) else x, level=1) print (df) a 0 o1 0 o2 0 o3 0 o4 0 o5 0 o6 0 NA 0 df.loc[(0, 'NA')]

Tks for the answer, but this is somehow similar to my answer. I will wait a couple of days, if nobody has another solution I will put this as a bug to the pandas git.

deponovo · Accepted Answer · 2021-09-29 10:50:49Z

One "bad solution", that is not really solving the underlying issue but provide a working solution, would be by converting the indices to strings (the str constructor is capable of amazing results here).

df.index = [str(idx) for idx in df.index] df Out (0, 'o1') [0, 6] (0, 'o2') [1, 4] (0, 'o3') [2, 5] (0, 'o4') [8, 9, 12] (0, 'o5') [10, 13] (0, 'o6') [11, 14] (0, nan) [3, 7] Name: x, dtype: object df.index Out Index(['(0, 'o1')', '(0, 'o2')', '(0, 'o3')', '(0, 'o4')', '(0, 'o5')', '(0, 'o6')', '(0, nan)'], dtype='object') xy_data[0].loc['(0, nan)'] # or xy_data[0].loc[str((0, nan))]

Collectives™ on Stack Overflow

Get values from dataframe with MultiIndex index containg NaNs

3 Answers 3

Update

Original Attempt to Reproduce Error

2 Comments

1 Comment

Comments

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Update

Original Attempt to Reproduce Error

2 Comments

1 Comment

Comments

Related