Pandas dataframe concatenation

Question

I have two dataframes. The first has only two columns, and N rows. N is hundreds to thousands. Each column is a molecules name, thus, it is a dataframe of pairs of molecules.

Second dataframe: I have a dataframe that is 1600 columns and M rows. M < N. Each column has a descriptor of a molecule. Thus, each molecule has 1600 descriptors.

Given these two dataframes, I want to create a 3rd dataframe that has 3200 columns (1600*2) and N rows. For each pair of molecules, I want to have the 1600 descriptors of the first molecules, followed (concatenated) by the 1600 descriptors of the second molecule.

So, I will have a new dataframe with 3200 descriptors for each pair of molecules.

Is there a `pandas` way to combine columns from different `DataFrames`? my MWE only works for my little example.

I have a MWE, however, when I try using it on the real dataframes, I get this error (diclofenac is the name of the molecule - the equivalent of a, b, or c in the MWE)

Traceback (most recent call last): File "/apps/psi4conda/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 3621, in get_loc return self._engine.get_loc(casted_key) File "pandas/_libs/index.pyx", line 136, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/index.pyx", line 163, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/hashtable_class_helper.pxi", line 5198, in pandas._libs.hashtable.PyObjectHashTable.get_item File "pandas/_libs/hashtable_class_helper.pxi", line 5206, in pandas._libs.hashtable.PyObjectHashTable.get_item KeyError: 'diclofenac' The above exception was the direct cause of the following exception: Traceback (most recent call last): File "ml_script.py", line 232, in <module> matrix.append(pd.concat([cof_df.loc[row['cof1']], cof_df[row['cof2']]], axis=0)) File "/apps/psi4conda/lib/python3.8/site-packages/pandas/core/frame.py", line 3505, in __getitem__ indexer = self.columns.get_loc(key) File "/apps/psi4conda/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 3623, in get_loc raise KeyError(key) from err KeyError: 'diclofenac'

Here is the MWE

import numpy as np import pandas as pd # Dataframe with each molecules descriptors (real and binaries allowed) df1 = pd.DataFrame([['a',1,True,3,4], ['b',55,False,76,87],['c',9,True,11,12]], columns=["name", "d1", "d2", "d3", "d4"]) df1 = df1.set_index("name") # dataframe of pairs of molecules df2 = pd.DataFrame({'cof1':['a', 'a','c','b'], 'cof2':['c','b','a','c']}) matrix = [] for index, rows in df2.iterrows(): matrix.append(pd.concat([df1.loc[rows['cof1']], df1.loc[rows['cof2']]], axis=0)) matrix = np.asarray(matrix) df3 = pd.DataFrame(matrix)

The thing I don't get, is that it will successfully print to screen df1.loc[rows['cof1']], so it has no issues with the key in this call.

Please provide a minimal reproducible example that actually reproduces the error. For a problem that we can't see, the best we can do is guess. For specifics, see How to make good reproducible pandas examples. — wjandrea
– wjandrea, Commented Jun 7, 2022 at 15:55
FWIW though, you can do the same thing more easily with df3 = pd.concat([df1.loc[df2[col]].reset_index(drop=True) for col in df2], axis=1) — wjandrea
– wjandrea, Commented Jun 7, 2022 at 15:57
@wjandrea that does help. I get KeyError: '[nan] not in index' when I do this. Is there a typical reason for this, such as a strange name? — Charlie Crown
– Charlie Crown, Commented Jun 7, 2022 at 17:14

wjandrea · Accepted Answer · 2022-06-07 16:00:39Z

I wish I could comment and not write an answer here but I will try to help.

It seems your example code is working perfectly so based on the error I can only recommend you to find that particular KeyError: 'diclofenac' over both dataframe's and see if in any of them it contains a blank space of a capital letter that is raising that particular error.

In your example script, you can reproduce this error if either you change your df1 molecule name from a to A or do the same change in any particular molecule pair in your df2, so the error can also be on your df2 molecule names.

If you know your data is correct but may contain any capital, try to .lower() and .strip() every molecule name.

df1['name'].apply(lambda x: x.strip().lower())

` File "ml_script.py", line 218, in fix_mol_name return x.strip().lower() AttributeError: 'float' object has no attribute 'strip'` You are probably onto something with strip() though. I was using lower() already. Actually, while the features have floats, and molecules names have integers in them... I don't know why it is talking about floats
Yeah, if the field is not string, those functions wont work, the molecule pairs are string based or float based? You can also cast them to str() if the molecule name can contain numbers, but may cause troubles if they are seen as float's as could generate string with '.0' at the end that you should clean too.
they are all strings, although some strings contain integers. I added print(x) before the return statement, and apparently there was a 'nan', but it is not in the list, and if I print the list of names, there is no nan.

Collectives™ on Stack Overflow

Pandas dataframe concatenation

Is there a `pandas` way to combine columns from different `DataFrames`? my MWE only works for my little example.

1 Answer 1

3 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

Is there a pandas way to combine columns from different DataFrames? my MWE only works for my little example.

1 Answer 1

3 Comments

Linked

Related

Is there a `pandas` way to combine columns from different `DataFrames`? my MWE only works for my little example.