I have two dataframes. The first has only two columns, and N rows. N is hundreds to thousands. Each column is a molecules name, thus, it is a dataframe of pairs of molecules.
Second dataframe: I have a dataframe that is 1600 columns and M rows. M < N. Each column has a descriptor of a molecule. Thus, each molecule has 1600 descriptors.
Given these two dataframes, I want to create a 3rd dataframe that has 3200 columns (1600*2) and N rows. For each pair of molecules, I want to have the 1600 descriptors of the first molecules, followed (concatenated) by the 1600 descriptors of the second molecule.
So, I will have a new dataframe with 3200 descriptors for each pair of molecules.
Is there a pandas way to combine columns from different DataFrames? my MWE only works for my little example.
I have a MWE, however, when I try using it on the real dataframes, I get this error (diclofenac is the name of the molecule - the equivalent of a, b, or c in the MWE)
Traceback (most recent call last): File "/apps/psi4conda/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 3621, in get_loc return self._engine.get_loc(casted_key) File "pandas/_libs/index.pyx", line 136, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/index.pyx", line 163, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/hashtable_class_helper.pxi", line 5198, in pandas._libs.hashtable.PyObjectHashTable.get_item File "pandas/_libs/hashtable_class_helper.pxi", line 5206, in pandas._libs.hashtable.PyObjectHashTable.get_item KeyError: 'diclofenac' The above exception was the direct cause of the following exception: Traceback (most recent call last): File "ml_script.py", line 232, in <module> matrix.append(pd.concat([cof_df.loc[row['cof1']], cof_df[row['cof2']]], axis=0)) File "/apps/psi4conda/lib/python3.8/site-packages/pandas/core/frame.py", line 3505, in __getitem__ indexer = self.columns.get_loc(key) File "/apps/psi4conda/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 3623, in get_loc raise KeyError(key) from err KeyError: 'diclofenac' Here is the MWE
import numpy as np import pandas as pd # Dataframe with each molecules descriptors (real and binaries allowed) df1 = pd.DataFrame([['a',1,True,3,4], ['b',55,False,76,87],['c',9,True,11,12]], columns=["name", "d1", "d2", "d3", "d4"]) df1 = df1.set_index("name") # dataframe of pairs of molecules df2 = pd.DataFrame({'cof1':['a', 'a','c','b'], 'cof2':['c','b','a','c']}) matrix = [] for index, rows in df2.iterrows(): matrix.append(pd.concat([df1.loc[rows['cof1']], df1.loc[rows['cof2']]], axis=0)) matrix = np.asarray(matrix) df3 = pd.DataFrame(matrix) The thing I don't get, is that it will successfully print to screen df1.loc[rows['cof1']], so it has no issues with the key in this call.
df3 = pd.concat([df1.loc[df2[col]].reset_index(drop=True) for col in df2], axis=1)KeyError: '[nan] not in index'when I do this. Is there a typical reason for this, such as a strange name?df2