2

I have two data frames A and B which look like:

firstDF: col1 col2 id A 1 2 B 5 3 C 6 4 secondDF: col1 col2 id A 1 2 E 15 5 F 16 6 Resultant DF: col1 col2 id A 1 2 B 5 3 C 6 4 E 15 5 F 16 6 

The resultant data frame must contain all the rows from the two data frames. Incase there are rows which have the same id, it must be put in the resultant data frame only once.

I tried using the rbind function, but it returns with all the rows merged. I tried using the merge function with condition x.id=y.id, but the resultant data frame created had multiple columns namely x.col1, y.col1,x.col2, y.col2 and so on.

4 Answers 4

5

You can do this with merge().

merge(df1, df2, by=c("col1", "col2", "id"), all.x=T, all.y=T) 

This merges by all common variables, keeping all records in either data frame. Alternatively you can omit the by= argument and R will automatically use all common variables.

As @thelatemail mentioned in a comment, rather than individually specifying all.x=T and all.y=T, you can alternatively use all=T.

Sign up to request clarification or add additional context in comments.

7 Comments

I will have to write all column names? I have about 20 columns!!
@user1692342: Do all 20 appear in both data frames? I believe the default behavior if you omit a by= argument is to use all common variables. Maybe try that and see what happens. You'll still want all.x and all.y though.
The merging worked, however the rows which have same "id" repeats
@user1692342: That's expected if there are duplicate id values in either data frame, otherwise it doesn't make sense since you're merging on id. You can subset out the duplicated id values if you want. Use subset(df, !duplicated(id)).
@AlexA.- all=TRUE is shorthand for specifying both all.x=TRUE and all.y=TRUE
|
1

You can try the sqldf library. I'm not sure what kind of join. But it would go something like this:

Result =sqldf("select a.col1, a.col2, a.id from firstDF as a join secondDF as b on a.id=b.id") 

Or

X=rbind(firstDB, secondDB) 

Then filter out duplicates using the unique function.

Comments

0

Using sqldf:

library(sqldf) sqldf("select * from firstDF union select * from secondDF") 

Note that union automatically removes duplicates.

Comments

0

This may not be the most performant answer, but a quick and easy way to do it -- assuming that any duplicate rows are in fact exact duplicates (i.e., for any row in df1 where col_1 = X, if there exists a row in df2 where col_1 = X, all other columns are also identical between those two rows) -- would be to rbind them and get the unique results:

> df1 col_1 col_2 id 1 A 1 2 2 B 5 3 3 C 6 4 > df2 col_1 col_2 id 1 A 1 2 2 E 15 5 3 F 16 6 > unique(rbind(df1, df2)) col_1 col_2 id 1 A 1 2 2 B 5 3 3 C 6 4 5 E 15 5 6 F 16 6 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.