How can I sum a column of multiple datasets that do not have the same structure?

Question

I have several files (20) with the same column structure but with different rows structure. All are composed of two columns, the first are factors and the second are integers. I want to sum the column of integers for the factors that are repeated and the new ones that are simply added. How could I combine and sum the ones already repeated?

I've thought about combining cbind and tapply but I really do not know how to carry this out.

A simple example of the structure of the files:

Shop Clients Shop Clients Shop Clients A 9 D 8 A 5 B 7 A 4 R 4 C 4 F 3 C 3 D 2 B 1 B 2

I expect the output:

Shop Clients A 18 B 10 C 7 D 10 F 3 R 4

I read the different files in a loop, creating a dataset for each of them, so that the dataset shows City1$Shop and City1$Clients for example. This case is only for 20 files but I would like to know how to work with more (for example 100). How could I solve this problem by reading the datasets in this way?

f<-function(x){ read.delim2("p01.txt",header=T,sep="\t",stringsAsFactors = FALSE) } for(i in x){ total<-f(i) #Here I suppose I would combine and sum the datasets }

So with the edit it now seems that you do not have one file with multiple columns named "Shop" and "Clients". The example above are in fact several files? If so the posted solutions probably do not work. — Rui Barradas
– Rui Barradas, Commented May 20, 2019 at 18:19
Also, is there a pattern in the filenames? If this is the case then they can be read in with fnames <- list.files(pattern = <something>) followed by a lapply loop. See here and here. — Rui Barradas
– Rui Barradas, Commented May 20, 2019 at 18:20
I thought at first that I had explained it in a clear way, but when I saw the answers I realized that I had to edit it and show in detail how I read the data. It's my first question and I'm learning to ask questions in a clear way. Sorry! I had thought about creating a "total" dataframe in which to add and sum each dataframe that I am reading, or is it better all at once? I'm stuck with this problem. — Cris_91
– Cris_91, Commented May 20, 2019 at 18:30
Do all filenames start with a "p" followed by one or more digits and have a .txt extension? — Rui Barradas
– Rui Barradas, Commented May 20, 2019 at 18:31
Yeah, I read them in this way ` x<-list.files(pattern=".txt") ` I have the problem to put them together and sum the columns — Cris_91
– Cris_91, Commented May 20, 2019 at 18:34

Rui Barradas · Accepted Answer · 2019-05-20 18:38:46Z

A possibility is to use aggregate after reshaping the data to a long format.

inx <- grep("Shop", names(df1)) long <- do.call(rbind, lapply(inx, function(i) df1[i:(i + 1)])) aggregate(Clients ~ Shop, long, sum) # Shop Clients #1 A 18 #2 B 10 #3 C 7 #4 D 10 #5 F 3 #6 R 4

Edit.
After the edit to the question, I believe the following does what is asked. I will once again use aggregate.

fnames <- list.files(pattern = "\\.txt") df_list <- lapply(fnames, read.table, header = TRUE) df_all <- do.call(rbind, df_list) aggregate(Clients ~ Shop, data = df_all, sum)

It works! I was complicating the problem with a loop, it is much simpler than I thought. Thank you very much for the help and patience!
What if I have multiple headers not just two as in( Clients ~ Shop)??
@AmalNasir Can you give an example? It is possible to aggregate by several columns, say, Clients ~ Shop + Month.
@RuiBarradas Yes, I posted my question here, stackoverflow.com/questions/67692461/…

tmfmnk · Accepted Answer · 2019-05-20 17:29:04Z

One tidyverse possibility could be:

df %>% select_at(vars(contains("Shop"))) %>% gather(var1, val1) %>% bind_cols(df %>% select_at(vars(contains("Client"))) %>% gather(var2, val2)) %>% group_by(Shop = val1) %>% summarise(Clients = sum(val2)) Shop Clients <chr> <int> 1 A 18 2 B 10 3 C 7 4 D 10 5 F 3 6 R 4

The same with base R:

long_df <- data.frame(Shop = stack(df[, grepl("Shop", names(df))])[, 1], Clients = stack(df[, grepl("Client", names(df))])[, 1]) aggregate(Clients ~ Shop, long_df, sum) Shop Clients 1 A 18 2 B 10 3 C 7 4 D 10 5 F 3 6 R 4

akrun · Accepted Answer · 2019-05-20 17:29:08Z

We can melt the data into 'long' format by specifying the measure columns with patterns of column names as 'Shop' 'Clients', then grouped by 'Shop' get the sum of 'Clients'

library(data.table) melt(setDT(df1), measure = patterns("^Shop", "^Clients"), value.name = c("Shop", "Clients"))[, .(Clients = sum(Clients)), by = Shop] # Shop Clients #1: A 18 #2: B 10 #3: C 7 #4: D 10 #5: F 3 #6: R 4

Or using tidyverse

library(tidyverse) map_dfc(list(Shop = "Shop", Clients = "Clients"), ~ df1 %>% select(matches(.x)) %>% unlist) %>% group_by(Shop) %>% summarise(Clients = sum(Clients)) # A tibble: 6 x 2 # Shop Clients # <chr> <int> #1 A 18 #2 B 10 #3 C 7 #4 D 10 #5 F 3 #6 R 4

Or with rowsum from base R

i1 <- grepl("^Shop", names(df1)) rowsum(unlist(df1[!i1]), group = unlist(df1[i1]))

data

df1 <- structure(list(Shop = c("A", "B", "C", "D"), Clients = c(9L, 7L, 4L, 2L), Shop.1 = c("D", "A", "F", "B"), Clients.1 = c(8L, 4L, 3L, 1L), Shop.2 = c("A", "R", "C", "B"), Clients.2 = 5:2), class = "data.frame", row.names = c(NA, -4L))

Collectives™ on Stack Overflow

How can I sum a column of multiple datasets that do not have the same structure?

3 Answers 3

4 Comments

Comments

data

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

4 Comments

Comments

data

Comments

Linked

Related