0

I have 20 datasets, and some of them have introductions in the first few rows. Since not all the dataset have introduction and the number of rows of introductions from different datasets may not be the same, therefore skip_rows may not be useful. Is it possible to catch the keywords and start reading from the row that contains keywords?

Sample dataset:

dataset 1:

balabala balabala...
A header Another header
First row
Second row

dataset 2:

A header Another header
First row
Second row

dataset 3:

|balabala | balabala... | |balabala | balabala... | | -------- | -------------- | | A header | Another header | | First | row | | Second | row |

etc...

What I want:

dataset 1:

A header Another header
First row
Second row

dataset 2:

A header Another header
First row
Second row

dataset 3:

A header Another header
First row
Second row

etc...

1 Answer 1

1

You may try

library(dplyr) library(janitor) df1 <- read.table(text = "balabala balabala... 'A header' 'Another header' First row Second row", header = T) df2 <- read.table(text = "'A header' 'Another header' First row Second row", header = T, check.names = F) df3 <- read.table(text = "balabala balabala... balabala balabala... 'A header' 'Another header' First row Second row", header = T) header_vector <- c('A header', 'Another header') ftn <- function(df){ if (all(names(df) == header_vector)) { df } else { df$key = apply(df, 1, function(x) {all(x == header_vector)}) df %>% mutate(key = cumsum(key)) %>% filter(key >= 1) %>% select(-key) %>% janitor::row_to_names(row_number = 1) } } ftn(df1) A header Another header 2 First row 3 Second row ftn(df2) A header Another header 1 First row 2 Second row ftn(df3) A header Another header 2 First row 3 Second row 
Sign up to request clarification or add additional context in comments.

2 Comments

Since I have more than 100 datasets, therefore I may be able to open one by one by hand. Also, the size of each dataset is very large(>2gb), so read.table is not suitable for opening such large dataset...
@doraemon read.table is just to put your data to my workspace. Make your dataset as list and try lapply(your_list, ftn)

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.