Is it possible to skip a paragraph using arrow::open_dataset in r?

Question

I have 20 datasets, and some of them have introductions in the first few rows. Since not all the dataset have introduction and the number of rows of introductions from different datasets may not be the same, therefore skip_rows may not be useful. Is it possible to catch the keywords and start reading from the row that contains keywords?

Sample dataset:

dataset 1:

balabala	balabala...
A header	Another header
First	row
Second	row

dataset 2:

A header	Another header
First	row
Second	row

dataset 3:

|balabala | balabala... | |balabala | balabala... | | -------- | -------------- | | A header | Another header | | First | row | | Second | row |

etc...

What I want:

dataset 1:

A header	Another header
First	row
Second	row

dataset 2:

A header	Another header
First	row
Second	row

dataset 3:

A header	Another header
First	row
Second	row

etc...

Kra.P · Accepted Answer · 2023-02-07 07:46:44Z

You may try

library(dplyr) library(janitor) df1 <- read.table(text = "balabala balabala... 'A header' 'Another header' First row Second row", header = T) df2 <- read.table(text = "'A header' 'Another header' First row Second row", header = T, check.names = F) df3 <- read.table(text = "balabala balabala... balabala balabala... 'A header' 'Another header' First row Second row", header = T) header_vector <- c('A header', 'Another header') ftn <- function(df){ if (all(names(df) == header_vector)) { df } else { df$key = apply(df, 1, function(x) {all(x == header_vector)}) df %>% mutate(key = cumsum(key)) %>% filter(key >= 1) %>% select(-key) %>% janitor::row_to_names(row_number = 1) } } ftn(df1) A header Another header 2 First row 3 Second row ftn(df2) A header Another header 1 First row 2 Second row ftn(df3) A header Another header 2 First row 3 Second row

Since I have more than 100 datasets, therefore I may be able to open one by one by hand. Also, the size of each dataset is very large(>2gb), so read.table is not suitable for opening such large dataset...
@doraemon read.table is just to put your data to my workspace. Make your dataset as list and try lapply(your_list, ftn)

Collectives™ on Stack Overflow

Is it possible to skip a paragraph using arrow::open_dataset in r?

1 Answer 1

2 Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Related