3

Could anyone help me out with proper import settings for an excel file with a following kind of structure (for time series analysis):

label1 varName label2 random text label3 random text2 2015-01-01 01:00 85 2015-01-01 02:00 26 2015-01-01 03:00 15 2015-01-01 04:00 13 2015-01-01 05:00 22 2015-01-01 06:00 21 2015-01-01 07:00 13 2015-01-01 08:00 22 2015-01-01 09:00 20 2015-01-01 10:00 31 2015-01-01 11:00 36 2015-01-01 12:00 33 2015-01-01 13:00 33 2015-01-01 14:00 33 

label and varName are rows to keep. Rows 2 and 3 should be deleted.

Database starts at Jan 1st 2015 1am and ends at 31dec 2015 11pm. For most days I have a value for each hour. There are some NAs inside values but nrow=8760

I'm still learning how to do time series in R, but I'd imagine that it would be easier to handle summaries (say day by day means) if R would split %Y-%m-%d %H:%M column into two separate ones.

Simple import with RStudio default readxl library fails as rows 2 and 3 get imported and date is translated into a funny format: 42005.041666666664

enter image description here

Normally I would deal with this by hand (in excel) and import a clean txt to R. Problem is I need to process 61 similar files (for different years and different variables). I'm sure there is a way to automate this task, but after 6hours of searching, testing and reading I'm basically in the same spot as this morning.

I'd appreciate any kind of hint or help. Thank you

7
  • Have you at least picked a library of choice for importing your excel files? There is an endless list of libraries to help with this. I will recommend this one. Commented Jan 13, 2017 at 15:13
  • Yes, I worked with readxl as this what RStudio uses by default. Still, I'm not able to do what I described: keep proper date/time format, drop rows 2:3, separate date from time. And all of this in 61 similar excel files Commented Jan 13, 2017 at 15:24
  • 2
    if all your files have the same column names, you should do: read_excel(path = file_name, col_names = c('label1','varname'), col_types = c('date', 'numeric'), skip = 3). If the files don't have the same column names, then you can get the columns in each file and then skip 3 rows. Commented Jan 13, 2017 at 15:34
  • Thanks for this hint. Problem is, files have different number of columns and they don't have same names (not always). That's why I was thinking about Keeping row 1 as column name and skipping rows 2 and 3. Commented Jan 13, 2017 at 15:40
  • 2
    Then you should try using xlsx. The only problem with this package is that it relies on a working installation of the rJava and xlsxjars packages, which are really hard to set up properly. But if you have it installed, then there is very much you can do with its read.xlsx function. Commented Jan 13, 2017 at 15:53

2 Answers 2

2

You can convert the numbers back to a datetime format, such as POSIXct.

library("datetimeutils") convert_date(42005.041666666664, type = "excel", fraction = TRUE) ## [1] "2015-01-01 01:00:00 CET" 
Sign up to request clarification or add additional context in comments.

1 Comment

I was struggling with readxl, lubridate and other stuff. This one saved me.
1

Exel often uses different cell format to store date time values.

Example 1:

Cell value visible to user: 1/28/2019 7:00:00 AM Exel stores value in general format: 43493.2916666667 43493 - number of days after 1/1/1900 0.2916666667 - % of the day [7*60/(24*60)] 

Example 2:

Cell value visible to user: 1/28/2019 7:23:33 AM Exel stores value in general format: 43493.3080208333 0.3080208333 - % of the day [(7*60*60+23*60+33)/(24*60*60)] 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.