2

I'm trying to extract date information from a string. The string may look like:

  1. 5 months and 17 hours
  2. 1 month and 19 days
  3. 3 months and 1 day
  4. 2 years 1 month and 2 days
  5. 1 year 1 month and 1 days and 1 hour

And I'd like to extract:

  1. y=0 m=5 d=0 h=17
  2. y=0 m=1 d=19 h=0
  3. y=0 m=3 d=1 h=0
  4. y=2 m=1 d=2 h=0
  5. y=1 m=1 d=1 h=1

I started working something out like this:

publishedWhen = '1 year 1 month and 1 days and 1 hour' y,m,d,h = 0,0,0,0 if 'day ' in publishedWhen: d = int(publishedWhen.split(' day ')[0]) if 'days ' in publishedWhen: d = int(publishedWhen.split(' days ')[0]) if 'days ' not in publishedWhen and 'day ' not in publishedWhen: d = 0 if 'month ' in publishedWhen: m = int(publishedWhen.split(' month ')[0]) d = int(publishedWhen.replace(publishedWhen.split(' month ')[0] + ' month ','').replace('and','').replace('days','').replace('day','')) if 'months ' in publishedWhen: m = int(publishedWhen.split(' months ')[0]) 

However, I know that this code is bug-ridden (some cases are probably not taken into account) and that regex would probably produce something much cleaner and effective. Is this true? Which regex would help me extract all this information?

1 Answer 1

5

You don't have to use re\gular expres{2}ions? and, instead, look into the very rich library of third-party packages at the Python Package Index.

For instance, you could use a combination of dateparser - for parsing human-readable dates and dateutil - for the relative delta object:

from datetime import datetime import dateparser as dateparser from dateutil.relativedelta import relativedelta BASE_DATE = datetime(2018, 1, 1) def get_relative_date(date_string): parsed_date = dateparser.parse(date_string, settings={"RELATIVE_BASE": BASE_DATE}) return relativedelta(parsed_date, BASE_DATE) date_strings = [ "5 months and 17 hours", "1 month and 19 days", "3 months and 1 day", "2 years 1 month and 2 days", "1 year 1 month and 1 days and 1 hour" ] for date_string in date_strings: delta = get_relative_date(date_string) print(f"y={abs(delta.years)} m={abs(delta.months)} d={abs(delta.days)} h={abs(delta.hours)}") 

Prints:

y=0 m=5 d=0 h=17 y=0 m=1 d=19 h=0 y=0 m=3 d=1 h=0 y=2 m=1 d=2 h=0 y=1 m=1 d=1 h=1 

I don't particularly like the need to do the delta with some base date and pretty sure there is a package that could parse directly into the delta object. Open to any suggestions.

Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.