4

I am trying to write a regex to identify some dates.

the string I am working on is :

string: 'these are just rubbish 11-2-2222, 24-3-1695-194475 12-13-1111, 32/11/2000\ these are dates 4-02-2011, 12/12/1990, 31-11-1690, 11 July 1990, 7 Oct 2012\ these are actual deal- by 12 December six people died and in June 2000 he told, by 5 July 2001, he will leave.' 

The regex looks like :

re.findall('(\ [\b, ]\ ([1-9]|0[1-9]|[12][0-9]|3[01])\ [-/.\s+]\ (1[1-2]|0[1-9]|[1-9]|Jan|January|Feb|February|Mar|March|Apr|April|May|Jun|June|Jul|July|Aug|August|Sept|September|Oct|October|Nov|November|Dec|December)\ (?:[-/.\s+](1[0-9]\d\d|20[0-2][0-5]))?\ [^\da-zA-Z])',String) 

The output I get is :

[(' 11-2-', '11', '2', ''), (' 24-3-1695-', '24', '3', '1695'), (' 4-02-2011,', '4', '02', '2011'), (' 12/12/1990,', '12', '12', '1990'), (' 31-11-1690,', '31', '11', '1690'), (' 11 July 1990,', '11', 'July', '1990'), (' 7 Oct 2012 ', '7', 'Oct', '2012'), (' 12 December ', '12', 'December', ''), (' 5 July 2001,', '5', 'July', '2001')] 

Problems:

  1. The first two output are wrong, they come because of the optional expression ((?:[-/.\s+](1[0-9]\d\d|20[0-2][0-5]))?) put to handle cases like "12 December". How do I get rid of them?

  2. There is a case "June 2000" that is not handles by the expression.
    Can I implement something with the expression that could handle this case without affecting others?

2
  • 3
    stackoverflow.com/questions/33143433/… isnt this same? Commented Oct 15, 2015 at 9:57
  • A little bit different, I could have commented on the previous post with the additional problems, But thought would be good to have a new question. Commented Oct 15, 2015 at 10:00

3 Answers 3

2

I would avoid trying to get a regular expression to parse your dates. As you have found, it starts ok but soon becomes harder to catch edge cases, for example invalid dates, e.g. 31/09/2018

A safer approach is to let Python's datetime decide if a date is valid or not. You can then easily specify valid date ranges and allowed date formats.

This script works by using the regular expression to extract all words and number groups. It then takes three parts at a time and applies the allowed date formats. If datetime succeeds in parsing a given format, it is tested to ensure it falls within your allowed date ranges. If valid, the matching parts are skipped over to avoid a second match on a partial date.

If the date found does not contain a year, a default_year is assumed:

from itertools import tee from datetime import datetime import re valid_from = datetime(1920, 1, 1) valid_to = datetime(2030, 1, 1) default_year = 2018 dt_formats = [ ['%d', '%m', '%Y'], ['%d', '%b', '%Y'], ['%d', '%B', '%Y'], ['%d', '%b'], ['%d', '%B'], ['%b', '%d'], ['%B', '%d'], ['%b', '%Y'], ['%B', '%Y'], ] text = """these are just rubbish 11-2-2222, 24-3-1695-194475 12-13-1111, 32/11/2000 these are dates 4-02-2011, 12/12/1990, 31-11-1690, 11 July 1990, 7 Oct 2012 these are actual deal- by 12 December six people died and in June 2000 he told, by 5 July 2001, he will leave.""" t1, t2, t3 = tee(re.findall(r'\b\w+\b', text), 3) next(t2, None) next(t3, None) next(t3, None) triples = zip(t1, t2, t3) for triple in triples: for dt_format in dt_formats: try: dt = datetime.strptime(' '.join(triple[:len(dt_format)]), ' '.join(dt_format)) if '%Y' not in dt_format: dt = dt.replace(year=default_year) if valid_from <= dt <= valid_to: print(dt.strftime('%d-%m-%Y')) for skip in range(1, len(dt_format)): next(triples) break except ValueError: pass 

For the text you have given, this would display:

04-02-2011 12-12-1990 11-07-1990 07-10-2012 12-12-2018 01-06-2000 05-07-2001 
Sign up to request clarification or add additional context in comments.

1 Comment

this is a great answer, I have an edited version in a new answer that returns the original string and index of each match stackoverflow.com/a/71321576/5125264
1

@Martin Evans answer was great but I wanted to also return the locations of the match within the string:

>>> text = """these are just rubbish 11-2-2222, 24-3-1695-194475 12-13-1111, 32/11/2000 these are dates 4-02-2011, 12/12/1990, 31-11-1690, 11 July 1990, 7 Oct 2012 these are actual deal- by 12 December six people died and in June 2000 he told, by 5 July 2001, he will leave.""" >>> find_dates(text) [('2011-02-04', 90, 99, '4-02-2011'), ('1990-12-12', 101, 111, '12/12/1990'), ('1990-07-11', 126, 138, '11 July 1990'), ('2012-10-07', 140, 150, '7 Oct 2012'), ('2022-12-12', 177, 192, '12 December six'), ('2000-06-01', 212, 224, 'June 2000 he'), ('2001-07-05', 234, 245, '5 July 2001')] 

I have wrapped it up in a function and users finditer instead of findall

from itertools import tee from datetime import datetime import re def find_dates( text, valid_from = datetime(1920, 1, 1), valid_to = datetime(2030, 1, 1), default_year = datetime.now().year, dt_formats = [ ['%d', '%m', '%Y'], ['%d', '%b', '%Y'], ['%d', '%B', '%Y'], ['%d', '%b'], ['%d', '%B'], ['%b', '%d'], ['%B', '%d'], ['%b', '%Y'], ['%B', '%Y'], ], ): # store your matches here dates = [] t1, t2, t3 = tee(list(re.finditer(r'\b\w+\b', text)), 3) next(t2, None) next(t3, None) next(t3, None) triples = zip(t1, t2, t3) for triple in triples: # get start and end index of each triple start = triple[0].start() end = triple[-1].end() # convert mathes to a list of three strings triple = [text[t.start():t.end()] for t in triple] for dt_format in dt_formats: try: dt = datetime.strptime(' '.join(triple[:len(dt_format)]), ' '.join(dt_format)) if '%Y' not in dt_format: dt = dt.replace(year=default_year) if valid_from <= dt <= valid_to: dates.append((dt.strftime('%Y-%m-%d'), start, end, text[start:end])) for skip in range(1, len(dt_format)): next(triples) break except ValueError: pass return dates 

There is some bug though as you can see ('2000-06-01', 212, 224, 'June 2000 he'). Although a better approach may be to do something with dateutil.parser.parse like in https://stackoverflow.com/a/33051237/5125264

Comments

0

Use this : r'\d{,2}-[A-Za-z]{,9}-\d{,4}'

import re re.match(r'\d{,2}\-[A-Za-z]{,9}\-\d{,4}','Your Date') 

This can match dates of formats : '14-Jun-2021' , '4-september-20'

1 Comment

But it also matches 69-PaNcAkEs-321

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.