0

I'm using python to find dates in strings like :

string01='los mantenimientos acontecieron en los dias 3,06,8 ,9, 15 y 29 de diciembre de 2018.Por cada mantenimiento fué cobrado $1,300.00 códigos de mantenimiento: (3)A34,(2)C54,(1)D65' 

('the manteinance sessions were in december 3,06,8 ,9, 15 and 29 of 2018')

I'm first trying with regex to find and split just the dates and (not the currency) then transform them to the expected result

expected result: ['3/12/2018','06/12/2018','08/12/2018','09/12/2018','15/12/2018','29/12/2018']

string02='los mantenimientos sucedieron en: 2,04,05,8,9,10,11,14,15,22,24, y 27 de junio de 2018.Valor de cada uno de los mantenimiento: $1,300.00, códigos de mantenimiento: (1)A35,(6)C54,(5)D65' 

('the manteinance sessions happened in june 2,04,05,8,9,10,11,14,15,22,24, and 27 of 2018') expected result: ['02/06/2018','04/06/2018','05/06/2018','08/06/2018','09/06/2018','10/06/2018','11/06/2018','14/06/2018','15/06/2018','22/06/2018','24/06/2018','27/06/2018']

Ive tryied so far:

dias=re.compile(r"((\s?[0-3]?[0-9]\s?\,?\s?){1,9}[0-3][0-9]|\sy\s[0-3][0-9]\sde\s(?:diciembre|junio)\sde\s[2][0][0-2][0-9])") dias_found=re.findall(dias,string01) 

but I'm getting tuples and duplicated values:

[(' 3,06,8,9, 15', '9, '), (' y 29 de diciembre de 2018', '')] 

shoud be ['3','06','8','9','15','29 de diciembre de 2018']

Any help will be greatly appreciated.

Thanks in advance.

2
  • Honestly, trying to parse human-readable language is fraught with difficulty enough that relying on it for anything critical is probably a bad idea -- it'd be better to get your ops team to share their schedule in iCal format or something else built to be parsed programatically; that way, if they use slightly different wording next time and it's read incorrectly, that's their problem and not yours. Commented Jul 5, 2019 at 20:49
  • @abdusco thanks a lot, It's spanish actually Commented Jul 5, 2019 at 20:57

1 Answer 1

1

You can use re module together with string manipulation to extract the dates easily

import requests import re import json if __name__ == "__main__": texts = [ 'en los dias 3,06,8 ,9, 15 y 29 de diciembre de 2018.Por c', 'n en: 2,04,05,8,9,10,11,14,15,22,24, y 27 de junio de 2018.Valor de', ] # select from the beginning of date-like text till the end of year pattern = r'\s*((\d+[\sy\,]*)+[\D\s]+20\d{2})' month_names = ['diciembre', 'junio'] # add others month_pattern = re.compile(f'({"|".join(month_names)})', flags=re.IGNORECASE) all_dates = [] for item in texts: match = re.search(pattern, item) if not match: continue date_region: str = match.group(1) # find year year = re.search('(20\d{2})', date_region).group(1) # find month month_match = re.search(month_pattern, date_region) month = month_match.group(1) # remove everything after month date_region = date_region[: month_match.start()] # find all numbers, we're assuming they represent day of the month days = re.findall('(\d+)', date_region) found_dates = [f'{d}/{month}/{year}' for d in days] all_dates.append(found_dates) print(all_dates) 

I don't know the month names in Portuguese? (edit: it was Spanish), but replacing those with numbers is a trivial task. output:

[['3/diciembre/2018', '06/diciembre/2018', '8/diciembre/2018', '9/diciembre/2018', '15/diciembre/2018', '29/diciembre/2018'], ['2/junio/2018', '04/junio/2018', '05/junio/2018', '8/junio/2018', '9/junio/2018', '10/junio/2018', '11/junio/2018', '14/junio/2018', '15/junio/2018', '22/junio/2018', '24/junio/2018', '27/junio/2018']] 
Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.