2
\$\begingroup\$

There is a text (link clickable) file with HTML table. The table is a bank statement. I'd like to parse it into pandas DataFrame. Is there a way to do it more gracefully? I've started to learn Python recently so there is a good chance you guys can give me a good advice.

from bs4 import BeautifulSoup import pandas as pd with open("sber2.txt", "r", encoding = "UTF8") as f: context = f.read() soup = BeautifulSoup(context, 'html.parser') rows_dates = soup.find_all(attrs = {'data-bind':'momentDateText: date'}) rows_category = soup.find_all(attrs = {'data-bind' : 'text: categoryName'}) rows_comment = soup.find_all(attrs = {'data-bind' : 'text: comment'}) rows_money = soup.find_all(attrs = {'data-bind' : 'currencyText: nationalAmount'}) dic = { "dates" : [], "category" : [], "comment": [], "money" : [] } i = 0 while i < len(rows_dates): dic["dates"].append(rows_dates[i].text) dic["category"].append(rows_category[i].text) dic["comment"].append(rows_comment[i].text) dic["money"].append(rows_money[i].text) ''' print( rows_dates[i].text, rows_category[i].text, rows_comment[i].text, rows_money[i].text) ''' i += 1 df = pd.DataFrame(dic) df.info() print(df.head()) 

Output:

RangeIndex: 18 entries, 0 to 17 Data columns (total 4 columns): category 18 non-null object comment 18 non-null object dates 18 non-null object money 18 non-null object dtypes: object(4) memory usage: 656.0+ bytes category comment dates money 0 Supermarkets PYATEROCHKA 1168 SAMARA RU 28.12.2017 -456,85 1 Supermarkets KARUSEL SAMARA RU 26.12.2017 -710,78 2 Supermarkets PYATEROCHKA 1168 SAMARA RU 24.12.2017 -800,24 3 Supermarkets AUCHAN SAMARA IKEA SAMARA RU 19.12.2017 -154,38 4 Supermarkets PYATEROCHKA 9481 SAMARA RU 16.12.2017 -188,80 
\$\endgroup\$

1 Answer 1

1
\$\begingroup\$

zip() with a list comprehension to the rescue:

rows_dates = soup.find_all(attrs={'data-bind': 'momentDateText: date'}) rows_category = soup.find_all(attrs={'data-bind': 'text: categoryName'}) rows_comment = soup.find_all(attrs={'data-bind': 'text: comment'}) rows_money = soup.find_all(attrs={'data-bind': 'currencyText: nationalAmount'}) data = [ { "dates": date.get_text(), "category": category.get_text(), "comment": comment.get_text(), "money": money.get_text() } for date, category, comment, money in zip(rows_dates, rows_category, rows_comment, rows_money) ] 

Or, you can do it a bit differently - zipping the lists of texts and specifying the dataframe headers via columns argument:

rows_dates = [item.get_text() for item in soup.find_all(attrs={'data-bind': 'momentDateText: date'})] rows_category = [item.get_text() for item in soup.find_all(attrs={'data-bind': 'text: categoryName'})] rows_comment = [item.get_text() for item in soup.find_all(attrs={'data-bind': 'text: comment'})] rows_money = [item.get_text() for item in soup.find_all(attrs={'data-bind': 'currencyText: nationalAmount'})] data = list(zip(rows_dates, rows_category, rows_comment, rows_money)) df = pd.DataFrame(data, columns=["dates", "category", "comment", "money"]) df = pd.DataFrame(data) 
\$\endgroup\$

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.