Parsing HTML table into Pandas DataFrame

Question

There is a text (link clickable) file with HTML table. The table is a bank statement. I'd like to parse it into pandas DataFrame. Is there a way to do it more gracefully? I've started to learn Python recently so there is a good chance you guys can give me a good advice.

from bs4 import BeautifulSoup import pandas as pd with open("sber2.txt", "r", encoding = "UTF8") as f: context = f.read() soup = BeautifulSoup(context, 'html.parser') rows_dates = soup.find_all(attrs = {'data-bind':'momentDateText: date'}) rows_category = soup.find_all(attrs = {'data-bind' : 'text: categoryName'}) rows_comment = soup.find_all(attrs = {'data-bind' : 'text: comment'}) rows_money = soup.find_all(attrs = {'data-bind' : 'currencyText: nationalAmount'}) dic = { "dates" : [], "category" : [], "comment": [], "money" : [] } i = 0 while i < len(rows_dates): dic["dates"].append(rows_dates[i].text) dic["category"].append(rows_category[i].text) dic["comment"].append(rows_comment[i].text) dic["money"].append(rows_money[i].text) ''' print( rows_dates[i].text, rows_category[i].text, rows_comment[i].text, rows_money[i].text) ''' i += 1 df = pd.DataFrame(dic) df.info() print(df.head())

Output:

RangeIndex: 18 entries, 0 to 17 Data columns (total 4 columns): category 18 non-null object comment 18 non-null object dates 18 non-null object money 18 non-null object dtypes: object(4) memory usage: 656.0+ bytes category comment dates money 0 Supermarkets PYATEROCHKA 1168 SAMARA RU 28.12.2017 -456,85 1 Supermarkets KARUSEL SAMARA RU 26.12.2017 -710,78 2 Supermarkets PYATEROCHKA 1168 SAMARA RU 24.12.2017 -800,24 3 Supermarkets AUCHAN SAMARA IKEA SAMARA RU 19.12.2017 -154,38 4 Supermarkets PYATEROCHKA 9481 SAMARA RU 16.12.2017 -188,80

alecxe · Accepted Answer · 2017-12-31 17:56:01Z

zip() with a list comprehension to the rescue:

rows_dates = soup.find_all(attrs={'data-bind': 'momentDateText: date'}) rows_category = soup.find_all(attrs={'data-bind': 'text: categoryName'}) rows_comment = soup.find_all(attrs={'data-bind': 'text: comment'}) rows_money = soup.find_all(attrs={'data-bind': 'currencyText: nationalAmount'}) data = [ { "dates": date.get_text(), "category": category.get_text(), "comment": comment.get_text(), "money": money.get_text() } for date, category, comment, money in zip(rows_dates, rows_category, rows_comment, rows_money) ]

Or, you can do it a bit differently - zipping the lists of texts and specifying the dataframe headers via columns argument:

rows_dates = [item.get_text() for item in soup.find_all(attrs={'data-bind': 'momentDateText: date'})] rows_category = [item.get_text() for item in soup.find_all(attrs={'data-bind': 'text: categoryName'})] rows_comment = [item.get_text() for item in soup.find_all(attrs={'data-bind': 'text: comment'})] rows_money = [item.get_text() for item in soup.find_all(attrs={'data-bind': 'currencyText: nationalAmount'})] data = list(zip(rows_dates, rows_category, rows_comment, rows_money)) df = pd.DataFrame(data, columns=["dates", "category", "comment", "money"]) df = pd.DataFrame(data)

Stack Exchange Network

Parsing HTML table into Pandas DataFrame

1 Answer 1

You must log in to answer this question.

Hot Network Questions

Parsing HTML table into Pandas DataFrame

1 Answer 1

You must log in to answer this question.

Related

Hot Network Questions