0

I'm trying to scrape two tables using beautifulsoup and running into a brick wall. Website: https://bgp.he.net/country/US I'm trying to grab the header row from the table, but for some reason can't get it to parse into a list so I can manipulate it. I then would like to grab data from each column and output it all to a JSON file.

Example:

for row in soup.find_all("tr"): #Append to list(?) 

The delete unwanted entries?

I want to be able to output this to the JSON file and display it like this.

ASN #: Country: "US", "Name": XXX, "Routes V4", "XXXX", "Routes V6", "XXX"

4
  • 1
    are you getting <Response [200]>? Seems like for me Im getting <Response [404]> Commented Jan 12, 2019 at 15:58
  • No I am able to successfully print the HTML code. I can easily grab the code and use a print(soup.prettify()). Commented Jan 12, 2019 at 16:07
  • ah ok. can you add that code then to your quesyion above then? Commented Jan 12, 2019 at 16:10
  • nevermind. found my mistake. had a typo in the url. I'll have a solution in moment Commented Jan 12, 2019 at 16:15

2 Answers 2

1

if you get response code other than 200 set User-Agent in headers, mine get 403 Forbidden.

headers = {'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:56.0) Gecko/20100101 Firefox/56.0'} html = requests.get('https://bgp.........', headers=headers) soup = BeautifulSoup(html.text, 'html.parser') #print(soup) data = [] for row in soup.find_all("tr")[1:]: # start from second row cells = row.find_all('td') data.append({ 'ASN': cells[0].text, 'Country': 'US', "Name": cells[1].text, "Routes V4": cells[3].text, "Routes V6": cells[5].text }) print(data) 

results:

[ {'ASN': 'AS6939', 'Country': 'US', 'Name': 'Hurricane Electric LLC', 'Routes V4': '127,337', 'Routes V6': '28,227'}, {'ASN': 'AS174', 'Country': 'US', 'Name': 'Cogent Communications', 'Routes V4': '118,159', 'Routes V6': '8,814'} ] 

get country and code

country = soup.select_one('h2 img').get('title') # United State country_code = 'https://bgp.he.net/country/US'.split('/')[-1] # US 
Sign up to request clarification or add additional context in comments.

5 Comments

So hard cod the "ASN" and "country" entries instead of attempting to go header row: body text, header row: body text. THANK YOU. Going to give this a try.
because they are fixed string I will write it directly to make the code faster but of course you can improve it.
I wonder if instead of hardcoding US, I could apply a parser to the current HTML link and print the last two characters of the link. Therefore make this usable on numerous of there pages.
yup, certainly can do that in a loop and just have the suffix dynamic and iterate through the different values (ie. bgp.he.net/country/ES for Spain).
@JordanNewman I edited the answer to get the country
1

Slightly different approach than the BeautifulSoup version below to give you options.

I like BeautifulSoup to parse, until I see <table> tags. Then I usually just go to Pandas to get the table as it can be done in 1 line, then I can just manipulate the dataframe as needed.

Then can just convert the dataframe to json (actually learned this from an ewwink solution a few weeks back :-) )

import pandas as pd import requests import json url = 'https://bgp.he.net/country/US' session = requests.Session() headers = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36", "Accept-Encoding": "gzip, deflate", "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8", "Accept-Language": "en"} response = session.get(url, headers=headers) tables = pd.read_html(response.text) table = tables[0] table['Country'] = url.split('/')[-1] jsonObject = table.to_dict(orient='records') # if you need as string to write to json file jsonObject_string = json.dumps(jsonObject) 

Output:

[{'ASN': 'AS6939', 'Name': 'Hurricane Electric LLC', 'Adjacencies v4': 7216, 'Routes v4': 127337, 'Adjacencies v6': 4460, 'Routes v6': 28227, 'Country': 'US'}, {'ASN': 'AS174', 'Name': 'Cogent Communications', 'Adjacencies v4': 5692, 'Routes v4': 118159, 'Adjacencies v6': 1914, 'Routes v6': 8814, 'Country': 'US'}... 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.