0

I want to make a simple program that extracts URLs from a site, then it dumps them to a .txt file.

The code bellow works just fine but when i try to dump it to a file i get errors.

from bs4 import BeautifulSoup, SoupStrainer import requests url = "https://stackoverflow.com" page = requests.get(url) data = page.text soup = BeautifulSoup(data) cr='C:\Users\Admin\Desktop\extracted.txt' for link in soup.find_all('a'): print(link.get('href')) 

I tried with

open(cr, 'w') as f: for link in soup.find_all('a'): print(link.get('href')) f.write(link.get('href')) 

It dumps some links, not all - and they are all in one line (i get TypeError: expected a string or other character buffer object)

The result in .txt should look like:

/teams/customers /teams/use-cases /questions /teams /enterprise https://www.stackoverflowbusiness.com/talent https://www.stackoverflowbusiness.com/advertising https://stackoverflow.com/users/login?ssrc=head&returnurl=https%3a%2f%2fstackoverflow.com%2f https://stackoverflow.com/users/signup?ssrc=head&returnurl=%2fusers%2fstory%2fcurrent https://stackoverflow.com https://stackoverflow.com https://stackoverflow.com/help https://chat.stackoverflow.com https://meta.stackoverflow.com https://stackoverflow.com/users/signup?ssrc=site_switcher&returnurl=%2fusers%2fstory%2fcurrent https://stackoverflow.com/users/login?ssrc=site_switcher&returnurl=https%3a%2f%2fstackoverflow.com%2f https://stackexchange.com/sites https://stackoverflow.blog https://stackoverflow.com/legal/cookie-policy https://stackoverflow.com/legal/privacy-policy https://stackoverflow.com/legal/terms-of-service/public 
from bs4 import BeautifulSoup, SoupStrainer import requests url = "https://stackoverflow.com" page = requests.get(url) data = page.text soup = BeautifulSoup(data) cr='C:\Users\Admin\Desktop\crawler\extracted.txt' with open(cr, 'w') as f: for link in soup.find_all('a'): print(link.get('href')) f.write(link.get('href')) 
3
  • You have f.write but I don't see you create f + write will put them all on one line. You are responsible for formatting. Just add a \n after each time you call it Commented Aug 28, 2019 at 12:38
  • My bad...i added it assabled Commented Aug 28, 2019 at 12:40
  • 2
    You propably need to check if link.get('href') is None in case href is not defined to avoid the TypeError. Commented Aug 28, 2019 at 12:41

3 Answers 3

2

Try this:

with open(cr, 'w') as f: for link in soup.find_all('a'): link_text = link.get('href') if link_text is not None: print(link.get('href')) f.write(link.get('href') + '\n') 
Sign up to request clarification or add additional context in comments.

1 Comment

This is a good use case for the new walrus operator in 3.8. if link_text := link.get('href'):
0
from bs4 import BeautifulSoup, SoupStrainer import requests url = "https://stackoverflow.com" page = requests.get(url) data = page.text soup = BeautifulSoup(data) cr= r'C:\Users\Admin\Desktop\extracted.txt' links = [] for link in soup.find_all('a'): print(link.get('href')) if link.get('href'): links.append(link.get('href')) with open(cr, 'w') as f: for link in links: print(link) f.write(link + '\n') 

Comments

0

So... as Simon Fink suggested it works. However i found another one

with open(cr, 'w') as f: for link in soup.find_all('a'): print(link.get('href')) try: f.write(link.get('href')+'\n') except: continue 

But i think the method presented by Simon Fink is better. Much thanks

2 Comments

Catching every exception with just a continue might not be a good idea, it might be better if you only catch the exception that you're expecting (in this case a TypeError). So if you had other exceptions, they will be correctly raised.
I will do as you suggest. Thanks

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.