How to dump output to a txt?

Question

I want to make a simple program that extracts URLs from a site, then it dumps them to a .txt file.

The code bellow works just fine but when i try to dump it to a file i get errors.

from bs4 import BeautifulSoup, SoupStrainer import requests url = "https://stackoverflow.com" page = requests.get(url) data = page.text soup = BeautifulSoup(data) cr='C:\Users\Admin\Desktop\extracted.txt' for link in soup.find_all('a'): print(link.get('href'))

I tried with

open(cr, 'w') as f: for link in soup.find_all('a'): print(link.get('href')) f.write(link.get('href'))

It dumps some links, not all - and they are all in one line (i get TypeError: expected a string or other character buffer object)

The result in .txt should look like:

/teams/customers /teams/use-cases /questions /teams /enterprise https://www.stackoverflowbusiness.com/talent https://www.stackoverflowbusiness.com/advertising https://stackoverflow.com/users/login?ssrc=head&returnurl=https%3a%2f%2fstackoverflow.com%2f https://stackoverflow.com/users/signup?ssrc=head&returnurl=%2fusers%2fstory%2fcurrent https://stackoverflow.com https://stackoverflow.com https://stackoverflow.com/help https://chat.stackoverflow.com https://meta.stackoverflow.com https://stackoverflow.com/users/signup?ssrc=site_switcher&returnurl=%2fusers%2fstory%2fcurrent https://stackoverflow.com/users/login?ssrc=site_switcher&returnurl=https%3a%2f%2fstackoverflow.com%2f https://stackexchange.com/sites https://stackoverflow.blog https://stackoverflow.com/legal/cookie-policy https://stackoverflow.com/legal/privacy-policy https://stackoverflow.com/legal/terms-of-service/public

from bs4 import BeautifulSoup, SoupStrainer import requests url = "https://stackoverflow.com" page = requests.get(url) data = page.text soup = BeautifulSoup(data) cr='C:\Users\Admin\Desktop\crawler\extracted.txt' with open(cr, 'w') as f: for link in soup.find_all('a'): print(link.get('href')) f.write(link.get('href'))

You have f.write but I don't see you create f + write will put them all on one line. You are responsible for formatting. Just add a \n after each time you call it — Joe
– Joe, Commented Aug 28, 2019 at 12:38
You propably need to check if link.get('href') is None in case href is not defined to avoid the TypeError. — AnsFourtyTwo
– AnsFourtyTwo, Commented Aug 28, 2019 at 12:41

Holloway · Accepted Answer · 2019-08-28 13:59:48Z

2

Try this:

with open(cr, 'w') as f: for link in soup.find_all('a'): link_text = link.get('href') if link_text is not None: print(link.get('href')) f.write(link.get('href') + '\n')

edited Aug 28, 2019 at 13:59

Holloway

7,5471 gold badge29 silver badges37 bronze badges

answered Aug 28, 2019 at 12:43

AnsFourtyTwo

2,5182 gold badges17 silver badges33 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Holloway Over a year ago

This is a good use case for the new walrus operator in 3.8. if link_text := link.get('href'):

Kostas Charitidis · Accepted Answer · 2019-08-28 12:45:33Z

from bs4 import BeautifulSoup, SoupStrainer import requests url = "https://stackoverflow.com" page = requests.get(url) data = page.text soup = BeautifulSoup(data) cr= r'C:\Users\Admin\Desktop\extracted.txt' links = [] for link in soup.find_all('a'): print(link.get('href')) if link.get('href'): links.append(link.get('href')) with open(cr, 'w') as f: for link in links: print(link) f.write(link + '\n')

root · Accepted Answer · 2019-08-28 12:48:44Z

So... as Simon Fink suggested it works. However i found another one

with open(cr, 'w') as f: for link in soup.find_all('a'): print(link.get('href')) try: f.write(link.get('href')+'\n') except: continue

But i think the method presented by Simon Fink is better. Much thanks

Catching every exception with just a continue might not be a good idea, it might be better if you only catch the exception that you're expecting (in this case a TypeError). So if you had other exceptions, they will be correctly raised.

Collectives™ on Stack Overflow

How to dump output to a txt?

3 Answers 3

1 Comment

Comments

2 Comments

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

Comments

2 Comments

Related