I want to make a simple program that extracts URLs from a site, then it dumps them to a .txt file.
The code bellow works just fine but when i try to dump it to a file i get errors.
from bs4 import BeautifulSoup, SoupStrainer import requests url = "https://stackoverflow.com" page = requests.get(url) data = page.text soup = BeautifulSoup(data) cr='C:\Users\Admin\Desktop\extracted.txt' for link in soup.find_all('a'): print(link.get('href')) I tried with
open(cr, 'w') as f: for link in soup.find_all('a'): print(link.get('href')) f.write(link.get('href')) It dumps some links, not all - and they are all in one line (i get TypeError: expected a string or other character buffer object)
The result in .txt should look like:
/teams/customers /teams/use-cases /questions /teams /enterprise https://www.stackoverflowbusiness.com/talent https://www.stackoverflowbusiness.com/advertising https://stackoverflow.com/users/login?ssrc=head&returnurl=https%3a%2f%2fstackoverflow.com%2f https://stackoverflow.com/users/signup?ssrc=head&returnurl=%2fusers%2fstory%2fcurrent https://stackoverflow.com https://stackoverflow.com https://stackoverflow.com/help https://chat.stackoverflow.com https://meta.stackoverflow.com https://stackoverflow.com/users/signup?ssrc=site_switcher&returnurl=%2fusers%2fstory%2fcurrent https://stackoverflow.com/users/login?ssrc=site_switcher&returnurl=https%3a%2f%2fstackoverflow.com%2f https://stackexchange.com/sites https://stackoverflow.blog https://stackoverflow.com/legal/cookie-policy https://stackoverflow.com/legal/privacy-policy https://stackoverflow.com/legal/terms-of-service/public from bs4 import BeautifulSoup, SoupStrainer import requests url = "https://stackoverflow.com" page = requests.get(url) data = page.text soup = BeautifulSoup(data) cr='C:\Users\Admin\Desktop\crawler\extracted.txt' with open(cr, 'w') as f: for link in soup.find_all('a'): print(link.get('href')) f.write(link.get('href'))
f.writebut I don't see you createf+writewill put them all on one line. You are responsible for formatting. Just add a\nafter each time you call itlink.get('href')isNonein casehrefis not defined to avoid the TypeError.