1

I am scraping all the words from website Merriam-Webster.

I want to scrape all pages starting from a-z and all pages within them and save them to a text file. The problem i'm having is i only get first result of the table instead of all. I know that this is a large amount of text (around 500k) but i'm doing it for educating myself.

CODE:

import requests from bs4 import BeautifulSoup as bs URL = 'https://www.merriam-webster.com/browse/dictionary/a/' page = 1 # for page in range(1, 75): req = requests.get(URL + str(page)) soup = bs(req.text, 'html.parser') containers = soup.find('div', attrs={'class', 'entries'}) table = containers.find_all('ul') for entries in table: links = entries.find_all('a') name = links[0].text print(name) 

Now what i want is to get all the entries from this table, but instead i only get the first entry.

I'm kinda stuck here so any help would be appreciated. Thanks

https://www.merriam-webster.com/browse/medical/a-z https://www.merriam-webster.com/browse/legal/a-z https://www.merriam-webster.com/browse/dictionary/a-z https://www.merriam-webster.com/browse/thesaurus/a-z 
1
  • 1
    Like the answer below, you need another for loop. One is for looping a-z, inner for loop for looping page numbers. To get the page number, find the a tag for last page then you will get the last page number: <a aria-label="Last" data-page="75" ... Commented Oct 21, 2020 at 19:33

2 Answers 2

1

To get all entries, you can use this example:

import requests from bs4 import BeautifulSoup url = 'https://www.merriam-webster.com/browse/dictionary/a/' soup = BeautifulSoup(requests.get(url).content, 'html.parser') for a in soup.select('.entries a'): print('{:<30} {}'.format(a.text, 'https://www.merriam-webster.com' + a['href'])) 

Prints:

(a) heaven on earth https://www.merriam-webster.com/dictionary/%28a%29%20heaven%20on%20earth (a) method in/to one's madness https://www.merriam-webster.com/dictionary/%28a%29%20method%20in%2Fto%20one%27s%20madness (a) penny for your thoughts https://www.merriam-webster.com/dictionary/%28a%29%20penny%20for%20your%20thoughts (a) quarter after https://www.merriam-webster.com/dictionary/%28a%29%20quarter%20after (a) quarter of https://www.merriam-webster.com/dictionary/%28a%29%20quarter%20of (a) quarter past https://www.merriam-webster.com/dictionary/%28a%29%20quarter%20past (a) quarter to https://www.merriam-webster.com/dictionary/%28a%29%20quarter%20to (all) by one's lonesome https://www.merriam-webster.com/dictionary/%28all%29%20by%20one%27s%20lonesome (all) choked up https://www.merriam-webster.com/dictionary/%28all%29%20choked%20up (all) for the best https://www.merriam-webster.com/dictionary/%28all%29%20for%20the%20best (all) in good time https://www.merriam-webster.com/dictionary/%28all%29%20in%20good%20time ...and so on. 

To scrape multiple pages:

url = 'https://www.merriam-webster.com/browse/dictionary/a/{}' for page in range(1, 76): soup = BeautifulSoup(requests.get(url.format(page)).content, 'html.parser') for a in soup.select('.entries a'): print('{:<30} {}'.format(a.text, 'https://www.merriam-webster.com' + a['href'])) 

EDIT: To get all pages from A to Z:

import requests from bs4 import BeautifulSoup url = 'https://www.merriam-webster.com/browse/dictionary/{}/{}' for char in range(ord('a'), ord('z')+1): page = 1 while True: soup = BeautifulSoup(requests.get(url.format(chr(char), page)).content, 'html.parser') for a in soup.select('.entries a'): print('{:<30} {}'.format(a.text, 'https://www.merriam-webster.com' + a['href'])) last_page = soup.select_one('[aria-label="Last"]')['data-page'] if last_page == '': break page += 1 

EDIT 2: To save to file:

import requests from bs4 import BeautifulSoup url = 'https://www.merriam-webster.com/browse/dictionary/{}/{}' with open('data.txt', 'w') as f_out: for char in range(ord('a'), ord('z')+1): page = 1 while True: soup = BeautifulSoup(requests.get(url.format(chr(char), page)).content, 'html.parser') for a in soup.select('.entries a'): print('{:<30} {}'.format(a.text, 'https://www.merriam-webster.com' + a['href'])) print('{}\t{}'.format(a.text, 'https://www.merriam-webster.com' + a['href']), file=f_out) last_page = soup.select_one('[aria-label="Last"]')['data-page'] if last_page == '': break page += 1 
Sign up to request clarification or add additional context in comments.

5 Comments

This is only for page 'a' but i want this for all pages 'a-z'. Kindly Can u tell me that as well
Thanks, but as a beginner i don't understand some of the commands so if you can add some explanation that would help others as well. And i also want to save this to a text file how do it do that.
@Mujtaba See my Edit 2, how to save to file.
Thanks sir, it's really helpful. Finally i'm adding other categories as well but i'm getting some errors. Can u please add that as well in the code. They are as follows: thesaurus, medical, legal.
Please look above this when u have time.
1

I think you need another loop:

for entries in table: links = entries.find_all('a') for name in links: print(name.text) 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.