3

I am working on a project where I am crawling thousands of websites to extract text data, the end use case is natural language processing.

EDIT * since I am crawling 100's of thousands of websites I cannot tailor a scraping code for each one, which means I cannot search for specific element id's, the solution I am looking for is a general one *

I am aware of solutions such as the .get_text() function from beautiful soup. The issue with this method is that it gets all the text from the website, much of it being irrelevant to the main topic on that particular page. for the most part a website page will be dedicated to a single main topic, however on the sides and top and bottom there may be links or text about other subjects or promotions or other content.

With the .get_text() function it return all the text on the site page in one go. the problem is that it combines it all (the relevant parts with the irrelevant ones. is there another function similar to .get_text() that returns all text but as a list and every list object is a specific section of the text, that way it can be know where new subjects start and end.

As a bonus, is there a way to identify the main body of text on a web page?

2
  • 1
    Maybe you could try to use regex to get the links you'd need. Commented May 18, 2020 at 4:04
  • @MustardTiger, have u tried using find_all which allows to search elements by tag & attributes then call text Commented May 18, 2020 at 4:33

2 Answers 2

3

Below I have mentioned snippets that you could use to query data in desired way using BeautifulSoup4 and Python3:

import requests from bs4 import BeautifulSoup response = requests.get('https://yoursite/page') soup = BeautifulSoup(response.text, 'html.parser') # Print the body content in list form print(soup.body.contents[0]) # Print the first found div on html page print(soup.find('div')) # Print the all divs on html page in list form print(soup.find_all('div')) # Print the element with 'required_element_id' id print(soup.find(id='required_element_id')) # Print the all html elements in list form that matches the selectors print(soup.select(required_css_selectors)) # Print the attribute value in list form print(soup.find(id='someid').get("attribute-name")) # You can also break your one large query into multiple queries parent = soup.find(id='someid') # getText() return the text between opening and closing tag print(parent.select(".some-class")[0].getText()) 

For your more advance requirement, you can check Scrapy as well. Let me know if you face any challenge in implementing this or if your requirement is something else.

Sign up to request clarification or add additional context in comments.

1 Comment

Hi i made an edit to the question to make things clearer
1

The closest real-life example of what you're looking for could be the Reader View aka Reading Mode found in Firefox, Safari and others.

Here's one question on that topic here on StackOverflow: How does Firefox reader view operate

Firefox is said to be relying on github.com/mozilla/readability, which they thankfully open-sourced.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.