1

Is the below question a good fit for https://softwareengineering.stackexchange.com/ ? If not, is it worth a place in StackOverflow site ? If not again, in what StackExchange site can I ask it ?

Question Title :Python- scraping news articles on daily basis from sites that do not have any feed

Question body: I can use Python Beautiful Soup module to extract news items from a site feed URL. But suppose the site has no feed and I need to extract news articles from it on daily basis as if it had a feed.

Edit 1: The site https://www.jugantor.com/ has no feed. Even by googling, I did not find any feed of this site . With the following code snippet, I tried to extract the links from the site . The result shows links such as 'http://epaper.jugantor.com' But the news items appearing on the site are nor included in the extracted links.

My Code:

from bs4 import BeautifulSoup from urllib.request import Request, urlopen import re def getLinks(url): USER_AGENT = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5' request = Request(url) request.add_header('User-Agent', USER_AGENT) response = urlopen(request) content = response.read().decode('utf-8') response.close() soup = BeautifulSoup(content, "html.parser") links = [] for link in soup.findAll('a', attrs={'href': re.compile("^http://")}): links.append(link.get('href')) return links print(getLinks("https://www.jugantor.com/")) 

Obviously this does not serve the intended purpose. I need all the news article links of 'https://www.jugantor.com/' on a daily basis as if I acquire them from a feed.
How can I do that ? Any python module or algorithm etc ?

Addendum The question was asked on Stackoverflow in between and got an answer.

1 Answer 1

1

I am pretty sure, if you ask this question this way, it would probably be closed as either "too broad" or as being interpreted as a question for programming help, which is off-topic for this site.

In general, asking for an algorithm who does such news scraping could be on topic. However, you should try to

  • tell us more details what you already tried, and why it did not suit your needs (your mentioning of "beautiful soup" is a good start for this, but since this module is not dedicated for news feeds, why exactly couldn't you utilize it?)

  • tell us which of the solutions you found by googling "python web scraping" you already tried, and why that did not work, either

  • try to keep it language agnostic (that does not mean you cannot mention you used Python for your first experiments). Asking for language-specific tools or modules is asking for a 3rd party resource, which is 100% off-topic here.

Askers are expected here to do some research on their own before their post a question. One possible way to do this mght be by actually writing a scraper prototype first, and when you hit some algorithmic road blocks, ask about them. That would probably lead to a much more focussed question.

6
  • I added 'EDIT1:' and removed 'Q2'. Is it now worth a place at softwareengineering.stackexchange.com now? Commented Feb 18, 2018 at 10:11
  • 1
    @IstiaqueAhmed: no, but now it looks like a good fit for stackoverflow.com . That is the place where you can ask for help on coding problems, and providing such a code snippet is encouraged there. Commented Feb 18, 2018 at 21:38
  • is this a question ( stackoverflow.com/questions/48788611/… )any fit for softwareengineering.stackexchange.com ? That is an architecture related question. Commented Feb 19, 2018 at 9:24
  • 1
    @IstiaqueAhmed: questions which are considered as "too broad" on Stackoverflow are likely to be considered as too broad here as well by our community. See this older meta question: softwareengineering.meta.stackexchange.com/questions/6961/… Commented Feb 19, 2018 at 9:44
  • SO already has a question similar to what I asked in OP here: stackoverflow.com/questions/29147449/…. But obviously there is not any appropriate answer there I think. Should I still ask my question in SO ? Commented Feb 19, 2018 at 9:46
  • 1
    @IstiaqueAhmed: why don't you just try and see how well it works? You should then refer that older question, and explain why you think the answers to that older one do not suit you well. Another alternative is to place a bounty on that older question, it looks like you have enough rep for this on SO. Commented Feb 19, 2018 at 9:54

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.