Skip to main content
0 votes
0 answers
77 views

I have a bit of code I am trying to build to take a specific tumblr page and then iteratively scan by post # sequentially and check to see if a page exists. If it does it will print that full URL to ...
Kyle Campbell's user avatar
0 votes
0 answers
40 views

I'm working on a crawler in Python that takes an e-commerce sitemap and classifies each URL into a page type, for example: home product product_category product_tag brand post static_page ...
Tasos Paraskevakis's user avatar
0 votes
0 answers
45 views

I have a React page which gets re-routed for crawlers to a SEO backend page in nodejs + express. And I want to make it work with reddits crawler to get embedded videos, which it doesn't. When I post ...
Andi Giga's user avatar
  • 4,252
0 votes
1 answer
209 views

I'm currently using Playwright in Node.js to capture screenshots of webpages, but I'm exploring Firecrawl and wondering if it can handle screenshots directly. Here is what my firecrawl looks like with ...
James's user avatar
  • 11
-2 votes
1 answer
117 views

I am webscraping WHO pages using the following code: pacman::p_load(rvest, httr, stringr, purrr) download_first_pdf_from_handle <- function(handle_id) { ...
flâneur's user avatar
  • 321
1 vote
1 answer
260 views

I set up a self-hosted Firecrawl instance and I want to crawl my internal intranet site (e.g. https://intranet.xxx.gov.tr/). I can access the site directly both from the host machine and from inside ...
birdalugur's user avatar
0 votes
1 answer
64 views

"My WordPress site's post pages return intermittent HTTP 406 "Not Acceptable" errors, but ONLY for site analysis/SEO tools (e.g., SEMrush, Ahrefs, GTmetrix). When accessed directly by ...
Zaheer Ahmad Safeer's user avatar
0 votes
2 answers
75 views

from langchain_community.document_loaders import SitemapLoader def crawl(self): print("Starting crawler...") sitemap_url = "https://gringo.co.il/sitemap.xml" ...
Gulzar's user avatar
  • 28.8k
0 votes
0 answers
182 views

Here’s a brief overview of what I want to achieve Extract raw htmls and save them Use Crawl4AI to produce a ‘cleaner’ and smaller HTML that has a lot of information, including what I will eventually ...
Leksa99's user avatar
  • 115
0 votes
0 answers
86 views

Im running facebook ads and today i woke up to see my server cpu at 100%. I couldnt even use my website. I did some research and found out it was a Facebook Crawler sending excessive requests. I tried ...
Shami Asad's user avatar
0 votes
0 answers
84 views

i'm performing data crawling on a webpage using selenium. this is my code: from selenium import webdriver from selenium.webdriver.chrome.service import Service from selenium.webdriver.chrome.options ...
midmash36's user avatar
0 votes
1 answer
116 views

I'm working on a web crawler and I'm trying to understand how the IP substitution works. From what I have read, DNS hostname should be resolved to its IP address (one of many) and used instead of the ...
DK3Z's user avatar
  • 123
2 votes
1 answer
926 views

So I was testing https://leonardo467.gumroad.com/l/cstsu this code that uses crawl4ai, but it seems that the library has been updated or something because if you run it (with an API, so I use free ...
ray's user avatar
  • 21
0 votes
0 answers
34 views

I want to use the data in my codes with Transfermark Scraper for my own special purpose. I get all the desired data in the codes except Current Club, but I can't get the Club name. I tried all the ...
Perseus's user avatar
  • 29
0 votes
0 answers
158 views

I'm trying to scrape code blocks from multiple tabs in a documentation page using Crawl4AI. While I'm able to extract Markdown content, the code blocks inside tabbed sections are not being captured. ...
harsha bajaj's user avatar

15 30 50 per page
1
2 3 4 5
643