9,636 questions
0 votes
0 answers
77 views
URL Targeted web crawler [closed]
I have a bit of code I am trying to build to take a specific tumblr page and then iteratively scan by post # sequentially and check to see if a page exists. If it does it will print that full URL to ...
0 votes
0 answers
40 views
How can I classify sitemap URLs (WooCommerce / OpenCart / custom ) [closed]
I'm working on a crawler in Python that takes an e-commerce sitemap and classifies each URL into a page type, for example: home product product_category product_tag brand post static_page ...
0 votes
0 answers
45 views
How to Get Reddit Crawler to Use my Video Preview?
I have a React page which gets re-routed for crawlers to a SEO backend page in nodejs + express. And I want to make it work with reddits crawler to get embedded videos, which it doesn't. When I post ...
0 votes
1 answer
209 views
How can I use Firecrawl to crawl and take a screenshot of a webpage instead of using Playwright in Node.js?
I'm currently using Playwright in Node.js to capture screenshots of webpages, but I'm exploring Firecrawl and wondering if it can handle screenshots directly. Here is what my firecrawl looks like with ...
-2 votes
1 answer
117 views
Webscrape links to download files based on word in page HTML
I am webscraping WHO pages using the following code: pacman::p_load(rvest, httr, stringr, purrr) download_first_pdf_from_handle <- function(handle_id) { ...
1 vote
1 answer
260 views
Firecrawl self-hosted crawler throws Connection violated security rules error
I set up a self-hosted Firecrawl instance and I want to crawl my internal intranet site (e.g. https://intranet.xxx.gov.tr/). I can access the site directly both from the host machine and from inside ...
0 votes
1 answer
64 views
Intermittent 406 Errors on Post, Pages Detected by Site Analyzers, Not Direct Browser Access
"My WordPress site's post pages return intermittent HTTP 406 "Not Acceptable" errors, but ONLY for site analysis/SEO tools (e.g., SEMrush, Ahrefs, GTmetrix). When accessed directly by ...
0 votes
2 answers
75 views
SitemapLoader(sitemap_url).load() hangs
from langchain_community.document_loaders import SitemapLoader def crawl(self): print("Starting crawler...") sitemap_url = "https://gringo.co.il/sitemap.xml" ...
0 votes
0 answers
182 views
Crawl4AI token threshold not applied to raw html in arun
Here’s a brief overview of what I want to achieve Extract raw htmls and save them Use Crawl4AI to produce a ‘cleaner’ and smaller HTML that has a lot of information, including what I will eventually ...
0 votes
0 answers
86 views
Facebook Crawler Floods CPU usage to 100%
Im running facebook ads and today i woke up to see my server cpu at 100%. I couldnt even use my website. I did some research and found out it was a Facebook Crawler sending excessive requests. I tried ...
0 votes
0 answers
84 views
Adding user agent in chrome options in selenium
i'm performing data crawling on a webpage using selenium. this is my code: from selenium import webdriver from selenium.webdriver.chrome.service import Service from selenium.webdriver.chrome.options ...
0 votes
1 answer
116 views
Substitute host name with its IP address in HTTPS requests
I'm working on a web crawler and I'm trying to understand how the IP substitution works. From what I have read, DNS hostname should be resolved to its IP address (one of many) and used instead of the ...
2 votes
1 answer
926 views
How to use LLMConfig in crawl4ai?
So I was testing https://leonardo467.gumroad.com/l/cstsu this code that uses crawl4ai, but it seems that the library has been updated or something because if you run it (with an API, so I use free ...
0 votes
0 answers
34 views
Transfermarkt Scraper can not get Club name
I want to use the data in my codes with Transfermark Scraper for my own special purpose. I get all the desired data in the codes except Current Club, but I can't get the Club name. I tried all the ...
0 votes
0 answers
158 views
How to Extract Code Blocks from Different Tabs in a Code Documentation Using Crawl4AI (or any other tool)?
I'm trying to scrape code blocks from multiple tabs in a documentation page using Crawl4AI. While I'm able to extract Markdown content, the code blocks inside tabbed sections are not being captured. ...