This script allows you to crawl a website and collect links from its webpages based on a specified regex pattern. It can be useful for extracting links from websites for various purposes such as data scraping or analysis.
Before running the script, make sure you have the following installed:
- Python 3.x
argparselibraryrequestslibraryremoduleosmodulesysmodulebase64moduleurllib.parsemodulebs4(BeautifulSoup) libraryshutilmodule
You can install the required dependencies using pip:
pip install argparse requests bs4To use the script, follow these steps:
-
Clone or download the script file to your local machine.
-
Open a terminal or command prompt.
-
Navigate to the directory where the script is located.
-
Run the following command:
python link_crawler.py -u <url> -p <pattern> [-d] [-c]
Replace
<url>with the URL of the website you want to crawl, and<pattern>with the regex pattern to match the links.Optional flags:
-dor--domain: Include the website domain for internal links. By default, it deletes the domain name from internal links and then searches for the pattern.-cor--clear-directory: Clear the directory if it already exists for this command. By default, if the command is entered with a duplicate pattern and domain, the search is not performed.
-
The script will start crawling the website, collecting links from its webpages, and display the results.
- If links matching the regex pattern are found, the script will save them to a
links.txtfile in the corresponding directory. - If no links are found, the script will display a message accordingly.
- If links matching the regex pattern are found, the script will save them to a
Note: The script crawls webpages within the specified website by following links found in HTML tags such as <a>, <link>, <script>, <base>, <form>, and more (in all tags that contain links). It searches for href, src, and data-src attributes in these tags to extract the links.
Note: this script finds any link anywhere on the webpage, even outside of the attributes of the tags.
Here are a few examples of how you can use the script:
-
Crawl a website and collect all links from its webpages:
python link_crawler.py -u https://example.com -p ".*"This will crawl the
example.comwebsite, collect all links from its webpages, and save them tolinks.txtin thedata/<host>/<pattern>/directory. -
Crawl a website and collect only specific links matching a pattern:
python link_crawler.py -u https://example.com -p "https://example.com/downloads/.*"This will crawl the
example.comwebsite and collect only the links that match the patternhttps://example.com/downloads/. -
Crawl a website and putting domains in internal links:
python link_crawler.py -u https://example.com -p ".*" -dThis will crawl the
example.comwebsite, collect all links from its webpage, putting domains in internal links, and save them tolinks.txt. -
Clear the directory and crawl the website to collect fresh links:
python link_crawler.py -u https://example.com -p ".*" -cThis will clear the existing directory (if any) for the specified command and crawl the
example.comwebsite to collect fresh links.
This script is licensed under the MIT License. Feel free to modify and use it according to your needs.