Tags
Asked 3 years ago
1 Jul 2021
Views 312
Kaden

Kaden posted

Obtaining a list of links (the actual URL) from an entire website

Obtaining a list of links (the actual URL) from an entire website
iptracker

iptracker
answered May 1 '23 00:00

To get a list of links (the actual URLs) from an entire website using Python, you can use the requests and beautifulsoup4 libraries.

Here's an example code that you can use:



import requests
from bs4 import BeautifulSoup

def get_links(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    links = set()
    for link in soup.find_all('a'):
        href = link.get('href')
        if href and href.startswith('http'):
            links.add(href)
    return links

def get_all_links(url):
    all_links = set()
    to_visit = set()
    to_visit.add(url)
    while to_visit:
        url = to_visit.pop()
        links = get_links(url)
        all_links |= links
        to_visit |= links - all_links
    return all_links

You can call the get_all_links() function with the starting URL as the argument, and it will return a set of all the links (the actual URLs) that it finds on the website.

Note that this code only collects links that start with "http". If you want to collect links that start with "https" or other protocols, you can modify the condition in the get_links() function accordingly. Also, note that this code only crawls links within the same domain as the starting URL. If you want to crawl links on external domains as well, you can remove the check for the domain in the get_links() function.
Post Answer