iptracker
answered May 1 '23 00:00
To get a list of links (the actual URLs) from an entire website using Python, you can use the requests and beautifulsoup4 libraries.
Here's an example code that you can use:
import requests
from bs4 import BeautifulSoup
def get_links(url):
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
links = set()
for link in soup.find_all('a'):
href = link.get('href')
if href and href.startswith('http'):
links.add(href)
return links
def get_all_links(url):
all_links = set()
to_visit = set()
to_visit.add(url)
while to_visit:
url = to_visit.pop()
links = get_links(url)
all_links |= links
to_visit |= links - all_links
return all_links
You can call the get_all_links() function with the starting URL as the argument, and it will return a set of all the links (the actual URLs) that it finds on the website.
Note that this code only collects links that start with "http". If you want to collect links that start with "https" or other protocols, you can modify the condition in the get_links() function accordingly. Also, note that this code only crawls links within the same domain as the starting URL. If you want to crawl links on external domains as well, you can remove the check for the domain in the get_links() function.