Today we are going to learn how we can find and extract a URL of a website from a string in Python. We will be using the regular expression module of python. So if we have a string and we want to check if it contains a URL and if it contains one then we can extract it and print it.
First, we need to understand how to judge a URL presence. To judge that we will be using a regular expression that has all possible symbols combination/conditions that can constitute a URL.
This regular expression is going to help us to judge the presence of a URL.
#regular expression to find URL in string in python
r"(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|(([^\s()<>]+|(([^\s()<>]+)))))+(?:(([^\s()<>]+|(([^\s()<>]+))))|[^\s`!()[]{};:'\".,<>?«»“”‘’]))"
Then we will just parse our string with this regular expression and check the URL presence. So to do that we will be using findall() method/function from the regular expression module of python.
so let us begin our code.
Program to find the URL from an input string
Importing the regular expression module to our program and defining a method to do the logic
Code Example
#How to Extract URL from a string in Python?
import re
def URLsearch(stringinput):
#regular expression
regularex = r"(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|(([^\s()<>]+|(([^\s()<>]+)))))+(?:(([^\s()<>]+|(([^\s()<>]+))))|[^\s`!()[]{};:'\".,<>?«»“”‘’]))"
#finding the url in passed string
urlsrc = re.findall(regularex,stringinput)
#return the found website url
return [url[0] for url in urlsrc]
textcontent = 'text :a software website find contents related to technology https://devenum.com https://google.com,http://devenum.com'
#using the above define function
print("Urls found: ", URLsearch(textcontent))
Output:
Urls found: ['https://devenum.com', 'https://google.com,http://devenum.com']
Find URL in string of HTML format
In this code example we are searching the urls inside a HTML <p><a></a></p> tags.We are using the above defines regular expression to find the same.
import re
def URLsearch(stringinput):
#regular expression
regularex = regularex = r"(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|(([^\s()<>]+|(([^\s()<>]+)))))+(?:(([^\s()<>]+|(([^\s()<>]+))))|[^\s`!()[]{};:'\".,<>?«»“”‘’]))"
#finding the url in passed string
urlsrc = re.findall(regularex,stringinput)
#return the found website url
return [url[0] for url in urlsrc]
textcontent = '<p>Contents <a href="https://www.google.com">Python Examples</a><a href="https://www.devenum.com">Even More Examples</a> <a href="http://www.devenum.com"></p>'
#using the above define function
print("Urls found: ", URLsearch(textcontent))
Output
Urls found: ['https://www.google.com"', 'https://devenum.com"', 'http://www.devenum.com"']