How to Extract URL from a string in Python?

Python String

Today we are going to learn how we can find and extract a URL of a website from a string in Python. We will be using the regular expression module of python. So if we have a string and we want to check if it contains a URL and if it contains one then we can extract it and print it.

First, we need to understand how to judge a URL presence. To judge that we will be using a regular expression that has all possible symbols combination/conditions that can constitute a URL.

This regular expression is going to help us to judge the presence of a URL.

#regular expression to find URL in string in python

r"(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|(([^\s()<>]+|(([^\s()<>]+)))))+(?:(([^\s()<>]+|(([^\s()<>]+))))|[^\s`!()[]{};:'\".,<>?«»“”‘’]))"

Then we will just parse our string with this regular expression and check the URL presence. So to do that we will be using findall() method/function from the regular expression module of python.

so let us begin our code.

Program to find the URL from an input string

Importing the regular expression module to our program and defining a method to do the logic

Code Example

#How to Extract URL from a string in Python?

import re

def URLsearch(stringinput):

  #regular expression

 regularex = r"(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|(([^\s()<>]+|(([^\s()<>]+)))))+(?:(([^\s()<>]+|(([^\s()<>]+))))|[^\s`!()[]{};:'\".,<>?«»“”‘’]))"

 #finding the url in passed string

 urlsrc = re.findall(regularex,stringinput)

 #return the found website url

 return [url[0] for url in urlsrc]

textcontent = 'text :a software website find contents related to technology https://devenum.com https://google.com,http://devenum.com'

#using the above define function

print("Urls found: ", URLsearch(textcontent))

Output:

Urls found:  ['https://devenum.com', 'https://google.com,http://devenum.com']

Find URL in string of HTML format

In this code example we are searching the urls inside a HTML <p><a></a></p> tags.We are using the above defines regular expression to find the same.

import re

def URLsearch(stringinput):

  #regular expression

 regularex =  regularex = r"(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|(([^\s()<>]+|(([^\s()<>]+)))))+(?:(([^\s()<>]+|(([^\s()<>]+))))|[^\s`!()[]{};:'\".,<>?«»“”‘’]))"


 #finding the url in passed string

 urlsrc = re.findall(regularex,stringinput)

 #return the found website url

 return [url[0] for url in urlsrc]

textcontent = '<p>Contents <a href="https://www.google.com">Python Examples</a><a href="https://www.devenum.com">Even More Examples</a> <a href="http://www.devenum.com"></p>'

#using the above define function

print("Urls found: ", URLsearch(textcontent))


Output

Urls found:  ['https://www.google.com"', 'https://devenum.com"', 'http://www.devenum.com"']