Parsing and Processing URL using Python Regex (original) (raw)

Last Updated : 4 Nov, 2025

Given a URL, the task is to extract key components such as the protocol, hostname, port number, and path using regular expressions (Regex) in Python. **For example:

**Input: https://www.geeksforgeeks.org/courses
**Output:
Protocol: https
Hostname: geeksforgeeks.org

Let’s explore different methods to parse and process a URL in Python using Regex.

Using re.findall() to Extract Protocol and Hostname

"re.findall()" method returns all non-overlapping matches of the given pattern as a list, it scans the entire string and extracts every substring that matches the given regular expression pattern.

Python `

import re
s = 'https://www.geeksforgeeks.org/'

p = re.findall(r'(\w+)://', s) print(p)

h = re.findall(r'://www.([\w-.]+)', s) print(h)

Output

['https'] ['geeksforgeeks.org']

**Explanation:

****(\w+)://** captures the protocol part before ****://**.
****://www.([\w\-.]+)** captures the hostname that may contain letters, digits, dots, or hyphens.
**re.findall() returns all matching parts as a list.

When URLs contain an optional port number, we can extend the regex to capture it using the '?' quantifier. This ensures that the port number is included only if present.

Python `

import re
s = 'file://localhost:4040/abc_file'

p = re.findall(r'(\w+)://', s) print(p)

h = re.findall(r'://([\w-.]+)', s) print(h)

hp = re.findall(r'://([\w-.]+)(:(\d+))?', s) print(hp)

Output

['file'] ['localhost'] [('localhost', ':4040', '4040')]

**Explanation:

****(\w+)://** captures the protocol (file) and ****://([\w\-.]+)** captures the hostname (localhost).
****(:(\d+))?** captures the port number after a colon, if it exists.
****?** makes the port group optional, ensuring it works for URLs with or without ports.
Here, the tuple represents (hostname, :port_with_colon, port_number).

This approach extracts protocol, domain, path, and file extension together. It’s useful for structured URLs where each part follows a predictable pattern.

Python `

import re
s = 'http://www.example.com/index.html' res = re.findall(r'(\w+)://([\w-.]+)/(\w+).(\w+)', s) print(res)

Output

[('http', 'www.example.com', 'index', 'html')]

**Explanation:

(\w+):// captures the protocol (http) and ([\w\-.]+) captures the domain (www.example.com).
(\w+) captures the filename (index) and (\w+) after the dot captures the file extension (html).
re.findall() returns all matching tuples containing these groups.