Parsing and Processing URL using Python Regex (original) (raw)

Last Updated : 4 Nov, 2025

Given a URL, the task is to extract key components such as the protocol, hostname, port number, and path using regular expressions (Regex) in Python. **For example:

**Input: https://www.geeksforgeeks.org/courses
**Output:
Protocol: https
Hostname: geeksforgeeks.org

Let’s explore different methods to parse and process a URL in Python using Regex.

Using re.findall() to Extract Protocol and Hostname

"re.findall()" method returns all non-overlapping matches of the given pattern as a list, it scans the entire string and extracts every substring that matches the given regular expression pattern.

Python `

import re
s = 'https://www.geeksforgeeks.org/'

p = re.findall(r'(\w+)://', s) print(p)

h = re.findall(r'://www.([\w-.]+)', s) print(h)

`

Output

['https'] ['geeksforgeeks.org']

**Explanation:

When URLs contain an optional port number, we can extend the regex to capture it using the '?' quantifier. This ensures that the port number is included only if present.

Python `

import re
s = 'file://localhost:4040/abc_file'

p = re.findall(r'(\w+)://', s) print(p)

h = re.findall(r'://([\w-.]+)', s) print(h)

hp = re.findall(r'://([\w-.]+)(:(\d+))?', s) print(hp)

`

Output

['file'] ['localhost'] [('localhost', ':4040', '4040')]

**Explanation:

This approach extracts protocol, domain, path, and file extension together. It’s useful for structured URLs where each part follows a predictable pattern.

Python `

import re
s = 'http://www.example.com/index.html' res = re.findall(r'(\w+)://([\w-.]+)/(\w+).(\w+)', s) print(res)

`

Output

[('http', 'www.example.com', 'index', 'html')]

**Explanation: