Web scraping from Wikipedia using Python (original) (raw)

Web scraping is the process of automatically extracting data from websites. It enables programmers to collect structured information from web pages without manual effort. In this article, we will demonstrate how to scrape data from the Wikipedia homepage using Python and discuss commonly used techniques, libraries for efficient web scraping.

website_pages_unstructured_data_https_

Process of Web Scraping

Python Libraries Required

The following Python libraries are commonly used to fetch, parse, and automate web pages during the web scraping process:

  1. Requests: It is the foundation of web scraping in Python. It allows to send HTTP requests like GET and POST to download web pages quickly and reliably.
  2. lxml:A high-performance library for parsing HTML and XML, lxml is extremely fast and ideal for large-scale scraping tasks.
  3. BeautifulSoup: A beginner-friendly HTML parser that creates a parse tree for easy data extraction. It works seamlessly with content fetched using Requests.
  4. Selenium: Automates a real browser, enabling scraping of dynamic, JavaScript-loaded websites. It is slower than other tools and less suitable for large-scale scraping.
  5. Scrapy: A full-featured, asynchronous web scraping framework built for speed and scalability. It is ideal for large crawling projects, providing pipelines and advanced data handling capabilities.

Python Web Scraping on Wikipedia

The following diagram shows the basic workflow followed when scraping data from Wikipedia using Python:

steps_in_web_scraping_using_python

Steps in Web Scraping

It explains how a web page is loaded, parsed, and processed to extract useful information before converting it into a structured form.

Step 1: Setting Up Python for Web Scraping

Before scraping a website, we must set up our Python environment.

**Requirements:

Using virtualenv, we can install packages inside an isolated environment without affecting global system installations.

Install Required Libraries

pip install virtualenv
python -m pip install selenium
python -m pip install requests
python -m pip install urllib3

Step 2: Fetching Web Pages Using Requests

**Requirements:

This code sends a request to Wikipedia and downloads the HTML of the page.

Python `

import requests

url = "https://en.wikipedia.org/wiki/Main_Page" headers = { "User-Agent": "Mozilla/5.0" } page = requests.get(url, headers=headers) print(page.status_code) print(page.content)

`

**Output

ww1

**Explanation:

Step 3: Parsing HTML Using BeautifulSoup

You can install the BeautifulSoup using the below command:

pip install bs4

This code converts the downloaded HTML into a structured format that Python can read.

Python `

from bs4 import BeautifulSoup import requests

url = "https://en.wikipedia.org/wiki/Main_Page" headers = { "User-Agent": "Mozilla/5.0" } page = requests.get(url, headers=headers) soup = BeautifulSoup(page.content, 'html.parser') print(soup.prettify())

`

**Output

ww2

**Explanation:

Step 4: Extracting Data with BeautifulSoup

This shows how to use BeautifulSoup to parse a webpage and extract all

tags, then retrieve the text content of the first paragraph.

Python `

from bs4 import BeautifulSoup import requests

url = "https://en.wikipedia.org/wiki/Main_Page"

headers = { "User-Agent": "Mozilla/5.0" } page = requests.get(url, headers=headers) soup = BeautifulSoup(page.content, 'html.parser') print(soup.find_all('p')) print("\n\n") print(soup.find_all('p')[0].get_text())

`

**Output

ww3

**Explanation:

Step 5: Exploring Page Structure Using Chrome DevTools

Before scraping, analyzing the structure of the webpage is essential.

How to Inspect Elements:

This helps us locate the exact tag or class to extract.

ww5

This code extracts a specific section of the Wikipedia page using its id and class.

Python `

from bs4 import BeautifulSoup import requests

url = "https://en.wikipedia.org/wiki/Main_Page" headers = { "User-Agent": "Mozilla/5.0" } page = requests.get(url, headers=headers) soup = BeautifulSoup(page.content, 'html.parser') container = soup.find(id="mp-left") items = container.find_all(class_="mp-h2") result = items[0] print(result.prettify())

`