Web Scraping using lxml and XPath in Python (original) (raw)

Prerequisites: Introduction to Web Scraping

In this article, we will discuss the lxml python library to scrape data from a webpage, which is built on top of the libxml2 XML parsing library written in C. When compared to other python web scraping libraries like BeautifulSoup and Selenium, the lxml package gives an advantage in terms of performance. Reading and writing large XML files takes an indiscernible amount of time, making data processing easier & much faster.

We will be using the lxml library for Web Scraping and the requests library for making HTTP requests in Python. These can be installed in the command line using the pip package installer for Python.

Getting data from an element on the webpage using lxml requires the usage of Xpaths.

Using XPath

XPath works very much like a traditional file system

Diagram of a File System

To access file 1,

C:/File1

Similarly, To access file 2,

C:/Documents/User1/File2

Now consider a simple web page,

HTML `

My page

Welcome to my page

page

This is the first paragraph

   <h2>Hello World</h2>

`

This can be represented as an XML Tree as follows,

XML Tree of the Webpage

For getting the text inside the <p> tag,

XPath : html/body/p/text()

Result : This is the first paragraph

For getting a value inside the <href> attribute in the anchor or <a> tag,

XPath : html/body/a/@href

Result: www.example.com

For getting the value inside the second <h2> tag,

XPath : html/body/h2[2]/text()

Result: Hello World

To find the XPath for a particular element on a page:

Using LXML

Step-by-step Approach

Example 1:

Below is a program based on the above approach which uses a particular URL.

Python `

Import required modules

from lxml import html import requests

Request the page

page = requests.get('http://econpy.pythonanywhere.com/ex/001.html')

Parsing the page

(We need to use page.content rather than

page.text because html.fromstring implicitly

expects bytes as input.)

tree = html.fromstring(page.content)

Get element using XPath

buyers = tree.xpath('//div[@title="buyer-name"]/text()') print(buyers)

`

Output:

Example 2:

Another example for an E-commerce website, URL.

Python `

Import required modules

from lxml import html import requests

Request the page

page = requests.get('https://webscraper.io/test-sites/e-commerce/allinone')

Parsing the page

tree = html.fromstring(page.content)

Get element using XPath

prices = tree.xpath( '//div[@class="col-sm-4 col-lg-4 col-md-4"]/div/div[1]/h4[1]/text()') print(prices)

`