XML parsing in Python (original) (raw)

Last Updated : 28 Jun, 2022

This article focuses on how one can parse a given XML file and extract some useful data out of it in a structured way.

XML: XML stands for eXtensible Markup Language. It was designed to store and transport data. It was designed to be both human- and machine-readable.That’s why, the design goals of XML emphasize simplicity, generality, and usability across the Internet.
The XML file to be parsed in this tutorial is actually a RSS feed.

RSS: RSS(Rich Site Summary, often called Really Simple Syndication) uses a family of standard web feed formats to publish frequently updated informationlike blog entries, news headlines, audio, video. RSS is XML formatted plain text.

Python Module used: This article will focus on using inbuilt xml module in python for parsing XML and the main focus will be on the ElementTree XML API of this module.

Implementation:

import csv

import requests

import xml.etree.ElementTree as ET

def loadRSS():

`` resp = requests.get(url)

`` with open ( 'topnewsfeed.xml' , 'wb' ) as f:

`` f.write(resp.content)

def parseXML(xmlfile):

`` tree = ET.parse(xmlfile)

`` root = tree.getroot()

`` newsitems = []

`` for item in root.findall( './channel/item' ):

`` news = {}

`` for child in item:

`` news[ 'media' ] = child.attrib[ 'url' ]

`` else :

`` news[child.tag] = child.text.encode( 'utf8' )

`` newsitems.append(news)

`` return newsitems

def savetoCSV(newsitems, filename):

`` fields = [ 'guid' , 'title' , 'pubDate' , 'description' , 'link' , 'media' ]

`` with open (filename, 'w' ) as csvfile:

`` writer = csv.DictWriter(csvfile, fieldnames = fields)

`` writer.writeheader()

`` writer.writerows(newsitems)

def main():

`` loadRSS()

`` newsitems = parseXML( 'topnewsfeed.xml' )

`` savetoCSV(newsitems, 'topnews.csv' )

if __name__ = = "__main__" :

`` main()

Above code will:

Let us try to understand the code in pieces:

Here, we first created a HTTP response object by sending an HTTP request to the URL of the RSS feed. The content of response now contains the XML file data which we save as topnewsfeed.xml in our local directory.
For more insight on how requests module works, follow this article:
GET and POST requests using Python

Now, we know that we are iterating through item elements where each item element contains one news. So, we create an empty news dictionary in which we will store all data available about news item. To iterate though each child element of an element, we simply iterate through it, like this:
for child in item:
Now, notice a sample item element here:
16
We will have to handle namespace tags separately as they get expanded to their original value, when parsed. So, we do something like this:
if child.tag == '{http://search.yahoo.com/mrss/}content':
news['media'] = child.attrib['url']
child.attrib is a dictionary of all the attributes related to an element. Here, we are interested in url attribute of media:content namespace tag.
Now, for all other children, we simply do:
news[child.tag] = child.text.encode('utf8')
child.tag contains the name of child element. child.text stores all the text inside that child element. So, finally, a sample item element is converted to a dictionary and looks like this:
{'description': 'Ignis has a tough competition already, from Hyun.... ,
'guid': 'http://www.hindustantimes.com/autos/maruti-ignis-launch.... ,
'link': 'http://www.hindustantimes.com/autos/maruti-ignis-launch.... ,
'media': 'http://www.hindustantimes.com/rf/image_size_630x354/HT/... ,
'pubDate': 'Thu, 12 Jan 2017 12:33:04 GMT ',
'title': 'Maruti Ignis launches on Jan 13: Five cars that threa..... }
Then, we simply append this dict element to the list newsitems.
Finally, this list is returned.

So now, here is how our formatted data looks like now:
result

As you can see, the hierarchical XML file data has been converted to a simple CSV file so that all news stories are stored in form of a table. This makes it easier to extend the database too.
Also, one can use the JSON-like data directly in their applications! This is the best alternative for extracting data from websites which do not provide a public API but provide some RSS feeds.

All the code and files used in above article can be found here.

What next?

Quiz of HTML and XML