How to scrape data from a website (original) (raw)

Image depicting ASBPE 2024 national silver award recognition

"Hey ChatGPT, write me a pun about web scraping."

"Why did the web scraper get kicked out of school? It kept skipping classes!" Get it? Like an HTML class.

Not bad, ChatGPT. It only took about 570 gigabytes of data scraped from the public internet and years of development to come up with that one.

ChatGPT, the large language model that powers it and similar AI systems were trained by synthesizing data collected during large-scale web scraping. This has caused many -- including authors and social media platforms – to reexamine data rights and ownership as AI models use data that they provided for free.

Although AI is the new context for it, web scraping is actually an old practice -- about as old as the web itself. Just like the web, scraping has grown significantly, both in scale and sophistication. Web scraping is the technique that has powered search engines since their inception.

Before Google was around, the Internet Archive scraped the web to archive it and continues to do so. As of 2023, the Wayback Machine -- the Internet Archive's record of the web -- has archived more than 833 billion webpages. Scraping has been, is and will continue to be a cornerstone of the internet for the foreseeable future.

What is web scraping?

Web scraping is also known as web harvesting and web data harvesting. It refers to the process of programmatically reading and analyzing content on the internet. There are three main steps to web scraping:

Web scraping usually refers to extracting, parsing and outputting data from HTML code. Webpages typically comprise a combination of HTML, CSS and JavaScript code. A browser makes these elements human-readable. By right-clicking and inspecting a page, a user can see which on-page elements in the browser correspond to which lines of HTML code. This can be helpful in knowing what to scrape.

Although HTML is the code being sorted in web scraping, any type of data can be harvested.

Web scraping use cases

There are many common use cases for web scraping across industries, including the following:

Is scraping the web illegal?

Scraping the web for publicly available data is legal within certain guidelines. Web scraping is not the same thing as stealing data. There is no federal law against web scraping and many legitimate businesses rely on it to make money. Some platforms -- such as Reddit -- provide access to an application programming interface (API) that makes the platform easier to scrape. Until recently, Reddit offered this service for free.

However, there are cases when web scraping is illegal, such as when data is taken without permission. If web scraping violates a website's terms of service, the Computer Fraud and Abuse Act, data protection laws such as the General Data Protection Regulation or certain copyright laws, it can lead to legal action.

Many social platforms seek to limit access to their data to discourage scrapers. LinkedIn is one platform that actively discourages it.

One example of a conflict that arose from web scraping is the case of HiQ vs. LinkedIn. In 2017, data analytics company HiQ Labs sued LinkedIn after receiving a cease-and-desist letter from LinkedIn prohibiting web scraping of its public profiles. The court granted HiQ a preliminary injunction, allowing it to continue scraping, emphasizing the public nature of the data and potential anticompetitive effects of LinkedIn's actions.

LinkedIn uses deep learning to detect bot-like behavior and inform members that get flagged on how they can correct their behavior.

Some other ways that platforms can discourage or avoid scraping include the following:

Many websites have a robots.txt file. It's a page directed at web crawlers that specifies how the page is allowed to be crawled, and what's against the rules. It provides information such as allowed content, disallowed content, sitemaps to make crawling easier for search engines and crawl-delays to tell bots how long to wait between consecutive requests.

The robots.txt file is typically located at the root of the website. Here is Bloomberg's robots.txt file as a real-world example.

Ways to scrape a website

There are many ways to scrape a website, with varying levels of coding ability required.

No-code ways to scrape include the following:

Ways to scrape that involve some coding include the following:

Scraping methods that require more coding knowledge include the following:

How to build a web scraper in Python

Following is an example of a simple scraper. This scraper extracts definition articles from links listed on the WhatIs.com homepage using the Python libraries requests and Beautiful Soup.

Step 1. Access WhatIs.com through code

This step shows how content gets scraped into the coding environment using the requests library. The code in Figure 1 imports the first 1,000 characters of the WhatIs.com source code. It isn't a necessary precursor to the next step, but shows how the Python libraries at the top of the code pull data into the IDE.

Screenshot of Beautiful Soup.

Figure 1. The first lines of source code are pulled into the IDE.

Requests is a popular open source library for making HTTP requests. The line response.txt returns the source code from the website.

Step 2. Extract URLs

HTML links are represented with the following format:

Clickable Text or Content

The Whatis.com source code is full of them.

Screenshot of Whatis.com source code.

Figure 2. WhatIs.com source code, featuring links to different glossary pages for each letter of the alphabet.

The code in Figure 3 returns a list of every link on the WhatIs.com homepage. The code does this by finding every instance of the letter "a" -- which indicates a link -- then printing the URL of that link.

Screenshot of parsing.

Figure 3. Every link from the source code is pulled into the IDE.

As pictured, the scraper extracts every hyperlink from the page, including TechTarget's privacy page and contact page. The goal is to extract only definition URLs.

Step 3. Extract definition URLs

Look for a pattern in the URLs of articles that the scraper can filter for and extract. Definitions all have a similar URL structure -- they have /definition in the URL. The following code ensures that all URLs with /definition will be printed.

Screenshot of looking for definition links.

Figure 4. Only links containing the string 'definition' are pulled into the IDE.

Step 4. Refining output

The previous step mostly worked. However, it still prints a list of links to the WhatIs.com glossary, which all contain the word "definitions." To remove the glossary entries and only display definition links, add the following line to the for loop:

if href and "/definition" in href and "/definitions" not in href:

Screenshot of refined output.

Figure 5. The word 'definitions' is excluded, removing the glossary links from the output.

Now all the links displayed will be to TechTarget definitions, and the glossary will be excluded.

To export these links from the coding environment, use the pandas library to turn the output into a data frame, then save it to a CSV file in the coding environment with the title "output.csv." The code will look like Figure 6 with those lines of code appended.

Screenshot of exporting to CSV file.

Figure 6. Exporting to CSV file using pandas library. The CSV file appears off screen, in the file directory.