Scrape Content from Dynamic Websites (original) (raw)

Many websites load content using JavaScript after the page opens, so data may not appear in the initial HTML. Since requests and BeautifulSoup only fetch the static HTML, they can't access this dynamic content. Selenium helps by loading full page and running JavaScript. After that, BeautifulSoup can extract the required data.

This project demonstrates how to scrape dynamically loaded job profile links from "Top Jobs by Designation" page on Naukri Gulf using Selenium and BeautifulSoup. It uses webdriver-manager to automatically manage ChromeDriver, avoiding manual installation.

Install Selenium

Before using **Selenium, we need to install it in our Python environment. We’ll also install **webdriver-manager, which helps automatically manage the browser driver (like ChromeDriver), so you don’t need to manually download and set the path.

Run the following command in notebook or terminal:

pip install selenium
pip install webdriver-manager

Now, let's break down the scraping process step by step.

Step 1: Importing Required Libraries

To start, all the necessary Python libraries must be imported. These include Selenium components for browser automation, **BeautifulSoup for parsing HTML content and **webdriver-manager to automatically handle the browser driver setup.

Python `

from selenium import webdriver from selenium.webdriver.chrome.service import Service from selenium.webdriver.chrome.options import Options from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC from webdriver_manager.chrome import ChromeDriverManager from bs4 import BeautifulSoup

**Explanation:

**selenium: Automates browser actions.
**Service, Options: Configure the Chrome browser and driver.
**By, WebDriverWait, expected_conditions: Used to wait for elements to load dynamically.
**ChromeDriverManager: Automatically downloads the correct driver.
**BeautifulSoup: Parses HTML content.

Step 2: Set Up Chrome Options

In this step, Chrome browser options are configured. These settings allow the browser to run in headless mode (without a visible window), disable GPU usage and use a custom user-agent string to mimic a real browser environment.

Python `

chrome_options = Options() chrome_options.add_argument("--headless")
chrome_options.add_argument("--disable-gpu") chrome_options.add_argument( "user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.6778.265 Safari/537.36" )

Step 3: Initialize WebDriver with webdriver-manager

This step initializes the Chrome WebDriver using **webdriver-manager package. It automatically downloads and configures the correct version of ChromeDriver, avoiding manual setup. The driver is launched with previously defined Chrome options.

Python `

service = Service(ChromeDriverManager().install()) driver = webdriver.Chrome(service=service, options=chrome_options)

Step 4: Open the Target Webpage

In this step, the browser is directed to the target URL. Selenium opens the webpage, allowing dynamic content to load for further processing.

Python `

url = "https://www.naukrigulf.com/top-jobs-by-designation" driver.get(url)

It navigates to the Naukri Gulf page.

Step 5: Wait for Dynamic Content to Load

Before trying to extract data, the script should wait for the webpage to fully load the job links (which appear using JavaScript). This step uses a wait command to pause until those elements are ready.

Python `

WebDriverWait(driver, 30).until( EC.presence_of_element_located((By.CLASS_NAME, "soft-link")) )

Step 6: Get and Parse the Page Source

Once the page is fully loaded, the script grabs the complete HTML content. Then, **BeautifulSoup is used to parse this HTML so we can easily search and extract specific elements.

Python `

html = driver.page_source soup = BeautifulSoup(html, "html.parser")

Now that the HTML is parsed, the script searches for all job profile links using their class name. It then prints the top 10 job titles from the list.

Python `

job_profiles_section = soup.find_all('a', class_='soft-link darker')

print("Top Job Profiles:") for i, job in enumerate(job_profiles_section[:10], start=1): print(f"{i}. {job.text.strip()}")

Step 8: Close the WebDriver

After the scraping is complete, it's important to close the browser properly. This step shuts down the WebDriver to free up system resources.

Python `

driver.quit()