How to do data Scraping Through Selenium

Data scraping is also called web scraping. It is the process of importing information or data from a website to the file saved on your computer. It is the most efficient way to get data from the web. In this article, we learn how to scrap data from the website using python selenium.

What is Selenium?

Selenium is an open-source testing tool, which means it can be downloaded from the internet without spending anything. Selenium is a functional testing tool and is also compatible with non-functional testing tools as well. It is one of the most popular automation testing tools. Here automation testing is a process of converting any manual test case into test scripts using automation tools. So that’s why it is very efficient to scrap data because you have to write a simple python script using the Selenium testing tool.

In this article, we learn step-by-step procedures to scrap data from the website Wikipedia using selenium web driver and after scraping put that scrap data into a data frame and then save this data into a CSV file in a local computer. Here we use the Mozilla Firefox web driver for the automation.

Step 1.  Install the required module:

pip install selenium

Step 2. Import the required module, web driver and create web driver object 

First, you have to download the firefox web driver from the internet and then install it into your system then give the executable path to the web driver object for the automation. And the data we going to scrap from Wikipedia the link is given below:

https://en.wikipedia.org/wiki/List_of_Olympic_medalists_in_badminton

This medal leaders table we are going to scrap
# Python program to demonstrate
import pandas as pd
# selenium
# import webdriver

from selenium import webdriver
 
# create webdriver object
driver = webdriver.Firefox(executable_path = "C:\\Users\\siddh\\Downloads\\geckodriver-v0.31.0-win64\\geckodriver.exe")
# get google.co.in
driver.get("https://en.wikipedia.org/wiki/List_of_Olympic_medalists_in_badminton")

Step 3. Extract all elements from the table:

For better understanding, we extract each web element one by one with the help of Xpath. Xpath is one of the methods for searching web elements. If you want to learn about that the link is given below:

https://www.selenium.dev/documentation/webdriver/elements/finders/

Now we have to find the length of the table which means how many players are there in the table. For that, we have to inspect the table by pressing the F12 button. From there you have to copy the Xpath of each element. Now we have to create an empty list to save this extracted data.

Select the xpath for the medalist column
# Extracting the length of table :
total_element = len(driver.find_elements(By.XPATH, 
           "/html/body/div[3]/div[3]/div[5]/div[1]/table[7]/tbody/tr"))
print(total_element)

# creating empty list to save data:
medalist = []
nations =[]
olympic = []
gold =[]
silver = []
bronze = []
total = []


# for extracting medalist name 

for i in range(total_element+1):
    w = driver.find_elements(By.XPATH, 
        '/html/body/div[3]/div[3]/div[5]/div[1]/table[7]/tbody/tr[' + 
        str(i) + ']/td[1]/a')
    for element in w:
        medalist.append(element.text)

# for extracting nations

for i in range(total_element+1):
    n = driver.find_elements(By.XPATH, 
        '/html/body/div[3]/div[3]/div[5]/div[1]/table[7]/tbody/tr[' + 
         str(i) + ']/td[2]/a')
    for element in n:
        nations.append(element.text)

# for extracting which olympic

for i in range(total_elemnt+1):
    o = driver.find_elements(By.XPATH, 
        '/html/body/div[3]/div[3]/div[5]/div[1]/table[7]/tbody/tr[' + 
        str(i) + ']/td[3]')
    for element in o:
        olympic.append(element.text)
        
# for extracting gold medal

for i in range(152):
    g = driver.find_elements(By.XPATH, 
        '/html/body/div[3]/div[3]/div[5]/div[1]/table[7]/tbody/tr[' + 
        str(i) + ']/td[4]')
    for element in g:
        gold.append(element.text)

# For extracting silver medal

for i in range(total_element+1):
    s = driver.find_elements(By.XPATH, 
        '/html/body/div[3]/div[3]/div[5]/div[1]/table[7]/tbody/tr[' + 
        str(i) + ']/td[5]')
    for element in s:
        silver.append(element.text)

#  for extracting bronze medal

for i in range(total_element+1):
    b = driver.find_elements(By.XPATH, 
        '/html/body/div[3]/div[3]/div[5]/div[1]/table[7]/tbody/tr[' + 
        str(i) + ']/td[6]')
    for element in b:
        bronze.append(element.text)

# for extracting total number of medal

for i in range(total_element+1):
    t = driver.find_elements(By.XPATH, 
        '/html/body/div[3]/div[3]/div[5]/div[1]/table[7]/tbody/tr[' + 
        str(i) + ']/td[7]')
    for element in t:
        total.append(element.text)

Step 4: Put Extracted data into Dataframe

Now we will create a data frame with the data we extracted in the last step. Using the Python pandas library we save the extracted data into the data frame.

df=pd.DataFrame(list(zip(medalist,nations,olympic,gold,silver,bronze,total)),
               columns =['Medalist', 'Nation', 'Olympic','Gold','Silver','Bronze','Total'])

print(df)

Step 5. Export CSV to a working directory

The next step is to create a CSV file from this data frame. To do that, we simply export a Dataframe to a CSV file using df.to_csv().

df.to_csv('bdo_medalist.csv')

Now you can open this CSV and see the data in the excel sheets. This was how you can scrape data using Selenium and a few python libraries. We will look more into data scrapping and saving in the coming articles.

If you like the article, please share and subscribe to the blog. Also, follow me on Linkedin if you think these articles are helping you and want to see more.


Siddhartha Sharma

Aspiring data analyst currently working as a freelancer tech blogger. Having proficiency in python, sql, advance excel , powerBI ,and metabase.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.