BeautifulSoup and Python: Writing your first web scraping code

In our last post, we covered the introduction to web scraping, why do we need it, and the process of web scraping. In this post, we will look at the python library which is used for scraping and also write our first scraping code.

We are going to perform web scraping for Premier League Table 2019-20 and store the data in a pandas dataframe.Let us get started ..

We will first import the required libraries –

import requests
from bs4 import BeautifulSoup
import pandas as pd

The requests library will be used to perform get requests to a URL from where we have to scrape the data. It will help to get the HTML behind the webpage.

Making use of the requests library to get the HTML code

url = ""
page = requests.get(url)
print(page)               # Response object
print(page.status_code)   # Status code
print(page.text)          # HTML code

Enter the URL from where you want to get the data. The variable page is an HTML response object. We can get the status of our request by printing page.status_code (200 for successful responses). The page.text object will consist of the actual HTML code that we are interested in.

Create a BeautifulSoup() object named soup. The prettify attribute of this object will return the HTML code in a structured format that has better readability.

soup = BeautifulSoup(page.text, 'html.parser')

The next task is to identify the tags and their class names in which our required data is present. 

You can make use of the browser devtools to identify these tags and the class names of the relevant data that you require. Once you have this ready with you, you can use the find_all() method to find all the relevant HTML code for your requirement. In our case, the tag is a table tag and the class is standing-table__table.

The league_table variable contains only the HTML content of the table. 

league = soup.find('table', class_ = 'standing-table__table')
league_table = league.find_all('tbody')

We can use the indexing logic of python where index n will give you (n+1)th tag value. The .text attribute will give the text value that the tag contains. We have used the .strip() method of string to eliminate the additional spaces before and after the text content. 

The rest of the code is self-explanatory shown below.

league_2020 = []

for league_teams in league_table:
    rows = league_teams.find_all('tr')
    for row in rows:
        rank = row.find_all('td', class_ = 'standing-table__cell')[0].text.strip()
        team_name = row.find('td', class_ = 'standing-table__cell standing-table__cell--name').text.strip()
        PL = row.find_all('td', class_ = 'standing-table__cell')[2].text.strip()
        W = row.find_all('td', class_ = 'standing-table__cell')[3].text.strip()
        D = row.find_all('td', class_ = 'standing-table__cell')[4].text.strip()
        L = row.find_all('td', class_ = 'standing-table__cell')[5].text.strip()
        F = row.find_all('td', class_ = 'standing-table__cell')[6].text.strip()
        A = row.find_all('td', class_ = 'standing-table__cell')[7].text.strip()
        GD = row.find_all('td', class_ = 'standing-table__cell')[8].text.strip()
        team_points = row.find_all('td', class_ = 'standing-table__cell')[9].text.strip()

        league_dict = {
            '#': rank,
            'Name': team_name,
            'PL': PL,
            'W': W,
            'D': D,
            'L': L,
            'F': F,
            'A': A,
            'GD': GD,
            'Points': team_points}

df = pd.DataFrame(league_2020)

We have created a dataframe of the league table –

This was an introductory post about how we can write our first web scraping code. The BeautifulSoup library has very good documentation where you can visit and find out additional methods and attributes which can prove to be handy during your scraping process.

Leave a comment

Your email address will not be published. Required fields are marked *