BeautifulSoup and Python: Writing your first web scraping code

In our last post, we covered the introduction to web scraping, why do we need it, and the process of web scraping. In this post, we will look at the python library which is used for scraping and also write our first scraping code.

We are going to perform web scraping for Premier League Table 2019-20 and store the data in a pandas dataframe.Let us get started ..

We will first import the required libraries –

import requests
from bs4 import BeautifulSoup
import pandas as pd

The requests library will be used to perform get requests to a URL from where we have to scrape the data. It will help to get the HTML behind the webpage.

Making use of the requests library to get the HTML code

url = "https://www.skysports.com/premier-league-table/2019"
page = requests.get(url)
print(page)               # Response object
print(page.status_code)   # Status code
print(page.text)          # HTML code

Enter the URL from where you want to get the data. The variable page is an HTML response object. We can get the status of our request by printing page.status_code (200 for successful responses). The page.text object will consist of the actual HTML code that we are interested in.

Create a BeautifulSoup() object named soup. The prettify attribute of this object will return the HTML code in a structured format that has better readability.

soup = BeautifulSoup(page.text, 'html.parser')
print(soup.prettify)

The next task is to identify the tags and their class names in which our required data is present. 

You can make use of the browser devtools to identify these tags and the class names of the relevant data that you require. Once you have this ready with you, you can use the find_all() method to find all the relevant HTML code for your requirement. In our case, the tag is a table tag and the class is standing-table__table.

The league_table variable contains only the HTML content of the table. 

league = soup.find('table', class_ = 'standing-table__table')
league_table = league.find_all('tbody')

We can use the indexing logic of python where index n will give you (n+1)th tag value. The .text attribute will give the text value that the tag contains. We have used the .strip() method of string to eliminate the additional spaces before and after the text content. 

The rest of the code is self-explanatory shown below.

league_2020 = []

for league_teams in league_table:
    rows = league_teams.find_all('tr')
    for row in rows:
        rank = row.find_all('td', class_ = 'standing-table__cell')[0].text.strip()
        team_name = row.find('td', class_ = 'standing-table__cell standing-table__cell--name').text.strip()
        PL = row.find_all('td', class_ = 'standing-table__cell')[2].text.strip()
        W = row.find_all('td', class_ = 'standing-table__cell')[3].text.strip()
        D = row.find_all('td', class_ = 'standing-table__cell')[4].text.strip()
        L = row.find_all('td', class_ = 'standing-table__cell')[5].text.strip()
        F = row.find_all('td', class_ = 'standing-table__cell')[6].text.strip()
        A = row.find_all('td', class_ = 'standing-table__cell')[7].text.strip()
        GD = row.find_all('td', class_ = 'standing-table__cell')[8].text.strip()
        team_points = row.find_all('td', class_ = 'standing-table__cell')[9].text.strip()

        league_dict = {
            '#': rank,
            'Name': team_name,
            'PL': PL,
            'W': W,
            'D': D,
            'L': L,
            'F': F,
            'A': A,
            'GD': GD,
            'Points': team_points}
        league_2020.append(league_dict)

df = pd.DataFrame(league_2020)
print(df)

We have created a dataframe of the league table –

This was an introductory post about how we can write our first web scraping code. The BeautifulSoup library has very good documentation where you can visit and find out additional methods and attributes which can prove to be handy during your scraping process.

What is Web Scraping? Everything you need to know before writing your first scraping code

In this post, we are going to cover an introduction to web scraping. The aim of this post is to answer questions like: What is web scraping? When to use web scraping? The web scraping process and the python web scraping library which you can use.

What is Web Scraping?

The process of web scraping involves fetching or downloading a web page, extracting data from it, and saving it to a DB or spreadsheet. It is an automated process of retrieving structured data from a website. 

Why Web Scraping?

Consider that you need to collect data from a website. There are three different ways to collect Structured Data from a website.

  1. Download .csv file from a website
    You can download the .csv extension file from a website but not in all cases this file is available. In most of the cases, attachments/downloads aren’t available.
  2. Copy and Paste to a CSV or Excel File
    This is the simplest step you can do but it is meant for hard-working people and not smart working. It is time-consuming and complex when there are multiple records to look at. When it comes to multiple web pages or multiple web sites then the task gets tedious and also impossible in some use cases by this technique.
  3. API (Application Programming Interface)
    APIs are comparatively simpler and a better choice than the rest of the two but it is charged and is costly. Also, it does not provide relevant data in the preferred structure. It also takes time to request data using an API.

To overcome the above shortcomings, we can replace these traditional methods with web scraping. Web-scraping is used in several domains. Let us understand the web scraping use cases in a much better way by looking at the following examples –

  • Price Comparison Engines
    Get prices of an item from different websites and present them in a structured manner for comparisons.
  • News AggregationAggregating top news articles from multiple websites to one site.
  • Sentiment AnalysisCollect tweets or hashtags to analyze people’s sentiments on particular issues such as politics or products
  • Hotel IndustryScraping hotel prices and images and comparing them for a better choice for customers or target audience. 

What is BeautifulSoup Library?

  1. BeautifulSoup transforms HTML text/script into a tree format for easy readability.
  2. This will make it easy to search for what we want inside the HTML/CSS contents.
  3. You can search for classes, headers, tables, paragraphs, etc which are presented in a readable structure by BeautifulSoup.
  4. BeautifulSoup will help with isolating the titles, tags, and links from the HTML documents so that we can extract the content that we want (text/image).

Web Scrapping Process

Obtain web URL & download the web page

Use the python requests library & obtain the source code

Parse the downloaded data into an HTML parser & get data in a readable and structured format

Use the BeautifulSoup library to extract the data that we need

Store/Save the data as pandas dataframe, CSV, JSON, SQL DB, etc.

So, this is all that you need to know for now. In the next post, we will cover how we can write our first scraping code using the BeautifulSoup library and create a pandas dataframe out of it.

What is the Confusion Matrix in Machine Learning? What is Type 1 and Type 2 Error?

Why do we need the confusion matrix? Well, if you don’t know then let me put it plainly for now. It is one of the techniques about how you can measure the performance of your model. By creating a confusion matrix, you can calculate recall, precision, f-measure, and accuracy as well. It is really simple to create a confusion matrix once you are done with your predictions.

What is a confusion matrix?

A confusion matrix is one of the evaluation techniques for machine learning models in which you compare the results of all the predicted and actual values.

Let us consider a binary target variable consisting of 0s and 1s, where 1 resembles Positive or True case scenarios and 0 resembles negative or False case scenarios.

In the above example —

  1. There are 5 instances when the actual value (y) is 1 and the predicted value (ŷ) is also 1. This is called aTrue Positive case where True means that values are the same (1 & 1) and Positive means that it is a true scenario.
  2. There are 4 instances when the actual value (y) is 0 and the predicted value (ŷ) is also 0. This is called a True Negative case where True means that values are the same (0 & 0) and Negative means that it is a negative scenario.
  3. There are 3 instances when the actual value (y) is 0 and the predicted value (ŷ) is 1. This is called a False Positive case where False means that the values are different (0 & 1) and Positive means that the predicted value is positive or 1.
  4. There are 2 instances when the actual value (y) is 1 and the predicted value (ŷ) is 0. This is called a False Negative case where False means that the values are different (1 & 0) and Negative means that the predicted value is negative or 0.

In the example matrix, the values in green are correctly identified by the model and the values in red are wrongly identified by the model.

Type 1 Error and Type 2 Error

  • Type 1 Error arises when the predicted value is positive while it is actually negative (False Positive).
    eg. If your device predicts that it will rain today but in reality, it did not rain today.
  • Type 2 Error arises when the predicted value is negative while it is actually positive (False Negative).
    eg. If your device predicts that it will not rain today but in reality, id did rain today.

Note – In the above two examples for type 1 and type 2 error, we have considered raining to be a positive case.

Complete summary of the confusion matrix

In the below example for covid19 test, the four cases can summarize in the following way —

— True Positive —

The covid test is positive and the patient is suffering from covid19.

— True Negative —

The covid test is negative and the patient is not suffering from covid19.

— False Positive —

The covid test is positive but the patient is not suffering from covid19.

— False Negative —

The covid test is negative but the patient is suffering from covid19.

From the above four cases, the fourth case i.e. False Negative (Type 2 error) is dangerous as it can cause the life of the patient due to error in the test. So, generally, False Negative cases are considered to be more dangerous than False Positive but in few applications like software testing, False Positive cases (Type 2 error) are tried to be minimized.

In the next post, we will cover how we can calculate the accuracy, precision, recall, f-measure from a confusion matrix, and what does this value signifies.

What are the different types of Clustering Algorithms? Its Applications and Usage

What is Clustering ?

Clustering is a technique in which unsupervised data are grouped together based on similarities These groups are mutually exclusive.

Clustering Algorithms

  • Partitioned-based Clustering
    1. K-Means
    2. K-Median
    3. Fuzzy C-means
  • Hierarchical Clustering
    4. Agglomerative
    5. Decisive
  • Density-based Clustering
    6. DBSCAN

Why Clustering ?

  1. Exploratory Data Analysis (EDA)
  2. Summary Generation
  3. Outlier Detection
  4. Finding Duplicates
  5. Pre-processing Step

Applications of Clustering

  • Retail Marketing
    1. Identify buying patterns of customers
    2. Recommending new books/movies to the new cast
  • Banking
    3. Fraud detection in credit card use
    4. Identifying clusters of customers
  • Insurance
    5. Fraud detection in claims analysis
    6. Insurance risk of customers
  • Publication
    7. Auto-categorizing news based on their content
    8. Recommending similar news articles
  • Medicine
    9. Characterizing patient behavior
  • Biology
    10. Clustering genetic markets to identify family ties

3 Metrics to evaluate the accuracy of a KNN Model

After building the model, it is also important to define which metrics would be more suitable for the model. For simple linear regression where we have just one dependent and one independent variable, finding a correlation between them can do the job in finding out how much accuracy factor the model can provide. But the same is not the case with multiple linear regression. For multiple linear regression, one should go for the F1 score because using correlation simply can not justify the accuracy of the model. The following are the three metrics that one can use to find the accuracy of a KNN (K-Nearest Neighbors) model.

  1. Jaccard Index
  2. F1 – Score
  3. Log Loss

Let us discuss each of them one by one in detail.

Jaccard Index

Jaccard Index or also known as Jaccard similarity coefficient. If y is the actual label and ŷ is the predicted value then we can define Jaccard index as the size of the intersection by the size of the union of two labeled sets.

Consider if you have a total of 50 observations, out of which your model predicts 41 of them correctly, then the Jaccard index is given by –

The jaccard index of 0.69 defines that the model predicts on the test set with an accuracy of 69%. So a jaccard index ranges from 0 to 1 where an index value of 1 implies maximum accuracy.

F1 – Score

F1-Score is also known as F-Measure or F-score. F1-score is the harmonic average value of precision and recall. It is a good way to show the model that the model has good value for recall and precision. Since this metric makes use of the harmonic mean it takes care of extreme values (since Arithmetic mean performs poor for outliers). 

PrecisionOut of all the positive classes we have predicted correctly, how many are actually positive.RecallOut of all the positive classes, how much we predicted correctly.

Log Loss

Logg loss or logarithmic loss measures the performance of a classifier where predicted output is a probability value between 0 and 1.

In the above example, the model predicted a probability of 0.21 where the actual label is 1. This is a poor prediction and will result in a higher log loss.We can calculate the log loss using the log loss equation which measures how far each prediction is from the actual label. Then we calculate average log loss across each row of the test set.

The value of log loss ranges from 0 to 1. It is obvious that most ideal classifiers have a lower value of log loss. So the classifier with lower log loss has better accuracy.
Note that, the Jaccard index and the F1-score metrics can also be used for multi-class classifiers.

Difference between Label Encoding and One-Hot Encoding | Pre-processing | Ordinal vs Nominal Data

No matter what programming language you use to write your code logic, machines understand the binary language of 1s and 0s. Similarly, it is easier for machines to deal with IP addresses than hostnames while on the contrary, humans prefer to deal with hostnames. The encoding logic in machine learning is more or less based on this philosophy. Encoding is a major pre-processing step while building machine learning models. In this post, we are going to talk about the two encoding techniques, namely, one-hot encoding and label encoding for nominal and ordinal categorical data. Before we begin, let us discuss in short about the two categorical data types.

  1. Ordinal DataAs the name suggests, ordinal data are the type of categorical data that can be put into an order for better understanding. For example, the colors of a rainbow, the different sizes of a shirt (S, M, L, XL, XXL), rating of a product (5, 4, 3, 2, 1), etc.
  2. Nominal DataThe type of categorical data that do not tend to follow any pattern is termed as nominal data. For example, gender (male and female), colors, etc.

What is Label Encoding?

Label encoding is a technique used for ordinal data. In label encoding, the labels of the ordinal data (easy, medium, difficult) are converted to numeric values (0, 1, 2) so that it becomes easier for the machines to interpret. Machine Learning models work better with numeric data than string variables. 

  • Consider a dataframe df with size column having values in [‘S’, ‘M’, ‘L’, ‘XL’]

Example –

# import label encoder 
from sklearn.preprocessing import LabelEncoder

# label_encoder object knows how to understand word labels
label_encoder = LabelEncoder() 

# encode labels in column size. 
df['size']= label_encoder.fit(df['size']) 

# get unique data labels produced by label encoder class
df['size'].unique()

# get the corresponding values of numeric labels
label_encoder.inverse_transform(0)

What is One Hot Encoding?

One Hot Encoding technique is used for nominal data. In one hot encoding, each label is converted to an attribute and the particular attribute is given values 0 (False) or 1 (True). For example, consider a gender column having values Male or M and Female or F. After one-hot encoding is converted into two separate attributes (columns) as Male and Female. For rows consisting of the Male category, the Male column is given a value 1 (True) and the Female column is given a value 0 (False). For rows consisting of the Female category, the Male column is given a value 0 (False) and the Female column is given a value 1 (True).

  • Consider a dataframe df with gender column having values in [‘Male’, ‘Female’]

Example –

# import label encoder 
from sklearn.preprocessing import OneHotEncoder

# creating one hot encoder
one_hot_encoder = OneHotEncoder() 

# encode labels in column size. 
onehot_encoded = one_hot_encoder.fit(df['Gender']) 

# concatenate onehot_encoded to the original dataframe
# drop the column Gender from original dataframe

# get the corresponding values of numeric labels
one_hot_encoder.inverse_transform(0)

For the gender example, we can only have one column (either male or female) instead of having both at the same time. We can identify the gender making use of either of the columns.

There are several other pre-processing techniques that we are going to cover in the upcoming posts. Stay updated.

Types of Machine Learning Systems

Machine Learning Systems can be broadly classified into 3 categories. Let us discuss them in detail.

Category 1

Whether or not they are trained with human supervision

This category is sub-divided into –

  1. Supervised Learning
    In a supervised learning method, the training data consists of labels. The target variable is trained using these labels to predict the output.
  2. Unsupervised Learning
    In an unsupervised learning method, the training data is unlabeled. The machine tries to learn on its own.
  3. Semi-Supervised Learning
    In a semi-supervised method, as the name suggests, some part of the data is labeled while most of the data is unlabeled.
  4. Reinforcement Learning
    In reinforcement learning, the machine is rewarded for the good policy it makes. Over the time, the machine learns what action needs to be performed for a given situation on the basis of the rewards earned.

Category 2

Whether or not they can learn incrementally on the fly

The category is sub-divided into –

  1. Batch learning
    In batch learning, the entire data is trained to build the model. In order to predict for a new data point altogether, the entire model needs to be trained again with all the available data points.
  2. Online learning
    In online learning, the model learns incrementally by feeding the data into mini-batches. For newer data points, the model needs to learn only for the newer point, unlike batch learning.

Category 3

Whether they memorize the data points or instead detect patterns in the training dataThe category is sub-divided into –

  1. Instance-based learning
    In instance-based learning, the model learns the data points. It memorizes the instances of all the available data. It performs poorly on the test set.
  2. Model-based learning
    In model-based learning, a model is created by using a training set. If modeled correctly, the system can produce near accurate predictions over the test set.

These are the broad categories in which the machine learning systems are classified. Using each of the sub-categories depends on the kind of data and predictions we are dealing with, the number of computation resources available, and the use cases.

Properties of OLS estimators and the fitted regression model

Simple linear model equation is denoted by

  • Ordinary Least Squares is the most common method to estimate the parameters in a linear regression model regardless of the form of distribution of the error 𝑒.
  • Least squares stand for the minimum square error or 𝑆𝑆𝐸 (𝑆𝑢𝑚 𝑜𝑓 𝑆𝑞𝑢𝑎𝑟𝑒𝑑 𝐸𝑟𝑟𝑜𝑟). A lower error results in a better explanatory power of the regression model.
  • Also, least-squares produce the best linear unbiased estimators of 𝑏0 and 𝑏1.

Properties of least square estimators and the fitted regression model

  1. The sum of the residuals in any regression model that contains an intercept 𝑏0 is always zero, that is  –
  2. The sum of the observed value 𝑦𝑖 equals the sum of the fitted values ŷi, that is  –
  3. The least squares regression line always passes through the centroid (ȳ, x̄) of the data.
  4. The sum of the residuals weighted by the corresponding value of the regressor variable always equals zero, that is –
  5. The sum of the residuals weighted by the corresponding fitted value always equals zero, that is –

Data Analyst vs Data Engineer vs Data Scientist

Do you feel like the companies want a Super-Human when you read their job description?

It is very important to know about the profile that you are applying for. The role names differ from company to company, for that reason, one should not fall for the role names but instead, insist on getting information about the responsibilities and the kind of project that the companies work on. 

Usually, while going through the job description, one can see that job profiles are not classified equally at every place. They tend to mix multiple profiles into one and one might not understand what actually the company requires. On occasions, I have also seen frontend technologies being mixed into Data Science profile, Big Data skillset being the required criteria for a Data Analyst.

In this post, I will try to cover in very short about what do these three profiles, namely, Data Science, Data Engineers, and Data Analysts have in common, the places or skillset they differ in, and what can be the right match if you have planned to enter this domain.

Data Analyst

The primary focus of a Data Analyst is to retrieve data and perform some kind of analysis. The analysis mainly refers to analyzing past trends in data and predicting future attributes that the data might possess based on their findings. They do not care much about feature extraction or modeling and work mainly with structured data. They are responsible for creating visualization and graphs in a more informative way and present it to the concerned audience. They are more or less similar to Business Analyst, it’s just that Business Analyst has more information specific to the business domain where they work in.

Also known as – Business Analysts

Skills required –  Statistics, Domain Knowledge, Communication, Python Visualisation Libraries, SQL, Tableau, MSExcel

Data Engineer

The primary focus of a Data Engineer is to code, clean up data as per the requirements of a Data Scientist. They typically deal with a huge amount of data termed as Big Data which is either in semi-structured or unstructured form. They have knowledge related to how the data can be stored and retrieved efficiently in storage stacks which can help faster processing of data.

Also known as – Database Administrators or Data Architects

Skills required – Mathematics, Big Data, Hadoop, Spark, Hive, Pig, Python, NoSQL

Data Scientist

Coming to the sexiest job of the 21st century as quoted by Harvard Business Review, it is the hottest debate out in the market. Terming everything to a Data Scientist has created a major confusion in different roles. The task associated with a Data Scientist mainly comprises of EDA – Exploratory Data Analysis, feature extraction, finding the right Machine Learning algorithm to model the data and improve accuracy by testing the model followed by fine-tuning whenever needed. Usually, a Data Scientist does know about the job of a Data Analyst and he/she might use the visualizations to place their own findings and model performance but they might or might not know about the Big Data and task related to a Data Engineer.

Also known as – Data Managers or Statisticians 

Skills required – Mathematics, Statistics, Communication, Machine Learning, Python/R, SQL

To summarise this comparison, I would like to put the roles that one must ideally look for in a sequential manner. 

Data Analyst or Business Analyst → Data Engineer → Data Scientist or ML Engineer → AI Architect

Measures of relationship between variable | Correlation and Co-variance coefficient

While performing EDA (Exploratory Data Analysis) the most crucial step is to find the relationship between two or more variables to understand how one behaves when the other variable tends to change. This helps us to figure out the significance of each independent variable on the target and thus, create a model with a reduced number of parameters (only the most important or significant ones).

Covariance

Covariance provides insight into how two variables are related to one another. More precisely, covariance refers to the measure of how two random variables in a data set will change together. A positive covariance means that the two variables at hand are positively related, and they move in the same direction. A negative covariance means that the variables are inversely related, or that they move in opposite directions.

Correlation Coefficient

When the correlation coefficient is zero, it means that there is no identifiable relationship between the variables. If one variable move, it’s impossible to make predictions about the movement of the other variable. If the correlation coefficient is a negative one, this means that the variables are perfectly negatively or inversely correlated. If one variable increases, the other will decrease at the same proportion. The variables will move in opposite directions from each other. If the correlation coefficient is greater than the negative one, it indicates that there is an imperfect negative correlation. As the correlation approaches a negative one, the correlation grows.

Note: ρ(x, y) = ρ(y, x). Correlation does not imply causationWhat does the above statement “Correlation does not apply causation” mean? We will cover it in a separate blog.