No matter what programming language you use to write your code logic, machines understand the binary language of 1s and 0s. Similarly, it is easier for machines to deal with IP addresses than hostnames while on the contrary, humans prefer to deal with hostnames. The encoding logic in machine learning is more or less based on this philosophy. Encoding is a major pre-processing step while building machine learning models. In this post, we are going to talk about the two encoding techniques, namely, one-hot encoding and label encoding for nominal and ordinal categorical data. Before we begin, let us discuss in short about the two categorical data types.
- Ordinal DataAs the name suggests, ordinal data are the type of categorical data that can be put into an order for better understanding. For example, the colors of a rainbow, the different sizes of a shirt (S, M, L, XL, XXL), rating of a product (5, 4, 3, 2, 1), etc.
- Nominal DataThe type of categorical data that do not tend to follow any pattern is termed as nominal data. For example, gender (male and female), colors, etc.
What is Label Encoding?
Label encoding is a technique used for ordinal data. In label encoding, the labels of the ordinal data (easy, medium, difficult) are converted to numeric values (0, 1, 2) so that it becomes easier for the machines to interpret. Machine Learning models work better with numeric data than string variables.
- Consider a dataframe df with size column having values in [‘S’, ‘M’, ‘L’, ‘XL’]
# import label encoder from sklearn.preprocessing import LabelEncoder # label_encoder object knows how to understand word labels label_encoder = LabelEncoder() # encode labels in column size. df['size']= label_encoder.fit(df['size']) # get unique data labels produced by label encoder class df['size'].unique() # get the corresponding values of numeric labels label_encoder.inverse_transform(0)
What is One Hot Encoding?
One Hot Encoding technique is used for nominal data. In one hot encoding, each label is converted to an attribute and the particular attribute is given values 0 (False) or 1 (True). For example, consider a gender column having values Male or M and Female or F. After one-hot encoding is converted into two separate attributes (columns) as Male and Female. For rows consisting of the Male category, the Male column is given a value 1 (True) and the Female column is given a value 0 (False). For rows consisting of the Female category, the Male column is given a value 0 (False) and the Female column is given a value 1 (True).
- Consider a dataframe df with gender column having values in [‘Male’, ‘Female’]
# import label encoder from sklearn.preprocessing import OneHotEncoder # creating one hot encoder one_hot_encoder = OneHotEncoder() # encode labels in column size. onehot_encoded = one_hot_encoder.fit(df['Gender']) # concatenate onehot_encoded to the original dataframe # drop the column Gender from original dataframe # get the corresponding values of numeric labels one_hot_encoder.inverse_transform(0)
For the gender example, we can only have one column (either male or female) instead of having both at the same time. We can identify the gender making use of either of the columns.
There are several other pre-processing techniques that we are going to cover in the upcoming posts. Stay updated.