02 Building the Machine learning classifier

 

This tutorial is about building a simple spam classifier using logistic regression and the sklearn library. I would recommend creating a folder named "ml', creating a virtual environment, activating, installing jupyter notebook, and starting it.

mkdir ml
cd ml
python -m venv env
.\env\Scripts\activate  
#source env/bin/activate for linux/mac
pip install notebook pandas scikit-learn
jupyter-notebook 

Let's break down the code step-by-step:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
import joblib

These imports include tools for data handling (pandas), splitting datasets (train_test_split), converting text into vectors (CountVectorizer), our classification model (LogisticRegression), and a tool for saving/loading models (joblib).

df = pd.read_csv('spam.csv', encoding='latin-1')
print(df.head())

columns_to_drop = ["Unnamed: 2", "Unnamed: 3", "Unnamed: 4"]
df.drop(columns=columns_to_drop, inplace=True)

Here, we are loading a CSV file named 'spam.csv' into a DataFrame df and printing the first few rows. The encoding 'latin-1' is used for compatibility purposes as this dataset may contain special characters.

The dataset contains some columns that are not relevant for our purpose, so we drop them.

X = df['v2']
y = df["v1"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Here, we are separating our data (emails/texts) from the labels (spam or not spam). After that, we split them into training and testing datasets. The random state helps maintain our results be reproducible.

vectorizer = CountVectorizer()
X_train_transformed = vectorizer.fit_transform(X_train)

Machine learning models don't understand raw text. Thus, we convert text data into a numerical format using CountVectorizer, resulting in a matrix representation of our text data.

CountVectorizer():

  • Purpose: This is used to convert a collection of text data into a matrix of token counts.
  • How it works: It tokenizes the text data and gives an integer ID to each token. Then, it counts the occurrences of each of these tokens.

fit_transform(X_train):

  • fit: Learns the vocabulary of X_train i.e., all unique words.
  • transform: Transforms our text data into a matrix where each row corresponds to a text and each column corresponds to a unique word in the data. The value in each cell of this matrix represents the count of that word in that text.
model = LogisticRegression(solver='liblinear')
model.fit(X_train_transformed, y_train)

We instantiate and train a logistic regression model. The 'liblinear' solver is generally a good choice for small datasets and binary classification, making it apt for our use-case.

X_test_transformed = vectorizer.transform(X_test)
accuracy = model.score(X_test_transformed, y_test)
print("Accuracy: {:.2f}%".format(accuracy * 100))

Once the model is trained, we evaluate its performance on unseen/test data. The model's accuracy indicates how well it can predict spam messages.

example_text = pd.Series("Hello, Are you enjoying learning fastapi?")
example_text_transformed = vectorizer.transform(example_text)
prediction = model.predict(example_text_transformed)
print(prediction)

To see our model in action, we provide it with an example text and check its prediction. This example illustrates how you'd use the model in a real-world scenario.

After training, we save the model and the vectorizer to disk using joblib. This is crucial for real-world applications as it negates the need to retrain the model every time we want to use it.

FastAPITutorial

Brige the gap between Tutorial hell and Industry. We want to bring in the culture of Clean Code, Test Driven Development.

We know, we might make it hard for you but definitely worth the efforts.

Contacts

Refunds:

Refund Policy
Social

Follow us on our social media channels to stay updated.

© Copyright 2022-23 Team FastAPITutorial