Text Classification in Python: A Practical Guide with Code

December 31, 2023 Pedro Martins

Text Classification in Python: A Practical Guide with Code

Text classification is an essential technique in natural language processing (NLP) that categorizes text into predefined classes. Python, with its vast array of libraries, makes this task more accessible. This guide will walk you through a basic text classification example, complete with Python code.

1. Data Collection

Our first step is to gather a dataset. For simplicity, we’ll use a sample dataset available in Python libraries like scikit-learn.

from sklearn.datasets import fetch_20newsgroups
data = fetch_20newsgroups(subset='all', categories=['alt.atheism', 'soc.religion.christian'], shuffle=True, random_state=42)

2. Data Preprocessing

Text data often needs cleaning. Here’s a simple way to preprocess text data:

import re
def preprocess_text(text):
    # Remove special characters and digits
    text = re.sub(r'\W', ' ', str(text))
    text = re.sub(r'\s+[a-zA-Z]\s+', ' ', text)
    text = re.sub(r'\^[a-zA-Z]\s+', ' ', text)
    text = re.sub(r'\s+', ' ', text, flags=re.I)
    text = re.sub(r'^b\s+', '', text)
    text = text.lower()
    return text

processed_data = [preprocess_text(text) for text in data.data]

3. Feature Extraction

We’ll use TF-IDF for feature extraction:

from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(max_features=5000, min_df=5, max_df=0.7, stop_words='english')
X = vectorizer.fit_transform(processed_data)

4. Model Selection

We’ll use the Naive Bayes classifier:

from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, accuracy_score

# Splitting dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, data.target, test_size=0.2, random_state=42)

# Training the model
model = MultinomialNB()
model.fit(X_train, y_train)

5. Model Evaluation

Evaluating the model’s performance:

# Predicting the Test set results
y_pred = model.predict(X_test)

# Evaluating the model
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))

6. Conclusion

This guide gives you a basic understanding and practical implementation of text classification in Python. You can explore different preprocessing techniques, feature extraction methods, and machine learning models to improve performance. The world of NLP is vast, and text classification is just the beginning!

Feel free to experiment with different aspects of the code or reach out for more in-depth examples and explanations!

Back to blog

Our Books

JavaScript for the Modern Developer: Concepts, Code, and Best Practices

Sale

JavaScript for the Modern Developer: Concepts, Code, and Best Practices

Regular price €25,00 EUR

Regular price ~~€30,00 EUR~~ Sale price €25,00 EUR
Unit price per

Sale
Course Book: Oracle and REST API Services using Spring Boot

Sale

Course Book: Oracle and REST API Services using Spring Boot

Regular price €10,00 EUR

Regular price ~~€20,00 EUR~~ Sale price €10,00 EUR
Unit price per

Sale
Mastering Efficient Data Modeling with MongoDB

Sale

Mastering Efficient Data Modeling with MongoDB

Regular price €15,00 EUR

Regular price ~~€25,00 EUR~~ Sale price €15,00 EUR
Unit price per

Sale
React JS and Express Framework: A Comprehensive Guide

Sale

React JS and Express Framework: A Comprehensive Guide

Regular price €10,00 EUR

Regular price ~~€20,00 EUR~~ Sale price €10,00 EUR
Unit price per

Sale

View all

Podcasts

View all

ChatGPT Uncovered Podcast

November 21, 2023Pedro Martins
ChatGPT Uncovered Podcast ChatGPT Uncovered Podcast Exploring the Frontiers of AI Conversational Models Episode 1: Understanding ChatGPT Published on: May 15, 2023 Your browser does not support the audio element....

ChatGPT Uncovered Podcast

November 21, 2023Pedro Martins
ChatGPT Uncovered Podcast ChatGPT Uncovered Podcast Exploring the Frontiers of AI Conversational Models Episode 1: Understanding ChatGPT Published on: May 15, 2023 Your browser does not support the audio element....
Power Apps In-Depth Podcast

November 20, 2023Pedro Martins
Power Apps In-Depth Podcast Power Apps In-Depth Podcast Exploring the Capabilities of Microsoft Power Apps Episode 1: Introduction to Power Apps Published on: April 20, 2023 Your browser does not...

Power Apps In-Depth Podcast

November 20, 2023Pedro Martins
Power Apps In-Depth Podcast Power Apps In-Depth Podcast Exploring the Capabilities of Microsoft Power Apps Episode 1: Introduction to Power Apps Published on: April 20, 2023 Your browser does not...
Exploring Power Pages Podcast

November 20, 2023Pedro Martins
Exploring Power Pages Podcast Exploring Power Pages Podcast Delving into the World of Microsoft Power Pages Episode 1: Getting Started with Power Pages Published on: March 10, 2023 Your browser...

Exploring Power Pages Podcast

November 20, 2023Pedro Martins
Exploring Power Pages Podcast Exploring Power Pages Podcast Delving into the World of Microsoft Power Pages Episode 1: Getting Started with Power Pages Published on: March 10, 2023 Your browser...

1 3

View all

Your cart is empty

Your cart

Estimated total

Text Classification in Python: A Practical Guide with Code

Text Classification in Python: A Practical Guide with Code

1. Data Collection

2. Data Preprocessing

3. Feature Extraction

4. Model Selection

5. Model Evaluation

6. Conclusion

Our Books

JavaScript for the Modern Developer: Concepts, Code, and Best Practices

JavaScript for the Modern Developer: Concepts, Code, and Best Practices

Course Book: Oracle and REST API Services using Spring Boot

Course Book: Oracle and REST API Services using Spring Boot

Mastering Efficient Data Modeling with MongoDB

Mastering Efficient Data Modeling with MongoDB

React JS and Express Framework: A Comprehensive Guide

React JS and Express Framework: A Comprehensive Guide

Podcasts

ChatGPT Uncovered Podcast

ChatGPT Uncovered Podcast

Power Apps In-Depth Podcast

Power Apps In-Depth Podcast

Exploring Power Pages Podcast

Exploring Power Pages Podcast

Country/region

Language