Text Classification in Python: A Practical Guide with Code
Text classification is an essential technique in natural language processing (NLP) that categorizes text into predefined classes. Python, with its vast array of libraries, makes this task more accessible. This guide will walk you through a basic text classification example, complete with Python code.
1. Data Collection
Our first step is to gather a dataset. For simplicity, we’ll use a sample dataset available in Python libraries like scikit-learn.
from sklearn.datasets import fetch_20newsgroups
data = fetch_20newsgroups(subset='all', categories=['alt.atheism', 'soc.religion.christian'], shuffle=True, random_state=42)
2. Data Preprocessing
Text data often needs cleaning. Here’s a simple way to preprocess text data:
import re
def preprocess_text(text):
# Remove special characters and digits
text = re.sub(r'\W', ' ', str(text))
text = re.sub(r'\s+[a-zA-Z]\s+', ' ', text)
text = re.sub(r'\^[a-zA-Z]\s+', ' ', text)
text = re.sub(r'\s+', ' ', text, flags=re.I)
text = re.sub(r'^b\s+', '', text)
text = text.lower()
return text
processed_data = [preprocess_text(text) for text in data.data]
3. Feature Extraction
We’ll use TF-IDF for feature extraction:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(max_features=5000, min_df=5, max_df=0.7, stop_words='english')
X = vectorizer.fit_transform(processed_data)
4. Model Selection
We’ll use the Naive Bayes classifier:
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, accuracy_score
# Splitting dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, data.target, test_size=0.2, random_state=42)
# Training the model
model = MultinomialNB()
model.fit(X_train, y_train)
5. Model Evaluation
Evaluating the model’s performance:
# Predicting the Test set results
y_pred = model.predict(X_test)
# Evaluating the model
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))
6. Conclusion
This guide gives you a basic understanding and practical implementation of text classification in Python. You can explore different preprocessing techniques, feature extraction methods, and machine learning models to improve performance. The world of NLP is vast, and text classification is just the beginning!
Feel free to experiment with different aspects of the code or reach out for more in-depth examples and explanations!