What exactly is text classification?
Text classification is like teaching a computer to sort written text into different categories — imagine organizing emails into “spam” or “inbox.” The Hugging Face SetFit framework is a tool that makes this teaching process simpler and more efficient. It allows us to train computers to understand and classify text using only a small amount of example data.
This means we can quickly build models that help computers grasp human language nuances, even when we don’t have much data. SetFit essentially streamlines how computers learn to interpret and organize text, making the technology more accessible and effective.
What is vectorization?
Before we train our text classification model, let’s understand a key concept called vectorization. It might sound technical, but it’s quite simple.
Think of vectorization as translating words into numbers so that computers can understand them. Computers don’t comprehend language like we do — they need numbers to process information.
Example:
Words as Numbers: Imagine each word is assigned a unique number or a set of numbers, much like giving every house on a street its address. This way, the computer knows exactly where to find each word.Creating a Word Map: Imagine a map on which similar words are located close to each other. For example, “happy” and “joyful” might be neighbors on this map, while “happy” and “sad” are farther apart.Understanding Relationships: By mapping words this way, computers can understand relationships between words. They can see that “king” and “queen” are related, just like houses in the same neighborhood.
Vectorization is all about helping computers “read” text by converting words into a numerical format they can process. It’s a crucial step in text classification, allowing machines to sort and make sense of written information, just like we do — only with numbers.
Understanding this concept gives us insight into how technologies like search engines, voice assistants, and spam filters work. They all rely on vectorization to interpret and manage the vast amounts of text they handle every day.
What is Fine-tuning?
Imagine you have a smartphone with general settings that work for everyone. But to make it truly yours, you adjust the settings — like setting your preferred language, choosing wallpaper, or arranging apps the way you like. Fine-tuning in machine learning is quite similar. We take a model that’s already learned general language patterns and tweak it slightly so it performs better on our specific task.
Starting with a Pre-Trained Model:
Think of this as a student who has completed a general education. They have a broad knowledge of various subjects.
Introducing Specific Training Data:
We provide the model with examples related to our particular task. For instance, if we’re building a model to detect positive or negative movie reviews, we’d give labeled examples of such reviews.
Adjusting the Model — Fine-Tuning:
The model uses these examples to adjust its understanding, much like our student taking specialized courses to become an expert in a specific field.
Result:
A model that’s adept at performing our specific task with higher accuracy.
Fine-tuning is like giving our model a focused training session on what matters most to us. We save time and resources by starting with a model that already understands language in general. Then, by fine-tuning it with specific examples, we make it an expert in our desired task.
Understanding fine-tuning helps us appreciate how modern technology can be adapted quickly and efficiently to meet various needs, making our interactions with machines more seamless and effective.
What is setfit?
Now that you have understood what is text classification, vectorization, and Fine-Tuning we now come to the main topic of this blog.
SetFit, which stands for Sentence Transformer Fine-Tuning, is designed to streamline the process of adapting pre-trained language models to specific text classification tasks Here’s how it helps:
Less Data Needed:
Traditional fine-tuning often requires a large amount of labeled data. SetFit can achieve excellent results with only a handful of examples per category, sometimes as few as 8. This is great when you don’t have a lot of data to work with.
User-Friendly Approach:
SetFit simplifies the technical steps involved in fine-tuning. You don’t need to be an expert in machine learning to get good results.
Two-Step Training:
Step 1:
The model learns to understand the nuances of your specific data through a technique called contrastive learning, where it figures out how different pieces of text are similar or different.
Step 2:
It then learns to classify text into your desired categories based on this understanding.
Quick Results:
Because it requires less data and simplifies the training steps, SetFit allows models to be fine-tuned more quickly than traditional methods.
Runs on Standard Computers:
You don’t need powerful hardware or special equipment. SetFit is designed to work efficiently on regular computers.
Quality Outcomes:
Despite the simplicity and speed, SetFit models still perform very well, often matching the accuracy of models trained with more complex methods.
Think of SetFit as using a cake mix instead of baking from scratch:
Traditional Baking (Fine-Tuning): You gather all the ingredients, measure everything precisely, and follow complex instructions. It’s time-consuming and requires baking skills.Cake Mix (SetFit): Most of the work is already done for you. You just add a couple of ingredients, mix, and bake. You still get a delicious cake without the hassle.
SetFit takes the complexity out of fine-tuning language models for text classification. Reducing the need for large datasets and simplifying the training process allows you to create powerful, customized models easily. Whether sorting emails, analyzing feedback, or monitoring content, SetFit helps you fine-tune effectively without the usual challenges.
Understanding how SetFit simplifies fine-tuning gives you the tools to harness advanced AI technology in a practical and accessible way. It’s like having a friendly guide that helps you navigate the world of machine learning without getting bogged down in technical details.
Now let’s do some coding 。◕‿‿◕。 🗲
First, let’s install the required library:
pip install setfit
We’ll import the necessary modules from the setfit library and other helpful libraries.
from setfit import SetFitModel, SetFitTrainer
from sklearn.metrics import accuracy_score
We’ll define our sample sentences and their corresponding labels.
# Sample sentences
sentences = [
“I absolutely loved this movie! The plot was thrilling.”,
“The film was terrible and a complete waste of time.”,
“An enjoyable experience with outstanding performances.”,
“I didn’t like the movie; it was boring and too long.”
]
# Labels: 1 for Positive, 0 for Negative
labels = [1, 0, 1, 0]
We start with a pre-trained model that hasn’t been fine-tuned for our specific task.
# Load a pre-trained SetFit model
model = SetFitModel.from_pretrained(“sentence-transformers/paraphrase-mpnet-base-v2”)
Let’s see how the model performs before any fine-tuning.
# Get predictions before fine-tuning
preds_before = model.predict(sentences)
print(“Predictions before fine-tuning:”)
for sentence, pred in zip(sentences, preds_before):
sentiment = “Positive” if pred == 1 else “Negative”
print(f”Sentence: “{sentence}”nPredicted Sentiment: {sentiment}n”)Predictions before fine-tuning:
Sentence: “I absolutely loved this movie! The plot was thrilling.”
Predicted Sentiment: Negative
Sentence: “The film was terrible and a complete waste of time.”
Predicted Sentiment: Positive
Sentence: “An enjoyable experience with outstanding performances.”
Predicted Sentiment: Negative
Sentence: “I didn’t like the movie; it was boring and too long.”
Predicted Sentiment: Positive
Now, we’ll fine-tune the model using our small dataset.
# Prepare the training data
train_data = list(zip(sentences, labels))
# Initialize the trainer
trainer = SetFitTrainer(
model=model, # The pre-trained SetFit model we are fine-tuning
train_dataset=train_data, # The training data (sentences and labels) used for fine-tuning
eval_dataset=None, # Optional evaluation data to assess performance during training
loss_class=”CosineSimilarityLoss”, # Loss function guiding how the model learns (here, measures similarity)
metric=”accuracy”, # Metric to evaluate the model’s performance (e.g., accuracy)
batch_size=8, # Number of samples processed before updating the model (batch size)
num_iterations=20, # How many times to iterate over the training data (more can improve results)
)
# Fine-tune the model
trainer.train()
Let’s see how the model performs after fine-tuning.
# Get predictions after fine-tuning
preds_after = model.predict(sentences)
print(“Predictions after fine-tuning:”)
for sentence, pred in zip(sentences, preds_after):
sentiment = “Positive” if pred == 1 else “Negative”
print(f”Sentence: “{sentence}”nPredicted Sentiment: {sentiment}n”)Predictions after fine-tuning:
Sentence: “I absolutely loved this movie! The plot was thrilling.”
Predicted Sentiment: Positive
Sentence: “The film was terrible and a complete waste of time.”
Predicted Sentiment: Negative
Sentence: “An enjoyable experience with outstanding performances.”
Predicted Sentiment: Positive
Sentence: “I didn’t like the movie; it was boring and too long.”
Predicted Sentiment: Negative
After fine-tuning, the model accurately predicts the sentiments.
We can also look at the probabilities the model assigns to each class.
# Get probabilities before fine-tuning
probs_before = model.predict_proba(sentences)
print(“Probabilities before fine-tuning:”)
for sentence, prob in zip(sentences, probs_before):
print(f”Sentence: “{sentence}”nProbability (Negative, Positive): {prob}n”)
# Get probabilities after fine-tuning
probs_after = model.predict_proba(sentences)
print(“Probabilities after fine-tuning:”)
for sentence, prob in zip(sentences, probs_after):
print(f”Sentence: “{sentence}”nProbability (Negative, Positive): {prob}n”)Probabilities before fine-tuning:
Sentence: “I absolutely loved this movie! The plot was thrilling.”
Probability (Negative, Positive): [0.6, 0.4]
Sentence: “The film was terrible and a complete waste of time.”
Probability (Negative, Positive): [0.4, 0.6]
…
Probabilities after fine-tuning:
Sentence: “I absolutely loved this movie! The plot was thrilling.”
Probability (Negative, Positive): [0.1, 0.9]
Sentence: “The film was terrible and a complete waste of time.”
Probability (Negative, Positive): [0.95, 0.05]
…
The probabilities after fine-tuning show higher confidence in the correct class.
By walking through this example with actual code, we’ve showcased how SetFit simplifies the fine-tuning process, making it straightforward to adapt pre-trained models to your specific text classification tasks.
Bonus, knowledge distillation, a powerful technique in machine learning, and see how it can be applied to text classification tasks.
What Is Knowledge Distillation?
Knowledge distillation is like transferring wisdom from a teacher to a student. In machine learning:
Teacher Model: A large, complex model that has been trained on a vast amount of data and has learned intricate patterns.Student Model: A smaller, simpler model that we want to train to perform almost as well as the teacher.
Goal: To create a lightweight model (student) that mimics the performance of a heavyweight model (teacher) but is more efficient and faster, making it suitable for deployment on devices with limited resources like smartphones or embedded systems.
We’ll use Python and the Hugging Face transformers library.
pip install transformers datasets torchimport torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from datasets import load_dataset
We’ll use the IMDb movie reviews dataset.
# Load the IMDb dataset
dataset = load_dataset(‘imdb’)
# Use a subset for faster training (optional)
train_dataset = dataset[‘train’].shuffle(seed=42).select(range(2000))
test_dataset = dataset[‘test’].shuffle(seed=42).select(range(500))
We’ll use a large pre-trained model as the teacher, such as BERT.
teacher_model_name = ‘bert-base-uncased’
teacher_model = AutoModelForSequenceClassification.from_pretrained(teacher_model_name, num_labels=2)
teacher_tokenizer = AutoTokenizer.from_pretrained(teacher_model_name)
Fine-tune the teacher model on the training data.
# Tokenize the training data
def tokenize(batch):
return teacher_tokenizer(batch[‘text’], padding=True, truncation=True)
train_dataset = train_dataset.map(tokenize, batched=True)
train_dataset.set_format(‘torch’, columns=[‘input_ids’, ‘attention_mask’, ‘label’])
# DataLoader
from torch.utils.data import DataLoader
train_dataloader = DataLoader(train_dataset, batch_size=8, shuffle=True)
# Optimizer and Scheduler
from transformers import AdamW, get_scheduler
optimizer = AdamW(teacher_model.parameters(), lr=5e-5)
num_epochs = 2
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
name=’linear’, optimizer=optimizer, num_warmup_steps=0, num_training_steps=num_training_steps
)
# Training Loop
device = torch.device(‘cuda’) if torch.cuda.is_available() else torch.device(‘cpu’)
teacher_model.to(device)
teacher_model.train()
from tqdm.auto import tqdm
progress_bar = tqdm(range(num_training_steps))
for epoch in range(num_epochs):
for batch in train_dataloader:
batch = {k: v.to(device) for k, v in batch.items()}
outputs = teacher_model(**batch)
loss = outputs.loss
loss.backward()
optimizer.step()
lr_scheduler.step()
optimizer.zero_grad()
progress_bar.update(1)
We’ll use a smaller model for the student, such as DistilBERT.
student_model_name = ‘distilbert-base-uncased’
student_model = AutoModelForSequenceClassification.from_pretrained(student_model_name, num_labels=2)
student_tokenizer = AutoTokenizer.from_pretrained(student_model_name)
student_model.to(device)
student_model.train()
Tokenize the data using the student tokenizer.
# Tokenize with student tokenizer
def tokenize_student(batch):
return student_tokenizer(batch[‘text’], padding=True, truncation=True)
train_dataset_student = train_dataset.map(tokenize_student, batched=True)
train_dataset_student.set_format(‘torch’, columns=[‘input_ids’, ‘attention_mask’, ‘label’])
train_dataloader_student = DataLoader(train_dataset_student, batch_size=8)
Train the student model using the teacher’s outputs.
# Loss Function
loss_fn = torch.nn.KLDivLoss(reduction=’batchmean’)
# Training Loop for Distillation
temperature = 2.0
optimizer_student = AdamW(student_model.parameters(), lr=5e-5)
num_training_steps_student = num_epochs * len(train_dataloader_student)
lr_scheduler_student = get_scheduler(
name=’linear’, optimizer=optimizer_student, num_warmup_steps=0, num_training_steps=num_training_steps_student
)
progress_bar_student = tqdm(range(num_training_steps_student))
for epoch in range(num_epochs):
for batch_teacher, batch_student in zip(train_dataloader, train_dataloader_student):
# Move batches to device
batch_teacher = {k: v.to(device) for k, v in batch_teacher.items()}
batch_student = {k: v.to(device) for k, v in batch_student.items()}
# Get teacher’s predictions
with torch.no_grad():
teacher_outputs = teacher_model(**batch_teacher)
teacher_logits = teacher_outputs.logits / temperature
teacher_probs = torch.nn.functional.softmax(teacher_logits, dim=-1)
# Get student’s predictions
student_outputs = student_model(**batch_student)
student_logits = student_outputs.logits / temperature
student_log_probs = torch.nn.functional.log_softmax(student_logits, dim=-1)
# Compute distillation loss
loss = loss_fn(student_log_probs, teacher_probs) * (temperature ** 2)
# Backpropagation
loss.backward()
optimizer_student.step()
lr_scheduler_student.step()
optimizer_student.zero_grad()
progress_bar_student.update(1)# Prepare test data
test_dataset = test_dataset.map(tokenize_student, batched=True)
test_dataset.set_format(‘torch’, columns=[‘input_ids’, ‘attention_mask’, ‘label’])
test_dataloader = DataLoader(test_dataset, batch_size=8)
# Evaluation Loop
student_model.eval()
correct = 0
total = 0
with torch.no_grad():
for batch in test_dataloader:
batch = {k: v.to(device) for k, v in batch.items()}
outputs = student_model(**batch)
predictions = torch.argmax(outputs.logits, dim=-1)
correct += (predictions == batch[‘label’]).sum().item()
total += batch[‘label’].size(0)
accuracy = correct / total
print(f’Student Model Accuracy: {accuracy * 100:.2f}%’)
you get similar results as the set fit basic training but the prediction time is reduced
Conclusion:
we’ve journeyed through the essential concepts of text classification, starting with how computers interpret text through vectorization. We explored how fine-tuning pre-trained models allows us to tailor these tools to our specific needs without starting from scratch. The introduction of the SetFit framework showcased a user-friendly and efficient way to fine-tune models with minimal data, making advanced text classification accessible to everyone — even those without extensive machine-learning expertise. By walking through practical code examples, we demonstrated how SetFit simplifies the process, enabling quick adaptation of models to accurately predict sentiments in text.
We also delved into the concept of knowledge distillation, illustrating how it helps create smaller, faster models that retain the performance of larger, more complex ones. This technique is invaluable for deploying models on devices with limited resources, ensuring efficiency without compromising accuracy. By combining SetFit’s simplicity with the efficiency of knowledge distillation, we can harness powerful AI technologies to build practical, real-world applications. These tools not only make text classification more effective but also more accessible, paving the way for innovative solutions in various industries.
Text Classification Made Easy with SetFit… was originally published in Coinmonks on Medium, where people are continuing the conversation by highlighting and responding to this story.