Constructing an Finish-to-Finish Sentiment Evaluation Pipeline with Scikit-LLM

In this article, you will learn how to build an end-to-end sentiment analysis pipeline using Scikit-LLM and open-source large language models served through the Groq API.

Topics we will cover include:

How Scikit-LLM bridges classical scikit-learn pipelines with modern large language model API calls.
How to set up Scikit-LLM with a Groq backend and prepare the IMDB Movie Reviews dataset for inference.
How to build, run, and evaluate a zero-shot sentiment classification pipeline using scikit-learn-compatible syntax.

Building an End-to-End Sentiment Analysis Pipeline with Scikit-LLM

Introduction

Traditional machine learning pipelines for predictive tasks like text classification usually rely on extracting structured, numerical features from raw text — for instance, TF-IDF frequencies or token embeddings — to feed into classical models such as logistic regression, ensembles, or support vector machines.

With the rise of large language models (LLMs), the rules of the game have somewhat changed: it is now possible to leverage zero-shot or few-shot reasoning on existing, pre-trained models for language tasks as part of a machine learning framework. Scikit-LLM is a Python library that addresses this: it bridges the gap between classical machine learning and modern LLM API calls. In this article, we will use Scikit-LLM alongside Groq backend models to build an end-to-end pipeline for sentiment analysis (a domain-specific form of text classification), achieving reasonably fast inference results with open-source models. From preprocessing to inference, we will use a large, realistically-sized dataset — the IMDB movie reviews dataset.

Prerequisites, Setup, and Obtaining the Dataset

To make the code shown in this tutorial work, you’ll need to have installed the Scikit-LLM library:

Once installed, the first step is to set it up and configure API credentials. In other words, we will need to “connect” Scikit-LLM to an endpoint — namely an LLM API repository like Groq. Make sure you register on Groq and generate an API key here: you’ll need to copy and paste it in the code below:

from skllm.config import SKLLMConfig

# 1. Pointing to a Groq’s compatible endpoint
SKLLMConfig.set_gpt_url(“https://api.groq.com/openai/v1”)

# 2. Set your free Groq API key
# Get yours at https://console.groq.com/keys
SKLLMConfig.set_openai_key(“YOUR-API-KEY-GOES-HERE”)

from skllm.config import SKLLMConfig

# 1. Pointing to a Groq’s compatible endpoint

SKLLMConfig.set_gpt_url(“https://api.groq.com/openai/v1”)

# 2. Set your free Groq API key

# Get yours at https://console.groq.com/keys

SKLLMConfig.set_openai_key(“YOUR-API-KEY-GOES-HERE”)

Scikit-LLM uses an endpoint function, set_gpt_url, that is compatible with OpenAI by default; we have routed it to make internal requests to a custom Groq URL: https://api.groq.com/openai/v1.

The next stage of the process is importing the IMDB Movie Reviews dataset — which has about 50K instances — and preparing it for the sentiment analysis pipeline we will build. Instances consist of a text review labeled with a sentiment, which can be positive or negative (this is a binary classification problem, solvable with models like logistic regression, for instance).

For convenience, we read the dataset from a publicly available GitHub repository version in CSV format:

import pandas as pd
from sklearn.model_selection import train_test_split

# Fetching a large, realistic-sized dataset (IMDB Movie Reviews – 50,000 rows)
# We will read the data from a public raw CSV for convenience
url = “https://raw.githubusercontent.com/Ankit152/IMDB-sentiment-analysis/master/IMDB-Dataset.csv”
print(“Downloading dataset…”)
df = pd.read_csv(url)

print(f”Total dataset size: {df.shape(0)} rows”)

# In a realistic LLM pipeline using a free-tier API, sending 50,000 requests
# will likely trigger quota limits. Thus, we will use 500 rows for demonstrating our pipeline execution.
# Feel free to use more data if you have paid API access.
df_sampled = df.sample(n=500, random_state=42)

# The IMDB dataset contains HTML tags and formatting noise: that’s perfect for testing our cleaner
X = df_sampled(“review”)
y = df_sampled(“sentiment”) # Labels are ‘positive’ or ‘negative’

# Splitting into training (for initializing zero-shot labels) and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

import pandas as pd

from sklearn.model_selection import train_test_split

# Fetching a large, realistic-sized dataset (IMDB Movie Reviews – 50,000 rows)

# We will read the data from a public raw CSV for convenience

url = “https://raw.githubusercontent.com/Ankit152/IMDB-sentiment-analysis/master/IMDB-Dataset.csv”

print(“Downloading dataset…”)

df = pd.read_csv(url)

print(f”Total dataset size: {df.shape(0)} rows”)

# In a realistic LLM pipeline using a free-tier API, sending 50,000 requests

# will likely trigger quota limits. Thus, we will use 500 rows for demonstrating our pipeline execution.

# Feel free to use more data if you have paid API access.

df_sampled = df.sample(n=500, random_state=42)

# The IMDB dataset contains HTML tags and formatting noise: that’s perfect for testing our cleaner

X = df_sampled(“review”)

y = df_sampled(“sentiment”) # Labels are ‘positive’ or ‘negative’

# Splitting into training (for initializing zero-shot labels) and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Note that we fetched 500 rows only for demonstration purposes, as otherwise inference may take long without sufficient computing resources. You can freely change this sample size, n=500, to adapt it to your own needs.

Building the Sentiment Analysis Pipeline

Here comes the most interesting part of the process! A data science pipeline boils down to a series of preprocessing, cleaning, and data preparation steps followed by model setup or training, inference, and evaluation. For a predictive, text-based scenario like ours, preprocessing typically entails cleaning and normalizing the text. Scikit-learn provides an elegant class, FunctionTransformer, to define and encapsulate preprocessing steps based on a custom function:

from sklearn.preprocessing import FunctionTransformer

def clean_text_data(texts):
“””Cleans raw text inputs by removing HTML tags and stripping whitespace.”””
series = pd.Series(texts).astype(str)
# Remove HTML tags like <br />
cleaned = series.str.replace(r'<(^>)+>’, ‘ ‘, regex=True)
# Remove extra spaces
cleaned = cleaned.str.strip().str.replace(r’\s+’, ‘ ‘, regex=True)
return cleaned.tolist()

# Wrapping the cleaning function to enable its use inside a Pipeline object
text_cleaner = FunctionTransformer(clean_text_data)

from sklearn.preprocessing import FunctionTransformer

def clean_text_data(texts):

“””Cleans raw text inputs by removing HTML tags and stripping whitespace.”””

series = pd.Series(texts).astype(str)

# Remove HTML tags like <br />

cleaned = series.str.replace(r'<(^>)+>’, ‘ ‘, regex=True)

# Remove extra spaces

cleaned = cleaned.str.strip().str.replace(r’\s+’, ‘ ‘, regex=True)

return cleaned.tolist()

# Wrapping the cleaning function to enable its use inside a Pipeline object

text_cleaner = FunctionTransformer(clean_text_data)

Now we put together this preprocessing object with a model instance to create the Pipeline. Once defined, this pipeline orchestrates the whole process of preparing the data and passing it to the model at both training and inference stages — even though we use the term “training”, no actual weight-based training will occur, as we are utilizing a pre-trained model from Groq for zero-shot classification. Fitting the model only involves passing it the classification labels to use.

from sklearn.pipeline import Pipeline
from skllm.models.gpt.classification.zero_shot import ZeroShotGPTClassifier

# Define the end-to-end pipeline
sentiment_pipeline = Pipeline((
(“cleaner”, text_cleaner),
# Updated to use Groq’s active Llama 3.1 8B model
(“llm_classifier”, ZeroShotGPTClassifier(model=”custom_url::llama-3.1-8b-instant”))
))

# Fit the pipeline
# Note: For Zero-Shot classification, fit() doesn’t train the LLM.
# It simply registers the unique labels present in ‘y_train’ (positive, negative).
print(“Fitting the pipeline…”)
sentiment_pipeline.fit(X_train, y_train)

from sklearn.pipeline import Pipeline

from skllm.models.gpt.classification.zero_shot import ZeroShotGPTClassifier

# Define the end-to-end pipeline

sentiment_pipeline = Pipeline((

(“cleaner”, text_cleaner),

# Updated to use Groq’s active Llama 3.1 8B model

(“llm_classifier”, ZeroShotGPTClassifier(model=”custom_url::llama-3.1-8b-instant”))

))

# Fit the pipeline

# Note: For Zero-Shot classification, fit() doesn’t train the LLM.

# It simply registers the unique labels present in ‘y_train’ (positive, negative).

print(“Fitting the pipeline…”)

sentiment_pipeline.fit(X_train, y_train)

Once we have run the pipeline to “fit” the model, we use it once more for inference. Both steps use familiar scikit-learn syntax. Besides evaluating the model pipeline’s performance, we also display a few example predictions:

from sklearn.metrics import classification_report

print(f”Running predictions on {len(X_test)} test samples…”)
# Run predictions through the pipeline
predictions = sentiment_pipeline.predict(X_test)

# Evaluate the pipeline’s performance on the realistic data
print(“\n— Classification Report —“)
print(classification_report(y_test, predictions))

# Display a few side-by-side examples
print(“\n— Sample Predictions —“)
for review, actual, predicted in zip(X_test(:3), y_test(:3), predictions(:3)):
# Truncate review for display purposes
short_review = review(:100) + “…”
print(f”Review: {short_review}”)
print(f”Actual: {actual} | Predicted: {predicted}\n”)

from sklearn.metrics import classification_report

print(f”Running predictions on {len(X_test)} test samples…”)

# Run predictions through the pipeline

predictions = sentiment_pipeline.predict(X_test)

# Evaluate the pipeline’s performance on the realistic data

print(“\n— Classification Report —“)

print(classification_report(y_test, predictions))

# Display a few side-by-side examples

print(“\n— Sample Predictions —“)

for review, actual, predicted in zip(X_test(:3), y_test(:3), predictions(:3)):

# Truncate review for display purposes

short_review = review(:100) + “…”

print(f”Review: {short_review}”)

print(f”Actual: {actual} | Predicted: {predicted}\n”)

Here’s the detailed output — execution of the above code may take a few minutes to complete:

— Classification Report —
precision recall f1-score support

negative 0.95 0.97 0.96 60
positive 0.95 0.93 0.94 40

accuracy 0.95 100
macro avg 0.95 0.95 0.95 100
weighted avg 0.95 0.95 0.95 100

— Sample Predictions —
Review: I saw mommy…well, she wasn’t exactly kissing Santa Clause; he has his hand on her thigh and wicked…
Actual: negative | Predicted: negative

Review: This entry is certainly interesting for series fans (like myself), but yet it is mostly incomprehens…
Actual: negative | Predicted: negative

Review: Ingrid Bergman (Cleo Dulaine) has never been so beautiful. Gary Cooper as “Cleent” so perfectly cast…
Actual: positive | Predicted: positive

— Classification Report —

precision recall f1-score support

negative 0.95 0.97 0.96 60

positive 0.95 0.93 0.94 40

accuracy 0.95 100

macro avg 0.95 0.95 0.95 100

weighted avg 0.95 0.95 0.95 100

— Sample Predictions —

Review: I saw mommy…well, she wasn’t exactly kissing Santa Clause; he has his hand on her thigh and wicked…

Actual: negative | Predicted: negative

Review: This entry is certainly interesting for series fans (like myself), but yet it is mostly incomprehens…

Actual: negative | Predicted: negative

Review: Ingrid Bergman (Cleo Dulaine) has never been so beautiful. Gary Cooper as “Cleent” so perfectly cast…

Actual: positive | Predicted: positive

Our pipeline is doing a solid job at classifying sentiment in reviews. Well done!

Wrapping Up

This article walked you through defining an end-to-end pipeline for sentiment classification using Scikit-LLM and freely available, pre-trained LLMs from API endpoints like Groq. This is a versatile approach to using classic scikit-learn syntax in novel, LLM-driven machine learning applications.

Source link