Note
Go to the end to download the full example code.
Using the Neuralk NICLClassifier¶
The NICLClassifier is the simplest way to use Neuralk’s In-Context
Learning model for classification. It offers the usual scikit-learn classifier
interface so it can easily be inserted into any machine-learning pipeline.
Note
For this example to run, the environment variable API_KEY must be set
with your Neuralk API key.
Simple example on toy data¶
We start by using the NICLClassifier on simple data that needs no preprocessing.
Generate simple data:
import os
import warnings
import skrub
from sklearn.datasets import make_classification
from sklearn.decomposition import PCA
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.impute import SimpleImputer
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_validate, train_test_split
from sklearn.pipeline import make_pipeline
from neuralk import NICLClassifier, datasets
skrub.set_config(use_table_report=False)
# API key for Neuralk cloud service
API_KEY = os.environ["API_KEY"]
X, y = make_classification()
X_train, X_test, y_train, y_test = train_test_split(X, y)
print(f"{X_train.shape=} {y_train.shape=} {X_test.shape=} {y_test.shape=}")
X_train.shape=(75, 20) y_train.shape=(75,) X_test.shape=(25, 20) y_test.shape=(25,)
Now we apply Neuralk’s classifier.
# Note: nothing actually happens during fit() -- in-context learning models are
# pretrained but require no fitting on our specific dataset.
classifier = NICLClassifier(api_key=API_KEY).fit(X_train, y_train)
predictions = classifier.predict(X_test)
accuracy = accuracy_score(y_test, predictions)
print(f"Accuracy: {accuracy}")
Accuracy: 0.44
Working with non-numeric data¶
The Neuralk NICLClassifier is a raw classifier that does not perform any preprocessing. To handle complex datasets, we need to encode non-numeric data and possibly reduce the feature dimension. The example below shows a simple pipeline that yields good results for most datasets.
The example dataset contains the descriptions and sale price of houses. The prediction target is the sale price (binned to transform it into a classification task).
/home/runner/work/neuralk/neuralk/.venv/lib/python3.11/site-packages/sklearn/preprocessing/_discretization.py:304: FutureWarning: The current default behavior, quantile_method='linear', will be changed to quantile_method='averaged_inverted_cdf' in scikit-learn version 1.9 to naturally support sample weight equivalence properties by default. Pass quantile_method='averaged_inverted_cdf' explicitly to silence this warning.
warnings.warn(
As we can see above, the dataset contains many columns of different types. The basic Neuralk classification service only accepts numeric data. Moreover, it is better to send data that is not too high-dimensional otherwise the model is forced to subsample the context and this can deteriorate performance.
To meet those requirements, we build a simple pipeline that transforms the
input to a numeric array with the skrub.TableVectorizer, then scales features, imputes
missing values and reduces dimension with a Principal Components Analysis.
Note
Here we perform dimensionality reduction to control the number of columns. If the dataset also has a very large number of rows, the model will subsample the training data to be able to fit it in memory and make a prediction. This is done by default and you do not need to do anything to activate it. However, if you have a way to select a better training subsample than the default random one, you may want to do the subsampling yourself. For this advanced usage see this example.
We start by defining the pipeline:
classifier = make_pipeline(
skrub.TableVectorizer(),
skrub.SquashingScaler(),
SimpleImputer(),
PCA(40),
NICLClassifier(api_key=API_KEY),
)
And now we can evaluate its performance.
# Silence spurious warning from scikit-learn while preprocessing some categorical columns.
warnings.filterwarnings("ignore", message="Found unknown categories.*during transform")
cv_results = cross_validate(classifier, X, y, error_score="raise", scoring="accuracy")
cv_results["test_score"]
array([0.20136519, 0.1996587 , 0.1996587 , 0.20136519, 0.20136519])
For comparison, we can run the same experiment after replacing the in-context learner with gradient boosting:
classifier = make_pipeline(
skrub.TableVectorizer(),
skrub.SquashingScaler(),
SimpleImputer(),
# PCA(40), # The PCA makes results worse for the gradient boosting
HistGradientBoostingClassifier(),
)
cv_results = cross_validate(classifier, X, y, error_score="raise", scoring="accuracy")
cv_results["test_score"]
array([0.70989761, 0.7116041 , 0.70819113, 0.69624573, 0.72525597])
Total running time of the script: (1 minutes 7.396 seconds)