Using the Neuralk NICLClassifier

The NICLClassifier is the simplest way to use Neuralk’s In-Context Learning model for classification. It offers the usual scikit-learn classifier interface so it can easily be inserted into any machine-learning pipeline.

Note

For this example to run, the environment variable API_KEY must be set with your Neuralk API key.

Simple example on toy data

We start by using the NICLClassifier on simple data that needs no preprocessing.

Generate simple data:

import os
import warnings

import skrub
from sklearn.datasets import make_classification
from sklearn.decomposition import PCA
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.impute import SimpleImputer
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_validate, train_test_split
from sklearn.pipeline import make_pipeline

from neuralk import NICLClassifier, datasets

skrub.set_config(use_table_report=False)

# API key for Neuralk cloud service
API_KEY = os.environ["API_KEY"]

X, y = make_classification()
X_train, X_test, y_train, y_test = train_test_split(X, y)
print(f"{X_train.shape=} {y_train.shape=} {X_test.shape=} {y_test.shape=}")
X_train.shape=(75, 20) y_train.shape=(75,) X_test.shape=(25, 20) y_test.shape=(25,)

Now we apply Neuralk’s classifier.

# Note: nothing actually happens during fit() -- in-context learning models are
# pretrained but require no fitting on our specific dataset.
classifier = NICLClassifier(api_key=API_KEY).fit(X_train, y_train)

predictions = classifier.predict(X_test)
accuracy = accuracy_score(y_test, predictions)
print(f"Accuracy: {accuracy}")
Accuracy: 0.44

Working with non-numeric data

The Neuralk NICLClassifier is a raw classifier that does not perform any preprocessing. To handle complex datasets, we need to encode non-numeric data and possibly reduce the feature dimension. The example below shows a simple pipeline that yields good results for most datasets.

The example dataset contains the descriptions and sale price of houses. The prediction target is the sale price (binned to transform it into a classification task).

X, y = datasets.housing()

X.assign(Sale_Price=y).iloc[:, :4].head()
/home/runner/work/neuralk/neuralk/.venv/lib/python3.11/site-packages/sklearn/preprocessing/_discretization.py:304: FutureWarning: The current default behavior, quantile_method='linear', will be changed to quantile_method='averaged_inverted_cdf' in scikit-learn version 1.9 to naturally support sample weight equivalence properties by default. Pass quantile_method='averaged_inverted_cdf' explicitly to silence this warning.
  warnings.warn(
MS_SubClass MS_Zoning Lot_Frontage Lot_Area
0 One_Story_1946_and_Newer_All_Styles Residential_Low_Density 141 31770
1 One_Story_1946_and_Newer_All_Styles Residential_High_Density 80 11622
2 One_Story_1946_and_Newer_All_Styles Residential_Low_Density 81 14267
3 One_Story_1946_and_Newer_All_Styles Residential_Low_Density 93 11160
4 Two_Story_1946_and_Newer Residential_Low_Density 74 13830


As we can see above, the dataset contains many columns of different types. The basic Neuralk classification service only accepts numeric data. Moreover, it is better to send data that is not too high-dimensional otherwise the model is forced to subsample the context and this can deteriorate performance.

To meet those requirements, we build a simple pipeline that transforms the input to a numeric array with the skrub.TableVectorizer, then scales features, imputes missing values and reduces dimension with a Principal Components Analysis.

Note

Here we perform dimensionality reduction to control the number of columns. If the dataset also has a very large number of rows, the model will subsample the training data to be able to fit it in memory and make a prediction. This is done by default and you do not need to do anything to activate it. However, if you have a way to select a better training subsample than the default random one, you may want to do the subsampling yourself. For this advanced usage see this example.

We start by defining the pipeline:

And now we can evaluate its performance.

# Silence spurious warning from scikit-learn while preprocessing some categorical columns.
warnings.filterwarnings("ignore", message="Found unknown categories.*during transform")

cv_results = cross_validate(classifier, X, y, error_score="raise", scoring="accuracy")
cv_results["test_score"]
array([0.20136519, 0.1996587 , 0.1996587 , 0.20136519, 0.20136519])

For comparison, we can run the same experiment after replacing the in-context learner with gradient boosting:

classifier = make_pipeline(
    skrub.TableVectorizer(),
    skrub.SquashingScaler(),
    SimpleImputer(),
    # PCA(40), # The PCA makes results worse for the gradient boosting
    HistGradientBoostingClassifier(),
)

cv_results = cross_validate(classifier, X, y, error_score="raise", scoring="accuracy")
cv_results["test_score"]
array([0.70989761, 0.7116041 , 0.70819113, 0.69624573, 0.72525597])

Total running time of the script: (1 minutes 7.396 seconds)

Gallery generated by Sphinx-Gallery