Using the Neuralk NICLClassifier¶

The NICLClassifier is the simplest way to use Neuralk’s In-Context Learning model for classification. It offers the usual scikit-learn classifier interface so it can easily be inserted into any machine-learning pipeline.

Note

For this example to run, the environment variable API_KEY must be set with your Neuralk API key.

Simple example on toy data¶

We start by using the NICLClassifier on simple data that needs no preprocessing.

Generate simple data:

import os
import warnings

import skrub
from sklearn.datasets import make_classification
from sklearn.decomposition import PCA
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.impute import SimpleImputer
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_validate, train_test_split
from sklearn.pipeline import make_pipeline

from neuralk import NICLClassifier, datasets

skrub.set_config(use_table_report=False)

# API key for Neuralk cloud service
API_KEY = os.environ["API_KEY"]

X, y = make_classification()
X_train, X_test, y_train, y_test = train_test_split(X, y)
print(f"{X_train.shape=} {y_train.shape=} {X_test.shape=} {y_test.shape=}")

X_train.shape=(75, 20) y_train.shape=(75,) X_test.shape=(25, 20) y_test.shape=(25,)

Now we apply Neuralk’s classifier.

# Note: nothing actually happens during fit() -- in-context learning models are
# pretrained but require no fitting on our specific dataset.
classifier = NICLClassifier(api_key=API_KEY).fit(X_train, y_train)

predictions = classifier.predict(X_test)
accuracy = accuracy_score(y_test, predictions)
print(f"Accuracy: {accuracy}")

Traceback (most recent call last):
  File "/home/runner/work/neuralk/neuralk/examples/0010_housing_classification.py", line 60, in <module>
    predictions = classifier.predict(X_test)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/runner/work/neuralk/neuralk/src/neuralk/_base_classifier.py", line 149, in predict
    result = self._remote_predict(X)
             ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/runner/work/neuralk/neuralk/src/neuralk/_classifier.py", line 182, in _remote_predict
    response = self._client.classifications.create(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/runner/work/neuralk/neuralk/src/neuralk/_api.py", line 228, in create
    return self._client._make_request(tar_bytes, headers)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/runner/work/neuralk/neuralk/src/neuralk/_api.py", line 601, in _make_request
    self._raise_for_status(response)
  File "/home/runner/work/neuralk/neuralk/src/neuralk/_api.py", line 722, in _raise_for_status
    raise NeuralkException(message, HTTPStatus(status), response.text)
neuralk.exceptions.NeuralkException: ('{"detail":{"error":{"code":4030107,"type":"ORG_EXPIRED","message":"Organization trial has expired. Please contact sales.","request_id":"e3485a42-c180-44f2-94b6-49568df51cba"}}}', <HTTPStatus.FORBIDDEN: 403>, '{"detail":{"error":{"code":4030107,"type":"ORG_EXPIRED","message":"Organization trial has expired. Please contact sales.","request_id":"e3485a42-c180-44f2-94b6-49568df51cba"}}}')

Working with non-numeric data¶

The Neuralk NICLClassifier is a raw classifier that does not perform any preprocessing. To handle complex datasets, we need to encode non-numeric data and possibly reduce the feature dimension. The example below shows a simple pipeline that yields good results for most datasets.

The example dataset contains the descriptions and sale price of houses. The prediction target is the sale price (binned to transform it into a classification task).

X, y = datasets.housing()

X.assign(Sale_Price=y).iloc[:, :4].head()

As we can see above, the dataset contains many columns of different types. The basic Neuralk classification service only accepts numeric data. Moreover, it is better to send data that is not too high-dimensional otherwise the model is forced to subsample the context and this can deteriorate performance.

To meet those requirements, we build a simple pipeline that transforms the input to a numeric array with the skrub.TableVectorizer, then scales features, imputes missing values and reduces dimension with a Principal Components Analysis.

Note

Here we perform dimensionality reduction to control the number of columns. If the dataset also has a very large number of rows, the model will subsample the training data to be able to fit it in memory and make a prediction. This is done by default and you do not need to do anything to activate it. However, if you have a way to select a better training subsample than the default random one, you may want to do the subsampling yourself. For this advanced usage see this example.

We start by defining the pipeline:

classifier = make_pipeline(
    skrub.TableVectorizer(),
    skrub.SquashingScaler(),
    SimpleImputer(),
    PCA(40),
    NICLClassifier(api_key=API_KEY),
)

And now we can evaluate its performance.

# Silence spurious warning from scikit-learn while preprocessing some categorical columns.
warnings.filterwarnings("ignore", message="Found unknown categories.*during transform")

cv_results = cross_validate(classifier, X, y, error_score="raise", scoring="accuracy")
cv_results["test_score"]

For comparison, we can run the same experiment after replacing the in-context learner with gradient boosting:

classifier = make_pipeline(
    skrub.TableVectorizer(),
    skrub.SquashingScaler(),
    SimpleImputer(),
    # PCA(40), # The PCA makes results worse for the gradient boosting
    HistGradientBoostingClassifier(),
)

cv_results = cross_validate(classifier, X, y, error_score="raise", scoring="accuracy")
cv_results["test_score"]

Total running time of the script: (0 minutes 1.859 seconds)

Gallery generated by Sphinx-Gallery