Controlling the sampling of the Classifier’s context

This script demonstrates how to use sampling context methods with NICL for classification.

Note

This illustrates an advanced use case. For a simpler classification example that does not demonstrate manual control of the sampled context, see this example.

Sampling Context Methods

When working with very large datasets, inference can become computationally expensive and time-consuming. To manage this, it is often advised to apply row sampling, selecting a representative subset of the data to provide as context while preserving its generalisation capability. In our example, we illustrate this using random sampling. As with other preprocessing steps, it is recommended to experiment with different sampling strategies and proportions to determine what best fits the data characteristics and available computational resources.

Note

For this example to run, the environment variable API_KEY must be set with your Neuralk API key.

We start by generating an example dataset

import os

import numpy as np
from sklearn.datasets import make_classification
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

from neuralk import NICLClassifier

# API key for Neuralk cloud service
API_KEY = os.environ["API_KEY"]

X, y = make_classification(
    n_samples=1_000_000, n_features=10, n_informative=8, n_classes=3, random_state=42
)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=10_000)
print(f"{X_train.shape=} {y_train.shape=} {X_test.shape=} {y_test.shape=}")
X_train.shape=(990000, 10) y_train.shape=(990000,) X_test.shape=(10000, 10) y_test.shape=(10000,)

As the dataset is quite large, it is not feasible to feed the whole training set as context to the Neuralk model when making a prediction. Therefore, if we send the whole dataset, a portion of it will be sampled.

Often, we have information on which rows are more interesting to keep in the context. For example, there may be sensible groupings of our data (by date, geographical location, or other criteria, …) and we may want to keep examples that are similar to the ones for which we need a prediction, or to ensure some diversity across those groups.

Here to keep the example simple, we just sample the context randomly.

Sampling ratio: 1.01%

Now we fit the classifier.

classifier = NICLClassifier(api_key=API_KEY)

# Fit classifier (nothing happens here as we are using a pre-trained model).
classifier.fit(sampled_X_train, sampled_y_train)
NICLClassifier(api_key='nk_live_waxtdogzZLT-qNRsbYEH7wHmgMh0Kdmqlqi8-C-hoJE')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.


And we can make predictions for the test set.

predictions = classifier.predict(X_test)

Finally, we measure the accuracy.

acc = accuracy_score(y_test, predictions)
print(f"Accuracy: {acc:.3f}")
Accuracy: 0.338

Total running time of the script: (0 minutes 3.288 seconds)

Gallery generated by Sphinx-Gallery