Controlling the sampling of the Classifier’s context¶

This script demonstrates how to use sampling context methods with NICL for classification.

Note

This illustrates an advanced use case. For a simpler classification example that does not demonstrate manual control of the sampled context, see this example.

Sampling Context Methods¶

When working with very large datasets, inference can become computationally expensive and time-consuming. To manage this, it is often advised to apply row sampling, selecting a representative subset of the data to provide as context while preserving its generalisation capability. In our example, we illustrate this using random sampling. As with other preprocessing steps, it is recommended to experiment with different sampling strategies and proportions to determine what best fits the data characteristics and available computational resources.

Note

For this example to run, the environment variable API_KEY must be set with your Neuralk API key.

We start by generating an example dataset

import os

import numpy as np
from sklearn.datasets import make_classification
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

from neuralk import NICLClassifier

# API key for Neuralk cloud service
API_KEY = os.environ["API_KEY"]

X, y = make_classification(
    n_samples=1_000_000, n_features=10, n_informative=8, n_classes=3, random_state=42
)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=10_000)
print(f"{X_train.shape=} {y_train.shape=} {X_test.shape=} {y_test.shape=}")

X_train.shape=(990000, 10) y_train.shape=(990000,) X_test.shape=(10000, 10) y_test.shape=(10000,)

As the dataset is quite large, it is not feasible to feed the whole training set as context to the Neuralk model when making a prediction. Therefore, if we send the whole dataset, a portion of it will be sampled.

Often, we have information on which rows are more interesting to keep in the context. For example, there may be sensible groupings of our data (by date, geographical location, or other criteria, …) and we may want to keep examples that are similar to the ones for which we need a prediction, or to ensure some diversity across those groups.

Here to keep the example simple, we just sample the context randomly.

rng = np.random.default_rng()
sample_indices = rng.choice(np.arange(X_train.shape[0]), size=10_000, replace=False)

sampled_X_train, sampled_y_train = X_train[sample_indices], y_train[sample_indices]

print(f"Sampling ratio: {sampled_X_train.shape[0] / X_train.shape[0]:.2%}")

Sampling ratio: 1.01%

Now we fit the classifier.

classifier = NICLClassifier(api_key=API_KEY)

# Fit classifier (nothing happens here as we are using a pre-trained model).
classifier.fit(sampled_X_train, sampled_y_train)

NICLClassifier(api_key='nk_live_waxtdogzZLT-qNRsbYEH7wHmgMh0Kdmqlqi8-C-hoJE')

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

And we can make predictions for the test set.

predictions = classifier.predict(X_test)

Finally, we measure the accuracy.

acc = accuracy_score(y_test, predictions)
print(f"Accuracy: {acc:.3f}")

Accuracy: 0.338

Total running time of the script: (0 minutes 3.288 seconds)

Download Jupyter notebook: 0020_selecting_sampling_context.ipynb

Download Python source code: 0020_selecting_sampling_context.py

Download zipped: 0020_selecting_sampling_context.zip

Gallery generated by Sphinx-Gallery

	api_key	'nk_live_waxtdogzZLT-qN...7wHmgMh0Kdmqlqi8-C-hoJE'
	host	None
	dataset_name	'dataset'
	model	'nicl-small'
	strategy	None
	memory_optimization	True
	n_groups	None
	column_names	None
	selected_features	None
	timeout_s	900
	metadata	None
	user	None
	api_version	None
	default_headers	None