Note
Go to the end to download the full example code.
Controlling the sampling of the Classifier’s context¶
This script demonstrates how to use sampling context methods with NICL for classification.
Note
This illustrates an advanced use case. For a simpler classification example that does not demonstrate manual control of the sampled context, see this example.
Sampling Context Methods¶
When working with very large datasets, inference can become computationally expensive and time-consuming. To manage this, it is often advised to apply row sampling, selecting a representative subset of the data to provide as context while preserving its generalisation capability. In our example, we illustrate this using random sampling. As with other preprocessing steps, it is recommended to experiment with different sampling strategies and proportions to determine what best fits the data characteristics and available computational resources.
Note
For this example to run, the environment variable API_KEY must be set
with your Neuralk API key.
We start by generating an example dataset
import os
import numpy as np
from sklearn.datasets import make_classification
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from neuralk import NICLClassifier
# API key for Neuralk cloud service
API_KEY = os.environ["API_KEY"]
X, y = make_classification(
n_samples=1_000_000, n_features=10, n_informative=8, n_classes=3, random_state=42
)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=10_000)
print(f"{X_train.shape=} {y_train.shape=} {X_test.shape=} {y_test.shape=}")
X_train.shape=(990000, 10) y_train.shape=(990000,) X_test.shape=(10000, 10) y_test.shape=(10000,)
As the dataset is quite large, it is not feasible to feed the whole training set as context to the Neuralk model when making a prediction. Therefore, if we send the whole dataset, a portion of it will be sampled.
Often, we have information on which rows are more interesting to keep in the context. For example, there may be sensible groupings of our data (by date, geographical location, or other criteria, …) and we may want to keep examples that are similar to the ones for which we need a prediction, or to ensure some diversity across those groups.
Here to keep the example simple, we just sample the context randomly.
rng = np.random.default_rng()
sample_indices = rng.choice(np.arange(X_train.shape[0]), size=10_000, replace=False)
sampled_X_train, sampled_y_train = X_train[sample_indices], y_train[sample_indices]
print(f"Sampling ratio: {sampled_X_train.shape[0] / X_train.shape[0]:.2%}")
Sampling ratio: 1.01%
Now we fit the classifier.
classifier = NICLClassifier(api_key=API_KEY)
# Fit classifier (nothing happens here as we are using a pre-trained model).
classifier.fit(sampled_X_train, sampled_y_train)
And we can make predictions for the test set.
predictions = classifier.predict(X_test)
Finally, we measure the accuracy.
acc = accuracy_score(y_test, predictions)
print(f"Accuracy: {acc:.3f}")
Accuracy: 0.338
Total running time of the script: (0 minutes 3.288 seconds)