Built-in selection of the most informative context¶

When using in-context learning, the quality of the results is directly tied to the quality of the context provided during training. Group-Wise Processing is a mechanism for optimizing this context. It ensures the model sees the most representative examples without exceeding GPU capacity.

These prompting strategies are available for both cloud and on-premise deployments using the NICLClassifier.

Overview¶

When training data exceeds GPU memory capacity, the system automatically splits data into groups using a “prompter” strategy. The system supports two modes:

Automatic Mode (default): System automatically calculates optimal number of groups based on available memory
Manual Mode: You specify the prompter strategy and configuration

Automatic Mode (Default)¶

By default, the system handles everything automatically:

import numpy as np
from neuralk import NICLClassifier

# Prepare your data
X_train = np.random.randn(1000000, 50).astype(np.float32)
y_train = np.random.randint(0, 2, 1000000)
X_test = np.random.randn(10000, 50).astype(np.float32)

# Cloud API - system automatically handles grouping
clf = NICLClassifier()
clf.fit(X_train, y_train)
predictions = clf.predict(X_test)

# Or with on-premise server
clf = NICLClassifier(host="http://localhost:8000")
clf.fit(X_train, y_train)
predictions = clf.predict(X_test)

The system will:

Handle memory constraints automatically

Manual Mode¶

You can override the automatic behavior by using the strategy parameter.

# Cloud API
clf = NICLClassifier(
    strategy="random",
    n_groups=10,
)

# Or on-premise
clf = NICLClassifier(
    host="http://localhost:8000",
    strategy="random",
    n_groups=10,
)

The n_groups parameter specifies the target number of groups to split your data into. Each group will be processed separately to fit within GPU memory capacity.

Available Strategies¶

Feature Strategy (Custom column selection)¶

Groups samples based on specific features/columns from your data. This strategy allows you to use domain-specific features that are most relevant for creating meaningful groups.

How it works:

The Feature Strategy uses the values from specified columns to determine group membership. Samples with similar values in the selected grouping columns will be assigned to the same group. This is useful when you have domain knowledge about which features are most important for creating coherent groups.

# Cloud API
clf = NICLClassifier(
    strategy="feature",
    column_names=["age", "income", "score", "rating"],  # All column names in order
    selected_features=["income", "score"],              # Columns to use for grouping
    n_groups=10,  # Optional
)
clf.fit(X_train, y_train)

# Or on-premise
clf = NICLClassifier(
    host="http://localhost:8000",
    strategy="feature",
    column_names=["age", "income", "score", "rating"],
    selected_features=["income", "score"],
    n_groups=10,
)
clf.fit(X_train, y_train)

Parameters:

column_names: List of all column names corresponding to features in X
selected_features: List of column names to use for grouping
n_groups: (Optional) Target number of groups. If not specified, the system will determine an optimal number.

Random Strategy (Best for guaranteed even splits)¶

Randomly assigns samples to groups:

# Cloud API
clf = NICLClassifier(
    strategy="random",
    n_groups=15,
)

# Or on-premise
clf = NICLClassifier(
    host="http://localhost:8000",
    strategy="random",
    n_groups=15,
)

Correlation Strategy (Data-driven but may create uneven groups)¶

Automatically selects features based on their correlation with the target variable and groups samples using quantile-based distribution.

How it works:

The Correlation Strategy uses correlation-based feature selection to identify the most informative features for grouping:

Feature Selection: The system automatically identifies features that are most correlated with the target variable. This data-driven approach selects features that are most predictive of the outcome.
Quantile-based Grouping: Samples are grouped based on quantile distribution of the selected correlated features. This means samples with similar values in the most predictive features are grouped together, creating clusters that respect the natural structure of your data.
Automatic Process: Unlike the Feature Strategy where you manually specify which columns to use, the Correlation Strategy automatically determines which features are most relevant based on statistical correlation with the target.

Usage:

# Cloud API
clf = NICLClassifier(
    strategy="correlation",
    n_groups=10,
)

# Or on-premise
clf = NICLClassifier(
    host="http://localhost:8000",
    strategy="correlation",
    n_groups=10,
)

How it differs from Feature Strategy:

Feature Strategy: You manually specify which columns to use for grouping based on domain knowledge. You have full control but need to know which features are relevant.
Correlation Strategy: The system automatically selects features based on their correlation with the target variable. No manual feature selection needed, but you have less control over which features drive grouping.
Feature Strategy: Groups samples with similar values in your manually selected features.
Correlation Strategy: Groups samples based on quantile distribution of automatically selected correlated features, which may create more natural clusters but can result in uneven group sizes.

When to use:

Use Correlation Strategy when you want data-driven feature selection and don’t have strong domain knowledge about which features are most relevant.
Use Feature Strategy when you have domain expertise and want explicit control over which features drive grouping.

Precomputed Groups Strategy (Use existing group assignments)¶

Uses pre-existing group IDs from your data. This strategy is ideal when you have already computed optimal group assignments externally and want to use them directly.

How it works:

The system reads group IDs from a specified column in your data and uses those precomputed assignments directly, without computing groups itself.

# Cloud API
clf = NICLClassifier(
    strategy="precomputed_groups",
    column_names=["feature1", "feature2", "group_id"],  # All column names in order
    selected_features=["group_id"],                      # Must be exactly 1 feature: the group_id column
)

# Or on-premise
clf = NICLClassifier(
    host="http://localhost:8000",
    strategy="precomputed_groups",
    column_names=["feature1", "feature2", "group_id"],
    selected_features=["group_id"],
)

Requirements:

strategy must be "precomputed_groups"
selected_features must contain exactly one feature name, which is the group_id column
column_names must include the group_id column name to map numpy array columns correctly

How it works:

The system validates that selected_features contains exactly one feature (the group_id), then builds a training dataframe to access the group_id column and uses those precomputed assignments directly.

Automatic Mitigation¶

If a group exceeds capacity, the system automatically applies stratified sampling to reduce it while preserving class distribution.

Best Practices¶

When to Use Manual Mode¶

✅ Use Manual Mode when:

You understand your data structure and want specific grouping
You have domain knowledge about natural clusters in your data
You need reproducible grouping for experiments
You want to compare different prompter strategies

❌ Use Automatic Mode when:

You’re unsure about optimal configuration
You want the system to handle memory constraints automatically
You prioritize simplicity over control
Your data characteristics may change between requests

Setting n_groups¶

Too few groups: May exceed capacity, system will raise a warning.

Too many groups: Slower training, more overhead, less data per model

Monitoring¶

Response Metadata¶

Important: Metadata is only returned for manual mode (when prompter_config is provided). Automatic mode does not include metadata in the response.

{
  "processing_mode": "group_wise",
  "strategy": "feature",
  "mode": "manual",
  "n_groups": 8,
  "max_group_size": 12500,
  "avg_group_size": 12500.0,
  "group_sizes_train": {"0": 12500, "1": 12500, "2": 12500, ...},
  "group_sizes_test": {"0": 1250, "1": 1250, "2": 1250, ...}
}

Manual Mode Response (Warning - Groups Exceed Capacity):¶

{
  "processing_mode": "group_wise",
  "strategy": "correlation",
  "mode": "manual",
  "n_groups": 5,
  "max_group_size": 450000,
  "avg_group_size": 200000,
  "group_sizes_train": {"0": 50000, "1": 450000, "2": 30000, ...},
  "group_sizes_test": {"0": 500, "1": 4500, "2": 300, ...},
  "capacity_warning": "CAPACITY WARNING: ..."
}

Automatic Mode Response:¶

{}

Note: Automatic mode returns an empty metadata object. All grouping is handled internally.

Key Fields (Manual Mode Only):

processing_mode: Always “group_wise” when data exceeded capacity
strategy: Prompter strategy used (e.g., “feature”, “random”)
mode: Always “manual” (automatic mode returns no metadata)
n_groups: Actual number of groups created
max_group_size: Size of largest training group (in rows)
avg_group_size: Average training group size (in rows)
group_sizes_train: Dict mapping group_id → training sample count
group_sizes_test: Dict mapping group_id → test sample count
capacity_warning: (Optional) Warning message if groups exceed capacity

Automatic Mode: Returns empty metadata {} - all grouping handled internally

Troubleshooting¶

Groups Exceed Capacity¶

Problem: Logs show Largest group (X rows) exceeds capacity (Y rows)

Solutions:

Increase n_groups in manual config
Switch to "random" for even splits
Use automatic mode (omit prompter_config)

Unexpected Group Count¶

Problem: Requested 20 groups but only 3 were created

Cause: Data-driven strategies (correlation) may create fewer groups if data naturally clusters into fewer patterns

Solution: Use strategy="random" for exact group counts

Poor Prediction Quality¶

Problem: Predictions worse than expected

Possible causes:

Groups too small (increase n_groups = less data per model)
Groups ignore natural clusters (try strategy="correlation")
Heavy stratified sampling (groups too large, data being downsampled)

Solution: Experiment with different strategies and monitor group sizes in logs