Skip to main content

Built-in selection of the most informative context

When using in-context learning, the quality of the results is directly tied to the quality of the context provided during training. Group-Wise Processing is a mechanism for optimizing this context. It ensures the model sees the most representative examples without exceeding GPU capacity.

Overview

When training data exceeds GPU memory capacity, the system automatically splits data into groups using a “prompter” strategy. The system supports two modes:

  1. Automatic Mode (default): System automatically calculates optimal number of groups based on available memory
  2. Manual Mode: You specify the prompter strategy and configuration

Automatic Mode (Default)

By default, the system handles everything automatically:

# Prepare your data
X_train = np.random.randn(1000000, 50).astype(np.float32)
y_train = np.random.randint(0, 2, 1000000)
X_test = np.random.randn(10000, 50).astype(np.float32)

# Send request - system automatically handles grouping
clf = OnPremiseClassifier(host=<...>)
clf.fit(X_train, y_train)
predictions = clf.predict(X_test)

The system will:

  • Handle memory constraints automatically

Manual Mode

You can override the automatic behavior by using the strategy parameter.

clf = OnPremiseClassifier(
host=<...>,
strategy = "random",
n_groups = 10,
)

The n_groups parameter specifies the target number of groups to split your data into. Each group will be processed separately to fit within GPU memory capacity.

Available Strategies

Feature Strategy (Custom column selection)

Groups samples based on specific features/columns from your data. This strategy allows you to use domain-specific features that are most relevant for creating meaningful groups.

How it works:

The Feature Strategy uses the values from specified columns to determine group membership. Samples with similar values in the selected grouping columns will be assigned to the same group. This is useful when you have domain knowledge about which features are most important for creating coherent groups.

clf = OnPremiseClassifier(host = <...>,
strategy = "feature",
column_names = ["age", "income", "score", "rating"], # All column names in order
selected_features = ["income", "score"], # Columns to use for grouping
n_groups = 10 # Optional
)
clf.fit(X_train, y_train)

Parameters:

  • grouping_columns: List of column names to use for grouping
  • n_groups: (Optional) Target number of groups. If not specified, the system will determine an optimal number.

Random Strategy (Best for guaranteed even splits)

Randomly assigns samples to groups:

clf = OnPremiseClassifier(host = <...>,
strategy = "random",
n_groups = 15
)

Correlation Strategy (Data-driven but may create uneven groups)

Automatically selects features based on their correlation with the target variable and groups samples using quantile-based distribution.

How it works:

The Correlation Strategy uses correlation-based feature selection to identify the most informative features for grouping:

  1. Feature Selection: The system automatically identifies features that are most correlated with the target variable. This data-driven approach selects features that are most predictive of the outcome.
  2. Quantile-based Grouping: Samples are grouped based on quantile distribution of the selected correlated features. This means samples with similar values in the most predictive features are grouped together, creating clusters that respect the natural structure of your data.
  3. Automatic Process: Unlike the Feature Strategy where you manually specify which columns to use, the Correlation Strategy automatically determines which features are most relevant based on statistical correlation with the target.

Usage:

clf = OnPremiseClassifier(host = <...>,
strategy = "correlation",
n_groups = 10
)

How it differs from Feature Strategy:

  • Feature Strategy: You manually specify which columns to use for grouping based on domain knowledge. You have full control but need to know which features are relevant.
  • Correlation Strategy: The system automatically selects features based on their correlation with the target variable. No manual feature selection needed, but you have less control over which features drive grouping.
  • Feature Strategy: Groups samples with similar values in your manually selected features.
  • Correlation Strategy: Groups samples based on quantile distribution of automatically selected correlated features, which may create more natural clusters but can result in uneven group sizes.

When to use:

  • Use Correlation Strategy when you want data-driven feature selection and don’t have strong domain knowledge about which features are most relevant.
  • Use Feature Strategy when you have domain expertise and want explicit control over which features drive grouping.

Precomputed Groups Strategy (Use existing group assignments)

Uses pre-existing group IDs from your data. This strategy is ideal when you have already computed optimal group assignments externally and want to use them directly.

How it works:

The system reads group IDs from a specified column in your data and uses those precomputed assignments directly, without computing groups itself.

clf = OnPremiseClassifier(host = <...>,
strategy = "precomputed_groups",
column_names = ["feature1", "feature2", "group_id"], # All column names in order
selected_features = ["group_id"] # Must be exactly 1 feature: the group_id column
)

Requirements:

  • strategy must be "precomputed_groups"
  • selected_features must contain exactly one feature name, which is the group_id column
  • column_names must include the group_id column name to map numpy array columns correctly

How it works:

The system validates that selected_features contains exactly one feature (the group_id), then builds a training dataframe to access the group_id column and uses those precomputed assignments directly.

Automatic Mitigation

If a group exceeds capacity, the system automatically applies stratified sampling to reduce it while preserving class distribution.

Best Practices

When to Use Manual Mode

Use Manual Mode when:

  • You understand your data structure and want specific grouping
  • You have domain knowledge about natural clusters in your data
  • You need reproducible grouping for experiments
  • You want to compare different prompter strategies

Use Automatic Mode when:

  • You’re unsure about optimal configuration
  • You want the system to handle memory constraints automatically
  • You prioritize simplicity over control
  • Your data characteristics may change between requests

Setting n_groups

Too few groups: May exceed capacity, system will raise a warning.

Too many groups: Slower training, more overhead, less data per model

Monitoring

Response Metadata

Important: Metadata is only returned for manual mode (when prompter_config is provided). Automatic mode does not include metadata in the response.

{
"processing_mode": "group_wise",
"strategy": "feature",
"mode": "manual",
"n_groups": 8,
"max_group_size": 12500,
"avg_group_size": 12500.0,
"group_sizes_train": {"0": 12500, "1": 12500, "2": 12500, ...},
"group_sizes_test": {"0": 1250, "1": 1250, "2": 1250, ...}
}

Manual Mode Response (Warning - Groups Exceed Capacity):

{
"processing_mode": "group_wise",
"strategy": "correlation",
"mode": "manual",
"n_groups": 5,
"max_group_size": 450000,
"avg_group_size": 200000,
"group_sizes_train": {"0": 50000, "1": 450000, "2": 30000, ...},
"group_sizes_test": {"0": 500, "1": 4500, "2": 300, ...},
"capacity_warning": "CAPACITY WARNING: Largest group (450000 rows) exceeds capacity (368550 rows). Your configuration (strategy=correlation, n_groups=5) created uneven groups. Stratified sampling will be applied. Consider: 1) Increase n_groups to 7, 2) Use an other strategy or 'random' for even splits."
}

Automatic Mode Response:

{}

Note: Automatic mode returns an empty metadata object. All grouping is handled internally.

Key Fields (Manual Mode Only):

  • processing_mode: Always “group_wise” when data exceeded capacity
  • strategy: Prompter strategy used (e.g., “feature”, “random”)
  • mode: Always “manual” (automatic mode returns no metadata)
  • n_groups: Actual number of groups created
  • max_group_size: Size of largest training group (in rows)
  • avg_group_size: Average training group size (in rows)
  • group_sizes_train: Dict mapping group_id → training sample count
  • group_sizes_test: Dict mapping group_id → test sample count
  • capacity_warning: (Optional) Warning message if groups exceed capacity

Automatic Mode: Returns empty metadata {} - all grouping handled internally

Troubleshooting

Groups Exceed Capacity

Problem: Logs show Largest group (X rows) exceeds capacity (Y rows)

Solutions:

  1. Increase n_groups in manual config
  2. Switch to "random" for even splits
  3. Use automatic mode (omit prompter_config)

Unexpected Group Count

Problem: Requested 20 groups but only 3 were created

Cause: Data-driven strategies (correlation) may create fewer groups if data naturally clusters into fewer patterns

Solution: Use strategy="random" for exact group counts

Poor Prediction Quality

Problem: Predictions worse than expected

Possible causes:

  1. Groups too small (increase n_groups = less data per model)
  2. Groups ignore natural clusters (try strategy="correlation")
  3. Heavy stratified sampling (groups too large, data being downsampled)

Solution: Experiment with different strategies and monitor group sizes in logs