alien.selection

This module implements a number of batch selection strategies, and provides a framework for implementing others. Each strategy is implemented as a subclass of SampleSelector. The docs for SampleSelector give a good introduction to the interface.

A selector object needs at least two things:

A Model

A pool of unlabeled samples to select from. This can be a Dataset, or any sufficiently array-like object. Alternatively, this can be a SampleGenerator.

Additionally, some selectors also need

The labeled samples the model was trained on.

Given these, and various hyperparameters and auxiliary data, you can call on the selector to select a batch of candidates for labeling, chosen from the unlabeled samples.

Example

from alien.selecion import CovarianceSelector

selector = CovarianceSelector(
   model = deep_model,
   samples = unlabeled_pool,
   batch_size = 10,
)

batch = selector.select()

ALIEN has the following selection strategies implemented.

Active learning strategies - These are aimed at improving model performance as quickly as possible.
Optimization strategies - These are aimed at finding the highest scorers as quickly as possible.
- ThompsonSelector
Baselines
- RandomSelector
- TimestampSelector

You can write your own strategies! ALIEN’s Selector architecture is designed to make it as simple as possible for you to implement other selection strategies. Documentation coming soon!

SampleSelector

class alien.selection.SampleSelector(model=None, batch_size=1, samples=None, num_samples=None, labelled_samples=None, X_key='X', prior=None, prior_scale=1, return_indices=False, verbose=1)[source]

Abstract base class for selection strategies

Parameters:

model – An instance of models.RegressionModel. Will be used to determine prediction covariances for proposed batches.
samples – The sample pool to select from. Can be a numpy-style addressable array (with first dimension indexing samples, and other dimensions indexing features)—note that alien.data.Dataset serves this purpose—or an instance of sample_generation.SampleGenerator, in which case the num_samples parameter is in effect.
num_samples – If a SampleGenerator has been provided via the ‘samples’ parameter, then at the start of a call to select(), num_samples samples will be drawn from the SampleGenerator, or as many samples as the SampleGenerator can provide, whichever is less. Defaults to Inf, i.e., draws as many samples as available.
labelled_samples – Some selection strategies need to know the previously-labelled samples.
batch_size – Size of the batch to select.
prior –
Specifies a “prior probability” for each sample. Each selector may use this prior as it sees fit, but generally, samples with low prior are de-emphasized in the selection process. This is a covenient way of introducing factors other than uncertainty into the ranking. prior may be an array of numbers (of size num_samples), or a function (applied to the samples), or one of the following:

’prediction’: calculates a prior from the quantile of the predicted
performance (not the uncertainties). prior_scale sets the power this quantile is raised to.

Defaults to the constant value 1.
prior_scale – The prior will be raised to this power before applying it to the samples. Defaults to 1.
prefilter –
Reduces the incoming sample pool before applying batch selection. If a is the single-sample acquisition function, then prefilter = True selects a subset of the provided samples maximizing a

If 0 < prefilter < 1, takes this fraction of the sample pool. If prefilter >= 1, takes this many samples.

Some of the selectors are limited in how many samples they can consider for the final, batch-selection problem. For example, CovarianceSelector computes and stores the size-N^2 covariance matrix for the whole sample pool; therefore, because of memory constraints it should work with at most around 10,000 samples.

In such cases, there is often a cheaper prefiltering operation available. Eg., CovarianceSelector prefilters only with the variance, rather than the full covariance.

A practical strategy in such cases is to take a sample pool about 5 times as big as the selector can handle for the final computation, then narrow down to only the top 20% individual scores before batch selection. Narrowing to much less than 20% risks reducing diversity too much and changing what would ultimately be the selected batch.
random_seed – A random seed for deterministic behaviour.
return_indices – If True, select() will return the indices of the selection (from within the given sample pool). If False, select() will return the actual selected samples. Defaults to False.

select(batch_size=None, samples=None, num_samples=None, prior=None, X_key=None, fixed_samples=None, fixed_prior=None, return_indices=None, tail_call=None, **kwargs)[source]

Selects a batch from the provided samples, and returns it (or its indices).

All of the arguments to select() are optional. If you have provided samples and other necessary parameters to the constructor already, then you may omit them here.

However, some parameters here are not in the constructor: fixed_samples and fixed_prior.

Parameters:

batch_size (int, optional) – The size of the batch to select.
samples (ArrayLike, optional) – The sample pool to select from. Can be a numpy-style addressable array (with first dimension indexing samples, and other dimensions indexing features)—note that alien.data.Dataset serves this purpose—or an instance of sample_generation.SampleGenerator, in which case the num_samples parameter isnin effect.
num_samples – If a SampleGenerator has been provided via the ‘samples’ parameter, then at the start of a call to select(), num_samples samples will be drawn from the SampleGenerator, or as many samples as the SampleGenerator can provide, whichever is less. Defaults to Inf, i.e., draws as many samples as available.
prior – A “prior probability” for each sample. May be an array of numbers, a function, or the string 'prediction'. A more detailed explanation is above, in the class definition. Defaults to the constant value 1.
prior_scale – The prior will be raised to this power before applying it to the samples. Defaults to 1.
prefilter – Reduces the incoming sample pool before applying batch selection. If 0 < prefilter < 1, we use this fraction of the sample pool. If prefilter >= 1, we use this many samples. A more detailed explanation if above, in the class definition.
return_indices – If True, select() will return the indices of the selection (from within the given sample pool). If False, select() will return the actual selected samples. Defaults to False.
X_key – The key used to extract the X values from samples. I.e., X = samples[X_key]. This is only in effect if you pass an explicit value to X_key, or if samples is a DictDatabase with key 'X'. By default, X = samples.
fixed_samples (ArrayLike, optional) – This parameter is for passing in those samples which have been previously selected for labeling, but which haven’t been labeled yet. (Eg., you’ve previously sent off a batch to the laboratory pipeline for testing, but you need to select the next batch for the pipeline before the results are in.) Some selection strategies (eg., CovarianceSelector and BAITSelector) will use this information to avoid redundancy between the newly selected batch and the fixed_samples.
fixed_prior – If you provide an explicit (i.e., array-like) prior for samples, then you must also provide a prior for fixed_samples.

Returns:

The selected batch, either as a sub-array of samples, or as an array of indices into samples (if return_indices is set to True).

CovarianceSelector

class alien.selection.CovarianceSelector(model=None, samples=None, num_samples=inf, batch_size=1, normalize=False, normalize_epsilon=0.001, regularization=0.05, similarity=0, similarity_scale='auto', prior=1, prior_scale=1, prefilter=None, random_seed=None, fast_opt=True, n_rounds=10, **kwargs)[source]

Bases: UncertaintySelector

Batch selector which looks for batches with large total covariance, i.e., large joint entropy.

Parameters:

model – An instance of models.CovarianceRegressor. Will be used to determine prediction covariances for proposed batches.
samples – The sample pool to select from. Can be a numpy-style addressable array (with first dimension indexing samples, and other dimensions indexing features)—note that data.Dataset serves this purpose—or an instance of sample_generation.SampleGenerator, in which case the num_samples parameter is in effect.
num_samples – If a SampleGenerator has been provided via the samples parameter, then at the start of a call to select(), num_samples samples will be drawn from the SampleGenerator, or as many samples as the SampleGenerator can provide, whichever is less. Defaults to :Inf:, i.e., draws as many samples as available.
batch_size – Size of the batch to select.
regularization –
The diagonal of the covariance matrix will be multiplied by (1 + regularization), after being computed by the model. This ensures that the coviarance matrix is positive definite (as long as all the covariances are positive). Defaults to .05

This parameter is particularly important if the covariance is computed from an ensemble of models, and the ensemble is not very large: for a given batch of N samples, and a model ensemble size of M, the distribution of predictions will consist of M points in an N-dimensional space. If M < N, the covariance of the batch predictions is sure to have determinant 0, and no comparisons can be made (without regularization). Even if M >= N, a relatively small ensemble size can produce numerical instability in the covariances, which regularization smooths out.
normalize –
If True, scales the (co)variances by the inverse-square-length of the embedding vector (retrieved by a call to model.embedding), modulo a small constant. This prevents the algorithm from just seeking out those inputs which give large embedding vectors. Defaults to False.

If the model has not implemented an embedding method, setting normalize = True will raise a NotImplementedError when you call select(). LaplaceApproxRegressor and subclasses have implemented embedding(), as have some others.
normalize_epsilon –
In the normalization step described above, variances are divided by |embedding_length|² + ε, where

ε = normalize_epsilon * MEAN_SQUARE(all embedding lengths)

Defaults to 1e-3. Should be related to measurement error.
similarity – The effective covariance matrix (before regularization) will be (1 - similarity) * covariance (computed from the model) + similarity * S, where S is a “synthetic” covariance computed from a similarity matrix, as follows: First, a similarity matrix is computed: each feature dimension in the data, X, will be normalized to have variance of 1. Then, a euclidean distance matrix is computed for the whole dataset (divided by sqrt(N), where N is the number of feature dimensions). Then, this distance metric is passed into a decaying exponential, with 1/e-life equal to the parameter ‘similarity_scale’. This gives a positive-definite similarity matrix, with ones on the diagonal. Second, the similarity matrix is interpreted as a correlation matrix. Variances are taken from the model (i.e., copied from the diagonal of the covariance matrix). Together, these data determine a covariance matrix, which will be combined with the model covariance in the given proportion. Defaults to 0.
similarity_scale – This tunes the correlation matrix in the similarity computation above. The pairwise euclidean distances in normalized feature space are passed into a exponential with 1/e-life equal to similarity_scale. So, a smaller value for similarity_scale will give smaller off-diagonal entries on the correlation/covariance matrix. If similarity_scale is set to ‘auto’, then a scale is chosen to match the mean squares of the similarities and the correlations (from the model).
prior –
Specifies a “prior probability” for each sample. The covariance matrix (after the application of similarity) will be multiplied by the priors such that, if C_ij is a covariance and p_i, p_j are the corresponding priors, C’_ij = C_ij p_i p_j. This is a covenient way of introducing factors other than covariance into the ranking. ‘prior’ may be an array of numbers (of size num_samples), or a function (applied to the samples), or one of the following:

’prediction’: calculates a prior from the quantile of the predicted
performance (not the uncertainties). ‘prior_scale’ sets the power this quantile is raised to.

Defaults to the constant value 1.
prior_scale – The prior will be raised to this power before applying it to the covariance. Defaults to 1.
prefilter – Reduces the incoming sample pool before applying batch selection. Selects the subset of the samples which have the highest std_dev * prior score. If 0 < prefilter < 1, takes this fraction of the sample pool. If prefilter >= 1, takes this many samples. Since batch selection computes and stores the size-N^2 covariance matrix for the whole sample pool, it should work with at most around 10,000 samples. Prefiltering can work with much larger pools, since it only needs to compute N standard deviations. Therefore, a practical strategy is to take a sample pool about 5 times as big as you can handle covariances for, then narrow down to only the top 20% individual scores before batch selection. Narrowing to much less than 20% risks changing what will ultimately be the optimal batch.
random_seed – A random seed for deterministic behaviour.

get_prefilter(X=None, k=None, prior=1, score=None, return_indices=True)

model_predict(X, return_std_dev=False)

prediction_prior(samples)

select(*args, **kwargs)

Selects a batch from the provided samples, and returns it (or its indices).

All of the arguments to select() are optional. If you have provided samples and other necessary parameters to the constructor already, then you may omit them here.

However, some parameters here are not in the constructor: fixed_samples and fixed_prior.

Parameters:

batch_size (int, optional) – The size of the batch to select.
samples (ArrayLike, optional) – The sample pool to select from. Can be a numpy-style addressable array (with first dimension indexing samples, and other dimensions indexing features)—note that alien.data.Dataset serves this purpose—or an instance of sample_generation.SampleGenerator, in which case the num_samples parameter isnin effect.
num_samples – If a SampleGenerator has been provided via the ‘samples’ parameter, then at the start of a call to select(), num_samples samples will be drawn from the SampleGenerator, or as many samples as the SampleGenerator can provide, whichever is less. Defaults to Inf, i.e., draws as many samples as available.
prior – A “prior probability” for each sample. May be an array of numbers, a function, or the string 'prediction'. A more detailed explanation is above, in the class definition. Defaults to the constant value 1.
prior_scale – The prior will be raised to this power before applying it to the samples. Defaults to 1.
prefilter – Reduces the incoming sample pool before applying batch selection. If 0 < prefilter < 1, we use this fraction of the sample pool. If prefilter >= 1, we use this many samples. A more detailed explanation if above, in the class definition.
return_indices – If True, select() will return the indices of the selection (from within the given sample pool). If False, select() will return the actual selected samples. Defaults to False.
X_key – The key used to extract the X values from samples. I.e., X = samples[X_key]. This is only in effect if you pass an explicit value to X_key, or if samples is a DictDatabase with key 'X'. By default, X = samples.
fixed_samples (ArrayLike, optional) – This parameter is for passing in those samples which have been previously selected for labeling, but which haven’t been labeled yet. (Eg., you’ve previously sent off a batch to the laboratory pipeline for testing, but you need to select the next batch for the pipeline before the results are in.) Some selection strategies (eg., CovarianceSelector and BAITSelector) will use this information to avoid redundancy between the newly selected batch and the fixed_samples.
fixed_prior – If you provide an explicit (i.e., array-like) prior for samples, then you must also provide a prior for fixed_samples.

Returns:

The selected batch, either as a sub-array of samples, or as an array of indices into samples (if return_indices is set to True).

BAITSelector

class alien.selection.BAITSelector(model=None, samples=None, num_samples=None, gamma=1, oversample=2, random_seed=None, **kwargs)[source]

Bases: SampleSelector

Batch selector following the BAIT strategy. See ` <https://arxiv.org/abs/2106.09675>`_. This strategy optimizes the trace of the Fisher matrix between the outputs and the last layer of parameters. This is a measure of the mutual information between the unknown labels and the parameters.

BAIT optimizes the trace of the Fisher for the “batch” consisting of all previously labelled samples plus the unlabelled candidate samples. This means that BAITSelector needs to know the previously labelled samples. They can be passed into either __init__() or select(), as labelelled_samples. (This class will try to determine whether labelled_samples needs to be unpacked into separate X and y columns—only the X column is needed.)

There are two hyperparameters, gamma and oversample, described below.

Parameters:

model – An instance of models.LinearizableRegressor, or a model which implements the embedding method.
samples – The sample pool to select from. Can be a numpy-style addressable array (with first dimension indexing samples, and other dimensions indexing features)—note that alien.data.Dataset serves this purpose—or an instance of sample_generation.SampleGenerator, in which case the num_samples parameter is in effect.
num_samples – If a SampleGenerator has been provided via the ‘samples’ parameter, then at the start of a call to self.select(…), num_samples samples will be drawn from the SampleGenerator, or as many samples as the SampleGenerator can provide, whichever is less. Defaults to Inf, i.e., draws as many samples as available.
labelelled_samples – The samples which have already been labelled (or are in the process of being labelled). This class will try to determine whether labelled_samples needs to be unpacked into separate X and y columns—only the X column is needed.
batch_size – Size of the batch to select.
random_seed – A random seed for deterministic behaviour.
gamma – The ‘regularization’ parameter in the BAIT algorithm. A larger value corresponds to narrower priors. Defaults to 1, which works well enough.
oversample – The factor by which to oversample in the greedy acquisition step. BAIT will greedily draw a batch of oversample * batch_size samples, then greedily remove all but batch_size of them. Defaults to 2, which is empirically good.

model_predict(X, return_std_dev=False)

prediction_prior(samples)

select(batch_size=None, samples=None, num_samples=None, prior=None, X_key=None, fixed_samples=None, fixed_prior=None, return_indices=None, tail_call=None, **kwargs)

Selects a batch from the provided samples, and returns it (or its indices).

All of the arguments to select() are optional. If you have provided samples and other necessary parameters to the constructor already, then you may omit them here.

However, some parameters here are not in the constructor: fixed_samples and fixed_prior.

Parameters:

batch_size (int, optional) – The size of the batch to select.
samples (ArrayLike, optional) – The sample pool to select from. Can be a numpy-style addressable array (with first dimension indexing samples, and other dimensions indexing features)—note that alien.data.Dataset serves this purpose—or an instance of sample_generation.SampleGenerator, in which case the num_samples parameter isnin effect.
num_samples – If a SampleGenerator has been provided via the ‘samples’ parameter, then at the start of a call to select(), num_samples samples will be drawn from the SampleGenerator, or as many samples as the SampleGenerator can provide, whichever is less. Defaults to Inf, i.e., draws as many samples as available.
prior – A “prior probability” for each sample. May be an array of numbers, a function, or the string 'prediction'. A more detailed explanation is above, in the class definition. Defaults to the constant value 1.
prior_scale – The prior will be raised to this power before applying it to the samples. Defaults to 1.
prefilter – Reduces the incoming sample pool before applying batch selection. If 0 < prefilter < 1, we use this fraction of the sample pool. If prefilter >= 1, we use this many samples. A more detailed explanation if above, in the class definition.
return_indices – If True, select() will return the indices of the selection (from within the given sample pool). If False, select() will return the actual selected samples. Defaults to False.
X_key – The key used to extract the X values from samples. I.e., X = samples[X_key]. This is only in effect if you pass an explicit value to X_key, or if samples is a DictDatabase with key 'X'. By default, X = samples.
fixed_samples (ArrayLike, optional) – This parameter is for passing in those samples which have been previously selected for labeling, but which haven’t been labeled yet. (Eg., you’ve previously sent off a batch to the laboratory pipeline for testing, but you need to select the next batch for the pipeline before the results are in.) Some selection strategies (eg., CovarianceSelector and BAITSelector) will use this information to avoid redundancy between the newly selected batch and the fixed_samples.
fixed_prior – If you provide an explicit (i.e., array-like) prior for samples, then you must also provide a prior for fixed_samples.

Returns:

The selected batch, either as a sub-array of samples, or as an array of indices into samples (if return_indices is set to True).

KmeansSelector

class alien.selection.KmeansSelector(model=None, batch_size=1, samples=None, num_samples=None, labelled_samples=None, X_key='X', prior=None, prior_scale=1, return_indices=False, verbose=1)[source]

Bases: SampleSelector

Selector based on K-Means algorithm.

model_predict(X, return_std_dev=False)

prediction_prior(samples)

select(batch_size=None, samples=None, num_samples=None, prior=None, X_key=None, fixed_samples=None, fixed_prior=None, return_indices=None, tail_call=None, **kwargs)

Selects a batch from the provided samples, and returns it (or its indices).

All of the arguments to select() are optional. If you have provided samples and other necessary parameters to the constructor already, then you may omit them here.

However, some parameters here are not in the constructor: fixed_samples and fixed_prior.

Parameters:

batch_size (int, optional) – The size of the batch to select.
samples (ArrayLike, optional) – The sample pool to select from. Can be a numpy-style addressable array (with first dimension indexing samples, and other dimensions indexing features)—note that alien.data.Dataset serves this purpose—or an instance of sample_generation.SampleGenerator, in which case the num_samples parameter isnin effect.
num_samples – If a SampleGenerator has been provided via the ‘samples’ parameter, then at the start of a call to select(), num_samples samples will be drawn from the SampleGenerator, or as many samples as the SampleGenerator can provide, whichever is less. Defaults to Inf, i.e., draws as many samples as available.
prior – A “prior probability” for each sample. May be an array of numbers, a function, or the string 'prediction'. A more detailed explanation is above, in the class definition. Defaults to the constant value 1.
prior_scale – The prior will be raised to this power before applying it to the samples. Defaults to 1.
prefilter – Reduces the incoming sample pool before applying batch selection. If 0 < prefilter < 1, we use this fraction of the sample pool. If prefilter >= 1, we use this many samples. A more detailed explanation if above, in the class definition.
return_indices – If True, select() will return the indices of the selection (from within the given sample pool). If False, select() will return the actual selected samples. Defaults to False.
X_key – The key used to extract the X values from samples. I.e., X = samples[X_key]. This is only in effect if you pass an explicit value to X_key, or if samples is a DictDatabase with key 'X'. By default, X = samples.
fixed_samples (ArrayLike, optional) – This parameter is for passing in those samples which have been previously selected for labeling, but which haven’t been labeled yet. (Eg., you’ve previously sent off a batch to the laboratory pipeline for testing, but you need to select the next batch for the pipeline before the results are in.) Some selection strategies (eg., CovarianceSelector and BAITSelector) will use this information to avoid redundancy between the newly selected batch and the fixed_samples.
fixed_prior – If you provide an explicit (i.e., array-like) prior for samples, then you must also provide a prior for fixed_samples.

Returns:

The selected batch, either as a sub-array of samples, or as an array of indices into samples (if return_indices is set to True).

RandomSelector

class alien.selection.RandomSelector(model=None, random_seed=None, **kwargs)[source]

Bases: SampleSelector

Select samples at random.

model_predict(X, return_std_dev=False)

prediction_prior(samples)

select(batch_size=None, samples=None, num_samples=None, prior=None, X_key=None, fixed_samples=None, fixed_prior=None, return_indices=None, tail_call=None, **kwargs)

Selects a batch from the provided samples, and returns it (or its indices).

All of the arguments to select() are optional. If you have provided samples and other necessary parameters to the constructor already, then you may omit them here.

However, some parameters here are not in the constructor: fixed_samples and fixed_prior.

Parameters:

batch_size (int, optional) – The size of the batch to select.
samples (ArrayLike, optional) – The sample pool to select from. Can be a numpy-style addressable array (with first dimension indexing samples, and other dimensions indexing features)—note that alien.data.Dataset serves this purpose—or an instance of sample_generation.SampleGenerator, in which case the num_samples parameter isnin effect.
num_samples – If a SampleGenerator has been provided via the ‘samples’ parameter, then at the start of a call to select(), num_samples samples will be drawn from the SampleGenerator, or as many samples as the SampleGenerator can provide, whichever is less. Defaults to Inf, i.e., draws as many samples as available.
prior – A “prior probability” for each sample. May be an array of numbers, a function, or the string 'prediction'. A more detailed explanation is above, in the class definition. Defaults to the constant value 1.
prior_scale – The prior will be raised to this power before applying it to the samples. Defaults to 1.
prefilter – Reduces the incoming sample pool before applying batch selection. If 0 < prefilter < 1, we use this fraction of the sample pool. If prefilter >= 1, we use this many samples. A more detailed explanation if above, in the class definition.
return_indices – If True, select() will return the indices of the selection (from within the given sample pool). If False, select() will return the actual selected samples. Defaults to False.
X_key – The key used to extract the X values from samples. I.e., X = samples[X_key]. This is only in effect if you pass an explicit value to X_key, or if samples is a DictDatabase with key 'X'. By default, X = samples.
fixed_samples (ArrayLike, optional) – This parameter is for passing in those samples which have been previously selected for labeling, but which haven’t been labeled yet. (Eg., you’ve previously sent off a batch to the laboratory pipeline for testing, but you need to select the next batch for the pipeline before the results are in.) Some selection strategies (eg., CovarianceSelector and BAITSelector) will use this information to avoid redundancy between the newly selected batch and the fixed_samples.
fixed_prior – If you provide an explicit (i.e., array-like) prior for samples, then you must also provide a prior for fixed_samples.

Returns:

The selected batch, either as a sub-array of samples, or as an array of indices into samples (if return_indices is set to True).

TimestampSelector

class alien.selection.TimestampSelector(model=None, random_seed=None, timestamp_key='t', timestamps=None, **kwargs)[source]

Bases: SampleSelector

model_predict(X, return_std_dev=False)

prediction_prior(samples)

select(batch_size=None, samples=None, num_samples=None, prior=None, X_key=None, fixed_samples=None, fixed_prior=None, return_indices=None, tail_call=None, **kwargs)

Selects a batch from the provided samples, and returns it (or its indices).

All of the arguments to select() are optional. If you have provided samples and other necessary parameters to the constructor already, then you may omit them here.

However, some parameters here are not in the constructor: fixed_samples and fixed_prior.

Parameters:

batch_size (int, optional) – The size of the batch to select.
samples (ArrayLike, optional) – The sample pool to select from. Can be a numpy-style addressable array (with first dimension indexing samples, and other dimensions indexing features)—note that alien.data.Dataset serves this purpose—or an instance of sample_generation.SampleGenerator, in which case the num_samples parameter isnin effect.
num_samples – If a SampleGenerator has been provided via the ‘samples’ parameter, then at the start of a call to select(), num_samples samples will be drawn from the SampleGenerator, or as many samples as the SampleGenerator can provide, whichever is less. Defaults to Inf, i.e., draws as many samples as available.
prior – A “prior probability” for each sample. May be an array of numbers, a function, or the string 'prediction'. A more detailed explanation is above, in the class definition. Defaults to the constant value 1.
prior_scale – The prior will be raised to this power before applying it to the samples. Defaults to 1.
prefilter – Reduces the incoming sample pool before applying batch selection. If 0 < prefilter < 1, we use this fraction of the sample pool. If prefilter >= 1, we use this many samples. A more detailed explanation if above, in the class definition.
return_indices – If True, select() will return the indices of the selection (from within the given sample pool). If False, select() will return the actual selected samples. Defaults to False.
X_key – The key used to extract the X values from samples. I.e., X = samples[X_key]. This is only in effect if you pass an explicit value to X_key, or if samples is a DictDatabase with key 'X'. By default, X = samples.
fixed_samples (ArrayLike, optional) – This parameter is for passing in those samples which have been previously selected for labeling, but which haven’t been labeled yet. (Eg., you’ve previously sent off a batch to the laboratory pipeline for testing, but you need to select the next batch for the pipeline before the results are in.) Some selection strategies (eg., CovarianceSelector and BAITSelector) will use this information to avoid redundancy between the newly selected batch and the fixed_samples.
fixed_prior – If you provide an explicit (i.e., array-like) prior for samples, then you must also provide a prior for fixed_samples.

Returns:

The selected batch, either as a sub-array of samples, or as an array of indices into samples (if return_indices is set to True).