alien.selection
This module implements a number of batch selection strategies, and provides
a framework for implementing others. Each strategy is implemented as a
subclass of SampleSelector
. The docs for SampleSelector
give a good introduction to the interface.
A selector object needs at least two things:
A
Model
A pool of unlabeled samples to select from. This can be a
Dataset
, or any sufficiently array-like object. Alternatively, this can be aSampleGenerator
.
Additionally, some selectors also need
The labeled samples the model was trained on.
Given these, and various hyperparameters and auxiliary data, you can call on the selector to select a batch of candidates for labeling, chosen from the unlabeled samples.
Example
from alien.selecion import CovarianceSelector
selector = CovarianceSelector(
model = deep_model,
samples = unlabeled_pool,
batch_size = 10,
)
batch = selector.select()
ALIEN has the following selection strategies implemented.
Active learning strategies - These are aimed at improving model performance as quickly as possible.
Optimization strategies - These are aimed at finding the highest scorers as quickly as possible.
ThompsonSelector
Baselines
You can write your own strategies! ALIEN’s Selector architecture is designed to make it as simple as possible for you to implement other selection strategies. Documentation coming soon!
SampleSelector
- class alien.selection.SampleSelector(model=None, batch_size=1, samples=None, num_samples=None, labelled_samples=None, X_key='X', prior=None, prior_scale=1, return_indices=False, verbose=1)[source]
Abstract base class for selection strategies
- Parameters:
model – An instance of models.RegressionModel. Will be used to determine prediction covariances for proposed batches.
samples – The sample pool to select from. Can be a numpy-style addressable array (with first dimension indexing samples, and other dimensions indexing features)—note that
alien.data.Dataset
serves this purpose—or an instance ofsample_generation.SampleGenerator
, in which case the num_samples parameter is in effect.num_samples – If a
SampleGenerator
has been provided via the ‘samples’ parameter, then at the start of a call toselect()
,num_samples
samples will be drawn from theSampleGenerator
, or as many samples as theSampleGenerator
can provide, whichever is less. Defaults to Inf, i.e., draws as many samples as available.labelled_samples – Some selection strategies need to know the previously-labelled samples.
batch_size – Size of the batch to select.
prior –
Specifies a “prior probability” for each sample. Each selector may use this prior as it sees fit, but generally, samples with low prior are de-emphasized in the selection process. This is a covenient way of introducing factors other than uncertainty into the ranking.
prior
may be an array of numbers (of size num_samples), or a function (applied to the samples), or one of the following:- ’prediction’: calculates a prior from the quantile of the predicted
performance (not the uncertainties).
prior_scale
sets the power this quantile is raised to.
Defaults to the constant value 1.
prior_scale – The prior will be raised to this power before applying it to the samples. Defaults to 1.
prefilter –
Reduces the incoming sample pool before applying batch selection. If
a
is the single-sample acquisition function, thenprefilter = True
selects a subset of the provided samples maximizinga
If 0 < prefilter < 1, takes this fraction of the sample pool. If prefilter >= 1, takes this many samples.
Some of the selectors are limited in how many samples they can consider for the final, batch-selection problem. For example,
CovarianceSelector
computes and stores the size-N^2 covariance matrix for the whole sample pool; therefore, because of memory constraints it should work with at most around 10,000 samples.In such cases, there is often a cheaper prefiltering operation available. Eg.,
CovarianceSelector
prefilters only with the variance, rather than the full covariance.A practical strategy in such cases is to take a sample pool about 5 times as big as the selector can handle for the final computation, then narrow down to only the top 20% individual scores before batch selection. Narrowing to much less than 20% risks reducing diversity too much and changing what would ultimately be the selected batch.
random_seed – A random seed for deterministic behaviour.
return_indices – If True,
select()
will return the indices of the selection (from within the given sample pool). If False,select()
will return the actual selected samples. Defaults toFalse
.
- select(batch_size=None, samples=None, num_samples=None, prior=None, X_key=None, fixed_samples=None, fixed_prior=None, return_indices=None, tail_call=None, **kwargs)[source]
Selects a batch from the provided samples, and returns it (or its indices).
All of the arguments to
select()
are optional. If you have providedsamples
and other necessary parameters to the constructor already, then you may omit them here.However, some parameters here are not in the constructor:
fixed_samples
andfixed_prior
.- Parameters:
batch_size (int, optional) – The size of the batch to select.
samples (ArrayLike, optional) – The sample pool to select from. Can be a numpy-style addressable array (with first dimension indexing samples, and other dimensions indexing features)—note that
alien.data.Dataset
serves this purpose—or an instance ofsample_generation.SampleGenerator
, in which case the num_samples parameter isnin effect.num_samples – If a
SampleGenerator
has been provided via the ‘samples’ parameter, then at the start of a call toselect()
,num_samples
samples will be drawn from theSampleGenerator
, or as many samples as theSampleGenerator
can provide, whichever is less. Defaults to Inf, i.e., draws as many samples as available.prior – A “prior probability” for each sample. May be an array of numbers, a function, or the string
'prediction'
. A more detailed explanation is above, in the class definition. Defaults to the constant value 1.prior_scale – The prior will be raised to this power before applying it to the samples. Defaults to 1.
prefilter – Reduces the incoming sample pool before applying batch selection. If 0 < prefilter < 1, we use this fraction of the sample pool. If prefilter >= 1, we use this many samples. A more detailed explanation if above, in the class definition.
return_indices – If True,
select()
will return the indices of the selection (from within the given sample pool). If False,select()
will return the actual selected samples. Defaults toFalse
.X_key – The key used to extract the X values from
samples
. I.e.,X = samples[X_key]
. This is only in effect if you pass an explicit value toX_key
, or ifsamples
is aDictDatabase
with key'X'
. By default,X = samples
.fixed_samples (ArrayLike, optional) – This parameter is for passing in those samples which have been previously selected for labeling, but which haven’t been labeled yet. (Eg., you’ve previously sent off a batch to the laboratory pipeline for testing, but you need to select the next batch for the pipeline before the results are in.) Some selection strategies (eg.,
CovarianceSelector
andBAITSelector
) will use this information to avoid redundancy between the newly selected batch and thefixed_samples
.fixed_prior – If you provide an explicit (i.e., array-like) prior for
samples
, then you must also provide a prior forfixed_samples
.
- Returns:
The selected batch, either as a sub-array of
samples
, or as an array of indices intosamples
(ifreturn_indices
is set to True).
CovarianceSelector
- class alien.selection.CovarianceSelector(model=None, samples=None, num_samples=inf, batch_size=1, normalize=False, normalize_epsilon=0.001, regularization=0.05, similarity=0, similarity_scale='auto', prior=1, prior_scale=1, prefilter=None, random_seed=None, fast_opt=True, n_rounds=10, **kwargs)[source]
Bases:
UncertaintySelector
Batch selector which looks for batches with large total covariance, i.e., large joint entropy.
- Parameters:
model – An instance of
models.CovarianceRegressor
. Will be used to determine prediction covariances for proposed batches.samples – The sample pool to select from. Can be a numpy-style addressable array (with first dimension indexing samples, and other dimensions indexing features)—note that
data.Dataset
serves this purpose—or an instance ofsample_generation.SampleGenerator
, in which case thenum_samples
parameter is in effect.num_samples – If a
SampleGenerator
has been provided via thesamples
parameter, then at the start of a call toselect()
,num_samples
samples will be drawn from theSampleGenerator
, or as many samples as theSampleGenerator
can provide, whichever is less. Defaults to :Inf:, i.e., draws as many samples as available.batch_size – Size of the batch to select.
regularization –
The diagonal of the covariance matrix will be multiplied by (1 + regularization), after being computed by the model. This ensures that the coviarance matrix is positive definite (as long as all the covariances are positive). Defaults to .05
This parameter is particularly important if the covariance is computed from an ensemble of models, and the ensemble is not very large: for a given batch of N samples, and a model ensemble size of M, the distribution of predictions will consist of M points in an N-dimensional space. If M < N, the covariance of the batch predictions is sure to have determinant 0, and no comparisons can be made (without regularization). Even if M >= N, a relatively small ensemble size can produce numerical instability in the covariances, which regularization smooths out.
normalize –
If True, scales the (co)variances by the inverse-square-length of the embedding vector (retrieved by a call to
model.embedding
), modulo a small constant. This prevents the algorithm from just seeking out those inputs which give large embedding vectors. Defaults to False.If the model has not implemented an
embedding
method, settingnormalize = True
will raise aNotImplementedError
when you callselect()
.LaplaceApproxRegressor
and subclasses have implementedembedding()
, as have some others.normalize_epsilon –
In the normalization step described above, variances are divided by |embedding_length|² + ε, where
ε = normalize_epsilon * MEAN_SQUARE(all embedding lengths)
Defaults to 1e-3. Should be related to measurement error.
similarity – The effective covariance matrix (before regularization) will be (1 - similarity) * covariance (computed from the model) + similarity * S, where S is a “synthetic” covariance computed from a similarity matrix, as follows: First, a similarity matrix is computed: each feature dimension in the data, X, will be normalized to have variance of 1. Then, a euclidean distance matrix is computed for the whole dataset (divided by sqrt(N), where N is the number of feature dimensions). Then, this distance metric is passed into a decaying exponential, with 1/e-life equal to the parameter ‘similarity_scale’. This gives a positive-definite similarity matrix, with ones on the diagonal. Second, the similarity matrix is interpreted as a correlation matrix. Variances are taken from the model (i.e., copied from the diagonal of the covariance matrix). Together, these data determine a covariance matrix, which will be combined with the model covariance in the given proportion. Defaults to 0.
similarity_scale – This tunes the correlation matrix in the similarity computation above. The pairwise euclidean distances in normalized feature space are passed into a exponential with 1/e-life equal to similarity_scale. So, a smaller value for similarity_scale will give smaller off-diagonal entries on the correlation/covariance matrix. If similarity_scale is set to ‘auto’, then a scale is chosen to match the mean squares of the similarities and the correlations (from the model).
prior –
Specifies a “prior probability” for each sample. The covariance matrix (after the application of similarity) will be multiplied by the priors such that, if C_ij is a covariance and p_i, p_j are the corresponding priors, C’_ij = C_ij p_i p_j. This is a covenient way of introducing factors other than covariance into the ranking. ‘prior’ may be an array of numbers (of size num_samples), or a function (applied to the samples), or one of the following:
- ’prediction’: calculates a prior from the quantile of the predicted
performance (not the uncertainties). ‘prior_scale’ sets the power this quantile is raised to.
Defaults to the constant value 1.
prior_scale – The prior will be raised to this power before applying it to the covariance. Defaults to 1.
prefilter – Reduces the incoming sample pool before applying batch selection. Selects the subset of the samples which have the highest std_dev * prior score. If 0 < prefilter < 1, takes this fraction of the sample pool. If prefilter >= 1, takes this many samples. Since batch selection computes and stores the size-N^2 covariance matrix for the whole sample pool, it should work with at most around 10,000 samples. Prefiltering can work with much larger pools, since it only needs to compute N standard deviations. Therefore, a practical strategy is to take a sample pool about 5 times as big as you can handle covariances for, then narrow down to only the top 20% individual scores before batch selection. Narrowing to much less than 20% risks changing what will ultimately be the optimal batch.
random_seed – A random seed for deterministic behaviour.
- get_prefilter(X=None, k=None, prior=1, score=None, return_indices=True)
- model_predict(X, return_std_dev=False)
- prediction_prior(samples)
- select(*args, **kwargs)
Selects a batch from the provided samples, and returns it (or its indices).
All of the arguments to
select()
are optional. If you have providedsamples
and other necessary parameters to the constructor already, then you may omit them here.However, some parameters here are not in the constructor:
fixed_samples
andfixed_prior
.- Parameters:
batch_size (int, optional) – The size of the batch to select.
samples (ArrayLike, optional) – The sample pool to select from. Can be a numpy-style addressable array (with first dimension indexing samples, and other dimensions indexing features)—note that
alien.data.Dataset
serves this purpose—or an instance ofsample_generation.SampleGenerator
, in which case the num_samples parameter isnin effect.num_samples – If a
SampleGenerator
has been provided via the ‘samples’ parameter, then at the start of a call toselect()
,num_samples
samples will be drawn from theSampleGenerator
, or as many samples as theSampleGenerator
can provide, whichever is less. Defaults to Inf, i.e., draws as many samples as available.prior – A “prior probability” for each sample. May be an array of numbers, a function, or the string
'prediction'
. A more detailed explanation is above, in the class definition. Defaults to the constant value 1.prior_scale – The prior will be raised to this power before applying it to the samples. Defaults to 1.
prefilter – Reduces the incoming sample pool before applying batch selection. If 0 < prefilter < 1, we use this fraction of the sample pool. If prefilter >= 1, we use this many samples. A more detailed explanation if above, in the class definition.
return_indices – If True,
select()
will return the indices of the selection (from within the given sample pool). If False,select()
will return the actual selected samples. Defaults toFalse
.X_key – The key used to extract the X values from
samples
. I.e.,X = samples[X_key]
. This is only in effect if you pass an explicit value toX_key
, or ifsamples
is aDictDatabase
with key'X'
. By default,X = samples
.fixed_samples (ArrayLike, optional) – This parameter is for passing in those samples which have been previously selected for labeling, but which haven’t been labeled yet. (Eg., you’ve previously sent off a batch to the laboratory pipeline for testing, but you need to select the next batch for the pipeline before the results are in.) Some selection strategies (eg.,
CovarianceSelector
andBAITSelector
) will use this information to avoid redundancy between the newly selected batch and thefixed_samples
.fixed_prior – If you provide an explicit (i.e., array-like) prior for
samples
, then you must also provide a prior forfixed_samples
.
- Returns:
The selected batch, either as a sub-array of
samples
, or as an array of indices intosamples
(ifreturn_indices
is set to True).
BAITSelector
- class alien.selection.BAITSelector(model=None, samples=None, num_samples=None, gamma=1, oversample=2, random_seed=None, **kwargs)[source]
Bases:
SampleSelector
Batch selector following the BAIT strategy. See ` <https://arxiv.org/abs/2106.09675>`_. This strategy optimizes the trace of the Fisher matrix between the outputs and the last layer of parameters. This is a measure of the mutual information between the unknown labels and the parameters.
BAIT optimizes the trace of the Fisher for the “batch” consisting of all previously labelled samples plus the unlabelled candidate samples. This means that BAITSelector needs to know the previously labelled samples. They can be passed into either
__init__()
orselect()
, aslabelelled_samples
. (This class will try to determine whetherlabelled_samples
needs to be unpacked into separate X and y columns—only the X column is needed.)There are two hyperparameters,
gamma
andoversample
, described below.- Parameters:
model – An instance of models.LinearizableRegressor, or a model which implements the
embedding
method.samples – The sample pool to select from. Can be a numpy-style addressable array (with first dimension indexing samples, and other dimensions indexing features)—note that alien.data.Dataset serves this purpose—or an instance of sample_generation.SampleGenerator, in which case the num_samples parameter is in effect.
num_samples – If a
SampleGenerator
has been provided via the ‘samples’ parameter, then at the start of a call to self.select(…),num_samples
samples will be drawn from theSampleGenerator
, or as many samples as theSampleGenerator
can provide, whichever is less. Defaults to Inf, i.e., draws as many samples as available.labelelled_samples – The samples which have already been labelled (or are in the process of being labelled). This class will try to determine whether
labelled_samples
needs to be unpacked into separate X and y columns—only the X column is needed.batch_size – Size of the batch to select.
random_seed – A random seed for deterministic behaviour.
gamma – The ‘regularization’ parameter in the BAIT algorithm. A larger value corresponds to narrower priors. Defaults to 1, which works well enough.
oversample – The factor by which to oversample in the greedy acquisition step. BAIT will greedily draw a batch of
oversample * batch_size
samples, then greedily remove all butbatch_size
of them. Defaults to 2, which is empirically good.
- model_predict(X, return_std_dev=False)
- prediction_prior(samples)
- select(batch_size=None, samples=None, num_samples=None, prior=None, X_key=None, fixed_samples=None, fixed_prior=None, return_indices=None, tail_call=None, **kwargs)
Selects a batch from the provided samples, and returns it (or its indices).
All of the arguments to
select()
are optional. If you have providedsamples
and other necessary parameters to the constructor already, then you may omit them here.However, some parameters here are not in the constructor:
fixed_samples
andfixed_prior
.- Parameters:
batch_size (int, optional) – The size of the batch to select.
samples (ArrayLike, optional) – The sample pool to select from. Can be a numpy-style addressable array (with first dimension indexing samples, and other dimensions indexing features)—note that
alien.data.Dataset
serves this purpose—or an instance ofsample_generation.SampleGenerator
, in which case the num_samples parameter isnin effect.num_samples – If a
SampleGenerator
has been provided via the ‘samples’ parameter, then at the start of a call toselect()
,num_samples
samples will be drawn from theSampleGenerator
, or as many samples as theSampleGenerator
can provide, whichever is less. Defaults to Inf, i.e., draws as many samples as available.prior – A “prior probability” for each sample. May be an array of numbers, a function, or the string
'prediction'
. A more detailed explanation is above, in the class definition. Defaults to the constant value 1.prior_scale – The prior will be raised to this power before applying it to the samples. Defaults to 1.
prefilter – Reduces the incoming sample pool before applying batch selection. If 0 < prefilter < 1, we use this fraction of the sample pool. If prefilter >= 1, we use this many samples. A more detailed explanation if above, in the class definition.
return_indices – If True,
select()
will return the indices of the selection (from within the given sample pool). If False,select()
will return the actual selected samples. Defaults toFalse
.X_key – The key used to extract the X values from
samples
. I.e.,X = samples[X_key]
. This is only in effect if you pass an explicit value toX_key
, or ifsamples
is aDictDatabase
with key'X'
. By default,X = samples
.fixed_samples (ArrayLike, optional) – This parameter is for passing in those samples which have been previously selected for labeling, but which haven’t been labeled yet. (Eg., you’ve previously sent off a batch to the laboratory pipeline for testing, but you need to select the next batch for the pipeline before the results are in.) Some selection strategies (eg.,
CovarianceSelector
andBAITSelector
) will use this information to avoid redundancy between the newly selected batch and thefixed_samples
.fixed_prior – If you provide an explicit (i.e., array-like) prior for
samples
, then you must also provide a prior forfixed_samples
.
- Returns:
The selected batch, either as a sub-array of
samples
, or as an array of indices intosamples
(ifreturn_indices
is set to True).
KmeansSelector
- class alien.selection.KmeansSelector(model=None, batch_size=1, samples=None, num_samples=None, labelled_samples=None, X_key='X', prior=None, prior_scale=1, return_indices=False, verbose=1)[source]
Bases:
SampleSelector
Selector based on K-Means algorithm.
- model_predict(X, return_std_dev=False)
- prediction_prior(samples)
- select(batch_size=None, samples=None, num_samples=None, prior=None, X_key=None, fixed_samples=None, fixed_prior=None, return_indices=None, tail_call=None, **kwargs)
Selects a batch from the provided samples, and returns it (or its indices).
All of the arguments to
select()
are optional. If you have providedsamples
and other necessary parameters to the constructor already, then you may omit them here.However, some parameters here are not in the constructor:
fixed_samples
andfixed_prior
.- Parameters:
batch_size (int, optional) – The size of the batch to select.
samples (ArrayLike, optional) – The sample pool to select from. Can be a numpy-style addressable array (with first dimension indexing samples, and other dimensions indexing features)—note that
alien.data.Dataset
serves this purpose—or an instance ofsample_generation.SampleGenerator
, in which case the num_samples parameter isnin effect.num_samples – If a
SampleGenerator
has been provided via the ‘samples’ parameter, then at the start of a call toselect()
,num_samples
samples will be drawn from theSampleGenerator
, or as many samples as theSampleGenerator
can provide, whichever is less. Defaults to Inf, i.e., draws as many samples as available.prior – A “prior probability” for each sample. May be an array of numbers, a function, or the string
'prediction'
. A more detailed explanation is above, in the class definition. Defaults to the constant value 1.prior_scale – The prior will be raised to this power before applying it to the samples. Defaults to 1.
prefilter – Reduces the incoming sample pool before applying batch selection. If 0 < prefilter < 1, we use this fraction of the sample pool. If prefilter >= 1, we use this many samples. A more detailed explanation if above, in the class definition.
return_indices – If True,
select()
will return the indices of the selection (from within the given sample pool). If False,select()
will return the actual selected samples. Defaults toFalse
.X_key – The key used to extract the X values from
samples
. I.e.,X = samples[X_key]
. This is only in effect if you pass an explicit value toX_key
, or ifsamples
is aDictDatabase
with key'X'
. By default,X = samples
.fixed_samples (ArrayLike, optional) – This parameter is for passing in those samples which have been previously selected for labeling, but which haven’t been labeled yet. (Eg., you’ve previously sent off a batch to the laboratory pipeline for testing, but you need to select the next batch for the pipeline before the results are in.) Some selection strategies (eg.,
CovarianceSelector
andBAITSelector
) will use this information to avoid redundancy between the newly selected batch and thefixed_samples
.fixed_prior – If you provide an explicit (i.e., array-like) prior for
samples
, then you must also provide a prior forfixed_samples
.
- Returns:
The selected batch, either as a sub-array of
samples
, or as an array of indices intosamples
(ifreturn_indices
is set to True).
RandomSelector
- class alien.selection.RandomSelector(model=None, random_seed=None, **kwargs)[source]
Bases:
SampleSelector
Select samples at random.
- model_predict(X, return_std_dev=False)
- prediction_prior(samples)
- select(batch_size=None, samples=None, num_samples=None, prior=None, X_key=None, fixed_samples=None, fixed_prior=None, return_indices=None, tail_call=None, **kwargs)
Selects a batch from the provided samples, and returns it (or its indices).
All of the arguments to
select()
are optional. If you have providedsamples
and other necessary parameters to the constructor already, then you may omit them here.However, some parameters here are not in the constructor:
fixed_samples
andfixed_prior
.- Parameters:
batch_size (int, optional) – The size of the batch to select.
samples (ArrayLike, optional) – The sample pool to select from. Can be a numpy-style addressable array (with first dimension indexing samples, and other dimensions indexing features)—note that
alien.data.Dataset
serves this purpose—or an instance ofsample_generation.SampleGenerator
, in which case the num_samples parameter isnin effect.num_samples – If a
SampleGenerator
has been provided via the ‘samples’ parameter, then at the start of a call toselect()
,num_samples
samples will be drawn from theSampleGenerator
, or as many samples as theSampleGenerator
can provide, whichever is less. Defaults to Inf, i.e., draws as many samples as available.prior – A “prior probability” for each sample. May be an array of numbers, a function, or the string
'prediction'
. A more detailed explanation is above, in the class definition. Defaults to the constant value 1.prior_scale – The prior will be raised to this power before applying it to the samples. Defaults to 1.
prefilter – Reduces the incoming sample pool before applying batch selection. If 0 < prefilter < 1, we use this fraction of the sample pool. If prefilter >= 1, we use this many samples. A more detailed explanation if above, in the class definition.
return_indices – If True,
select()
will return the indices of the selection (from within the given sample pool). If False,select()
will return the actual selected samples. Defaults toFalse
.X_key – The key used to extract the X values from
samples
. I.e.,X = samples[X_key]
. This is only in effect if you pass an explicit value toX_key
, or ifsamples
is aDictDatabase
with key'X'
. By default,X = samples
.fixed_samples (ArrayLike, optional) – This parameter is for passing in those samples which have been previously selected for labeling, but which haven’t been labeled yet. (Eg., you’ve previously sent off a batch to the laboratory pipeline for testing, but you need to select the next batch for the pipeline before the results are in.) Some selection strategies (eg.,
CovarianceSelector
andBAITSelector
) will use this information to avoid redundancy between the newly selected batch and thefixed_samples
.fixed_prior – If you provide an explicit (i.e., array-like) prior for
samples
, then you must also provide a prior forfixed_samples
.
- Returns:
The selected batch, either as a sub-array of
samples
, or as an array of indices intosamples
(ifreturn_indices
is set to True).
TimestampSelector
- class alien.selection.TimestampSelector(model=None, random_seed=None, timestamp_key='t', timestamps=None, **kwargs)[source]
Bases:
SampleSelector
- model_predict(X, return_std_dev=False)
- prediction_prior(samples)
- select(batch_size=None, samples=None, num_samples=None, prior=None, X_key=None, fixed_samples=None, fixed_prior=None, return_indices=None, tail_call=None, **kwargs)
Selects a batch from the provided samples, and returns it (or its indices).
All of the arguments to
select()
are optional. If you have providedsamples
and other necessary parameters to the constructor already, then you may omit them here.However, some parameters here are not in the constructor:
fixed_samples
andfixed_prior
.- Parameters:
batch_size (int, optional) – The size of the batch to select.
samples (ArrayLike, optional) – The sample pool to select from. Can be a numpy-style addressable array (with first dimension indexing samples, and other dimensions indexing features)—note that
alien.data.Dataset
serves this purpose—or an instance ofsample_generation.SampleGenerator
, in which case the num_samples parameter isnin effect.num_samples – If a
SampleGenerator
has been provided via the ‘samples’ parameter, then at the start of a call toselect()
,num_samples
samples will be drawn from theSampleGenerator
, or as many samples as theSampleGenerator
can provide, whichever is less. Defaults to Inf, i.e., draws as many samples as available.prior – A “prior probability” for each sample. May be an array of numbers, a function, or the string
'prediction'
. A more detailed explanation is above, in the class definition. Defaults to the constant value 1.prior_scale – The prior will be raised to this power before applying it to the samples. Defaults to 1.
prefilter – Reduces the incoming sample pool before applying batch selection. If 0 < prefilter < 1, we use this fraction of the sample pool. If prefilter >= 1, we use this many samples. A more detailed explanation if above, in the class definition.
return_indices – If True,
select()
will return the indices of the selection (from within the given sample pool). If False,select()
will return the actual selected samples. Defaults toFalse
.X_key – The key used to extract the X values from
samples
. I.e.,X = samples[X_key]
. This is only in effect if you pass an explicit value toX_key
, or ifsamples
is aDictDatabase
with key'X'
. By default,X = samples
.fixed_samples (ArrayLike, optional) – This parameter is for passing in those samples which have been previously selected for labeling, but which haven’t been labeled yet. (Eg., you’ve previously sent off a batch to the laboratory pipeline for testing, but you need to select the next batch for the pipeline before the results are in.) Some selection strategies (eg.,
CovarianceSelector
andBAITSelector
) will use this information to avoid redundancy between the newly selected batch and thefixed_samples
.fixed_prior – If you provide an explicit (i.e., array-like) prior for
samples
, then you must also provide a prior forfixed_samples
.
- Returns:
The selected batch, either as a sub-array of
samples
, or as an array of indices intosamples
(ifreturn_indices
is set to True).