alien.data

Submodules

alien.data.dataset module

Module with dataset (sub-)classes for storing data.

class alien.data.dataset.Dataset(*args, **kwargs)[source]

Bases: object

Abstract interface to a readable dataset.

abstract find(value, first=True)[source]: Finds instances of value in this dataset. If first is True, returns the index of the first occurence (or None if not found), otherwise returns an iterable of indices of all occurences.

static from_data(*args, **kwargs)[source]: Returns a Dataset built from the given data and other args. Arguments and functionality are exactly like TeachableDataset.from_data In fact, at present, this method just calls TeachableDataset.from_data

property X: Return features.

property y: Return targets.

check_Xy()[source]

abstract property shape: Abstract method for returning shape.

property ndim

number of dimensions

Type:: Returns
Type:: int

property batch_shape

property feature_shape

reshape(*shape, index=None, bdim=None)[source]

class alien.data.dataset.TeachableDataset(*args, **kwargs)[source]

Bases: Dataset

Abstract interface to a teachable dataset.

abstract append(x: Any)[source]: Appends a single sample to the end of the dataset.

extend(X: _SupportsArray[dtype] | _NestedSequence[_SupportsArray[dtype]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes])[source]: Appends a batch of samples to the end of the dataset.

Creates a TeachableDataset with given data.

Parameters:

data –
the initial data of the dataset Can be:
- another TeachableDataset
- a Python mutable sequence (eg., a list) or anything that implements the interface
- a Numpy array
- a Pytorch tensor
- a dictionary or tuple whose values are one of the above types
- a Pandas DataFrame
shuffle – if this evaluates to True, data will be wrapped in a shuffle, exposing the ShuffledDataset interface. Can be: * anything evaluating to False * ‘identity’ (initial shuffle is the identity) * ‘random’ (initial shuffle is random)
random_seed – a random seed to pass to Numpy’s shuffle algorithm. If None (the default), Numpy gets entropy from the OS.
recursive – if True, data like MutableSequences or TeachableDatasets that already expose the needed interface, will still be wrapped; if False, such data will be returned as-is, with no new object created.

static from_deepchem(data)[source]

get_shuffle(shuffle='random', random_seed=None)[source]

Return a shuffled version of self

Parameters:

shuffle (str, optional) – The initial shuffle - 'identity' or 'random'. Defaults to 'random’.
random_seed (int, optional) – A random seed for the shuffle. Defaults to None.

Returns:

A shuffled version of self

Return type:

ShuffledDataset

property X: Return features.

property batch_shape

check_Xy()

property feature_shape

abstract find(value, first=True): Finds instances of value in this dataset. If first is True, returns the index of the first occurence (or None if not found), otherwise returns an iterable of indices of all occurences.

property ndim

number of dimensions

Type:: Returns
Type:: int

reshape(*shape, index=None, bdim=None)

abstract property shape: Abstract method for returning shape.

property y: Return targets.

class alien.data.dataset.TeachableWrapperDataset(*args, **kwargs)[source]

Bases: TeachableDataset

Wraps another dataset-like object. Functions as an abstract base class for wrapping specific data types. Also functions concretely as the default wrapper for MutableSequences, other TeachableDatasets, and anything else which exposes a suitable interface.

append(x)[source]: Appends a single sample to the end of the dataset.

extend(X)[source]: Appends a batch of samples to the end of the dataset.

find(value: Any, first: bool = True)[source]: Finds instances of value in this dataset. If first is True, returns the index of the first occurence (or None if not found), otherwise returns an iterable of indices of all occurences.

reshape_features(*shape, index=None)[source]

reshape_batch(*shape, index=None)[source]

property shape: Abstract method for returning shape.

property X: Return features.

property batch_shape

check_Xy()

property feature_shape

Creates a TeachableDataset with given data.

Parameters:

data –
the initial data of the dataset Can be:
- another TeachableDataset
- a Python mutable sequence (eg., a list) or anything that implements the interface
- a Numpy array
- a Pytorch tensor
- a dictionary or tuple whose values are one of the above types
- a Pandas DataFrame
shuffle – if this evaluates to True, data will be wrapped in a shuffle, exposing the ShuffledDataset interface. Can be: * anything evaluating to False * ‘identity’ (initial shuffle is the identity) * ‘random’ (initial shuffle is random)
random_seed – a random seed to pass to Numpy’s shuffle algorithm. If None (the default), Numpy gets entropy from the OS.
recursive – if True, data like MutableSequences or TeachableDatasets that already expose the needed interface, will still be wrapped; if False, such data will be returned as-is, with no new object created.

static from_deepchem(data)

get_shuffle(shuffle='random', random_seed=None)

Return a shuffled version of self

Parameters:

shuffle (str, optional) – The initial shuffle - 'identity' or 'random'. Defaults to 'random’.
random_seed (int, optional) – A random seed for the shuffle. Defaults to None.

Returns:

A shuffled version of self

Return type:

ShuffledDataset

property ndim

number of dimensions

Type:: Returns
Type:: int

reshape(*shape, index=None, bdim=None)

property y: Return targets.

class alien.data.dataset.ShuffledDataset(*args, **kwargs)[source]

Bases: TeachableWrapperDataset

Presents a shuffle of an existing dataset (or MutableSequence)

Added data goes at the end and isn’t shuffled (until reshuffle() is called).

Parameters:

data – the existing dataset to wrap
shuffle – determines the initial shuffle state: ‘random’ or ‘identity’
random_seed – random seed to pass to the numpy shuffle algorithm. If None, get a source of randomness from the OS.

reshuffle()[source]: Reshuffles self with self.rng.

extend_shuffle()[source]: Extend self.shuffle with [len(self.shuffle), …, len(self.data)].

find(value: Any, first: bool = True)[source]

Return index(es) of value in self.

Parameters:

value (Any) – value to look for
first (bool, optional) – whether to return first instance of value or all of them. Defaults to True.

Returns:

_description_

Return type:

_type_

property X: Return features.

property y: Return targets.

append(x): Appends a single sample to the end of the dataset.

property batch_shape

check_Xy()

extend(X): Appends a batch of samples to the end of the dataset.

property feature_shape

Creates a TeachableDataset with given data.

Parameters:

data –
the initial data of the dataset Can be:
- another TeachableDataset
- a Python mutable sequence (eg., a list) or anything that implements the interface
- a Numpy array
- a Pytorch tensor
- a dictionary or tuple whose values are one of the above types
- a Pandas DataFrame
shuffle – if this evaluates to True, data will be wrapped in a shuffle, exposing the ShuffledDataset interface. Can be: * anything evaluating to False * ‘identity’ (initial shuffle is the identity) * ‘random’ (initial shuffle is random)
random_seed – a random seed to pass to Numpy’s shuffle algorithm. If None (the default), Numpy gets entropy from the OS.
recursive – if True, data like MutableSequences or TeachableDatasets that already expose the needed interface, will still be wrapped; if False, such data will be returned as-is, with no new object created.

static from_deepchem(data)

get_shuffle(shuffle='random', random_seed=None)

Return a shuffled version of self

Parameters:

shuffle (str, optional) – The initial shuffle - 'identity' or 'random'. Defaults to 'random’.
random_seed (int, optional) – A random seed for the shuffle. Defaults to None.

Returns:

A shuffled version of self

Return type:

ShuffledDataset

property ndim

number of dimensions

Type:: Returns
Type:: int

reshape(*shape, index=None, bdim=None)

reshape_batch(*shape, index=None)

reshape_features(*shape, index=None)

property shape: Abstract method for returning shape.

alien.data.dataset.compute_bdim(old_shape, old_bdim, new_shape)[source]

class alien.data.dataset.ArrayDataset(*args, **kwargs)[source]

Bases: TeachableWrapperDataset

Abstract base class for datasets based on numpy, pytorch, or other similarly-interfaced arrays.

append(x)[source]: Appends a single sample to the end of the dataset.

find(value, first=True)[source]: Finds instances of value in this dataset. If first is True, returns the index of the first occurence (or None if not found), otherwise returns an iterable of indices of all occurences.

reshape(*shape, index=None, bdim=None)[source]

property X: Return features.

property batch_shape

check_Xy()

extend(X): Appends a batch of samples to the end of the dataset.

property feature_shape

Creates a TeachableDataset with given data.

Parameters:

data –
the initial data of the dataset Can be:
- another TeachableDataset
- a Python mutable sequence (eg., a list) or anything that implements the interface
- a Numpy array
- a Pytorch tensor
- a dictionary or tuple whose values are one of the above types
- a Pandas DataFrame
shuffle – if this evaluates to True, data will be wrapped in a shuffle, exposing the ShuffledDataset interface. Can be: * anything evaluating to False * ‘identity’ (initial shuffle is the identity) * ‘random’ (initial shuffle is random)
random_seed – a random seed to pass to Numpy’s shuffle algorithm. If None (the default), Numpy gets entropy from the OS.
recursive – if True, data like MutableSequences or TeachableDatasets that already expose the needed interface, will still be wrapped; if False, such data will be returned as-is, with no new object created.

static from_deepchem(data)

get_shuffle(shuffle='random', random_seed=None)

Return a shuffled version of self

Parameters:

shuffle (str, optional) – The initial shuffle - 'identity' or 'random'. Defaults to 'random’.
random_seed (int, optional) – A random seed for the shuffle. Defaults to None.

Returns:

A shuffled version of self

Return type:

ShuffledDataset

property ndim

number of dimensions

Type:: Returns
Type:: int

reshape_batch(*shape, index=None)

reshape_features(*shape, index=None)

property shape: Abstract method for returning shape.

property y: Return targets.

class alien.data.dataset.NumpyDataset(*args, **kwargs)[source]

Bases: ArrayDataset

Dataset with Numpy array as data.

extend(X)[source]: Appends a batch of samples to the end of the dataset.

property X: Return features.

append(x): Appends a single sample to the end of the dataset.

property batch_shape

check_Xy()

property feature_shape

find(value, first=True): Finds instances of value in this dataset. If first is True, returns the index of the first occurence (or None if not found), otherwise returns an iterable of indices of all occurences.

Creates a TeachableDataset with given data.

Parameters:

data –
the initial data of the dataset Can be:
- another TeachableDataset
- a Python mutable sequence (eg., a list) or anything that implements the interface
- a Numpy array
- a Pytorch tensor
- a dictionary or tuple whose values are one of the above types
- a Pandas DataFrame
shuffle – if this evaluates to True, data will be wrapped in a shuffle, exposing the ShuffledDataset interface. Can be: * anything evaluating to False * ‘identity’ (initial shuffle is the identity) * ‘random’ (initial shuffle is random)
random_seed – a random seed to pass to Numpy’s shuffle algorithm. If None (the default), Numpy gets entropy from the OS.
recursive – if True, data like MutableSequences or TeachableDatasets that already expose the needed interface, will still be wrapped; if False, such data will be returned as-is, with no new object created.

static from_deepchem(data)

get_shuffle(shuffle='random', random_seed=None)

Return a shuffled version of self

Parameters:

shuffle (str, optional) – The initial shuffle - 'identity' or 'random'. Defaults to 'random’.
random_seed (int, optional) – A random seed for the shuffle. Defaults to None.

Returns:

A shuffled version of self

Return type:

ShuffledDataset

property ndim

number of dimensions

Type:: Returns
Type:: int

reshape(*shape, index=None, bdim=None)

reshape_batch(*shape, index=None)

reshape_features(*shape, index=None)

property shape: Abstract method for returning shape.

property y: Return targets.

class alien.data.dataset.TorchDataset(*args, **kwargs)[source]

Bases: ArrayDataset

Dataset with torch.tensor as data.

extend(X)[source]: Appends a batch of samples to the end of the dataset.

property X: Return features.

append(x): Appends a single sample to the end of the dataset.

property batch_shape

check_Xy()

property feature_shape

find(value, first=True): Finds instances of value in this dataset. If first is True, returns the index of the first occurence (or None if not found), otherwise returns an iterable of indices of all occurences.

Creates a TeachableDataset with given data.

Parameters:

data –
the initial data of the dataset Can be:
- another TeachableDataset
- a Python mutable sequence (eg., a list) or anything that implements the interface
- a Numpy array
- a Pytorch tensor
- a dictionary or tuple whose values are one of the above types
- a Pandas DataFrame
shuffle – if this evaluates to True, data will be wrapped in a shuffle, exposing the ShuffledDataset interface. Can be: * anything evaluating to False * ‘identity’ (initial shuffle is the identity) * ‘random’ (initial shuffle is random)
random_seed – a random seed to pass to Numpy’s shuffle algorithm. If None (the default), Numpy gets entropy from the OS.
recursive – if True, data like MutableSequences or TeachableDatasets that already expose the needed interface, will still be wrapped; if False, such data will be returned as-is, with no new object created.

static from_deepchem(data)

get_shuffle(shuffle='random', random_seed=None)

Return a shuffled version of self

Parameters:

shuffle (str, optional) – The initial shuffle - 'identity' or 'random'. Defaults to 'random’.
random_seed (int, optional) – A random seed for the shuffle. Defaults to None.

Returns:

A shuffled version of self

Return type:

ShuffledDataset

property ndim

number of dimensions

Type:: Returns
Type:: int

reshape(*shape, index=None, bdim=None)

reshape_batch(*shape, index=None)

reshape_features(*shape, index=None)

property shape: Abstract method for returning shape.

property y: Return targets.

class alien.data.dataset.DictDataset(*args, **kwargs)[source]

Bases: TeachableWrapperDataset

Contains a dictionary whose values are datasets.

For indexing purposes, the first self.bdim axes (i.e., the batch dimensions) index into the first axes of the constituent datasets, whereas the dictionary key “dimension” occurs right after the batch dimensions. Since there is usually exactly one batch dimension, this means you can index like

>>> dataset[:20, 'X']

which will return the first 20 rows of the 'X' constituent dataset, whereas

>>> dataset[:20]

will take the first 20 rows of each constituent dataset, and package them into a new DictDataset with the same keys.

append(x)[source]: Appends a single sample to the end of the dataset.

extend(X)[source]: Appends a batch of samples to the end of the dataset.

reshape(*shape, index=None, bdim=None)[source]

find(value, first=True)[source]: Finds instances of value in this dataset. If first is True, returns the index of the first occurence (or None if not found), otherwise returns an iterable of indices of all occurences.

property X: Return features.

property y: Return targets.

property shape: Abstract method for returning shape.

property ndim

number of dimensions

Type:: Returns
Type:: int

property batch_shape

check_Xy()

property feature_shape

Creates a TeachableDataset with given data.

Parameters:

data –
the initial data of the dataset Can be:
- another TeachableDataset
- a Python mutable sequence (eg., a list) or anything that implements the interface
- a Numpy array
- a Pytorch tensor
- a dictionary or tuple whose values are one of the above types
- a Pandas DataFrame
shuffle – if this evaluates to True, data will be wrapped in a shuffle, exposing the ShuffledDataset interface. Can be: * anything evaluating to False * ‘identity’ (initial shuffle is the identity) * ‘random’ (initial shuffle is random)
random_seed – a random seed to pass to Numpy’s shuffle algorithm. If None (the default), Numpy gets entropy from the OS.
recursive – if True, data like MutableSequences or TeachableDatasets that already expose the needed interface, will still be wrapped; if False, such data will be returned as-is, with no new object created.

static from_deepchem(data)

get_shuffle(shuffle='random', random_seed=None)

Return a shuffled version of self

Parameters:

shuffle (str, optional) – The initial shuffle - 'identity' or 'random'. Defaults to 'random’.
random_seed (int, optional) – A random seed for the shuffle. Defaults to None.

Returns:

A shuffled version of self

Return type:

ShuffledDataset

reshape_batch(*shape, index=None)

reshape_features(*shape, index=None)

class alien.data.dataset.TupleDataset(*args, **kwargs)[source]

Bases: TeachableWrapperDataset

Dataset with Tuple as self.data.

append(x)[source]: Appends a single sample to the end of the dataset.

extend(X)[source]: Appends a batch of samples to the end of the dataset.

reshape(*shape, index=None, bdim=None)[source]

find(value, first=True)[source]: Finds instances of value in this dataset. If first is True, returns the index of the first occurence (or None if not found), otherwise returns an iterable of indices of all occurences.

property tuple: Getter for self.data.

property shape: Abstract method for returning shape.

property X: Return features.

property batch_shape

check_Xy()

property feature_shape

Creates a TeachableDataset with given data.

Parameters:

data –
the initial data of the dataset Can be:
- another TeachableDataset
- a Python mutable sequence (eg., a list) or anything that implements the interface
- a Numpy array
- a Pytorch tensor
- a dictionary or tuple whose values are one of the above types
- a Pandas DataFrame
shuffle – if this evaluates to True, data will be wrapped in a shuffle, exposing the ShuffledDataset interface. Can be: * anything evaluating to False * ‘identity’ (initial shuffle is the identity) * ‘random’ (initial shuffle is random)
random_seed – a random seed to pass to Numpy’s shuffle algorithm. If None (the default), Numpy gets entropy from the OS.
recursive – if True, data like MutableSequences or TeachableDatasets that already expose the needed interface, will still be wrapped; if False, such data will be returned as-is, with no new object created.

static from_deepchem(data)

get_shuffle(shuffle='random', random_seed=None)

Return a shuffled version of self

Parameters:

shuffle (str, optional) – The initial shuffle - 'identity' or 'random'. Defaults to 'random’.
random_seed (int, optional) – A random seed for the shuffle. Defaults to None.

Returns:

A shuffled version of self

Return type:

ShuffledDataset

property ndim

number of dimensions

Type:: Returns
Type:: int

reshape_batch(*shape, index=None)

reshape_features(*shape, index=None)

property y: Return targets.

alien.data.deepchem module

Deepchem Dataset

class alien.data.deepchem.DeepChemDataset(*args, **kwargs)[source]

Bases: DictDataset

DeepChem dataset

Some common featurizers:

Keras GraphConvModel`s use the `ConvMolFeaturizer, which may be
abbreviated to 'convmol' in the featurizer argument here.

Pytorch GCNModel`s use the `MolGraphConvFeaturizer, which may be
abbreviated to 'molgraph' here.

static get_featurizer(f, **kwargs)[source]

static from_csv(file, X='X', y=None, featurizer=None, **kwargs)[source]

Loads a DeepChem dataset from a csv file.

Parameters:

X (str) – Column names for the X and y data
y (str) – Column names for the X and y data
featurizer – Specifies the DeepChem featurizer to use, if any. featurizer may be a DeepChem featurizer class, or a featurizer instance, or a string contained in the classname of a featurizer. (Eg., 'convmol' matches the DeepChem ConvMolFeaturizer.)
**kwargs – These are passed to the featurizer constructor.

Returns:

An alien.data.DeepChemDataset

static from_df(df, X='X', y=None, ids='ids', weights=None, featurizer=None, **kwargs)[source]

Returns a DeepChemDataset built from a Pandas DataFrame.

Parameters:

df – The dataframe to convert
X – The name of the feature column. Defaults to “X”.
y – The name of the y/label column, or a list of names for multi-prediction. By default, no y values are extracted.
ids – The name of the ids column. By default, looks for a column named ‘ids’, and if none is found, uses the dataframe index.
weights – The name of the weights column. If none is given, uses 1.0 for all weights.
featurizer – Specifies the DeepChem featurizer to use, if any. featurizer may be a DeepChem featurizer class, or a featurizer instance, or a string contained in the classname of a featurizer. (Eg., 'convmol' matches the DeepChem ConvMolFeaturizer.)
**kwargs –
Any additional keyword args will become columns in the dataset; for example, keyword arg t='timestamp', creates a column with key t and values taken from df['timestamp'].

property X: Return features.

append(x): Appends a single sample to the end of the dataset.

property batch_shape

check_Xy()

extend(X): Appends a batch of samples to the end of the dataset.

property feature_shape

find(value, first=True): Finds instances of value in this dataset. If first is True, returns the index of the first occurence (or None if not found), otherwise returns an iterable of indices of all occurences.

Creates a TeachableDataset with given data.

Parameters:

data –
the initial data of the dataset Can be:
- another TeachableDataset
- a Python mutable sequence (eg., a list) or anything that implements the interface
- a Numpy array
- a Pytorch tensor
- a dictionary or tuple whose values are one of the above types
- a Pandas DataFrame
shuffle – if this evaluates to True, data will be wrapped in a shuffle, exposing the ShuffledDataset interface. Can be: * anything evaluating to False * ‘identity’ (initial shuffle is the identity) * ‘random’ (initial shuffle is random)
random_seed – a random seed to pass to Numpy’s shuffle algorithm. If None (the default), Numpy gets entropy from the OS.
recursive – if True, data like MutableSequences or TeachableDatasets that already expose the needed interface, will still be wrapped; if False, such data will be returned as-is, with no new object created.

static from_deepchem(data)

get_shuffle(shuffle='random', random_seed=None)

Return a shuffled version of self

Parameters:

shuffle (str, optional) – The initial shuffle - 'identity' or 'random'. Defaults to 'random’.
random_seed (int, optional) – A random seed for the shuffle. Defaults to None.

Returns:

A shuffled version of self

Return type:

ShuffledDataset

property ndim

number of dimensions

Type:: Returns
Type:: int

reshape(*shape, index=None, bdim=None)

reshape_batch(*shape, index=None)

reshape_features(*shape, index=None)

property shape: Abstract method for returning shape.

property y: Return targets.

alien.data.deepchem.as_DCDataset(data)[source]: Convert data to a DeepChem dataset.

alien.data

Submodules

alien.data.dataset module

alien.data.deepchem module

Module contents