alien.data
Submodules
alien.data.dataset module
Module with dataset (sub-)classes for storing data.
- class alien.data.dataset.Dataset(*args, **kwargs)[source]
Bases:
object
Abstract interface to a readable dataset.
- abstract find(value, first=True)[source]
Finds instances of
value
in this dataset. If first is True, returns the index of the first occurence (or None if not found), otherwise returns an iterable of indices of all occurences.
- static from_data(*args, **kwargs)[source]
Returns a Dataset built from the given data and other args. Arguments and functionality are exactly like TeachableDataset.from_data In fact, at present, this method just calls TeachableDataset.from_data
- property X
Return features.
- property y
Return targets.
- abstract property shape
Abstract method for returning shape.
- property ndim
number of dimensions
- Type:
Returns
- Type:
int
- property batch_shape
- property feature_shape
- class alien.data.dataset.TeachableDataset(*args, **kwargs)[source]
Bases:
Dataset
Abstract interface to a teachable dataset.
- extend(X: _SupportsArray[dtype] | _NestedSequence[_SupportsArray[dtype]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes])[source]
Appends a batch of samples to the end of the dataset.
- static from_data(data=None, shuffle: bool | str | None = False, random_seed: int | _SupportsArray[dtype] | _NestedSequence[_SupportsArray[dtype]] | bool | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes] | SeedSequence | BitGenerator | Generator | None = None, recursive: bool = True, convert_sequences: bool = True, **kwargs)[source]
Creates a TeachableDataset with given data.
- Parameters:
data –
the initial data of the dataset Can be:
another TeachableDataset
a Python mutable sequence (eg., a list) or anything that implements the interface
a Numpy array
a Pytorch tensor
a dictionary or tuple whose values are one of the above types
a Pandas DataFrame
shuffle – if this evaluates to True, data will be wrapped in a shuffle, exposing the ShuffledDataset interface. Can be: * anything evaluating to False * ‘identity’ (initial shuffle is the identity) * ‘random’ (initial shuffle is random)
random_seed – a random seed to pass to Numpy’s shuffle algorithm. If None (the default), Numpy gets entropy from the OS.
recursive – if True, data like MutableSequences or TeachableDatasets that already expose the needed interface, will still be wrapped; if False, such data will be returned as-is, with no new object created.
- get_shuffle(shuffle='random', random_seed=None)[source]
Return a shuffled version of self
- Parameters:
shuffle (str, optional) – The initial shuffle -
'identity'
or'random'
. Defaults to'random
’.random_seed (int, optional) – A random seed for the shuffle. Defaults to None.
- Returns:
A shuffled version of
self
- Return type:
- property X
Return features.
- property batch_shape
- check_Xy()
- property feature_shape
- abstract find(value, first=True)
Finds instances of
value
in this dataset. If first is True, returns the index of the first occurence (or None if not found), otherwise returns an iterable of indices of all occurences.
- property ndim
number of dimensions
- Type:
Returns
- Type:
int
- reshape(*shape, index=None, bdim=None)
- abstract property shape
Abstract method for returning shape.
- property y
Return targets.
- class alien.data.dataset.TeachableWrapperDataset(*args, **kwargs)[source]
Bases:
TeachableDataset
Wraps another dataset-like object. Functions as an abstract base class for wrapping specific data types. Also functions concretely as the default wrapper for MutableSequences, other TeachableDatasets, and anything else which exposes a suitable interface.
- find(value: Any, first: bool = True)[source]
Finds instances of
value
in this dataset. If first is True, returns the index of the first occurence (or None if not found), otherwise returns an iterable of indices of all occurences.
- property shape
Abstract method for returning shape.
- property X
Return features.
- property batch_shape
- check_Xy()
- property feature_shape
- static from_data(data=None, shuffle: bool | str | None = False, random_seed: int | _SupportsArray[dtype] | _NestedSequence[_SupportsArray[dtype]] | bool | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes] | SeedSequence | BitGenerator | Generator | None = None, recursive: bool = True, convert_sequences: bool = True, **kwargs)
Creates a TeachableDataset with given data.
- Parameters:
data –
the initial data of the dataset Can be:
another TeachableDataset
a Python mutable sequence (eg., a list) or anything that implements the interface
a Numpy array
a Pytorch tensor
a dictionary or tuple whose values are one of the above types
a Pandas DataFrame
shuffle – if this evaluates to True, data will be wrapped in a shuffle, exposing the ShuffledDataset interface. Can be: * anything evaluating to False * ‘identity’ (initial shuffle is the identity) * ‘random’ (initial shuffle is random)
random_seed – a random seed to pass to Numpy’s shuffle algorithm. If None (the default), Numpy gets entropy from the OS.
recursive – if True, data like MutableSequences or TeachableDatasets that already expose the needed interface, will still be wrapped; if False, such data will be returned as-is, with no new object created.
- static from_deepchem(data)
- get_shuffle(shuffle='random', random_seed=None)
Return a shuffled version of self
- Parameters:
shuffle (str, optional) – The initial shuffle -
'identity'
or'random'
. Defaults to'random
’.random_seed (int, optional) – A random seed for the shuffle. Defaults to None.
- Returns:
A shuffled version of
self
- Return type:
- property ndim
number of dimensions
- Type:
Returns
- Type:
int
- reshape(*shape, index=None, bdim=None)
- property y
Return targets.
- class alien.data.dataset.ShuffledDataset(*args, **kwargs)[source]
Bases:
TeachableWrapperDataset
Presents a shuffle of an existing dataset (or MutableSequence)
Added data goes at the end and isn’t shuffled (until reshuffle() is called).
- Parameters:
data – the existing dataset to wrap
shuffle – determines the initial shuffle state: ‘random’ or ‘identity’
random_seed – random seed to pass to the numpy shuffle algorithm. If None, get a source of randomness from the OS.
- find(value: Any, first: bool = True)[source]
Return index(es) of value in self.
- Parameters:
value (Any) – value to look for
first (bool, optional) – whether to return first instance of value or all of them. Defaults to True.
- Returns:
_description_
- Return type:
_type_
- property X
Return features.
- property y
Return targets.
- append(x)
Appends a single sample to the end of the dataset.
- property batch_shape
- check_Xy()
- extend(X)
Appends a batch of samples to the end of the dataset.
- property feature_shape
- static from_data(data=None, shuffle: bool | str | None = False, random_seed: int | _SupportsArray[dtype] | _NestedSequence[_SupportsArray[dtype]] | bool | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes] | SeedSequence | BitGenerator | Generator | None = None, recursive: bool = True, convert_sequences: bool = True, **kwargs)
Creates a TeachableDataset with given data.
- Parameters:
data –
the initial data of the dataset Can be:
another TeachableDataset
a Python mutable sequence (eg., a list) or anything that implements the interface
a Numpy array
a Pytorch tensor
a dictionary or tuple whose values are one of the above types
a Pandas DataFrame
shuffle – if this evaluates to True, data will be wrapped in a shuffle, exposing the ShuffledDataset interface. Can be: * anything evaluating to False * ‘identity’ (initial shuffle is the identity) * ‘random’ (initial shuffle is random)
random_seed – a random seed to pass to Numpy’s shuffle algorithm. If None (the default), Numpy gets entropy from the OS.
recursive – if True, data like MutableSequences or TeachableDatasets that already expose the needed interface, will still be wrapped; if False, such data will be returned as-is, with no new object created.
- static from_deepchem(data)
- get_shuffle(shuffle='random', random_seed=None)
Return a shuffled version of self
- Parameters:
shuffle (str, optional) – The initial shuffle -
'identity'
or'random'
. Defaults to'random
’.random_seed (int, optional) – A random seed for the shuffle. Defaults to None.
- Returns:
A shuffled version of
self
- Return type:
- property ndim
number of dimensions
- Type:
Returns
- Type:
int
- reshape(*shape, index=None, bdim=None)
- reshape_batch(*shape, index=None)
- reshape_features(*shape, index=None)
- property shape
Abstract method for returning shape.
- class alien.data.dataset.ArrayDataset(*args, **kwargs)[source]
Bases:
TeachableWrapperDataset
Abstract base class for datasets based on numpy, pytorch, or other similarly-interfaced arrays.
- find(value, first=True)[source]
Finds instances of
value
in this dataset. If first is True, returns the index of the first occurence (or None if not found), otherwise returns an iterable of indices of all occurences.
- property X
Return features.
- property batch_shape
- check_Xy()
- extend(X)
Appends a batch of samples to the end of the dataset.
- property feature_shape
- static from_data(data=None, shuffle: bool | str | None = False, random_seed: int | _SupportsArray[dtype] | _NestedSequence[_SupportsArray[dtype]] | bool | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes] | SeedSequence | BitGenerator | Generator | None = None, recursive: bool = True, convert_sequences: bool = True, **kwargs)
Creates a TeachableDataset with given data.
- Parameters:
data –
the initial data of the dataset Can be:
another TeachableDataset
a Python mutable sequence (eg., a list) or anything that implements the interface
a Numpy array
a Pytorch tensor
a dictionary or tuple whose values are one of the above types
a Pandas DataFrame
shuffle – if this evaluates to True, data will be wrapped in a shuffle, exposing the ShuffledDataset interface. Can be: * anything evaluating to False * ‘identity’ (initial shuffle is the identity) * ‘random’ (initial shuffle is random)
random_seed – a random seed to pass to Numpy’s shuffle algorithm. If None (the default), Numpy gets entropy from the OS.
recursive – if True, data like MutableSequences or TeachableDatasets that already expose the needed interface, will still be wrapped; if False, such data will be returned as-is, with no new object created.
- static from_deepchem(data)
- get_shuffle(shuffle='random', random_seed=None)
Return a shuffled version of self
- Parameters:
shuffle (str, optional) – The initial shuffle -
'identity'
or'random'
. Defaults to'random
’.random_seed (int, optional) – A random seed for the shuffle. Defaults to None.
- Returns:
A shuffled version of
self
- Return type:
- property ndim
number of dimensions
- Type:
Returns
- Type:
int
- reshape_batch(*shape, index=None)
- reshape_features(*shape, index=None)
- property shape
Abstract method for returning shape.
- property y
Return targets.
- class alien.data.dataset.NumpyDataset(*args, **kwargs)[source]
Bases:
ArrayDataset
Dataset with Numpy array as data.
- property X
Return features.
- append(x)
Appends a single sample to the end of the dataset.
- property batch_shape
- check_Xy()
- property feature_shape
- find(value, first=True)
Finds instances of
value
in this dataset. If first is True, returns the index of the first occurence (or None if not found), otherwise returns an iterable of indices of all occurences.
- static from_data(data=None, shuffle: bool | str | None = False, random_seed: int | _SupportsArray[dtype] | _NestedSequence[_SupportsArray[dtype]] | bool | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes] | SeedSequence | BitGenerator | Generator | None = None, recursive: bool = True, convert_sequences: bool = True, **kwargs)
Creates a TeachableDataset with given data.
- Parameters:
data –
the initial data of the dataset Can be:
another TeachableDataset
a Python mutable sequence (eg., a list) or anything that implements the interface
a Numpy array
a Pytorch tensor
a dictionary or tuple whose values are one of the above types
a Pandas DataFrame
shuffle – if this evaluates to True, data will be wrapped in a shuffle, exposing the ShuffledDataset interface. Can be: * anything evaluating to False * ‘identity’ (initial shuffle is the identity) * ‘random’ (initial shuffle is random)
random_seed – a random seed to pass to Numpy’s shuffle algorithm. If None (the default), Numpy gets entropy from the OS.
recursive – if True, data like MutableSequences or TeachableDatasets that already expose the needed interface, will still be wrapped; if False, such data will be returned as-is, with no new object created.
- static from_deepchem(data)
- get_shuffle(shuffle='random', random_seed=None)
Return a shuffled version of self
- Parameters:
shuffle (str, optional) – The initial shuffle -
'identity'
or'random'
. Defaults to'random
’.random_seed (int, optional) – A random seed for the shuffle. Defaults to None.
- Returns:
A shuffled version of
self
- Return type:
- property ndim
number of dimensions
- Type:
Returns
- Type:
int
- reshape(*shape, index=None, bdim=None)
- reshape_batch(*shape, index=None)
- reshape_features(*shape, index=None)
- property shape
Abstract method for returning shape.
- property y
Return targets.
- class alien.data.dataset.TorchDataset(*args, **kwargs)[source]
Bases:
ArrayDataset
Dataset with torch.tensor as data.
- property X
Return features.
- append(x)
Appends a single sample to the end of the dataset.
- property batch_shape
- check_Xy()
- property feature_shape
- find(value, first=True)
Finds instances of
value
in this dataset. If first is True, returns the index of the first occurence (or None if not found), otherwise returns an iterable of indices of all occurences.
- static from_data(data=None, shuffle: bool | str | None = False, random_seed: int | _SupportsArray[dtype] | _NestedSequence[_SupportsArray[dtype]] | bool | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes] | SeedSequence | BitGenerator | Generator | None = None, recursive: bool = True, convert_sequences: bool = True, **kwargs)
Creates a TeachableDataset with given data.
- Parameters:
data –
the initial data of the dataset Can be:
another TeachableDataset
a Python mutable sequence (eg., a list) or anything that implements the interface
a Numpy array
a Pytorch tensor
a dictionary or tuple whose values are one of the above types
a Pandas DataFrame
shuffle – if this evaluates to True, data will be wrapped in a shuffle, exposing the ShuffledDataset interface. Can be: * anything evaluating to False * ‘identity’ (initial shuffle is the identity) * ‘random’ (initial shuffle is random)
random_seed – a random seed to pass to Numpy’s shuffle algorithm. If None (the default), Numpy gets entropy from the OS.
recursive – if True, data like MutableSequences or TeachableDatasets that already expose the needed interface, will still be wrapped; if False, such data will be returned as-is, with no new object created.
- static from_deepchem(data)
- get_shuffle(shuffle='random', random_seed=None)
Return a shuffled version of self
- Parameters:
shuffle (str, optional) – The initial shuffle -
'identity'
or'random'
. Defaults to'random
’.random_seed (int, optional) – A random seed for the shuffle. Defaults to None.
- Returns:
A shuffled version of
self
- Return type:
- property ndim
number of dimensions
- Type:
Returns
- Type:
int
- reshape(*shape, index=None, bdim=None)
- reshape_batch(*shape, index=None)
- reshape_features(*shape, index=None)
- property shape
Abstract method for returning shape.
- property y
Return targets.
- class alien.data.dataset.DictDataset(*args, **kwargs)[source]
Bases:
TeachableWrapperDataset
Contains a dictionary whose values are datasets.
For indexing purposes, the first
self.bdim
axes (i.e., the batch dimensions) index into the first axes of the constituent datasets, whereas the dictionary key “dimension” occurs right after the batch dimensions. Since there is usually exactly one batch dimension, this means you can index like>>> dataset[:20, 'X']
which will return the first 20 rows of the
'X'
constituent dataset, whereas>>> dataset[:20]
will take the first 20 rows of each constituent dataset, and package them into a new
DictDataset
with the same keys.- find(value, first=True)[source]
Finds instances of
value
in this dataset. If first is True, returns the index of the first occurence (or None if not found), otherwise returns an iterable of indices of all occurences.
- property X
Return features.
- property y
Return targets.
- property shape
Abstract method for returning shape.
- property ndim
number of dimensions
- Type:
Returns
- Type:
int
- property batch_shape
- check_Xy()
- property feature_shape
- static from_data(data=None, shuffle: bool | str | None = False, random_seed: int | _SupportsArray[dtype] | _NestedSequence[_SupportsArray[dtype]] | bool | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes] | SeedSequence | BitGenerator | Generator | None = None, recursive: bool = True, convert_sequences: bool = True, **kwargs)
Creates a TeachableDataset with given data.
- Parameters:
data –
the initial data of the dataset Can be:
another TeachableDataset
a Python mutable sequence (eg., a list) or anything that implements the interface
a Numpy array
a Pytorch tensor
a dictionary or tuple whose values are one of the above types
a Pandas DataFrame
shuffle – if this evaluates to True, data will be wrapped in a shuffle, exposing the ShuffledDataset interface. Can be: * anything evaluating to False * ‘identity’ (initial shuffle is the identity) * ‘random’ (initial shuffle is random)
random_seed – a random seed to pass to Numpy’s shuffle algorithm. If None (the default), Numpy gets entropy from the OS.
recursive – if True, data like MutableSequences or TeachableDatasets that already expose the needed interface, will still be wrapped; if False, such data will be returned as-is, with no new object created.
- static from_deepchem(data)
- get_shuffle(shuffle='random', random_seed=None)
Return a shuffled version of self
- Parameters:
shuffle (str, optional) – The initial shuffle -
'identity'
or'random'
. Defaults to'random
’.random_seed (int, optional) – A random seed for the shuffle. Defaults to None.
- Returns:
A shuffled version of
self
- Return type:
- reshape_batch(*shape, index=None)
- reshape_features(*shape, index=None)
- class alien.data.dataset.TupleDataset(*args, **kwargs)[source]
Bases:
TeachableWrapperDataset
Dataset with Tuple as self.data.
- find(value, first=True)[source]
Finds instances of
value
in this dataset. If first is True, returns the index of the first occurence (or None if not found), otherwise returns an iterable of indices of all occurences.
- property tuple
Getter for self.data.
- property shape
Abstract method for returning shape.
- property X
Return features.
- property batch_shape
- check_Xy()
- property feature_shape
- static from_data(data=None, shuffle: bool | str | None = False, random_seed: int | _SupportsArray[dtype] | _NestedSequence[_SupportsArray[dtype]] | bool | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes] | SeedSequence | BitGenerator | Generator | None = None, recursive: bool = True, convert_sequences: bool = True, **kwargs)
Creates a TeachableDataset with given data.
- Parameters:
data –
the initial data of the dataset Can be:
another TeachableDataset
a Python mutable sequence (eg., a list) or anything that implements the interface
a Numpy array
a Pytorch tensor
a dictionary or tuple whose values are one of the above types
a Pandas DataFrame
shuffle – if this evaluates to True, data will be wrapped in a shuffle, exposing the ShuffledDataset interface. Can be: * anything evaluating to False * ‘identity’ (initial shuffle is the identity) * ‘random’ (initial shuffle is random)
random_seed – a random seed to pass to Numpy’s shuffle algorithm. If None (the default), Numpy gets entropy from the OS.
recursive – if True, data like MutableSequences or TeachableDatasets that already expose the needed interface, will still be wrapped; if False, such data will be returned as-is, with no new object created.
- static from_deepchem(data)
- get_shuffle(shuffle='random', random_seed=None)
Return a shuffled version of self
- Parameters:
shuffle (str, optional) – The initial shuffle -
'identity'
or'random'
. Defaults to'random
’.random_seed (int, optional) – A random seed for the shuffle. Defaults to None.
- Returns:
A shuffled version of
self
- Return type:
- property ndim
number of dimensions
- Type:
Returns
- Type:
int
- reshape_batch(*shape, index=None)
- reshape_features(*shape, index=None)
- property y
Return targets.
alien.data.deepchem module
Deepchem Dataset
- class alien.data.deepchem.DeepChemDataset(*args, **kwargs)[source]
Bases:
DictDataset
DeepChem dataset
Some common featurizers:
- Keras
GraphConvModel`s use the `ConvMolFeaturizer
, which may be abbreviated to
'convmol'
in thefeaturizer
argument here.- Pytorch
GCNModel`s use the `MolGraphConvFeaturizer
, which may be abbreviated to
'molgraph'
here.
- static from_csv(file, X='X', y=None, featurizer=None, **kwargs)[source]
Loads a DeepChem dataset from a
csv
file.- Parameters:
X (str) – Column names for the X and y data
y (str) – Column names for the X and y data
featurizer – Specifies the DeepChem featurizer to use, if any.
featurizer
may be a DeepChem featurizer class, or a featurizer instance, or a string contained in the classname of a featurizer. (Eg.,'convmol'
matches the DeepChemConvMolFeaturizer
.)**kwargs – These are passed to the featurizer constructor.
- Returns:
An
alien.data.DeepChemDataset
- static from_df(df, X='X', y=None, ids='ids', weights=None, featurizer=None, **kwargs)[source]
Returns a DeepChemDataset built from a Pandas DataFrame.
- Parameters:
df – The dataframe to convert
X – The name of the feature column. Defaults to “X”.
y – The name of the y/label column, or a list of names for multi-prediction. By default, no y values are extracted.
ids – The name of the ids column. By default, looks for a column named ‘ids’, and if none is found, uses the dataframe index.
weights – The name of the weights column. If none is given, uses 1.0 for all weights.
featurizer – Specifies the DeepChem featurizer to use, if any.
featurizer
may be a DeepChem featurizer class, or a featurizer instance, or a string contained in the classname of a featurizer. (Eg.,'convmol'
matches the DeepChemConvMolFeaturizer
.)**kwargs –
Any additional keyword args will become columns in the dataset; for example, keyword arg
t='timestamp'
, creates a column with keyt
and values taken fromdf['timestamp']
.
- property X
Return features.
- append(x)
Appends a single sample to the end of the dataset.
- property batch_shape
- check_Xy()
- extend(X)
Appends a batch of samples to the end of the dataset.
- property feature_shape
- find(value, first=True)
Finds instances of
value
in this dataset. If first is True, returns the index of the first occurence (or None if not found), otherwise returns an iterable of indices of all occurences.
- static from_data(data=None, shuffle: bool | str | None = False, random_seed: int | _SupportsArray[dtype] | _NestedSequence[_SupportsArray[dtype]] | bool | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes] | SeedSequence | BitGenerator | Generator | None = None, recursive: bool = True, convert_sequences: bool = True, **kwargs)
Creates a TeachableDataset with given data.
- Parameters:
data –
the initial data of the dataset Can be:
another TeachableDataset
a Python mutable sequence (eg., a list) or anything that implements the interface
a Numpy array
a Pytorch tensor
a dictionary or tuple whose values are one of the above types
a Pandas DataFrame
shuffle – if this evaluates to True, data will be wrapped in a shuffle, exposing the ShuffledDataset interface. Can be: * anything evaluating to False * ‘identity’ (initial shuffle is the identity) * ‘random’ (initial shuffle is random)
random_seed – a random seed to pass to Numpy’s shuffle algorithm. If None (the default), Numpy gets entropy from the OS.
recursive – if True, data like MutableSequences or TeachableDatasets that already expose the needed interface, will still be wrapped; if False, such data will be returned as-is, with no new object created.
- static from_deepchem(data)
- get_shuffle(shuffle='random', random_seed=None)
Return a shuffled version of self
- Parameters:
shuffle (str, optional) – The initial shuffle -
'identity'
or'random'
. Defaults to'random
’.random_seed (int, optional) – A random seed for the shuffle. Defaults to None.
- Returns:
A shuffled version of
self
- Return type:
- property ndim
number of dimensions
- Type:
Returns
- Type:
int
- reshape(*shape, index=None, bdim=None)
- reshape_batch(*shape, index=None)
- reshape_features(*shape, index=None)
- property shape
Abstract method for returning shape.
- property y
Return targets.
- Keras