Sequential
DesignSpace
The
DesignSpace
class object is intended to store the
entire design space. As the sequential learning
loop is iterated, this can be continuously updated
with the newly found labels.
There are two key components required for this object:
design_space_structures
: all systems to be considered asase.Atoms
objects in alist
design_space_labels
:numpy array
of the same length as the above list with the corresponding labels. If the label is not yet known, set it tonumpy.nan
NB: The order of the list of design space structures must be in the same order as the labels given in the design space labels.
>>> import numpy as np
>>> from autocat.surface import generate_surface_structures
>>> from autocat.utils import flatten_structures_dict
>>> from autocat.learning.sequential import DesignSpace
>>> surf_dict = generate_surface_structures(["Pt", "Pd", "Cu", "Ni"])
>>> surf_structs = flatten_structures_dict(surf_dict)
>>> labels = np.array([0.95395024, 0.63504885, np.nan, 0.08320879, np.nan,
... 0.32423194, 0.55570785, np.nan, np.nan, np.nan,
... 0.18884186, np.nan])
>>> acds = DesignSpace(surf_structs, labels)
>>> acds
+-------------------------+--------------------------+
| | DesignSpace |
+-------------------------+--------------------------+
| total # of systems | 12 |
| # of unlabelled systems | 6 |
| unique species present | ['Pt', 'Pd', 'Cu', 'Ni'] |
| maximum label | 0.95395024 |
| minimum label | 0.08320879 |
+-------------------------+--------------------------+
>>> len(acds)
12
>>> acds.design_space_structures
[Atoms(...),
Atoms(...),
Atoms(...),
Atoms(...),
Atoms(...),
Atoms(...),
Atoms(...),
Atoms(...),
Atoms(...),
Atoms(...),
Atoms(...),
Atoms(...)]
>>> acds.design_space_labels
array([0.95395024, 0.63504885, nan, 0.08320879, nan,
0.32423194, 0.55570785, nan, nan, nan,
0.18884186, nan])
CandidateSelector
The
CandidateSelector
object stores information about the methodology for candidate selection,
and can apply this to choose candidates from a design space.
Key properties specified within this object include:
- Acquisition function to be used for calculating scores. Currently supported functions:
- maximum likelihood of improvement (MLI)
- maximum uncertainty (MU)
- random
- Number of candidates that should be proposed for each iteration
- Target window that the candidate should ideally fall within (this is only applicable for MLI)
- Whether to weight each system's score by its HHI and/or segregation energies
For example, let's define a CandidateSelector
that chooses the 3 systems based on MLI with a target
window of between 0.25 and 0.3, and weights the scores by the HHI values.
>>> from autocat.learning.sequential import CandidateSelector
>>> candidate_selector = CandidateSelector(
... acquisition_function="MLI",
... num_candidates_to_pick=3,
... target_window=(0.25, 0.3),
... include_hhi=True,
... )
>>> candidate_selector
+----------------------------------+--------------------+
| | Candidate Selector |
+----------------------------------+--------------------+
| acquisition function | MLI |
| # of candidates to pick | 3 |
| target window | [0.25 0.3 ] |
| include hhi? | True |
| hhi type | production |
| include segregation energies? | False |
| segregation energies data source | raban1999 |
+----------------------------------+--------------------+
The method choose_candidate
applies these options to calculate the scores and propose
the desired number of candidate systems to evaluate. A DesignSpace
must be supplied along with
optionally a combination of predictions and/or uncertainties depending on the acquisition function
chosen.
Using the DesignSpace
above, and making up some prediction and uncertainty values
(in practice these should be from your own trained Predictor
!), we can see how this works.
>>> predictions = np.array([0.95395024, 0.63504885, 0.46160089, 0.08320879, 0.81524182,
... 0.32423194, 0.55570785, 0.75537232, 0.21824507, 0.89147292,
... 0.18884186, 0.47473003])
>>> uncertainties = np.array([0.01035017, 0.01171273, 0.00688497, 0.00514248, 0.01254998,
... 0.01047033, 0.01268476, 0.01017691, 0.01436907, 0.00878836,
... 0.00786345, 0.01341667])
>>> parent_idx, max_scores, aq_scores = candidate_selector.choose_candidate(
... design_space=acds,
... predictions=predictions,
... uncertainties=uncertainties
... )
parent_idx
is the indices of the proposed candidate systems in the given DesignSpace
,
max_scores
are the scores attributed to these identified candidates, and
aq_scores
are the scores for all systems.
N.B.: If there are np.nan
labels within the DesignSpace
, by default the
candidates will be chosen exclusively from these unlabelled systems.
Otherwise, in the case of a fully labelled DesignSpace
the default is to consider
all systems. However, these defaults may be overridden via the allowed_idx
parameter.
SequentialLearner
The
SequentialLearner
object stores information regarding the latest
iteration of the sequential learning loop including:
- A
Predictor
(and its kwargs for both the regressor and featurizer) - A
CandidateSelector
for choosing the candidate systems - Iteration number
- Latest
DesignSpace
- Candidate system(s) that is identified for the next loop.
- Histories for predictions, uncertainties, and training indices
This object can be thought of as a central hub for the
sequential learning workflow, with an external driver
(either automated or manual) triggering iteration. The first
iterate
trains the model and identifies candidate(s) to
start the loop.
>>> import numpy as np
>>> from ase import Atoms
>>> from dscribe.descriptors import SOAP
>>> from sklearn.gaussian_process import GaussianProcessRegressor
>>> from sklearn.gaussian_process.kernels import RBF
>>> from autocat.surface import generate_surface_structures
>>> from autocat.utils import flatten_structures_dict
>>> from autocat.adsorption import place_adsorbate
>>> from autocat.learning.featurizers import Featurizer
>>> from autocat.learning.predictors import Predictor
>>> from autocat.learning.sequential import CandidateSelector
>>> from autocat.learning.sequential import DesignSpace
>>> from autocat.learning.sequential import SequentialLearner
>>> # make the DesignSpace
>>> subs_dict = generate_surface_structures(["Pt", "Pd", "Cu", "Ni"])
>>> subs = flatten_structures_dict(subs_dict)
>>> ads_structs =[place_adsorbate(s, Atoms("Li")) for s in subs]
>>> labels = np.array([0.95395024, 0.63504885, np.nan, 0.08320879, np.nan,
... 0.32423194, 0.55570785, np.nan, np.nan, np.nan,
... 0.18884186, np.nan])
>>> acds = DesignSpace(ads_structs, labels)
>>> # specify the featurization details
>>> featurizer = Featurizer(
... featurizer_class=SOAP,
... design_space_structures=acds.design_space_structures,
... kwargs={"rcut": 5.0, "lmax": 6, "nmax": 6}
... )
>>> # define the predictor
>>> kernel = RBF()
>>> regressor = GaussianProcessRegressor(kernel=kernel)
>>> predictor = Predictor(
... regressor=regressor,
... featurizer=featurizer
... )
>>> # choose how candidates will be selected on each loop
>>> candidate_selector = CandidateSelector(
... acquisition_function="MLI",
... target_window=(0.1, 0.2),
... include_hhi=True,
... hhi_type="reserves",
... include_segregation_energies=False
... )
>>> # set up the sequential learner
>>> acsl = SequentialLearner(
... design_space=acds,
... predictor=predictor,
... candidate_selector=candidate_selector,
... )
>>> acsl.iteration_count
0
>>> acsl.iterate()
>>> acsl.iteration_count
1
>>> acsl
+----------------------------------+--------------------+
| | Sequential Learner |
+----------------------------------+--------------------+
| iteration count | 1 |
| next candidate system structures | ['Cu36Li'] |
| next candidate system indices | [7] |
+----------------------------------+--------------------+
+----------------------------------+--------------------+
| | Candidate Selector |
+----------------------------------+--------------------+
| acquisition function | MLI |
| # of candidates to pick | 1 |
| target window | [0.1 0.2] |
| include hhi? | True |
| hhi type | reserves |
| include segregation energies? | False |
| segregation energies data source | raban1999 |
+----------------------------------+--------------------+
+-------------------------+--------------------------------+
| | DesignSpace |
+-------------------------+--------------------------------+
| total # of systems | 12 |
| # of unlabelled systems | 6 |
| unique species present | ['Li', 'Pt', 'Pd', 'Cu', 'Ni'] |
| maximum label | 0.95395024 |
| minimum label | 0.08320879 |
+-------------------------+--------------------------------+
+-----------+------------------------------------------------------------------+
| | Predictor |
+-----------+------------------------------------------------------------------+
| regressor | <class 'sklearn.gaussian_process._gpr.GaussianProcessRegressor'> |
| is fit? | True |
+-----------+------------------------------------------------------------------+
+-----------------------------------+-------------------------------------+
| | Featurizer |
+-----------------------------------+-------------------------------------+
| class | dscribe.descriptors.soap.SOAP |
| kwargs | {'rcut': 5.0, 'lmax': 6, 'nmax': 6} |
| species list | ['Li', 'Ni', 'Pt', 'Pd', 'Cu'] |
| maximum structure size | 37 |
| preset | None |
| design space structures provided? | True |
+-----------------------------------+-------------------------------------+
Simulated Sequential Learning
If you already have a fully explored design space and want
to simulate exploration over it, the
simulated_sequential_learning
function may be used.
Internally this function acts a driver on a SequentialLearner
object, and can be
viewed as an example for how a driver can be set up for an exploratory simulated
sequential learning loop. As inputs it requires all parameters needed to instantiate
a SequentialLearner
and returns the object that has been iterated. For further analysis
of the search, histories of the predictions, uncertainties, and the training indices for
each iteration are kept.
>>> import numpy as np
>>> from dscribe.descriptors import SineMatrix
>>> from sklearn.gaussian_process import GaussianProcessRegressor
>>> from sklearn.gaussian_process.kernels import RBF
>>> from autocat.surface import generate_surface_structures
>>> from autocat.utils import flatten_structures_dict
>>> from autocat.learning.featurizers import Featurizer
>>> from autocat.learning.predictors import Predictor
>>> from autocat.learning.sequential import CandidateSelector
>>> from autocat.learning.sequential import DesignSpace
>>> from autocat.learning.sequential import simulated_sequential_learning
>>> surf_dict = generate_surface_structures(["Pt", "Pd", "Cu", "Ni"])
>>> surf_structs = flatten_structures_dict(surf_dict)
>>> labels = np.array([0.95395024, 0.63504885, 0.4567, 0.08320879, 0.87779,
... 0.32423194, 0.55570785, 0.325, 0.43616, 0.321632,
... 0.18884186, 0.1114])
>>> acds = DesignSpace(surf_structs, labels)
>>> # specify the featurization details
>>> featurizer = Featurizer(
... featurizer_class=SineMatrix,
... design_space_structures=acds.design_space_structures,
... )
>>> # define the predictor
>>> kernel=RBF()
>>> regressor = GaussianProcessRegressor(kernel=kernel)
>>> predictor = Predictor(
... regressor=regressor,
... featurizer=featurizer
... )
>>> # choose how candidates will be selected on each loop
>>> candidate_selector = CandidateSelector(
... acquisition_function="MLI",
... target_window=(0.1, 0.2),
... include_hhi=True,
... hhi_type="reserves",
... include_segregation_energies=False
... )
>>> # conduct the simulated sequential learning loop
>>> sim_sl = simulated_sequential_learning(
... full_design_space=acds,
... predictor=predictor,
... candidate_selector=candidate_selector,
... init_training_size=5,
... number_of_sl_loops=3,
... )
Sequential Learning Iteration #1
Sequential Learning Iteration #2
Sequential Learning Iteration #3
Additionally, simulated searches are typically most useful when repeated to obtain
statistics that are less dependent on the initialization of the design space. For this
purpose there is the
multiple_simulated_sequential_learning_runs
function. This returns a list of SequentialLearner
corresponding to each individual run. Optionally,
this function can also initiate the multiple runs across parallel processes via the
number_of_parallel_jobs
parameter.
>>> import numpy as np
>>> from matminer.featurizers.composition import ElementProperty
>>> from sklearn.gaussian_process import GaussianProcessRegressor
>>> from sklearn.gaussian_process.kernels import RBF
>>> from autocat.surface import generate_surface_structures
>>> from autocat.utils import flatten_structures_dict
>>> from autocat.learning.featurizers import Featurizer
>>> from autocat.learning.predictors import Predictor
>>> from autocat.learning.sequential import CandidateSelector
>>> from autocat.learning.sequential import DesignSpace
>>> from autocat.learning.sequential import multiple_simulated_sequential_learning_runs
>>> surf_dict = generate_surface_structures(["Pt", "Pd", "Cu", "Ni"])
>>> surf_structs = flatten_structures_dict(surf_dict)
>>> labels = np.array([0.95395024, 0.63504885, 0.4567, 0.08320879, 0.87779,
... 0.32423194, 0.55570785, 0.325, 0.43616, 0.321632,
... 0.18884186, 0.1114])
>>> acds = DesignSpace(surf_structs, labels)
>>> # specify the featurization details
>>> featurizer = Featurizer(
... featurizer_class=ElementProperty,
... preset="matminer",
... design_space_structures=acds.design_space_structures,
... )
>>> # define the predictor
>>> kernel = RBF()
>>> regressor = GaussianProcessRegressor(kernel=kernel)
>>> predictor = Predictor(
... regressor=regressor,
... featurizer=featurizer
... )
>>> # choose how candidates will be selected on each loop
>>> candidate_selector = CandidateSelector(
... acquisition_function="MLI",
... target_window=(0.1,0.2),
... include_hhi=True,
... hhi_type="reserves",
... include_segregation_energies=False
... )
>>> # conduct the multiple simulated sequential learning loop
>>> multi_sim_sl = multiple_simulated_sequential_learning_runs(
... full_design_space=acds,
... predictor=predictor,
... candidate_selector=candidate_selector,
... init_training_size=5,
... number_of_sl_loops=2,
... number_of_runs=3,
... )
Sequential Learning Iteration #1
Sequential Learning Iteration #2
Sequential Learning Iteration #1
Sequential Learning Iteration #2
Sequential Learning Iteration #1
Sequential Learning Iteration #2
>>> len(multi_sim_sl)
3