Tutorial¶
Manipulate Predicted Routes¶
If you just want to start manipulating your predicted routes right away, this page will
help you getting started.
Our process_routes() function wraps the main functionalities of our package, providing
a simple interface to perform some useful applications, to work both I/O stream and in memory stream.
So, you used IBMRXN or Askcos (or both!) to predict some routes for a molecule and you have the results in json files. In order to start manipulating the routes, let’s read the json file and transform the predictions in a list of graph objects.
First of all, let’s import the process_routes() function
from linchemin.interfaces.workflows import process_routes
This function is really all you need to start working with you predictions. It is called by
specifying the path of the json files and the name of the CASP tool(s) that generated them as a dictionary,
and the optional arguments output_format and out_data_model.
The former sets the format of the output file in which the routes will be written (default is json,
but csv is also available) and the latter the data model
of the output routes (default is bipartite, with both molecule and reaction nodes, but monopartite reactions and
monopartite molecules are also available).
In the example below we transform the routes generated by two different CASP tools in monopartite-reactions only,
SynGraph objects, i.e., in instances of the
MonopartiteReacSynGraph class, and we write them into a json file.
output = process_routes({'ibmrxn_file.json': 'ibmrxn', # path of input file generated by IBMRXN
'askcos_file.json': 'askcosv1'}, # path of input file generated by AskcosV1
out_data_model='monopartite_reactions')# selected output data model
The function creates a json file called ‘routes.json’ containing the predicted routes
as dictionaries of nodes and edges, as well as the WorkflowOutput
object, whose attribute routes_list
stores the routes as a list of MonopartiteReacSynGraph instances.
The SynGraph format, and its subclasses are the
working formats for many of the functionalities implemented in LinChemIn and they also ensure
that the chemical information in your graphs is correct.
In case you are interested in selecting a different data model
or you just want to find out which are the available options,
you can import the get_workflow_options() function and type:
from linchemin.interfaces.workflows import get_workflow_options
get_workflow_options(verbose=True)
This will print to your screen all the available options and the
default values of the process_routes function. The information will also be returned
as dictionary. In case you do not want to have the information printed on the screen, just set verbose
to False (or simply do not specify it).
The basic functionality of the process_routes function, i.e., reading and writing the routes,
wraps the translate functionality exposed by the facade() function,
which, in turns, wraps the translator() function of the translate module.
The latter hosts all the functions and classes to handle the translation of graph objects between formats.
In case you want a better understanding of the
functions behind the facade or just need more freedom for converting your routes, you can check them
and the helper function facade_helper() out.
Reactions atom mapping¶
In the routes predicted by CASP tools, it is very likely that the roles of the Molecules in the involved reactions are incorrect, with reagents appearing as reactants. In order to improve the chemical information in your route you might want to perform the atom mapping on the involved reactions and LinChemIn has a full machinery to do that!
Be aware, however, that the atom-to-atom mapping tool is part of our separate library LinChemIn_Services
and that you will need to install it in order to access the atom mapping functionality.
But don’t worry, LinChemIn_Services is freely available and it can be found
at https://github.com/syngenta/linchemin_services together with the documentation for its installation and
usage. You can also check out the Atom-to-Atom Mapping page for some additional information
(including some details about the module responsible for the mapping functionality and the usage
of the mapping information, if you are interested).
Once you have the atom mapping machinery up and running, you can easily access it through the usual
process_routes() function: you just need to set to True the mapping
argument. The rxnmapper tool will be called to map all the routes in you set.
In the code snippet below, we read the routes predicted by two different CASP tools, we map all the involved reactions and we write the mapped routes into a json file. The atom-to-atom mapping is performed with ‘rxnmapper’.
output = process_routes({'ibmrxn_file.json': 'ibmrxn'}, # path of input file generated by IBMRXN
output_format='json', # format of the output file 'routes.csv'
mapping=True,} # the mapping functionality is activated
In this case, the routes written in the output file will have the atom-to-atom mapping information
and they can also be accessed from the routes_list attribute of the output object.
This option wraps the atom_mapping functionality exposed
by facade(), which in turn is based on the
perform_atom_mapping(). You can check them out if
more details are needed.
Compute routes descriptors¶
The process_routes function gives access to all the main functionalities of LinChemin. The only
thing you need to do is to specify which ones you want to perform through the functionalities
argument.
Let’s say you are interested in computing the number of steps
and the number of branches for each of the predicted route. This can be done
by passing to the functionalities argument the string compute_descriptors, together with the
input file, as we did before. To determine which descriptors should be computed, we can pass their name
to the descriptors argument; if the latter is not specified all the implemented
will be computed.
output = process_routes({'az_file.json': 'az', # path of input file generated by AZ
'askcosv2_file.json': 'askcosv2'}, # path of input file generated by AskcosV2
output_format='csv', # format of the output file 'routes.csv'
functionalities=['compute_descriptors'],# the functionalities to be activated
descriptors=['nr_steps', # descriptors to be computed: nr steps
'nr_branches']) # descriptors to be computed: nr branches
The function once again writes the routes to a file (here we select to have them in a csv
file, instead of the default json file) and stores the corresponding SynGraph objects
in the routes_list attribute of the
output object. Moreover, a csv file called ‘descriptors.csv’ is also created which contains
a dataframe with the route ids and the computed descriptors. The dataframe is also stored in the descriptors
attribute of the output object.
If you need a refresh about which are the available descriptors to be computed,
just resort again to our get_workflow_options() function.
The compute_descriptors option wraps the routes_descriptors functionality exposed
by facade(), which in turn is based on the
descriptor_calculator() function. You can check them out if
more details are needed.
Route similarity¶
You might be interested in asses how similar to one another the predicted routes are.
The distance_matrix functionality of the process_routes function comes in handy and the usage is once
again very simple, as it is sufficient to specify the distance_matrix string to the functionalities
argument.
output = process_routes({'az_file.json': 'az'}, # path of input file generated by AZ
functionalities=['distance_matrix']) # the functionalities to be activated
Here, the distance matrix is returned as a pandas dataframe stored in the distance_matrix
attribute of the output object and it is also written to the ‘distance_matrix.csv’ file.
The distance matrix is computed with the Graph Edit Distance (GED) algorithm implemented in NetworkX;
for computational efficiency reasons, we recommend to work with
MonopartiteReacSynGraph objects.
By typing the above code, you are using the default parameters that we pre-defined for the calculations of the
distance matrix; in particular, we selected a GED algorithm and a set of parameters determining methods
and algorithms for computing reaction
and molecular fingerprints and similarity. To
find out which are the default values and the optional parameters that you can select,
let’s call our get_workflow_options() function.
In case you are not happy with our defaults, you can freely choose different values for
some or all parameters by specifying them in the ged_params argument, as shown in the code below.
output = process_routes({'az_file.json': 'az'}, # path of input file generated by AZ
functionalities=['distance_matrix'])# the functionalities to be activated
ged_method='nx_ged',# the algorithm to be used for the GED calculations
ged_params={ # a dictionary specifying the parameters for for molecular/reaction similarity
'reaction_fp': 'structure_fp', # reaction fingerprints type
'reaction_fp_params': {'fpSize': 1024, # reaction fingerprints size
'fpType': rdChemReactions.FingerprintType.MorganFP}, # molecular fingerprints
'reaction_similarity_name': 'dice'}) # similarity algorithm
This functionality wraps the distance_matrix function of facade(),
which in turn is based on the compute_distance_matrix() function:
you can check them out for more details.
Clustering¶
One of the best use we can do of the distance matrix, is to cluster our routes based on it. With the
functionality clustering we do not need to even bother computing the matrix: everything is automatically done
under the hood! Let’s just pass the usual path to the input file(s) and let’s specify the desired
functionalities:
output = process_routes({'ibmrxn_file.json': 'ibmrxn'}, # path of input file generated by IBMRXN
functionalities=['clustering']) # the functionalities to be activated
The above code is really all you need: the clustering attribute of the output object
stores the outcome of the clustering algorithm, while the clustered_descriptors attribute
holds a dataframe with the route ids, the relative clustering labels and a couple of descriptors. The
latter dataframe is also written into the newly created ‘cluster_metrics.csv’ file.
By default, the Agglomerative Clustering algorithm is used if there are less than 15 routes and
Hdbscan otherwise. However you can also specify the clustering_method argument and directly select
which algorithm to use.
Also in this case, there are many parameters for which we chose a default value,
and again, you can find what they are and how you can change them by invoking the
get_workflow_options() function.
In the code below, you can see how to set different values for the parameters: the ged_params
argument is the same that we used in the previous sections to specify the GED parameters (we need to compute
the distance matrix, after all!); in addition, you can specify the linkage argument in case you
are using the Agglomerative Clustering algorithm or the min_cluster_size if you are using Hdbscan.
output = process_routes({'ibmrxn_file.json': 'ibmrxn'}, # path of input file generated by IBMRXN
functionalities=['clustering'], # the functionalities to be activated
ged_method='nx_ged', # the algorithm to be used for the GED calculations
clustering_method='agglomerative_cluster', # the algorithm to be used for clustering
ged_params={ # a dictionary specifying the parameters for molecular/reaction similarity
'reaction_fp': 'difference_fp', # reaction fingerprints type
'reaction_fp_params': {'fpSize': 1024}, # reaction fingerprints size
'reaction_similarity_name': 'dice'}, # similarity algorithm
linkage='average') # option parameter for the clustering algorithm
The clustering functionality wraps the clusterer() function
of the factory in the clustering modules.
The codes shown in the examples above for the clustering functionality do not return the distance
matrix, although it is actually computed under the hood. However, if you want to also have the distance
matrix among the outputs, you can select the clustering_and_d_matrix functionality, which
returns all the outputs generated by the distance_matrix and the clustering functionalities.
Full analysis¶
Of course, if you want a full analysis of your routes, you can decide to activate all
the functionalities in the same call to the process_routes function.
output = process_routes({'ibmrxn_file.json': 'ibmrxn'}, # path of input file generated by IBMRXN
functionalities=[ # the functionalities to be activated
'compute_descriptors', # calculation of routes descriptors
'clustering_and_d_matrix', # calculation of distance matrix and clustering
'merging']) # merging of the routes to obtain a "tree"
The above code will generate all the above mentioned attributes of the output
object, as well as all the files.
Performance improvements¶
In order to improve the computational performances, it is possible to activate the parallel computing
from the process_routes function and it will automatically be applied to all the suitable functionalities
among the selected ones.
output = process_routes({'az_file.json': 'az'}, # path of input file generated by AZ
functionalities=['distance_matrix']) # the functionalities to be activated
parallelization_True, # parallelization is activated
n_cpu=8) # nr of CPUs to be used