Tutorial

Manipulate Predicted Routes

If you just want to start manipulating your predicted routes right away, this page will help you getting started. Our process_routes() function wraps the main functionalities of our package, providing a simple interface to perform some useful applications, to work both I/O stream and in memory stream.

So, you used IBMRXN or Askcos (or both!) to predict some routes for a molecule and you have the results in json files. In order to start manipulating the routes, let’s read the json file and transform the predictions in a list of graph objects.

First of all, let’s import the process_routes() function

from linchemin.interfaces.workflows import process_routes

This function is really all you need to start working with you predictions. It is called by specifying the path of the json files and the name of the CASP tool(s) that generated them as a dictionary, and the optional arguments output_format and out_data_model. The former sets the format of the output file in which the routes will be written (default is json, but csv is also available) and the latter the data model of the output routes (default is bipartite, with both molecule and reaction nodes, but monopartite reactions and monopartite molecules are also available). In the example below we transform the routes generated by two different CASP tools in monopartite-reactions only, SynGraph objects, i.e., in instances of the MonopartiteReacSynGraph class, and we write them into a json file.

output = process_routes({'ibmrxn_file.json': 'ibmrxn',          # path of input file generated by IBMRXN
                         'askcos_file.json': 'askcosv1'},       # path of input file generated by AskcosV1
                         out_data_model='monopartite_reactions')# selected output data model

The function creates a json file called ‘routes.json’ containing the predicted routes as dictionaries of nodes and edges, as well as the WorkflowOutput object, whose attribute routes_list stores the routes as a list of MonopartiteReacSynGraph instances.

The SynGraph format, and its subclasses are the working formats for many of the functionalities implemented in LinChemIn and they also ensure that the chemical information in your graphs is correct.

In case you are interested in selecting a different data model or you just want to find out which are the available options, you can import the get_workflow_options() function and type:

from linchemin.interfaces.workflows import get_workflow_options
get_workflow_options(verbose=True)

This will print to your screen all the available options and the default values of the process_routes function. The information will also be returned as dictionary. In case you do not want to have the information printed on the screen, just set verbose to False (or simply do not specify it).

The basic functionality of the process_routes function, i.e., reading and writing the routes, wraps the translate functionality exposed by the facade() function, which, in turns, wraps the translator() function of the translate module. The latter hosts all the functions and classes to handle the translation of graph objects between formats. In case you want a better understanding of the functions behind the facade or just need more freedom for converting your routes, you can check them and the helper function facade_helper() out.

Reactions atom mapping

In the routes predicted by CASP tools, it is very likely that the roles of the Molecules in the involved reactions are incorrect, with reagents appearing as reactants. In order to improve the chemical information in your route you might want to perform the atom mapping on the involved reactions and LinChemIn has a full machinery to do that!

Be aware, however, that the atom-to-atom mapping tool is part of our separate library LinChemIn_Services and that you will need to install it in order to access the atom mapping functionality. But don’t worry, LinChemIn_Services is freely available and it can be found at https://github.com/syngenta/linchemin_services together with the documentation for its installation and usage. You can also check out the Atom-to-Atom Mapping page for some additional information (including some details about the module responsible for the mapping functionality and the usage of the mapping information, if you are interested).

Once you have the atom mapping machinery up and running, you can easily access it through the usual process_routes() function: you just need to set to True the mapping argument. The rxnmapper tool will be called to map all the routes in you set.

In the code snippet below, we read the routes predicted by two different CASP tools, we map all the involved reactions and we write the mapped routes into a json file. The atom-to-atom mapping is performed with ‘rxnmapper’.

output = process_routes({'ibmrxn_file.json': 'ibmrxn'}, # path of input file generated by IBMRXN
                         output_format='json',          # format of the output file 'routes.csv'
                         mapping=True,}                 # the mapping functionality is activated

In this case, the routes written in the output file will have the atom-to-atom mapping information and they can also be accessed from the routes_list attribute of the output object.

This option wraps the atom_mapping functionality exposed by facade(), which in turn is based on the perform_atom_mapping(). You can check them out if more details are needed.

Compute routes descriptors

The process_routes function gives access to all the main functionalities of LinChemin. The only thing you need to do is to specify which ones you want to perform through the functionalities argument.

Let’s say you are interested in computing the number of steps and the number of branches for each of the predicted route. This can be done by passing to the functionalities argument the string compute_descriptors, together with the input file, as we did before. To determine which descriptors should be computed, we can pass their name to the descriptors argument; if the latter is not specified all the implemented will be computed.

output = process_routes({'az_file.json': 'az',                  # path of input file generated by AZ
                         'askcosv2_file.json': 'askcosv2'},     # path of input file generated by AskcosV2
                         output_format='csv',                   # format of the output file 'routes.csv'
                         functionalities=['compute_descriptors'],# the functionalities to be activated
                         descriptors=['nr_steps',               # descriptors to be computed: nr steps
                                      'nr_branches'])          # descriptors to be computed: nr branches

The function once again writes the routes to a file (here we select to have them in a csv file, instead of the default json file) and stores the corresponding SynGraph objects in the routes_list attribute of the output object. Moreover, a csv file called ‘descriptors.csv’ is also created which contains a dataframe with the route ids and the computed descriptors. The dataframe is also stored in the descriptors attribute of the output object.

If you need a refresh about which are the available descriptors to be computed, just resort again to our get_workflow_options() function.

The compute_descriptors option wraps the routes_descriptors functionality exposed by facade(), which in turn is based on the descriptor_calculator() function. You can check them out if more details are needed.

Route similarity

You might be interested in asses how similar to one another the predicted routes are. The distance_matrix functionality of the process_routes function comes in handy and the usage is once again very simple, as it is sufficient to specify the distance_matrix string to the functionalities argument.

output = process_routes({'az_file.json': 'az'},               # path of input file generated by AZ
                         functionalities=['distance_matrix']) # the functionalities to be activated

Here, the distance matrix is returned as a pandas dataframe stored in the distance_matrix attribute of the output object and it is also written to the ‘distance_matrix.csv’ file. The distance matrix is computed with the Graph Edit Distance (GED) algorithm implemented in NetworkX; for computational efficiency reasons, we recommend to work with MonopartiteReacSynGraph objects.

By typing the above code, you are using the default parameters that we pre-defined for the calculations of the distance matrix; in particular, we selected a GED algorithm and a set of parameters determining methods and algorithms for computing reaction and molecular fingerprints and similarity. To find out which are the default values and the optional parameters that you can select, let’s call our get_workflow_options() function.

In case you are not happy with our defaults, you can freely choose different values for some or all parameters by specifying them in the ged_params argument, as shown in the code below.

output = process_routes({'az_file.json': 'az'},               # path of input file generated by AZ
                         functionalities=['distance_matrix'])# the functionalities to be activated
                         ged_method='nx_ged',# the algorithm to be used for the GED calculations
                         ged_params={        # a dictionary specifying the parameters for for molecular/reaction similarity
                        'reaction_fp': 'structure_fp',          # reaction fingerprints type
                        'reaction_fp_params': {'fpSize': 1024,  # reaction fingerprints size
                                               'fpType': rdChemReactions.FingerprintType.MorganFP}, # molecular fingerprints
                        'reaction_similarity_name': 'dice'})    # similarity algorithm

This functionality wraps the distance_matrix function of facade(), which in turn is based on the compute_distance_matrix() function: you can check them out for more details.

Clustering

One of the best use we can do of the distance matrix, is to cluster our routes based on it. With the functionality clustering we do not need to even bother computing the matrix: everything is automatically done under the hood! Let’s just pass the usual path to the input file(s) and let’s specify the desired functionalities:

output = process_routes({'ibmrxn_file.json': 'ibmrxn'},     # path of input file generated by IBMRXN
                         functionalities=['clustering'])    # the functionalities to be activated

The above code is really all you need: the clustering attribute of the output object stores the outcome of the clustering algorithm, while the clustered_descriptors attribute holds a dataframe with the route ids, the relative clustering labels and a couple of descriptors. The latter dataframe is also written into the newly created ‘cluster_metrics.csv’ file.

By default, the Agglomerative Clustering algorithm is used if there are less than 15 routes and Hdbscan otherwise. However you can also specify the clustering_method argument and directly select which algorithm to use.

Also in this case, there are many parameters for which we chose a default value, and again, you can find what they are and how you can change them by invoking the get_workflow_options() function.

In the code below, you can see how to set different values for the parameters: the ged_params argument is the same that we used in the previous sections to specify the GED parameters (we need to compute the distance matrix, after all!); in addition, you can specify the linkage argument in case you are using the Agglomerative Clustering algorithm or the min_cluster_size if you are using Hdbscan.

output = process_routes({'ibmrxn_file.json': 'ibmrxn'},     # path of input file generated by IBMRXN
                         functionalities=['clustering'],    # the functionalities to be activated
                         ged_method='nx_ged',    # the algorithm to be used for the GED calculations
                         clustering_method='agglomerative_cluster',  # the algorithm to be used for clustering
                         ged_params={    # a dictionary specifying the parameters for molecular/reaction similarity
                            'reaction_fp': 'difference_fp',                             # reaction fingerprints type
                            'reaction_fp_params': {'fpSize': 1024},                     # reaction fingerprints size
                                                   'reaction_similarity_name': 'dice'}, # similarity algorithm
                         linkage='average')      # option parameter for the clustering algorithm

The clustering functionality wraps the clusterer() function of the factory in the clustering modules.

The codes shown in the examples above for the clustering functionality do not return the distance matrix, although it is actually computed under the hood. However, if you want to also have the distance matrix among the outputs, you can select the clustering_and_d_matrix functionality, which returns all the outputs generated by the distance_matrix and the clustering functionalities.

Full analysis

Of course, if you want a full analysis of your routes, you can decide to activate all the functionalities in the same call to the process_routes function.

output = process_routes({'ibmrxn_file.json': 'ibmrxn'}, # path of input file generated by IBMRXN
                         functionalities=[              # the functionalities to be activated
                            'compute_descriptors',      # calculation of routes descriptors
                            'clustering_and_d_matrix',  # calculation of distance matrix and clustering
                            'merging'])                 # merging of the routes to obtain a "tree"

The above code will generate all the above mentioned attributes of the output object, as well as all the files.

Performance improvements

In order to improve the computational performances, it is possible to activate the parallel computing from the process_routes function and it will automatically be applied to all the suitable functionalities among the selected ones.

output = process_routes({'az_file.json': 'az'},               # path of input file generated by AZ
                         functionalities=['distance_matrix']) # the functionalities to be activated
                         parallelization_True,                # parallelization is activated
                         n_cpu=8)                             # nr of CPUs to be used