De-novo Chemical Reaction Generation by Means of Temporal Convolutional Neural Networks Andrei Buin,∗,† Hung Yi Chiang,† S. Andrew Gadsden,∗,‡ and Faraz A. Alderson† †Department of Mechanical Engineering, University of Guelph, 50 Stone Rd E, Guelph, ON, N1G 2W1 ‡Department of Mechanical Engineering, McMaster University, 1280 Main Street West Hamilton, Ontario, Canada L8S 4L7 E-mail: phquanta@gmail.com; gadsden@mcmaster.ca Abstract We present here a combination of two networks, Recurrent Neural Networks (RNN) and Temporarily Convolutional Neural Networks (TCN) in de novo reaction genera- tion using the novel Reaction Smiles-like representation of reactions (CGRSmiles) with atom mapping directly incorporated. Recurrent Neural Networks are known for their autoregressive properties and are frequently used in language modelling with direct application to SMILES generation. The relatively novel TCNs possess similar proper- ties with wide receptive field while obeying the causality required for natural language processing (NLP). The combination of both latent representations expressed through TCN and RNN results in an overall better performance compared to RNN alone. Ad- ditionally, it is shown that different fine-tuning protocols have a profound impact on generative scope of the model when applied on a dataset of interest via transfer learning. 1 ar X iv :2 31 0. 17 34 1v 3 [ cs .L G ] 1 N ov 2 02 3 phquanta@gmail.com gadsden@mcmaster.ca Introduction With advances in Deep Learning(DL) generative methods, it is becoming more common to utilize DL’s generative properties in a variety of applications. One such application is retorsynthetic planning, where given the products of a reaction, one tries to predict react- ing precursors that resulted in the products. There are works1,2 that are already using DL methods to guide retorsynthetic planning. While it is a great tool for chemists, it is still lacking truly generative power when trying to generate novel reactions with unseen reaction centers and precursors. Part of the problem is a lengthy language model describing chemical reactions in textual form, such as SMARTS/SMIRKS atom-mapped reaction representa- tion. Only recently3 with the introduction of Condensed Graph of Reaction (CGR)4 has complex reaction information (reactants/products, bond formation/breaking) been success- fully encoded into a simple textual representation. In CGR, both reactants and products are combined into one single graph with bond creation and bond breaking incorporated, then expressed via SMILES-like strings. We will refer to CGR as CGRSmiles throughout the paper. One can tackle the task of generating SMILES-like strings via Recurrent Neural Networks(RNN)5 with Long-Short-Term-Memory (LSTM) cells used to avoid problems of vanishing/exploding gradient. On the other hand, a similar approach but with the use of Convolutional Neural Net- works(CNN), namely Temporal Convolutional Neural Networks (TCN)6 which use causal and dilated convolutions, have been used mostly in classification (prediction) applications. These applications include text classification,7 State-of-Charge battery estimation,8 time series forecasting,9–11 and protein-binding predictions.12 By itself, TCNs have a generative power which comes from causal convolutions, and in this sense can be thought of as an alternative to RNN. Given that fact, there are virtually no applications of novel DL archi- tectures such as TCN in generative applied chemistry. By utilizing the TCN’s generative power, we show that a combination of RNN-LSTM with TCN results in a better generative model compared to pure RNN models expressed as baseline models. In addition to this, 2 SMILES generation previously and usually implied that the model learns SMILES indepen- dent of the context. Context was only introduced via fine tuning a given dataset on which the model was subsequently refined. We show that in-context SMILES generation exhibits more diverse structural motiff based on Tanimoto similarity scores, compared to the pure RNN-LSTM model without context. Another contribution of this paper is the utilization of a novel transfer learning protocol, which again is widely used in image/video applications13 suitable for low-shot learning . With a traditional transfer learning protocol, where all the weights are "re-learned" on a particular dataset, the model seems to "memorize" the particular dataset with application of learned grammar rules in initial training. As a result, it will try to generate reactions with a particular reaction template (reactants+products) as seen in a fine tuned dataset. We have observed this phenomena in other3 research, as well as our own. One can alleviate this problem by introducing an exhaustive reaction dataset for a particular problem, but this solution would not work in low-shot learning. With a variant of the weight freezing, we show that our fine tuned model, trained on our own dataset, significantly outperforms models that learned from an all-weights optimization transfer learning approach. Computational details The original work of Gupta et al.14 has been utilized and modified for RNN in parts of the code. As for TCN, we used a custom implementation of Remy.15 All of the code was written in the Keras16 framework with a Tensorflow17 backend. Custom vocabulary for CGRSmiles was incorporated with one-hot encodings. RNNs with 2 to 3 Layers of stacked LSTM cells were used as outlined in the results section. 512 hidden units were used for each RNN(LSTM) layer. For the TCN network, 1 residual block was used without Normalization. We used a dilation vector of d = [1, 2, 4, 8, 16, 32]. 3 A kernel size of 2 was used in 1D convolutional layers. 256 convolutional filters were used in the TCN residual block. For regularization, we have used a dropout of 0.5 for each of the LSTM and TCN layers. A softmax layer was used as the final layer for classification with categorical cross-entropy loss. No batch normalization was used in the TCN residual block. For Seq2seq fingerprints (length of 768), an RNN of 3 stacked layers of (256) GRU cells with attention mechanism was used.18 Additionally, Seq2Seq fingerprints were processed with Principal Component Analysis for dimensionality reduction. We kept 50 dimensions from PCA before it was fed into a t-distributed stochastic neighbor embedding (tSNE)19 analyzer. All SMILES manipulations, such as getting Tanimoto similarity scores and performing valid- ity checks, were done using RDKit.20 For CGRSmiles generation, reaction center acquisition, and to/from reaction SMILES conversions along with aromaticity and valence checks, the CGRtools package21 was used. For the BiLSTM model,22 the entire USPTO Dataset was cast as a classification problem by dividing the entire corpus into strings of 80 characters with a sliding window of 3, along with the 81st character being used as ground truth for the classifier. We have tried longer character strings and shorter sliding windows, but this resulted in lengthy training times. Architectural design Language Model When doing generative modeling in the context of Natural Language Processing (NLP), it is crucial to choose a proper language model. Usually RNN(LSTM) implies unidirectional context from past to future, whereas BiDirectional LSTM(BiLSTM) becomes inappropriate to use in NLP process. However, there are methods that use the BiLSTM22 language model with certain adaptations. In one case, the entire dataset is represented as a single entire corpus, with the model trained to predict the next character given an N-length sequence in a sliding window manner. The problem with this approach is that CGRSmiles have relatively 4 long sequence lengths. Datasets created from the corpus of the original CGRSmiles reaction strings in the form of X, y, where X = x1, , . . . xi, xt and label y = xt+1, becomes prohibitively large. As a result, training time also scales proportionally. Another Bidirectional adaptation involves interleaving BiDirectional sampling on the left and right of the sequence starting from the center character(BiMODAL).23 Additional augmentations of the adapted dataset, in this case, helps increase accuracy of the model. This again is at the expense of increased computational time. In general, the problem with Bidirectional language models, without proper adaptation, lies in the fact that the model has seen the whole context. If one, on the other hand, tries to use a vanilla BiLSTM with RNN-like training (where the target sequence is an original input sequence shifted 1 position to the right), the model will not learn to generate novel reactions, but instead just shift the input sequence to the right by 1 position. These non-standard adaptations might require usage of dynamic graph neural computation rather than static graph computation due to graph modifications during runtime.23 As a result, we adopt a standard language model suited for generating one token at a time given the left context, along with a refined fine tuning protocol. Model Training Protocol We used the Adam optimization algorithm24 for training our models with cross-entropy loss as the objective for optimization. For Seq2Seq and RNN, the original dataset split was 80% for training and 20% for test. Learning rate was set to 10−3. Models were trained for 50 epochs using the original dataset and for 10 epochs in the ase of fine tuning. Batch size was 64 in the case of the original dataset, and 1 in the case of fine tuning. Fine Tuning Protocol A variety of fine tuning protocols were used. The original fine tuning protocol allowed all weights to be adjusted under the transfer learning approach. Another protocol involved freezing all layers except the last softmax layer. Additionally, we tried a decaying learning 5 protocol for different layers25 using different learning rates for different layers, along with a different number of epochs trained for each layer. Sampling protocol Usually sampling involves the sampling softmax function: P(yi) = exp(yi)∑ j exp(yj) . This gives the more syntactically correct, but less diverse, structures/reactions compared to the temperature controlled softmax function P(yi) = exp(yi/T )∑ j exp(yj/T ) , which has more diversity in generated structures/reactions but a smaller number of CGRSmiles strings. This sampling protocol has been described elsewhere.14,23 GRSmiles strings were sampled at a sampling temperature of T = 0.7. For each analysis task, 30,000 CGRSmiles were generated. Datasets For the larger corpus we used the training dataset of Jin’s26 USPTO atom-mapped dataset derived from Lowe’s grant dataset.27 In the case of the fine tuning dataset, we used web scraped reactions involving hydrogen peroxide(H2O dataset) from PubChem28 which later were preprocessed via Atom Mapper29 for proper atom mapping. Jin’s dataset and the H2O dataset were processed via the CGRtools library21 to obtain CGRSmiles strings. A max string of 156 characters was considered for both datasets. After aromaticity and valence checks along with applying a max length of 156 characters for CGRSmiles strings, the larger dataset was reduced to 216,308 reactions. In the case of the smaller, fine tuning dataset, we 6 acquired 166 atom-mapped reactions . It should be noted that most reactions in the smaller dataset are oxidation reactions (80% oxidation reactions), meaning they contain O=O as part of the reactants. Reaction Center and In-Context SMILES Analysis The most crucial part of any chemical reaction is the reaction center, i.e. atoms directly involved in bond creation/breaking. To analyze novelty in terms of how the model performs, one needs a means of categorizing novelty, i.e. reaction centers. Fortunately, there is a way in the CGRtools library that encodes each substructural motif by hash function whose value is a unique key. This value used in categorizing known reaction centers within a dataset. Additionally, the hash value is used for the detection of novel reaction centers and comparison between known and unknown reaction centers. We do not categorize reactions based on known reaction centers, but instead with different 1st closest neighbor. This is because the original dataset was not curated based on the presence of certain types of reaction centers, and all were considered. For in-context SMILES generation, each CGRSmile string was converted back to the reaction SMILES representation. SMILES from the product and reactant parts were extracted for subsequent analysis using Tanimoto similarity scores. Results and discussion Figure 1 shows the Deep Learning architectures used. Baseline 1 - 3 are homogenous Deep Learning architectures with either TCN or LSTM layers used. The proposed architecture in Figure 1(d) is, on the other hand, a combination of both LSTM and TCN. This architecture has the ability to learn from two latent representations. Figure 1(e) shows a residual block, which is the core of TCN. Basically, TCN is a stack of residual blocks which in turn consists of Dilated convolutional layers combined with Weight Normalization and Dropout layers. This is shown in Figure 2. In our case, no Weight Normalization was used and a dilation 7 vector of d = [1, 2, 4, 8, 16, 32] was utilized. 1x1 convolution is used in the case of depth mismatch between the input and output of the last dropout layer. (a) (b) (c) (d) (e) Figure 1: Baseline proposed architecture used (1) Baseline 1 (b) Baseline 2 (c) Baseline3 (d) Proposed (e) internals of TCN layer One could just utilize causal convolutions alone. However, by introducing dilated convo- lutions, one has greatly enhanced the receptive field (i.e. past history) that TCN can look 8 into. In other words, the last convolutional layer can see much further in the past when compared to plain causal convolutions, as seen in Figure 2. The receptive field, compared to vanilla convolutional operation (d=1) grows as F TCN Fconv ∝ 2n n+ 1 , where n is the number of equivalent convolutional and residual layers of TCN with expo- nentially growing dilation (2, 4, 8....). Clearly, TCN has the advantage. As demonstrated in Figure 2, this advantage is 2. In our case, the actual advantage is roughly 9 - fold. (a) Figure 2: TCN with residual blocks and Receptive field of original convolution(light violet shading) and dilate convolution(light green shading) Next, we considered how well the models perform when generating novel reactions based on initial training. From table 1, one can see that all of the models give a relatively high num- ber of unique and valid CGRSmiles strings. However, both TCN and TCN+RNN(LSTM) both gave significantly higher numbers of novel reaction centers. For comparison, we also compared our results with the BiLSTM language model22 that has 2 BiLSTM layers with 128 hidden units each. Results are shown in table 1. One can see that the number of valid 9 CGRSmiles string is significantly lower for the BiLSTM model, while the amount of reaction centers is higher compared to the TCN and TCN+RNN models. One aspect to note is that the training time required for BiLSTM is significantly higher compared to both the TCN and TCN+RNN models. A good compromise is achieved via the combination of TCN+RNN, where the number of valid CGRSmiles strings was the highest among all models and the number of novel reaction centers was relatively high, albeit not the highest. Table 1: Generative properties of a variety of models. Model Valid(%) Unique(%) N(RC) Baseline1 93.42 98.84 877 Baseline2 94.40 99.3 873 TCN 85.52 98.2 1943 TCN+RNN(LSTM) 94.71 98.66 1239 BiLSTM(80,3) 78.52 99.87 2606 Dataset N/A N/A 12308 The next step was to explore a variety of fine tuning protocols along with in-context SMILES generation. Table 2 shows that if one is to allow the model to freely adapt to a smaller dataset with all weights being adjustable (all unfrozen, AU in Table 2), "memoriza- tion" or overfitting of the model on the novel dataset will occur. On the other hand and while keeping only the last layer unfrozen (LL in Table 2), the model is capable of transferring its knowledge from previous learning more efficiently. Other fine tuning protocols have been tried as well, one of which is shown in table 2 (P1, or Protocol1) and provides results lying between the 2 extremes from the other two protocols. Interestingly enough, the model with the LL-transfer learning approach was able to sample CGRSmiles with Na, Pt, and Se ions in them. These ions were not a part of the smaller dataset, but the model has seen some ex- amples of such reactions in the larger dataset with Na, Pt, Se ions and applied its knowledge during the fine-tuning phase. We also computed tSNE plots of generated CGRSMiles as shown in Figure 3. One can see that the results are in agreement with Table 2 as expected, with Last Layer(LL), unfrozen upon transfer-learning, giving the highest amount of different reaction centers. This indicates the possibility of few-shot learning with only a few samples 10 Table 2: Fine Tuning Properties of a variety of Fine-Tuning Protocols for TCN+RNN. AU - All Unfrozen, P1 - Protocol1 where layers= [[1,2], 4, 5] were trained for different numbers of epochs [2,5,10] with learning rates=[10−6, 10−5, 5 ∗ 10−4] in sequential order starting from last layer. LL - only Last Layer (SoftMax) was unfrozen, while all other layers were frozen. Model Valid(%) Unique(%) N(RC) AU 98.65 9.98 63 P1 94.95 21.12 97 LL 91.92 60.40 288 H2O2 Dataset N/A N/A 64 from the smaller dataset. - 4 0 - 3 0 - 2 0 - 1 0 0 1 0 2 0 3 0 4 0 - 4 0 - 3 0 - 2 0 - 1 0 0 1 0 2 0 3 0 4 0 D a t a s e t A U L L tSN E2 t S N E 1 (a) Figure 3: tSNE plots of generated CGRSmiles compared to fine-tuned dataset and LL pro- tocol where all layers are unfrozen. To explore the SMILES of reactants and products that participated in a given reaction , we performed the conversion of CGRSmiles back to reaction SMIRKS consisting of the participating compounds. In this case however, the information about atom mapping is lost during the conversion and only reactants and products are preserved. Figure 4(a) shows a typical example of such conversion. In this case, each reactant/product is highlighted with 11 different colors. By analyzing these participating chemical formulations, one can look into the context that created those SMILES strings. In other words, SMILES extracted from SMIRKS representation are not mere SMILES, but rather SMILES with context in them, since each SMILES string in this case is tied with reaction. In a broader sense, it means that some reactants and products are more common than others in certain reactions. Figure 4(b) and Figure 4(c) show a difference in in-context SMILES similarity scores in the case of RNN and TCN+RNN. One can see that TCN+RNN gives a more diverse SMILES scaffold compared to RNN by having a smaller mean Tanimoto score. Please note that both were compared at different fine-tuning protocols: one is done at AU, and another is done at LL. However, Figure 4(c) shows that fine tuning protocol has a little effect on in-context-SMILES generated strings and as a result the shift towards lower Tanimoto scores could be attributed to architectural choice. 12 (a) 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 0 . 0 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5 No rm aliz ed dis trib utio n S i m i l a r i t y S c o r e R N N ( A U ) T C N + R N N ( L L ) G a u s s f i t o n R N N G a u s s f i t o n T C N + R N N (b) 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5 No rm aliz ed dis trib utio n S i m i l a r i t y S c o r e L L P 1 A U G a u s s f i t o f L L G a u s s f i t o f P 1 G a u s s f i t o f A U (c) Figure 4: (a) Mapping between CGRSmiles and SMIRKS representation. Please note that during conversion, atom mapping is lost. (b) InContext Tanimoto Scores for RNN and TCN+RNN. (c) Dependence of Tanimoto score in case of TCN+RNN with respect to fine- tuning protocols. Finally, to explore generative capabilities, we studied some of the reactions generated by the models. Out of all the Fine-Tuning protocols, LL gave the most reactions with novel reaction centers. In total, LL gave on the order of 800 reactions with novel RC, whereas all the other protocols gave only reactions on the order of 100 with novel RC. Figure 5(a) shows an example of an unseen reaction, not present in the dataset, with novel RC. This is glycidol hydrolysis, with no mistakes.30,31 Another reaction with novel RC is shown in Figure 5(b). The closest reaction to this one is diketene hydrolysis.32,33 Interestingly enough, the model is able to learn how to open the ring, albeit with some errors such as the wrong 13 placement of the CH2 group and O. Another reaction, but with known reaction center this time, is the oxidation of the cyclohexanol derivative shown in Figure 5(c). This is a feasible pathway of oxidation for the cyclohexanol derivative, as the closest reaction is the oxidation of cyclohexanol34 with a similar pathway. In our case, oxidation was done in presence of water, whereas the cited work34 uses tert-Butyl hydroperoxide (TBHP) as an oxidizing agent. This phenomenon can be attributed to the fact that most reactions in the fine-tuning dataset are oxidation reactions (80% oxidation reactions as mentioned previously), meaning that they contain O=O in the reactants. For generated CGRSmiles, the number of oxidation reactions is 92%, whereas the rest of the reactions contain mainly H2O and H2O2 as precursors. In addition to this, the collected dataset involving H2O2 did not contain metadata attributed to reactions such as catalyzers, oxidizing agents etc. Instead, only plain SMIRKS reaction representations were collected from open sources. In addition to this, reaction 5(c) has an invalid stoichiometry. Most errors of this type involve a wrong number of implicit hydrogens. We have observed phenomena of reactions being unbalanced in line with other work,3 mostly in the imbalance of implicit hydrogens. This has been attributed3 to the USPTO reaction database being imbalanced in the first place. Additionally, a small portion (2.5% of all generated smiles) of errors, such as copying the reactant part directly into the product part, was also observed. The reaction shown in Figure 5(d) is a reaction with novel RC - an initial pathway for gold reduction by 2-pyrrolidinone35 with the wrong placement of the O − OH group. However, one should keep in mind that the original work36 cited by Li35 et al. uses Nuclear Magnetic Resonance (NMR) shifts of 13C in determining the structure of the intermediate compound and the N atom has a methyl group. This is in contrast to the work of Li35 et. al and our work, where there is no methyl group attached to N , and resulting shifts do not necessarily correspond to the NH group. In other words, the placement of the −OOH group does not necessarily have to be on a carbon atom as no rigorous NMR analysis was done in the case of Li35 et al. 14 (a) (b) (c) (d) Figure 5: A variety of Generated atom-mapped reactions with novel and known reaction centers. 15 Conclusion This work presents a step forward towards unsupervised de-novo reaction generation. The contribution of this work is threefold: First, it explores an alternative TCN Deep Learning architecture in comparison with RNN by itself. Second, it is shown that this approach allows for context-aware SMILES generation. Lastly, it is shown that fine-tuning protocols have a significant contribution towards chemical domain adaptation in the chemical space, which in turn enables few-shot learning on smaller datasets upon transfer learning. The model with the best fine-tuning protocol was able to discover reactions which it has not seen, but was present in previous published work. This leads to the possibility of gaining reaction insight before the synthesis stage. Data and Software Availability Availability of data and materials CGRSmiles Dataset and Collected Hydrogen Peroxide datasets, along with Generated CGRSmiles and Python scripts are available at: https: //github.com/phquanta/CGRSmiles.git Acknowledgement References (1) Schreck, J. S.; Coley, C. W.; Bishop, K. J. M. Learning Retrosynthetic Planning through Simulated Experience. ACS Central Science 2019, 5, 970–981. (2) Zheng, S.; Rao, J.; Zhang, Z.; Xu, J.; Yang, Y. Predicting retrosynthetic reactions using self-corrected transformer neural networks. Journal of Chemical Information and Modeling 2019, 60, 47–55. 16 https://github.com/phquanta/CGRSmiles.git https://github.com/phquanta/CGRSmiles.git (3) Bort, W.; Baskin, I. I.; Gimadiev, T.; Mukanov, A.; Nugmanov, R.; Sidorov, P.; Mar- cou, G.; Horvath, D.; Klimchuk, O.; Madzhidov, T., et al. Discovery of novel chemical reactions by deep generative recurrent neural network. Scientific reports 2021, 11, 1–15. (4) Hoonakker, F.; Lachiche, N.; Varnek, A.; Wagner, A. A REPRESENTATION TO APPLY USUAL DATA MINING TECHNIQUES TO CHEMICAL REAC- TIONS—ILLUSTRATION ON THE RATE CONSTANT OF SN 2 REACTIONS IN WATER. International Journal on Artificial Intelligence Tools 2011, 20, 253–270. (5) Segler, M. H.; Kogej, T.; Tyrchan, C.; Waller, M. P. Generating focused molecule libraries for drug discovery with recurrent neural networks. ACS central science 2018, 4, 120–131. (6) Bai, S.; Kolter, J. Z.; Koltun, V. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271 2018, (7) Conneau, A.; Schwenk, H.; Barrault, L.; Lecun, Y. Very deep convolutional networks for natural language processing. arXiv preprint arXiv:1606.01781 2016, 2, 1. (8) Song, X.; Yang, F.; Wang, D.; Tsui, K.-L. Combined CNN-LSTM network for state-of- charge estimation of lithium-ion batteries. IEEE Access 2019, 7, 88894–88902. (9) Wan, R.; Mei, S.; Wang, J.; Liu, M.; Yang, F. Multivariate temporal convolutional network: A deep neural networks approach for multivariate time series forecasting. Electronics 2019, 8, 876. (10) Zhao, W.; Gao, Y.; Ji, T.; Wan, X.; Ye, F.; Bai, G. Deep temporal convolutional networks for short-term traffic flow forecasting. IEEE Access 2019, 7, 114496–114507. (11) Liu, Y.; Dong, H.; Wang, X.; Han, S. Time series prediction based on temporal con- volutional network. 2019 IEEE/ACIS 18th International Conference on Computer and Information Science (ICIS). 2019; pp 300–305. 17 (12) Cui, Y.; Dong, Q.; Hong, D.; Wang, X. Predicting protein-ligand binding residues with deep convolutional neural networks. BMC bioinformatics 2019, 20, 1–12. (13) Qi, H.; Brown, M.; Lowe, D. G. Low-shot learning with imprinted weights. Proceedings of the IEEE conference on computer vision and pattern recognition. 2018; pp 5822– 5830. (14) Gupta, A.; Müller, A. T.; Huisman, B. J.; Fuchs, J. A.; Schneider, P.; Schneider, G. Generative recurrent networks for de novo drug design. Molecular informatics 2018, 37, 1700111. (15) Remy, P. Temporal Convolutional Networks for Keras. https://github.com/ philipperemy/keras-tcn, 2020. (16) Chollet, F., et al. Keras. https://github.com/fchollet/keras, 2015. (17) Abadi, M. et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. 2015; https://www.tensorflow.org/, Software available from tensorflow.org. (18) Xu, Z.; Wang, S.; Zhu, F.; Huang, J. Seq2seq fingerprint: An unsupervised deep molecu- lar embedding for drug discovery. Proceedings of the 8th ACM international conference on bioinformatics, computational biology, and health informatics. 2017; pp 285–294. (19) Maaten, L. v. d.; Hinton, G. Visualizing data using t-SNE. Journal of machine learning research 2008, 9, 2579–2605. (20) Landrum, G. RDKit: Open-source cheminformatics. http://www.rdkit.org. (21) Nugmanov, R. I.; Mukhametgaleev, R. N.; Akhmetshin, T.; Gimadiev, T. R.; Afon- ina, V. A.; Madzhidov, T. I.; Varnek, A. CGRtools: Python Library for Molecule, Reaction, and Condensed Graph of Reaction Processing. Journal of chemical informa- tion and modeling 2019, 59, 2516–2521. 18 https://github.com/philipperemy/keras-tcn https://github.com/philipperemy/keras-tcn https://github.com/fchollet/keras https://www.tensorflow.org/ http://www.rdkit.org (22) van Deursen, R.; Ertl, P.; Tetko, I. V.; Godin, G. GEN: highly efficient SMILES ex- plorer using autodidactic generative examination networks. Journal of Cheminformatics 2020, 12, 1–14. (23) Grisoni, F.; Moret, M.; Lingwood, R.; Schneider, G. Bidirectional molecule generation with recurrent neural networks. Journal of chemical information and modeling 2020, 60, 1175–1183. (24) Kingma, D. P.; Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 2014, (25) Howard, J.; Ruder, S. Universal language model fine-tuning for text classification. arXiv preprint arXiv:1801.06146 2018, (26) Jin, W.; Coley, C.; Barzilay, R.; Jaakkola, T. Predicting organic reaction outcomes with weisfeiler-lehman network. Advances in Neural Information Processing Systems. 2017; pp 2607–2616. (27) Lowe, D. Chemical reactions from US patents (1976-Sep2016). 2018. (28) Kim, S.; Chen, J.; Cheng, T.; Gindulyte, A.; He, J.; He, S.; Li, Q.; Shoemaker, B. A.; Thiessen, P. A.; Yu, B., et al. PubChem 2019 update: improved access to chemical data. Nucleic acids research 2019, 47, D1102–D1109. (29) Jaworski, W.; Szymkuć, S.; Mikulak-Klucznik, B.; Piecuch, K.; Klucznik, T.; Kaźmierowski, M.; Rydzewski, J.; Gambin, A.; Grzybowski, B. A. Automatic mapping of atoms across both simple and complex chemical reactions. Nature communications 2019, 10, 1–11. (30) Saito, A.; Shirasawa, T.; Tanahashi, S.; Uno, M.; Tatsumi, N.; Kitsuki, T. An efficient synthesis of glyceryl ethers: catalyst-free hydrolysis of glycidyl ethers in water media. Green Chemistry 2009, 11, 753–755. 19 (31) Wang, Z.; Cui, Y.-T.; Xu, Z.-B.; Qu, J. Hot water-promoted ring-opening of epoxides and aziridines by water and other nucleopliles. The Journal of organic chemistry 2008, 73, 2270–2274. (32) Clemens, R. J. Diketene. Chemical Reviews 1986, 86, 241–318. (33) Gómez-Bombarelli, R.; González-Pérez, M.; Pérez-Prior, M. T.; Manso, J. A.; Calle, E.; Casado, J. Kinetic study of the neutral and base hydrolysis of diketene. Journal of Physical Organic Chemistry 2009, 22, 438–442. (34) Bhaumik, C.; Stein, D.; Vincendeau, S.; Poli, R.; Manoury, É. Oxidation of alcohols by TBHP in the presence of sub-stoichiometric amounts of MnO2. Comptes Rendus Chimie 2016, 19, 566–570. (35) Li, C. C.; Chen, L. B.; Li, Q. H.; Wang, T. H. Seed-free, aqueous synthesis of gold nanowires. CrystEngComm 2012, 14, 7549–7551. (36) Drago, R. S.; Riley, R. Oxidation of N-alkyl amides to novel hydroperoxides by dioxy- gen. Journal of the American Chemical Society 1990, 112, 215–218. 20