Machine Learning Methods to Predict Density Functional Theory B3LYP Energies of HOMO and LUMO Orbitals (original) (raw)

Machine learning algorithms were explored for the fast estimation of HOMO and LUMO orbital energies calculated by DFT B3LYP, on the basis of molecular descriptors exclusively based on connectivity. The whole project involved the retrieval and generation of molecular structures, quantum chemical calculations for a database with >111 000 structures, development of new molecular descrip-tors, and training/validation of machine learning models. Several machine learning algorithms were screened, and an applicability domain was defined based on Euclidean distances to the training set. Random forest models predicted an external test set of 9989 compounds achieving mean absolute error (MAE) up to 0.15 and 0.16 eV for the HOMO and LUMO orbitals, respectively. The impact of the quantum chemical calculation protocol was assessed with a subset of compounds. Inclusion of the orbital energy calculated by PM7 as an additional descriptor significantly improved the quality of estimations (reducing the MAE in >30%). ■ INTRODUCTION The energies of the highest occupied and lowest unoccupied molecular orbitals (HOMO and LUMO) calculated by quantum chemistry methods are currently of high importance for the discovery of new materials, namely for estimating optoelectronic properties and filtering databases of candidate organic molecules. 1 The demand for ultrathin, lightweight, and flexible electronic devices lead to the exploration of organic materials with possible unique combination of electronic, chemical, and mechanical properties. For example, organic materials have vital application in organic light-emitting diodes (OLEDs), 2 organic photovoltaic devices (OPVs), 3 and organic thin-film transistors (OTFTs). 4 In organic light-emitting diodes (OLED), a current of electrons flows through the device as electrons are injected into the LUMO of the layer at the cathode and withdrawn from the HOMO at the anode; radiation emission occurs with the electron relaxation from the LUMO to the HOMO, and the frequency of the radiation depends on the HOMO−LUMO gap. 5 In organic solar cells, light absorption is interpreted in terms of electron excitations from the HOMO to the LUMO orbitals, and charge transport is achieved by electron transfers between the frontier orbitals of donors and acceptors. 6 The efficiency of OPVs depends on the HOMO−LUMO gap of the polymer donor, and optimization of the energy difference ΔE between the LUMO of the donor and acceptor polymers is required. Some minimal ΔE is required to separate the energies of the excited state of the donor and the acceptor, and was suggested to be 0.3 eV. 7,8 The effect of electric field on the HOMO, LUMO, and HOMO− LUMO gap were suggested as determinant parameters for the suitability of organic materials as a conducting channel in OTFTs. 9 Analogous to the well-established protocols for virtual screening in drug discovery projects, usually based on simulations of biomolecular docking, chemical similarities, pharmacophore searching, or QSAR models, various approaches have recently emerged for the virtual screening of new materials based on high-throughput quantum chemistry calculations. The Harvard Clean Energy Project has screened 2 million organic compounds using DFT calculations including the energies of frontier orbitals for the discovery of high-efficiency organic photovoltaic materials (OPVs). 10 Ramprasad and co-workers generated and screened a virtual database of polymers with DFT calculations to efficiently identify advanced polymer dielectrics for capacitive energy storage applications 11 and trained kernel ridge regressions for on-demand prediction of the bandgap (Egap) and dielectric constants. 12 Using genetic algorithms and semiempirical calculations, O'Boyle et al. searched a space of synthetically accessible conjugated organic.jcim.6b00340 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX