Knowing the pK(a) of a compound gives insight into many properties relevant to many industries, in particular the pharmaceutical industry during drug development processes. In light of this, we have used the theory of Quantum Chemical Topology (QCT), to provide ab initio descriptors that are able to accurately predict pK(a) values for 228 carboxylic acids. This Quantum Topological Molecular Similarity (QTMS) study involved the comparison of 5 increasingly more expensive levels of theory to conclude that HF/6-31G(d) and B3LYP/6-311+G(2d,p) provided an accurate representation of the compounds studies. We created global and subset models for the carboxylic acids using Partial Least Square (PLS), Support Vector Machines (SVM), and Radial Basis Function Neural Networks (RBFNN). The models were extensively validated using 4-, 7-, and 10-fold cross-validation, with the validation sets selected based on systematic and random sampling. HF/6-31G(d) in conjunction with SVM provided the best statistics when taking into account the large increase in CPU time required to optimize the geometries at the B3LYP/6-311+G(2d,p) level. The SVM models provided an average q(2) value of 0.886 and an RMSE value of 0.293 for all the carboxylic acids, a q(2) of 0.825 and RMSE of 0.378 for the ortho-substituted acids, a q(2) of 0.923 and RMSE of 0.112 for the para- and meta-substituted acids, and a q(2) of 0.906 and RMSE of 0.268 for the aliphatic acids. Our method compares favorably to ACD/Laboratories, VCCLAB, SPARC, and ChemAxon's pK(a) prediction software based of the RMSE calculated by the leave-one-out method.

Original publication




Journal article


J Chem Inf Model

Publication Date





1914 - 1924


Carboxylic Acids, Computer Simulation, Hydrogen-Ion Concentration, Least-Squares Analysis, Models, Chemical, Molecular Structure, Protons, Software