A central problem of machine learning is the following. Given data of the form (y_i, f(y_i) + ϵ_i)_{i = 1}^M, where y_i’s are drawn randomly from an unknown (marginal) distribution μ* and ϵ_i are random noise variables from another unknown distribution, find an approximation to the unknown function f, and estimate the error in terms of M. The approximation is accomplished typically by neural/rbf/kernel networks, where the number of nonlinear units is determined on the basis of an estimate on the degree of approximation, but the actual approximation is computed using an optimization algorithm. Although this paradigm is obviously extremely successful, we point out a number of perceived theoretical shortcomings of this paradigm, the perception reinforced by some recent observations about deep learning. We describe our efforts to overcome these shortcomings and develop a more direct and elegant approach based on the principles of approximation theory and harmonic analysis.