Fully connected networks are roughly described by two structural parameters: a depth L and a width n. It is well known that, with some important caveats on the scale at initialization, in the regime of fixed L and the limit of infinite n, neural networks at the start of training are a free (i.e. Gaussian) field and that network optimization is kernel regression for the so-called neural tangent kernel (NTK). This is a striking and insightful simplification of infinitely overparameterized networks. However, in this particular infinite width limit neural networks cannot learn data-dependent features, which is perhaps their most important empirical feature. To understand feature learning one must therefore study networks at finite width. In this talk I will do just that. I will report on recent work joint with Dan Roberts and Sho Yaida (done at a physics level of rigor) and some more mathematical ongoing work which allows one to compute, perturbatively in 1/n and recursively in L, all correlation functions of the neural network function (and its derivatives) at initialization. An important upshot is the emergence of L/n, instead of simply L, as the effective network depth. This cut-off parameter provably measures the extent of feature learning and the distance at initialization to the large n free theory.