This page is dedicated to start discussions about the article "Scrambled Objects for Least-Squares Regression".
Feel free to post any comment, sugggestion, question, correction, extension... I will enjoy discussing this with you.
- Abstract:
"We consider least-squares regression using a randomly generated subspace G_P \subset F of finite dimension P, where F is a function space of infinite dimension, e.g. L2([0, 1]^d). G_P is defined as the span of P random features that are linear combinations of the basis functions of F weighted by random Gaussian i.i.d. coefficients. In particular, we consider multi-resolution random combinations at all scales of a given mother function, such as a hat function or a wavelet. In this latter case, the resulting Gaussian objects are called scrambled wavelets and we show that they enable to approximate functions in Sobolev spaces H_s([0, 1]^d). As a result, given N data, the least-squares estimate \hat g built from P scrambled wavelets has excess risk ||f^* − \hat g||2 P = O(||f^*||2_{H_s([0,1]^d)}(logN)/P + P(logN)/N) for target functions f^* \in H_s([0, 1]^d) of smoothness order s > d/2. An interesting aspect of the resulting bounds is that they do not depend on the distribution P from which the data are generated, which is important in a statistical regression setting considered here. Randomization enables to adapt to any possible distribution. We conclude by describing an efficient numerical implementation using lazy expansions with numerical complexity \tilde O (2dN^3/2 logN + N^2), where d is the dimension of the input space."
- Discussion:
It appears that similar methods using random generation of features seem to have been used in practice for years, beginning with the perceptron. Thus a natural question is what light this work sheds on such practical ideas.
The proof here is not satisfactory here for it makes appear big, not optimal constants. The main reason is that we used techniques from "A distribution Free Theory of Nonparametric Regression" in order to control the concentration of the quadratic process that appears. But using more recent techniques would definitely yield to tighter bounds.
- Future work:
In the case the initial set of features comes from Carleman operators, there is a deep connection with RKHS. Thus we would like to use proof techniques from kernel methods to derive alternate bounds with improved constants and improve further our understanding of this concept.