November 3 2011
5
03
/11
/November
/2011
09:03
This page is dedicated to start discussions about the article "Sparse Recovery with Brownian Sensing". Feel free to post any comment, sugggestion, question, correction, extension... I will enjoy discussing this with you.
" We consider the problem of recovering the parameter α ∈ R^K of a sparse function f (i.e. the number of nonzero entries of α is small compared to the number K of features) given noisy evaluations of f at a set of wellchosen sampling points. We introduce an additional randomization process, called Brownian sensing, based on the computation of stochastic integrals, which produces a Gaussian sensing matrix, for which good recovery properties are proven, independently on the number of sampling points N , even when the features are arbitrarily nonorthogonal. Under the assumption that f is Hölder continuous with exponent at least 1/2, we provide an estimate â of the parameter such that α − â_2 = O( η_2/ sqrt(N)), where η is the observation noise. The method uses a set of sampling points uniformly distributed along a onedimensional curve selected according to the features. We report numerical experiments illustrating our method."
Find a systematic way to choose an appropriate curve to sample along.
Published by OdalricAmbrym Maillard

in
Discussing articles
June 3 2011
6
03
/06
/June
/2011
13:28
This page is dedicated to start discussions about the article "A FiniteTime Analysis of Multiarmed Bandits Problems with KullbackLeibler Divergences". Feel free to post any comment, sugggestion, question, correction, extension... I will enjoy discussing this with you.
"We consider a KullbackLeiblerbased algorithm for the stochastic multiarmed bandit problem in the case of distributions with finite supports (not necessarily known beforehand), whose asymptotic regret matches the lower bound of \cite{Burnetas96}. Our contribution is to provide a finitetime analysis of this algorithm; we get bounds whose main terms are smaller than the ones of previously known algorithms with finitetime analyses (like UCBtype algorithms). "
We provided a finitetime analysis of the (asymptotically optimal) Kinf–strategy in the case
of finitely supported distributions. One could think that the extension to the case of general distributions
is straightforward. However this extension appears somewhat difficult (at least when using the current
definition of Kinf ) for the following reasons: (1) adapting the proof would require some extension of Sanov’s nonasymptotic Theorem. (2) some needed properties in the proof, do not seem to be always satisfied for general distributions.
Extend this setting to general exponential families; for instance, histogrambased approximations of general distributions could be considered.
Published by OdalricAmbrym Maillard

in
Discussing articles
June 3 2011
6
03
/06
/June
/2011
13:22
This page is dedicated to start discussions about the article "Adaptive bandits: Towards the best historydependent strategy". Feel free to post any comment, sugggestion, question, correction, extension... I will enjoy discussing this with you.
"We consider multiarmed bandit games with possibly adaptive opponents.
We introduce models \Theta of constraints based on equivalence classes on the common history (information shared by the player and the opponent) which define two learning scenarios:
(1) The opponent is constrained, i.e.~he provides rewards that are stochastic functions of equivalence classes defined by some model \theta^*\in\Theta . The regret is measured with respect to (w.r.t.) the best historydependent strategy.
(2) The opponent is arbitrary and we measure the regret w.r.t.~the best strategy among all mappings from classes to actions (i.e.~the best historyclassbased strategy) for the best model in \Theta .
This allows to model opponents (case 1) or strategies (case 2) which handles finite memory, periodicity, standard stochastic bandits and other situations.
When \Theta=\{\theta\} , i.e.~only one model is considered, we derive \textit{tractable} algorithms achieving a \textit{tight} regret (at time T) bounded by \tilde O(\sqrt{TAC}) , where C is the number of classes of \theta . Now, when many models are available, all known algorithms achieving a nice regret O(\sqrt{T}) are unfortunately \textit{not tractable} and scale poorly with the number of models \Theta . Our contribution here is to provide {\em tractable} algorithms with regret bounded by T^{2/3}C^{1/3}\log(\Theta)^{1/2} ."
We do not know whether in the case of a pool of models \Phi_\Theta, there exist tractable algorithms with \Phi_\Thetaregret better that T^{2/3} with log dependency w.r.t. \Theta.
Extend this setting to Reinforcement Learning.
Published by OdalricAmbrym Maillard

in
Discussing articles
January 16 2011
1
16
/01
/January
/2011
11:23
This page is dedicated to start discussions about the article "FiniteSample Analysis of Bellman Residual Minimization".
Feel free to post any comment, sugggestion, question, correction, extension... I will enjoy discussing this with you.
"We consider the Bellman residual minimization approach for solving discounted Markov decision problems, where we assume that a generative model of the dynamics and rewards is available. At each policy iteration step, an approximation of the value function for the current policy is obtained by minimizing an empirical Bellman residual dened on a set of n states drawn i.i.d. from a distribution \mu, the immediate rewards, and the next states sampled from the model. Our main result is a generalization bound for the Bellman residual in linear approximation spaces. In particular, we prove that the empirical Bellman residual approaches the true (quadratic) Bellman residual in \munorm with a rate of order O(1/\sqrt{n}). This result implies that minimizing the empirical residual is indeed a sound approach for the minimization of the true Bellman residual which guarantees a good approximation of the value function for each policy. Finally, we derive performance bounds for the resulting approximate policy iteration algorithm in terms of the number of samples n and a measure of how well the function space is able to approximate the sequence of value functions."
We have a poorer estimation rate of O(1/\sqrt{n}) instead of O(1/n) and it is an open question to whether an improved rate for Bellman residuals can be obtained. The main reason is explained in Remark 1 p.8.
Improve the rate of convergence by means of improved statistical techniques.
Published by OdalricAmbrym Maillard

in
Discussing articles
January 16 2011
1
16
/01
/January
/2011
11:13
This page is dedicated to start discussions about the article "Scrambled Objects for LeastSquares Regression".
Feel free to post any comment, sugggestion, question, correction, extension... I will enjoy discussing this with you.
"We consider leastsquares regression using a randomly generated subspace G_P \subset F of finite dimension P, where F is a function space of infinite dimension, e.g. L2([0, 1]^d). G_P is defined as the span of P random features that are linear combinations of the basis functions of F weighted by random Gaussian i.i.d. coefficients. In particular, we consider multiresolution random combinations at all scales of a given mother function, such as a hat function or a wavelet. In this latter case, the resulting Gaussian objects are called scrambled wavelets and we show that they enable to approximate functions in Sobolev spaces H_s([0, 1]^d). As a result, given N data, the leastsquares estimate \hat g built from P scrambled wavelets has excess risk f^* − \hat g2 P = O(f^*2_{H_s([0,1]^d)}(logN)/P + P(logN)/N) for target functions f^* \in H_s([0, 1]^d) of smoothness order s > d/2. An interesting aspect of the resulting bounds is that they do not depend on the distribution P from which the data are generated, which is important in a statistical regression setting considered here. Randomization enables to adapt to any possible distribution. We conclude by describing an efficient numerical implementation using lazy expansions with numerical complexity \tilde O (2dN^3/2 logN + N^2), where d is the dimension of the input space."
It appears that similar methods using random generation of features seem to have been used in practice for years, beginning with the perceptron. Thus a natural question is what light this work sheds on such practical ideas.
The proof here is not satisfactory here for it makes appear big, not optimal constants. The main reason is that we used techniques from "A distribution Free Theory of Nonparametric Regression" in order to control the concentration of the quadratic process that appears. But using more recent techniques would definitely yield to tighter bounds.
In the case the initial set of features comes from Carleman operators, there is a deep connection with RKHS. Thus we would like to use proof techniques from kernel methods to derive alternate bounds with improved constants and improve further our understanding of this concept.
Published by OdalricAmbrym Maillard

in
Discussing articles
January 16 2011
1
16
/01
/January
/2011
11:06
This page is dedicated to start discussions about the article "LSTD with Random Projections".
Feel free to post any comment, sugggestion, question, correction, extension... I will enjoy discussing this with you.
"We consider the problem of reinforcement learning in highdimensional spaces when the number of features is bigger than the number of samples. In particular, we study the leastsquares temporal difference (LSTD) learning algorithm when a space of low dimension is generated with a random projection from a highdimensional space. We provide a thorough theoretical analysis of the LSTD with random projections and derive performance bounds for the resulting algorithm. We also show how the error of LSTD with random projections is propagated through the iterations of a policy iteration algorithm and provide a performance bound for the resulting leastsquares policy iteration (LSPI) algorithm."
It would be interesting to build equivalent analysis using alternative methods like l1 and l2 regularization algorithms.
Published by OdalricAmbrym Maillard

in
Discussing articles
July 11 2010
1
11
/07
/July
/2010
20:40
This page is dedicated to start discussions about the article "Online Learning in Adversarial Lipschitz Environments".
Feel free to post any comment, sugggestion, question, correction, extension... I will enjoy discussing this with you.
"We consider the problem of online learning in an adversarial environment when the reward functions chosen by the adversary are assumed to be Lipschitz. This setting extends previous works on linear and convex online learning. We provide a class of algorithms with cumulative regret upper bounded by O(\sqrt{dT ln(\lambda)}) where d is the dimension of the search space, T the time horizon, and \lambda the Lipschitz constant. Efficient numerical implementations using particle methods are discussed. Applications include online supervised learning problems for both full and partial (bandit) information settings, for a large class of nonlinear regressors/classifiers, such as neural networks."
It would be interesting to analyse general PMC methods and get full understanding of the speed of convergence towards the distribution targetted by the method. This involves working for instance on further developing the nice results of the book (FeynmanKac Formulae, Pierre Del Moral,2004) as well as (Sequential Monte Carlo Methods in Practice, A.Doucet , N. de Freitas, N. Gordon, 2001).
Developping a deeper analysis of the PMC methods, that can be applied to this paper, or any other else.
Published by OdalricAmbrym Maillard

in
Discussing articles
December 10 2009
5
10
/12
/December
/2009
23:12
This page is dedicated to start discussions about the article "Compressed Least Squares Regression".
Feel free to post any comment, sugggestion, question, correction, extension... I will enjoy discussing this with you.
"We consider the problem of learning, from K data, a regression function in a linear space of high dimension N using projections onto a random subspace of lower dimension M. From any algorithm minimizing the (possibly penalized) empirical risk, we provide bounds on the excess risk of the estimate computed in the projected subspace (compressed domain) in terms of the excess risk of the estimate built in the highdimensional space (initial domain). We show that solving the problem in the compressed domain instead of the initial domain reduces the estimation error at the price of an increased (but controlled) approximation error. We apply the analysis to LeastSquares (LS) regression and discuss the excess risk and numerical complexity of the resulting ``Compressed Least Squares Regression'' (CLSR) in terms of N, K, and M. When we choose M=O(\sqrt{K}), we show that CLSR has estimation error of order O(\log K / \sqrt{K}) and provides an interesting alternative to usual penalization techniques where one first selects a random subspace of lower dimension and then performs an empirical risk minimizer in this subspace."
The term appearing in the bound is crucial for the method to be useful, and seem to appear in a lot of works dealing with regression. So it would be nice to make this more explicit in previous work, have possibly different interpretation, and of course use approximation theory to know under which assumptions it can be controlled.
 Future work: Possible extensions include
Compressed kernel regression.
Compressed density estimation.
New interpretation of compressed learning.
Published by OdalricAmbrym Maillard

in
Discussing articles
November 15 2009
1
15
/11
/November
/2009
21:18
This page is dedicated to start discussions about the article "Complexity versus Agreement for Many Views".
Feel free to post any comment, sugggestion, question, correction, extension... I will enjoy discussing this with you.
"The paper considers the problem of semisupervised multiview classification, where each view corresponds to a Reproducing Kernel Hilbert Space. An algorithm based on coregularization methods with extra penalty terms reflecting smoothness and general agreement properties is proposed. We first provide explicit tight control on the Rademacher
(L1 ) complexity of the corresponding class of learners for arbitrary many views, then give the asymptotic behavior of the bounds when the coregularization term increases, making explicit the relation between consistency of the views and reduction of the search space. Third we provide an illustration through simulations on toy examples. With many views,
a parameter selection procedure is based on the stability approach with clustering and localization arguments. The last new result is an explicit bound on the L2 diameter of the class of functions."
It would be interesting to compare the selection method proposed here that uses a datadriven penalty with standard model selection methods (P.Massart, S.Arlot...etc)
Published by OdalricAmbrym Maillard

in
Discussing articles