November 3 2011 5 03 /11 /November /2011 09:03

Sparse Recovery with Brownian Sensing

This page is dedicated to start discussions about the article "Sparse Recovery with Brownian Sensing". Feel free to post any comment, sugggestion, question, correction, extension... I will enjoy discussing this with you.

• Abstract:

" We consider the problem of recovering the parameter α ∈ R^K of a sparse function f (i.e. the number of non-zero entries of α is small compared to the number K of features) given noisy evaluations of f at a set of well-chosen sampling points. We introduce an additional randomization process, called Brownian sensing, based on the computation of stochastic integrals, which produces a Gaussian sensing matrix, for which good recovery properties are proven, independently on the number of sampling points N , even when the features are arbitrarily non-orthogonal. Under the assumption that f is Hölder continuous with exponent at least 1/2, we provide an estimate â of the parameter such that ||α − â||_2 = O( ||η||_2/ sqrt(N)), where η is the observation noise. The method uses a set of sampling points uniformly distributed along a one-dimensional curve selected according to the features. We report numerical experiments illustrating our method."

• Future work:

Find a systematic way to choose an appropriate curve to sample along.

Repost 0
June 3 2011 6 03 /06 /June /2011 13:28

A Finite-Time Analysis of Multi-armed Bandits Problems with Kullback-Leibler Divergences

This page is dedicated to start discussions about the article "A Finite-Time Analysis of Multi-armed Bandits Problems with Kullback-Leibler Divergences". Feel free to post any comment, sugggestion, question, correction, extension... I will enjoy discussing this with you.

• Abstract:

"We consider a Kullback-Leibler-based algorithm for the stochastic multi-armed bandit problem in the case of distributions with finite supports (not necessarily known beforehand), whose asymptotic regret matches the lower bound of \cite{Burnetas96}. Our contribution is to provide a finite-time analysis of this algorithm; we get bounds whose main terms are smaller than the ones of previously known algorithms with finite-time analyses (like UCB-type algorithms). "

• Discussion:

We provided a finite-time analysis of the (asymptotically optimal) Kinf–strategy in the case
of finitely supported distributions. One could think that the extension to the case of general distributions
is straightforward. However this extension appears somewhat difficult (at least when using the current
definition of Kinf ) for the following reasons: (1) adapting the proof would require some extension of Sanov’s non-asymptotic Theorem. (2) some needed properties in the proof, do not seem to be always satisfied for general distributions.

• Future work:

Extend this setting to general exponential families; for instance, histogram-based approximations of general distributions could be considered.

Repost 0
June 3 2011 6 03 /06 /June /2011 13:22

Adaptive bandits: Towards the best history-dependent strategy

This page is dedicated to start discussions about the article "Adaptive bandits: Towards the best history-dependent strategy". Feel free to post any comment, sugggestion, question, correction, extension... I will enjoy discussing this with you.

• Abstract:

"We consider multi-armed bandit games with possibly adaptive opponents.
We introduce models  \Theta of constraints based on equivalence classes on the common history (information shared by the player and the opponent) which define two learning scenarios:
(1) The opponent is constrained, i.e.~he provides rewards that are stochastic functions of equivalence classes defined by some model  \theta^*\in\Theta . The regret is measured with respect to (w.r.t.) the best history-dependent strategy.
(2) The opponent is arbitrary and we measure the regret w.r.t.~the best strategy among all mappings from classes to actions (i.e.~the best history-class-based strategy) for the best model in  \Theta .
This allows to model opponents (case 1) or strategies (case 2) which handles finite memory, periodicity, standard stochastic bandits and other situations.

When  \Theta=\{\theta\} , i.e.~only one model is considered, we derive \textit{tractable} algorithms achieving a \textit{tight} regret (at time T) bounded by  \tilde O(\sqrt{TAC}) , where  C is the number of classes of  \theta . Now, when many models are available, all known algorithms achieving a nice regret  O(\sqrt{T}) are unfortunately \textit{not tractable} and scale poorly with the number of models  |\Theta| . Our contribution here is to provide {\em tractable} algorithms with regret bounded by  T^{2/3}C^{1/3}\log(|\Theta|)^{1/2} ."

• Discussion:

We do not know whether in the case of a pool of models \Phi_\Theta, there exist tractable algorithms with \Phi_\Theta-regret better that T^{2/3} with log dependency w.r.t. |\Theta|.

• Future work:

Extend this setting to Reinforcement Learning.

Repost 0
January 16 2011 1 16 /01 /January /2011 11:23

Finite-Sample Analysis of Bellman Residual Minimization

Feel free to post any comment, sugggestion, question, correction, extension... I will enjoy discussing this with you.

• Abstract:

"We consider the Bellman residual minimization approach for solving discounted Markov decision problems, where we assume that a generative model of the dynamics and rewards is available. At each policy iteration step, an approximation of the value function for the current policy is obtained by minimizing an empirical Bellman residual dened on a set of n states drawn i.i.d. from a distribution \mu, the immediate rewards, and the next states sampled from the model. Our main result is a generalization bound for the Bellman residual in linear approximation spaces. In particular, we prove that the empirical Bellman residual approaches the true (quadratic) Bellman residual in \mu-norm with a rate of order O(1/\sqrt{n}). This result implies that minimizing the empirical residual is indeed a sound approach for the minimization of the true Bellman residual which guarantees a good approximation of the value function for each policy. Finally, we derive performance bounds for the resulting approximate policy iteration algorithm in terms of the number of samples n and a measure of how well the function space is able to approximate the sequence of value functions."

• Discussion:

We have a poorer estimation rate of O(1/\sqrt{n}) instead of O(1/n) and it is an open question to whether an improved rate for Bellman residuals can be obtained. The main reason is explained in Remark 1 p.8.

• Future work:

Improve the rate of convergence by means of improved statistical techniques.

Repost 0
January 16 2011 1 16 /01 /January /2011 11:13

Scrambled Objects for Least-Squares Regression

Feel free to post any comment, sugggestion, question, correction, extension... I will enjoy discussing this with you.

• Abstract:

"We consider least-squares regression using a randomly generated subspace G_P \subset F of finite dimension P,  where F is a function space of infinite dimension, e.g. L2([0, 1]^d). G_P is defined as the span of P random features that are linear combinations of the basis functions of F weighted by random Gaussian i.i.d. coefficients. In particular, we consider multi-resolution random combinations at all scales of a given mother function, such as a hat function or a wavelet. In this latter case, the resulting Gaussian objects are called scrambled wavelets and we show that they enable to approximate functions in Sobolev spaces H_s([0, 1]^d). As a result, given N data, the least-squares estimate \hat g built from P scrambled wavelets has excess risk ||f^* − \hat g||2 P = O(||f^*||2_{H_s([0,1]^d)}(logN)/P + P(logN)/N) for target functions f^* \in H_s([0, 1]^d) of smoothness order s > d/2. An interesting aspect of the resulting bounds is that they do not depend on the distribution P from which the data are generated, which is important in a statistical regression setting considered here. Randomization enables to adapt to any possible distribution. We conclude by describing an efficient numerical implementation using lazy expansions with numerical complexity \tilde O (2dN^3/2 logN + N^2), where d is the dimension of the input space."

• Discussion:

It appears that similar methods using random generation of features seem to have been used in practice for years, beginning with the perceptron. Thus a natural question is what light this work sheds on such practical ideas.

The proof here is not satisfactory here for it makes appear big, not optimal constants. The main reason is that we used techniques from "A distribution Free Theory of Nonparametric Regression" in order to control the concentration of the quadratic process that appears. But using more recent techniques would definitely yield to tighter bounds.

• Future work:

In the case the initial set of features comes from Carleman operators, there is a deep connection with RKHS. Thus we would like to use proof techniques from kernel methods to derive alternate bounds with improved constants and improve further our understanding of this concept.

Repost 0
January 16 2011 1 16 /01 /January /2011 11:06

LSTD with Random Projections

Feel free to post any comment, sugggestion, question, correction, extension... I will enjoy discussing this with you.

• Abstract:

"We consider the problem of reinforcement learning in high-dimensional spaces when the number of features is bigger than the number of samples. In particular, we study the least-squares temporal difference (LSTD) learning algorithm when a space of low dimension is generated with a random projection from a highdimensional space. We provide a thorough theoretical analysis of the LSTD with random projections and derive performance bounds for the resulting algorithm. We also show how the error of LSTD with random projections is propagated through the iterations of a policy iteration algorithm and provide a performance bound for the resulting least-squares policy iteration (LSPI) algorithm."

• Discussion:

It would be interesting to build equivalent analysis using alternative methods like l1 and l2 regularization algorithms.

• Future work:

Repost 0
July 11 2010 1 11 /07 /July /2010 20:40

Online Learning in Adversarial Lipschitz Environments

Feel free to post any comment, sugggestion, question, correction, extension... I will enjoy discussing this with you.

• Abstract:

"We consider the problem of online learning in an adversarial environment when the reward functions chosen by the adversary are assumed to be Lipschitz. This setting extends previous works on linear and convex online learning. We  provide a class of algorithms with cumulative regret upper bounded by O(\sqrt{dT ln(\lambda)}) where d is the dimension of the search space, T the time horizon, and \lambda the Lipschitz constant. Efficient numerical implementations using particle methods are discussed. Applications include online supervised learning problems for both full and partial (bandit) information settings, for a large class of non-linear regressors/classifiers, such as neural networks."

• Discussion:

It would be interesting to analyse general PMC methods and get full understanding of the speed of convergence towards the distribution targetted by the method. This involves working for instance on further developing the nice results of the book (Feynman-Kac Formulae, Pierre Del Moral,2004) as well as (Sequential Monte Carlo Methods in Practice, A.Doucet , N. de Freitas, N. Gordon, 2001).

• Future work:

Developping a deeper analysis of the PMC methods, that can be applied to this paper, or any other else.

Repost 0
December 10 2009 5 10 /12 /December /2009 23:12

Compressed Least Squares Regression

Feel free to post any comment, sugggestion, question, correction, extension... I will enjoy discussing this with you.

• Abstract:

"We consider the problem of learning, from K data, a regression function in a linear space of high dimension N using projections onto a random subspace of lower dimension M. From any algorithm minimizing the (possibly penalized) empirical risk, we provide bounds on the excess risk of the estimate computed in the projected subspace (compressed domain) in terms of the excess risk of the estimate built in the high-dimensional space (initial domain). We show that solving the problem in the compressed domain instead of the initial domain reduces the estimation error at the price of an increased (but controlled) approximation error. We apply the analysis to Least-Squares (LS) regression and discuss the excess risk and numerical complexity of the resulting Compressed Least Squares Regression'' (CLSR) in terms of N, K, and M. When we choose M=O(\sqrt{K}),  we show that CLSR has estimation error of order O(\log K / \sqrt{K}) and provides an interesting alternative to usual penalization techniques where one first selects a random subspace of lower dimension and then performs an empirical risk minimizer in this subspace."

• Discussion:

The term appearing in the bound is crucial for the method to be useful, and seem to appear in a lot of works dealing with regression. So it would be nice to make this more explicit in previous work, have possibly different interpretation, and of course use approximation theory to know under which assumptions it can be controlled.

• Future work: Possible extensions include

Compressed kernel regression.
Compressed density estimation.
New interpretation of compressed learning.

Repost 0
November 15 2009 1 15 /11 /November /2009 21:18

Complexity versus Agreement for Many Views

Feel free to post any comment, sugggestion, question, correction, extension... I will enjoy discussing this with you.

• Abstract:

"The paper considers the problem of semi-supervised multi-view classification, where each view corresponds to a Reproducing Kernel Hilbert Space. An algorithm based on co-regularization methods with extra penalty terms reflecting smoothness and general agreement properties is proposed. We first provide explicit tight control on the Rademacher
(L1 ) complexity of the corresponding class of learners for arbitrary many views, then give the asymptotic behavior of the bounds when the co-regularization term increases, making explicit the relation between consistency of the views and reduction of the search space. Third we provide an illustration through simulations on toy examples. With many views,
a parameter selection procedure is based on the stability approach with clustering and localization arguments. The last new result is an explicit bound on the L2 -diameter of the class of functions.
"

• Future work:

It would be interesting to compare the selection method proposed here that uses a data-driven penalty  with standard model selection methods (P.Massart, S.Arlot...etc)

Repost 0

Overview

Senior Researcher

The Technion
Faculty of Electrical Engineering,
Fishbach Bldg, Rm. 462
32000 Haifa, ISRAEL
Tel: +972 (0)48293638