August 6 2014
4
06
/08
/August
/2014
16:51
Rémi Bardenet, OdalricAmbrym Maillard.
In Bernoulli Journal, 2014.
Abstract: 
Concentration inequalities quantify the deviation of a random variable from a fixed value. In spite of numerous applications, such as opinion surveys or ecological counting procedures, few concentration results are known for the setting of sampling without replacement from a finite population. Until now, the best general concentration inequality has been a Hoeffding inequality due to Serfling (1974). In this paper, we first improve on the fundamental result of Serfling (1974), and further extend it to obtain a Bernstein concentration bound for sampling without replacement. We then derive an empirical version of our bound that does not require the variance to be known to the user. 
You can dowload the paper from the Bernoulli website (here) or from the HAL online open depository* (here).

* The HAL openaccess online archive system seeks to make research results available to the widest audience, independently of the major publisher, and cooperates with other large international archives like arXiv.
Published by OdalricAmbrym Maillard

in
Discussing articles
June 16 2014
2
16
/06
/June
/2014
14:39
OdalricAmbrym Maillard, Shie Mannor
In International Conference on Machine Learning, 2014.
Abstract: 
We consider a multiarmed bandit problem where the reward distributions are indexed by two sets –one for arms, one for type– and can be partitioned into a small number of clusters according to the type. First,we consider the setting where all reward distributions are known and all types have the same underlying cluster, the type’s identity is, however, unknown. Second, we study the case where types may come from different classes, which is significantly more challenging. Finally, we tackle the case where the reward distributions are completely unknown. In each setting, we introduce specific algorithms and derive nontrivial regret performance. Numerical experiments show that, in the most challenging agnostic case, the proposed algorithm achieves excellent performance in several difficult scenarios. 
You can dowload the paper from the JMLR website (here) or from the HAL online open depository* (here).
You can dowload the Java code used to generate the experiments here.

* The HAL openaccess online archive system seeks to make research results available to the widest audience, independently of the major publisher, and cooperates with other large international archives like arXiv.
Published by OdalricAmbrym Maillard

in
Discussing articles
October 1 2013
3
01
/10
/October
/2013
13:30
OdalricAmbrym Maillard
In Algorithmic Learning Theory, 2013.
Abstract: 
We study a variant of the standard stochastic multiarmed bandit problem when one is not interested in the arm with the best mean, but instead in the arm maximizing some coherent risk measure criterion. Further, we are studying the deviations of the regret instead of the less informative expected regret. We provide an algorithm, called RAUCB to solve this problem, together with a high probability bound on its regret. 
You can dowload the paper from the ALT website (here) or from the HAL online open depository* (here).
Bibtex: 
@incollection{Maillard2013, year={2013}, isbn={9783642409349}, booktitle={Algorithmic Learning Theory}, volume={8139}, series={Lecture Notes in Computer Science}, editor={Jain, Sanjay and Munos, Rémi and Stephan, Frank and Zeugmann, Thomas}, title={Robust RiskAverse Stochastic Multiarmed Bandits}, publisher={Springer Berlin Heidelberg}, author={Maillard, OdalricAmbrym}, pages={218233} } 

* The HAL openaccess online archive system seeks to make research results available to the widest audience, independently of the major publisher, and cooperates with other large international archives like arXiv.
Published by OdalricAmbrym Maillard

in
Discussing articles
September 1 2013
1
01
/09
/September
/2013
16:35
Olivier Cappé, Aurélien Garivier, OdalricAmbrym Maillard, Rémi Munos, Gilles Stoltz.
In The Annals of Statistics, 2013.
Abstract: 
We consider optimal sequential allocation in the context of the socalled stochastic multiarmed bandit model. We describe a generic index policy, in the sense of Gittins (1979), based on upper confidence bounds of the arm payoffs computed using the KullbackLeibler divergence. We consider two classes of distributions for which instances of this general idea are analyzed: The klUCB algorithm is designed for oneparameter exponential families and the empirical KLUCB algorithm for bounded and finitely supported distributions. Our main contribution is a unified finitetime analysis of the regret of these algorithms that asymptotically matches the lower bounds of Lai and Robbins (1985) and Burnetas et Katehakis (1996), respectively. We also investigate the behavior of these algorithms when used with general bounded rewards, showing in particular that they provide significant improvements over the stateoftheart. 
You can dowload the paper from the Annals of Statistics (here) or from the HAL online open depository* (here).
Bibtex: 
@article{CaGaMaMuSt2013, AUTHOR = {Olivier Capp\'{e}, Aur\'{e}lien Garivier, OdalricAmbrym Maillard, R\'{e}mi Munos, Gilles Stoltz}, TITLE = {KullbackLeibler upper confidence bounds for optimal sequential allocation}, JOURNAL = {Ann. Statist.}, FJOURNAL = {Annals of Statistics}, YEAR = {2013}, VOLUME = {41}, NUMBER = {3}, PAGES = {15161541}, ISSN = {00905364}, DOI = {10.1214/13AOS1119} } 

* The HAL openaccess online archive system seeks to make research results available to the widest audience, independently of the major publisher, and cooperates with other large international archives like arXiv.
Published by OdalricAmbrym Maillard

in
Discussing articles
March 14 2013
5
14
/03
/March
/2013
10:57
Phuong Nguyen, OdalricAmbrym Maillard, Daniil Ryabko,Ronald Ortner.
In International Conference on Artificial Intelligence and Statistics, 2013.
Abstract: 
We consider a reinforcement learning setting where the learner also has to deal with the problem of finding a suitable staterepresentation function from a given set of models. This has to be done while interacting with the environment in an online fashion (no resets), and the goal is to have small regret with respect to any Markov model in the set. For this setting, recently the BLB algorithm has been proposed, which achieves regret of order T^{2/3}, provided that the given set of models is finite. Our first contribution is to extend this result to a countably infinite set of models. Moreover, the BLB regret bound suffers from an additive term that can be exponential in the diameter of the MDP involved, since the diameter has to be guessed. The algorithm we propose avoids guessing the diameter, thus improving the regret bound. 
You can dowload the paper from the JMLR website (here) or from the HAL online open depository* (soon).
Bibtex: 
@InProceedings{Nguyen13, author = "Nguyen, P. and Maillard, O. and Ryabko, D. and Ortner, R. ", title = "Competing with an Infinite Set of Models in Reinforcement Learning", booktitle = "AISTATS", series = {JMLR W\&CP 31}, address = "Arizona, USA", year = "2013", pages = "463471" } 
Related publications: 
Optimal regret bounds for selecting the state representation in reinforcement learning. OdalricAmbrym Maillard, Phuong Nguyen, Ronald Ortner, Daniil Ryabko. In Proceedings of the 30th international conference on machine learning, ICML 2013, 2013. Selecting the StateRepresentation in Reinforcement Learning. OdalricAmbrym Maillard, Daniil Ryabko, Rémi Munos. In Proceedings of the 24th conference on advances in Neural Information Processing Systems, NIPS '11, pages 2627–2635, 2011. 

* The HAL openaccess online archive system seeks to make research results available to the widest audience, independently of the major publisher, and cooperates with other large international archives like arXiv.
Published by OdalricAmbrym Maillard

in
Discussing articles
February 28 2013
5
28
/02
/February
/2013
19:30
OdalricAmbrym Maillard, Phuong Nguyen, Ronald Ortner, Daniil Ryabko.
In Proceedings of the 30th international conference on machine learning, ICML 2013, 2013.
Abstract: 
We consider an agent interacting with an environment in a single stream of actions, observations, and rewards, with no reset. This process is not assumed to be a Markov Decision Process (MDP). Rather, the agent has several representations (mapping histories of past interactions to a discrete state space) of the environment with unknown dynamics, only some of which result in an MDP. The goal is to minimize the average regret criterion against an agent who knows an MDP representation giving the highest optimal reward, and acts optimally in it. Recent regret bounds for this setting are of order O(T^{2/3}) with an additive term constant yet exponential in some characteristics of the optimal MDP. We propose an algorithm whose regret after T time steps is O(\sqrt{T}), with all constants reasonably small. This is optimal in T since O(\sqrt{T}) is the optimal regret in the setting of learning in a (single discrete) MDP. 
You can dowload the paper from the ICML website (here), and a corrected version from the HAL online open depository* (here), or from arXiv (here) (the correction is minor and changes only a constant 2 into 2\sqrt{2}). See also a talk presenting this work here.
Bibtex: 
@inproceedings{MaillardNguyenOrtnerRyabko13, author = "Maillard, O. and Nguyen, P. and Ortner, R. and Ryabko, D.", title = "Optimal Regret Bounds for Selecting the State Representation in Reinforcement Learning", booktitle = "International conference on Machine Learning", series = {JMLR W\&CP 28(1)}, address = "Atlanta, USA", year = "2013", pages = " 543551" } 

* The HAL openaccess online archive system seeks to make research results available to the widest audience, independently of the major publisher, and cooperates with other large international archives like arXiv.
Published by OdalricAmbrym Maillard

in
Discussing articles
February 28 2013
5
28
/02
/February
/2013
19:22
By OdalricAmbrym Maillard.
In Proceedings of the 25th conference on advances in Neural Information Processing Systems, NIPS '12, 2012.
Abstract: 
This paper aims to take a step forwards making the term ''intrinsic motivation'' from reinforcement learning theoretically well founded, focusing on curiositydriven learning. To that end, we consider the setting where, a fixed partition \P of a continuous space \X being given, and a process \nu defined on \X being unknown, we are asked to sequentially decide which cell of the partition to select as well as where to sample \nu in that cell, in order to minimize a loss function that is inspired from previous work on curiositydriven learning. The loss on each cell consists of one term measuring a simple worst case quadratic sampling error, and a penalty term proportional to the range of the variance in that cell. The corresponding problem formulation extends the setting known as active learning for multiarmed bandits to the case when each arm is a continuous region, and we show how an adaptation of recent algorithms for that problem and of hierarchical optimistic sampling algorithms for optimization can be used in order to solve this problem. The resulting procedure, called Hierarchical Optimistic Region SElection driven by Curiosity (HORSE.C) is provided together with a finitetime regret analysis. 
You can dowload the paper from the NIPS website (here) or from the HAL online open depository* (here).
Bibtex: 
@inproceedings{Maillard12, title ={Hierarchical Optimistic Region Selection driven by Curiosity}, author={OdalricAmbrym Maillard}, booktitle = {Advances in Neural Information Processing Systems 25}, editor = {P. Bartlett and F.C.N. Pereira and C.J.C. Burges and L. Bottou and K.Q. Weinberger}, pages = {14571465}, year = {2012}, url = {http://books.nips.cc/papers/files/nips25/NIPS2012_0697.pdf} } 

* The HAL openaccess online archive system seeks to make research results available to the widest audience, independently of the major publisher, and cooperates with other large international archives like arXiv.
Published by OdalricAmbrym Maillard

in
Discussing articles
February 28 2013
5
28
/02
/February
/2013
19:05
By Alexandra Carpentier and OdalricAmbrym Maillard.
In Proceedings of the 25th conference on advances in Neural Information Processing Systems, NIPS '12, 2012.
Abstract: 
In the setting of active learning for the multiarmed bandit, where the goal of a learner is to estimate with equal precision the mean of a finite number of arms, recent results show that it is possible to derive strategies based on finitetime confidence bounds that are competitive with the best possible strategy. We here consider an extension of this problem to the case when the arms are the cells of a finite partition P of a continuous sampling space X \subset \Real^d. Our goal is now to build a piecewise constant approximation of a noisy function (where each piece is one region of P and P is fixed beforehand) in order to maintain the local quadratic error of approximation on each cell equally low. Although this extension is not trivial, we show that a simple algorithm based on upper confidence bounds can be proved to be adaptive to the function itself in a nearoptimal way, when P is chosen to be of minimaxoptimal order on the class of \alphaHölder functions. 
You can dowload the paper from the NIPS website (here) or from the HAL online open depository* (here).
Bibtex: 
@inproceedings{CarpentierMaillard12, title = {Online allocation and homogeneous partitioning for piecewise constant meanapproximation.}, author = {Carpentier, Alexandra and Maillard, OdalricAmbrym}, booktitle = {Advances in Neural Information Processing Systems 25}, editor = {P. Bartlett and F.C.N. Pereira and C.J.C. Burges and L. Bottou and K.Q. Weinberger}, pages = {19701978}, year = {2012} } 

* The HAL openaccess online archive system seeks to make research results available to the widest audience, independently of the major publisher, and cooperates with other large international archives like arXiv.
Published by OdalricAmbrym Maillard

in
Discussing articles
February 28 2013
5
28
/02
/February
/2013
17:44
By OdalricAmbrym Maillard and Rémi Munos,
In Journal of Machine Learning Research 2012, vol:13, pp:27352772.
Abstract: 
We investigate a method for regression that makes use of a randomly generated subspace G_P (of finite dimension P) of a given large (possibly infinite) dimensional function space F, for example, L_{2}([0,1]^d). G_P is defined as the span of P random features that are linear combinations of a basis functions of F weighted by random Gaussian i.i.d.~coefficients. We show practical motivation for the use of this approach, detail the link that this random projections method share with RKHS and Gaussian objects theory and prove, both in deterministic and random design, approximation error bounds when searching for the best regression function in G_P rather than in F, and derive excess risk bounds for a specific regression algorithm (least squares regression in G_P). This paper stresses the motivation to study such methods, thus the analysis developed is kept simple for explanations purpose and leaves room for future developments. 
You can dowload the paper from the JMRL website (here) or from the HAL online open depository* (here).
Bibtex: 
@article{MaillardMunos12, title = {{Linear Regression with Random Projections}}, author = {Maillard, OdalricAmbrym and Munos, R{\'e}mi}, affiliation = {SEQUEL  INRIA Lille  Nord Europe}, pages = {27352772}, journal = {Journal of Machine Learning Research}, volume = {13}, year = {2012}, pdf = {http://hal.archivesouvertes.fr/hal00771487/PDF/JMLR\_random\_proj\_2012.pdf}, url = {http://hal.archivesouvertes.fr/hal00771487}, } 
Related publications: 
Scrambled Objects for LeastSquares Regression. OdalricAmbrym Maillard, Rémi Munos. In J. Lafferty, C. K. I. Williams, J. ShaweTaylor, R.S. Zemel, and A. Culotta, editors, Proceedings of the 23rd conference on advances in Neural Information Processing Systems, NIPS '10, pages 1549–1557, 2010. Compressed Least Squares Regression. OdalricAmbrym Maillard, Rémi Munos. In Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams, and A. Culotta, editors, Proceedings of the 22nd conference on advances in Neural Information Processing Systems, NIPS '09, pages 1213–1221, 2009. 

* The HAL openaccess online archive system seeks to make research results available to the widest audience, independently of the major publisher, and cooperates with other large international archives like arXiv.
Published by OdalricAmbrym Maillard

in
Discussing articles
December 19 2011
2
19
/12
/December
/2011
22:49
This page is dedicated to start discussions about the article "Selecting the StateRepresentation in Reinforcement Learning". Feel free to post any comment, sugggestion, question, correction, extension... I will enjoy discussing this with you.
" The problem of selecting the right staterepresentation in a reinforcement learning problem is considered. Several models (functions mapping past observations to a finite set) of the observations are given, and it is known that for at least one of these models the resulting state dynamics are indeed Markovian. Without knowing neither which of the models is the correct one, nor what are the probabilistic characteristics of the resulting MDP, it is required to obtain as much reward as the optimal policy for the correct model (or for the best of the correct models, if there are several). We propose an algorithm that achieves that, with a regret of order T^{2/3} where T is the horizon time."
A natural question is whether the rate T^{2/3} is optimal in this setting. We believe it is not.
Published by OdalricAmbrym Maillard

in
Discussing articles