Events Calendar

AI Seminar: Estimating Long-term Rewards by Off-policy Reinforcement Learning

Lihong Li, Senior Principal Scientist

One of the core problems in reinforcement learning (RL) is estimating the long-term reward of a given policy. In many real-world applications such as healthcare, robotics and dialogue systems, running a new policy on users or robots can be costly or risky. This gives rise to the need for off-policy, or counterfactual, estimation: estimate the long-term reward of a given policy using data previously collected by another policy (e.g., the one currently deployed). This talk will describe some recent advances in this problem, for which many standard estimators suffer an exponentially large variance (known as "the curse of horizon"). Our approach is based on a dual linear program formulation of the long-term reward, and can be extended to estimate confidence intervals.

Speaker Bio
Lihong Li is a Senior Principal Scientist at Amazon. He obtained a PhD degree in Computer Science from Rutgers University. After that, he held research positions in Yahoo!, Microsoft and Google, before joining Amazon. His main research interests are in reinforcement learning, including contextual bandits, and other related problems in AI. His work is often inspired by applications in recommendation, advertising, Web search and conversational systems. Homepage:

Wednesday, November 10, 2021 at 1:00pm to 2:00pm

Virtual Event
Event Type

Lecture or Presentation

Event Topic


Electrical Engineering and Computer Science
Contact Name

Prasad Tadepalli

Contact Email

Google Calendar iCal Outlook

Recent Activity