Abstract
This dissertation describes computational experiments comparing the performance of a range of reinforcement-learning algorithms. The experiments are designed to focus on aspects of the credit-assignment problem having to do with determining when the behavior that deserves credit occurred. The issues of knowledge representation involved in developing new features or refining existing ones are not addressed. The algorithms considered include some from learning automata theory, mathematical learning theory, early "cybernetic" approaches to learning, Samuel's checker-playing program, Michie and Chambers's "Boxes" system, and a number of new algorithms. The tasks were selected so as to involve, first in isolation and then in combination, the issues of misleading generalizations, delayed reinforcement, unbalanced reinforcement, and secondary reinforcement. The tasks range from simple, abstract "two-armed bandit" tasks to a physically realistic pole-balancing task. The results indicate several areas where the algorithms presented here perform substantially better than those previously studied. An unbalanced distribution of reinforcement, misleading generalizations, and delayed reinforcement can greatly retard learning and in some cases even make it counterproductive. Performance can be substantially improved in the presence of these common problems through the use of mechanisms of reinforcement comparison and secondary reinforcement. We present a new algorithm similar to the "learning-by-generalization" algorithm used for altering the static evaluation function in Samuel's checker-playing program. Simulation experiments indicate that the new algorithm performs better than a version of Samuel's algorithm suitably modified for reinforcement learning tasks. Theoretical analysis in terms of an "ideal reinforcement signal" sheds light on the relationship between these two algorithms and other temporal credit-assignment algorithms.
Keywords
Related Publications
A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play
One program to rule them all Computers can beat humans at increasingly complex games, including chess and Go. However, these programs are typically constructed for a particular ...
Temporal difference learning and TD-Gammon
Ever since the days of Shannon's proposal for a chess-playing algorithm [12] and Samuel's checkers-learning program [10] the domain of complex board games such as Go, chess, che...
EFFECTS OF INSTRUCTIONS AND REINFORCEMENT‐FEEDBACK ON HUMAN OPERANT BEHAVIOR MAINTAINED BY FIXED‐INTERVAL REINFORCEMENT<sup>1</sup>
In three experiments, human subjects were trained on a five‐component multiple schedule with different fixed intervals of monetary reinforcement scheduled in the different compo...
Input generalization in delayed reinforcement learning: an algorithm and performance comparisons
Delayed reinforcement learning is an attractive framework for the unsupervised learning of action policies for autonomous agents. Some existing delayed reinforcement learning te...
Some experiments in machine learning using vector evaluated genetic algorithms (artificial intelligence, optimization, adaptation, pattern recognition)
This dissertation describes experiments conducted to explore the efficacy of using vector-valued feedback with a class of adaptive procedures called genetic algorithms. The soft...
Publication Info
- Year
- 1984
- Type
- article
- Citations
- 778
- Access
- Closed