Graduate Certificate in Machine Learning in Polymer Science and Engineering · Guide

Reinforcement Learning

Reinforcement Learning is a powerful area of machine learning that deals with how agents ought to take actions in an environment to maximize some notion of cumulative reward. It is a type of learning where an agent learns to behave in an en…

14 min read Updated 3 May 2026

Key Terms and Vocabulary in Reinforcement Learning:

1. **Agent**: The entity that is learning to interact with the environment in reinforcement learning. It takes actions based on its policy and receives rewards based on those actions.

2. **Environment**: The external system with which the agent interacts. The agent learns by receiving feedback in the form of rewards from the environment based on its actions.

3. **State**: A state represents a situation in the environment. It is a specific configuration or snapshot that the agent can be in. The agent's actions are based on the current state.

4. **Action**: An action is a decision taken by the agent in a particular state that leads to a change in the state of the environment. The agent learns to choose actions that maximize its cumulative reward.

5. **Reward**: A scalar feedback signal from the environment that informs the agent about the immediate success or failure of its action. The goal of the agent is to maximize the cumulative reward over time.

6. **Policy**: A policy is a strategy or a set of rules that the agent uses to select actions in different states. It maps states to actions and defines the agent's behavior.

7. **Value Function**: The value function estimates how good it is for the agent to be in a particular state. It predicts the expected cumulative reward the agent will receive starting from that state.

8. **Q-Value**: The Q-value of a state-action pair is the expected cumulative reward the agent will receive starting from that state, taking the action, and then following a specific policy.

9. **Exploration**: Exploration is the process of trying out new actions to discover their effects and to improve the agent's policy. Balancing exploration and exploitation is crucial in reinforcement learning.

10. **Exploitation**: Exploitation is the process of choosing actions that are known to yield high rewards based on the agent's current knowledge. It involves using the learned information to make decisions.

11. **Episode**: An episode is a single run of the agent interacting with the environment from the initial state to a terminal state. It is a sequence of states, actions, and rewards.

12. **Discount Factor (Gamma)**: The discount factor is a value between 0 and 1 that determines the importance of future rewards in the agent's decision-making process. It discounts future rewards relative to immediate rewards.

13. **Markov Decision Process (MDP)**: MDP is a mathematical framework for modeling decision-making in situations where outcomes are partly random and partly under the control of a decision-maker. It consists of states, actions, transition probabilities, rewards, and a discount factor.

14. **Bellman Equation**: The Bellman equation is a fundamental equation in reinforcement learning that decomposes the value function into immediate rewards and the value of the next state. It is used to update the value function iteratively.

15. **Policy Iteration**: Policy iteration is an iterative algorithm in reinforcement learning that alternates between policy evaluation and policy improvement to find an optimal policy.

16. **Value Iteration**: Value iteration is an iterative algorithm in reinforcement learning that updates the value function by taking the maximum value over all possible actions in each state.

17. **Temporal Difference (TD) Learning**: TD learning is a method in reinforcement learning that updates the value function based on the difference between the estimated value of a state and a better estimate obtained one step later.

18. **Q-Learning**: Q-learning is a model-free reinforcement learning algorithm that learns the Q-value function without requiring a model of the environment. It uses the Bellman equation to update Q-values.

19. **Deep Q-Network (DQN)**: DQN is a deep learning model that combines deep neural networks with Q-learning to handle high-dimensional state spaces in reinforcement learning tasks.

20. **Policy Gradient**: Policy gradient methods directly optimize the policy function by computing the gradient of the expected return with respect to the policy parameters.

21. **Actor-Critic**: Actor-critic methods combine policy-based and value-based approaches by having two components: an actor (policy) that selects actions and a critic (value function) that evaluates the actions.

22. **Exploration-Exploitation Dilemma**: The exploration-exploitation dilemma refers to the trade-off between exploring new actions to learn more about the environment and exploiting known actions to maximize rewards.

23. **On-Policy Learning**: On-policy learning methods update the policy that is used to make decisions while interacting with the environment.

24. **Off-Policy Learning**: Off-policy learning methods learn a policy different from the one used to generate the data, allowing for more flexibility in exploration.

25. **SARSA**: SARSA is a temporal difference learning algorithm that updates the Q-values based on the current action, the next action, and the rewards received.

26. **Monte Carlo Methods**: Monte Carlo methods estimate value functions by averaging the returns of multiple episodes, without considering the dynamics of the environment.

27. **Function Approximation**: Function approximation methods use parameterized functions to estimate value functions or policies in reinforcement learning, enabling the handling of large state spaces.

28. **Generalization**: Generalization refers to the ability of a reinforcement learning agent to apply its learned knowledge to new, unseen situations in the environment.

29. **Curse of Dimensionality**: The curse of dimensionality refers to the exponential increase in the size of the state space with the number of dimensions, making it challenging to explore and learn in high-dimensional spaces.

30. **Sparse Rewards**: Sparse rewards are rewards that are only given at certain points in the environment, making it difficult for the agent to learn the optimal policy.

31. **Continuous Action Space**: Continuous action spaces involve actions that are represented by real numbers rather than discrete choices, posing challenges for exploration and optimization.

32. **Multi-Armed Bandit Problem**: The multi-armed bandit problem is a simplified version of the reinforcement learning problem where the agent must decide which arm of a slot machine to pull to maximize cumulative reward.

33. **Function Approximation in RL**: Function approximation techniques in reinforcement learning use parameterized functions to represent value functions or policies, allowing for efficient learning in large state spaces.

34. **Deadly Triad**: The deadly triad refers to the combination of off-policy learning, function approximation, and bootstrapping, which can lead to instability and divergence in reinforcement learning algorithms.

35. **Policy Search**: Policy search methods directly search for an optimal policy by exploring the policy space, often using optimization techniques like evolutionary algorithms or gradient descent.

36. **Policy Evaluation**: Policy evaluation is the process of estimating the value function under a given policy, which is essential for policy improvement in reinforcement learning.

37. **Temporal Difference Error**: The temporal difference error is the difference between the predicted value of a state and the actual reward received, which is used to update the value function.

38. **Softmax Policy**: Softmax policy is a probabilistic policy that selects actions based on the probabilities assigned to each action, allowing for exploration and exploitation in reinforcement learning.

39. **Replay Buffer**: A replay buffer is a memory structure used in deep reinforcement learning to store experiences and replay them during training, improving sample efficiency and stability.

40. **Target Network**: A target network is a separate network used in deep reinforcement learning to stabilize training by fixing the target Q-values for a certain number of steps before updating them.

41. **Policy Distillation**: Policy distillation is a method in reinforcement learning where a complex policy learned by a neural network is transferred to a simpler policy that can be executed more efficiently.

42. **AlphaGo**: AlphaGo is a deep reinforcement learning system developed by DeepMind that defeated world champion Go player Lee Sedol in 2016, showcasing the power of reinforcement learning in complex games.

43. **OpenAI Gym**: OpenAI Gym is a toolkit for developing and comparing reinforcement learning algorithms, providing a wide range of environments and benchmarks for research and experimentation.

44. **Deep Reinforcement Learning**: Deep reinforcement learning combines deep learning techniques with reinforcement learning to handle complex state spaces and high-dimensional inputs in tasks.

45. **Policy-Based Methods**: Policy-based methods directly optimize the agent's policy to maximize expected rewards, without explicitly estimating value functions.

46. **Value-Based Methods**: Value-based methods estimate value functions to guide the agent's decisions and actions towards maximizing cumulative rewards in reinforcement learning tasks.

47. **Tabular Methods**: Tabular methods represent value functions or policies using tables, which are suitable for small state spaces but impractical for large or continuous spaces.

48. **Stochastic Environment**: A stochastic environment is one where the outcomes of actions are probabilistic, leading to uncertainty in the agent's interactions and requiring probabilistic policies.

49. **Deterministic Environment**: A deterministic environment is one where the outcomes of actions are predictable and do not involve randomness, simplifying the agent's decision-making process.

50. **Memoryless Property**: The memoryless property in reinforcement learning states that the agent's decisions are based only on the current state, not the entire history of states and actions.

51. **Asynchronous Methods**: Asynchronous methods in reinforcement learning update the agent's policy or value function in parallel, allowing for faster learning and exploration in large-scale environments.

52. **Batch Reinforcement Learning**: Batch reinforcement learning is a setting where the agent learns from a fixed dataset of experiences, rather than interacting with the environment in real-time.

53. **Deep Deterministic Policy Gradient (DDPG)**: DDPG is an algorithm that combines deep Q-learning with policy gradients to handle continuous action spaces in reinforcement learning.

54. **Trust Region Policy Optimization (TRPO)**: TRPO is a policy optimization method that constrains the policy update to ensure that the new policy does not deviate too far from the old policy, improving stability.

55. **Proximal Policy Optimization (PPO)**: PPO is a policy gradient method that simplifies TRPO by using a clipped objective function to update the policy, achieving similar performance with less computation.

56. **Distributed Reinforcement Learning**: Distributed reinforcement learning involves training multiple agents in parallel across different machines or GPUs, enabling faster learning and scalability in large environments.

57. **Meta Reinforcement Learning**: Meta reinforcement learning is a higher-level learning process where the agent learns how to learn and adapt its learning strategies across different tasks or environments.

58. **Imitation Learning**: Imitation learning is a method where the agent learns from demonstrations provided by an expert, mimicking the expert's behavior to achieve better performance.

59. **Inverse Reinforcement Learning**: Inverse reinforcement learning is a technique where the agent infers the underlying reward function of an expert by observing their behavior, allowing for better policy learning.

60. **Multi-Agent Reinforcement Learning**: Multi-agent reinforcement learning involves multiple agents learning to interact and cooperate or compete with each other in a shared environment, leading to complex behaviors and strategies.

61. **Robust Reinforcement Learning**: Robust reinforcement learning focuses on developing algorithms that are resilient to uncertainties, disturbances, and adversarial attacks in the environment.

62. **Transfer Learning in RL**: Transfer learning in reinforcement learning involves leveraging knowledge or policies learned in one task to accelerate learning or improve performance in a related task.

63. **Regularization in RL**: Regularization techniques in reinforcement learning prevent overfitting by adding constraints or penalties to the agent's policy or value function during training.

64. **Model-Based Reinforcement Learning**: Model-based reinforcement learning methods learn an explicit model of the environment dynamics to aid in planning and decision-making, improving sample efficiency.

65. **Model-Free Reinforcement Learning**: Model-free reinforcement learning methods directly learn the policy or value function from interactions with the environment, without explicitly modeling the environment dynamics.

66. **Bayesian Reinforcement Learning**: Bayesian reinforcement learning integrates Bayesian inference with reinforcement learning to handle uncertainty and make robust decisions in stochastic environments.

67. **Sparse Rewards Problem**: The sparse rewards problem refers to the challenge of learning in environments where rewards are rare or sparse, making it difficult for the agent to learn an optimal policy.

68. **Recurrent Neural Networks (RNNs)**: RNNs are a type of neural network architecture that can capture sequential dependencies in data, making them suitable for handling temporal aspects in reinforcement learning tasks.

69. **Long Short-Term Memory (LSTM)**: LSTMs are a variant of RNNs that can store and retrieve long-term dependencies in sequential data, enabling better learning and generalization in reinforcement learning.

70. **Attention Mechanism**: Attention mechanisms allow neural networks to focus on relevant parts of the input data, enabling better performance in tasks that require selective processing, such as sequence prediction in reinforcement learning.

71. **Self-Attention**: Self-attention mechanisms enable neural networks to relate different positions in a sequence, capturing long-range dependencies and improving performance in tasks like language modeling in reinforcement learning.

72. **Exploration Strategies**: Exploration strategies in reinforcement learning dictate how an agent selects actions to balance between exploring new options and exploiting known ones, affecting the learning efficiency and performance.

73. **Intrinsic Motivation**: Intrinsic motivation methods provide additional rewards or incentives to encourage exploration and learning in reinforcement learning tasks, beyond the external rewards from the environment.

74. **Curriculum Learning**: Curriculum learning is a training strategy that gradually increases the complexity of tasks or environments to guide the agent's learning process, improving sample efficiency and generalization.

75. **Distributed Representation**: Distributed representation encodes information across multiple neurons or variables, enabling efficient learning and generalization in reinforcement learning tasks with high-dimensional inputs.

76. **Hindsight Experience Replay**: Hindsight experience replay is a technique that reuses failed experiences by relabeling them with successful outcomes, improving sample efficiency and exploration in reinforcement learning.

77. **Safe Reinforcement Learning**: Safe reinforcement learning focuses on developing algorithms that ensure the agent's behavior remains within predefined safety constraints to prevent harmful actions or accidents.

78. **Multi-Task Reinforcement Learning**: Multi-task reinforcement learning involves learning multiple tasks simultaneously or sequentially, leveraging shared knowledge to improve learning efficiency and adaptability.

79. **Hierarchical Reinforcement Learning**: Hierarchical reinforcement learning decomposes complex tasks into subtasks with separate policies, enabling more efficient learning and decision-making in reinforcement learning.

80. **Catastrophic Forgetting**: Catastrophic forgetting refers to the phenomenon where a reinforcement learning agent forgets previously learned knowledge when adapting to new tasks or environments, hindering performance.

81. **Curiosity-Driven Exploration**: Curiosity-driven exploration methods incentivize the agent to explore by rewarding novelty or learning progress, promoting diverse behaviors and improving exploration in reinforcement learning.

82. **Dopamine Signal**: The dopamine signal is a neural signal in the brain associated with rewards and learning, inspiring reinforcement learning algorithms like Q-learning and TD learning.

83. **Counterfactual Reasoning**: Counterfactual reasoning allows the agent to imagine alternative actions or scenarios and evaluate their consequences, improving decision-making and policy learning in reinforcement learning.

84. **Contextual Bandits**: Contextual bandits extend the multi-armed bandit problem by introducing context or features that affect the rewards of each action, requiring the agent to learn a policy based on context.

85. **Self-Supervised Learning**: Self-supervised learning methods generate supervisory signals from the input data itself, enabling pre-training or feature learning in reinforcement learning tasks without external supervision.

86. **Adversarial Reinforcement Learning**: Adversarial reinforcement learning involves training an agent to compete or collaborate with an adversary, leading to robust strategies and behaviors in competitive environments.

87. **Model-Based Reinforcement Learning**: Model-based reinforcement learning methods learn an explicit model of the environment dynamics to aid in planning and decision-making, improving sample efficiency.

88. **Gaussian Process**: A Gaussian process is a distribution over functions that can represent uncertainty and generalization in reinforcement learning tasks, enabling efficient exploration and learning in continuous spaces.

89. **Multi-Objective Reinforcement Learning**: Multi-objective reinforcement learning considers multiple conflicting objectives or goals in the agent's decision-making process, leading to a trade-off between different criteria.

90. **Evolution Strategies**: Evolution strategies are optimization algorithms inspired by natural evolution that evolve a population of policies over generations to find optimal solutions in reinforcement learning tasks.

91. **Learning Rate**: The learning rate is a hyperparameter in optimization algorithms that controls the step size during parameter updates, affecting the convergence speed and stability of reinforcement learning algorithms.

92. **Batch Size**: The batch size is the number of experiences sampled from the replay buffer or dataset in each training iteration, influencing the learning efficiency and generalization of reinforcement learning algorithms.

93. **Exploration Rate**: The exploration rate determines the probability of choosing random actions over greedy actions in the agent's policy, affecting the balance between exploration and exploitation in reinforcement learning.

94. **Optimality**: Optimality in reinforcement learning refers to the state where the agent has learned the optimal policy that maximizes cumulative rewards in the environment, achieving the best possible performance.

95. **Convergence**: Convergence in reinforcement learning indicates that the agent's policy or value function has reached a stable state where further updates do not significantly improve performance, signaling the end of training.

96. **Generalization**: Generalization in reinforcement learning refers to the agent's ability to apply its learned knowledge to new, unseen situations or environments, achieving robust performance beyond the training data.

97. **Overfitting**: Overfitting occurs when a reinforcement learning model learns to memorize the training data rather than generalize to unseen data, leading to poor performance and lack of adaptability.

98. **Underfitting**: Underfitting happens when a reinforcement learning model is too simple to capture the underlying patterns in the data, resulting in suboptimal performance and learning inefficiency.

99. **Bias-Variance Trade-Off**: The bias-variance trade-off in reinforcement learning balances the model's bias (systematic error) and variance (random error) to achieve optimal generalization and performance on unseen data.

100. **Reward Shaping**: Reward shaping is a technique in reinforcement learning that modifies the reward signals to guide the agent towards desirable behaviors or objectives, improving learning efficiency and convergence speed.

In conclusion, understanding the key terms and vocabulary in reinforcement learning is essential for mastering this complex field of machine learning. By grasping the concepts like agent, environment, state, action, reward, policy, value function, exploration, exploitation, and various algorithms and methods, practitioners can develop effective reinforcement learning models for a wide range of applications. Through practical applications, examples, and challenges, learners can deepen their knowledge and expertise in reinforcement learning, paving the way for innovative solutions in diverse domains.

Key takeaways

Reinforcement Learning is a powerful area of machine learning that deals with how agents ought to take actions in an environment to maximize some notion of cumulative reward.
**Agent**: The entity that is learning to interact with the environment in reinforcement learning.
The agent learns by receiving feedback in the form of rewards from the environment based on its actions.
It is a specific configuration or snapshot that the agent can be in.
**Action**: An action is a decision taken by the agent in a particular state that leads to a change in the state of the environment.
**Reward**: A scalar feedback signal from the environment that informs the agent about the immediate success or failure of its action.
**Policy**: A policy is a strategy or a set of rules that the agent uses to select actions in different states.

Reinforcement Learning

Key takeaways

More from Graduate Certificate in Machine Learning in Polymer Science and Engineering