# Reading Circle (Reinforcement Learning for Image Processing)

Explanation of Reinforcement Learning for Image Processing

July 10, 2020

## Transcript

2. ### Today’s Topic Reinforcement Learning Algorithms −SARSA −Q-Learning −DQN −REINFORCE −Actor-Critic

Reinforcement Learning for Image Processing −[Cao+, CVPR17] −A2-RL [Li+, CVPR18] −PixelRL [Furuta+, AAAI19] −Adversarial RL [Ganin+, ICML18] 2
3. ### What’s Reinforcement Learning An area of machine learning concerned with

how software agents ought to take actions in an environment in order to maximize the notion of cumulative reward. 3 Environment Agent State Reward Action Maximize = σ=0 ∞ γ++1
4. ### Policy and Value Policy π −the model of agent's action

selection State Value Function π −the expected return starting with state under π −"how good" it is to be in the given state Action Value Function π(, ) −the action-value of (, ) under π 4
5. ### TD error Value-based RL Q-function is usually used. π ,

← π , + αδ δ = − π , 5 learning rate TD target The difference between the evaluation value before and after the agent acted.
6. ### TD error Value-based RL Q-function is usually used. π ,

← π , + αδ δ = − π , There are some algorithms for different TD targets. − SARSA − Q-Learning 6 learning rate TD target
7. ### Policy of Value-based RL greedy policy −always choose the action

with the highest Q-value ε-greedy policy −select a random action with probability ε and the action with the highest Q-value with probability 1-ε 7
8. ### SARSA = +1 + γ π +1 , +1 π

, ← π , + α( − π , ) 8 Update the Q-value based on the current policy +1 +1 Update Q-value ε-greedy On-Policy
9. ### Q-Learning = +1 + γmax π +1 , ′ π

, ← π , + α( − π , ) 9 ′ Update the Q-value assuming the agent chose the action that maximizes the Q-value +1 +1 Update Q-value ε-greedy ε-greedy max Off-Policy
10. ### Problem of Q-Learning Q-value is maintained by Q-table −manual separation

to discrete values −slow learning because of ignorance of the relationships between values −the curse of dimensionality → approximate the Q-function by NN 10
11. ### DQN = +1 + γmax π +1 , ′ π

, ← π , + α( − π , ) approximate the Q-function by CNN −target network −experience replay −reward clipping 11 ′ fix the parameters of the Q-function max π +1 , ′ for a certain period ′
12. ### DQN = +1 + γmax π +1 , ′ π

, ← π , + α( − π , ) approximate the Q-function by CNN −target network −experience replay −reward clipping 12 ′ store past state transitions and their rewards in the replay buffer and apply mini-batch learning by sampling from replay buffer
13. ### Policy-Gradient Method Learn parameter to maximize objective function ∇θ πθ

= [∇θ logπθ ∗ πθ (∗)] for parameterized policy πθ 13 approximate policy itself by NN
14. ### Policy-Gradient Method Learn parameter to maximize objective function ∇θ πθ

= [∇θ logπθ ∗ πθ (∗)] for parameterized policy πθ There are some algorithms for different πθ (∗). −REINFORCE : −Actor-Critic : π = π − π 14 how good the policy is

16. ### Today’s Topic Reinforcement Learning Algorithms −SARSA −Q-Learning −DQN −REINFORCE −Actor-Critic

Reinforcement Learning for Image Processing −[Cao+, CVPR17] −A2-RL [Li+, CVPR18] −PixelRL [Furuta+, AAAI19] −Adversarial RL [Ganin+, ICML18] 16

18. ### [Cao+, CVPR17] Attention-Aware Face Hallucination algorithm : REINFORCE state :

facial image action : select a location to enhance reward : MSE loss 18

21. ### A2-RL [Li+, CVPR18] Aesthetics Aware Image Cropping algorithm : A3C

state : current image & original image action : select a cropping window reward : aesthetics score 21

23. ### PixelRL [Furuta+, AAAI19] Pixel-wise multi-agent reinforcement learning - Employ fully

convolutional network (FCN) - pixel-wise agents share the parameters - pixel-wise action : change pixel value - pixel-wise reward : the difference of pixel value 23

28

29

31. ### Conclusion Reinforcement Learning Algorithms −SARSA −Q-Learning −DQN −REINFORCE −Actor-Critic Reinforcement

Learning for Image Processing −[Cao+, CVPR17] −A2-RL [Li+, CVPR18] −PixelRL [Furuta+, AAAI19] −Adversarial RL [Ganin+, ICML18] 31 Very recently, deep RL has been used for image processing There are much more RL algorithms…