Reading Circle (Reinforcement Learning for Image Processing)

Reading Circle Reinforcement Learning for Image Processing

Today’s Topic Reinforcement Learning Algorithms −SARSA −Q-Learning −DQN −REINFORCE −Actor-Critic
Reinforcement Learning for Image Processing −[Cao+, CVPR17] −A2-RL [Li+, CVPR18] −PixelRL [Furuta+, AAAI19] −Adversarial RL [Ganin+, ICML18] 2

What’s Reinforcement Learning An area of machine learning concerned with
how software agents ought to take actions in an environment in order to maximize the notion of cumulative reward. 3 Environment Agent State Reward Action Maximize = σ=0 ∞ γ++1

Policy and Value Policy π −the model of agent's action
selection State Value Function π −the expected return starting with state under π −"how good" it is to be in the given state Action Value Function π(, ) −the action-value of (, ) under π 4

TD error Value-based RL Q-function is usually used. π ,
← π , + αδ δ = − π , 5 learning rate TD target The difference between the evaluation value before and after the agent acted.

TD error Value-based RL Q-function is usually used. π ,
← π , + αδ δ = − π , There are some algorithms for different TD targets. − SARSA − Q-Learning 6 learning rate TD target

Policy of Value-based RL greedy policy −always choose the action
with the highest Q-value ε-greedy policy −select a random action with probability ε and the action with the highest Q-value with probability 1-ε 7

SARSA = +1 + γ π +1 , +1 π
, ← π , + α( − π , ) 8 Update the Q-value based on the current policy +1 +1 Update Q-value ε-greedy On-Policy

Q-Learning = +1 + γmax π +1 , ′ π
, ← π , + α( − π , ) 9 ′ Update the Q-value assuming the agent chose the action that maximizes the Q-value +1 +1 Update Q-value ε-greedy ε-greedy max Off-Policy

Problem of Q-Learning Q-value is maintained by Q-table −manual separation
to discrete values −slow learning because of ignorance of the relationships between values −the curse of dimensionality → approximate the Q-function by NN 10

DQN = +1 + γmax π +1 , ′ π
, ← π , + α( − π , ) approximate the Q-function by CNN −target network −experience replay −reward clipping 11 ′ fix the parameters of the Q-function max π +1 , ′ for a certain period ′

DQN = +1 + γmax π +1 , ′ π
, ← π , + α( − π , ) approximate the Q-function by CNN −target network −experience replay −reward clipping 12 ′ store past state transitions and their rewards in the replay buffer and apply mini-batch learning by sampling from replay buffer

Policy-Gradient Method Learn parameter to maximize objective function ∇θ πθ
= [∇θ logπθ ∗ πθ (∗)] for parameterized policy πθ 13 approximate policy itself by NN

Policy-Gradient Method Learn parameter to maximize objective function ∇θ πθ
= [∇θ logπθ ∗ πθ (∗)] for parameterized policy πθ There are some algorithms for different πθ (∗). −REINFORCE : −Actor-Critic : π = π − π 14 how good the policy is

After all… 15

Today’s Topic Reinforcement Learning Algorithms −SARSA −Q-Learning −DQN −REINFORCE −Actor-Critic
Reinforcement Learning for Image Processing −[Cao+, CVPR17] −A2-RL [Li+, CVPR18] −PixelRL [Furuta+, AAAI19] −Adversarial RL [Ganin+, ICML18] 16

[Cao+, CVPR17] Attention-Aware Face Hallucination 17

[Cao+, CVPR17] Attention-Aware Face Hallucination algorithm : REINFORCE state :
facial image action : select a location to enhance reward : MSE loss 18

[Cao+, CVPR17] Qualitative Results 19

A2-RL [Li+, CVPR18] Aesthetics Aware Image Cropping 20

A2-RL [Li+, CVPR18] Aesthetics Aware Image Cropping algorithm : A3C
state : current image & original image action : select a cropping window reward : aesthetics score 21

A2-RL [Li+, CVPR18] Qualitative Results 22

PixelRL [Furuta+, AAAI19] Pixel-wise multi-agent reinforcement learning - Employ fully
convolutional network (FCN) - pixel-wise agents share the parameters - pixel-wise action : change pixel value - pixel-wise reward : the difference of pixel value 23

PixelRL [Furuta+, AAAI19] Image Denoising 24

PixelRL [Furuta+, AAAI19] Image Restoration 25

PixelRL [Furuta+, AAAI19] Image Restoration 26

Adversarial RL [Ganin+, ICML18] Synthesizing Programs 27 Renderer is non-differential

Adversarial RL [Ganin+, ICML18] Generate vector images from raster images
28

Adversarial RL [Ganin+, ICML18] Generate vector images from raster images
29

Adversarial RL [Ganin+, ICML18] MNIST Reconstraction 30

Conclusion Reinforcement Learning Algorithms −SARSA −Q-Learning −DQN −REINFORCE −Actor-Critic Reinforcement
Learning for Image Processing −[Cao+, CVPR17] −A2-RL [Li+, CVPR18] −PixelRL [Furuta+, AAAI19] −Adversarial RL [Ganin+, ICML18] 31 Very recently, deep RL has been used for image processing There are much more RL algorithms…

Reading Circle (Reinforcement Learning for Imag...

Reading Circle (Reinforcement Learning for Image Processing)

pyman

More Decks by pyman

Other Decks in Research

Featured

Transcript