Article Preview
TopPreliminaries Of Q Learning
Q learning is basically a model free Reinforcement Learning (Busoniu et al., 2010; Masoumzadeh et al., 2009), where a set of states S, a set of actions A, and a reward function R(S, A) are there. In each state s
S, the agent (Hsu et al., 2008; Zhou et al., 2007) takes an action a
A. Upon taking the action, the agent receives a reward R(s, a) and reaches to a new state s/. Q learning (Cho et al., 2007; Pandey et al., 2010), which has been developed in several stages (Chen et al., 2009), are explained briefly in the following section.
Classical Q- Learning (CQL)
In classical Q-learning, every possible state of an agent and its possible actions in a given state are deterministically known. In other words, for a given agent A, let s0, s1, s2... sn, be n- possible states, and each state has m possible actions 

...,
. At a particular state-action pair
the specific reward that the agent achieves is known as immediate reward
(shown in Figure 1). The agent selects its next state from its current state using a policy that attempts to maximize the cumulative reward that the agent could have in subsequent transition of states from its next state (Dean et al., 1993; Bellman, 1957; Watkins et al., 1992). For example, let the agent be in state
and is expecting to select the next best state. Then the Q-value at state
due to action of
is given in (1).
(1)Figure 1. State-action pair with reward
Where
denotes the next state due to selection of action
at state
. Let the next state selected be
.So,
.Consequently selection of
that maximizing
is an interesting problem. One main drawback for the above Q-learning is to know the Q value at a state
for all possible action
. As a result, each time it accesses the memory to get Q value for all possible actions at a particular state to determine the most appropriate next state. So it consumes more time to select the next state. Since the action
for which
is maximum needs to be evaluated, we can remodel the Q-learning equation by identifying
that drives the agent closer to the goal.