The idea of Q-learning is to approximate the Q-function by using the current optimal action value. The Q-learning update is very similar to the update done in SARSA, with the exception that it takes the maximum state-action value:
is the usual learning rate and is the discount factor.
While the SARSA update is done on the behavior policy (like a -greedy policy), the Q-update is done on the greedy target policy that results from the maximum action value. If this concept is not clear yet, take a look at figure 4.7. While in SARSA we had figure 4.3, where both actions and come from the same policy, in Q-learning, action is chosen based on the next maximum state-action value. Because an update in Q-learning is not more dependent on the behavior policy (which is used only for sampling from the environment), it becomes an off-policy algorithm.