Dezfouli, Lingiwi and Balleine (2014) advocate hierarchical reinforcement learning (hierarchical RL) as a framework for understanding important features of animal action learning.
Hierarchical RL and model-free RL are both capable of coping with complex environment where outcomes may be delayed until a sequence of actions is completed. In these situations simple model-based (goal-directed) RL does not scale. The key difference between hierarchical and model free RL is that in model free RL actions are evaluated at each step, whereas in hierarchical RL they are evaluated at the end of an action sequence.
The authors note two features of the development of habits. The concatenation of actions, such that sequences can be units of selection, is predicted by hierarchical RL. The insensitivity of actions to the devaluation of their outcomes is predicted by model-free RL. Here they report experiments, and draw on prior modelling work, to show that hierarchical RL can lead to outcome devaluation insensitivity. This encompasses these two features of habit learning under a common mechanisms, and renders a purely model-free RL account of action learning redundant. Instead model-free RL will be subsumed within a hierarchical RL controller, which is involved in early learning of action components but will later devolve oversight (hence insensitivity to devaluation).
Hierarchical RL leads to two kinds of action errors, planning errors and action slips (for which they distinguish two types).
Planning errors result from ballistic control, meaning that intervening changes in outcome do not affect the action sequence.
Action slips are also due to ‘open-loop control’, ie due to a lack of outcome evaluation for component actions. The first kind is where ballistic control means an action is completed despite a reward being delivered midsequence (and so rendering completion of the action irrelevant, see refs 30 and 31 in the original). The second subcategory of action slip is ‘capture error’ or ‘strong habit intrusion’, which is where a well rehearsed completion of a sequence runs off from initial action(s) which were intended as part of a different sequence.
I don’t see a fundamental difference between the first type of action slip and the planning error, but that may be my failing.
They note that model free RL does not predict specific timing of errors (hierarchical RL predicts errors due to devaluation in the middle of sequences, and habitual intrusions at joins in sequences, see Botvinick & Bylsma, 2005), and doesn’t predict action slips (as Dezfouli et al define them)
They use a two stage decision task to show insensitivity to intermediate outcomes in a sequence, in humans.
Quoting Botvinick & Weinstein (2014)’s description of the result, because their own is less clear:
“they observed that when subjects began a trial with the same action that they had used to begin the previous trial, in cases where that previous trial had ended with a reward, subjects were prone to follow up with the same second-step action as well, regardless of the outcome of the first action. And when this occurred, the second action was executed with a brief reaction time, compared to trials where a different second-step action was selected.”
The first action, because it was part of a successful sequence, was reinforced (more likely to be choosen, quicker), despite the occasions when the intermediate outcome – the one that resulted from that first action – was not successful.
Rats tested in extinction recover goal-directed control over their actions (as indicated by outcome devaluation having the predicted effect). This is predicted by a normative analysis where habits should only exist when their time/effort saving benefits outweigh the costs.
The authors note that this is “consistent with a report showing that the pattern of neuronal activity, within dorso-lateral striatum that marks the beginning and end of the action sequences during training, is diminished when the reward is removed during extinction ”
They review evience for a common locus (the striatum of the basal ganglia) and common mechanism (dopamine signals) for action valuation and sequence learning. Including:
“evidence suggests that the administration of a dopamine antagonist disrupts the
chunking of movements into well-integrated sequences in capuchin monkeys , which can be reversed by co-administration of a dopamine agonist . In addition, motor chunking appears not to occur in Parkinsons patients  due to a loss of dopaminergic activity in the sensorimotor putamen, which can be restored in patients on L -DOPA .”
My memory of this literature is that evidence on chunking in Parkinsons is far from convincing or consistent, so I might take these two results with a pinch of salt.
Their conclusion: “This hierarchical view suggests that the development of action sequences and the insensitivity of actions to changes in outcome value are essentially two sides of the same coin, explaining why these two aspects of automatic behaviour involve a shared neural structure.”
Botvinick, M. M., & Bylsma, L. M. (2005). Distraction and action slips in an everyday task: Evidence for a dynamic representation of task context. Psychonomic bulletin & review, 12(6), 1011-1017.
Botvinick, M., & Weinstein, A. (2014). Model-based hierarchical reinforcement learning and human action control. Philosophical Transactions of the Royal Society B: Biological Sciences, 369(1655), 20130480.
Dezfouli, A., Lingawi, N. W., & Balleine, B. W. (2014). Habits as action sequences: hierarchical action control and changes in outcome value. Philosophical Transactions of the Royal Society B: Biological Sciences, 369(1655), 20130482.