I have no explanation for my own vile ambitions. Confronted with your pus, I could not stop to examine my direction, whether or not I was aimed at a star. As I limped down the street every window broadcast a command: Change! Purify! Experiment! Cauterize! Reverse! Burn! Preserve! Teach! Believe me, Edith, I had to act, and act fast. That was my nature. Call me Dr. Frankenstein with a deadline. I seemed to wake up in the middle of a car accident, limbs strewn everywhere, detached voices screaming for comfort, severed fingers pointing homeward, all the debris withering like sliced cheese out of Cellophane – and all I had in the wrecked world was a needle and thread, so I got down on my knees, I pulled pieces out of the mess and I started to stitch them together. I had an idea of what a man should look like, but it kept changing. I couldn’t devote a lifetime to discovering the ideal physique. All I heard was pain, all I saw was mutilation. My needle going so madly, sometimes I found I’d run the thread right through my own flesh and I was joined to one of my own grotesque creations – I’d rip us apart – and then I heard my own voice howling with the others, and I knew that I was also truly part of the disaster. But I also realized that I was not the only one on my knees sewing frantically. There were others like me, making the same monstrous mistakes, driven by the same impure urgency, stitching themselves into the ruined heap, painfully extracting themselves
F., in Leonard Cohen’s Beautiful Losers (1966), p.175
The exploration-exploitation trade-off is a fundamental dilemma whenever you learn about the world by trying things out. The dilemma is between choosing what you know and getting something close to what you expect (‘exploitation’) and choosing something you aren’t sure about and possibly learning more (‘exploration’). For example, suppose you are in a restaurant and you look at the menu:
Fish and Chips
Assuming for the sake of example that you’re not very good with Sri Lankan food, you’ve now got a choice. You can ‘exploit’ – go with the fish and chips, which will probably be alright – or you can ‘explore’ – try something you haven’t had before and see what you get. Obviously which you decide to do will depend on many things: how hungry you are, how good the restaurant reviews are, how adventurous you are, how often you reckon you’ll be coming back ..etc. What’s important is that the study of the best way to make these kinds of choices – called reinforcement learning – has shown that optimal learning requires that you to sometimes make some bad choices. This means that sometimes you have to choose to avoid the action you think will be most rewarding, and take an action which you think will be less rewarding. The rationale is that these ‘sub-optimal’ actions are necessary for your long term benefit – you need to go off track sometimes to learn more about the environment. The exploration-exploitation dilemma is really a trade-off : enjoy more now vs learn more now and enjoy later. You can’t avoid it, all you can do is position yourself somewhere along the spectrum.
Because the trade-off is fundamental we would expect to be able to see it in all learning domains, not just restaurant food choices. In work just published, we’ve been using a new task to look at how actions are learnt. Using a joystick we asked people to explore the space of all possible movements, giving them a signal when they made a particular target movement. This task – which we’re pretty keen on – gives us a lens to look at the relation between how people explore the possible movements they can make and which particular movements they learn to rely on to generate predictable outcomes (which we call ‘actions’).
Using data gathered from this task, it is possible to see the exploitation-exploration trade-off in action. With each target people get 10 attempts to try to identify the right movement to make. Obviously some successful movements will be more efficient than others, because it is possible to hit the target after going all “round the houses” first, adding lots of extraneous movements and taking longer than needed. If you had a success like this you could repeat it exactly (‘exploit’), or try and cut out some of the extraneous movement and risk missing the target (‘explore’). Obviously this refinement of action through trial and error is of critical interest to anyone who cares about how we learn skilled movements.
I calculated an average performance score for the first 50% and second 50% of attempts (basically a measure of distance travelled before hitting the target – so lower scores mean better performance). I also calculated how variable these performance scores were in the first 50% and second 50%. Normally we would expect people who perform best in the first half of a test to perform best in the second half (depressingly people who start out ahead usually stay there!). But this analysis showed up something interesting: a strong correlation between variability in the first half and performance in the second half. You can see this in the graph
This shows that people who are most inconsistent when they start to learn perform best towards the end of learning. Usually inconsistency is a bad sign, so it is somewhat surprising that it predicts better performance later on. The obvious interpretation is in terms of the exploration-exploitation trade-off. The inconsistent people are trying out more things at the beginning, learning more about what works and what doesn’t. This provides them with the foundation to perform well later on. This pattern holds when comparing across individuals, but it also holds for comparing across trials (so for the same individual, their later performance is better for targets on which they are most inconsistent on early in learning).
It’s a curious fact that although psychologists have thoroughly investigated how actions are valued (i.e. how you figure out how good or bad a thing is to do), and how actions are trained (i.e. shaped and refined over time), the same effort has not gone into investigating how a behaviour is first identified and stored as a part of our repertoire. We hope this task provides a useful tool for opening up this area for investigation.
As well as the basic description of the task, the paper also contains a section outlining how the form of learning the the task makes available for inspection is different from the forms of learning made available by other ‘action learning’ tasks (such as, for example, operant conditioning tasks). In addition to serving an under-investigated area of learning research, the task also has a number of practical benefits. It is scalable in difficulty, suitable for repeated measures designs (meaning you can do it again and again – it isn’t something you learn once and then can’t be tested on any more) as well being adaptable for different species (meaning you can test humans and non-human animals on the task).