1 Part 1: Perceptual Decision-Making with Confidence
Metacognition, the ability to reflect on one’s own cognitive processes, is commonly studied through confidence ratings in decision-making tasks. In this document we highlight empirical findings and experimental considerations relevant to perceptual decision making with confidence ratings.
Three main aspects characterize the relationship between confidence ratings and choices.
Confidence ratings reflect a subjective probability of being correct or a form of self-consistency that incorporates response biases rather than purely objective performance (Caziot and Mamassian 2021; Sánchez-Fuenzalida et al. 2025; Mihali et al. 2023; Navajas et al. 2017).
Confidence judgments tend to underestimate relative to first-order accuracy (i.e., 90% accuracy might produce 80% confidence) (Shekhar and Rahnev 2021).
Metacognition and error-monitoring are functionally related systems, being confident about an error is a form of error-monitoring (Yeung and Summerfield 2012; Stephen M. Fleming and Daw 2017; Öztel and Balcı 2024).
1.1 Models of perceptual decision making and metacognition
Computational models of metacognition typically fall into two categories: static models based on Signal Detection Theory (SDT) and dynamic models based on evidence accumulation (EA).
Static models describe choice and confidence as originating from a static process of acquiring a fixed sample of evidence on each trial. Confidence ratings are then assumed to be computed from this evidence alone or in conjunction with additional independent information.
Dynamic models additionally account for response times by assuming that evidence accumulates over time. When a decision boundary is reached (enough evidence has been collected), a choice is made. Confidence ratings are then thought to be generated by additional evidence accumulation in the post-decisional time window.
Both approaches can be understood as (approximately) optimal solutions to partially observable Markov decision processes under different assumptions. In most perceptual decision making tasks these two types of models are not mutually exclusive.
1.1.1 Patterns of Confidence ratings
Regardless of theoretical framework, a good metacognition model must be able to capture several empirical patterns observed in the literature.
Performance increases with stimulus intensity: Higher stimulus intensity yields higher accuracy
Confidence curves shift in tandem with first-order choice curves: As the psychometric function governing the choice probability, shifts from unbiased (threshold at 0) so does the confidence curves. The lowest confidence ratings for correct trials are at the stimulus value of the threshold (Caziot and Mamassian 2021; Sánchez-Fuenzalida et al. 2025; Mihali et al. 2023).
Confidence increases with stimulus intensity: Higher stimulus intensities yield higher confidence ratings (conditional on accuracy, see below) (Hangya et al. 2016; Sanders et al. 2016)
Confidence varies with accuracy: The relationship between confidence and stimulus intensity depends on response correctness, with correct trials showing on average higher confidence than incorrect trials. (Hangya et al. 2016; Sanders et al. 2016).
“Folded-X” pattern: Confidence sometimes increases with stimulus intensity for correct responses but decreases (or plateaus) for errors. This pattern is considered a signature of metacognition (Hangya et al. 2016; Sanders et al. 2016).
“Double-increase” pattern: Confidence increases with stimulus intensity for both correct and incorrect responses
Recent work suggest that these two patterns of confidence, could potentially arises from different experimental manipulations (Xue et al. 2026; Fung et al. 2025):
Folded-X: Emerges when manipulating decision-relevant stimulus properties (e.g., coherence in RDM).
Double-increase: Occurs when manipulating auxiliary difficulty markers (e.g., stimulus visibility).
These studies investigating patterns 4a and 4b use a confidence scale that ranges from guessing to certain. This severely limits the information carried by the incorrect side of the confidence curve, as we argue in the next section. Furthermore, if participants cannot express that they made an error, one might argue that the observed pattern does not constitute error-monitoring, even if confidence ratings for incorrect trials decrease as stimulus intensity increases.
Confidence patterns may reflect error-monitoring processes, but this can only be observed when the confidence scale allows participants to explicitly indicate that they believe they made a mistake. When confidence is rated on a scale ranging from guessing to confident, participants lack a clear way to report that they believe their response was incorrect. Furthermore, confidence ratings should be administered after the binary choice rather than simultaneously with it. Only when the choice precedes the confidence judgment can participants first commit to a response and then evaluate whether they may have erred. If choice and confidence are reported simultaneously, the interpretation of a folded X-pattern becomes ambiguous; as it could reflect participants strategically pairing a low-confidence response with an incorrect answer rather than detecting an error after committing to a choice.
To summarize; only a full confidence scale ranging from incorrect to correct combined with a sequential design in which choices are followed by confidence rating, allow subjects the ability to demonstrate error-monitoring and thereby produce a meaningful folded-X pattern. Such a pattern would have confidence ratings for incorrect trials fall below the midpoint of the scale (which expresses “guessing”). Only then could the folded-X reflect functioning error-monitoring (confidence decreases for errors), whereas a double-increase pattern would suggest no error-monitoring. Below we define and describe in more detail how experimental designs and measurement scales can influence the types of data and the interactions one might observe.
1.2 Confidence scales, stimulus granularity, and measurement considerations
Here we outline several methodological considerations, and argue that some of these inevitably determine the types of empirical data one can observe.
Confidence measurement: Confidence is collected either discretely or continuously. Some models of metacognition (e.g., M-ratio) require discrete data, leading many experiments to collect or discretize continuous confidence ratings (Maniscalco and Lau 2012; Stephen M. Fleming 2017).
Stimulus intensity: The main stimulus axis that determines trial difficulty. It can be discrete or continuous.
Confidence scale construction: The particular anchors of a confidence scale.
Half-scale confidence. A half-scale ranges from “guessing” to “confident”. This prevents participants from expressing certainty about errors, they cannot indicate they are sure they made a mistake. Half-scales generally produce “half-folded-X” patterns (a flat or near-flat incorrect branch) rather than a full folded-X.
Full-scale confidence. A full scale ranges from “confidently incorrect” to “confidently correct”, enabling participants to express error-monitoring. This scale can produce a full folded-X patterns.
Simultaneous choice–confidence. When choice and confidence are produced together, a double-increase pattern should be expected, because otherwise participants would have to deliberately choose the wrong answer.
These confidence patterns, which arise only under the aforementioned experimental constraints, primarily reflect how incorrect trials are distributed across stimulus intensities. Their interpretation depends critically on the expressiveness of the confidence scale. Error-monitoring processes can only be observed when the scale allows participants to indicate that they believe they made a mistake. When such responses are possible, incorrect choices can be accompanied by low confidence (“I am confident I made a mistake”) that reflects an awareness of being wrong. For these reasons, tasks that include many levels of stimulus intensity together with a sufficiently expressive confidence scale provide a parsimonious and statistically powerful framework for investigating metacognition. In particular, the confidence scale must allow participants to report that they believe their response was incorrect (i.e., a full confidence scale).
1.3 Error-monitoring
Computational models of metacognition have attempted to explain error-monitoring in different ways. A central question for these models is why a rational agent make a mistake that they are subsequently aware of.
Across both modeling approaches, the main answer has been to assume that additional information is available for the confidence judgment beyond what was used for the binary choice; if the two were identical, genuine error-monitoring could not occur (Stephen M. Fleming and Daw 2017).
Stephen M. Fleming and Daw (2017) formalized this idea in the second-order model. In this framework, evidence for the action and evidence for the confidence rating are drawn from a multivariate normal distribution with correlation \(\rho\) governing the degree of information sharing. If \(\rho\) = 1, the confidence judgment is based on the same information as the action (and no error-monitoring is possible; i.e., a first-order model). If \(\rho\) = 0, the confidence judgment is based on completely independent information.
A similar idea has been proposed in the dynamic models of perceptual decision making. In these models evidence for the initial choice arise from a noisy accumulation process until an evidence bound is hit and a choice is made. It is then further assumed that evidence keeps accumulating after the choice, in a post decision evidence accumulation window, which then gives rise to a confidence rating (Pleskac and Busemeyer 2010). This theory of post-decisional evidence accumulation (PDEA) of confidence ratings is one of the leading theoretical accounts of the generative process underlying both choice and confidence ratings that can explain error-monitoring (Desender et al. 2021). Below we provide a selective review of studies investigating PDEA. Here we challenge the idea that additional independent information is necessary for the computation of confidence to produce the patterns of confidence described above and as a result also error-monitoring. We end by arguing that the rater just needs access to the action provided by the actor in order to explain the empirical patterns.
1.4 Post Decisional Evidence Accumulation
Several approaches have been used to determine how PDEA works. In general, PDEA has been shown to increase metacognitive efficiency and to have distinct electrophysiological signatures (Desender et al. 2021).
PDEA unfolds within a remarkably short temporal window. Intracranial recordings in epilepsy patients performing mouse-tracking tasks reveal that post-decisional confidence processing occurs within 300 - 500 ms of choice initialization, with the pre-supplementary motor area associated with confidence and changes of mind (Goueytes et al. 2025; Resulaj et al. 2009). Neural recordings in rhesus monkeys have further demonstrated that anatomical regions implicated in choice formation are also the regions associated with confidence generation (Kiani and Shadlen 2009). This brief time course and overlapping neural circuitry suggests that PDEA, rather than reflecting slow deliberative processes, operates as a relatively automatic extension of the initial decision dynamics.
PDEA appears to be conditional on task demands and resource constraints. Under speed stress instructions (choice response times below 500 ms), confidence response times are much longer, and subjects show enhanced error monitoring, particularly for easy trials, compared to under accuracy stress. Under accuracy stress, when subjects have sufficient time for choice formation, choices are generally produced much slower and confidence judgments are made after a constant time, independent of difficulty and choice response time (Baranski and Petrusic 1998). Consistent with this pattern, subjects take significantly longer to respond when they know a confidence judgment is required compared to choice-only trials, suggesting that when not time-pressed, evidence accumulation for choice and confidence may occur pre-decisionally, i.e. in a parallel or single evidence-accumulation process (Petrusic and Baranski 2003; Baranski and Petrusic 2001).
Another way to investigate PDEA is to provide additional stiomulus exposure after the participant has committed to a choice. Stephen M. Fleming et al. (2018) showed that 300 ms of additional stimulus viewing after a speed-stressed choice (300 ms) can increase confidence efficiency (the degree to which confidence ratings can be used to disentangle correct from incorrect choices), with the strongest effect being a decrease in confidence ratings for incorrect trials.
Yu et al. (2015) found that manipulating the interval between choice and confidence produced changes in confidence ratings, in particular decreased confidence for errors. Additional exposure to the stimulus during this interval did not change the quality (efficiency) of the confidence ratings. This finding has been interpreted as internal evidence may be sufficient for post-decisional updating and additional exposure to the stimulus is not necessary.
In most of the highlighted studies a random dot motion stimulus is used as the perceptual decision making task. In such tasks, participants’ performance asymptotes with exposure times of only 300–600 ms, showing no improvement with longer presentations (Tsetsos et al. 2015). Crucially, this pattern holds even when stimulus-response mappings are manipulated, indicating that optimal decisions require only brief stimulus exposure. However, when confidence ratings are required participants tend to wait significantly longer before making an inital response if not pressed for time.
Taken together, these findings suggest that when time permits, most confidence-relevant processing may occur pre-decisionally. The most convincing evidence for PDEA comes from conditions when participants are time-constrained in the primary decision. In such cases, PDEA appear to increase metacognitive efficiency primarily by decreasing confidence ratings for incorrect choices.
1.5 An Alternative Account: Motor Control and Choice Variability
While the PDEA framework offers a mechanistically parsimonious explanation by using the same accumulation process for both choice and confidence, we propose a more general account that does not require additional independent information or extra PDEA. The short temporal window of PDEA (~300-500 ms) coincides with timescales consistent with motor control processes, meaning that we propose that what is characterized as PDEA may instead reflect motor awareness. This account can be further argued for given that under speed stress conditions subjects produce more errors, and these errors partly arise from reduced coordination between the central nervous system and effectors due to the time constraints. Motor control theory, particularly comparator models involving efference copy mechanisms, provides a framework for understanding these errors and how agents can become aware of them and subsequently correct them. When an agent initiates an action, an efference copy is generated and after commitment of the action it is compared against the expected sensory consequences, via a forward model. Any mismatch between predicted and actual motor output signals an error, and the magnitude of this prediction error is an indication of the degree of confidence in the choice. Subjects can therefore become aware that their executed action differs from their intended choice, enabling confidence ratings below guessing, without requiring PDEA about the stimulus itself.
From this formulation we propose that error-monitoring can arise from choice variability, i.e. stochasticity in the mapping from evidence to choice. Most perceptual decision-making models assume deterministic choices given the acquired evidence, but a stochastic choice process allows agents to produce errors that they can subsequently be aware of. Ordinarily, evidence variability and choice variability would be indistinguishable from binary responses alone, however, as will be shown below multivariate modeling of choices and confidence ratings allows these noise sources to be distinguished.
In the following section we derive a model incorporating choice variability, that can produce the empirical patterns of confidence and choices described earlier. This model can be expressed as a specific instantiation of Fleming and Daw’s second-order framework (Stephen M. Fleming and Daw 2017), but without requiring independent information while still allowing for error-monitoring.