3 Part 3: Design, Methods, and Hypotheses

Authors

Affiliations

3.1 Experimental study

This study examines motion discrimination and metacognition in a Random Dot Motion (RDM) task. Specifically, we ask whether the time between a first-order choice and a confidence judgements (post-decisional time) influences metacognitive uncertainty. We operationalize this as whether confidence ratings track the probability of being correct more closely when participants are required to wait longer before giving their confidence ratings, than when they can respond immediately

Dynamic models of metacognition assume that evidence is accumulated until a response boundary is reached and a binary decision is initiated. After the binary choice, additional evidence is assumed to be accumulated post-decisionally to inform the confidence rating. If this post-decisional evidence accumulation is informative about the correctness of the choice, increasing the post-decisional time window should improve the quality of the confidence ratings. This improvement would be shown as confidence ratings more closely approximating the objective probability that the choice was correct. However, if the evidence used for the confidence rating is already fully computed around the time of the decision (i.e., pre-decisionally), one would expect no difference, or even the opposite effect. Confidence ratings becoming less consistent with the probability of being correct would indicate that some information leakage is taking place, perhaps due to memory constraints. In this study, we manipulate the post-decisional time window to examine its effects on the quality of confidence ratings. Previous studies of post-decisional evidence accumulation have generally found that such accumulation is easiest to detect in speeded decision-making tasks, particularly on easy trials, where increased post-decisional time can decrease confidence ratings for incorrect choices.

In this study, we use a non-speeded perceptual decision-making task with individually calibrated difficulty levels ranging from very easy to very hard. We test whether post-decisional evidence accumulation can unfold over a longer time period even in non-speeded tasks. Because previous literature on this question is inconclusive, we adopt a Bayesian sequential-sampling framework. This approach allows us to quantify evidence for the null hypothesis (that post-decisional time does not change the quality of confidence ratings) and to collect only the number of participants necessary to reach a decision threshold.

3.1.1 Computational model

To test whether the post-decisional response window changes how informative confidence ratings are about choice correctness, we constructed a hierarchical Bayesian model grounded in signal detection theory and motor control theory. The model assumes that agents exhibit variability both in the encoding of the physical stimulus, as in traditional signal detection theory, and in the selection of a choice given the internal evidence. Consistent with motor control theories, the model assumes that an efference copy is generated before the efferent signal is sent to the acting limb. This efference copy is corrupted by additional noise, consistent with other generative metacognition models, before being combined with the executed action to compute confidence in the chosen response. The model accounts for error monitoring through the inclusion of choice variability. For the full derivation and simulations, see the previous page. Here it suffices to note that the metacognitive uncertainty parameter \(\sigma_m\) governs the relationship between confidence ratings and choice correctness, as \(\sigma_m\) increases, confidence becomes less informative about whether the choice was correct.

3.1.2 Hypotheses

Hypothesis 1: Increasing the post-decisional time window will lead to no improvement in the relationship between confidence and accuracy.
- Test specification:
  This hypothesis is tested by allowing the metacognitive uncertainty (\(\sigma_m\)) parameter to vary linearly on the log scale with the post-decisional time window. Because this is the main parameter of interest, we will sequentially test whether the Bayes factor for this parameter has reached our termination criteria of either 30 or 1/30.
Hypothesis 2: Increasing the post-decisional time window will lead to no difference in confidence bias.
- Test specification:
  This hypothesis is tested by allowing the \(\mu_{\beta_c}\) parameter to vary linearly on the logit scale with the post-decisional time window. The primary test evaluates whether the group-level parameter differs from 0 (i.e., whether the 95% highest density interval includes 0). We also aim to report the Bayes factor to quantify the strength of the evidence relative to the null (i.e., there is no difference). This test will be conducted when the data collection has been terminated.

3.1.3 Priors and Sequential Sampling.

To limit the number of participants needed to obtain sufficient evidence, we will use a sequential-sampling design. Recruitment will be terminated when the Bayes factor described above reaches either 30 or 1/30, or when 50 non-excluded participants have been collected. We will begin examining the Bayes factor in increments of 5 participants once 15 non-excluded participants have been reached (see exclusion criteria). This minimum of 15 is set to ensure stability of the model estimates; the increment of 5 reflects our expected daily testing rate. The Bayes factor used in this study is the Savage-Dickey density ratio evaluated at 0 (Wagenmakers et al. 2010).

Below we define the initial sampler parameters, the convergence diagnostics we require, and contingency plans if the sampler does not converge.

3.1.3.1 Initial Sampler Configuration

The model is initially estimated using the following sampler settings:

Number of chains: 4
Number of iterations per chain: 1000
Warmup iterations: 1000
Target acceptance rate (adapt_delta): 0.95
Maximum tree depth (max_treedepth): 12

These values represent our baseline configuration for model estimation.

3.1.3.1.1 Convergence Criteria

Sampler convergence and reliability will be assessed using the following diagnostics:

\(\hat{R}\) values below 1.03 for all group level parameters
Effective sample size (ESS) above 400 for all group level parameters
Zero divergent transitions after warmup
No saturation of the maximum tree depth
Well-mixing trace plots

If these criteria are met, the sampling procedure will be considered satisfactory.

3.1.3.1.2 Contingency Procedures

If the above convergence criteria are not met, we will apply the following steps depending on the convergence failure:

Increase the number of iterations.
Increase the target acceptance rate (adapt_delta).
Increase the maximum tree depth (max_treedepth).
Reconsider model parameterization (i.e. centered vs non-centered parameterization).
Estimate the lapse-rate parameter non-hierarchically
Estimate the choice variability parameter non-hierarchically

These steps are intended to improve sampler stability and ensure reliable posterior estimation.

The full Stan model and the scripts for checking diagnostics and model fit are available in the sequential-sampling directory (../Sequential Sampling).

3.2 Participants

Eligible participants will be 18–40 years old, have normal or corrected-to-normal vision, and be proficient in English. Individuals with neurological or psychiatric diagnoses, or those taking medication that may affect cognitive performance, will not be eligible. Participants will be recruited through COBE Lab’s participant pool to ensure fast and effective recruitment and will be compensated 110 DKK.

3.3 Design

This study consists of a single lab visit of an hour, where participants perform a computer experiment. Stimulus were presented on a 64x39xm monitor (2560x1440 resolution, 60 Hz refresh rate) at a viewing distance of 60 cm. The monitor was calibrated to RGB color space. Stimulus presentation was controlled using PsychoPy with synchronization to the Eyelink 1000 (Peirce et al. 2019). The experiment uses a random dot motion (RDM) decision‑making task. On each trial, participants view a random dot kinematogram in which a proportion of dots move coherently upward or downward. While the dots are displayed, participants indicate whether the overall motion is ‘up’ or ‘down’ by pressing a corresponding key. The dots remained on the screen for at most 2 seconds or when the participant responded, trials without a response within this window are classified as no‑response trials. After indicating their response, participants view a post‑decisional fixation cross. The duration of this interval is the main experimental manipulation and is randomly drawn from a uniform distribution ranging from 2 to 5 seconds (\(PostD_i \sim \text{U}(2, 5)\)). After the interval, participants provide a metacognitive evaluation of their decision. They are asked, ‘How confident were you in your decision?’ and respond using a visual analogue scale ranging from ‘certainly wrong’ to ‘certainly right.’ Participants navigate the scale using the arrow keys and confirm their response with a key press. If no response is made within 3 seconds, the trial is classified as a no‑response trial. Before the start of each new trial, an inter‑trial interval is presented consisting of a fixation cross. The duration of this interval is drawn from a log‑normal distribution with a mean of 1 second and a standard deviation of 0.1 (\(ITI_i \sim \text{Lognormal}(1,0.1)\)).

In total, participants will complete 288 trials, spread over 8 blocks of 36 trials. Pilot data indicated that this trial count was sufficient for good subject-level-parameter estimation. Before the main experiment begins, participants receive instructions on how to perform the decision task. They then complete six easy, non‑speeded practice trials without confidence ratings, followed by 120 more difficult practice trials. Afterward, participants receive additional instructions about the confidence scale and complete 10 easy, non‑speeded practice trials that include confidence ratings, after which the main experiment begins. The key variables of interest are response choice, accuracy, response time, confidence rating, and the response time associated with those confidence rating.

To ensure that task difficulty is calibrated at the individual level, the 120 harder practice trials serve as a staircase procedure. In a random dot kinematogram, task difficulty is determined by several stimulus properties, that can be broadly grouped into two categories: stimulus parameters and stimulus noisiness. To ensure that difficulty is optimally calibrated for each participant, we use the first 40 of the harder practice trials to staircase the dotlife parameter in the Dotstimclass in psychopy. This parameter determines how many frames a dot persists before disappearing. Once an appropriate dot‑life value is identified, we begin staircasing coherence. Coherence controls the stimulus signal-to-noise ratio and is the most commonly manipulated difficulty parameter in the literature. It ranges from 0 to 1 and determines the proportion of signal to noise dots, with higher values indicating a greater proportion of signal dots.

The remaining 80 trials are used to estimate the slope and threshold of the participant’s psychometric function on coherence. Both staircases are implemented in PsychoPy using the QUEST+ Bayesian adaptive procedure, which selects each stimulus to minimize posterior entropy over the psychometric parameters (Watson 2017). Using the estimated psychometric function, we determine the coherence levels for the main experiment based on a set of predefined target accuracies (\(ACC = [0.51, 0.6, 0.7, 0.8, 0.99]\)). The corresponding coherence values are then sampled in a quasi‑normal fashion: the extreme accuracy levels (0.51 and 0.99) appear once per direction, the intermediate levels appear twice, and the 0.70 level appears three times. In each block of the main experiment, trials are presented in a pseudo‑randomized sequence of upward and downward motion such that no more than three consecutive trials share the same direction. This reduces the risk of implicit directional bias. Together, these procedures ensure that participants use the full confidence scale and that difficulty is balanced across upward and downward motion directions.

In the fifth block of the main experiment, participants will rate the visibility of the stimulus direction rather than provide a confidence judgment. Following each decision response, participants rate ‘How clear was the direction of the dots?’ using a visual analogue scale ranging from ‘not at all clear’ to ‘very clear.’. These visibility ratings provide a measure of first-order stimulus perception (evidence quality) on a similar scale and time course as the metacognitive judgement. This allows us to investigate whether visibility and confidence dissociate, as predicted by our motor control theory: visibility should be driven by stimulus intensity, while confidence reflects a combination of evidence from the stimulus intensity and action reappraisal.

3.4 Exclusion criteria

Because the task is a two‑alternative forced‑choice decision, chance performance is 50%. For each participant, we will compute overall accuracy across all main‑task trials and will exclude all participants with an accuracy lower than 60%. For our trial count, accuracy below this threshold is not reliably above chance under a binomial model of guessing. Furthermore, given the stimulus calibration procedure, an optimal observer is expected to achieve approximately 71% accuracy. Accuracy below 60% is therefore inconsistent with meaningful task engagement.

Additionally, trials with a response time below 150 ms will be coded as no response trials and excluded from further analysis, as responses this fast are unlikely to reflect genuine perceptual processing. Participants with a no-response rate greater than 30% (combining decision, confidence responses and response times below 150 ms), will be excluded, as a high proportion of missed responses also indicates insufficient engagement with the task.

3.5 Physiological measures

As an additional data stream we also collect physiological measures from each participant Specifically, we record eyetracking (including pupil size) using an Eyelink 1000, furthermore we will collect electrocardiogram (ECG) and respiration measures using Biopac Systems (BIOPAC Systems, Inc., n.d.). To ensure accurate eye-tracking calibration, the experimental script performs recalibration after each block, which also provides participants with regular breaks.

These physiological measures will allow us to test the theoretical assumptions of our model of confidence ratings using physiological data. In particular, we will investigate whether autonomic responses reflect the agent’s appraisal of their own action. We expect that an initial physiological response will occur following stimulus presentation corresponding to the orienting response (Gabay et al. 2011). Crucially, we predict a second response of the physiological measures, shortly after the initial decision. This physiological response is expected to unfold within a time window of approximately 500–2500 ms after the initial response, corresponding to the period during which the action is being appraised (note the general sluggishness of the autonomic signals making this range wide). Importantly, this physiological response is expected to occur before the explicit presentation of the confidence rating, which would indicate that the appraisal process and computation of confidence begin prior to the overt confidence response. Additionally, we expect the initial response to the stimulus to be conditional on the stimulus intensity, such that higher coherence leads to larger amplitudes of the physiological responses. We further believe that the second physiological response, the appraisal of the action, will be conditional on the internal prediction error between the expected evidence distribution and the observed action, quantified by the internal prediction error \(S_i\). This quantity is largest for errors at high stimulus intensity and decreases toward threshold, while for correct responses it is largest near threshold and decays to zero at high stimulus intensity. These predictions are directly derivable from the model and generate two dissociable physiological signatures: a stimulus-locked response scaling with |XD| that should be independent of the action and choice accuracy, and an action-locked response scaling with \(S_i\) which diverges by accuracy as stimulus intensity increases.

3.5.1 Physiological analyses

Physiological-Hypothesis 1: Autonomic responses, particularly pupil size, will show an initial response to the stimulus, but also a second response after the decision, reflecting action appraisal.
Physiological-Hypothesis 2: The Autonomic responses for the initial response to the stimulus will scale with the stimulus intensity, whereas the second response will scale with the internal prediction error \(S_i\). Which predicts the greatest response for high stimulus intensity incorrect choices, an intermediate response near threshold regardless of accuracy, and the smallest response for high stimulus intensity correct choices.

As the analysis plans for these two hypotheses are not fully developed, and do not depend on the sequential sampling, we designate them as exploratory.