Objective by Design: How Video-Derived Biomarkers Cut Assessor Variance in CNS Trials

AI Summary

Video-derived biomarkers use multimodal AI to measure movement, mood, and behavior objectively in CNS trials, reducing assessor variance that inflates placebo response and sample size. In one depression trial, 28% were placebo responders by site raters versus 14% by centralized video raters. Applying one consistent standard to every assessment supports a cleaner efficacy signal and defensible endpoint.

Key Takeaways:

Subjective rating scales introduce assessor variance that inflates placebo response and the sample size a trial needs.
In one depression trial, placebo response was 28% by site raters versus 14% by centralized raters scoring the same interviews over video.
Up to 32.5% of per-patient Phase III data comes from procedures not tied to primary or key secondary endpoints.
Videra Health’s video-based, multimodal AI applies one objective standard across raters and sites, reducing assessor variance.
Objective measurement supports a cleaner efficacy signal and a more defensible endpoint, not commercial outcomes.

The most carefully designed endpoint in a CNS trial still passes through a human rater, and that rater is a variable. Two trained clinicians can score the same patient differently. The same clinician can score differently on a Friday than on a Monday. None of that reflects a change in the patient, yet all of it lands in the data as if it did. In a field where drug and placebo are often separated by a few points on a rating scale, that noise is not a rounding error. It is frequently the difference between a positive trial and a failed one.

This is the quiet problem underneath a lot of CNS trial design, and it has less to do with the molecule than with how the effect is measured.

The rater is a variable, and an expensive one

The clearest illustration comes from depression research. In a controlled trial that scored the same patient interviews two ways, 28% of participants were placebo responders when scored by site raters, versus 14% when scored by centralized raters reviewing the interviews over video. Same patients, same interviews, half the placebo response. The gap was not in the people enrolled. It was in the measurement.

That pattern is consistent with what the broader literature shows about rater reliability. When interrater reliability sits at a commonly accepted level, a meaningful share of any difference between two raters’ scores reflects rating error rather than true change in the patient. Analyses of interrater reliability in depression trials have put that error fraction high enough to materially affect a study’s ability to detect a real signal. The rater, in other words, is not a neutral instrument. They are a source of variance that has to be managed like any other.

Why variance costs more than it looks

Variance is expensive in a specific, quantifiable way. The noisier the measurement, the lower the statistical power, and the larger the sample a trial needs to detect the same true effect. Larger samples mean more sites, longer timelines, and higher cost, and they raise the odds that a trial fails to separate drug from placebo not because the drug does not work, but because the measurement could not see the difference.

Inflated placebo response makes this worse. When site ratings drift toward improvement that centralized review does not confirm, the placebo arm looks better than it is, and the drug has a higher bar to clear. A trial can be designed perfectly and still fail on noise. For a sponsor, that is among the most painful ways to lose a program.

Objective by design

The way to shrink that variance is to stop relying on a subjective judgment that varies by person and day, and to measure the same thing the same way every time.

Video-based, multimodal AI does this by deriving measures directly from facial expression, vocal prosody, speech patterns, and behavior, the signals that underlie the movement, mood, and behavioral endpoints common in CNS trials. The same model applies the same standard to every assessment, and because the source is a recorded, structured capture, the measure can be reviewed and re-reviewed against that standard rather than reconstructed from a rater’s memory. That turns the rater from a variable into something much closer to a constant.

This is the logic behind the centralized-video finding taken one step further. Centralized review reduced variance by replacing many site raters with a consistent few. Objective, video-derived measurement reduces it further by replacing subjective scoring with a reproducible one.

More signal, not more procedures

The reflex when a trial needs cleaner data is to add raters, add training, or add procedures. The evidence suggests that reflex is misplaced. A TransCelerate and Tufts analysis found that up to 32.5% of the per-patient data collected in Phase III trials comes from procedures not directly supporting the primary or key secondary endpoints. Piling on tends to add burden faster than insight.

Objective video capture works the other way. It draws richer, more consistent signal from the assessment that already happens, without expanding the procedure list at the site. The data gets cleaner while the burden on participants and coordinators does not grow.

The payoff is not abstract. Lower assessor variance means a cleaner efficacy signal, which means a trial can detect a true effect with a tighter sample rather than buying its way out of noise with more patients and more time. It also means an endpoint that reviewers can interrogate, because the measure is consistent and the underlying capture can be examined, rather than taken on the faith that every rater scored the same way.

Throughout, Videra Health’s role is as a neutral, evidence-generation partner. The aim is objective, defensible clinical assessment, not any commercial outcome.

The path forward

CNS measurement has lived with rater variance for decades, in part because there was no practical alternative to a trained human with a scale. There is one now. Treating measurement as something to engineer, rather than something to tolerate, is how sponsors give a real drug effect its best chance to show up in the data. For a closer look at how validated, video-based measures are built and applied, see Videra Health’s case study on building disease-specific predictive algorithms.

See how validated, video-derived measures are built and applied.

Read the Case Study

The rater is a variable, and an expensive one

Why variance costs more than it looks

Objective by design

More signal, not more procedures

What it changes for the sponsor

The path forward