The efficacy of app‐supported smartphone interventions for mental health problems: a meta‐analysis of randomized controlled trials — DoOperator Research

Authors	Jake Linardon, Pim Cuijpers, Per Carlbring, Mariel Messer, Matthew Fuller‐Tyszkiewicz
Journal	World Psychiatry
Year	2019
DOI	10.1002/wps.20673
Citations	813

TL;DR

App-based mental health interventions produce small-to-moderate improvements in depression, anxiety, stress, and quality of life compared to doing nothing or receiving minimal support, with effect sizes roughly equivalent to the difference between mild and moderate symptom severity — but they do not outperform face-to-face therapy or computerized treatment when directly compared.

What they tested

This meta-analysis examined whether smartphone apps designed to support mental health (e.g., mood tracking, cognitive behavioral therapy exercises, mindfulness training) actually work when tested in randomized controlled trials. The researchers compared app-based interventions against several types of control conditions: waitlist (no treatment), information-only (e.g., a PDF about mental health), or active treatments (face-to-face therapy, computerized CBT programs). They looked at nine different mental health outcomes: depressive symptoms, generalized anxiety symptoms, stress levels, quality of life, general psychiatric distress, social anxiety symptoms, positive affect, panic symptoms, post-traumatic stress symptoms, and negative affect.

The key question was not just "do apps work?" but "for which conditions, by how much, and under what circumstances?"

Who was studied

The meta-analysis aggregated data from 66 randomized controlled trials, encompassing a total of 18,467 participants. Individual studies varied widely in their inclusion criteria, but the pooled sample consisted primarily of adults (mean ages typically ranged from 18 to 45 years) recruited from community settings, university campuses, and primary care clinics. Many studies excluded people with severe mental illness (e.g., active psychosis, bipolar disorder, current suicidal ideation), current substance use disorders, or those already receiving psychotherapy. Some trials focused on specific populations (e.g., college students with elevated depression scores, adults with diagnosed generalized anxiety disorder), while others recruited from the general public with no minimum symptom threshold. Approximately 60-70% of participants across studies were female, reflecting both the higher prevalence of anxiety and depression in women and potential recruitment biases.

How they measured it

Each individual study used validated self-report questionnaires to assess outcomes. The meta-analysis extracted standardized mean differences (Hedges' g) from each study, which allows comparison across different scales. Common instruments included:

Depressive symptoms: Patient Health Questionnaire-9 (PHQ-9, 0–27 scale, higher = worse), Beck Depression Inventory-II (BDI-II, 0–63 scale), Center for Epidemiologic Studies Depression Scale (CES-D, 0–60 scale)
Generalized anxiety symptoms: Generalized Anxiety Disorder-7 (GAD-7, 0–21 scale), Beck Anxiety Inventory (BAI, 0–63 scale), State-Trait Anxiety Inventory (STAI, 20–80 scale)
Stress: Perceived Stress Scale (PSS, 0–40 scale)
Quality of life: World Health Organization Quality of Life (WHOQOL-BREF), Short Form Health Survey (SF-12/SF-36)
Social anxiety: Liebowitz Social Anxiety Scale (LSAS, 0–144 scale), Social Phobia Inventory (SPIN, 0–68 scale)
Panic symptoms: Panic Disorder Severity Scale (PDSS, 0–28 scale)
PTSD symptoms: PTSD Checklist (PCL-5, 0–80 scale)
Positive/negative affect: Positive and Negative Affect Schedule (PANAS, 10–50 per subscale)

All measures were self-report, meaning participants filled out questionnaires on their phones or computers — no clinician interviews, physiological measures, or behavioral observations were included.

Methodology

Design: This is a meta-analysis of randomized controlled trials (RCTs). A meta-analysis statistically combines results from multiple independent studies to produce a single pooled estimate of effect size. The researchers systematically searched databases (PubMed, PsycINFO, Cochrane Central Register of Controlled Trials) through January 2019, screened 1,567 records, and ultimately included 66 RCTs that met strict inclusion criteria: (a) the intervention was delivered primarily via a smartphone app, (b) the study was a randomized controlled trial, (c) participants were adults (18+), (d) the study measured a mental health outcome, and (e) sufficient data were reported to calculate an effect size.

Randomization: All included studies randomly assigned participants to either the app intervention or a control condition. Randomization ensures that, on average, the two groups are comparable at baseline on measured and unmeasured confounders (e.g., motivation, symptom severity, age). This is the gold standard for causal inference.

Blinding: Blinding in app studies is challenging. Participants obviously know whether they are using an app or not. Some studies attempted to blind outcome assessors (i.e., the person administering follow-up questionnaires did not know group assignment), but because all outcomes were self-report, the participant was effectively the assessor. This means participant expectations could influence results — people who volunteer for an app study likely believe apps can help, which may inflate apparent benefits. The authors coded each study for "risk of bias" using the Cochrane tool, which assesses sequence generation, allocation concealment, blinding, incomplete outcome data, and selective reporting.

Control conditions: The meta-analysis distinguished between three types of controls: (1) waitlist/no treatment (participants received nothing and were told they would get the app later), (2) information-only/attention control (participants received a non-therapeutic app, a PDF, or minimal support), and (3) active comparator (face-to-face therapy, computerized CBT, or another established treatment). This distinction matters because comparing an app to waitlist inflates effect sizes (any attention helps), while comparing to active treatment provides a much stricter test.

Duration: Individual study durations ranged from 2 weeks to 6 months, with most falling between 4 and 12 weeks. The meta-analysis did not separately analyze effects by duration, which is a limitation — a 2-week app trial may capture novelty effects rather than genuine clinical improvement.

Statistical approach: The researchers used random-effects meta-analysis, which assumes that the true effect size varies across studies (due to differences in populations, apps, durations, etc.) and estimates both the average effect and the degree of heterogeneity. They calculated Hedges' g, a standardized mean difference corrected for small sample bias. They also conducted moderator analyses (meta-regression) to test whether certain study features — type of app (CBT-based vs. other), presence of human guidance, frequency of reminders, risk of bias rating, type of control condition — predicted larger or smaller effects. Publication bias was assessed using funnel plots and Egger's test.

What this design can and cannot prove: A meta-analysis of RCTs provides the strongest evidence for causal effects — if the individual trials are well-conducted, the pooled estimate reflects the average causal impact of app interventions on mental health outcomes. However, this design cannot tell you which specific app works best for which person, because it averages across many different apps, populations, and protocols. It also cannot rule out that the effects are driven by "common factors" (e.g., attention, expectation, daily self-monitoring) rather than the specific therapeutic content of the apps. Furthermore, because all outcomes are self-report, the meta-analysis cannot distinguish between genuine symptom reduction and changes in how people report their symptoms (e.g., apps may teach people to label emotions differently without actually changing their emotional experience).

Major methodological weaknesses flagged by the authors: (1) High risk of bias in many individual studies — only 22 of 66 trials had adequate blinding of outcome assessment. (2) Substantial heterogeneity (I² values often >60%), meaning the effects varied widely across studies, so the average may not apply to any particular app or population. (3) Small number of studies for some outcomes (e.g., only 3 trials for panic, 4 for PTSD, 5 for negative affect), making those estimates unreliable. (4) Most studies used waitlist or minimal control conditions, which overestimates real-world utility. (5) Industry funding was not systematically reported, but many app studies are funded by app developers with a vested interest in positive results.

Key findings

Primary outcomes (statistically significant effects):

Depressive symptoms: g = 0.28 (95% CI: 0.20 to 0.36, p < 0.001), based on 54 studies. This is a small effect by conventional benchmarks (0.2 = small, 0.5 = medium, 0.8 = large). Heterogeneity was moderate (I² = 58%).
Generalized anxiety symptoms: g = 0.30 (95% CI: 0.20 to 0.40, p < 0.001), based on 39 studies. Small effect, moderate heterogeneity (I² = 55%).
Stress levels: g = 0.35 (95% CI: 0.22 to 0.48, p < 0.001), based on 27 studies. Small-to-medium effect, moderate heterogeneity (I² = 52%).
Quality of life: g = 0.35 (95% CI: 0.24 to 0.46, p < 0.001), based on 43 studies. Small-to-medium effect, moderate heterogeneity (I² = 49%).
General psychiatric distress: g = 0.40 (95% CI: 0.18 to 0.62, p = 0.001), based on 12 studies. Medium effect, low heterogeneity (I² = 38%).
Social anxiety symptoms: g = 0.58 (95% CI: 0.26 to 0.90, p < 0.001), based on 6 studies. Medium-to-large effect, but based on very few studies; heterogeneity not reported due to small k.
Positive affect: g = 0.44 (95% CI: 0.14 to 0.74, p = 0.004), based on 6 studies. Medium effect, again based on few studies.

Non-significant outcomes:

Panic symptoms: g = -0.05 (95% CI: -0.35 to 0.25, p = 0.74), based on 3 studies. Essentially zero effect.
Post-traumatic stress symptoms: g = 0.18 (95% CI: -0.06 to 0.42, p = 0.14), based on 4 studies. Small, non-significant effect.
Negative affect: g = -0.08 (95% CI: -0.29 to 0.13, p = 0.46), based on 5 studies. Essentially zero effect.

Moderator analyses (what made apps work better):

CBT-based apps produced significantly larger effects than non-CBT apps (e.g., mindfulness-only, mood tracking without therapeutic content) for depression (β = 0.15, p = 0.03) and anxiety (β = 0.18, p = 0.02).
Human guidance (e.g., weekly check-ins with a therapist, coach, or researcher) significantly enhanced effects for depression (β = 0.20, p = 0.01) and anxiety (β = 0.22, p = 0.008). Apps with no human contact produced smaller, though still significant, effects.
Reminders to engage (push notifications, emails, SMS) were associated with larger effects for depression (β = 0.14, p = 0.04).
Type of control condition mattered: effects were larger when compared to waitlist (g ≈ 0.35–0.45) than when compared to active controls (g ≈ 0.10–0.20), though the difference was not always statistically significant.
Risk of bias did not significantly moderate effects, suggesting that lower-quality studies did not systematically inflate results.

Comparison to active treatments: When apps were directly compared to face-to-face therapy or computerized CBT (13 studies or fewer per outcome), the difference was non-significant (g = -0.05 to 0.12, all p > 0.30). This does not mean apps are equivalent to therapy — the number of studies is too small to conclude equivalence, and the confidence intervals are wide enough to include meaningful differences in either direction.

Effect magnitude

To translate these numbers into plain English: a g of 0.28 for depression means that the average person in the app group scored about 0.28 standard deviations lower on depression scales than the average person in the control group. On a common scale like the PHQ-9 (range 0–27, standard deviation ~5–6 in clinical samples), this corresponds to roughly a 1.5- to 2-point reduction — the difference between "mild" and "moderate" depression, or about one-third of the typical improvement seen in face-to-face CBT (which produces g ≈ 0.7–0.9). For anxiety (g = 0.30), on the GAD-7 (range 0–21, SD ~5), this translates to about a 1.5-point reduction — moving from, say, a score of 10 (moderate anxiety) to 8.5 (mild anxiety).

For social anxiety (g = 0.58), the effect is larger: on the Social Phobia Inventory (SPIN, range 0–68, SD ~12), this corresponds to about a 7-point reduction, which is clinically meaningful (the minimal clinically important difference for social anxiety is typically 6–10 points).

The effect on quality of life (g = 0.35) is roughly equivalent to the difference between "somewhat satisfied" and "satisfied" on a life satisfaction scale — noticeable but not transformative.

Importantly, these are average effects. Some people likely experienced much larger improvements, while others saw no benefit or even worsened. The meta-analysis cannot tell us who falls into which category.

Limitations

What the authors acknowledge:

High heterogeneity across studies, meaning the average effect may not apply to any specific app or population.
Most studies used waitlist or minimal control conditions, which inflate effect sizes relative to real-world comparisons (where people might seek other help).
Small number of studies for several outcomes (panic, PTSD, social anxiety, affect), making those estimates unreliable.
Lack of long-term follow-up data — most studies measured outcomes immediately post-intervention, so durability of effects is unknown.
Potential publication bias: funnel plot asymmetry was detected for some outcomes, suggesting that small negative studies may be missing from the literature.
Inability to examine individual-level moderators (e.g., age, gender, baseline severity, personality) because the meta-analysis used study-level aggregates.

What a critical reader would add:

Self-report bias: All outcomes were self-reported. People who volunteer for app studies may be more motivated, more tech-savvy, or more likely to report improvement due to demand characteristics. No study used clinician-rated outcomes, behavioral measures, or objective biomarkers.
No active placebo control: Most "active" control conditions were minimal (e.g., a PDF about stress management). Only 13 studies compared apps to actual therapy. The effects may reflect the power of daily attention and self-monitoring rather than the specific app content.
Industry funding: The authors did not systematically report funding sources, but many app studies are funded by the app developers. Industry-funded trials in other domains (e.g., pharma) tend to show larger effects.
Adherence is not measured: The meta-analysis could not account for how much participants actually used the apps. In many individual studies, dropout rates were high (30–50%), and those who dropped out were often excluded from analyses, potentially inflating effects (completer bias).
No head-to-head comparisons: The meta-analysis cannot tell you which specific app is best, because it averages across dozens of different apps with different features, content, and quality.
Generalizability: Most studies excluded people with severe mental illness, current treatment, or substance use disorders. Results may not apply to clinical populations or those with complex comorbidities.

Practical takeaways

For someone running their own n=1 experiment:

What to test:

Choose a CBT-based app (e.g., Woebot, MoodKit, Sanvello, or a research-grade app like iCBT). Avoid apps that only offer

Read full paper →More Social Habits →