| Authors | Jake Linardon, Pim Cuijpers, Per Carlbring, Mariel Messer, Matthew Fuller‐Tyszkiewicz |
| Journal | World Psychiatry |
| Year | 2019 |
| DOI | 10.1002/wps.20673 |
| Citations | 813 |
TL;DR
App-based mental health interventions produce small-to-moderate improvements in depression, anxiety, stress, and quality of life compared to doing nothing or receiving minimal support, with effect sizes roughly equivalent to the difference between mild and moderate symptom severity — but they do not outperform face-to-face therapy or computerized treatment when directly compared.
This meta-analysis examined whether smartphone apps designed to support mental health (e.g., mood tracking, cognitive behavioral therapy exercises, mindfulness training) actually work when tested in randomized controlled trials. The researchers compared app-based interventions against several types of control conditions: waitlist (no treatment), information-only (e.g., a PDF about mental health), or active treatments (face-to-face therapy, computerized CBT programs). They looked at nine different mental health outcomes: depressive symptoms, generalized anxiety symptoms, stress levels, quality of life, general psychiatric distress, social anxiety symptoms, positive affect, panic symptoms, post-traumatic stress symptoms, and negative affect.
The key question was not just "do apps work?" but "for which conditions, by how much, and under what circumstances?"
The meta-analysis aggregated data from 66 randomized controlled trials, encompassing a total of 18,467 participants. Individual studies varied widely in their inclusion criteria, but the pooled sample consisted primarily of adults (mean ages typically ranged from 18 to 45 years) recruited from community settings, university campuses, and primary care clinics. Many studies excluded people with severe mental illness (e.g., active psychosis, bipolar disorder, current suicidal ideation), current substance use disorders, or those already receiving psychotherapy. Some trials focused on specific populations (e.g., college students with elevated depression scores, adults with diagnosed generalized anxiety disorder), while others recruited from the general public with no minimum symptom threshold. Approximately 60-70% of participants across studies were female, reflecting both the higher prevalence of anxiety and depression in women and potential recruitment biases.
Each individual study used validated self-report questionnaires to assess outcomes. The meta-analysis extracted standardized mean differences (Hedges' g) from each study, which allows comparison across different scales. Common instruments included:
All measures were self-report, meaning participants filled out questionnaires on their phones or computers — no clinician interviews, physiological measures, or behavioral observations were included.
Design: This is a meta-analysis of randomized controlled trials (RCTs). A meta-analysis statistically combines results from multiple independent studies to produce a single pooled estimate of effect size. The researchers systematically searched databases (PubMed, PsycINFO, Cochrane Central Register of Controlled Trials) through January 2019, screened 1,567 records, and ultimately included 66 RCTs that met strict inclusion criteria: (a) the intervention was delivered primarily via a smartphone app, (b) the study was a randomized controlled trial, (c) participants were adults (18+), (d) the study measured a mental health outcome, and (e) sufficient data were reported to calculate an effect size.
Randomization: All included studies randomly assigned participants to either the app intervention or a control condition. Randomization ensures that, on average, the two groups are comparable at baseline on measured and unmeasured confounders (e.g., motivation, symptom severity, age). This is the gold standard for causal inference.
Blinding: Blinding in app studies is challenging. Participants obviously know whether they are using an app or not. Some studies attempted to blind outcome assessors (i.e., the person administering follow-up questionnaires did not know group assignment), but because all outcomes were self-report, the participant was effectively the assessor. This means participant expectations could influence results — people who volunteer for an app study likely believe apps can help, which may inflate apparent benefits. The authors coded each study for "risk of bias" using the Cochrane tool, which assesses sequence generation, allocation concealment, blinding, incomplete outcome data, and selective reporting.
Control conditions: The meta-analysis distinguished between three types of controls: (1) waitlist/no treatment (participants received nothing and were told they would get the app later), (2) information-only/attention control (participants received a non-therapeutic app, a PDF, or minimal support), and (3) active comparator (face-to-face therapy, computerized CBT, or another established treatment). This distinction matters because comparing an app to waitlist inflates effect sizes (any attention helps), while comparing to active treatment provides a much stricter test.
Duration: Individual study durations ranged from 2 weeks to 6 months, with most falling between 4 and 12 weeks. The meta-analysis did not separately analyze effects by duration, which is a limitation — a 2-week app trial may capture novelty effects rather than genuine clinical improvement.
Statistical approach: The researchers used random-effects meta-analysis, which assumes that the true effect size varies across studies (due to differences in populations, apps, durations, etc.) and estimates both the average effect and the degree of heterogeneity. They calculated Hedges' g, a standardized mean difference corrected for small sample bias. They also conducted moderator analyses (meta-regression) to test whether certain study features — type of app (CBT-based vs. other), presence of human guidance, frequency of reminders, risk of bias rating, type of control condition — predicted larger or smaller effects. Publication bias was assessed using funnel plots and Egger's test.
What this design can and cannot prove: A meta-analysis of RCTs provides the strongest evidence for causal effects — if the individual trials are well-conducted, the pooled estimate reflects the average causal impact of app interventions on mental health outcomes. However, this design cannot tell you which specific app works best for which person, because it averages across many different apps, populations, and protocols. It also cannot rule out that the effects are driven by "common factors" (e.g., attention, expectation, daily self-monitoring) rather than the specific therapeutic content of the apps. Furthermore, because all outcomes are self-report, the meta-analysis cannot distinguish between genuine symptom reduction and changes in how people report their symptoms (e.g., apps may teach people to label emotions differently without actually changing their emotional experience).
Major methodological weaknesses flagged by the authors: (1) High risk of bias in many individual studies — only 22 of 66 trials had adequate blinding of outcome assessment. (2) Substantial heterogeneity (I² values often >60%), meaning the effects varied widely across studies, so the average may not apply to any particular app or population. (3) Small number of studies for some outcomes (e.g., only 3 trials for panic, 4 for PTSD, 5 for negative affect), making those estimates unreliable. (4) Most studies used waitlist or minimal control conditions, which overestimates real-world utility. (5) Industry funding was not systematically reported, but many app studies are funded by app developers with a vested interest in positive results.
Primary outcomes (statistically significant effects):
Non-significant outcomes:
Moderator analyses (what made apps work better):
Comparison to active treatments: When apps were directly compared to face-to-face therapy or computerized CBT (13 studies or fewer per outcome), the difference was non-significant (g = -0.05 to 0.12, all p > 0.30). This does not mean apps are equivalent to therapy — the number of studies is too small to conclude equivalence, and the confidence intervals are wide enough to include meaningful differences in either direction.
To translate these numbers into plain English: a g of 0.28 for depression means that the average person in the app group scored about 0.28 standard deviations lower on depression scales than the average person in the control group. On a common scale like the PHQ-9 (range 0–27, standard deviation ~5–6 in clinical samples), this corresponds to roughly a 1.5- to 2-point reduction — the difference between "mild" and "moderate" depression, or about one-third of the typical improvement seen in face-to-face CBT (which produces g ≈ 0.7–0.9). For anxiety (g = 0.30), on the GAD-7 (range 0–21, SD ~5), this translates to about a 1.5-point reduction — moving from, say, a score of 10 (moderate anxiety) to 8.5 (mild anxiety).
For social anxiety (g = 0.58), the effect is larger: on the Social Phobia Inventory (SPIN, range 0–68, SD ~12), this corresponds to about a 7-point reduction, which is clinically meaningful (the minimal clinically important difference for social anxiety is typically 6–10 points).
The effect on quality of life (g = 0.35) is roughly equivalent to the difference between "somewhat satisfied" and "satisfied" on a life satisfaction scale — noticeable but not transformative.
Importantly, these are average effects. Some people likely experienced much larger improvements, while others saw no benefit or even worsened. The meta-analysis cannot tell us who falls into which category.
What the authors acknowledge:
What a critical reader would add:
For someone running their own n=1 experiment:
What to test:
Related papers
The relationship between nature connectedness and happiness: a meta-analysis
Colin A. Capaldi, Raelyne L. Dopko, John M. Zelenski · 2014
Systematic ReviewBeyond Adoption: A New Framework for Theorizing and Evaluating Nonadoption, Abandonment, and Challenges to the Scale-Up, Spread, and Sustainability of Health and Care Technologies
Trisha Greenhalgh, Joseph Wherton, Chrysanthi Papoutsi +7 more · 2017
Systematic ReviewReligion, Spirituality, and Health: The Research and Clinical Implications
Harold G. Koenig · 2012
RCTThe effects of improving sleep on mental health (OASIS): a randomised controlled trial with mediation analysis
Daniel Freeman, Bryony Sheaves, Guy M. Goodwin +39 more · 2017