| Authors | Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, Wayne Xin Zhao, Zhewei Wei, Ji-Rong Wen |
| Journal | Frontiers of Computer Science |
| Year | 2024 |
| DOI | 10.1007/s11704-024-40231-1 |
| Citations | 1,039 |
TL;DR
This systematic review of over 200 papers on LLM-based autonomous agents found that agents built with a unified architecture (profiling, memory, planning, and action modules) can perform complex tasks across social science, natural science, and engineering — but their reliability, safety, and generalizability remain unproven, with no single benchmark or standardised evaluation framework yet established.
This is a systematic review, not an experiment. The authors tested no single intervention. Instead, they:
The "outcome measures" were qualitative: the presence/absence of specific architectural features, the types of tasks agents could complete, and the evaluation methods used.
No human participants were studied. The "subjects" were:
No sample size, demographics, or population characteristics apply — this is a literature review, not a human study.
The authors used a qualitative systematic review methodology with the following approach:
No quantitative instruments (scales, questionnaires, physiological measures) were used because this is a review of existing literature, not a primary data collection study.
Study design: Systematic review with qualitative synthesis. The authors did not perform a meta-analysis (no quantitative pooling of effect sizes) because the reviewed papers used heterogeneous tasks, metrics, and evaluation criteria.
Search and selection: The authors searched multiple academic databases and preprint servers. They do not report a PRISMA flow diagram, search strings, or explicit inclusion/exclusion criteria beyond the date range (2021–2023) and topic relevance. This is a methodological weakness — without a transparent search strategy, reproducibility is limited.
Data extraction: For each paper, the authors extracted:
Synthesis approach: The authors organised findings into a unified framework (Figure 2) and discussed patterns across papers qualitatively. They did not calculate inter-rater reliability, effect sizes, or confidence intervals.
What this design can prove:
What this design cannot prove:
Major methodological weaknesses:
Architecture patterns:
Application domains:
Evaluation strategies:
Capability acquisition strategies:
Challenges identified:
This is a qualitative review, so no effect sizes, confidence intervals, or p-values are reported. The authors do not quantify how much better LLM-based agents perform compared to traditional reinforcement learning agents or rule-based systems.
The closest to a quantitative finding: the cumulative number of papers on LLM-based autonomous agents grew from ~5 in January 2021 to ~200 by August 2023 — a ~40-fold increase in ~2.5 years. This is a bibliometric observation, not an experimental effect.
What the authors acknowledge:
What a critical reader would note:
For someone running their own n=1 experiment with LLM-based autonomous agents:
Warning: A single successful trial does not mean the agent is reliable. Run at least 10 trials before drawing conclusions. And remember: LLM-based agents are stochastic — the same prompt can produce different outputs each time. Track this variability.
Related papers
SPIRIT 2013 explanation and elaboration: guidance for protocols of clinical trials
A.-W. Chan, Jennifer Tetzlaff, Peter C Gøtzsche +10 more · 2013
Meta-analysisThe effect of fall prevention exercise programmes on fall induced injuries in community dwelling older adults
F El-Khoury, Bernard Cassou, Marie‐Aline Charles +1 more · 2013
RCTA multifactorial interdisciplinary intervention reduces frailty in older people: randomized trial
Ian D. Cameron, Nicola Fairhall, Colleen Langron +6 more · 2013
RCTStandardized Rehabilitation and Hospital Length of Stay Among Patients With Acute Respiratory Failure
Peter E. Morris, Michael J. Berry, D. Clark Files +17 more · 2016