← Research / time_management

A survey on large language model based autonomous agents

Systematic ReviewWikitime_managementHigh confidence
Read full paper →
AuthorsLei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, Wayne Xin Zhao, Zhewei Wei, Ji-Rong Wen
JournalFrontiers of Computer Science
Year2024
DOI10.1007/s11704-024-40231-1
Citations1,039

TL;DR

This systematic review of over 200 papers on LLM-based autonomous agents found that agents built with a unified architecture (profiling, memory, planning, and action modules) can perform complex tasks across social science, natural science, and engineering — but their reliability, safety, and generalizability remain unproven, with no single benchmark or standardised evaluation framework yet established.

What they tested

This is a systematic review, not an experiment. The authors tested no single intervention. Instead, they:

  • Analysed the architectural designs of LLM-based autonomous agents across ~200 papers published between January 2021 and August 2023
  • Categorised agent designs into four modules: profiling, memory, planning, and action
  • Compared three methods for creating agent profiles: handcrafting, LLM-generation, and dataset alignment
  • Reviewed applications in three domains: social science (e.g., simulating human behaviour), natural science (e.g., scientific discovery), and engineering (e.g., software development)
  • Evaluated assessment strategies, distinguishing subjective (human judgement) from objective (automated metrics) approaches

The "outcome measures" were qualitative: the presence/absence of specific architectural features, the types of tasks agents could complete, and the evaluation methods used.

Who was studied

No human participants were studied. The "subjects" were:

  • ~200 published papers on LLM-based autonomous agents from January 2021 to August 2023
  • Specific agent systems reviewed in detail: Generative Agents, MetaGPT, ChatDev, RecAgent, PTLLM, and others
  • LLMs used as backbones: primarily GPT-3, GPT-4, ChatGPT, and open-source models like LLaMA

No sample size, demographics, or population characteristics apply — this is a literature review, not a human study.

How they measured it

The authors used a qualitative systematic review methodology with the following approach:

  • Literature search: Papers were collected from arXiv, ACL, NeurIPS, ICML, ICLR, and other major AI venues
  • Inclusion criteria: Papers proposing or evaluating LLM-based autonomous agents published between January 2021 and August 2023
  • Taxonomy development: The authors created a unified framework (profiling → memory → planning → action) and classified each paper according to which modules it used
  • Application categorisation: Papers were grouped by domain (social science, natural science, engineering)
  • Evaluation strategy analysis: Each paper's evaluation method was classified as subjective (human raters, surveys) or objective (automated metrics, task completion rates)

No quantitative instruments (scales, questionnaires, physiological measures) were used because this is a review of existing literature, not a primary data collection study.

Methodology

Study design: Systematic review with qualitative synthesis. The authors did not perform a meta-analysis (no quantitative pooling of effect sizes) because the reviewed papers used heterogeneous tasks, metrics, and evaluation criteria.

Search and selection: The authors searched multiple academic databases and preprint servers. They do not report a PRISMA flow diagram, search strings, or explicit inclusion/exclusion criteria beyond the date range (2021–2023) and topic relevance. This is a methodological weakness — without a transparent search strategy, reproducibility is limited.

Data extraction: For each paper, the authors extracted:

  • Agent architecture (which modules were used)
  • Profile generation method (handcrafted, LLM-generated, or dataset-aligned)
  • Application domain
  • Evaluation approach (subjective vs. objective)
  • Key findings

Synthesis approach: The authors organised findings into a unified framework (Figure 2) and discussed patterns across papers qualitatively. They did not calculate inter-rater reliability, effect sizes, or confidence intervals.

What this design can prove:

  • It can identify common architectural patterns across the field
  • It can map the landscape of current research and highlight gaps
  • It can propose a standardised taxonomy for future work

What this design cannot prove:

  • It cannot establish which agent architecture is "best" — no head-to-head comparisons were performed
  • It cannot quantify effect sizes or statistical significance of any intervention
  • It cannot control for publication bias (papers with positive results are more likely to be published)
  • It cannot assess the quality or rigour of individual studies systematically (no risk-of-bias assessment is reported)

Major methodological weaknesses:

  • No pre-registered protocol
  • No systematic quality assessment of included studies (e.g., no use of ROBINS-I or similar tools)
  • No quantitative synthesis or meta-analysis
  • The search strategy is not fully reproducible
  • The review is descriptive rather than evaluative — it catalogues what exists rather than testing hypotheses

Key findings

Architecture patterns:

  • The unified framework (profiling → memory → planning → action) encompasses "most" previous work, though the authors do not report what percentage of papers fit this framework
  • Three profile generation methods were identified: handcrafting (most common), LLM-generation, and dataset alignment
  • Memory modules were classified into short-term (within-session) and long-term (cross-session) variants
  • Planning modules ranged from simple chain-of-thought prompting to hierarchical task decomposition

Application domains:

  • Social science: Agents have been used to simulate human behaviour in social settings (e.g., Generative Agents simulating a small town of 25 agents), study opinion dynamics, and model economic decisions
  • Natural science: Agents have been applied to scientific discovery (e.g., ChemCrow for chemistry, BioGPT for biology), though the authors note these are "early-stage"
  • Engineering: The most mature application area, with agents used for software development (MetaGPT, ChatDev), code generation, and tool use

Evaluation strategies:

  • Subjective evaluation: Human raters assess agent outputs for quality, coherence, or human-likeness. Used in ~40% of reviewed papers (estimated from figures, not explicitly stated)
  • Objective evaluation: Automated metrics (e.g., task completion rate, BLEU score, accuracy on benchmarks). Used in ~60% of papers
  • No standardised benchmark exists: Different papers use different tasks, making cross-study comparison impossible

Capability acquisition strategies:

  • Fine-tuning approaches: Some agents fine-tune LLMs on domain-specific data (e.g., code for programming agents)
  • Prompt-based approaches: Most agents use in-context learning (prompt engineering) without modifying model weights
  • Tool use: Many agents are equipped with external tools (e.g., web search, calculators, code interpreters) to extend capabilities

Challenges identified:

  • Reliability: LLM-based agents can hallucinate, produce inconsistent outputs, or fail on simple tasks
  • Safety: Agents acting autonomously could cause harm (e.g., generating malicious code, giving dangerous advice)
  • Generalisation: Agents trained/tested in one domain often fail in others
  • Evaluation: No consensus on how to measure agent performance

Effect magnitude

This is a qualitative review, so no effect sizes, confidence intervals, or p-values are reported. The authors do not quantify how much better LLM-based agents perform compared to traditional reinforcement learning agents or rule-based systems.

The closest to a quantitative finding: the cumulative number of papers on LLM-based autonomous agents grew from ~5 in January 2021 to ~200 by August 2023 — a ~40-fold increase in ~2.5 years. This is a bibliometric observation, not an experimental effect.

Limitations

What the authors acknowledge:

  • The field is "rapidly developing" and the review may not capture the most recent work
  • The proposed unified framework may not encompass all possible agent architectures
  • Evaluation strategies are "not yet mature"
  • The review is descriptive rather than prescriptive

What a critical reader would note:

  • No systematic quality assessment: The authors do not evaluate the rigour of individual studies. A paper with 5 participants and a paper with 500 are treated equally
  • Publication bias: The field is dominated by positive results (agents that work). Failed architectures or negative results are rarely published
  • LLM dependence: Most reviewed agents use proprietary LLMs (GPT-3/4). Results may not generalise to open-source models or future model versions
  • No replication analysis: The authors do not report whether any findings have been independently replicated
  • Industry funding: Many reviewed papers come from tech companies (OpenAI, Google, Meta) with commercial interests in LLMs. The review does not discuss conflicts of interest
  • Temporal bias: The review covers only 2021–2023. Given the field's rapid pace, findings may already be outdated
  • No negative results: The review focuses on what agents can do, not what they fail at. Failures are mentioned only briefly in the challenges section
  • Lack of quantitative synthesis: Without meta-analysis, it's impossible to know which approaches are statistically superior

Practical takeaways

For someone running their own n=1 experiment with LLM-based autonomous agents:

What to test (specific intervention and dose)

  • Intervention: Build an LLM-based autonomous agent using the unified framework (profiling + memory + planning + action modules)
  • Dose: Start with a single agent performing one well-defined task (e.g., "write a Python script to scrape website X and save results to CSV"). Do not attempt multi-agent collaboration initially
  • Comparison: Compare against (a) doing the task manually, (b) using a simple LLM prompt without agent architecture, or (c) using a traditional rule-based system

Minimum meaningful duration

  • Per trial: 1–3 hours for a single task completion
  • Total experiment: At least 10–20 trials across different tasks to assess generalisability
  • Long-term: If testing memory/learning, run 5–10 sessions over 1–2 weeks, with the agent retaining information across sessions

What to measure (specific metrics)

  • Task completion rate: Did the agent complete the task? (binary: yes/no)
  • Time to completion: Minutes from start to finish
  • Error rate: Number of mistakes (e.g., syntax errors in code, incorrect outputs)
  • Number of human interventions: How many times did you need to correct or redirect the agent?
  • Output quality: Rate on a 1–5 scale (subjective, but use a rubric: accuracy, completeness, readability)
  • Hallucination count: Number of false statements or fabricated information
  • Cost: API calls made, tokens used, total cost in USD

Key confounds to control for

  • LLM version: Use the same model version throughout (e.g., GPT-4-turbo-2024-04-09). Model updates can change behaviour
  • Prompt engineering: Small changes in prompts can cause large changes in output. Document your exact prompts
  • Temperature setting: Keep temperature constant (start with 0.2 for deterministic tasks, 0.7 for creative tasks)
  • Task difficulty: Vary task difficulty systematically. Don't compare easy tasks to hard tasks
  • Order effects: Randomise the order of tasks if comparing multiple conditions
  • Learning effects: The agent may "learn" from previous tasks if memory is enabled. Control for this by resetting the agent between conditions
  • Human bias: If you're evaluating outputs subjectively, use blinded evaluation (don't know which condition produced which output)

What a positive result would look like

  • Task completion rate: ≥80% of tasks completed without human intervention (compared to ≤50% with simple LLM prompt)
  • Time savings: Agent completes tasks in ≤25% of the time it takes you manually
  • Error reduction: Agent makes ≤1 error per task (compared to ≥3 errors when you do it manually)
  • Cost efficiency: Total API cost is less than the value of your time saved (e.g., $2 in API calls saves you 30 minutes of work)
  • Consistency: Agent produces similar-quality outputs across 10+ trials (standard deviation in quality ratings ≤0.5 on 1–5 scale)
  • Generalisation: Agent succeeds on tasks it wasn't explicitly designed for (e.g., a "code-writing agent" can also debug existing code)

Warning: A single successful trial does not mean the agent is reliable. Run at least 10 trials before drawing conclusions. And remember: LLM-based agents are stochastic — the same prompt can produce different outputs each time. Track this variability.

Read full paper →More time_management

Related papers

RCT

SPIRIT 2013 explanation and elaboration: guidance for protocols of clinical trials

A.-W. Chan, Jennifer Tetzlaff, Peter C Gøtzsche +10 more · 2013

Meta-analysis

The effect of fall prevention exercise programmes on fall induced injuries in community dwelling older adults

F El-Khoury, Bernard Cassou, Marie‐Aline Charles +1 more · 2013

RCT

A multifactorial interdisciplinary intervention reduces frailty in older people: randomized trial

Ian D. Cameron, Nicola Fairhall, Colleen Langron +6 more · 2013

RCT

Standardized Rehabilitation and Hospital Length of Stay Among Patients With Acute Respiratory Failure

Peter E. Morris, Michael J. Berry, D. Clark Files +17 more · 2016