A survey on large language model based autonomous agents — DoOperator Research

Authors	Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, Wayne Xin Zhao, Zhewei Wei, Ji-Rong Wen
Journal	Frontiers of Computer Science
Year	2024
DOI	10.1007/s11704-024-40231-1
Citations	1,039

TL;DR

This systematic review of over 200 papers on LLM-based autonomous agents found that agents built with a unified architecture (profiling, memory, planning, and action modules) can perform complex tasks across social science, natural science, and engineering — but their reliability, safety, and generalizability remain unproven, with no single benchmark or standardised evaluation framework yet established.

What they tested

This is a systematic review, not an experiment. The authors tested no single intervention. Instead, they:

Analysed the architectural designs of LLM-based autonomous agents across ~200 papers published between January 2021 and August 2023
Categorised agent designs into four modules: profiling, memory, planning, and action
Compared three methods for creating agent profiles: handcrafting, LLM-generation, and dataset alignment
Reviewed applications in three domains: social science (e.g., simulating human behaviour), natural science (e.g., scientific discovery), and engineering (e.g., software development)
Evaluated assessment strategies, distinguishing subjective (human judgement) from objective (automated metrics) approaches

The "outcome measures" were qualitative: the presence/absence of specific architectural features, the types of tasks agents could complete, and the evaluation methods used.

Who was studied

No human participants were studied. The "subjects" were:

~200 published papers on LLM-based autonomous agents from January 2021 to August 2023
Specific agent systems reviewed in detail: Generative Agents, MetaGPT, ChatDev, RecAgent, PTLLM, and others
LLMs used as backbones: primarily GPT-3, GPT-4, ChatGPT, and open-source models like LLaMA

No sample size, demographics, or population characteristics apply — this is a literature review, not a human study.

How they measured it

The authors used a qualitative systematic review methodology with the following approach:

Literature search: Papers were collected from arXiv, ACL, NeurIPS, ICML, ICLR, and other major AI venues
Inclusion criteria: Papers proposing or evaluating LLM-based autonomous agents published between January 2021 and August 2023
Taxonomy development: The authors created a unified framework (profiling → memory → planning → action) and classified each paper according to which modules it used
Application categorisation: Papers were grouped by domain (social science, natural science, engineering)
Evaluation strategy analysis: Each paper's evaluation method was classified as subjective (human raters, surveys) or objective (automated metrics, task completion rates)

No quantitative instruments (scales, questionnaires, physiological measures) were used because this is a review of existing literature, not a primary data collection study.

Methodology

Study design: Systematic review with qualitative synthesis. The authors did not perform a meta-analysis (no quantitative pooling of effect sizes) because the reviewed papers used heterogeneous tasks, metrics, and evaluation criteria.

Search and selection: The authors searched multiple academic databases and preprint servers. They do not report a PRISMA flow diagram, search strings, or explicit inclusion/exclusion criteria beyond the date range (2021–2023) and topic relevance. This is a methodological weakness — without a transparent search strategy, reproducibility is limited.

Data extraction: For each paper, the authors extracted:

Agent architecture (which modules were used)
Profile generation method (handcrafted, LLM-generated, or dataset-aligned)
Application domain
Evaluation approach (subjective vs. objective)
Key findings

Synthesis approach: The authors organised findings into a unified framework (Figure 2) and discussed patterns across papers qualitatively. They did not calculate inter-rater reliability, effect sizes, or confidence intervals.

What this design can prove:

It can identify common architectural patterns across the field
It can map the landscape of current research and highlight gaps
It can propose a standardised taxonomy for future work

What this design cannot prove:

It cannot establish which agent architecture is "best" — no head-to-head comparisons were performed
It cannot quantify effect sizes or statistical significance of any intervention
It cannot control for publication bias (papers with positive results are more likely to be published)
It cannot assess the quality or rigour of individual studies systematically (no risk-of-bias assessment is reported)

Major methodological weaknesses:

No pre-registered protocol
No systematic quality assessment of included studies (e.g., no use of ROBINS-I or similar tools)
No quantitative synthesis or meta-analysis
The search strategy is not fully reproducible
The review is descriptive rather than evaluative — it catalogues what exists rather than testing hypotheses

Key findings

Architecture patterns:

The unified framework (profiling → memory → planning → action) encompasses "most" previous work, though the authors do not report what percentage of papers fit this framework
Three profile generation methods were identified: handcrafting (most common), LLM-generation, and dataset alignment
Memory modules were classified into short-term (within-session) and long-term (cross-session) variants
Planning modules ranged from simple chain-of-thought prompting to hierarchical task decomposition

Application domains:

Social science: Agents have been used to simulate human behaviour in social settings (e.g., Generative Agents simulating a small town of 25 agents), study opinion dynamics, and model economic decisions
Natural science: Agents have been applied to scientific discovery (e.g., ChemCrow for chemistry, BioGPT for biology), though the authors note these are "early-stage"
Engineering: The most mature application area, with agents used for software development (MetaGPT, ChatDev), code generation, and tool use

Evaluation strategies:

Subjective evaluation: Human raters assess agent outputs for quality, coherence, or human-likeness. Used in ~40% of reviewed papers (estimated from figures, not explicitly stated)
Objective evaluation: Automated metrics (e.g., task completion rate, BLEU score, accuracy on benchmarks). Used in ~60% of papers
No standardised benchmark exists: Different papers use different tasks, making cross-study comparison impossible

Capability acquisition strategies:

Fine-tuning approaches: Some agents fine-tune LLMs on domain-specific data (e.g., code for programming agents)
Prompt-based approaches: Most agents use in-context learning (prompt engineering) without modifying model weights
Tool use: Many agents are equipped with external tools (e.g., web search, calculators, code interpreters) to extend capabilities

Challenges identified:

Reliability: LLM-based agents can hallucinate, produce inconsistent outputs, or fail on simple tasks
Safety: Agents acting autonomously could cause harm (e.g., generating malicious code, giving dangerous advice)
Generalisation: Agents trained/tested in one domain often fail in others
Evaluation: No consensus on how to measure agent performance

Effect magnitude

This is a qualitative review, so no effect sizes, confidence intervals, or p-values are reported. The authors do not quantify how much better LLM-based agents perform compared to traditional reinforcement learning agents or rule-based systems.

The closest to a quantitative finding: the cumulative number of papers on LLM-based autonomous agents grew from ~5 in January 2021 to ~200 by August 2023 — a ~40-fold increase in ~2.5 years. This is a bibliometric observation, not an experimental effect.

Limitations

What the authors acknowledge:

The field is "rapidly developing" and the review may not capture the most recent work
The proposed unified framework may not encompass all possible agent architectures
Evaluation strategies are "not yet mature"
The review is descriptive rather than prescriptive

What a critical reader would note:

No systematic quality assessment: The authors do not evaluate the rigour of individual studies. A paper with 5 participants and a paper with 500 are treated equally
Publication bias: The field is dominated by positive results (agents that work). Failed architectures or negative results are rarely published
LLM dependence: Most reviewed agents use proprietary LLMs (GPT-3/4). Results may not generalise to open-source models or future model versions
No replication analysis: The authors do not report whether any findings have been independently replicated
Industry funding: Many reviewed papers come from tech companies (OpenAI, Google, Meta) with commercial interests in LLMs. The review does not discuss conflicts of interest
Temporal bias: The review covers only 2021–2023. Given the field's rapid pace, findings may already be outdated
No negative results: The review focuses on what agents can do, not what they fail at. Failures are mentioned only briefly in the challenges section
Lack of quantitative synthesis: Without meta-analysis, it's impossible to know which approaches are statistically superior

Practical takeaways

For someone running their own n=1 experiment with LLM-based autonomous agents:

What to test (specific intervention and dose)

Intervention: Build an LLM-based autonomous agent using the unified framework (profiling + memory + planning + action modules)
Dose: Start with a single agent performing one well-defined task (e.g., "write a Python script to scrape website X and save results to CSV"). Do not attempt multi-agent collaboration initially
Comparison: Compare against (a) doing the task manually, (b) using a simple LLM prompt without agent architecture, or (c) using a traditional rule-based system

Minimum meaningful duration

Per trial: 1–3 hours for a single task completion
Total experiment: At least 10–20 trials across different tasks to assess generalisability
Long-term: If testing memory/learning, run 5–10 sessions over 1–2 weeks, with the agent retaining information across sessions

What to measure (specific metrics)

Task completion rate: Did the agent complete the task? (binary: yes/no)
Time to completion: Minutes from start to finish
Error rate: Number of mistakes (e.g., syntax errors in code, incorrect outputs)
Number of human interventions: How many times did you need to correct or redirect the agent?
Output quality: Rate on a 1–5 scale (subjective, but use a rubric: accuracy, completeness, readability)
Hallucination count: Number of false statements or fabricated information
Cost: API calls made, tokens used, total cost in USD

Key confounds to control for

LLM version: Use the same model version throughout (e.g., GPT-4-turbo-2024-04-09). Model updates can change behaviour
Prompt engineering: Small changes in prompts can cause large changes in output. Document your exact prompts
Temperature setting: Keep temperature constant (start with 0.2 for deterministic tasks, 0.7 for creative tasks)
Task difficulty: Vary task difficulty systematically. Don't compare easy tasks to hard tasks
Order effects: Randomise the order of tasks if comparing multiple conditions
Learning effects: The agent may "learn" from previous tasks if memory is enabled. Control for this by resetting the agent between conditions
Human bias: If you're evaluating outputs subjectively, use blinded evaluation (don't know which condition produced which output)

What a positive result would look like

Task completion rate: ≥80% of tasks completed without human intervention (compared to ≤50% with simple LLM prompt)
Time savings: Agent completes tasks in ≤25% of the time it takes you manually
Error reduction: Agent makes ≤1 error per task (compared to ≥3 errors when you do it manually)
Cost efficiency: Total API cost is less than the value of your time saved (e.g., $2 in API calls saves you 30 minutes of work)
Consistency: Agent produces similar-quality outputs across 10+ trials (standard deviation in quality ratings ≤0.5 on 1–5 scale)
Generalisation: Agent succeeds on tasks it wasn't explicitly designed for (e.g., a "code-writing agent" can also debug existing code)

Warning: A single successful trial does not mean the agent is reliable. Run at least 10 trials before drawing conclusions. And remember: LLM-based agents are stochastic — the same prompt can produce different outputs each time. Track this variability.

Read full paper →More time_management →