LLM Surveys in Financial Applications
| Citations | Title (First Author, Year) | Distilled Core Info |
| 513 | Large language models in finance: A survey (Li et al., 2023) | Adoption Roadmap: Proposes a decision framework for choosing between zero-shot, fine-tuning, or custom LLMs based on data/compute constraints. |
| 140 | A survey of LLMs for financial applications (Nie et al., 2024) | Resource Hub: Categorizes tasks (sentiment, forecasting) and provides a curated list of specific datasets, model assets, and codebases. |
| 114 | TradingAgents: Multi-agents LLM financial trading framework (Xiao et al., 2024) | Firm Simulation: Uses specialized agents (Analyst, Risk Manager, Trader) debating in a group to outperform single-model baselines. |
| 69 | Large language models in finance (finllms) (Lee et al., 2025) | Technical Benchmarking: Compares training techniques across 8 financial LLMs and evaluates them on 6 key benchmark tasks; highlights hallucination risks. |
| 58 | Large language model agent in financial trading: A survey (Ding et al., 2024) | Agent vs. Human: Reviews autonomous agent architectures specifically for trading and analyzes their backtesting performance against professional standards. |
| 52 | Automated trading with boosting and expert weighting (Creamer et al., 2010) | Classical ML: (Pre-LLM) Uses Alternating Decision Trees and boosting to aggregate technical indicators and optimize rule weighting. |
| 36 | Sentiment trading with large language models (Kirtac et al., 2024) | Model Superiority: Empirically proves OPT outperforms BERT/FinBERT in sentiment analysis, achieving a 3.05 Sharpe ratio. |
| 16 | From deep learning to LLMs: a survey of AI in quant investment (Cao et al., 2025) | Alpha Generation: Tracks the shift from Deep Learning prediction to LLM agents that autonomously generate alpha strategies and process unstructured data. |
| -- | The New Quant: A Survey of LLMs in Financial Prediction (Fu, 2025) | Production Safety: Focuses on "evidence-grounded" signals and production challenges like temporal leakage, hallucination, and deployment economics. |
Large Language Models in Finance: A Survey
Q1: The "Gap"
The specific spark for this paper was the rapid emergence of powerful Large Language Models (LLMs) like ChatGPT, which demonstrated unprecedented capabilities in understanding and generating natural language.
The authors identified a critical gap between potential and practical adoption in the financial sector. While the potential for LLMs to revolutionize trading, risk modeling, and customer service was clear, there was no structured guidance on how to implement them responsibly.
Specifically, the paper was written to address the lack of:
- A synthesized view of existing solutions: Financial professionals needed a clear comparison between using pre-trained APIs, fine-tuning open-source models, or training from scratch.
- A decision-making roadmap: There was no framework to help professionals weigh the complex trade-offs between data privacy, computational costs, and performance requirements when choosing an LLM solution.
Q2: The Concept
If you remember only one thing, it should be the "LLM Decision Hierarchy" (referred to as the Decision Process Framework in the paper).
Rather than viewing LLMs as a "one-size-fits-all" solution, the paper proposes a four-level progressive framework for adoption that scales with cost and complexity:
- Level 1 (Zero-shot): Start here. Use APIs (like GPT-4) or self-hosted open-source models for basic tasks. If this fails, move up.
- Level 2 (Few-shot): Improve performance by providing examples in the context window.
- Level 3 (Tool-Augmented/Fine-tuning): Inject knowledge via external tools (e.g., search engines, calculators) or fine-tune models on domain data if tasks are complex.
- Level 4 (Train from Scratch): The "nuclear option." Only necessary if you have millions in budget and billions of tokens, and lower levels have failed.
This framework provides a durable mental model for balancing resource investment against performance needs.
Q3: Evidence for Skeptics
Skeptics who question whether specialized "Financial LLMs" are worth the massive investment over general models (or whether they work at all) should consider these specific results:
- The "1% Rule" of Domain Data:
Skeptics often argue that you need massive datasets to see improvement. However, BloombergGPT achieved significant performance gains on financial benchmarks (scoring 62.47 vs. 33.39 for the similarly sized BLOOM176B model) despite its internal private financial data making up less than 1% (0.7%) of the total training corpus. This proves that even a tiny fraction of high-quality, domain-specific data can radically alter model performance. - Beating the Giants on Classification:
Fine-tuned financial models (like FinMA and FinGPT) demonstrated the ability to outperform massive general-purpose models like GPT-4 and ChatGPT on specific financial classification tasks (such as sentiment analysis and news headline classification). This evidence suggests that smaller, cheaper, specialized models can beat larger generalist models in specific vertical applications. - Solving the "Unfeasible" via Tool Augmentation:
The paper highlights Auto-GPT's ability to autonomously optimize a portfolio by formulating plans, acquiring data, and using Python packages for Sharpe ratio optimization. This was cited as an end-to-end solution that was "previously unfeasible with a single model," proving LLMs can orchestrate complex workflows beyond simple text generation.
A Survey of Large Language Models for Financial Applications: Progress, Prospects and Challenges
| Task | Top Performing / Notable Model(s) | Why? (Key Capabilities & Results) |
| Sentiment Analysis | FinBERT GPT-4 Llama 2 | • FinBERT: Outperforms general models like BERT due to domain-specific pre-training on financial corpora (news, filings), enabling better grasp of financial context • GPT-4: Demonstrates superior zero-shot capabilities in detecting sophisticated sentiments (e.g., in microblogs) and handling nuances that smaller models miss • Llama 2: In specific tests (e.g., news headlines), the 7B model outperformed previous BERT-based and LSTM methods |
| Stock Movement Prediction (Forecasting) | GPT-4 Ploutos RiskLabs | • GPT-4: Significantly outperforms GPT-2, BERT, and traditional models in predicting returns from news headlines, particularly for small stocks and negative news • Ploutos: Uses a "rearview-mirror" prompting strategy and expert expert pool to achieve higher accuracy and interpretability than traditional methods • RiskLabs: Outperforms benchmarks by fusing multimodal data (text, vocal, time-series) from earnings calls rather than relying on a single modality |
| Named Entity Recognition (NER) | FLANG UniversalNER KPI-BERT | • FLANG: Outperforms general models by using financial-specific masking strategies during pre-training, allowing it to better handle financial terminology • UniversalNER: Uses targeted distillation and instruction tuning to achieve high accuracy without direct supervision, reducing computational costs • KPI-BERT: specialized for extracting Key Performance Indicators from German documents using an end-to-end trainable architecture |
| Financial Text Summarization | LED (Longformer) | • LED: Addresses the limitation of standard transformers by handling long sequences (lengthy financial reports) through a scalable self-attention mechanism |
| Question Answering (Numerical Reasoning) | WeaverBird GPT-4 (with EEDP) | • WeaverBird: Outperforms other models in financial QA by combining an LLM with a local knowledge base and search engine to reduce hallucinations and cite sources • GPT-4 (EEDP): When used with the "Elicit-Extract-Decompose-Predict" prompting strategy, it outperforms Chain of Thought (CoT) prompting in multi-step numerical reasoning |
| Relation Extraction / Knowledge Graph | FinDKG SEC-BERT (MOAT) | • FinDKG: Uses LLMs to incorporate a temporal layer, allowing it to adapt to changing market trends and economic indicators better than static graphs • SEC-BERT (MOAT): Outperforms by masking one entity at a time to extract specific contextual embeddings for classifying relationships |
| Textual Classification (Industry/Topic) | Sentence-BERT (Fine-tuned) KGEB | • Sentence-BERT: Accurately reproduces GICS industry classifications and outperforms baselines in identifying peer companies based on return correlations • KGEB: Enriches BERT with knowledge graph embeddings, achieving higher accuracy (91.98%) than standard BERT or CNNs |
| Fraud Detection | FinChain-BERT RiskLabs | • FinChain-BERT: Enhanced accuracy by focusing specifically on key financial terms during training • RiskLabs: Outperforms traditional methods in predicting financial risk (volatility/variance) by integrating vocal and textual data from conference calls |
| Agent-Based Trading | StockAgent (GPT-3.5) TradingGPT FinAgent | • StockAgent: GPT-3.5 agents showed more diverse and independent trading behaviors compared to Gemini agents (which were more homogeneous) • TradingGPT: Uses a triple-layered memory system (short/medium/long term) to better adapt to historical trades and real-time cues • FinAgent: Superior in high-frequency trading due to its ability to integrate multimodal data (visual, text, numerical) |
| Code Generation (Trading Strategies) | GPT-4-Turbo | • GPT-4-Turbo: In tests comparing multiple models (including Llama 2 and Mistral), it showed strong capability in generating correct, executable code for technical indicators when provided with well-designed prompts |
| Auditing & Compliance | ZeroShotALI (GPT-4 + SentenceBERT) | • ZeroShotALI: Combines GPT-4's generation with SentenceBERT's matching to inspect lists and match text to legal requirements more efficiently than traditional methods |