How to Use Large Language Models While Reducing Cost and Improving Performance

1 minute read

Researchers at Stanford have proposed a method called FrugalGPT to harness the power of large language models while significantly reducing their inference cost. It can potentially match GPT-4’s performance while reducing cost by 98%.

Some highlights:

  • There is rapidly growing number of large language models (LLMs) available as commercial APIs. Using these APIs can be very expensive, ranging from $700k per day to $21k per month for some use cases.
  • The cost of using different LLM APIs varies significantly, up to two orders of magnitude. For example, processing 10 million input tokens with GPT-J costs $0.2 while GPT-4 costs $30.
  • The paper proposes three strategies to reduce the cost of using LLMs while maintaining performance:
    1. Prompt adaptation: Using shorter prompts to reduce input length and save cost. This includes prompt selection (using only relevant examples) and query concatenation (aggregating multiple queries into one prompt).
    2. LLM approximation: Approximating expensive LLMs with smaller, cheaper models for specific tasks. This includes caching previously generated answers and fine-tuning cheap models with answers from expensive LLMs.
    3. LLM cascade: Selectively choosing which LLM APIs to use for different queries based on cost and reliability. Cheaper LLMs are queried first, reserving expensive ones only for “hard” queries.
  • FrugalGPT’s key technique is LLM cascade. In experiments:
    • It matched GPT-4’s performance while reducing cost by an astonishing 98%!
    • It improved accuracy over GPT-4 by 4% with the same cost.
  • By composing strategies, greater gains are possible.
  • In a simple cascade, more affordable APIs like GPT-J and J1-L answer most queries, while reserving GPT-4 for the hardest queries. This cuts costs drastically while maintaining performance.


Leave a comment