RL RanceLee Tutorials
← Back to tutorials

API and Token Basics Explained

You may have noticed that many experienced users talk about API, Token, Temperature, and other terms that sound technical and confusing. This chapter explains these core concepts in plain language. Understanding them will help you truly grasp how AI works and use it more effectively.


What is API?

API in Plain English

API = Application Programming Interface

That definition sounds technical, so let’s put it differently.

Think of AI as a restaurant:

  • Web version = You dine in at the restaurant
    • Nice decor (web interface)
    • Waitstaff (buttons, input fields)
    • You order, the chef cooks, the waiter serves
  • API = You call for takeout
    • No decor, you talk directly to the kitchen
    • No waiter, you speak directly to the chef
    • You say what you want, the chef prepares it and hands it to you

Key difference:

  • Web version: has an interface, convenient for humans
  • API: no interface, convenient for programs

Why Use API?

If the web version is so convenient, why bother with API?

Reason 1: Automation

Suppose you need AI to process 1,000 documents and write 1,000 summaries:

  • Web version: You copy-paste 1,000 times and click send 1,000 times
  • API: Write a script that processes everything automatically while you grab coffee

Reason 2: Integration into your own apps

You want to build an auto-reply bot, a content generator, or a smart customer service agent:

  • Web version: Not possible
  • API: You can embed AI directly into your own programs

Reason 3: Lower cost

  • Web subscription: ChatGPT Plus $20/month, Claude Pro $20/month
  • API pay-as-you-go: Pay only for what you use; light usage might cost just a few dollars per month

Reason 4: More flexibility

  • Fine‑tune AI parameters (Temperature, max length, etc.)
  • Batch processing
  • Custom input/output formats

What Does an API Call Look Like?

Here’s a simple example (don’t worry if it looks unfamiliar – we’ll cover it in detail later):

# Call the latest GPT-5.2 API with Python
response = openai.chat.completions.create(
    model="gpt-5.2",
    messages=[
        {"role": "user", "content": "Hello,Introduce yourself"}
    ]
)
print(response.choices[0].message.content)

Just a few lines of code, and the AI answers your question – no browser needed.

Official API model identifiers as of 2026-01-30:

  • OpenAI: gpt-5.2, gpt-5.2-chat-latest, gpt-5.2-pro
  • Anthropic Claude: claude-opus-4-5, claude-sonnet-4-5
  • Google Gemini: gemini-3-pro-preview, gemini-3-flash-preview

Web Version vs API Comparison

Aspect Web Version API
How to use Click around in a browser Write code to call it
Learning curve Low, anyone can use it High, requires some programming
Best for Daily chat, writing articles Automation, batch processing, app integration
Cost Monthly subscription ($20/month) Pay-as-you-go (pay for what you use)
Flexibility Limited by web features Highly customizable
Speed Average Usually faster (no UI rendering)

What is Token?

The Concept of Token

Token = The smallest unit of text that AI understands

Unlike humans, who read words and sentences directly, AI needs to break text into small pieces. Each piece is called a token.

Examples:

Chinese:

  • “你好” ≈ 1–2 tokens
  • “今天天气不错” ≈ 4–8 tokens, depending on the model

English:

  • “Hello” = 1 token
  • “How are you today?” ≈ 5 tokens

Simple rules of thumb:

  • English: 1 word ≈ 1 token (about 4 characters)
  • Chinese: 1 character ≈ 0.5–2 tokens (depends on the AI model)
  • Numbers, punctuation: usually 1 symbol = 1 token

Important Discovery: Different AI Models Define Tokens Differently!

Here’s a little‑known secret: The same text can have a completely different token count in different AI models!

Why? Because each AI company has its own tokenizer, and they split text differently.

Real example:

The same sentence: “AI is revolutionizing market research.”

  • GPT-3: 11 tokens
  • GPT-3.5 and GPT-4: 9 tokens
  • GPT-4o and GPT-5.2: 8 tokens

See? The same sentence differs by 3 tokens across models!

Another Chinese example:

The sentence “人工智能正在改变世界” (“Artificial intelligence is changing the world”):

  • GPT-4o: maybe 10 tokens
  • Claude Sonnet 4.5: maybe 12 tokens
  • Gemini 3: maybe 8 tokens

Why the difference?

Each company uses a different tokenization method when training its models:

  • OpenAI (GPT series): uses BPE (Byte-Pair Encoding)
  • Anthropic (Claude): uses its own optimized tokenizer
  • Google (Gemini): Gemini’s documentation says “1 token ≈ 4 characters”
  • DeepSeek: a tokenizer optimized for Chinese

How does this affect you?

1. Cost comparisons aren’t direct

Suppose you have 1,000 Chinese characters:

  • With GPT-5.2 it might be 1,500 tokens
  • With Claude Sonnet 4.5 it might be 1,600 tokens
  • With Gemini 3 it might be 1,400 tokens

Even though each says “input $X/1M tokens,” the actual cost can differ by 10–20%!

2. You can’t use the same token calculator for all models

  • OpenAI’s official tokenizer (https://platform.openai.com/tokenizer) only works for GPT series
  • Claude tokens need Anthropic’s calculation method
  • Gemini tokens need Google’s calculation method

3. Non‑English languages show even bigger differences

For Chinese, Japanese, Arabic, and other non‑English languages, token efficiency can vary by 30–40%. Most AI models are trained primarily on English, so their tokenizers are better optimized for English.

Why Token Matters

1. Token determines cost

API pricing is based on tokens, not character count.

Example (official prices as of 2026-01-30):

  • GPT-5.2: input $1.75/1M tokens, output $14/1M tokens
  • Claude Opus 4.5: input $5/1M tokens, output $25/1M tokens
  • Gemini 3 Flash: input $0.50/1M tokens, output $3/1M tokens (standard tier)

You send 500 tokens and the AI replies with 1,000 tokens:

  • With GPT-5.2: (500 × 1.75 + 1000 × 14) / 1,000,000 = $0.01488 (about 1.5 cents USD)
  • With Gemini 3 Flash: (500 × 0.50 + 1000 × 3) / 1,000,000 = $0.00325 (about 0.3 cents USD)

2. Token determines context length

Every AI model has a token limit:

  • GPT-5.2 (API): up to 400,000 tokens
  • GPT-5.2-chat-latest: up to 128,000 tokens
  • Claude Sonnet 4.5: up to 200,000 tokens
  • Gemini 3 Pro Preview: up to 1,048,576 tokens (about 1M)

This limit includes: your prompt + AI’s response + conversation history.

What happens if you exceed the limit?

  • The AI “forgets” the earliest parts of the conversation
  • Or it throws an error and won’t continue

How to Count Tokens

Method 1: Estimate (quick but not precise)

  • Chinese: number of characters × 1.5
  • English: number of words × 1.3

Method 2: Use the corresponding online tool

Important reminder: When estimating across models, always use the tool specific to that model. Don’t use GPT’s token count to estimate Claude’s cost!

Input Tokens, Output Tokens, Cached Tokens

API billing divides tokens into three types:

1. Input Tokens

  • The content you send to the AI
  • Includes your prompt, uploaded documents
  • Relatively cheap

2. Output Tokens

  • The content the AI returns to you
  • Includes the AI’s response
  • Usually 2–10 times more expensive than input tokens

Why is output more expensive? Because the AI “thinks” (generates text) using more computing resources than “reading” (processing input).

Example (GPT-5.2):

  • Input: $1.75/1M tokens
  • Output: $14/1M tokens (8× the input price!)

3. Cached Tokens

This is a cost‑saving trick!

If you repeatedly use the same prompt, the AI can cache it and avoid reprocessing it next time.

Example: You have a 1,000‑token prompt and ask 10 questions:

  • Without caching: each time processes 1,000 tokens → total 10,000 tokens
  • With caching: first time 1,000 tokens (normal price), next 9 times 1,000 tokens (cache price, 90% cheaper)

Models that support caching:

  • Anthropic Claude (Prompt Caching)
  • OpenAI GPT-5.2 (supports caching, 90% discount)

Caching billing rules:

  • First read: normal price
  • Cache hit: price reduced by 50–90%
  • Cache validity: usually 5–10 minutes

What is Temperature?

The Concept of Temperature

Temperature = Controls the “randomness” or “creativity” of AI responses

Recall that AI essentially “calculates probabilities.” When you ask “What color is the sky?”, the AI sees:

  • “Blue” probability 80%
  • “Gray” probability 10%
  • “Red” probability 5%

Temperature adjusts how the AI chooses among these options.

Temperature Values

Temperature typically ranges from 0 to 2 (or 0 to 1, depending on the platform):

Temperature = 0 (most conservative)

  • The AI always picks the highest‑probability answer
  • Very stable, predictable responses
  • Same question → almost identical answer every time
  • Best for: factual questions, code generation, data analysis

Temperature = 1 (balanced)

  • The AI chooses randomly according to probabilities
  • Responses vary a bit but stay reasonable
  • Default for most platforms
  • Best for: everyday conversation, general use

Temperature = 2 (most aggressive)

  • The AI tries many possibilities
  • Very diverse, creative responses
  • May be inaccurate or even nonsensical
  • Best for: creative writing, brainstorming, artistic work

A Practical Example

Question: Name my coffee shop

Temperature = 0:

  • “Starbucks Coffee” (most common, safest answer)
  • Almost the same every time

Temperature = 1:

  • “Morning Light Café”
  • “Aroma Time”
  • “Bean & Cozy”
  • Varies, but all reasonable

Temperature = 2:

  • “Quantum Coffee Dimension”
  • “Space‑Time Foam Lab”
  • “Cosmic Latte Terminal”
  • Very creative, but possibly too weird

When to Adjust Temperature

Lower Temperature (0–0.5):

  • Writing code, debugging
  • Data analysis, math problems
  • Translation, summarization
  • Any task that needs accuracy

Higher Temperature (1.5–2):

  • Writing novels, poetry
  • Naming things, creating slogans
  • Brainstorming
  • Any task that needs creativity

Different models list their recommended temperatures on their official sites. For example, DeepSeek’s website shows:

Scenario Temperature
Code generation / math problem solving 0.0
Data extraction / analysis 1.0
General conversation 1.3
Translation 1.3
Creative writing / poetry 1.5

Can you adjust it in the web version?

  • Most web versions don’t allow direct adjustment
  • But the API gives you precise control

Context Length

What is Context Length?

Context Length = How much content AI can “remember” at once

Unlike humans, AI doesn’t have long‑term memory. In each conversation, the AI can only remember a limited amount of content. This limit is called the context length, measured in tokens.

Why Does AI “Forget”?

You may have experienced this:

  • You chat with AI for a long time
  • Suddenly the AI doesn’t remember what was said at the beginning
  • It seems to have amnesia

Reason: You exceeded the context length limit.

Example:

  • GPT-5.2 context length = 128,000 tokens
  • You and the AI have 50 rounds of conversation, using 130,000 tokens total
  • Beyond the limit, the AI “forgets” the earliest parts

Practical Impact of Context Length

1. Affects conversation length

  • Short context: only a few dozen rounds
  • Long context: hundreds of rounds

2. Affects document processing

  • Short context: only short documents
  • Long context: entire books

3. Affects cost

  • Longer context → slower processing
  • More tokens → higher cost

How to Deal with Context Limits

Method 1: Clear the conversation regularly

  • Save important information
  • Start a new conversation
  • Re‑tell the AI the background

Method 2: Summarize the conversation history

  • Ask the AI to summarize previous content
  • Use that summary as the start of a new conversation
  • Saves tokens

Method 3: Choose a model with a large context

  • For long documents: use Gemini 3 Pro
  • For long conversations: use Claude Sonnet 4.5

Other Important Concepts

Max Tokens

Max Tokens = Limits the maximum length of a single AI response

  • Set Max Tokens = 100: AI replies with at most 100 tokens
  • Set Max Tokens = 2000: AI replies with at most 2000 tokens

Why limit it?

  • Control cost (output tokens are more expensive)
  • Avoid overly verbose answers
  • Some scenarios only need short replies

Top P (Nucleus Sampling)

Top P = Another way to control randomness

Similar to Temperature, but works differently:

  • Top P = 0.1: only considers the top 10% most probable options
  • Top P = 0.9: considers the top 90% most probable options

Usually:

  • Adjust either Temperature or Top P – one is enough
  • In most cases, Temperature is more intuitive

Frequency Penalty and Presence Penalty

Used to reduce repetition

  • Frequency Penalty: penalizes frequently used words, reducing repetition of the same word
  • Presence Penalty: penalizes words that have already appeared, encouraging the AI to introduce new topics

Range: -2.0 to 2.0

  • Positive values: reduce repetition
  • Negative values: allow more repetition
  • 0: no intervention

Summary: How to Use These Concepts?

Daily Use (Web Version)

If you only use the web version, you don’t need to worry about these parameters – the defaults work fine.

But understanding these concepts helps you:

  • Understand why AI sometimes “forgets” earlier parts of the conversation (context limit)
  • Understand why API users can do things you can’t (parameter control)
  • Prepare for using the API in the future

When Using the API

If you decide to use the API, these parameters become very important:

Basic settings (every time):

  • model: choose the model (e.g., gpt-5.2, claude-sonnet-4-5)
  • max_tokens: limit the response length

Adjust based on your needs:

  • temperature: 0–0.5 for factual tasks, 1–2 for creative tasks
  • top_p: usually fine at default
  • frequency_penalty: if the AI repeats too much, set it to 0.5–1

Cost optimization:

  • Use caching to save money
  • Control max_tokens to avoid waste
  • Choose the right model (you don’t always need the most expensive one)
  • Remember that different models define tokens differently