API and Token Basics Explained - RanceLee Tutorials

You may have noticed that many experienced users talk about API, Token, Temperature, and other terms that sound technical and confusing. This chapter explains these core concepts in plain language. Understanding them will help you truly grasp how AI works and use it more effectively.

What is API?

API in Plain English

API = Application Programming Interface

That definition sounds technical, so let’s put it differently.

Think of AI as a restaurant:

Web version = You dine in at the restaurant
- Nice decor (web interface)
- Waitstaff (buttons, input fields)
- You order, the chef cooks, the waiter serves
API = You call for takeout
- No decor, you talk directly to the kitchen
- No waiter, you speak directly to the chef
- You say what you want, the chef prepares it and hands it to you

Key difference:

Web version: has an interface, convenient for humans
API: no interface, convenient for programs

Why Use API?

If the web version is so convenient, why bother with API?

Reason 1: Automation

Suppose you need AI to process 1,000 documents and write 1,000 summaries:

Web version: You copy-paste 1,000 times and click send 1,000 times
API: Write a script that processes everything automatically while you grab coffee

Reason 2: Integration into your own apps

You want to build an auto-reply bot, a content generator, or a smart customer service agent:

Web version: Not possible
API: You can embed AI directly into your own programs

Reason 3: Lower cost

Web subscription: ChatGPT Plus $20/month, Claude Pro $20/month
API pay-as-you-go: Pay only for what you use; light usage might cost just a few dollars per month

Reason 4: More flexibility

Fine‑tune AI parameters (Temperature, max length, etc.)
Batch processing
Custom input/output formats

What Does an API Call Look Like?

Here’s a simple example (don’t worry if it looks unfamiliar – we’ll cover it in detail later):

# Call the latest GPT-5.2 API with Python
response = openai.chat.completions.create(
    model="gpt-5.2",
    messages=[
        {"role": "user", "content": "Hello，Introduce yourself"}
    ]
)
print(response.choices[0].message.content)

Just a few lines of code, and the AI answers your question – no browser needed.

Official API model identifiers as of 2026-01-30:

OpenAI: gpt-5.2, gpt-5.2-chat-latest, gpt-5.2-pro
Anthropic Claude: claude-opus-4-5, claude-sonnet-4-5
Google Gemini: gemini-3-pro-preview, gemini-3-flash-preview

Web Version vs API Comparison

Aspect	Web Version	API
How to use	Click around in a browser	Write code to call it
Learning curve	Low, anyone can use it	High, requires some programming
Best for	Daily chat, writing articles	Automation, batch processing, app integration
Cost	Monthly subscription ($20/month)	Pay-as-you-go (pay for what you use)
Flexibility	Limited by web features	Highly customizable
Speed	Average	Usually faster (no UI rendering)

What is Token?

The Concept of Token

Token = The smallest unit of text that AI understands

Unlike humans, who read words and sentences directly, AI needs to break text into small pieces. Each piece is called a token.

Examples:

Chinese:

“你好” ≈ 1–2 tokens
“今天天气不错” ≈ 4–8 tokens, depending on the model

English:

“Hello” = 1 token
“How are you today?” ≈ 5 tokens

Simple rules of thumb:

English: 1 word ≈ 1 token (about 4 characters)
Chinese: 1 character ≈ 0.5–2 tokens (depends on the AI model)
Numbers, punctuation: usually 1 symbol = 1 token

Important Discovery: Different AI Models Define Tokens Differently!

Here’s a little‑known secret: The same text can have a completely different token count in different AI models!

Why? Because each AI company has its own tokenizer, and they split text differently.

Real example:

The same sentence: “AI is revolutionizing market research.”

GPT-3: 11 tokens
GPT-3.5 and GPT-4: 9 tokens
GPT-4o and GPT-5.2: 8 tokens

See? The same sentence differs by 3 tokens across models!

Another Chinese example:

The sentence “人工智能正在改变世界” (“Artificial intelligence is changing the world”):

GPT-4o: maybe 10 tokens
Claude Sonnet 4.5: maybe 12 tokens
Gemini 3: maybe 8 tokens

Why the difference?

Each company uses a different tokenization method when training its models:

OpenAI (GPT series): uses BPE (Byte-Pair Encoding)
Anthropic (Claude): uses its own optimized tokenizer
Google (Gemini): Gemini’s documentation says “1 token ≈ 4 characters”
DeepSeek: a tokenizer optimized for Chinese

How does this affect you?

1. Cost comparisons aren’t direct

Suppose you have 1,000 Chinese characters:

With GPT-5.2 it might be 1,500 tokens
With Claude Sonnet 4.5 it might be 1,600 tokens
With Gemini 3 it might be 1,400 tokens

Even though each says “input $X/1M tokens,” the actual cost can differ by 10–20%!

2. You can’t use the same token calculator for all models

OpenAI’s official tokenizer (https://platform.openai.com/tokenizer) only works for GPT series
Claude tokens need Anthropic’s calculation method
Gemini tokens need Google’s calculation method

3. Non‑English languages show even bigger differences

For Chinese, Japanese, Arabic, and other non‑English languages, token efficiency can vary by 30–40%. Most AI models are trained primarily on English, so their tokenizers are better optimized for English.

Why Token Matters

1. Token determines cost

API pricing is based on tokens, not character count.

Example (official prices as of 2026-01-30):

GPT-5.2: input $1.75/1M tokens, output $14/1M tokens
Claude Opus 4.5: input $5/1M tokens, output $25/1M tokens
Gemini 3 Flash: input $0.50/1M tokens, output $3/1M tokens (standard tier)

You send 500 tokens and the AI replies with 1,000 tokens:

With GPT-5.2: (500 × 1.75 + 1000 × 14) / 1,000,000 = $0.01488 (about 1.5 cents USD)
With Gemini 3 Flash: (500 × 0.50 + 1000 × 3) / 1,000,000 = $0.00325 (about 0.3 cents USD)

2. Token determines context length

Every AI model has a token limit:

GPT-5.2 (API): up to 400,000 tokens
GPT-5.2-chat-latest: up to 128,000 tokens
Claude Sonnet 4.5: up to 200,000 tokens
Gemini 3 Pro Preview: up to 1,048,576 tokens (about 1M)

This limit includes: your prompt + AI’s response + conversation history.

What happens if you exceed the limit?

The AI “forgets” the earliest parts of the conversation
Or it throws an error and won’t continue

How to Count Tokens

Method 1: Estimate (quick but not precise)

Chinese: number of characters × 1.5
English: number of words × 1.3

Method 2: Use the corresponding online tool

OpenAI (GPT series): https://platform.openai.com/tokenizer
General token counter: https://token-counter.app (supports multiple models for comparison)
Gemini: use the count_tokens method in Google AI Studio

Important reminder: When estimating across models, always use the tool specific to that model. Don’t use GPT’s token count to estimate Claude’s cost!

Input Tokens, Output Tokens, Cached Tokens

API billing divides tokens into three types:

1. Input Tokens

The content you send to the AI
Includes your prompt, uploaded documents
Relatively cheap

2. Output Tokens

The content the AI returns to you
Includes the AI’s response
Usually 2–10 times more expensive than input tokens

Why is output more expensive? Because the AI “thinks” (generates text) using more computing resources than “reading” (processing input).

Example (GPT-5.2):

Input: $1.75/1M tokens
Output: $14/1M tokens (8× the input price!)

3. Cached Tokens

This is a cost‑saving trick!

If you repeatedly use the same prompt, the AI can cache it and avoid reprocessing it next time.

Example: You have a 1,000‑token prompt and ask 10 questions:

Without caching: each time processes 1,000 tokens → total 10,000 tokens
With caching: first time 1,000 tokens (normal price), next 9 times 1,000 tokens (cache price, 90% cheaper)

Models that support caching:

Anthropic Claude (Prompt Caching)
OpenAI GPT-5.2 (supports caching, 90% discount)

Caching billing rules:

First read: normal price
Cache hit: price reduced by 50–90%
Cache validity: usually 5–10 minutes

What is Temperature?

The Concept of Temperature

Temperature = Controls the “randomness” or “creativity” of AI responses

Recall that AI essentially “calculates probabilities.” When you ask “What color is the sky?”, the AI sees:

“Blue” probability 80%
“Gray” probability 10%
“Red” probability 5%

Temperature adjusts how the AI chooses among these options.

Temperature Values

Temperature typically ranges from 0 to 2 (or 0 to 1, depending on the platform):

Temperature = 0 (most conservative)

The AI always picks the highest‑probability answer
Very stable, predictable responses
Same question → almost identical answer every time
Best for: factual questions, code generation, data analysis

Temperature = 1 (balanced)

The AI chooses randomly according to probabilities
Responses vary a bit but stay reasonable
Default for most platforms
Best for: everyday conversation, general use

Temperature = 2 (most aggressive)

The AI tries many possibilities
Very diverse, creative responses
May be inaccurate or even nonsensical
Best for: creative writing, brainstorming, artistic work

A Practical Example

Question: Name my coffee shop

Temperature = 0:

“Starbucks Coffee” (most common, safest answer)
Almost the same every time

Temperature = 1:

“Morning Light Café”
“Aroma Time”
“Bean & Cozy”
Varies, but all reasonable

Temperature = 2:

“Quantum Coffee Dimension”
“Space‑Time Foam Lab”
“Cosmic Latte Terminal”
Very creative, but possibly too weird

When to Adjust Temperature

Lower Temperature (0–0.5):

Writing code, debugging
Data analysis, math problems
Translation, summarization
Any task that needs accuracy

Higher Temperature (1.5–2):

Writing novels, poetry
Naming things, creating slogans
Brainstorming
Any task that needs creativity

Different models list their recommended temperatures on their official sites. For example, DeepSeek’s website shows:

Scenario	Temperature
Code generation / math problem solving	0.0
Data extraction / analysis	1.0
General conversation	1.3
Translation	1.3
Creative writing / poetry	1.5

Can you adjust it in the web version?

Most web versions don’t allow direct adjustment
But the API gives you precise control

Context Length

What is Context Length?

Context Length = How much content AI can “remember” at once

Unlike humans, AI doesn’t have long‑term memory. In each conversation, the AI can only remember a limited amount of content. This limit is called the context length, measured in tokens.

Why Does AI “Forget”?

You may have experienced this:

You chat with AI for a long time
Suddenly the AI doesn’t remember what was said at the beginning
It seems to have amnesia

Reason: You exceeded the context length limit.

Example:

GPT-5.2 context length = 128,000 tokens
You and the AI have 50 rounds of conversation, using 130,000 tokens total
Beyond the limit, the AI “forgets” the earliest parts

Practical Impact of Context Length

1. Affects conversation length

Short context: only a few dozen rounds
Long context: hundreds of rounds

2. Affects document processing

Short context: only short documents
Long context: entire books

3. Affects cost

Longer context → slower processing
More tokens → higher cost

How to Deal with Context Limits

Method 1: Clear the conversation regularly

Save important information
Start a new conversation
Re‑tell the AI the background

Method 2: Summarize the conversation history

Ask the AI to summarize previous content
Use that summary as the start of a new conversation
Saves tokens

Method 3: Choose a model with a large context

For long documents: use Gemini 3 Pro
For long conversations: use Claude Sonnet 4.5

Other Important Concepts

Max Tokens

Max Tokens = Limits the maximum length of a single AI response

Set Max Tokens = 100: AI replies with at most 100 tokens
Set Max Tokens = 2000: AI replies with at most 2000 tokens

Why limit it?

Control cost (output tokens are more expensive)
Avoid overly verbose answers
Some scenarios only need short replies

Top P (Nucleus Sampling)

Top P = Another way to control randomness

Similar to Temperature, but works differently:

Top P = 0.1: only considers the top 10% most probable options
Top P = 0.9: considers the top 90% most probable options

Usually:

Adjust either Temperature or Top P – one is enough
In most cases, Temperature is more intuitive

Frequency Penalty and Presence Penalty

Used to reduce repetition

Frequency Penalty: penalizes frequently used words, reducing repetition of the same word
Presence Penalty: penalizes words that have already appeared, encouraging the AI to introduce new topics

Range: -2.0 to 2.0

Positive values: reduce repetition
Negative values: allow more repetition
0: no intervention

Summary: How to Use These Concepts?

Daily Use (Web Version)

If you only use the web version, you don’t need to worry about these parameters – the defaults work fine.

But understanding these concepts helps you:

Understand why AI sometimes “forgets” earlier parts of the conversation (context limit)
Understand why API users can do things you can’t (parameter control)
Prepare for using the API in the future

When Using the API

If you decide to use the API, these parameters become very important:

Basic settings (every time):

model: choose the model (e.g., gpt-5.2, claude-sonnet-4-5)
max_tokens: limit the response length

Adjust based on your needs:

temperature: 0–0.5 for factual tasks, 1–2 for creative tasks
top_p: usually fine at default
frequency_penalty: if the AI repeats too much, set it to 0.5–1

Cost optimization:

Use caching to save money
Control max_tokens to avoid waste
Choose the right model (you don’t always need the most expensive one)
Remember that different models define tokens differently