================================================================================
                          cortexai.io API GATEWAY — DOCS
                          OpenAI / Anthropic compatible
================================================================================

Base URL
--------
https://vertex.claude.gg

Authentication
--------------
Provide your API key using ANY of the following (in order of preference):

    Authorization: Bearer <YOUR_API_KEY>     (OpenAI / SDK style)
    x-api-key: <YOUR_API_KEY>                 (Anthropic style)
    x-goog-api-key: <YOUR_API_KEY>            (Google native style)
    ?api_key=<YOUR_API_KEY>                   (query string fallback)

The first matching credential is used. CORS pre-flight responses already include
`authorization, x-api-key, x-goog-api-key, content-type, anthropic-version` in
Access-Control-Allow-Headers, so browser clients can use any of these.

Public endpoints (no auth required):
    GET /v1/models
    GET /v1beta/models
    GET /health
    GET /docs.txt


================================================================================
                                ENDPOINTS
================================================================================

OpenAI-compatible
-----------------
POST  /v1/chat/completions
POST  /v1/responses
POST  /v1/images/generations
POST  /v1/embeddings
GET   /v1/models
GET   /v1/me              (rate-limit / quota status for current API key)
GET   /api/me             (alias of /v1/me, supports ?key= query param)

Anthropic-compatible
--------------------
POST  /v1/messages
POST  /v1/messages/count_tokens

Gemini-native compatible
------------------------
POST  /v1beta/models/{model}:{action}
GET   /v1beta/models

Vertex native passthrough
-------------------------
POST  /v1/projects/{project}/locations/{loc}/publishers/google/models/{model}:{action}

  Used for Virtual Try-On, which needs a structured personImage +
  productImages base64 input shape that the OpenAI Images API can't
  represent. {project} and {loc} can be any placeholder; the gateway
  substitutes its own internal node + region.


================================================================================
                          MODEL CATALOG (28 models)
================================================================================

For an always up-to-date machine-readable list call:

    GET /v1/models           (or GET /v1/models/{id} for a single model)
    GET /v1beta/models       (or GET /v1beta/models/{id} for a single model)

Both endpoints return ONE union response that simultaneously satisfies the
OpenAI, Anthropic and Gemini schemas:

    {
      "object": "list",                       # OpenAI envelope
      "data":   [ ...models... ],             # OpenAI / Anthropic envelope
      "models": [ ...models... ],             # Gemini envelope
      "first_id": "...", "last_id": "...",    # Anthropic pagination
      "has_more": false
    }

Each model entry exposes the fields read by every popular client (OpenAI
SDK, Anthropic SDK, Gemini SDK, Roo Code, OpenRouter SDK, LiteLLM proxy,
OpenWebUI, LibreChat, Continue.dev, Cherry Studio).

The context window is exposed under FOUR different field names so any
client picks it up correctly:

    context_window     -> Roo Code, Cherry Studio
    context_length     -> OpenRouter SDK, OpenWebUI
    max_input_tokens   -> Anthropic SDK, LiteLLM
    inputTokenLimit    -> Gemini SDK

    top_provider.context_length          (OpenRouter strict)
    top_provider.max_completion_tokens

The maximum-output limit is also mirrored under multiple names:
`max_tokens`, `max_output_tokens`, `max_output_length`, `outputTokenLimit`.

Sample (gemini-2.5-flash):

    {
      "id": "gemini-2.5-flash",
      "object": "model",
      "type": "language",                     # "language"|"embedding"|"image"|"audio"
      "mode": "chat",
      "created": 1748390400,                  # unix seconds
      "created_at": "2025-06-01T00:00:00Z",   # RFC 3339
      "owned_by": "google",                   # real vendor: google, openai,
                                              # anthropic, xai, alibaba,
                                              # deepseek, moonshot, minimax,
                                              # mistralai, meta, ...
      "name": "Gemini 2.5 Flash",             # human-readable
      "display_name": "Gemini 2.5 Flash",
      "displayName": "Gemini 2.5 Flash",
      "description": "Gemini 2.5 Flash by google",

      "context_window": 1048576,
      "context_length": 1048576,
      "max_input_tokens": 1048576,
      "inputTokenLimit": 1048576,

      "max_tokens":        65535,
      "max_output_tokens": 65535,
      "max_output_length": 65535,
      "outputTokenLimit":  65535,

      "input_modalities":  ["text", "image", "file"],
      "output_modalities": ["text"],
      "architecture":  { ... OpenRouter-shaped ... },
      "top_provider":  { "context_length": 1048576, "max_completion_tokens": 65535, "is_moderated": false },
      "pricing":       { "prompt": "0", "completion": "0", ... },
      "input_cost_per_token":  0,
      "output_cost_per_token": 0,
      "supported_parameters":  ["temperature","top_p","max_tokens","stream","stop","tools","tool_choice"],
      "supported_features":    ["tools","function_calling","vision","streaming"],
      "supportsImages":  true,
      "supportsTools":   true,
      "supportsStreaming": true,
      "supportsReasoning": false,

      // Gemini-native compat
      "baseModelId": "gemini-2.5-flash",
      "version": "001",
      "supportedGenerationMethods": ["generateContent","streamGenerateContent","countTokens"],

      // cortexai.io extension
      "capabilities": { "chat": true, "vision": true, "tool_use": true, ... },
      "canonical_slug": "google/gemini-2.5-flash",
      "tags": ["chat","vision","tools"]
    }

The Gemini-native endpoint `/v1beta/models` returns the same payload but
with `name = "models/<id>"` (Google resource-path format).

Each SDK reads only the fields it knows and ignores the rest, so the same
unchanged call works for OpenAI, Anthropic, Gemini, Roo Code, OpenRouter,
LiteLLM, OpenWebUI, LibreChat, Continue.dev and Cherry Studio.

Gemini family
-------------
gemini-2.5-pro
gemini-2.5-flash
gemini-2.5-flash-lite
gemini-3-flash-preview
gemini-3.1-flash-lite-preview
gemini-3.1-pro-preview

Embeddings
----------
gemini-embedding-001                       (3072 dim, native :predict)
multilingual-e5-large-instruct-maas        (1024 dim, OpenAI-compat)
multilingual-e5-small-maas                 ( 384 dim, OpenAI-compat)

Imagen — text to image (OpenAI Images API: POST /v1/images/generations)
----------------------------------------------------------------------
imagen-4.0-fast-generate-001               ( ~6 s, 5 regions, GA )
imagen-4.0-generate-001                    ( ~10 s, 5 regions, GA )
imagen-4.0-ultra-generate-001              ( ~13 s, 5 regions, GA, best quality )
imagen-3.0-generate-002                    ( ~11 s, 4 regions, GA )
imagen-3.0-fast-generate-001               ( ~7 s, 5 regions, GA )

Nano Banana — Gemini image generation (generateContent + responseModalities)
----------------------------------------------------------------------------
gemini-2.5-flash-image                     ( ~6  s, 4 regions, GA, supports
                                             multi-image fusion + edit + up to
                                             10 output images )
gemini-3.1-flash-image-preview             ( ~45 s, global, Preview, "Nano
                                             Banana Pro flash" - up to 3 images
                                             per call with full thinking trace )
gemini-3-pro-image-preview                 ( ~42 s, global, Preview, "Nano
                                             Banana Pro" - highest quality,
                                             single image with thinking trace )

Virtual Try-On — image fusion (Vertex native :predict only)
-----------------------------------------------------------
virtual-try-on-001                         ( place a product image on a person
                                             image; 17 regions )

Virtual Try-On does NOT fit the OpenAI Images API shape (it requires
structured `personImage` + `productImages` base64 arrays). Call it via
the Vertex native passthrough route (see example below).

Grok family (xAI)
-----------------
grok-4.20-reasoning
grok-4.20-non-reasoning
grok-4.1-fast-reasoning
grok-4.1-fast-non-reasoning

Qwen family (Alibaba)
---------------------
qwen3-235b-a22b-instruct-2507-maas
qwen3-coder-480b-a35b-instruct-maas
qwen3-next-80b-a3b-instruct-maas
qwen3-next-80b-a3b-thinking-maas

GPT-OSS (OpenAI open-weights)
-----------------------------
gpt-oss-120b-maas
gpt-oss-20b-maas

DeepSeek family
---------------
deepseek-v3.2-maas
deepseek-r1-0528-maas

NOTE: Catalog reflects only models with reliable capacity on cortexai.io
compute fleet. Audio (TTS / live), video (Veo / Lyria) and a number of
preview models with insufficient quota have been removed; for the current
authoritative list always read GET /v1/models.


Model name aliases & normalization
----------------------------------
The gateway accepts a wide variety of common shorthand and SDK-default model
names; they are silently rewritten to the closest catalog model. Examples:

    "gpt-4o"              -> gemini-2.5-flash
    "gpt-4o-mini"         -> gemini-2.5-flash-lite
    "gpt-4"               -> gemini-2.5-pro
    "claude-3-5-sonnet"   -> gemini-2.5-flash
    "claude-3-opus"       -> gemini-2.5-pro
    "claude-haiku"        -> gemini-2.5-flash-lite
    "grok-4.2"            -> grok-4.20-non-reasoning
    "grok-fast"           -> grok-4.1-fast-non-reasoning
    "gemini-3-pro"        -> gemini-3.1-pro-preview
    "gemini-pro"          -> gemini-2.5-pro
    "deepseek-r1"         -> deepseek-r1-0528-maas
    "qwen-coder"          -> qwen3-coder-480b-a35b-instruct-maas
    "dall-e-3"            -> imagen-4.0-generate-001
    "vto" / "try-on"      -> virtual-try-on-001
    "nano-banana"         -> gemini-2.5-flash-image
    "nano-banana-pro"     -> gemini-3-pro-image-preview
    "nano-banana-flash"   -> gemini-3.1-flash-image-preview
    "text-embedding-3-large" -> gemini-embedding-001
    "text-embedding-ada-002" -> gemini-embedding-001
    "GEMINI_2.5_FLASH"    -> gemini-2.5-flash      (case + underscore tolerance)
    "gemini-2.5-flsh"     -> gemini-2.5-flash      (typo tolerance / fuzzy)

Each successful response carries the resolution in headers:

    X-Cortexai-Model-Requested:  what you sent
    X-Cortexai-Model-Resolved:   the catalog model actually used
    X-Cortexai-Model-Resolution: <strategy>:<confidence>   (e.g. alias:0.95)

If the `model` field is missing or empty, a 400 invalid_request_error is
returned in the appropriate (OpenAI / Anthropic) envelope.


================================================================================
                                  EXAMPLES
================================================================================

# 1) OpenAI Chat Completions (curl, non-stream)
curl https://vertex.claude.gg/v1/chat/completions \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemini-2.5-flash",
    "messages": [
      {"role": "user", "content": "Hello"}
    ]
  }'

# 1b) Same request using x-api-key header (Anthropic style)
curl https://vertex.claude.gg/v1/chat/completions \
  -H "x-api-key: $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemini-2.5-flash",
    "messages": [{"role":"user","content":"Hello"}]
  }'

# 2) OpenAI Chat Completions (streaming SSE)
curl -N https://vertex.claude.gg/v1/chat/completions \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "grok-4.20-reasoning",
    "stream": true,
    "messages": [{"role":"user","content":"Stream a haiku."}]
  }'

# 3) OpenAI SDK (Python)
from openai import OpenAI
client = OpenAI(
    base_url="https://vertex.claude.gg/v1",
    api_key="<YOUR_API_KEY>"
)
resp = client.chat.completions.create(
    model="qwen3-next-80b-a3b-instruct-maas",
    messages=[{"role": "user", "content": "Hi"}]
)
print(resp.choices[0].message.content)

# 4) Anthropic SDK (Python)
from anthropic import Anthropic
client = Anthropic(
    base_url="https://vertex.claude.gg",
    api_key="<YOUR_API_KEY>"
)
msg = client.messages.create(
    model="gemini-2.5-pro",
    max_tokens=512,
    messages=[{"role": "user", "content": "Hello"}]
)
print(msg.content[0].text)

# 5) Image generation (OpenAI Images API)
curl https://vertex.claude.gg/v1/images/generations \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "imagen-4.0-generate-001",
    "prompt": "A red panda riding a bicycle, photorealistic"
  }'

# 5b) Nano Banana (Gemini image generation, OpenAI Images API)
curl https://vertex.claude.gg/v1/images/generations \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemini-2.5-flash-image",
    "prompt": "A serene Japanese garden in spring, koi pond, cherry blossoms"
  }'

# 5c) Virtual Try-On — place a product image on a person image (Vertex passthrough)
#     Returns 1-4 generated images (predictions[].bytesBase64Encoded).
#     PROJECT and LOCATION can be ANY value (e.g. "test" / "global"); the
#     gateway substitutes its own internal node + region.
curl https://vertex.claude.gg/v1/projects/test/locations/global/publishers/google/models/virtual-try-on-001:predict \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "instances": [{
      "personImage":   { "image": { "bytesBase64Encoded": "<BASE64_PERSON_IMAGE>" } },
      "productImages": [{ "image": { "bytesBase64Encoded": "<BASE64_PRODUCT_IMAGE>" } }]
    }],
    "parameters": { "sampleCount": 1, "personGeneration": "allow-adult" }
  }'

# 6) Embeddings (single input)
curl https://vertex.claude.gg/v1/embeddings \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemini-embedding-001",
    "input": "The quick brown fox"
  }'

# 6b) Embeddings (batch input + custom dimensions)
curl https://vertex.claude.gg/v1/embeddings \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "multilingual-e5-large-instruct-maas",
    "input": ["hello world", "merhaba dunya", "bonjour le monde"],
    "dimensions": 1024
  }'

# 7) List available models
curl https://vertex.claude.gg/v1/models


================================================================================
                              REQUEST BEHAVIOR
================================================================================

Path normalization
------------------
Common URL mistakes are silently corrected. All of the following resolve to
the same `/v1/chat/completions` route — no redirect, just a transparent rewrite:

    /chat/completions       /v1/chat/completions/      //chat/completions
    ///v1//chat/completions/

The same applies to `/messages`, `/models`, `/me`, `/embeddings`, `/responses`,
and `/images/generations`. Trailing slashes and duplicate `//` separators are
also normalized.

Routing & retries
-----------------
Requests are dispatched to Vertex AI across multiple regions transparently.
Slow upstreams are hedged (time-to-first-byte aware) and transient failures
(timeout / 5xx / 429 / 401 / 403) are automatically retried on a fresh
upstream — up to 8 attempts within a 120-second budget. Clients see a
single, stable endpoint and never observe transient backend failures.

Rate limits
-----------
Each API key has the following request quotas (per calendar day / hour, UTC):

    Daily limit    : 3500 requests
    Hourly limit   :  500 requests

In addition, per-key and global RPM (requests per minute) limits are enforced
on a per-model basis. High-throughput models (Gemini family) carry generous
RPM budgets; specialized partner models (Grok, Qwen-next, etc.) have their
own dedicated per-key allowances.

Every authenticated response includes the following headers so clients can
track usage without an extra round-trip:

    x-ratelimit-limit-requests-day        (= 3500)
    x-ratelimit-remaining-requests-day
    x-ratelimit-reset-requests-day        (seconds until reset)
    x-ratelimit-limit-requests-hour       (= 500)
    x-ratelimit-remaining-requests-hour
    x-ratelimit-reset-requests-hour       (seconds until reset)

To fetch a structured summary on demand (this call does NOT consume a request
slot — it is read-only):

    GET /v1/me           (auth via header, recommended)
    GET /api/me?key=sk-... (auth via query string, browser-friendly)

Response:

    {
      "name": "sk-de3aad...65d0",
      "isAdmin": false,
      "usage": {
        "daily": 0,
        "dailyLimit": 3500,
        "dailyRemaining": 3500,
        "dailyResetAt": "2026-04-30T00:00:00.000Z",
        "hourly": 0,
        "hourlyLimit": 500,
        "hourlyRemaining": 500,
        "hourlyResetAt": "2026-04-29T11:00:00.000Z"
      }
    }

Counters reset on UTC calendar boundaries (00:00 UTC for daily, top-of-hour for
hourly), not on rolling windows.

Streaming
---------
SSE streams are fully supported for chat / messages endpoints, including
incremental tool-argument streaming.

Errors
------
Errors follow OpenAI's error envelope:
    { "error": { "message": "...", "type": "...", "code": ... } }
Anthropic error envelope is used for /v1/messages routes:
    { "type": "error", "error": { "type": "...", "message": "..." } }

Common HTTP statuses returned by the gateway:

    400  invalid_request_error    Body parse error or missing required fields
                                  (e.g. omitted "model").
    401  authentication_error     Unknown / revoked API key.
    404  not_found_error          Requested model is not in the cortexai.io
                                  catalog (or has been removed).
    429  rate_limit_error         Per-key daily / hourly / RPM quota exceeded,
                                  or upstream Vertex AI quota exhausted.
    5xx  api_error                Upstream Vertex AI returned an error; the
                                  request was retried automatically across
                                  alternate routes before failing.

Vertex AI's own error messages (e.g. "Publisher Model not found",
"RESOURCE_EXHAUSTED: quota exceeded", "INVALID_ARGUMENT") are forwarded to
the client so that you can debug your request just like you would against
Vertex directly. Only secrets are stripped: project IDs, region names,
service-account emails, OAuth tokens, file paths, and stack frames are
redacted before reaching the client.


================================================================================
                                    NOTES
================================================================================

* All models are reachable through the same base URL — no per-model URL prefix.
* Send the bare model id ("grok-4.20-reasoning", "qwen3-coder-480b-a35b-instruct-maas",
  "gemini-2.5-flash"); the gateway adds the correct Vertex AI publisher
  prefix automatically.
* Model name aliases are accepted: GPT, Claude and shorthand names are mapped
  to the closest catalog model. See the "Model name aliases" section above.
* Path mistakes (//chat/completions, /chat/completions/, /chat/completions
  without /v1) are silently corrected.
* Reasoning models populate `message.reasoning_content` (or `reasoning`) in
  addition to `message.content`. Streaming chunks use `delta.reasoning_content`.
* Streaming tool calls emit incremental `input_json_delta` chunks (Anthropic)
  or `tool_calls[].function.arguments` deltas (OpenAI).
* The catalog is refreshed every 24 hours; use GET /v1/models for the live list.

================================================================================