================================================================================ cortexai.io API GATEWAY — DOCS OpenAI / Anthropic compatible ================================================================================ Base URL -------- https://vertex.claude.gg Authentication -------------- Provide your API key using ANY of the following (in order of preference): Authorization: Bearer (OpenAI / SDK style) x-api-key: (Anthropic style) x-goog-api-key: (Google native style) ?api_key= (query string fallback) The first matching credential is used. CORS pre-flight responses already include `authorization, x-api-key, x-goog-api-key, content-type, anthropic-version` in Access-Control-Allow-Headers, so browser clients can use any of these. Public endpoints (no auth required): GET /v1/models GET /v1beta/models GET /health GET /docs.txt ================================================================================ ENDPOINTS ================================================================================ OpenAI-compatible ----------------- POST /v1/chat/completions POST /v1/responses POST /v1/images/generations POST /v1/embeddings GET /v1/models GET /v1/me (rate-limit / quota status for current API key) GET /api/me (alias of /v1/me, supports ?key= query param) Anthropic-compatible -------------------- POST /v1/messages POST /v1/messages/count_tokens Gemini-native compatible ------------------------ POST /v1beta/models/{model}:{action} GET /v1beta/models Vertex native passthrough ------------------------- POST /v1/projects/{project}/locations/{loc}/publishers/google/models/{model}:{action} Used for Virtual Try-On, which needs a structured personImage + productImages base64 input shape that the OpenAI Images API can't represent. {project} and {loc} can be any placeholder; the gateway substitutes its own internal node + region. ================================================================================ MODEL CATALOG (28 models) ================================================================================ For an always up-to-date machine-readable list call: GET /v1/models (or GET /v1/models/{id} for a single model) GET /v1beta/models (or GET /v1beta/models/{id} for a single model) Both endpoints return ONE union response that simultaneously satisfies the OpenAI, Anthropic and Gemini schemas: { "object": "list", # OpenAI envelope "data": [ ...models... ], # OpenAI / Anthropic envelope "models": [ ...models... ], # Gemini envelope "first_id": "...", "last_id": "...", # Anthropic pagination "has_more": false } Each model entry exposes the fields read by every popular client (OpenAI SDK, Anthropic SDK, Gemini SDK, Roo Code, OpenRouter SDK, LiteLLM proxy, OpenWebUI, LibreChat, Continue.dev, Cherry Studio). The context window is exposed under FOUR different field names so any client picks it up correctly: context_window -> Roo Code, Cherry Studio context_length -> OpenRouter SDK, OpenWebUI max_input_tokens -> Anthropic SDK, LiteLLM inputTokenLimit -> Gemini SDK top_provider.context_length (OpenRouter strict) top_provider.max_completion_tokens The maximum-output limit is also mirrored under multiple names: `max_tokens`, `max_output_tokens`, `max_output_length`, `outputTokenLimit`. Sample (gemini-2.5-flash): { "id": "gemini-2.5-flash", "object": "model", "type": "language", # "language"|"embedding"|"image"|"audio" "mode": "chat", "created": 1748390400, # unix seconds "created_at": "2025-06-01T00:00:00Z", # RFC 3339 "owned_by": "google", # real vendor: google, openai, # anthropic, xai, alibaba, # deepseek, moonshot, minimax, # mistralai, meta, ... "name": "Gemini 2.5 Flash", # human-readable "display_name": "Gemini 2.5 Flash", "displayName": "Gemini 2.5 Flash", "description": "Gemini 2.5 Flash by google", "context_window": 1048576, "context_length": 1048576, "max_input_tokens": 1048576, "inputTokenLimit": 1048576, "max_tokens": 65535, "max_output_tokens": 65535, "max_output_length": 65535, "outputTokenLimit": 65535, "input_modalities": ["text", "image", "file"], "output_modalities": ["text"], "architecture": { ... OpenRouter-shaped ... }, "top_provider": { "context_length": 1048576, "max_completion_tokens": 65535, "is_moderated": false }, "pricing": { "prompt": "0", "completion": "0", ... }, "input_cost_per_token": 0, "output_cost_per_token": 0, "supported_parameters": ["temperature","top_p","max_tokens","stream","stop","tools","tool_choice"], "supported_features": ["tools","function_calling","vision","streaming"], "supportsImages": true, "supportsTools": true, "supportsStreaming": true, "supportsReasoning": false, // Gemini-native compat "baseModelId": "gemini-2.5-flash", "version": "001", "supportedGenerationMethods": ["generateContent","streamGenerateContent","countTokens"], // cortexai.io extension "capabilities": { "chat": true, "vision": true, "tool_use": true, ... }, "canonical_slug": "google/gemini-2.5-flash", "tags": ["chat","vision","tools"] } The Gemini-native endpoint `/v1beta/models` returns the same payload but with `name = "models/"` (Google resource-path format). Each SDK reads only the fields it knows and ignores the rest, so the same unchanged call works for OpenAI, Anthropic, Gemini, Roo Code, OpenRouter, LiteLLM, OpenWebUI, LibreChat, Continue.dev and Cherry Studio. Gemini family ------------- gemini-2.5-pro gemini-2.5-flash gemini-2.5-flash-lite gemini-3-flash-preview gemini-3.1-flash-lite-preview gemini-3.1-pro-preview Embeddings ---------- gemini-embedding-001 (3072 dim, native :predict) multilingual-e5-large-instruct-maas (1024 dim, OpenAI-compat) multilingual-e5-small-maas ( 384 dim, OpenAI-compat) Imagen — text to image (OpenAI Images API: POST /v1/images/generations) ---------------------------------------------------------------------- imagen-4.0-fast-generate-001 ( ~6 s, 5 regions, GA ) imagen-4.0-generate-001 ( ~10 s, 5 regions, GA ) imagen-4.0-ultra-generate-001 ( ~13 s, 5 regions, GA, best quality ) imagen-3.0-generate-002 ( ~11 s, 4 regions, GA ) imagen-3.0-fast-generate-001 ( ~7 s, 5 regions, GA ) Nano Banana — Gemini image generation (generateContent + responseModalities) ---------------------------------------------------------------------------- gemini-2.5-flash-image ( ~6 s, 4 regions, GA, supports multi-image fusion + edit + up to 10 output images ) gemini-3.1-flash-image-preview ( ~45 s, global, Preview, "Nano Banana Pro flash" - up to 3 images per call with full thinking trace ) gemini-3-pro-image-preview ( ~42 s, global, Preview, "Nano Banana Pro" - highest quality, single image with thinking trace ) Virtual Try-On — image fusion (Vertex native :predict only) ----------------------------------------------------------- virtual-try-on-001 ( place a product image on a person image; 17 regions ) Virtual Try-On does NOT fit the OpenAI Images API shape (it requires structured `personImage` + `productImages` base64 arrays). Call it via the Vertex native passthrough route (see example below). Grok family (xAI) ----------------- grok-4.20-reasoning grok-4.20-non-reasoning grok-4.1-fast-reasoning grok-4.1-fast-non-reasoning Qwen family (Alibaba) --------------------- qwen3-235b-a22b-instruct-2507-maas qwen3-coder-480b-a35b-instruct-maas qwen3-next-80b-a3b-instruct-maas qwen3-next-80b-a3b-thinking-maas GPT-OSS (OpenAI open-weights) ----------------------------- gpt-oss-120b-maas gpt-oss-20b-maas DeepSeek family --------------- deepseek-v3.2-maas deepseek-r1-0528-maas NOTE: Catalog reflects only models with reliable capacity on cortexai.io compute fleet. Audio (TTS / live), video (Veo / Lyria) and a number of preview models with insufficient quota have been removed; for the current authoritative list always read GET /v1/models. Model name aliases & normalization ---------------------------------- The gateway accepts a wide variety of common shorthand and SDK-default model names; they are silently rewritten to the closest catalog model. Examples: "gpt-4o" -> gemini-2.5-flash "gpt-4o-mini" -> gemini-2.5-flash-lite "gpt-4" -> gemini-2.5-pro "claude-3-5-sonnet" -> gemini-2.5-flash "claude-3-opus" -> gemini-2.5-pro "claude-haiku" -> gemini-2.5-flash-lite "grok-4.2" -> grok-4.20-non-reasoning "grok-fast" -> grok-4.1-fast-non-reasoning "gemini-3-pro" -> gemini-3.1-pro-preview "gemini-pro" -> gemini-2.5-pro "deepseek-r1" -> deepseek-r1-0528-maas "qwen-coder" -> qwen3-coder-480b-a35b-instruct-maas "dall-e-3" -> imagen-4.0-generate-001 "vto" / "try-on" -> virtual-try-on-001 "nano-banana" -> gemini-2.5-flash-image "nano-banana-pro" -> gemini-3-pro-image-preview "nano-banana-flash" -> gemini-3.1-flash-image-preview "text-embedding-3-large" -> gemini-embedding-001 "text-embedding-ada-002" -> gemini-embedding-001 "GEMINI_2.5_FLASH" -> gemini-2.5-flash (case + underscore tolerance) "gemini-2.5-flsh" -> gemini-2.5-flash (typo tolerance / fuzzy) Each successful response carries the resolution in headers: X-Cortexai-Model-Requested: what you sent X-Cortexai-Model-Resolved: the catalog model actually used X-Cortexai-Model-Resolution: : (e.g. alias:0.95) If the `model` field is missing or empty, a 400 invalid_request_error is returned in the appropriate (OpenAI / Anthropic) envelope. ================================================================================ EXAMPLES ================================================================================ # 1) OpenAI Chat Completions (curl, non-stream) curl https://vertex.claude.gg/v1/chat/completions \ -H "Authorization: Bearer $API_KEY" \ -H "Content-Type: application/json" \ -d '{ "model": "gemini-2.5-flash", "messages": [ {"role": "user", "content": "Hello"} ] }' # 1b) Same request using x-api-key header (Anthropic style) curl https://vertex.claude.gg/v1/chat/completions \ -H "x-api-key: $API_KEY" \ -H "Content-Type: application/json" \ -d '{ "model": "gemini-2.5-flash", "messages": [{"role":"user","content":"Hello"}] }' # 2) OpenAI Chat Completions (streaming SSE) curl -N https://vertex.claude.gg/v1/chat/completions \ -H "Authorization: Bearer $API_KEY" \ -H "Content-Type: application/json" \ -d '{ "model": "grok-4.20-reasoning", "stream": true, "messages": [{"role":"user","content":"Stream a haiku."}] }' # 3) OpenAI SDK (Python) from openai import OpenAI client = OpenAI( base_url="https://vertex.claude.gg/v1", api_key="" ) resp = client.chat.completions.create( model="qwen3-next-80b-a3b-instruct-maas", messages=[{"role": "user", "content": "Hi"}] ) print(resp.choices[0].message.content) # 4) Anthropic SDK (Python) from anthropic import Anthropic client = Anthropic( base_url="https://vertex.claude.gg", api_key="" ) msg = client.messages.create( model="gemini-2.5-pro", max_tokens=512, messages=[{"role": "user", "content": "Hello"}] ) print(msg.content[0].text) # 5) Image generation (OpenAI Images API) curl https://vertex.claude.gg/v1/images/generations \ -H "Authorization: Bearer $API_KEY" \ -H "Content-Type: application/json" \ -d '{ "model": "imagen-4.0-generate-001", "prompt": "A red panda riding a bicycle, photorealistic" }' # 5b) Nano Banana (Gemini image generation, OpenAI Images API) curl https://vertex.claude.gg/v1/images/generations \ -H "Authorization: Bearer $API_KEY" \ -H "Content-Type: application/json" \ -d '{ "model": "gemini-2.5-flash-image", "prompt": "A serene Japanese garden in spring, koi pond, cherry blossoms" }' # 5c) Virtual Try-On — place a product image on a person image (Vertex passthrough) # Returns 1-4 generated images (predictions[].bytesBase64Encoded). # PROJECT and LOCATION can be ANY value (e.g. "test" / "global"); the # gateway substitutes its own internal node + region. curl https://vertex.claude.gg/v1/projects/test/locations/global/publishers/google/models/virtual-try-on-001:predict \ -H "Authorization: Bearer $API_KEY" \ -H "Content-Type: application/json" \ -d '{ "instances": [{ "personImage": { "image": { "bytesBase64Encoded": "" } }, "productImages": [{ "image": { "bytesBase64Encoded": "" } }] }], "parameters": { "sampleCount": 1, "personGeneration": "allow-adult" } }' # 6) Embeddings (single input) curl https://vertex.claude.gg/v1/embeddings \ -H "Authorization: Bearer $API_KEY" \ -H "Content-Type: application/json" \ -d '{ "model": "gemini-embedding-001", "input": "The quick brown fox" }' # 6b) Embeddings (batch input + custom dimensions) curl https://vertex.claude.gg/v1/embeddings \ -H "Authorization: Bearer $API_KEY" \ -H "Content-Type: application/json" \ -d '{ "model": "multilingual-e5-large-instruct-maas", "input": ["hello world", "merhaba dunya", "bonjour le monde"], "dimensions": 1024 }' # 7) List available models curl https://vertex.claude.gg/v1/models ================================================================================ REQUEST BEHAVIOR ================================================================================ Path normalization ------------------ Common URL mistakes are silently corrected. All of the following resolve to the same `/v1/chat/completions` route — no redirect, just a transparent rewrite: /chat/completions /v1/chat/completions/ //chat/completions ///v1//chat/completions/ The same applies to `/messages`, `/models`, `/me`, `/embeddings`, `/responses`, and `/images/generations`. Trailing slashes and duplicate `//` separators are also normalized. Routing & retries ----------------- Requests are dispatched to Vertex AI across multiple regions transparently. Slow upstreams are hedged (time-to-first-byte aware) and transient failures (timeout / 5xx / 429 / 401 / 403) are automatically retried on a fresh upstream — up to 8 attempts within a 120-second budget. Clients see a single, stable endpoint and never observe transient backend failures. Rate limits ----------- Each API key has the following request quotas (per calendar day / hour, UTC): Daily limit : 3500 requests Hourly limit : 500 requests In addition, per-key and global RPM (requests per minute) limits are enforced on a per-model basis. High-throughput models (Gemini family) carry generous RPM budgets; specialized partner models (Grok, Qwen-next, etc.) have their own dedicated per-key allowances. Every authenticated response includes the following headers so clients can track usage without an extra round-trip: x-ratelimit-limit-requests-day (= 3500) x-ratelimit-remaining-requests-day x-ratelimit-reset-requests-day (seconds until reset) x-ratelimit-limit-requests-hour (= 500) x-ratelimit-remaining-requests-hour x-ratelimit-reset-requests-hour (seconds until reset) To fetch a structured summary on demand (this call does NOT consume a request slot — it is read-only): GET /v1/me (auth via header, recommended) GET /api/me?key=sk-... (auth via query string, browser-friendly) Response: { "name": "sk-de3aad...65d0", "isAdmin": false, "usage": { "daily": 0, "dailyLimit": 3500, "dailyRemaining": 3500, "dailyResetAt": "2026-04-30T00:00:00.000Z", "hourly": 0, "hourlyLimit": 500, "hourlyRemaining": 500, "hourlyResetAt": "2026-04-29T11:00:00.000Z" } } Counters reset on UTC calendar boundaries (00:00 UTC for daily, top-of-hour for hourly), not on rolling windows. Streaming --------- SSE streams are fully supported for chat / messages endpoints, including incremental tool-argument streaming. Errors ------ Errors follow OpenAI's error envelope: { "error": { "message": "...", "type": "...", "code": ... } } Anthropic error envelope is used for /v1/messages routes: { "type": "error", "error": { "type": "...", "message": "..." } } Common HTTP statuses returned by the gateway: 400 invalid_request_error Body parse error or missing required fields (e.g. omitted "model"). 401 authentication_error Unknown / revoked API key. 404 not_found_error Requested model is not in the cortexai.io catalog (or has been removed). 429 rate_limit_error Per-key daily / hourly / RPM quota exceeded, or upstream Vertex AI quota exhausted. 5xx api_error Upstream Vertex AI returned an error; the request was retried automatically across alternate routes before failing. Vertex AI's own error messages (e.g. "Publisher Model not found", "RESOURCE_EXHAUSTED: quota exceeded", "INVALID_ARGUMENT") are forwarded to the client so that you can debug your request just like you would against Vertex directly. Only secrets are stripped: project IDs, region names, service-account emails, OAuth tokens, file paths, and stack frames are redacted before reaching the client. ================================================================================ NOTES ================================================================================ * All models are reachable through the same base URL — no per-model URL prefix. * Send the bare model id ("grok-4.20-reasoning", "qwen3-coder-480b-a35b-instruct-maas", "gemini-2.5-flash"); the gateway adds the correct Vertex AI publisher prefix automatically. * Model name aliases are accepted: GPT, Claude and shorthand names are mapped to the closest catalog model. See the "Model name aliases" section above. * Path mistakes (//chat/completions, /chat/completions/, /chat/completions without /v1) are silently corrected. * Reasoning models populate `message.reasoning_content` (or `reasoning`) in addition to `message.content`. Streaming chunks use `delta.reasoning_content`. * Streaming tool calls emit incremental `input_json_delta` chunks (Anthropic) or `tool_calls[].function.arguments` deltas (OpenAI). * The catalog is refreshed every 24 hours; use GET /v1/models for the live list. ================================================================================