Documentation

Everything you need to get started with CloudGPU.

rocket_launchQuick Start Guide

1

Create an Account

Sign up at CloudGPU. You'll receive $5 in free credits instantly.

Create an Account
2

Go to Deploy

Open the Deploy page and pick a model — DeepSeek, Qwen, Llama, or Mistral.

Go to Deploy
3

Click Deploy

Select a GPU, click Deploy. We handle everything — GPU setup, model download, API server.

4

Copy the API Endpoint

After ~60 seconds, you get an OpenAI-compatible API URL ready to use.

5

Paste Into Your Code

Set base_url to your endpoint. That's it — unlimited tokens, fixed hourly cost.

Paste Into Your Code

apiUsing your instance as an API endpoint

For developers building AI agents, chatbots, or any app that needs an LLM backend with a fixed hourly cost instead of per-token billing.

Every running Ollama instance gives you an OpenAI-compatible endpoint. Open the instance in your Dashboard and click the API button to copy-paste one of these:

OpenAI Python SDK

from openai import OpenAI client = OpenAI( base_url="https://deployment-xxxx-11434.550w.link/v1", api_key="ollama", # any non-empty string ) resp = client.chat.completions.create( model="qwen2.5:0.5b", messages=[{"role": "user", "content": "Hello"}], ) print(resp.choices[0].message.content)

curl

curl https://deployment-xxxx-11434.550w.link/api/chat \ -H "Content-Type: application/json" \ -d '{ "model": "qwen2.5:0.5b", "messages": [{ "role": "user", "content": "Hello" }], "stream": false, "keep_alive": "30m" }'

Tip: keep_alive

Pass "keep_alive": "30m" in every request to keep the model in VRAM for 30 minutes after the last call. Without it, Ollama unloads the model after 5 minutes of idle and your next request pays a 30–90 second cold-load penalty.

memoryWhich GPU should I pick?

All prices below are per-hour retail. Supplier availability is live — if a card shows "Out of stock" in the Deploy page, pick the nearest larger tier.

Model sizeMin VRAMRecommended GPUPrice
0.5B – 8B (Qwen 2.5, Llama-3.1-8B, Mistral-7B)16 GBRTX 4090 24 GB$0.35/hr
13B – 32B quantized (Qwen-32B-AWQ, CodeLlama-34B)32 GBRTX 5090 32 GB$0.49/hr
30B+ full precision, medium batch inference48 GBL40 / L40S 48 GB$0.65 – $0.89/hr
70B quantized (Llama-3.3-70B-AWQ, Qwen3-72B-Int4)40 GBA100 40G$0.66/hr
70B full precision, production batch80 GBA800 80G$1.18/hr
Large-context LLM + long reasoning96 GBH20 96G$1.35/hr
Enterprise inference / multi-agent production80 GBH800 80G$3.09/hr

Multi-card bundles (2× / 4× / 8× 4090) are also available and price-competitive for 70B+ quantized workloads — see the Deploy page for the live list.

helpFrequently Asked Questions

What GPUs are available right now?expand_more
Supplier inventory changes hourly. As of April 2026 the main cards in stock are RTX 4090 (24 GB, $0.35/hr) and A100 40G ($0.66/hr). Other cards such as L40, L40S, A800, H20 and H800 are listed but currently at zero stock — they become available as our supplier 共绩算力 brings more nodes online. Check /deploy for the live list.
How does billing work?expand_more
You are billed per hour of GPU usage with a 1-hour minimum. After the first hour, additional usage is billed in 1-second increments. Billing starts when your instance launches and stops when you destroy it. All prices are listed in USD.
What's the minimum deposit?expand_more
No deposit required to start — new users receive $5 in free credits upon email verification, enough to run a 4090 instance for about 14 hours.
Can I stop and restart my instance?expand_more
You can destroy an instance at any time. Currently, stopped instances release their GPU — you'll need to create a new instance to resume. Persistent storage across restarts is on the roadmap.
What is pre-installed on an instance?expand_more
Today every template is a 1-click serverless container (e.g. Ollama with Qwen/Llama, Stable Diffusion WebUI, Whisper). The container comes pre-built with model weights and an HTTP API exposed on a unique URL. Raw SSH-into-Ubuntu rentals are on the roadmap.
What regions are available?expand_more
The CloudGPU control plane runs in Hong Kong. The GPU instances themselves run in our supplier's data centres in mainland China (several regions including 广东, 河南, 浙江, 新疆). The specific region for your instance is visible in your dashboard.
Is there an API?expand_more
Yes. Every instance exposes an OpenAI-compatible REST API (for LLM templates) or model-specific REST API (for image / audio templates). See the API button on any running instance in your dashboard for copy-pasteable curl, Python and Node examples.
How do I get support?expand_more
Email support@cloudgpu.app or visit our Contact page. We aim to respond within 24 hours while the service is in beta.
What's the difference between Deploy and Marketplace?expand_more
Deploy gives you a ready-to-use AI model API in 60 seconds — no terminal needed. Marketplace is currently a preview of upcoming raw GPU rentals; for now use Deploy.
Can I deploy custom models?expand_more
For LLMs, our Ollama template supports any model in the Ollama library — after deploy, use the 'Pick Model' button to pull anything from qwen2.5:0.5b up to llama3.1:70b. For non-LLM custom models, raw GPU rentals are coming soon.
How do I connect my deployed model to my application?expand_more
Open your running instance in the dashboard and click the API button. You'll get an OpenAI-compatible endpoint at /v1/chat/completions plus ready-to-paste code for curl, Python (requests + OpenAI SDK), and Node.js. Drop it into LangChain / AutoGen / CrewAI with one line.

Still have questions?

Contact Support