Documentation

Everything you need to get started with CloudGPU.

rocket_launchQuick Start Guide

Create an Account

Create an Account →

Go to Deploy

Open the Deploy page and pick a model — DeepSeek, Qwen, Llama, or Mistral.

Go to Deploy →

Click Deploy

Select a GPU, click Deploy. We handle everything — GPU setup, model download, API server.

Copy the API Endpoint

After ~60 seconds, you get an OpenAI-compatible API URL ready to use.

Paste Into Your Code

Set base_url to your endpoint. That's it — unlimited tokens, fixed hourly cost.

Paste Into Your Code →

apiUsing your instance as an API endpoint

For developers building AI agents, chatbots, or any app that needs an LLM backend with a fixed hourly cost instead of per-token billing.

Every running Ollama instance gives you an OpenAI-compatible endpoint. Open the instance in your Dashboard and click the API button to copy-paste one of these:

OpenAI Python SDK

from openai import OpenAI client = OpenAI( base_url="https://deployment-xxxx-11434.550w.link/v1", api_key="ollama", # any non-empty string ) resp = client.chat.completions.create( model="qwen2.5:0.5b", messages=[{"role": "user", "content": "Hello"}], ) print(resp.choices[0].message.content)

curl

curl https://deployment-xxxx-11434.550w.link/api/chat \ -H "Content-Type: application/json" \ -d '{ "model": "qwen2.5:0.5b", "messages": [{ "role": "user", "content": "Hello" }], "stream": false, "keep_alive": "30m" }'

Tip: keep_alive

Pass "keep_alive": "30m" in every request to keep the model in VRAM for 30 minutes after the last call. Without it, Ollama unloads the model after 5 minutes of idle and your next request pays a 30–90 second cold-load penalty.

memoryWhich GPU should I pick?

All prices below are per-hour retail. Supplier availability is live — if a card shows "Out of stock" in the Deploy page, pick the nearest larger tier.

Model size	Min VRAM	Recommended GPU	Price
0.5B – 8B (Qwen 2.5, Llama-3.1-8B, Mistral-7B)	16 GB	RTX 4090 24 GB	$0.35/hr
13B – 32B quantized (Qwen-32B-AWQ, CodeLlama-34B)	32 GB	RTX 5090 32 GB	$0.49/hr
30B+ full precision, medium batch inference	48 GB	L40 / L40S 48 GB	$0.65 – $0.89/hr
70B quantized (Llama-3.3-70B-AWQ, Qwen3-72B-Int4)	40 GB	A100 40G	$0.66/hr
70B full precision, production batch	80 GB	A800 80G	$1.18/hr
Large-context LLM + long reasoning	96 GB	H20 96G	$1.35/hr
Enterprise inference / multi-agent production	80 GB	H800 80G	$3.09/hr

Multi-card bundles (2× / 4× / 8× 4090) are also available and price-competitive for 70B+ quantized workloads — see the Deploy page for the live list.

helpFrequently Asked Questions

What GPUs are available right now?expand_more

Supplier inventory changes hourly. As of April 2026 the main cards in stock are RTX 4090 (24 GB, $0.35/hr) and A100 40G ($0.66/hr). Other cards such as L40, L40S, A800, H20 and H800 are listed but currently at zero stock — they become available as additional supplier nodes come online. Check /deploy for the live list.

How does billing work?expand_more

You are billed per hour of GPU usage with a 1-hour minimum. After the first hour, additional usage is billed in 1-second increments. Billing starts when your instance launches and stops when you destroy it. All prices are listed in USD.

What's the minimum deposit?expand_more

No deposit required to start — new users receive $5 in free credits upon email verification, enough to run a 4090 instance for about 14 hours.

Can I stop and restart my instance?expand_more

You can destroy an instance at any time. Currently, stopped instances release their GPU — you'll need to create a new instance to resume. Persistent storage across restarts is on the roadmap.

What is pre-installed on an instance?expand_more

Today every template is a 1-click serverless container (e.g. Ollama with Qwen/Llama, Stable Diffusion WebUI, Whisper). The container comes pre-built with model weights and an HTTP API exposed on a unique URL. Raw SSH-into-Ubuntu rentals are on the roadmap.

What regions are available?expand_more

The CloudGPU control plane runs in Hong Kong. The GPU instances themselves run in our supplier's data centres in mainland China (Guangdong, Henan, Zhejiang, Xinjiang and others). The specific region for your instance is visible in your dashboard.

Is there an API?expand_more

Yes. Every instance exposes an OpenAI-compatible REST API (for LLM templates) or model-specific REST API (for image / audio templates). See the API button on any running instance in your dashboard for copy-pasteable curl, Python and Node examples.

How do I get support?expand_more

Email support@cloudgpu.app or visit our Contact page. We aim to respond within 24 hours while the service is in beta.

What's the difference between Deploy and Marketplace?expand_more

Deploy gives you a ready-to-use AI model API in 60 seconds — no terminal needed. Marketplace is currently a preview of upcoming raw GPU rentals; for now use Deploy.

Can I deploy custom models?expand_more

For LLMs, our Ollama template supports any model in the Ollama library — after deploy, use the 'Pick Model' button to pull anything from qwen2.5:0.5b up to llama3.1:70b. For non-LLM custom models, raw GPU rentals are coming soon.

How do I connect my deployed model to my application?expand_more

Open your running instance in the dashboard and click the API button. You'll get an OpenAI-compatible endpoint at /v1/chat/completions plus ready-to-paste code for curl, Python (requests + OpenAI SDK), and Node.js. Drop it into LangChain / AutoGen / CrewAI with one line.

Still have questions?

Contact Support