Documentation
Everything you need to get started with CloudGPU.
rocket_launchQuick Start Guide
Create an Account
Sign up at CloudGPU. You'll receive $5 in free credits instantly.
Create an Account →Go to Deploy
Open the Deploy page and pick a model — DeepSeek, Qwen, Llama, or Mistral.
Go to Deploy →Click Deploy
Select a GPU, click Deploy. We handle everything — GPU setup, model download, API server.
Copy the API Endpoint
After ~60 seconds, you get an OpenAI-compatible API URL ready to use.
Paste Into Your Code
Set base_url to your endpoint. That's it — unlimited tokens, fixed hourly cost.
Paste Into Your Code →apiUsing your instance as an API endpoint
For developers building AI agents, chatbots, or any app that needs an LLM backend with a fixed hourly cost instead of per-token billing.
Every running Ollama instance gives you an OpenAI-compatible endpoint. Open the instance in your Dashboard and click the API button to copy-paste one of these:
OpenAI Python SDK
curl
Tip: keep_alive
Pass "keep_alive": "30m" in every request to keep the model in VRAM for 30 minutes after the last call. Without it, Ollama unloads the model after 5 minutes of idle and your next request pays a 30–90 second cold-load penalty.
memoryWhich GPU should I pick?
All prices below are per-hour retail. Supplier availability is live — if a card shows "Out of stock" in the Deploy page, pick the nearest larger tier.
| Model size | Min VRAM | Recommended GPU | Price |
|---|---|---|---|
| 0.5B – 8B (Qwen 2.5, Llama-3.1-8B, Mistral-7B) | 16 GB | RTX 4090 24 GB | $0.35/hr |
| 13B – 32B quantized (Qwen-32B-AWQ, CodeLlama-34B) | 32 GB | RTX 5090 32 GB | $0.49/hr |
| 30B+ full precision, medium batch inference | 48 GB | L40 / L40S 48 GB | $0.65 – $0.89/hr |
| 70B quantized (Llama-3.3-70B-AWQ, Qwen3-72B-Int4) | 40 GB | A100 40G | $0.66/hr |
| 70B full precision, production batch | 80 GB | A800 80G | $1.18/hr |
| Large-context LLM + long reasoning | 96 GB | H20 96G | $1.35/hr |
| Enterprise inference / multi-agent production | 80 GB | H800 80G | $3.09/hr |
Multi-card bundles (2× / 4× / 8× 4090) are also available and price-competitive for 70B+ quantized workloads — see the Deploy page for the live list.
helpFrequently Asked Questions
What GPUs are available right now?expand_more
How does billing work?expand_more
What's the minimum deposit?expand_more
Can I stop and restart my instance?expand_more
What is pre-installed on an instance?expand_more
What regions are available?expand_more
Is there an API?expand_more
How do I get support?expand_more
What's the difference between Deploy and Marketplace?expand_more
Can I deploy custom models?expand_more
How do I connect my deployed model to my application?expand_more
Still have questions?
Contact Support