Remote GPU Server
Host a larger coding model on a workstation / server (e.g., 4060/3090/4090/M‑series Ultra) and connect to it from a separate machine running VS Code + Cline Local.
Who this is for: You have a capable GPU box for inference and want better quality/latency than a
laptop can provide. Your client machine connects over LAN/VPN.
Topology
Client (VS Code + Cline Local) connects over network to the model server.
server:~ — remote
Recommended Models
- Qwen2.5‑Coder 32B (or Qwen Coder 30B A3A) — strong default for coding quality.
- GPT‑OSS‑120B — top‑tier quality; requires server‑class hardware.
See Quick Tips for what to avoid (e.g., GPT‑OSS‑20B) and low‑resource local‑only fallbacks.
Server Setup (GPU box)
-
Install LM Studio
Download from https://lmstudio.ai and install on the GPU server. -
Download your model
In LM Studio, search and download Qwen2.5‑Coder‑32B‑Instruct (or GPT‑OSS‑120B if your hardware supports it). -
Start the API server (listen on network)
- Open Server tab in LM Studio
- Host:0.0.0.0
(listen on all interfaces)
- Port:1234
- Enable CORS and keep‑alive
- Start server; ensure the model is loadedTest from the server itself:curl http://127.0.0.1:1234/v1/models
-
Find the server IP
- Windows:ipconfig
- Linux/macOS:ip addr
orifconfig
Use the LAN/VPN IP reachable by the client. -
Open firewall for the port
- Windows (PowerShell as admin):
New-NetFirewallRule -DisplayName "LM Studio 1234" -Direction Inbound -Action Allow -Protocol TCP -LocalPort 1234
- Linux (ufw):sudo ufw allow 1234/tcp
-
Optional: Reverse proxy + TLS
If exposing beyond LAN, put NGINX/Caddy in front, terminate TLS, and restrict access (IP allowlists/VPN/auth). Prefer VPN (WireGuard/Tailscale) over direct WAN exposure.
Client Setup (VS Code machine)
-
Install Cline Local (VSIX)
Download the latest release VSIX from Releases.
In VS Code: Extensions → ••• → Install from VSIX… → pick the file → Reload. - Open Cline Local settings within VS Code.
- Provider: LM Studio (or OpenAI‑compatible if using an alternative server).
-
Endpoint:
http://SERVER_IP:1234
(replaceSERVER_IP
with the GPU server's IP). - Model: the exact model name shown by the server (e.g.,
qwen2.5-coder-32b-instruct
). - Run a small coding task and verify token streaming.
Alternative servers: You can run an OpenAI‑compatible server (e.g., vLLM) on the GPU box and point Cline Local
to it. Steps are similar: bind to
0.0.0.0
, enable CORS, open firewall, and use the server IP in
Cline settings.
Troubleshooting
-
Connection refused/timeouts: Ensure server host is
0.0.0.0
, the port matches, and firewall allows inbound TCP on 1234. - CORS errors: Enable CORS on the server.
- Model not found / 404: Make sure the model is loaded and the name in Cline matches exactly.
- Slow tokens / high latency: Check GPU utilization, reduce context length, prefer LAN/VPN, and avoid heavy OS tasks on the server.
- Security: Avoid exposing the server to the public internet. Use a VPN (WireGuard/Tailscale) or IP allowlists + TLS if external access is necessary.
Need quick model guidance? See Quick Tips.