Remote GPU Server

Host a larger coding model on a workstation / server (e.g., 4060/3090/4090/M‑series Ultra) and connect to it from a separate machine running VS Code + Cline Local.

Who this is for: You have a capable GPU box for inference and want better quality/latency than a laptop can provide. Your client machine connects over LAN/VPN.

Topology

Client (VS Code + Cline Local) connects over network to the model server.

server:~ — remote

Recommended Models

Qwen2.5‑Coder 32B (or Qwen Coder 30B A3A) — strong default for coding quality.
GPT‑OSS‑120B — top‑tier quality; requires server‑class hardware.

See Quick Tips for what to avoid (e.g., GPT‑OSS‑20B) and low‑resource local‑only fallbacks.

Server Setup (GPU box)

Install LM Studio
Download from https://lmstudio.ai and install on the GPU server.
Download your model
In LM Studio, search and download Qwen2.5‑Coder‑32B‑Instruct (or GPT‑OSS‑120B if your hardware supports it).
Start the API server (listen on network)
- Open Server tab in LM Studio
- Host: 0.0.0.0 (listen on all interfaces)
- Port: 1234
- Enable CORS and keep‑alive
- Start server; ensure the model is loaded
Test from the server itself:
```
curl http://127.0.0.1:1234/v1/models
```
Find the server IP
- Windows: ipconfig
- Linux/macOS: ip addr or ifconfig
Use the LAN/VPN IP reachable by the client.
Open firewall for the port
- Windows (PowerShell as admin):
New-NetFirewallRule -DisplayName "LM Studio 1234" -Direction Inbound -Action Allow -Protocol TCP -LocalPort 1234
- Linux (ufw): sudo ufw allow 1234/tcp
Optional: Reverse proxy + TLS
If exposing beyond LAN, put NGINX/Caddy in front, terminate TLS, and restrict access (IP allowlists/VPN/auth). Prefer VPN (WireGuard/Tailscale) over direct WAN exposure.

Client Setup (VS Code machine)

Install Cline Local (VSIX)
Download the latest release VSIX from Releases.
In VS Code: Extensions → ••• → Install from VSIX… → pick the file → Reload.
Open Cline Local settings within VS Code.
Provider: LM Studio (or OpenAI‑compatible if using an alternative server).
Endpoint: http://SERVER_IP:1234 (replace SERVER_IP with the GPU server's IP).
Model: the exact model name shown by the server (e.g., qwen2.5-coder-32b-instruct).
Run a small coding task and verify token streaming.

Alternative servers: You can run an OpenAI‑compatible server (e.g., vLLM) on the GPU box and point Cline Local to it. Steps are similar: bind to 0.0.0.0, enable CORS, open firewall, and use the server IP in Cline settings.

Troubleshooting

Connection refused/timeouts: Ensure server host is 0.0.0.0, the port matches, and firewall allows inbound TCP on 1234.
CORS errors: Enable CORS on the server.
Model not found / 404: Make sure the model is loaded and the name in Cline matches exactly.
Slow tokens / high latency: Check GPU utilization, reduce context length, prefer LAN/VPN, and avoid heavy OS tasks on the server.
Security: Avoid exposing the server to the public internet. Use a VPN (WireGuard/Tailscale) or IP allowlists + TLS if external access is necessary.

Need quick model guidance? See Quick Tips.