AI with BASIC
Local LLMs, ONNX, RAG - all in one binary

jdBasic ships with ONNX Runtime for any pretrained model, llama.cpp for local LLM chat, a built-in RAG engine, and a k-NN classifier. No Python, no Docker, no cloud. Everything runs on your hardware out of a single self-contained EXE.

Download a release » Manual

On this page

› The AI stack at a glance
› 1) ONNX inference
› 2) Local LLM chat
› 3) Retrieval-augmented Q&A
› 4) k-NN classifier
› Performance & tips

The AI stack at a glance

ONNX Runtime

AI.LOAD + AI.RUN - drop in any .onnx model. MNIST, ResNet, stable-diffusion encoders, you name it.

llama.cpp

AI.LOAD_LLM + AI.CHAT_STREAM - load any GGUF (Phi-3, Llama, Qwen, gpt-oss). CUDA / Vulkan / ROCm offload depending on the build.

RAG engine

AI.RAG_* - TF-IDF or dense embeddings, HNSW index, file/dir ingestion, streaming answers grounded in your own data.

Classifier

AI.CLASSIFIER_* - k-NN over embeddings, ideal for ticket routing, intent detection, anything where you have labelled examples but no training budget.

1) ONNX inference

AI.LOAD takes any .onnx file; AI.RUN feeds it the inputs as nested jdBasic arrays and returns the outputs the same way. Use it for image classifiers, embedding extractors, or, like below, as a tiny SIMD compute backend for convolutions.

jdb/mini_onnx.jdb

' Load a tiny 3×3 convolution model and run it on a 5×5 input.
' Same pattern works for full image classifiers - swap the .onnx
' file and reshape the input.

DIM conv = AI.LOAD("bench/conv3x3.onnx")

' ONNX Conv2D wants [batch, channels, h, w]. Edge-detect kernel below.
LET edge_kernel = [[[[-1.0, -1.0, -1.0], [2.0, 2.0, 2.0], [-1.0, -1.0, -1.0]]]]
LET image = [[[[0.0, 0.0, 1.0, 0.0, 0.0], [0.0, 0.0, 1.0, 0.0, 0.0], [0.0, 0.0, 1.0, 0.0, 0.0], [0.0, 0.0, 1.0, 0.0, 0.0], [0.0, 0.0, 1.0, 0.0, 0.0]]]]

LET out = AI.RUN(conv, [image, edge_kernel])

PRINT "Edge-detected response:"
PRINT out[0][0]

See jdb/ai_demo.jdb for an interactive MNIST digit recogniser using the same two natives plus AI.SOFTMAX + AI.ARGMAX.

2) Local LLM chat

AI.LOAD_LLM takes a GGUF file plus a context size and a GPU-layer count. AI.CHAT_STREAM hands each new token to your callback as it's generated - perfect for typewriter-style UIs or for piping into a TUI.

jdb/mini_llm.jdb

DIM model = "models/Phi-3-mini-4k-instruct-q4.gguf"  ' ~2.4 GB
DIM ctx_size = 2048
DIM gpu_layers = 99   ' all on GPU if the build has CUDA/Vulkan/ROCm

DIM llm = AI.LOAD_LLM(model, ctx_size, gpu_layers)
DIM rc1 = AI.SET(llm, "system", "You are a concise assistant. Answer in 2 sentences.")
DIM rc2 = AI.SET(llm, "temperature", 0.5)
DIM rc3 = AI.SET(llm, "max_tokens", 120)

' AI.CHAT_STREAM hands each new token to the callback as it's generated.
' Return TRUE to keep going, FALSE to stop early.
FUNC OnToken(t)
  PRINT t;
  RETURN TRUE
ENDFUNC

PRINT "Q: Why is BASIC a good language to teach with?"
PRINT "A: ";
DIM full_response = AI.CHAT_STREAM(llm, "Why is BASIC a good language to teach with?", OnToken@)
PRINT

DIM rc4 = AI.FREE_LLM(llm)

AI.SET also takes "top_p", "top_k", "repeat_penalty", "grammar" (GBNF), and JSON-mode toggles. See jdb/ai_chat_demo.jdb for a full ImGui chat studio (history, streaming, tokenizer view).

3) Retrieval-augmented Q&A

Bundle your own docs with a query interface in ~20 lines. AI.RAG_CREATE in TF-IDF mode needs no model files at all; AI.RAG_SEARCH returns raw chunk + score hits, and AI.RAG_QUERY folds the retrieved context into an LLM prompt and streams an answer.

jdb/mini_rag.jdb

' TF-IDF mode (llm_id=0, no embed_id): zero model files required.
DIM rag = AI.RAG_CREATE(0, 400, 40)

DIM rc1 = AI.RAG_ADD(rag, "jdBasic is a modern BASIC dialect with APL-style array ops, ImGui graphics, and a built-in LLVM-18 native compiler.", "overview")
DIM rc2 = AI.RAG_ADD(rag, "AI features include AI.LOAD for ONNX models, AI.LOAD_LLM for GGUF local language models via llama.cpp, and AI.RAG_QUERY for retrieval-augmented generation.", "ai")
DIM rc3 = AI.RAG_ADD(rag, "The jdBasic MCP server lets Claude or Cursor pair-program on a persistent VM with STOP/RESUME and LLVM compile-and-ship.", "mcp")
DIM rc4 = AI.RAG_ADD(rag, "Graphics commands include SCREEN, LINE / RECT / CIRCLE, SPRITE.LOAD, and a Tiled-map loader.", "graphics")

DIM hits = AI.RAG_SEARCH(rag, "How can I pair-program with Claude?", 2)

DIM i AS INTEGER
FOR i = 0 TO LEN(hits) - 1
  PRINT "── hit ", i + 1, " - source=", hits[i]{"source"}, "  score=", hits[i]{"score"}
  PRINT hits[i]{"text"}
NEXT i

Swap to dense embeddings by passing a 4th arg to AI.RAG_CREATE: the id of an embedding model loaded via AI.LOAD_EMBEDDINGS. Then AI.RAG_BUILD_INDEX builds an HNSW graph for sub-millisecond search over millions of chunks. Full example in jdb/rag_demo.jdb.

4) k-NN classifier

For text classification (ticket routing, intent detection, sentiment, …) you usually don't need to fine-tune a model - just embed your labelled examples and look up the nearest neighbours. The AI.CLASSIFIER_* family does this with an HNSW index under the hood.

Sketch - see `jdb/train_classifier.jdb` for the full demo

DIM emb = AI.LOAD_EMBEDDINGS("models/bge-m3-Q4_K_M.gguf", 2048, 99)
DIM clf = AI.CLASSIFIER_CREATE(emb)

' Add labelled examples - one per CALL, or AI.CLASSIFIER_ADD_BATCH for many.
DIM n1 = AI.CLASSIFIER_ADD(clf, "Scanner offline, red LED blinking", "hardware")
DIM n2 = AI.CLASSIFIER_ADD(clf, "Reset my password please", "account")
DIM n3 = AI.CLASSIFIER_ADD(clf, "Excel macro fails after the update", "software")
' ... more examples ...

DIM rc = AI.CLASSIFIER_BUILD_INDEX(clf)

DIM pred = AI.CLASSIFIER_PREDICT(clf, "My printer won't turn on", 5)
PRINT "Top label: ", pred{"label"}, " - confidence ", pred{"confidence"}
PRINT "Votes: ", pred{"votes"}

DIM rc2 = AI.CLASSIFIER_SAVE(clf, "tickets.clf")  ' reload next time with AI.CLASSIFIER_LOAD

Performance & tips

GPU offload

The third arg to AI.LOAD_LLM and AI.LOAD_EMBEDDINGS is the number of model layers to push onto the GPU. 99 = all of them.

Windows release builds ship with CUDA - pick a small enough model that fits VRAM.
Linux builds inside the AMD Strix-Halo distrobox offload via Vulkan / RADV.
Test the Strix-Halo gpt-oss-20b baseline: ~74 tokens/sec on a Radeon 8060S iGPU.

Where to grab models

jdBasic loads anything .gguf (LLMs) or .onnx (general inference).

huggingface.co - the GGUF hub. Start with Phi-3-mini Q4 (2.4 GB) for a fast baseline.
github.com/onnx/models - pretrained ONNX zoo.
For embeddings: nomic-embed-text-v1.5 or bge-m3.

Dotted-native syntax

Call dotted natives in function form: DIM rc = AI.SET(id, "k", v), not AI.SET id, "k", v (parsed as method-on-value in some contexts).

Compile to a single EXE

jdbasic -c my_ai_script.jdb produces a standalone EXE via the built-in LLVM-18 backend. Pair with MCP-based pair coding and your AI co-builder can hand you a redistributable binary in the same turn.

AI with BASIC Local LLMs, ONNX, RAG - all in one binary