What each important control means, how it changes outcomes, and when to touch it.
Settings exist at two levels. System settings define global defaults and access rules. Runtime instance settings define how a specific model runs. Existing instances keep their own values when you change new-instance defaults.
System > Settings
Each row has two layers. The plain English line is for new users. The operator detail line is the precise behaviour and the rule for when to change the value.
Just changes how the app looks. Pick whatever's easiest on your eyes.
Light, dark, or system-follow. Does not change model output or runtime behavior.
Squishes the spacing so more fits on screen. Turn on if your screen feels cramped.
Reduces global UI padding. Useful on small displays or when monitoring many panes side-by-side.
Shows little help bubbles when you hover over things. Leave on until the app feels familiar.
Inline contextual help. Turn off once you know the interface to reduce visual noise.
Hides sections of the left menu you don't use, so the app looks simpler.
Shows or hides primary sidebar hubs. Hidden hubs still exist and remain reachable via route aliases or direct links.
Makes Ai Keeper start automatically when your Mac starts.
Enable for always-on proxy, channel connectors, scheduled automations, or server mode. Disable if you only run models on demand.
Remembers what you were typing so you don't lose it if the app closes.
Persists unfinished chat input to local storage. Disable if drafts should not persist on disk.
The voice-to-text model. Pick a small one for quick notes; a larger one if you record meetings or noisy audio.
Model repo used for voice messages, microphone capture, and the audio_transcribe tool. Larger models are slower but handle accents and noise better.
When a chat gets very long, the app automatically writes a short summary of the older parts so the conversation can keep going.
Summarizes older messages when the conversation approaches the context-window threshold. Keeps long sessions alive but can lose exact phrasing — export important transcripts before it kicks in.
How "full" a chat has to get before the auto-summary kicks in. Lower means it summarizes sooner.
Percent of the context window at which compaction triggers. Lower preserves memory for new content; higher preserves raw transcript longer.
Which model writes the summary. Leave blank to use the same model you're chatting with.
Optional model id used during compaction. A small local or cheap cloud model makes compaction fast and inexpensive without affecting chat quality.
A history of summaries the app has made, so you can see when and how much was condensed.
Tracked compaction events with token-reduction stats. Use to verify auto-compaction is firing as expected.
Connection And Access
This Mac is the one doing the heavy lifting — it runs the models and other devices can connect to it.
Local instances run here; the management API and proxy accept remote clients. Pick this on the machine with the most RAM, fastest disk, and your model storage.
This Mac is a remote control — it talks to another Mac that's running the models.
Defers runtime, RAG, and automation to a remote server. Cannot start local instances or host channels itself. Use on a laptop or secondary machine.
The number other apps use to reach Ai Keeper on this Mac. Like the door number for a building.
TCP port for the management API and web surfaces. Change only for port conflicts or network policy. Clients must update their URL after a change.
A password that lets other devices talk to your Ai Keeper. Don't share it.
Bearer secret required for remote management API calls and web dashboard access. Rotate immediately if exposed.
Lets you open Ai Keeper in a normal web browser, not just the app.
Serves the web UI at the management root URL. Disable to fully block browser access.
Lets other Macs running Ai Keeper connect to this one.
Permits remote Ai Keeper clients to use the management API. Disable when the host should be local-only.
Lets the web version actually run things on your computer (open files, run commands). Risky — only turn on for people you trust.
Permits web-initiated tool calls to execute locally. Keep off for read-only browsing; enable only for trusted users on trusted networks.
Only used in client mode — the address of the other Mac you're connecting to.
Client-mode-only. Copy verbatim from the server's Connection settings (includes scheme, host, port).
Only used in client mode — the password to talk to the other Mac. Get it from that Mac's settings.
Client-mode credential matching the remote server's Management API Key. Required for real remote use.
A web address other apps (like coding tools) can use to talk to your local AI as if it were ChatGPT.
OpenAI-compatible endpoint path, typically ending in /v1. Use in any external client that expects an OpenAI-shaped API.
Storage And Maintenance
The folder where downloaded AI models are kept. They're big — put this on a drive with lots of free space.
Library path scanned for model assets. Use a fast SSD; MLX and GGUF weights can be tens of gigabytes each.
Where Ai Keeper saves your settings, chats, and memory.
Application support location for state, sessions, and configuration. Back this up regularly.
Temporary copies of files you downloaded. Safe to clear if you need disk space.
Cached Hugging Face downloads. Deleting frees disk; assets re-download on next use.
Speed-up files the engine builds while running. Clear them if a model starts acting weird.
Runtime cache at ~/.omlx/cache. Rebuilds on next use; clear when stale state causes startup or generation issues.
A safe spot for your API keys and passwords so you never have to paste them into chat.
Encrypted local credential vault. Use exclusively — never inline secrets in prompts, custom args, or saved profiles.
Keeps the supporting tools Ai Keeper uses up to date by itself.
Updates Homebrew-managed packages (llama.cpp, oMLX, Python). Disable when reproducibility matters more than freshness.
A list of helper tools the app needs. If something is broken here, fix this first.
Required and optional runtime packages. Use Re-check, Install, Update, Repair, and Uninstall actions to recover local runtime health.
New Instance Defaults
How much the AI can "see" at once — your message, the chat history, attached files, and its reply, all measured in word-pieces. Bigger means it remembers more, but costs more memory.
Default token budget for new local instances. Consumed by system prompt, history, tools, retrieved chunks, and output. Larger windows increase RAM and prefill time.
A cap on how long each AI reply can be. Leave blank to let the model decide.
Per-response output length cap. Lower for chatty models that ramble; higher for code, long-form writing, or reports.
How long a model can sit unused before the app offers to unload it and free up memory.
Idle timer for memory-freeing prompts. Lower aggressive RAM recovery; higher keeps models warm for faster next response.
If your Mac is running out of memory, the app stops unused models automatically instead of asking you.
Stops idle past-threshold instances on macOS critical memory pressure events. Enable on low-RAM systems; disable if you prefer manual control.
Runtime Instance Settings
A nickname for this running model. Pick something that tells you what it's for.
Human-readable label. Use names that reveal model, role, or route — e.g. qwen-coder-fast, llama-vision-vlm.
Which AI model this instance will run.
Local asset or provider model. MLX for Apple Silicon local serving; GGUF for llama.cpp; provider model for cloud/CLI routes.
Who's allowed to reach this model. localhost means only your Mac; broader values let other devices in.
Bind address. 127.0.0.1/localhost for local-only; 0.0.0.0 exposes to the network — never enable without an API key and firewall rule.
The number this instance listens on. Like a doorway number — only one model can use it at a time.
TCP listening port. Change to avoid conflicts when running multiple instances or to match a client's hardcoded expectation.
Which engine runs the model. Leave on Automatic unless a model needs a specific one.
Automatic, omlx, vllm-mlx, vmlx, mlx-lm, or llama.cpp. Override only when a format or feature requires a specific engine.
A button that picks good settings for you based on your Mac and the chosen model. Run it before fiddling manually.
Analyzer that chooses engine and settings from model metadata + hardware profile. Reset reapplies defaults and clears local overrides.
A slider that trades faster replies against better-quality replies.
Family-specific preset slider. Fast lowers latency; Best uses higher-effort sampling and reasoning when memory and time permit.
How much memory this instance is allowed to set aside for short-term speed-ups. Bigger is faster on repeat questions but uses more RAM.
Runtime cache budget. Raise for repeated prompts or sustained workloads; lower to protect system RAM headroom.
When you send a very long prompt, it's processed in pieces. This sets the piece size.
Splits long prefill into chunks. Useful for huge contexts; values too small add scheduling overhead.
How the instance handles several requests at once.
Continuous batching improves multi-user throughput; stricter policies make single-user latency more predictable.
How many recent prompts the instance remembers verbatim to speed up repeats.
Number of cached prompt-prefix entries. Helps system prompts, agent scaffolds, and templated workflows.
A memory cap for the prompt-remembering feature above.
RAM ceiling for prompt cache. Raise only if cache-hit rate matters more than headroom for new requests.
Advanced — splits the model across multiple chips. Most users leave this alone.
Parallel split degree. Backend- and hardware-dependent. Wrong values fail launch — verify with the engine's docs first.
Power-user box for typing raw command-line flags directly. Skip unless you know what you're doing.
Raw backend CLI arguments. Verify against the backend's --help output before saving — a single bad flag prevents launch.
Generation And Tool Behavior
How creative versus consistent the AI is. Low = predictable; high = surprising.
Sampling randomness. Low for deterministic code, extraction, and tests; higher for brainstorming or prose variety.
Limits how adventurous each word choice can be. Most people leave this alone.
Nucleus sampling cumulative probability. Lower narrows the candidate pool; higher allows broader alternatives.
Limits the number of words the AI considers each step. Default is usually best.
Hard cap on candidate token count. Lower stabilizes small models; the default is tuned for the model family.
Filters out very unlikely word choices so creative replies don't go off the rails.
Rejects tokens whose probability is too small relative to the top candidate. Keeps creative output coherent without making it rigid.
Discourages the AI from saying the same thing over and over.
Penalizes repeated token patterns. Raise for stuck loops; lower if it starts avoiding necessary syntax or exact terms in code.
Pushes the AI to use a wider vocabulary. Good for writing; risky for code.
Discourages tokens by usage frequency. Useful for prose; risky for code, citations, and exact identifiers.
Encourages the AI to bring up new ideas instead of staying on one.
Boosts unseen tokens to encourage topic shifts. Useful for ideation; avoid for tight, factual answers.
Locks in a "luck number" so the AI gives the same answer every time. Useful for testing.
PRNG seed for repeatability. Fix for regression tests; leave default for normal conversation.
Decides whether the AI is allowed to actually use tools (browse, run code, read files), and how strict to be about it.
Controls tool exposure and call interpretation. Auto for agent work; Restricted or Disabled for sensitive conversations.
How the app understands tool requests from the model. If tool calls keep failing, try a different one.
Parser for model-family-specific tool-call syntax. Switch when calls are malformed — match the model family (Llama, Hermes, Mistral, etc.).
Lets the AI take more time to "think" before answering. Better answers, slower replies.
Enables/disables reasoning preambles. Use deep thinking for hard planning; disable for speed or models that degrade with decorated prompts.
Forces the AI to reply in a specific shape (like a form). Useful when another tool needs to read the answer.
Constrains response format to a JSON schema or grammar. Use for automation; disable for freeform chat.
What kind of job this instance does — chat, look-at-images, search, audio, etc. The app uses this to send the right work to the right model.
Declares specialization: LLM, VLM, Embedding, Reranker, Audio STT, Audio TTS, or Audio STS. Routing and UI selectors filter on this.
Backend-Specific Controls
These controls only appear on engines that support them. Most users never touch them.
KV cache quantization
Squeezes the model's short-term memory smaller so more fits in RAM. Stronger squeeze saves more memory but can hurt quality.
Compresses the key/value attention cache. Off preserves quality at full memory cost. Q8_0 ≈ half the memory with negligible quality loss. Q4_0 saves more but can affect generation stability.
Speculative prefill
Lets the model skip re-reading parts of the prompt it has already seen. Speeds up agent loops and very long chats.
Reuses a percentage of cached prefix state above a similarity threshold. Reduces redundant prompt processing for repeated agent scaffolds.
Speculative decoding
Uses a tiny "draft" model to predict ahead, then the main model checks. Faster replies — but you have to load two models.
Draft-model accelerated decoding. Improves throughput at the cost of additional VRAM/RAM and configuration complexity.
Adapters
Tiny add-on files that customize a model for a specific task without retraining the whole thing.
LoRA / adapter paths attached at load time. Verify base-model compatibility before launch — mismatches cause silent quality drops.
llama.cpp extras
A pile of advanced knobs that only the llama.cpp engine exposes. Skip unless you're tuning a specific GGUF model.
GGUF-oriented controls: Mirostat, rope scaling, tail-free sampling, dynamic temperature, slot similarity, concurrency. Only relevant on the llama.cpp backend.
Cloud providers
If you don't want to run a model locally, you can use one from a company like OpenAI, Anthropic, or Google instead. Most need an API key from that company; some (Claude Code, ChatGPT Codex) just use their CLI app.
Provider list: ChatGPT Codex, Claude Code, OpenAI, Anthropic, Google Gemini, Groq, Mistral, DeepSeek, xAI, OpenRouter, Together AI, Perplexity, Fireworks AI, Cerebras, Custom. CLI providers auth via their installed binary; the rest use API keys from the Secret Store.
Security Settings
A diary the app keeps of everything it did — every tool it ran, every file it touched. Useful for "wait, what did it do last night?" moments.
Append-only event log of tool calls, approvals, exec sessions, channel events, and identity changes. Export periodically for forensic review.
A health check that flags risky settings: open doors, missing passwords, plugins from unknown sources.
Posture report covering exposed ports, allowlist gaps, default-deny coverage, secret hygiene, and unsigned plugins. Treat warnings as blockers.
A locked vault for your passwords, API keys, and tokens. Always store them here.
Encrypted local credential vault for API keys, OAuth tokens, webhook secrets, and connector credentials. Never inline these in prompts or saved profiles.
Save a copy of all your settings, chats, and notes so you can restore them later if something breaks.
Snapshot, restore, and export of state, sessions, knowledge, prompts, and configuration. Restore wipes post-snapshot state — don't rollback past channel webhooks still in use.
A safety net that catches sensitive info (like phone numbers or social security numbers) before it leaves your machine — and catches sneaky inputs trying to trick the AI.
Input/output filters for PII, prompt-injection patterns, and policy keywords. Layer over Tool Policy and approvals; not a sole defense.
A guest list — only people on the list can DM the assistant on connected channels (Telegram, Slack, etc.).
Per-channel allowlist of senders permitted to direct-message the assistant. Configure welcome/rejection messages; enable Auto-Reply only for paired senders.
Connectivity Settings
A backup plan: if your first AI provider fails (down, rate-limited, slow), the app automatically tries the next one on the list.
Ordered list of providers/instances per route, with cooldowns and key rotation. Keep a local fallback at the tail of the chain for offline resilience.
Settings for letting other devices reach this Mac. Don't turn on without setting a password (the API key) first.
Server-mode reachability profile. Always pair with a Management API Key — never enable without authentication.
Advanced — lets one Ai Keeper server serve different groups (teams, apps, batch jobs) without their traffic stepping on each other.
Server-side routing lanes that group endpoints by purpose, priority, or tenant. Use to isolate batch traffic from latency-sensitive interactive traffic.
Lets your AI agents talk to other AI agents using a shared language. Advanced — only relevant if you're hooking up multiple agent systems.
Agent Communication Protocol surface for structured agent-to-agent messages. Configure inbound auth and allowlists before exposing.
Connects multiple Macs running Ai Keeper into a small network so they can share work.
Peer-to-peer discovery and work-routing across hosts. Signed pairing only; keep per-peer roles narrow — a mesh is not a substitute for proper auth.
Pair an iPhone, iPad, or other Mac with this server, like pairing a Bluetooth speaker.
Trust handshake for iOS/iPadOS or other Mac clients via QR or short code. Revoke a pairing immediately if the device is lost.
Web addresses that other services can ping to trigger something here, or that Ai Keeper pings to notify other services.
Inbound HTTP triggers and outbound notification targets. Each webhook holds its own signing secret — rotate after any audit-trail anomaly.
Media & Devices Settings
Mic input, spoken replies, and how voice gets handed off when chatting in places like Slack.
Microphone capture, TTS playback, and channel voice-handoff templates. WebChat voice is local; external-channel voice uses the configured handoff template.
Lets the AI see your screen. Be careful — share specific windows, not the whole desktop, so notifications don't leak.
Permission and pipeline for sharing the screen with vision-capable models. Limit to specific windows or displays — full-screen capture exposes notifications and unrelated apps.
Lets the AI use your webcam. Off by default for privacy.
Webcam capture for VLM and presence flows. Keep disabled by default; enable per-session.
The app's display language and the language the AI replies in by default.
App localization and default assistant language hint. Models that respect locale hints will switch reply language accordingly.
Operator Settings
Bundles of tools you can give an AI agent in one go, like a "web tools" or "file tools" pack — instead of toggling each one.
Named bundles of individual tools (e.g. fs-read, web, shell-safe). Assign groups to agent roles for cleaner audit and rename-safety.
Starter files an agent can drop into a folder when starting work — like project boilerplate.
File scaffolds dropped at task start (lint configs, README skeletons, .editorconfig). Keep templates idempotent.
Persistent terminal windows the AI can keep using, instead of opening a fresh one for every command. Faster, but cap the lifetime so they don't get stale.
Persistent shell/REPL handles the model can attach to. Avoids spawn-per-call overhead. Limit per-session lifetime — an idle session that survived a long compaction may carry stale state.
A document describing how your default assistant should sound — its voice, tone, what it cares about. Different from a system prompt because it sticks across all chats.
Long-form persona document applied outside per-conversation system prompts. Edit here for permanent voice; edit the agent's role prompt for task-specific behavior.
An address book the assistant can use — names, groups, tags, and how to reach each person.
People, groups, tags, and reachability metadata used by channels and presence routing. Treat as personal data; back up encrypted, scrub on export.
Connects to your Obsidian notes vault so the assistant can read (and optionally write) to it.
Vault path, note search, and import. Read-only by default. Enable write only on an isolated vault — never your primary notes.
Knowledge & Memory Settings
Decides whether the AI remembers things about you across separate chats and days.
Survival of memory facts across sessions/devices. Enable for personal-assistant flows; disable for task-specialist or audit roles.
Lets the AI save things from your conversation as long-term notes by itself. Only turn on if you're OK with what's being said being remembered.
Permits the model to write durable facts during chat. Each write is captured in the Audit Trail. Enable only on chats safe to persist.
Which model is in charge of indexing your documents for search. A small, dedicated one is best.
Runtime that produces embeddings for Documents and Wiki search. Use a small, fast embedding model — not the main chat model.
When documents get indexed, they're cut into pieces. This sets the size of those pieces and how much they overlap. Smaller for precise lookups, bigger for context.
Document split parameters. Smaller chunks favor precise retrieval; larger preserve narrative continuity. Re-index after any change.
How many document pieces the AI gets to read when you ask it a question. More = thorough but noisy; less = focused but might miss facts.
Number of retrieved chunks per query. Raise when answers miss facts; lower when irrelevant context floods the prompt.
How often the AI runs in the background to "think about" your notes and journal new ideas. Treat the output like suggestions, not facts.
Idle-synthesis schedule. Output is suggestion-grade, not authoritative memory. Disable on shared or untrusted workstations.
Automation & Workspace Settings
Decides whether the AI asks for permission before doing things. Pick "Confirm" until you're sure a workflow is safe.
Per-tool / per-agent gating: Auto, Confirm, Block. Default to Confirm for anything touching files, network, or external services; lift to Auto only after a flow has run clean under Confirm.
Permanent rules the AI must follow, like "never share my address" or "always reply in English". They stick around forever.
Persistent instructions applied globally or per agent. Use for invariants, not transient task details.
Auto-run actions on events — like "every time a Slack message arrives, log it" or "if there's an error, send me a notification".
Event-triggered actions (pre-prompt, post-tool, on-channel-message, on-error). Each can inject context, run a skill, log memory, call a webhook, or exec. Audit before enabling on production channels.
How often a recurring AI task wakes up to do something. Always set a stop condition so it doesn't run forever.
Recurring agent-turn schedule. Always attach explicit max-iterations — heartbeats without a stop condition burn tokens silently.
A friendly way to schedule recurring jobs ("every Monday at 9am"). Always check the next-run preview to make sure it's right.
Visual cron-expression builder. Validate the next-fire preview before saving — typo'd cron strings are a leading cause of silent automation failures.
Long jobs that run in the background. Always wire up a notification so you know when they finish.
Long-running non-interactive jobs. Cap concurrency. Always attach a notification or webhook so completion is observable.