How Smart Is a Local LLM, Really? My Weekend Experiment on the MacBook M4 Max

· 3 min read ·
llm local-llm apple-silicon mlx ollama ai

I’ve been wanting to set up a locally hosted LLM on my MacBook Pro (M4 Max, 36 GB RAM) — partly out of curiosity, and partly for those times when I’m travelling without internet but still want AI assistance.

This isn’t something I’ll use often (< 5% of the time), but it was a fun experiment worth sharing for anyone considering running models locally.


Setting It Up: Local Hosting Options

The most common approach is Ollama — a free, open-source app that makes it easy to run large language models (LLMs) on your computer. Behind the scenes, it uses llama.cpp — a C++ library that efficiently runs quantized models (in GGUF format) on both CPUs and GPUs.


For Apple Silicon Users: Enter MLX

Macs have a unique advantage thanks to Apple Silicon’s unified memory architecture, where the CPU, GPU, and Neural Engine share the same memory pool.

Apple’s MLX framework takes full advantage of this. It’s a lightweight array framework (C++ / Python) that lets models run directly on Apple Silicon without copying data between CPU and GPU.

In practice, MLX can act as an alternative runtime to llama.cpp — newer, Apple-native, and optimized for M-series and M4 chips.

Why it matters: On the M4 Max, unified memory means “the system can allocate memory flexibly between the CPU and GPU — allowing larger quantized models to run smoothly.”


Choosing the Interface: LM Studio

Rather than building a custom interface, I went with LM Studio — a desktop app that provides a ChatGPT-like environment for local models.

You can browse and download models directly inside it, choose between GGUF (llama.cpp) or MLX formats, and start chatting immediately. LM Studio also lets you tweak parameters such as the system prompt, context window, and temperature/top-K sampling.


Models and Performance

There’s a wide range of open-source models available in different sizes and quantization levels.

On my MacBook Pro M4 Max (36 GB RAM), models under ~30 billion parameters run smoothly once quantized (e.g., Q4 or Q5 precision). Even small multimodal (vision) models can run locally — letting you paste an image and ask questions about it.


Offline Capabilities and Limitations

I experimented with basic RAG (retrieval-augmented generation) using local embeddings, along with simple “memory” and “file access” features.

While core chat functionality works offline, a few things quickly stand out:

  • No internet context: the model can’t access live data or online tools.

  • Fragile memory: persistent memory (vector stores, embeddings) is hard to replicate fully offline.

  • Ecosystem gap: what makes modern LLMs powerful are the surrounding systems — agents, skills, toolchains — not just the base model. It’s not as mature off-the-shelf.


Key Takeaways

  • A local LLM setup is surprisingly capable for short offline sessions (e.g., 3–4 hours on a plane).

  • The experiment highlights how much the invisible infrastructure — memory, retrieval, tool integration — shapes the experience.

  • It’s like chatting with a smart but isolated person: competent, but forgetful and disconnected from the world’s knowledge.

Running everything offline gave me a new appreciation for the orchestration and context management that power today’s cloud LLMs.


Final Thought

If you’ve tried running models locally — I’d love to hear about your setup, what models worked best, and any performance surprises.