They said unquantized local AI was impossible on budget phones. We got a 2.3GB FP32 model running locally on a €120 Galaxy A25 CPU. No GPU, no NPU, uses less RAM than Chrome

TheOpenMachineAI · May 4, 2026, 9:10am

The current meta in local AI is that you have to quantize. Big Tech is telling us that to run anything on the edge, we need to compress 2B+ models down to 4-bit, sacrifice the signal-to-noise ratio, and rely on flagship NPUs or Apple Silicon just to survive the memory bandwidth bottleneck.

We at Open Machine didn’t buy it. So, we built a 245M parameter model from scratch, kept it in raw uncompressed 32-bit float (FP32), and ran it on the absolute worst hardware I could find: a two-year-old plastic Samsung Galaxy A25.

Attached is the raw screen recording. Airplane mode on.

The Specs:

Model: Open Machine 245M (Trained from scratch on 20B tokens)
Weights: 2.3GB pure FP32 (ONNX export)
Hardware: €120 Samsung A25 (Exynos 1280)
Compute: CPU ONLY. GPU is off. NPU is off.
RAM: ~4.4GB used (literally lighter than opening a few Chrome tabs).
Thermals: 33.3°C. Zero battery drain. No OOM crashes.

The Elephant in the Room: 0.17 tokens/s Yes, it is slow as hell right now. If it were running at 50 tok/s on a budget CPU in FP32, you guys would immediately (and rightfully) call BS and accuse me of hiding a 4-bit quantized lookup table or using an API.

This speed is because it’s a heavily unoptimized Python loop forcing raw 32-bit math sequentially through a budget mobile CPU. We deliberately handicapped it to prove a point about physics: The memory wall is a routing problem, not a compression problem. If a budget Exynos chip can physically route a 2.3GB FP32 graph without OS memory-killing it or melting the battery, the architecture works. Writing a C++ kernel and dropping to FP16 will make it fly later.

How it fits without OOMing: We didn’t compress the weights; we fixed the network topology. We’re using what we call a “Synthetic Neural Engine” architecture. Instead of vanilla dense transformers(this is also Trasnformers but on our way) where you’re wasting compute on 90% static noise, we proceduralize the weights. We store a semantic dictionary of primitives and a per-context recipe that reconstructs exact weights dynamically full W. We calculate exact attention but store a compact state. Basically, we only compute the pure signal.

The Benchmarks: Even though it was trained on only 20B tokens (DCLM subset) for less than €1,000, this 245M model is already hitting 66% on PIQA and matching Meta’s 350M logic.

We built it anyway over the weekend and dropped this APK in their inbox today.

Stop letting Big Tech convince you that you need $7 Trillion and a massive server farm to solve edge logic.

Roast or Python loop, ask me about the math, and let me know what you think. I’ll drop the Hugging Face benchmark links in the comments, you can download it and test it yourself.

automajicly · May 7, 2026, 11:47pm

I love it. Never say die.. I myself just released yesterday the quantized Q8 and Q4 version of an uncensored obliterated version local private QWEN 2.5–1.5 B compatible and currently using on my own iPhone 13 as well as built mine tire security system with an autonomous AI agent loop via MCP… just released both on here and on GitHub and my iPhone compatible models boosted from zero downloads on day one and in 24 hours 341 downloads and rising as we speak. And budget, it isn’t even the phrase. I literally did all this for free with only two months of experience in coding architecture and cyber security.. keep going we’re gonna change the world one autonomous AI agent data time

TheOpenMachineAI · May 8, 2026, 11:01am

Hey Uzer-namo, I checked out your 4PDA thread.

Your post explicitly states your model is: “работающая на моем сервере с RTX 3050” (running on my server with an RTX 3050).

Your 6MB app is a thin client/API wrapper sending network requests to a desktop GPU. That is cloud/remote hosting, not Edge AI. If you put your phone in Airplane Mode, your app stops working.

My video shows a 2.3GB uncompressed FP32 model running locally, on-device, in Airplane Mode, using zero network, processing exclusively on a budget mobile CPU. We are solving the physical memory bandwidth wall of mobile silicon without relying on external servers or quantization.

There is no Quantization that is the point, the model is 2.3GB running on the phone literally. Not wrapper of some other model or similar. The model and the APK are one not separated thing, when you install APK you get the model as well. Turn of internet or anything it still works. The slow part is because we are working with heavy python loop and unoptimized code. Its POC not the final product.

API wrappers are great, but bridging a network ping to a desktop GPU is not the same sport as running raw floating-point math directly on mobile silicon.

Its a nice project tho…I love it!

TheOpenMachineAI · May 8, 2026, 11:45am

Yeah, the problem is this is not a Transformers completely rather custom version of it. We re-wrote the whole stack together it works more like bio brain than AI. It has different math for multiplications and creating the matrix, so its whole new architecture behind, we will call it Post - Transformers or Synthetic Neural Engine. Its not the same thing

Yeah we runed also 7B on ordinary 2GB graphic card with almost 4k tokens, but still experimenting as a start up we are in early phase.

TheOpenMachineAI · May 8, 2026, 6:50pm

I completely agree that standard dense transformers hit a Thermal Wall and an Energy Tax, that is exactly why we had to throw away the vanilla transformer architecture and build the Synthetic Neural Engine.

And you are absolutely right to fit a standard model into RAM, you have to quantize it to death. But that’s exactly what my post addresses: we didn’t quantize. The video proves the model is running in pure, uncompressed FP32, pulling only ~4.4GB RAM total, with the CPU sitting at a cool 33.3°C and zero battery drain.

We bypassed the thermal and memory walls not by offloading the compute to a desktop GPU over Wi-Fi, but by changing the fundamental math of the matrix multiplications so the mobile CPU only computes the pure signal.

Client-server orchestration (‘Nitro-nodes’) is a great workaround for standard models cool! But our goal isn’t to work around the mobile hardware, it’s to write better math so the mobile hardware can actually do the thinking itself.

lulavc · May 9, 2026, 10:56am

Nice work.

Topic		Replies	Views
What is the best architecture for integrating local LLM inference and RAG on mobile devices? Beginners	1	93	March 15, 2026
🚀 Bringing Supercomputer-Grade AI Performance to Local CPUs: Purem Benchmarks Now Public Show and Tell	0	47	April 28, 2025
I make AI model from open to close Beginners	1	65	December 18, 2025
Are there any LLMs that can run with decent performance on hardware comparable to Jetson NX? Models	2	68	March 16, 2026
I made some open source software to run UNQUANTIZED Mistral 7b-Instruct on about 2GB of RAM Show and Tell	1	247	April 16, 2025

They said unquantized local AI was impossible on budget phones. We got a 2.3GB FP32 model running locally on a €120 Galaxy A25 CPU. No GPU, no NPU, uses less RAM than Chrome

Related topics