How to Run Neural Networks on ESP32: A Practical Guide to TinyML Deployment

Shrinking Neural Networks for Microcontrollers (Quantization & Pruning)
My Hands-On Journey with TinyML on ESP32
The Core Deployment Pipeline: From Python to C++ Array
Writing and Optimizing Your ESP32 TinyML Code
Frequently Asked Questions

Shrinking Neural Networks for Microcontrollers (Quantization & Pruning)

When you try to run a neural network on a microcontroller like the ESP32, your biggest enemy isn't accuracy; it's RAM. A typical ESP32 has about 520 KB of SRAM, while a standard deep learning model built on a computer can easily be hundreds of megabytes. To bridge this massive gap, we have to shrink our models before they ever touch the hardware. This is where optimization techniques like quantization and pruning come in. Quantization is the process of converting the weights and biases of your model from 32-bit floating-point numbers (FP32) down to 8-bit integers (INT8). It sounds like a massive downgrade, but it reduces the model's storage footprint by 75% while barely touching its accuracy. More importantly, microcontrollers are incredibly fast at integer math. Many chips don't even have a hardware floating-point unit, meaning FP32 math is incredibly slow because it has to be emulated in software. By switching to INT8, you're giving your microcontroller a massive speed boost. Pruning takes optimization a step further by looking for connections inside the network that don't do much. If a weight is very close to zero, it doesn't contribute to the final decision. We can safely snip these connections out of the network entirely. When you combine quantization and pruning, you can turn a heavy, power-hungry model into a lean, fast-running binary file ready for silicon.

A flowchart showing the model optimization process, starting from a large FP32 Tensorflow model, going through quantization and pruning, and ending as a lightweight INT8 TFLite Micro model.

My Hands-On Journey with TinyML on ESP32

Honestly, I've tried this myself quite a few times, and the learning curve can be steep if you don't know what to expect. I remember building a simple voice-activated light switch using an ESP32-CAM board and an external microphone. My first attempt at running the keyword recognition model resulted in immediate crashes. I kept getting the dreaded out-of-memory error because I didn't optimize my tensor arena size. I decided to try both Edge Impulse and a raw TensorFlow Lite Micro setup to see which worked better. Edge Impulse is incredibly friendly for beginners because it handles the memory allocation and code generation behind the scenes. However, if you want full control over your heap and need to squeeze every last byte of performance out of your ESP32, compiling your own custom TensorFlow Lite Micro model is the way to go. Once I manually tuned my model down to 8-bit integers and hand-allocated the memory buffer, the ESP32 processed the audio and recognized the wake word in less than 50 milliseconds. Seeing that onboard LED light up instantly without any internet connection was a massive win.

The Core Deployment Pipeline: From Python to C++ Array

To deploy a model, you don't actually write neural network code in C++. Instead, you train your model in a standard Python environment using tools like Keras or TensorFlow. Once you are happy with how the model performs, you save it, convert it, and embed it as a static array directly into your firmware. Here is how the transition works. After training your model in Python, you run it through the TensorFlow Lite Converter. This tool outputs a `.tflite` file. But microcontrollers don't have a traditional file system to open and read a `.tflite` file at runtime. To get around this, we use a command-line utility called `xxd` to convert that binary model file into a C++ header file. The command is straightforward:

xxd -i model.tflite > model.h

This generates a massive `unsigned char` array containing the hex representation of your model. By adding the `const` keyword to this array in your C++ code, you tell the ESP32 compiler to store the model directly in the chip's flash memory rather than loading it into the precious SRAM at boot time. This keeps your RAM free for running the actual code.

Screenshot of a terminal running the xxd command to convert a .tflite file into a C++ header file, alongside a snippet of the generated hex array code.

Writing and Optimizing Your ESP32 TinyML Code

Now that the model is sitting in your flash memory as a C++ array, you need to write the ESP32 code to run it. This requires the TensorFlow Lite Micro library. The setup involves initializing the interpreter, allocating memory for your tensors, and feeding sensor data into the model. One of the most critical parts of your code is defining the "tensor arena". This is a pre-allocated chunk of memory where TensorFlow Lite Micro stores its intermediate calculations during inference. If you make this arena too small, your code will crash during initialization. If you make it too large, your ESP32 won't have enough RAM left to run your other tasks, like handling Wi-Fi or Bluetooth connections. Finding the sweet spot takes a bit of trial and error.

Pro-Tip: Don't use the default AllOpsResolver when initializing your model. This resolver loads every single mathematical operation supported by TensorFlow Lite Micro into your flash memory, bloating your binary. Instead, use MicroMutableOpResolver and manually register only the specific operations your model uses, such as Conv2D or FullyConnected. This can save you hundreds of kilobytes of flash space.

Once your interpreter is ready, you loop your sensor readings—like accelerometer data, temperature, or microphone samples—and copy them directly into the input tensor. You call the interpreter's invoke function, and then read the predictions from the output tensor to take action, like spinning a motor or sending an alert.

Hardware setup diagram showing an ESP32 connected to an external sensor (like an MPU6050 accelerometer) feeding real-time data into a TinyML inference block on a computer screen.

Frequently Asked Questions

Can any ESP32 chip run machine learning models?

Yes, standard ESP32, ESP32-S3, and ESP32-C3 chips can all run TinyML models. However, the ESP32-S3 is particularly great for this because it features vector instructions that accelerate neural network math, making inference run significantly faster than on the standard dual-core ESP32.

How much accuracy do I lose when converting a model to INT8?

In most real-world scenarios, the drop in accuracy is negligible, often less than 1% to 2%. The benefits of quantization—such as a 4x reduction in model size and dramatic speed improvements—far outweigh this tiny drop in accuracy.

Can I train a model directly on the ESP32?

Generally, no. Training a neural network requires massive amounts of computing power, memory, and data, which microcontrollers simply do not have. You should always train your model on a computer or cloud platform, optimize it, and then deploy it to the ESP32 for inference only.

What happens if my model is too big for the ESP32's internal flash memory?

If your model exceeds the internal flash, you can use an ESP32 module with external PSRAM (Pseudo-static RAM) and external flash. However, accessing external memory is slower than accessing internal memory, which will slightly increase your model's inference latency.