What is Embedded Software and 10 Ways to Optimize It

Today, embedded systems in small devices are more popular and being used for more purposes than ever. Embedded systems are used in automation (including home, industrial and building automation), consumer, automotive, appliances, medical, telecommunication, commercial and military applications.

It’s been estimated that 98% of all microprocessors manufactured are used in embedded systems.

Modern embedded systems are often based on microcontrollers (i.e. microprocessors with integrated memory and peripheral interfaces), but ordinary microprocessors (using external chips for memory and peripheral interface circuits) are also common, especially in more complex systems.

In either case, the processor(s) used may be types ranging from general purpose to those specialized in a certain class of computations, or even custom designed for the application at hand.

Since the embedded system is dedicated to specific tasks, design engineers can optimize it to reduce the size and cost of the product and increase the reliability and performance.

Because devices with embedded systems typically have much less computing power and memory than personal computers, it’s more important to optimize the resource usage of these devices than it is on personal computers.

The clock frequency may be a hundred or a thousand times lower and the amount of Random Access Memory (RAM) may be a million times less than in a personal computer (Kilobytes vs Gigabytes).

What is Embedded Software

Embedded software is written to control machines or devices that are not typically thought of as computers, commonly known as embedded systems. Embedded software is usually specialized for the particular hardware that it runs on and has time and memory constraints. Embedded software is sometimes used interchangeably with the term firmware.

Firmware is a special type of embedded software that is written in non-volatile memory such as ROM (Read Only Memory), EPROM (Erasable Programmable Read-only Memory), and Flash memory. Updating firmware requires ROM integrated circuits to be physically replaced, or EPROM or flash memory to be reprogrammed. Some firmware memory devices are permanently installed and cannot be changed after manufacture. Common reasons for updating firmware include fixing bugs or adding features to the device.

One characteristic feature of embedded system software is that no or not all functions are initiated/controlled via a human interface, but through machine-interfaces instead.

Embedded software systems are often used in machines that are expected to run continuously for years without errors, and in some cases recover by themselves if an error occurs. So, the software for embedded devices is usually developed and tested more carefully than that for personal computers.

Because an embedded software system typically controls physical operations of the machine that it is embedded within, it often has real-time computing constraints.

Embedded software can be very simple, such as lighting controls running on an 8-bit microcontroller with a few kilobytes of memory or embedded software can be very sophisticated as when used in applications such as routers, optical network elements, airplanes, missiles, and process control systems.

Unlike application software:

  • Embedded software has fixed hardware requirements and capabilities.
  • Embedded software needs to include all needed device drivers.
  • Embedded software development requires use of a cross compiler, which runs on a computer and produces executable code for the target device.

How to Optimize Embedded Software

The smaller the system, the more important it is to design embedded software that uses less resources. On the smallest devices, you don’t even have an operating system.

Embedded systems have reliability and performance requirements that demand a software development style that is optimized from the beginning.

With well optimized embedded software design, it’s possible to get good performance for many applications, even on such small devices, by avoiding large libraries, graphics frameworks, interpreters, just-in-time compilers, system database, and other extra software layers or frameworks typically used on larger systems.

The best performance on embedded systems is obtained by choosing and using features of the programming language effectively based on the particular hardware design and avoiding wasted resources.

The preferred embedded software programming language will often be C or C++. Critical functions or device drivers may better written in Assembly language.

Contrary to what some may think, C++, when used properly can use the same resources as embedded software written in C. Modern C++ has added safety and productivity benefits. Including zero overhead abstraction, type system enforced correctness and the ability to move much error checking to compile time.

The use of C++ programming language is beneficial for enhancing the reliability and maintainability of the written code and improving the productivity of programming through the application of techniques like object-oriented programming and data abstraction.

Once you have some working code, you should have a pretty good idea of which functions are the most critical for overall code efficiency.

Interrupt services, high-priority tasks, calculations with real-time deadlines, and functions that are either compute-intensive or frequently called are all likely candidates. A tool called a profiler, included with some software development packages, can be used to narrow your focus to those functions in which the program spends most (or too much) of its time.

Now that you’ve identified the functions that require greater code efficiency, one or more of the following techniques can be used to reduce their execution time:

Direct memory access (DMA)

Direct memory access (DMA) is a way to have a peripheral device control a processor’s memory bus directly. DMA permits the peripheral to transfer data directly to or from memory without having each byte (or word) handled by the processor. DMA enables more efficient use of interrupts, increases data throughput, and potentially reduces hardware costs by eliminating the need for peripheral-specific buffers.

Hand-coded assembly

Some software functions are best written in assembly language. This gives the embedded software programmer an opportunity to make them as efficient as possible. Though most C/C++ compilers produce much better machine code than the average programmer, a good programmer can still do better than the average compiler for a given function.

Choose correct algorithms

The best algorithm for the job can typically perform better by a factor of two to ten times than the worst algorithm.

The first thing to do when you want to optimize a piece of CPU-intensive software is to find the best algorithm. The choice of algorithm is very important for tasks such as sorting, searching, and mathematical calculations.

Optimized function libraries for many standard tasks are available from a number of sources. For example, the Boost collection contains well-tested libraries for many common purposes.

It is often easier said than done to choose the optimal algorithm before you start to program. Many programmers have discovered that there are smarter ways of doing things only after they have put the whole software project together and tested it.

Be sure to benchmark the various algorithms in your application and then make an objective and informed decision.

Using the STL (Standard Template Library)

By using parts of the STL, when appropriate, in a microcontroller software project, it’s possible to significantly decrease coding complexity while simultaneously improving legibility and performance.

The STL authors have meticulously optimized algorithms. As a result the object code can be optimized particularly well by the compiler.

C++ is a great language to use for embedded applications and templates are a powerful aspect of it. The standard library offers a great deal of well tested functionality, but there are some parts that do not fit well with deterministic behavior and limited resource requirements. These limitations can prevent the use of STL containers with the default (std::allocator), because they dynamically allocate memory.

One alternative is to use your own memory allocator. Another is to use a library like the embedded template library which has been designed for lower resource embedded applications. It defines a set of containers, algorithms, and utilities, some of which emulate parts of the STL. There is no dynamic memory allocation.

The embedded template library makes no use of the heap. All the containers (apart from intrusive types) have a fixed capacity allowing all memory allocation to be determined at compile-time.

Use Template Metaprogramming to Unroll Loops

Template metaprogramming can be used to improve code performance by forcing compile-time loop unrolling.

Metaprogramming with constexpr functions

A constexpr function is a function that can do almost any calculations at the time of compilation if the parameters are compile time constants. Since the C++14 standard, you can have branches, loops, etc. in a constexpr function.

A good optimizing compiler will do simple calculations at compile time anyway, if all the inputs are known constants, but few compilers are able to do more complicated calculations at compile time if they involve branches, loops, function calls, etc. A constexpr function can be useful to make sure that certain calculations are done at compile time.

The result of a constexpr function can be used wherever a compile time constant is required, for example for an array size or in a compile time branch.

Fixed-point arithmetic

Unless your target platform includes a floating-point coprocessor, you’ll pay a very large penalty for manipulating float data in your program. The compiler-supplied floating-point library contains a set of software subroutines that emulate the instruction set of a floating-point coprocessor. Many of these functions take a long time to execute relative to their integer counterparts and also might not be reentrant.

To avoid potentially slow floating-point emulation libraries manipulating 32-bit single-precision float or even 64-bit double-precision double, you can use integer-based fixed-point arithmetic.

A fixed-point number is an integer-based data type representing a real-valued fractional number, optionally signed, having a fixed number of integer digits to the left of the decimal point and another fixed number of fractional digits to the right of the decimal point. Fixed-point data types are commonly implemented in base-2 or base-10. Fixed-point calculations can be highly efficient in microcontroller programming because they use a near-integer representation of the data type.

Use the Optimum Integer Size

One of the more important things you can do is to choose the correct sized integer for the task and the processor at hand. For all processors, choosing an integer size the same size as the natural word length of the CPU is usually a good idea. E.g. for an 8 bit processor, using an 8-bit integer when possible is usually the correct thing to do. While using 16 / 32-bit integers on an 8 bit processor will have a significant penalty.

Use Lookup Tables

A powerful way to speed up embedded system software is to use lookup tables.

A lookup table is an an array that replaces runtime computation with a simpler array indexing operation. The processing time savings can be significant, because retrieving a value from memory is often much faster than carrying out an “expensive” computation.

The lookup tables may be precalculated and stored in static memory, calculated (or pre-fetched) as part of a program’s initialization phase (memoization), or even stored in hardware for application-specific platforms.

It is preferable to declare a lookup table static and constant.

int SomeFunction (int x) { 
   static const int list[] = {1, 3, 2, 5, 8}; 
   return list[x]; 

The advantage of this is that the list does not need to be initialized every time the function is called. The static declaration helps the compiler decide that the table can be reused from one call to the next. The const declaration helps the compiler see that the table never changes.

Avoid High Frequency Interrupts

Interrupt service routines are often used to improve program efficiency. However, high frequency interrupts use an enormous amount of CPU time. The intrinsic interrupt overhead is the CPU cycles consumed in handling an interrupt request. For example loading and restoring the program counter.

Interrupt service routines and device drivers are particularly critical because they can block the execution of everything else.

For many reasons, it is highly desired that the interrupt handler execute as briefly as possible, and it is highly discouraged (or forbidden) for a hardware interrupt to invoke potentially blocking system calls.

In a low-level microcontroller, the chip might lack protection modes and have no memory management unit (MMU). In these chips, the execution context of an interrupt handler will be essentially the same as the interrupted program, which typically runs on a small stack of fixed size. Nested interrupts are often provided, which exacerbates stack usage.

In cases in which the average time between interrupts is of the same order of magnitude as the interrupt latency. It might be better to use polling to communicate with the hardware device.

Inline functions

In C++, the keyword inline can be added to function declarations. This keyword makes a request to the compiler to replace all calls to the indicated function with copies of the code that’s inside. This eliminates the runtime overhead associated with the actual function call and is most effective when the inline function is called frequently but contains only a few lines of code. Inline functions are an example of how execution speed and code size are sometimes inversely linked.

The repetitive addition of the inline code will increase the size of your program in direct proportion to the number of times the function is called. And, obviously, the larger the function, the more significant the size increase will be. The resulting program runs faster, but now requires more ROM.

Reducing Memory Usage

In some cases, it’s RAM (Random Access Memory) rather than ROM (Read Only Memory) that is the limiting factor for your application. In these cases, you’ll want to reduce your dependence on global data, the stack, and the Free Store (heap). These are all optimizations better made by the programmer than by the compiler. Because ROM is usually cheaper than RAM (on a per-byte basis), one acceptable strategy for reducing the amount of global data might be to move constant data into ROM. Linkers for embedded systems allow static const data to be kept in ROM.

If a system includes a lot of data that can be kept in ROM, special attention to class design is needed to ensure that the relevant objects are ROM-able. For an object to be ROM-able, it must be capable of initialization by a static initializer like a C struct.

This technique is most valuable if there are lots of strings or table-oriented data that does not change at runtime.

Methods of reducing stack usage

In general, you can lower the stack requirements of your program by:

  • Writing small functions that only require a small number of variables.
  • Avoiding the use of large local structures or arrays.
  • Avoiding recursion, for example, by using an alternative algorithm.
  • Minimizing the number of variables that are in use at any given time at each point in a function.
  • Using C++ block scope and declaring variables only where they are required.

Simplify your code

When writing embedded software you should strive for your code to be as simple and elegant as possible. Doing so allows the compiler to easily understand your intentions, and consequently to optimize it and generate object code that is as efficient as possible.

Amdahl’s Law should always be considered when deciding whether to optimize a specific part of the program. The impact on the overall program depends very much on how much time is actually spent in that specific part, which is not always clear from looking at the code without a performance analysis.

A better approach is to design first, code from the design and then profile/benchmark the resulting code to see which parts should be optimized. A simple and elegant software design is often easier to optimize at this stage, and profiling may reveal unexpected performance problems that would not have been addressed by premature optimization.


Embedded systems present some special software development challenges due to their limited resources, requirements for real time availability and reliability. By following best practices, with well optimized embedded software design, it’s possible to have an efficient system with excellent performance.


For more than 30 years Nexus Software Systems has been developing embedded software.

Related services: Embedded Software Development

When you’re ready, contact us to learn more Embedded Software Development for your use case