Hide code cell content

# ruff: noqa: N802, N803, N806, N815, N816

import numpy as np
from scipy import signal

import archimedes as arc

Generated C API¶

In the previous part of this tutorial, we learned how to automatically translate Python code to highly efficient C implementations. However, the C code generated by CasADi uses a generic interface that requires that all arrays be initialized externally. If we want to call the generated function from our own C code or use it in an embedded setting, this creates overhead of having to manually determine and adjust array sizes, names, initializations, etc.

In embedded systems, memory management is critical. The “kernel” implementation of the algorithm is extremely efficient, but it expects:

  1. Pre-allocated memory buffers for all inputs and outputs

  2. Properly configured pointers to these buffers

  3. Initialization of working memory needed by the algorithm

Manually handling these details would be tedious and error-prone - especially when algorithms change during development.

Archimedes handles this with an “interface layer”, which creates a predictable and easy-to-use API for working with the low-level kernel code. On this page we will go through some of the details of this interface layer.

First, we will generate the code just as before.

Hide code cell content

# Note: give descriptive names for return values (these don't need
# to match the variable names, but must be different from arg names)
@arc.compile(return_names=("u_hist", "y_hist"))
def iir_filter(u, b, a, u_prev, y_prev):
    # Update input history
    u_prev[1:] = u_prev[:-1]
    u_prev[0] = u

    # Compute output using the direct II transposed structure
    y = (np.dot(b, u_prev) - np.dot(a[1:], y_prev[: len(a) - 1])) / a[0]

    # Update output history
    y_prev[1:] = y_prev[:-1]
    y_prev[0] = y

    return u_prev, y_prev


# Design a simple IIR filter with SciPy
dt = 0.01  # Sampling time [seconds]
Wn = 10  # Cutoff frequency [Hz]
order = 4
b, a = signal.butter(order, Wn, "low", analog=False, fs=1 / dt)

# Create "template" arguments for type inference
u = 1.0
u_prev = np.zeros(len(b))
y_prev = np.zeros(len(a) - 1)
args = (u, b, a, u_prev, y_prev)

arc.codegen(iir_filter, args)

The interface code¶

Last time we looked at the kernel code, which used flattened arrays of pointers to access inputs, outputs, and working memory. While highly efficient, this code on its own is difficult to work with directly, particularly for more complex functions with multiple arguments and return values.

Here’s what the “interface” code looks like:


#ifndef IIR_FILTER_H
#define IIR_FILTER_H

#include "iir_filter_kernel.h"

#ifdef __cplusplus
extern "C" {
#endif

// Input arguments struct
typedef struct {
    float u;
    float b[5];
    float a[5];
    float u_prev[5];
    float y_prev[4];
} iir_filter_arg_t;

// Output results struct
typedef struct {
    float u_hist[5];
    float y_hist[4];
} iir_filter_res_t;

// Workspace struct
typedef struct {
    long int iw[iir_filter_SZ_IW];
    float w[iir_filter_SZ_W];
} iir_filter_work_t;

// Runtime API
int iir_filter_init(iir_filter_arg_t* arg, iir_filter_res_t* res, iir_filter_work_t* work);
int iir_filter_step(iir_filter_arg_t* arg, iir_filter_res_t* res, iir_filter_work_t* work);


#ifdef __cplusplus
}
#endif

#endif // IIR_FILTER_H

Now, instead of manually wrangling pointer arrays, we have function-specific input and output structs with meaningful names.

The header is enough for us to work with this in our own code, but for completeness here’s the corresponding C code:


#include <string.h>
#include "iir_filter.h"

int iir_filter_init(iir_filter_arg_t* arg, iir_filter_res_t* res, iir_filter_work_t* work) {
    if (!arg || !res || !work) {
        return -1; // Invalid pointers
    }

    /* Initialize inputs */
    memset(arg, 0, sizeof(iir_filter_arg_t));

    /* Initialize outputs */
    memset(res, 0, sizeof(iir_filter_res_t));

    /* Nonzero assignments */
    arg->u = 1.000000f;
    arg->b[0] = 0.004824f;
    arg->b[1] = 0.019297f;
    arg->b[2] = 0.028946f;
    arg->b[3] = 0.019297f;
    arg->b[4] = 0.004824f;
    arg->a[0] = 1.000000f;
    arg->a[1] = -2.369513f;
    arg->a[2] = 2.313988f;
    arg->a[3] = -1.054665f;
    arg->a[4] = 0.187379f;

    _Static_assert(sizeof(iir_filter_arg_t) == 20 * sizeof(float),
        "Non-contiguous arg struct; please enable -fpack-struct or equivalent.");

    _Static_assert(sizeof(iir_filter_res_t) == 9 * sizeof(float),
        "Non-contiguous res struct; please enable -fpack-struct or equivalent.");

    return 0;
}

int iir_filter_step(iir_filter_arg_t* arg, iir_filter_res_t* res, iir_filter_work_t* work) {
    if (!arg || !res || !work) {
        return -1; // Invalid pointers
    }

    // Marshal inputs to CasADi format
    const float* kernel_arg[iir_filter_SZ_ARG];
    kernel_arg[0] = &arg->u;
    kernel_arg[1] = arg->b;
    kernel_arg[2] = arg->a;
    kernel_arg[3] = arg->u_prev;
    kernel_arg[4] = arg->y_prev;

    // Marshal outputs to CasADi format
    float* kernel_res[iir_filter_SZ_RES];
    kernel_res[0] = res->u_hist;
    kernel_res[1] = res->y_hist;

    // Call kernel function
    return iir_filter(kernel_arg, kernel_res, work->iw, work->w, 0);
}

If you are familiar with C, this code will be largely self-explanatory. It takes care of all array and pointer initialization, specifically initializing all arrays to the values of the “template arguments” we passed to codegen.

One additional piece that’s worth noting: the kernel code had some nice properties including deterministic and static memory allocation, and portable, self-contained code. The interface layer preserves these features, so it is also suitable for either standalone C/C++ applications or deployment to embedded platforms.

Working with the generated API¶

In order to use the generated code we have to do three things:

#include "iir_filter.h"

// 1. Declare the input, output, and workspace structures
iir_filter_arg_t arg;
iir_filter_res_t res;
iir_filter_work_t work;

// 2. Initialize the structs
iir_filter_init(&arg, &res, &work);

// 3. Call the function
iir_filter_step(&arg, &res, &work);

Note that, as in the “quick start” example earlier in this series, for recursive filters like the IIR, the iir_filter_step function does not place the outputs back in the input arrays. For example, a more useful iteration written as a callback might look like the following:

int n = sizeof(arg.b) / sizeof(arg.b[0]) - 1;  // Filter order
float y;  // Output, to be fed to actuator, control algorithm, etc.

void controller_callback(void) {
    arg.u = dma_buffer[0];  // Read from sensor, other algorithm output, etc.
    iir_filter_step(&arg, &res, &work);
    y = res.y_hist[0];  // Output, to be fed to actuator, control algorithm, etc.

    // Copy output arrays back to inputs
    for (int j = 0; j < n; j++) {
        arg.u_prev[j] = res.u_hist[j];
        arg.y_prev[j] = res.y_hist[j];
    }
    arg.u_prev[n] = res.u_hist[n];
}

As a side note, you may notice here that we’re using roughly twice as much memory as we strictly need, since in this case we know that the outputs will be fed back to the inputs. This will generally be the case for recursive structures (IIR filters, sensor fusion, and similar algorithms) and highlights a limitation of the “functionally pure” assumption underlying code generation in Archimedes (and CasADi). While the generated code is efficient, it is not necessarily optimal; if you are working with tightly resource-constrained platforms or otherwise need extreme performance, you may still need hand-coding or other specialized tools.

On the other hand, if you can afford to keep a few extra floats in memory and perform the associated copy operations, this workflow provides a clean way to abstract the algorithm implementation from the rest of the embedded application.

For example, if we want to choose a different order and cutoff filter, we can do so easily:

# Design an IIR filter with SciPy
dt = 0.01  # Sampling time [seconds]
Wn = 20  # Cutoff frequency [Hz]
order = 2
b, a = signal.butter(order, Wn, "low", analog=False, fs=1 / dt)

# Create "template" arguments for type inference
u = 1.0
u_prev = np.zeros(len(b))
y_prev = np.zeros(len(a) - 1)
args = (u, b, a, u_prev, y_prev)

arc.codegen(iir_filter, args)

#ifndef IIR_FILTER_H
#define IIR_FILTER_H

#include "iir_filter_kernel.h"

#ifdef __cplusplus
extern "C" {
#endif

// Input arguments struct
typedef struct {
    float u;
    float b[3];
    float a[3];
    float u_prev[3];
    float y_prev[2];
} iir_filter_arg_t;

// Output results struct
typedef struct {
    float u_hist[3];
    float y_hist[2];
} iir_filter_res_t;

// Workspace struct
typedef struct {
    long int iw[iir_filter_SZ_IW];
    float w[iir_filter_SZ_W];
} iir_filter_work_t;

// Runtime API
int iir_filter_init(iir_filter_arg_t* arg, iir_filter_res_t* res, iir_filter_work_t* work);
int iir_filter_step(iir_filter_arg_t* arg, iir_filter_res_t* res, iir_filter_work_t* work);


#ifdef __cplusplus
}
#endif

#endif // IIR_FILTER_H

This has an identical API to the first version and will automatically initialize the correct filter coefficients, meaning we can tune the filter and re-deploy without changing any of the application code!

Platform Portability¶

Similar usage patterns can be applied in a range of contexts; since the generated code is self-contained it is highly portable (provided the compiler supports ANSI C).

For example, here’s a “cookbook recipe” for running a generated function in a timed loop on an Arduino.

#include <Arduino.h>
#include <TimerOne.h>
#include "iir_filter.h"

// Sampling rate: 100 Hz
const unsigned long SAMPLE_RATE_US = 10000;

// Declare the input, output, and workspace structures
iir_filter_arg_t arg;
iir_filter_res_t res;
iir_filter_work_t work;

volatile bool ctrl_flag = false;
int n = sizeof(arg.b) / sizeof(arg.b[0]) - 1;  // Filter order
float y;  // Output, to be fed to actuator, control algorithm, etc.

void controller_callback(void) {
    arg.u = 1.0f;  // Or read from sensor, other algorithm output, etc.
    iir_filter_step(&arg, &res, &work);
    y = res.y_hist[0];  // Output, to be fed to actuator, control algorithm, etc.

    // Copy output arrays back to inputs
    for (int j = 0; j < n; j++) {
        arg.u_prev[j] = res.u_hist[j];
        arg.y_prev[j] = res.y_hist[j];
    }
    arg.u_prev[n] = res.u_hist[n];

    ctrl_flag = false;
}

// Timer interrupt handler
void timerInterrupt() {
    ctrl_flag = true;
}

void setup(){
    // Initialize the structs
    iir_filter_init(&arg, &res, &work);

    Serial.begin(9600);

    // Initialize Timer1 for interrupts at 100 Hz
    Timer1.initialize(SAMPLE_RATE_US);
    Timer1.attachInterrupt(timerInterrupt);
}

void loop() {
    // ...non-time-critical tasks
    
    if (ctrl_flag) controller_callback();
}

Note that the “callback” function is identical to the previous snippet; the same auto-generated code can target everything from rapid prototyping platforms to production-grade embedded controllers to Unix clusters.

For example, the same timed-loop behavior could be implemented on an STM32 with something like the following:

#include <stdbool.h>
#include "iir_filter.h"

// ...

/* USER CODE BEGIN PV */
volatile bool ctrl_flag = false;

// Declare the input, output, and workspace structures
iir_filter_arg_t arg;
iir_filter_res_t res;
iir_filter_work_t work;
/* USER CODE END PV */

// ...

int main(void)
{
    // ...

    /* USER CODE BEGIN Init */
    iir_filter_init(&arg, &res, &work);
    /* USER CODE END Init */

    // ...
    
    /* USER CODE BEGIN WHILE */
    while (1)
    {
        /* USER CODE END WHILE */

        /* USER CODE BEGIN 3 */
        if (ctrl_flag) controller_callback();
    }
    /* USER CODE END 3 */
}

// ...

/* USER CODE BEGIN 4 */
void controller_callback(void) {
    // Same implementation as above
}

void HAL_TIM_PeriodElapsedCallback(TIM_HandleTypeDef *htim) {
    if (htim->Instance == TIM3) {  // Or whichever timer you're using
        ctrl_flag = true;
    }
}
/* USER CODE END 4 */

// ...

Summary¶

In this part of the hardware deployment tutorial, we covered the basic structure of the “interface” layer, or API, generated by Archimedes as a way to abstract application code from the low-level kernel implementation.

For this simple case, managing inputs, outputs, and state as individual scalars and arrays works fine. However, as functions grow in complexity, manually dealing with individual arrays quickly becomes difficult to maintain. In the final part of this series we will rewrite this same example using structured data types to autogenerate C struct types for hierarchical structured data.