Audio processing on esp32-s3 using esp-dsp library

talha · July 21, 2025, 9:35am

Hello!
I need some help to improve audio vocal quality and reduce the background noise in the captured audio.
I am working on a project with an i2s mems microphone and speaker, with two esp32-s3 devkits. I am working on PlatformIO using the Arduino framework.
My mic is configured to capture audio buffer at 16000Hz sampling rate, 16 bits per sample and mono audio. My audio buffer size is 1024 bytes, 512 samples.
I am using an RTOS task to capture audio, then I send it to a queue. I use esp now to send the audio to my other esp32-s3. There I send the received buffer to a queue and then write the audio to my speaker connected through an amplifier (MAX98537A).
If I write the audio buffers without processing, the audio amplitude is very low, and I have to put my ear to the speaker to hear the audio. So I decided to create another task to process audio. So now after capturing the audio, I send the audio to the processing queue instead of the sending queue. The processing task takes the audio buffer from that queue and process it, then sends it to the sending queue.
In the processing task, i added some gain to audio. The problem with this is that the background noise which was barely audible in the audio before is amplified as well. In complete silence there is almost no noise, but even a fan or an air-conditioner introduces a background noise (sounds like TV static noise).
I have tried using a band pass filter to attenuate the higher and lower frequencies to reduce the noise. The band pass filter with small quality factor barely reduces the background noise and the band pass filter with higher quality factor reduces the vocal quality significantly.
In my testing, the best vocal quality is the unprocessed audio that I captured from the mic but that audio also contains any and all background noise present.
After playing with different high pass, low pass and band pass filters, i have figured out that there is vocal data present even in the extreme high frequncies. If I remove any frequncies completely, the effect can be seen in the form of reduced vocal quality. This leads me to the conclusion that just targeting frequncies isnt the best approach for noise reduction. I need some method to distinguish between vocals and noise.
I have also tested a voice activity detection algorithm and it works great during the silent periods and gives complete silence on the speakers but when there are vocals present there is also background noise.
I tried spectral subtraction. I recorded an audio buffer during silent period and then performed fft (fast fourier transform) on my audio buffer and my noise buffer. Then I calculated the magnitudes and phases of the frequencies and subtracted the magnitude of noise from that of mic audio. Then I performed inverse fft to get the audio signal. It was somewhat effective. But I havent heard a purely clean good quality audio even once.
I want crisp audio quality and no background noise at all. Is it possible to achieve this on esp32-s3? I have been working on this on and off for almost half a year and it seems to me that I have made no progress at all. Either there is background noise or the vocals are no longer crisp.
Any help or guidance is appreciated. Thanks.

Here are my codes:
i2s setup:

micConfig = {
        .mode                 = (i2s_mode_t)(I2S_MODE_MASTER | I2S_MODE_RX),
        .sample_rate          = sampleRate,
        .bits_per_sample      = bitRate,
        .channel_format       = micChannel,
        .communication_format = I2S_COMM_FORMAT_STAND_I2S,
        .intr_alloc_flags     = ESP_INTR_FLAG_LEVEL1,
        .dma_buf_count        = 8,
        .dma_buf_len          = 256,
        .use_apll             = true
    };

    micPins = {
        .bck_io_num           = micClockPin,
        .ws_io_num            = micWordSelectPin,
        .data_out_num         = I2S_PIN_NO_CHANGE,
        .data_in_num          = micDataPin
    };

    i2s_driver_install(micPort, &micConfig, 0, NULL);
    i2s_set_pin(micPort, &micPins);

audio capturing task:

void audio_task(void *arg)
{
    DEBUGLN("Audio task started");
    while (true) {
        mic.read(vocalSamples, AUDIO_BUFFER_SIZE);
        if (xQueueSend(vocal_queue, vocalSamples, 0) != pdTRUE) {
            ESP_LOGW(TAG, "Audio queue full, dropping oldest");
            xQueueReceive(vocal_queue, dropBuf, 0);
            xQueueSend(vocal_queue, vocalSamples, 0);
        } 
        vTaskDelay(1);
    }
}

intializing my band pass filters coefficients:

            const float low_hz   = 300.0f;    // lower cutoff
            const float high_hz  = 3000.0f;   // upper cutoff
            const float fs       = 16000.0f;  // sample rate

            float  f0    = sqrtf(low_hz * high_hz);
            float  Q     = 0.707f;                  
            float  normF = f0 / fs;

            if (dsps_biquad_gen_bpf0db_f32(coeffs, normF, Q) != ESP_OK) {
                return false;
            }

applying my band pass filter:

        dsps_biquad_f32_ansi(in_f, out_f, N, coeffs, state);

        for (int i = 0; i < N; i++) {
            float y = out_f[i];
            if      (y >  32767.0f) y =  32767.0f;
            else if (y < -32768.0f) y = -32768.0f;
            out_samples[i] = (int16_t)y;
        }

spectral subtraction on vocal and noise:

forward fft:
        dsps_fft2r_fc32(fft_data1, N);
        dsps_bit_rev2r_fc32(fft_data1, N);

calculating the magnitudes:
        float mag1_sq = real1*real1 + imag1*imag1;
        float mag1 = sqrtf(mag1_sq);

phase calculation:
        float phase1 = atan2f(imag1, real1);

subtraction:
        float enhanced_mag = mag1 - mag2;

reconstruct with original phase:
            float cos_phase = cosf(phase1);
            float sin_phase = sinf(phase1);
            fft_data1[2*i] = enhanced_mag * cos_phase;
            fft_data1[2*i+1] = enhanced_mag * sin_phase;

taking conjugate for inverse fft:
        for (int i = 0; i < N; i++) {
            fft_data1[2*i + 1] = -fft_data1[2*i + 1];
        }

applying forward fft to the conjugate to get the audio buffer back:
        dsps_fft2r_fc32(fft_data1, N);
        dsps_bit_rev2r_fc32(fft_data1, N);

robertlipe · July 24, 2025, 2:17am

Per the doc at GitHub - espressif/esp-dsp: DSP library for ESP-IDF you’re likely to get a better audience for this kind of help at esp32.com.

talha · July 24, 2025, 7:20am

Ok, thanks for the response.

joba-1 · July 28, 2025, 9:11am

Have you checked you get the full bit width worth of data?

Some mems i2c mics can be configured for different bit widths, The i2c software can be configured and should match for the different bit width (usualy 16, 32, 12 or 8)

Bit width can also be lost on the way, so I‘d also check on the other end before sending to the dac.

If input is bad, filters cannot do magic…

talha · July 29, 2025, 6:55am

Hi thanks for responding. Yes I have been monitoring and validating all of my data. I have checked and validated data coming in from the mic, before espnow, after espnow and before writing to the amplifier. i2s configs are same on both ends. I can only configure i2s on 16 bits per sample because bit depth is directly proportional to the data rate. Espnow has a limited data bandwidth. we can only send 250 bytes per packet max.
I have tested the max transmission rate with a basic un-optimized setup and I could cross 300+ packets per second when the sender and receiver antennas were next to each other. The transmission rate dropped as I increased the distance between the antennas. I have tested at higher bit depths but even at a few feet distance the audio start skipping due to delay in packet transmission.
As for the data loss and data corruption over the espnow, I have tested by validating the received data by sending a known 250 bytes packet. And according to my testing I never got data loss (missing or modified bytes) on a successful send even during high speed transmission.
And the final and best indicator of good data tranmission is that i get silence on speakers during silence on mic and good vocals when there is no background noise. The problem is the ambient and background noise such as fans, ac units and people talking in the background.

joba-1 · July 29, 2025, 7:28am

ok, then your problem is beyond my level of expertise.

But I can recommend this repo: GitHub - pschatzmann/arduino-audio-tools: Arduino Audio Tools (a powerful Audio library not only for Arduino)
It contains functionality for filtering and is well maintained. Opening a discussion there could help.

Also, using multiple mics looking in different directions and filtering for what is loudest from the mic in the desired direction might be an approach worth exploring

good luck! Interesting project

talha · July 29, 2025, 10:10am

Thanks a lot for the guidance and the info.