Hello!
I need some help to improve audio vocal quality and reduce the background noise in the captured audio.
I am working on a project with an i2s mems microphone and speaker, with two esp32-s3 devkits. I am working on PlatformIO using the Arduino framework.
My mic is configured to capture audio buffer at 16000Hz sampling rate, 16 bits per sample and mono audio. My audio buffer size is 1024 bytes, 512 samples.
I am using an RTOS task to capture audio, then I send it to a queue. I use esp now to send the audio to my other esp32-s3. There I send the received buffer to a queue and then write the audio to my speaker connected through an amplifier (MAX98537A).
If I write the audio buffers without processing, the audio amplitude is very low, and I have to put my ear to the speaker to hear the audio. So I decided to create another task to process audio. So now after capturing the audio, I send the audio to the processing queue instead of the sending queue. The processing task takes the audio buffer from that queue and process it, then sends it to the sending queue.
In the processing task, i added some gain to audio. The problem with this is that the background noise which was barely audible in the audio before is amplified as well. In complete silence there is almost no noise, but even a fan or an air-conditioner introduces a background noise (sounds like TV static noise).
I have tried using a band pass filter to attenuate the higher and lower frequencies to reduce the noise. The band pass filter with small quality factor barely reduces the background noise and the band pass filter with higher quality factor reduces the vocal quality significantly.
In my testing, the best vocal quality is the unprocessed audio that I captured from the mic but that audio also contains any and all background noise present.
After playing with different high pass, low pass and band pass filters, i have figured out that there is vocal data present even in the extreme high frequncies. If I remove any frequncies completely, the effect can be seen in the form of reduced vocal quality. This leads me to the conclusion that just targeting frequncies isnt the best approach for noise reduction. I need some method to distinguish between vocals and noise.
I have also tested a voice activity detection algorithm and it works great during the silent periods and gives complete silence on the speakers but when there are vocals present there is also background noise.
I tried spectral subtraction. I recorded an audio buffer during silent period and then performed fft (fast fourier transform) on my audio buffer and my noise buffer. Then I calculated the magnitudes and phases of the frequencies and subtracted the magnitude of noise from that of mic audio. Then I performed inverse fft to get the audio signal. It was somewhat effective. But I havent heard a purely clean good quality audio even once.
I want crisp audio quality and no background noise at all. Is it possible to achieve this on esp32-s3? I have been working on this on and off for almost half a year and it seems to me that I have made no progress at all. Either there is background noise or the vocals are no longer crisp.
Any help or guidance is appreciated. Thanks.
Here are my codes:
i2s setup:
micConfig = {
.mode = (i2s_mode_t)(I2S_MODE_MASTER | I2S_MODE_RX),
.sample_rate = sampleRate,
.bits_per_sample = bitRate,
.channel_format = micChannel,
.communication_format = I2S_COMM_FORMAT_STAND_I2S,
.intr_alloc_flags = ESP_INTR_FLAG_LEVEL1,
.dma_buf_count = 8,
.dma_buf_len = 256,
.use_apll = true
};
micPins = {
.bck_io_num = micClockPin,
.ws_io_num = micWordSelectPin,
.data_out_num = I2S_PIN_NO_CHANGE,
.data_in_num = micDataPin
};
i2s_driver_install(micPort, &micConfig, 0, NULL);
i2s_set_pin(micPort, &micPins);
audio capturing task:
void audio_task(void *arg)
{
DEBUGLN("Audio task started");
while (true) {
mic.read(vocalSamples, AUDIO_BUFFER_SIZE);
if (xQueueSend(vocal_queue, vocalSamples, 0) != pdTRUE) {
ESP_LOGW(TAG, "Audio queue full, dropping oldest");
xQueueReceive(vocal_queue, dropBuf, 0);
xQueueSend(vocal_queue, vocalSamples, 0);
}
vTaskDelay(1);
}
}
intializing my band pass filters coefficients:
const float low_hz = 300.0f; // lower cutoff
const float high_hz = 3000.0f; // upper cutoff
const float fs = 16000.0f; // sample rate
float f0 = sqrtf(low_hz * high_hz);
float Q = 0.707f;
float normF = f0 / fs;
if (dsps_biquad_gen_bpf0db_f32(coeffs, normF, Q) != ESP_OK) {
return false;
}
applying my band pass filter:
dsps_biquad_f32_ansi(in_f, out_f, N, coeffs, state);
for (int i = 0; i < N; i++) {
float y = out_f[i];
if (y > 32767.0f) y = 32767.0f;
else if (y < -32768.0f) y = -32768.0f;
out_samples[i] = (int16_t)y;
}
spectral subtraction on vocal and noise:
forward fft:
dsps_fft2r_fc32(fft_data1, N);
dsps_bit_rev2r_fc32(fft_data1, N);
calculating the magnitudes:
float mag1_sq = real1*real1 + imag1*imag1;
float mag1 = sqrtf(mag1_sq);
phase calculation:
float phase1 = atan2f(imag1, real1);
subtraction:
float enhanced_mag = mag1 - mag2;
reconstruct with original phase:
float cos_phase = cosf(phase1);
float sin_phase = sinf(phase1);
fft_data1[2*i] = enhanced_mag * cos_phase;
fft_data1[2*i+1] = enhanced_mag * sin_phase;
taking conjugate for inverse fft:
for (int i = 0; i < N; i++) {
fft_data1[2*i + 1] = -fft_data1[2*i + 1];
}
applying forward fft to the conjugate to get the audio buffer back:
dsps_fft2r_fc32(fft_data1, N);
dsps_bit_rev2r_fc32(fft_data1, N);