Continous increase in Memory usage

Apart from the following hypothesis, there are simply so many cases where using a dataset containing media files rather than text consumes more RAM than expected that I’m not sure if pinpointing the problem will be easy:


The most likely cause in your case is host RAM growth from the request path itself, then whole-file Wav2Vec2 inference on long audio, then allocator retention that makes memory appear “not freed” even after cleanup. I do not think the main problem is your finally block being too weak. I think the main problem is that the expensive allocations have already happened before that block runs. FastAPI’s UploadFile is built around an internal SpooledTemporaryFile, Torchaudio can load from a file-like object directly, and Wav2Vec2 has well-known long-audio memory problems when you push full clips through one shot instead of chunking. (FastAPI)

What your handler is doing to memory

The first big amplification happens here:

audio_bytes = await audio_file.read()
waveform, sample_rate = torchaudio.load(io.BytesIO(audio_bytes))

UploadFile already wraps a spooled file object, so reading it fully into audio_bytes forces a whole extra in-process copy of the upload. Then torchaudio.load() decodes that into a waveform tensor, and the docs state that it accepts a file-like object directly and returns float32 tensors for common compressed formats. So a compressed file can become a much larger decoded tensor immediately, before inference even starts. (FastAPI)

The next amplification is the preprocessing chain. Your mono conversion creates a new tensor. Your resampling step creates another tensor. Torchaudio’s resampling docs note that transforms.Resample precomputes and caches a kernel, which is useful when reused, but in your code you instantiate it per request. None of these steps is “wrong,” but together they mean one incoming request can temporarily hold multiple full-waveform tensors in RAM. (PyTorch Docs)

Then the audio goes into the Hugging Face ASR pipeline. The pipeline source shows that the automatic speech recognition pipeline preprocesses audio with return_attention_mask=True, and it also has built-in chunk_length_s and stride_length_s handling for chunked processing. That matters because the high-level pipeline is convenient, but it is not the minimum-allocation path, and it is generic rather than tailored to your exact Wav2Vec2 checkpoint. (GitHub)

Why Wav2Vec2 is especially prone to this

Wav2Vec2 is a CTC model, and Hugging Face’s own long-audio guide exists because long files should be handled by chunking with stride, not by pushing the entire waveform through one forward pass. There are public reports of Wav2Vec2 consuming all 64 GB of RAM on a 7-minute file, more than 200 GB of RAM on a large decoding case, and OOM even on a 2 minute 17 second sample on a 32 GB machine. That is the same symptom family as yours. (Hugging Face)

So the background model should be this: Wav2Vec2 on long raw waveforms is memory-hungry by default. If your endpoint accepts arbitrary-duration audio and does full decode plus full inference plus generic pipeline preprocessing, then steady RAM growth under real traffic is exactly what one would expect. (Hugging Face)

The most important checkpoint-specific detail

The Wav2Vec2 docs state that models with config.feat_extract_norm == "group" such as wav2vec2-base were not trained using attention_mask, and for those models inputs should simply be padded with zeros and no attention mask should be passed. Only layer-norm variants such as wav2vec2-lv60 should get attention_mask for batched inference. The pipeline source, however, shows it builds attention_mask=True in its preprocessing path. That does not mean the pipeline is broken. It means the pipeline is generic, while your service may need a tighter manual path that avoids unnecessary tensors for your specific model family. (Hugging Face)

Why your cleanup is not solving it

torch.cuda.empty_cache() only releases unoccupied cached GPU memory. PyTorch explicitly says it does not increase the amount of GPU memory available to PyTorch itself, though it can help reduce fragmentation in some cases. It also says nothing about host RAM because it is a GPU-cache function. So it cannot fix CPU-side growth from uploads, decoded waveforms, or Python/native heap behavior. (PyTorch Docs)

malloc_trim(0) is also weaker than people often think. The Linux man page says it only attempts to release free heap memory from the process heap back to the OS. That means it may help sometimes and do nothing sometimes. It is not a primary control mechanism for a service that is over-allocating per request. (man7.org)

This is why your logs can show “I deleted everything” while RSS stays high. Some of that can be real live memory. Some can be allocator retention. Hugging Face users have reported the same “first batch fits, second similar batch OOMs” pattern with Wav2Vec2, and FastAPI/Uvicorn users have also reported persistent growth under repeated inference loads. (Hugging Face Forums)

My diagnosis, ranked

1. Highest-probability cause: whole-file CPU memory amplification

You are reading the full upload into Python memory, decoding the full file into float32, then creating more full-size tensors for mono conversion and resampling. That is the clearest architectural problem in the code. (FastAPI)

2. Very likely: full-length Wav2Vec2 inference instead of chunking

The public Wav2Vec2 OOM reports and the official chunking guide point strongly in this direction. Even a couple of minutes can be enough to blow memory depending on the exact model and path. (Hugging Face)

3. Likely: generic pipeline preprocessing doing more work than needed

The ASR pipeline preprocesses with return_attention_mask=True and has its own chunking behavior. A manual processor + model path gives you tighter control over what tensors are built and when. (GitHub)

4. Secondary amplifier: allocator retention and fragmentation

This explains why RAM does not visibly return to baseline after cleanup. It does not explain the initial spike by itself. (PyTorch Docs)

What I would change first

First: stop materializing the upload as bytes

Use the file object you already have:

await audio_file.seek(0)
waveform, sample_rate = torchaudio.load(audio_file.file)

FastAPI documents that UploadFile exposes the underlying spooled file, and Torchaudio documents that load() accepts a file-like object. This removes one full-copy allocation of the uploaded payload. (FastAPI)

Second: cap duration or chunk before full inference

Do not let arbitrary-duration audio go straight into the model. Use chunking with overlap, or reject or trim overly long inputs. Hugging Face’s long-audio guide is explicit that chunking with stride is the right approach for Wav2Vec2 on long files. (Hugging Face)

Third: replace the high-level pipeline in the hot path

Load AutoProcessor and AutoModelForCTC once at startup, then call them directly in the request handler. This lets you control return_attention_mask, input dtype, chunking, and device transfer yourself. The pipeline docs describe pipeline as a convenience abstraction, which is exactly why it is good for prototypes and sometimes suboptimal for tight production serving paths. (Hugging Face)

Fourth: use torch.inference_mode()

PyTorch states that inference_mode is analogous to no_grad, but removes additional overhead by disabling view tracking and version-counter bumps. For pure inference endpoints, that is generally the better mode. (PyTorch Docs)

Fifth: only pass attention_mask if your checkpoint needs it

If your model is a group-norm Wav2Vec2 checkpoint, drop the attention mask. If it is a layer-norm variant, keep it. That one decision can remove a large extra tensor from every request. (Hugging Face)

A safer version of the endpoint

This is the shape I would move toward:

import gc
import psutil
import torch
import torchaudio

from fastapi import UploadFile, File, HTTPException
from fastapi.responses import JSONResponse
from transformers import AutoProcessor, AutoModelForCTC

TARGET_SR = 16000
MAX_SECONDS = 30
DEVICE = "cuda:0" if torch.cuda.is_available() else "cpu"

MODEL_ID = "your-model-id"
processor = AutoProcessor.from_pretrained(MODEL_ID)
model = AutoModelForCTC.from_pretrained(MODEL_ID).to(DEVICE).eval()

USE_ATTENTION_MASK = getattr(model.config, "feat_extract_norm", None) == "layer"

@router.post("/transcribe")
async def quran(audio_file: UploadFile = File(...)):
    process = psutil.Process()
    start_ram = process.memory_info().rss / (1024 ** 2)

    waveform = None
    inputs = None
    logits = None
    pred_ids = None

    try:
        await audio_file.seek(0)

        # No full bytes copy
        waveform, sample_rate = torchaudio.load(audio_file.file)

        # Stereo -> mono
        if waveform.ndim == 2 and waveform.shape[0] > 1:
            waveform = waveform.mean(dim=0)
        else:
            waveform = waveform.squeeze(0)

        # Resample only if needed
        if sample_rate != TARGET_SR:
            waveform = torchaudio.functional.resample(waveform, sample_rate, TARGET_SR)

        # Hard cap input size before inference
        waveform = waveform[: TARGET_SR * MAX_SECONDS].contiguous()

        inputs = processor(
            waveform.numpy(),
            sampling_rate=TARGET_SR,
            return_tensors="pt",
            padding=False,
            return_attention_mask=USE_ATTENTION_MASK,
        )

        inputs = {k: v.to(DEVICE, non_blocking=True) for k, v in inputs.items()}

        with torch.inference_mode():
            logits = model(**inputs).logits
            pred_ids = torch.argmax(logits, dim=-1)
            transcript = processor.batch_decode(pred_ids)[0]

        return JSONResponse({"transcript": transcript}, status_code=200)

    except Exception:
        raise HTTPException(status_code=500, detail="Internal processing error")

    finally:
        try:
            await audio_file.close()
        except Exception:
            pass

        for name in ("waveform", "inputs", "logits", "pred_ids"):
            if locals().get(name) is not None:
                del locals()[name]

        if torch.cuda.is_available():
            torch.cuda.empty_cache()

        gc.collect()

        final_ram = process.memory_info().rss / (1024 ** 2)
        logger.info("RAM Status: %.1fMB -> %.1fMB", start_ram, final_ram)

This version changes the core memory behavior in the places that matter: it removes the extra upload copy, bounds input length before the model sees it, uses a manual path instead of the generic pipeline, and makes attention-mask creation checkpoint-aware. Those recommendations follow directly from the FastAPI, Torchaudio, Wav2Vec2, pipeline, and PyTorch docs. (FastAPI)

How to tell whether this is a true leak or allocator retention

Add logging for:

  • process RSS
  • torch.cuda.memory_allocated()
  • torch.cuda.memory_reserved()

PyTorch documents that memory_allocated() is memory currently occupied by tensors, while memory_reserved() is memory managed by the caching allocator. If allocated stays flat but reserved climbs, that points to allocator behavior or fragmentation. If both are flat while RSS rises, the pressure is on the CPU side: uploads, audio decode, resample buffers, Python/native heap, or FastAPI/Uvicorn request handling. PyTorch’s memory snapshot tools are the right next step if CUDA behavior is unclear. (PyTorch Docs)

The big picture

The core issue is not that you forgot one magic cleanup call. The core issue is that your endpoint is currently shaped like this:

  1. read all uploaded bytes into Python
  2. decode all audio into float32
  3. create more full-size tensors for mono and resample
  4. run a generic pipeline on the whole clip
  5. try to clean everything after the expensive work already happened

For Wav2Vec2 serving, the more stable shape is:

  1. keep the upload as a file-like object
  2. decode without creating an extra bytes copy
  3. bound or chunk the audio early
  4. use a manual processor/model path
  5. make attention-mask use depend on the checkpoint
  6. treat empty_cache() and malloc_trim() as secondary hygiene, not the main fix (FastAPI)

My final judgment: your RAM growth is most likely caused by architectural over-allocation in the request path plus long-audio Wav2Vec2 inference, with allocator retention making the symptom look worse over time. The highest-value fixes are: remove audio_bytes, stop whole-file inference, move off the generic pipeline in the API route, and make preprocessing explicit. (Hugging Face)