Description
Hi HyperQueue
team,
First off, thank you for the excellent work on HyperQueue
. It has been working great for me -- especially when used with a compatible version of Nextflow
(hq 0.17.0, nextflow 24.10.2). HyperQueue
has really improved our workflow pipeline by efficiently managing many small tasks, which is important for us since our cluster penalizes large numbers of small jobs submitted individually.
I’ve encountered a potential issue regarding memory usage on the hq server
. Over time, as more jobs are submitted, the server’s memory usage keeps increasing. For example, our current hq-server process is using over 4.6 GB (RES
column in htop
) of RAM on the login node, which accounts for nearly 30% of the node’s total memory. The newest version of HyperQueue
0.22.0 also follow this pattern, although memory usage increases much slower.
I tried running:
hq job forget --filter finished,failed,canceled all
This appears to help temporarily -- newly submitted jobs don’t cause a further increase in memory usage for a while. However, it seems that the memory already used by the hq-server process is not released after the forget operation. My current workaround is to use hq
0.22.0 and repeatly run hq job forget, such as watch -n 60 'hq job forget --filter finished,failed,canceled all'
.
I’m wondering if this is expected behavior or a potential area for improvement. Conceptually, if job metadata is stored in a structure like Vec<JobInfo>
, maybe the vector is being cleared without calling shrink_to_fit()
or similar? That could explain why memory usage doesn’t drop even after forgotten jobs are removed. Or is this a behavior from the Rust/system allocator that it won't return memory to the system?
Would it be feasible to add an option to explicitly release unused memory after forgotten jobs are removed? I understand memory management can be complex and sometimes OS-level caching is involved, but any insight would be appreciated.
Thanks again for all your hard work on this project!
Best,
Bing