I've just stumbled at this paper as well and asked myself the same question. Here's how I'd go about it.
If there is one ground truth tag, the ideal predicted vector would have a single 1 and all other predictions 0. If there were 2 tags, the ideal prediction would have two 0.5 and all others at 0. It makes sense to sort the predicted values by descending confidence and to look at the cumulative probability as we increase the number of candidates for the final number of tags.
We need to distinguish which option was the (sorted) ground truth:
1, 0, 0, 0, 0, ...
0.5, 0.5, 0, 0, 0, ...
1/3, 1/3, 1/3, 0, 0, ...
1/4, 1/4, 1/4, 1/4, 0, ...
1/5, 1/5, 1/5, 1/5, 1/5, 0, ...
The same tag position could have completely different ground truth values: 1.0 when alone, 0.5 when together with another one, 0.1 with 10 of them, and so on. A fixed threshold couldn't tell which was the correct case.
Instead, we can check the descending sort of predicted values and the corresponding cumulative sum. As soon as that sum is above a certain number (let's say 0.95), that's the number of tags that we predict. Tweaking the exact threshold number for the cumulative sum would serve as a way to influence precision and recall.
machine-learningtag info.