Evaluation Method

All evaluations are measured on our evaluation servers with data that is entirely unknown to the methods, in order to resemble true anomaly and make it harder for methods to overfit our data generation processes. This means that submissions contain binaries that are run over different input data.

The ‘Fishyscapes Web’ dataset is updated every three months with a fresh query of objects from the web that are overlayed on cityscapes images using varying techniques for every run. Methods are especially tested on new datasets that are generated only after the method has been submitted to our benchmark.

Metrics

We use Average Precision (AP) as the primary metric of our benchmark. It is invariant to data balance and we are therefore able to accurately compare methods regardless of how many pixels they label as anomaly. The tested methods output a continuous score for every pixel. We compute the metrics over all possible thresholds that a binary classifier could compare the output value with. The Average Precision is therefore also independent to the threshold a binary classifier could use.

In order to highlight safety-critical applications, we also compute the False Positive Rate at 95% True Positive Rate (). This resembles the False Positive Rate of a binary classifier that compares the output value of the method against a threshold and classifies all pixels as anomaly that are above the threshold. We take exactly that threshold which results in 95% True Positive Rate, because it is important in safety-critical systems to catch all anomalies, and for this threshold then pick the method which has the lowest number of false positives.

For methods that cannot use pretrained segmentation models, but require a special loss, this training or retraining can decrease the performance of the semantic segmentation. We therefore also report the mean intersection over union (mIoU) on the Cityscapes validation set.

Runtime is measured in seconds as the total time it takes for a submission to load an image from disk, run inference, and write the results to disk. We measure this as an average over 5000 images on a NVIDIA GTX3090 TI. Slower methods will have higher runtime, but the exact measurements should not be mistaken with the pure inference time.

Benchmark Results

Methods that are not attributed in the table are adaptations to semantic segmentation based on different related works. The method details are presented in the benchmark paper.