Lately I’ve been looking through a lot of literature, in specific GAN related. The main focus lies in increasing the IS or decreasing the FID, but there never seems to be any tests of significance. I’ve not seen many papers that reach SOTA of some kind and make an argument for their change being significant.
The large portion of what I’ve seen is ablation study with bolded values with slight boosts in mean performance on their chosen metric. Why is it not the norm for p values to be included when comparing changes in models? Am I missing a subset of research that lives and dies by T Tests and Chi-Square analysis?
I’m a very young researcher and am fascinated by the philosophy of the current paradigm for hypothesis testing. Where do you believe the current strengths and faults lie with our current methodologies?