Bug in pdf resampling
I encountered a problem with training samples produced end November: Apparently the labels are somewhat encoded in the training variables when using the pdf sampling which leads to an immediate training accuracy of around 99% but obviously a very poor validation accuracy. I had a look at the training sample after the resampling and before the scaling and shifting as well as the hybrid samples to verify that the bug happens in the resampling. I also checked if this problem occurs when using a different resampling method and at least for the count method this was not the case which is why I assume that something goes wrong explicitly in the pdf-sampling.
The first plot shows the distribution of the IP3D_signed_d0_significance of track 1 split by classes. It is one of the variables that contributes a lot to the high accuracy. It seems like every class only have more or less discrete values for this variable and therefore the class information might be encoded here.
The second one shows the disribution before the scaling and shifted, again only the first track and a working training sample for comparison, but not separated by classes