Background: We posted about our new tool to help automate a bunch of the legwork in exploratory data analysis and asked for feedback. We were blown away by the number of sign-ups and great feedback in app but there were some valid points of concern raised and upvoted quite a bit, so I thought it was worth a bit of a deep dive to provide some data for the discussion.
There seemed to be a lot of questions flying around about different methods of potentially dropping features and whether this is good or bad practice. So I put together a post looking at all the different options mentioned with a worked example of a complex interaction that is hard for these sorts of methodologies to pick up.
The approaches listed and conclusion:
Feature Importances (from tree based models) – Not a good way to select features for ML
Boruta – From our testing, not a good way to select features for ML (If you wrote it and read this, sorry. It got brought up for comparison and we didn’t want to seem like we were hiding from the challenge)
Kortical – Works across a wide range of diverse cases we’ve tested, where other methods did not
We’re not claiming it’s perfect but it improves or maintains model performance and outperforms any other feature dropping approach in our diverse 40 dataset benchmark. We think it can provide a lot of value for many datascientists, especially those that have to work with lots of datasets, large numbers of columns or don’t know the pitfalls of dropping features in different scenarios but want better model results.
Save time spent on exploratory data analysis of redundant features
Smaller datasets train and predict faster speeding modelling iteration
Better model results than not dropping features (17% better in the linked example but normally datasets are not quite as tricky as that one)
We’re obviously leaning towards the idea that in most cases dropping features early is “Smart, Time Saving and Efficient” but hopefully we’ve provided enough evidence for you to make your own mind up. By all means test your models with and without it and let us know how you get on, we’re still in Beta, so there may still be some kinks to work out.
Other criticisms of Kortical were:
We'll cover these off in future posts but hopefully this one helped answer the question: Dropping Features Early, Smart Time Saving and Efficient or Amateur Lazy Bad Practice? If you’ve got any other feedback or questions on approach, let me know. Andy