[D] Dropping Features Early: Smart, Time Saving and Efficient or Amateur, Bad Practice and Lazy?

It is high time we let go of the Mersenne Twister
November 21, 2020
Greece to offer 50% tax break for 'digital migrants'
November 21, 2020

[D] Dropping Features Early: Smart, Time Saving and Efficient or Amateur, Bad Practice and Lazy?

Background: We posted about our new tool to help automate a bunch of the legwork in exploratory data analysis and asked for feedback. We were blown away by the number of sign-ups and great feedback in app but there were some valid points of concern raised and upvoted quite a bit, so I thought it was worth a bit of a deep dive to provide some data for the discussion.

There seemed to be a lot of questions flying around about different methods of potentially dropping features and whether this is good or bad practice. So I put together a post looking at all the different options mentioned with a worked example of a complex interaction that is hard for these sorts of methodologies to pick up.

The jupyter notebook article if you want the full detail

The approaches listed and conclusion:

  • Correlations – Not a good way to select features for ML
  • Feature Importances (from tree based models) – Not a good way to select features for ML

  • Boruta – From our testing, not a good way to select features for ML (If you wrote it and read this, sorry. It got brought up for comparison and we didn’t want to seem like we were hiding from the challenge)

  • Kortical – Works across a wide range of diverse cases we’ve tested, where other methods did not

We’re not claiming it’s perfect but it improves or maintains model performance and outperforms any other feature dropping approach in our diverse 40 dataset benchmark. We think it can provide a lot of value for many datascientists, especially those that have to work with lots of datasets, large numbers of columns or don’t know the pitfalls of dropping features in different scenarios but want better model results.

  • Save time spent on exploratory data analysis of redundant features

  • Smaller datasets train and predict faster speeding modelling iteration

  • Better model results than not dropping features (17% better in the linked example but normally datasets are not quite as tricky as that one)

We’re obviously leaning towards the idea that in most cases dropping features early is “Smart, Time Saving and Efficient” but hopefully we’ve provided enough evidence for you to make your own mind up. By all means test your models with and without it and let us know how you get on, we’re still in Beta, so there may still be some kinks to work out.

Other criticisms of Kortical were:

  • Data leakage – Looking at the approach suggested in 7.10.2 Elements of Statistical Learning
  • How is this different to pandas profiling (or insert alternative)
  • EDA of Titanic is too simple, show us something more complex

We'll cover these off in future posts but hopefully this one helped answer the question: Dropping Features Early, Smart Time Saving and Efficient or Amateur Lazy Bad Practice? If you’ve got any other feedback or questions on approach, let me know. Andy

submitted by /u/andy_gray_kortical
[link] [comments]


Comments are closed.