[D] How do you deal with categorical variables with a large set of possible values?

Razer N95 mask
January 13, 2021
Hong Kong Company Formation & Open a Bank Account
January 13, 2021

[D] How do you deal with categorical variables with a large set of possible values?


Hi,

Categorical variables appear a lot with tabular data. In case there are a handful of possible values (e.g. gender, age range, …) one simply uses one-hot encoding and it normally works. But what if there are many possible values, each appearing in a few samples? Examples include zip/postal codes, some sort of non-unique ID, the sender of an e-mail, …

I don't think one-hot encoding works here, or does it? Computing an embedding on the variable looks nice, but may be overkill and also resembles some sort of a chicken-and-egg solution :-/

I appreciate any idea or link to research around best practices here.

Many thanks

Best

submitted by /u/ihatebeinganonymous
[link] [comments]

Source

Comments are closed.