Hi,
Categorical variables appear a lot with tabular data. In case there are a handful of possible values (e.g. gender, age range, …) one simply uses one-hot encoding and it normally works. But what if there are many possible values, each appearing in a few samples? Examples include zip/postal codes, some sort of non-unique ID, the sender of an e-mail, …
I don't think one-hot encoding works here, or does it? Computing an embedding on the variable looks nice, but may be overkill and also resembles some sort of a chicken-and-egg solution :-/
I appreciate any idea or link to research around best practices here.
Many thanks
Best
submitted by /u/ihatebeinganonymous
[link] [comments]
Source