I’ve been thinking about this now and again. IMO gender, if one insists on tracking it at all (which I mostly find counterproductive), would need to be a vector / tuple of floating-point values. The components would be something like:
- Sexual Development Index: Encodes chromosomal sex, genitalia, and other primary sexual characteristics (X/Y chromosome ratio).
- Hormonal Balance & Secondary Sexual Characteristics: Combines hormonal levels and the resulting secondary traits (body hair, muscle mass, etc.).
- Brain Structure: A dimension indicating how a person’s brain structure aligns with typical male or female patterns.
- Gender Identity: A measure of self-identified gender, representing the psychological and social dimension.
- Fertility/Intersex Traits: A combined measure of fertility potential and the presence of intersex traits (e.g., ambiguous genitalia, mixed gonadal structures, etc.).
Ideally it would track the specific genes that code for all of the above factors, but unfortunately science hasn’t got those down yet.
I like how you think but I’m not sure if that alone will hold water. A variable can vary wildly even though it’s not very relevant to the property you’re interested in, and PCA would consider such a variable to be very significant. Perhaps a neural network could find a latent space. But ideally we want the components to have some intuitive meaning for humans.