In the book "Hands-On Machine Learning with Scikit-Learn and Tensorflow" by Aurélien Géron. There is a regression project explained. My doubt is regarding his example for 'stratified sampling'.
So far we have considered purely random sampling methods. This is generally fine if your dataset is large enough (especially relative to the number of attributes), but if it is not, you run the risk of introducing a significant sampling bias. When a survey company decides to call 1,000 people to ask them a few questions, they don’t just pick 1,000 people randomly in a phone book. They try to ensure that these 1,000 people are representative of the whole population. For example, the US population is 51.3% females and 48.7% males, so a well-conducted survey in the US would try to maintain this ratio in the sample: 513 female and 487 male. This is called stratified sampling: the population is divided into homogeneous subgroups called strata, and the right number of instances are sampled from each stratum to guarantee that the test set is representative of the overall population.
I do understand his analogy, but he is comparing a case, where we know all population (US population), whereas the project is about housing prices which we only have a tiny fraction... How can we stratify if we dont even know the correct strata?