I would do in a way to seek an exact condition, which depending on the quality of the random number generator would be ideal for this case.
I would simulate a "roulette", kind of one that turns with one’s hands (as in events or TV shows), where each circle track indicates a single individual of the population, considering the total population of all cities summed (This total is exactly the total of tracks in which the roulette will go through one or more times until it stops in one of them, indicating which individual of which city stopped (drew).
I
The colored bands indicate the total population of each city, while the internal bands that divide each city indicate each of its individuals.
If it were possible to physically build such a roulette with millions or billions of individuals, and it had exactly the same space between the tracks and was "calibrated" by spinning, not favoring or harming any of these individuals, the draw would be perfect, since "all individuals have the same possibility of being drawn" each time the roulette was turned.
How to build and spin this roulette mathematically?
I made and tested the case on a excel spreadsheet.
There’s the column "Cities" (text) and Population (value, with the total population of each city).
The cell C5 has the total sum of the populations, the column "% of Total", calculates the participation of each population of each city of this total population (Population of the city / Total Population).
The column "Accumulated" sum the value of the previous cell in the column "Population" with the population of the city itself. In the end, we arrive at the same Total Population that is presented in the cell C5.
The field "Spin roulette until" ... "times", shall receive an integer number of "maximum laps" that roulette can give about itself. For example, if starting from the individual 1 of city to, will be considered a turn when there is the passage of this same individual of this city by the marker.
The field "Upshot" informs exactly how much the roulette spun, ie how many houses (individuals) passed by the marker until stopping. The calculation is done thus:
=INT(ALEATÓRIO()*$C$5*($F$3-1))+$C$5
One can discuss the efficiency of the function "Random" of Excel (see following the answer that I deal with this subject), but it is the one that we have in hands practically per hour, and it meets well the purpose of the way it was applied here, as can be confirmed below.
Note the formula that the population value is summed up at the end, this ensures that "at least one complete spin is given on the roulette".
So the number of spin times is subtracted 1, because this value is added once at the end.
The field "Stopped at" indicates where the marker stopped, and its formula is:
=(F6/$C$5-INT(F6/$C$5))*$C$5
Result over the total population minus the entire part of this division, this will result only in the fractional part of this operation (only decimal places), which indicates the "how much the roulette ran after the last complete round". For example, if you gave 15,25 apart the whole part would be 0,25 this indicates that after fifteen laps the roulette turned over 25% of individuals. By multiplying this percentage by "Total Population", you arrive at exactly which individual from which city the roulette stopped.
The column "Roulette stop" points to the city of this individual, which in this case is the city drawn. The formula is:
=SE(OU(E($F$8>E17;$F$8<=E18);E(F14=0;E17=""));"<=== " & " Cdade sorteada: "& B18;"")
If the value belongs to the population band "accumulated" concerning a city, the message appears pointing to this city and indicating its name.
The red background is done through the "conditional formatting of cells", where if the cell is not empty, it passes the cell background to red and the letter color to white.
If it is not the interval of a city, the formula returns "" (double quotes), which leaves the cell "empty", and its background remains as it was, blank.
By pressing the button "F9", calculations are done with other numbers, and the example below shows what happens immediately after typing "F9" (this was done after the previous example), eventually it may occur to result in the same city in the sequence, hardly or rarely will fall on the same individual (it would have to come out exactly the same value of the previous draw).
How to know if all this works as expected?
You can do "step by step", or "hold" the key "F9", so that there is one draw after another, and you will see that the draw will always be concentrated in the cities with the largest population. The more times you calculate, the more that will be observed.
What’s the problem of using random numbers to select values in a data range?
The "random" numbers that are used in programming languages or on other platforms (as is the case I described), is that in fact these numbers are "pseudorandom", that is, they seek to approach what would really be a "draw without vices"; however, that is not what happens. See this page of Universidade Federal Fluminense on the subject that clarifies very well the problem and access the links indicated.
HOW COMPUTERS GENERATE RANDOM NUMBERS?
Because numbers are generated by mathematical equations (calculation functions) from an initial number (commonly referred to as "seed"), a sequence of numbers is generated in exactly the same sequence and always of the same value, that is, the sequence and the results do not change (whenever the "seed" that started to generate the results has the same value).
To "circumvent" this problem, there are commands that "generate" a new value for this "seed" (the initial value only), thus starting to use another sequence of results, however, they are other values of the same behavior.
The problem this causes is that there is a "trend" of concentration of results at certain points and little or no incidence of results in others (especially if these numbers are generated a few times).
So it is expected that the results are not fair, and it may even occur that a city is not drawn after numerous attempts, even if it is not the city with the smallest population (of course it will depend on the value of the "seed" and the number of times the draw is made, if it is a significantly high number, it may occur, but there will certainly be disproportionate incidence among other cities).
Why the proposed "roulette" significantly reduces this problem?
As each "draw" is not done rigidly within a range of dice (once a draw), but through the "turns" of the roulette, these tracks are completely exceeded several times, which causes the effect of there being a "trend" is significantly reduced. For the effect of this "trend" to appear in this case, it would be necessary to occur a gigantic coincidence, that regardless of the "turns" that the roulette of this, the numbers "drawn" pointed exactly to the same individual. Given the high number of items (population of the country) treated individually, this fact would be extremely rare if it occurred.
This is easy, just add the inhabitants of all existing cities and divide the number of inhabitants by the total, safeguarding that 1% would be the minimum.
– Jorge B.
@Jorgeb’s answer. is very good, but another option is the rule 3simples http://matematica.no.sapo.pt/regra3sind.htm, If the river has the population of 6453682 and corresponds to a number of repetitions of 50000 -- Salvador has 2902927 which corresponds to X
– jsantos1991
@Caputo This would occur only if I kept the strings instead of treating them as a number, I expressed myself badly at the end, what I meant is that if I did the crazy of increasing the amount of values in the array, I could compensate by repeating the values with the highest percentage, but as I said, it’s crazy
– Fernando
@Fernando Entendido! , I will remove the comment above and then this
– Caputo