Stratified sampling using SQL given an absolute sample size

Question

I have the following population:

a b b c c c c

I am looking for a SQL statement to generate a the stratified sample of arbitrary size. Let's say for this example, I would like a sample size of 4. I would expect the output to be:

a b c c

Please clarify what rules you want to apply to get a stratified sample. With 1000 x A and 1000 x B and 2 x C and a sample size of 6, what result do you expect? Dismiss C completely, because its proportion is too small to be considered, thus ending up with CCCCCC? Have each strato at least once in the result and then fill up proportinal, thus getting either AAABBC or AABBBC? Get as many rows per strato as possible, thus getting AABBCC? Please be very precise formulating the rules, considering such edge cases. — Thorsten Kettner
– Thorsten Kettner, Commented Feb 4 at 7:41
Interesting problem, though there can be many ways of how to cope with corner cases and precise specification is your (rather than our) task, Saqib. Tim's count trick is fine and simple. Personally I think some elections algorithm such as d'Hondt method could be applied too. — Tomáš Záluský
– Tomáš Záluský, Commented Feb 4 at 9:24
This is an algorithm question. IMHO it's a better fit for softwareengineering.stackexchange.com — Jan Doggen
– Jan Doggen, Commented Feb 7 at 12:55
@JanDoggen Which programming Stack Exchange sites do I post on? "Software Engineering If your question is directly related to the Systems Development Life Cycle (except for troubleshooting, writing or explaining specific code), you can ask it on Software Engineering" - this does sound like a question about writing code. The threads over there don't seem to discuss much code. — Zegarek
– Zegarek, Commented Feb 7 at 13:25

Zegarek · Accepted Answer · 2025-02-07 11:57:22Z

select*from population order by row_number()over(partition by stratum) limit 4 offset 0;

stratum
c
b
a
c

_{demo at db<>fiddle}

Establish member numbers within each stratum using row_number().
ORDER BY that.
Use LIMIT to cut off your sample.
Increase OFFSET to progress through samples.

You can use different pagination methods to progress through consecutive, non-overlapping samples of your population. LIMIT..OFFSET isn't the best, but it's the simplest.

Once it sampled from each group, it picks another member however Postgres finds it quickest. If you want to instead force it to pick them alphabetically (get b instead of c as the fourth member drafted to this sample), add another order by item accordingly as shown in the demo.

To later order the whole extracted sample, you can wrap it in a subquery or a CTE and add another order by outside so that it sorts the result without affecting how members are sampled.

There are also built-in random sampling methods you can specify with tablesample clause:

select*from population tablesample system(50)repeatable(.42) limit 4;

But those don't operate on data-level strata.

TABLESAMPLE SYSTEM uses pages. 50 means every page of the table has 50% chance of being drafted. The number of live records on a page isn't constant. This typically gets you neighbouring rows that got inserted together/consecutively. You need to know the total row count of the table and adjust that percentage to it in order to arrive at a specific sample size. You also still need a limit clause on top, because the exact sample size you'll get is based entirely on probability.
TABLESAMPLE BERNOULLI uses records. With 50, every record of every page has 50% chance. Again, needs to be combined with total row count and trimmed with limit to arrive at a specific sample size.
TABLESAMPLE SYSTEM_TIME from tsm_system_time is TABLESAMPLE SYSTEM but instead of accepting a target sample %, it takes a time limit. It just keeps drafting until it runs out of time.
TABLESAMPLE SYSTEM_ROWS from tsm_system_rows is like TABLESAMPLE SYSTEM with LIMIT applied during sampling - it'll begin drafting page by page until it collects the target sample size.

Tim Biegeleisen · Accepted Answer · 2025-02-04 05:14:50Z

1

We can use a count trick here, with the help of window functions:

WITH cte AS ( SELECT t.*, COUNT(*) OVER (PARTITION BY col1) cnt, ROW_NUMBER() OVER (PARTITION BY col1 ORDER BY col1) rn FROM yourTable t ) SELECT col1 FROM cte WHERE 1.0*rn/cnt <= (4.0 / (SELECT COUNT(*) FROM yourTable)) ORDER BY col1;

The idea is to sequentially number every value, and then retain only a certain percentage.

answered Feb 4 at 5:14

Tim Biegeleisen

526k32 gold badges324 silver badges399 bronze badges

7 Comments

Saqib Ali Feb 4 at 5:25

Hi Tim. I tried this. it didn't work: db-fiddle.com/f/cJTZRhe7p5nXu2NjZcSrHY/0

Tim Biegeleisen Feb 4 at 5:30

It is working as per the logic you actually articulated. If you want an evenly spread 4 out of 7 draw, then rightfully a should not appear because it is only 1 record out of 7. a should only ever appear if you select all records. If my answer doesn't meet your expectations, then you need to update with the intended logic.

Saqib Ali Feb 4 at 5:56

Tim. With stratified sampling, shouldn't a be in the sample as well?

Tim Biegeleisen Feb 4 at 6:04

You need to quantify what your stratified sampling is. If you want to take 4 out of 7, with an even spread, then a should not appear, because at that small sample size, a is just noise.

Zegarek Feb 4 at 12:40

@ManuelFedele LLMs are trained on the work posted here and users like Tim and Tim specifically make both this platform and that LLM training work. Vote up what you find helpful, vote down what you find unhelpful, edit or suggest edits that you think can improve something, post if you have a different idea, comment to clarify - these things make SE work. Use flags to report abuse, address platform problems and questionable behaviour on meta. The internet tends to feel toxic at times and SE is no exception, but I don't see how the exchange here earned this sort of comment in any particular way.

|

Mark Rotteveel · Accepted Answer · 2025-02-07 10:46:57Z

You can use the NTILE window function to define the number of buckets (or tiles) you want, and then use ROW_NUMBER() to define the first of the group, and then filter on that:

select col from ( select col, tile, row_number() over(partition by tile order by col) as rownr from ( select col, ntile(4) over (order by col) as tile from (values ('a'), ('b'), ('b'), ('c'), ('c'), ('c'), ('c')) as a(col) ) b ) c where rownr = 1

Collectives™ on Stack Overflow

Stratified sampling using SQL given an absolute sample size

3 Answers 3

Comments

7 Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

7 Comments

Comments

Related