1

I have 2 tables:

Types Data +----+----------+ +-------+-------+ | id | name | | id | type | +----+----------+ +-------+-------+ | 1 | name1 | | 1 | 1 | | 2 | name2 | | 2 | 5 | | 3 | name3 | | 3 | 7 | | 4 | name4 | | 4 | 4 | | 5 | name5 | | 5 | 2 | | 6 | name6 | | 6 | 6 | | 7 | name7 | | 7 | 3 | | .. | .. | | 8 | 5 | +----+----------+ | 9 | 5 | | 10 | 4 | | 11 | 1 | | 12 | 2 | | 13 | 6 | | 14 | 5 | | 15 | 2 | | ... | ... | | 1...? | 1...? | +-------+-------+ 

Data table is very large, it contains millions of rows I need to select 1000 rows, but the result has to be from whole table, so every nth row select. I'v done this using answer from How to select every nth row in mySQL starting at n but, I need add some more logic to it, I need a select query that would select every nth row of all the types. I guess this sound complicated so I'll try to describe what I would like to achieve:

Lets say there are 7 Types and Data table has 7M rows 0.5M rows for types 1,2,3, 1.5M rows for types 4,5,6,7 (just be clear intervals may now be equal for all the types).

I need 1000 records that contains equal amounts of types so if I 7 types each type can occur in result set ROUND(1000/7) which would be equal to 142 records per type so I need to select 142 per type from Data table;

For types 1,2,3 which contains 0.5M rows that would be ROUND(0.5M / 142) which equals every nth 3521 row; For types 4,5,6,7 which contains 1.5M rows that would be ROUND(1.5M / 142) which equals every nth 10563 row;

So result would look something like this:

Result +-------+------+ | id | type | +-------+------+ | 1 | 1 | | 3522 | 1 | | 7043 | 1 | | .. | .. | | .. | 2 | | .. | 2 | | .. | .. | | .. | 3 | | .. | 3 | | .. | .. | | .. | 4 | | .. | 4 | | .. | .. | | .. | 5 | | .. | 5 | | .. | .. | | .. | 6 | | .. | 6 | | .. | .. | | .. | 7 | | .. | 7 | | .. | .. | +-------+------+ 

I could do this simply in any programming language with multiple queries that return each type's count from Data table, then after doing the maths selecting only single type at the time.

But I would like to do this purely in MySQL, using as less queries as possible.

EDIT

I'll try to explain in more detail what I wan't to achieve with real example.

I have table with 1437823 rows. Table schema looks like this:

+---------+----------+------+-----+---------+----------------+ | Field | Type | Null | Key | Default | Extra | +---------+----------+------+-----+---------+----------------+ | id | int(11) | NO | PRI | NULL | auto_increment | | type | int(11) | NO | | NULL | | | counter | int(11) | NO | | NULL | | | time | datetime | NO | | NULL | | +---------+----------+------+-----+---------+----------------+ 

That table type statistics is:

+------+-----------+ | Type | Row Count | +------+-----------+ | 1 | 135160 | | 2 | 291416 | | 3 | 149863 | | 4 | 296293 | | 5 | 273459 | | 6 | 275929 | | 7 | 15703 | +------+-----------+ 

(P.S. Types count can change in time.)

Let's say I need to select sample data from time interval, In first version of question I omitted time because I thought of it as insignificant but now I think it might have some significance when ordering to improve performance.

So anyway I need to select approximately 1000 rows sample in which there's equal chunk of data for each type, so the statistic of end result would look like this: I am selecting 1000 rows with 7 types so ROUND(1000 / 7) = 143 rows per type;

+------+-----------+ | Type | Row Count | +------+-----------+ | 1 | 143 | | 2 | 143 | | 3 | 143 | | 4 | 143 | | 5 | 143 | | 6 | 143 | | 7 | 143 | +------+-----------+ 

So now I need to select 143 rows for each type in equal gaps in time interval. So for single type it would look something like this:

SET @start_date := '2014-04-06 22:20:21'; SET @end_date := '2015-02-20 16:20:58'; SET @nth := ROUND( (SELECT COUNT(*) FROM data WHERE type = 1 AND time BETWEEN @start_date AND @end_date) / ROUND(1000 / (SELECT COUNT(*) FROM types)) ); SELECT r.* FROM (SELECT * FROM data WHERE type = 1 AND time BETWEEN @start_date AND @end_date) r CROSS JOIN ( SELECT @i := 0 ) s HAVING ( @i := @i + 1) MOD @nth = 1 

Statistics:

+------+-----------+ | Type | Row Count | +------+-----------+ | 1 | 144 | +------+-----------+ 

This query would give me needed results with tolerable performance, but I would need a query for each type which would decrease performance and would require later to concatenate results into single data set since that's what I need for further processing, so I would like to do it in single query or at least get single result set.

P.S. I can tolerate row count deviation in result set as long as type chunks are equal.

2 Answers 2

0

This should do what you want (tested on a table with 100 rows with TYPE=1, 200 rows with TYPE=2, 300 rows with TYPE=3, 400 rows with TYPE=4; with the value 10 in _c / 10, I get 40 rows, 10 of each type). Please check the performance, since I'm obviously using a smaller sample table than what you really have.

select * from (select @n := @n + 1 _n, _c, data.* from (select type _t, count(*) _c from data group by type) _1 inner join data on(_t = data.type) inner join (select @n := 0) _2 order by data.type) _2 where mod(_n, floor(_c / 10)) = 0 order by type, id; 

Although this gets the same number from each group, it isn't guaranteed to get the exact same number from each group, since there are obviously rounding inaccuracies introduced by the floor(_c / 10).

Sign up to request clarification or add additional context in comments.

Comments

0

What you want is a stratified sample. A good way to get a stratified sample is to order the rows by the type and assign a sequential number -- the numbering does not have to start over for each type.

You can then get 1000 rows by taking each nth value:

select d.* from (select d.*, (@rn := @rn + 1) as rn from data d cross join (select @rn := 0) vars order by type ) d where mod(rn, floor( @rn / 1000 )) = 1; 

Note: The final comparison is getting 1 out of n rows to approximate 1000. It might be off by one or two depending on the number of values.

EDIT:

Oops, the above does a stratified sample that matches the original distribution of the types in the data. To get equal counts for each group, enumerate them randomly and choose the first "n" for each group:

select d.* from (select d.*, (@rn := if(@t = type, @rn + 1, if(@t := type, 1, 1) ) ) as rn from data d cross join (select @rn := 0, @t := -1) vars order by type, rand() ) d cross join (select count(*) as numtypes from types) as t where rn <= 1000 / numtypes; 

10 Comments

Hmmm intresting aproach, but this doesn't give me equal chunks for each type :/
@sepikas_antanas . . . Ooops. That is a stratified sample, but not a balanced stratified sample. I just updated the answer.
Hmm strange I get milions of rows now I gues that is because rn is always 1 least that is what I see on results table
For one thing, this won't work because it doesn't ever take into account the number of rows of each type to know what the "gap between rows" of that type should be. This is why I have the group by in my answer.
@JonathanCottrill . . . It doesn't need to (assuming that there are enough rows). It takes "n" rows from each group, where "n" is the appropriate number to get 1000 over everything. There is no "gap between rows" -- but it is important that you look at the second query. The first does something else.
|

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.