本文翻译自:Select n random rows from SQL Server table
I've got a SQL Server table with about 50,000 rows in it. 我有一个SQL Server表,其中包含大约50,000行。 I want to select about 5,000 of those rows at random. 我想随机选择大约5,000行。 I've thought of a complicated way, creating a temp table with a "random number" column, copying my table into that, looping through the temp table and updating each row with RAND()
, and then selecting from that table where the random number column < 0.1. 我想到了一个复杂的方法,创建一个带有“随机数”列的临时表,将我的表复制到其中,循环访问临时表并使用RAND()
更新每一行,然后从该表中选择随机数列<0.1。 I'm looking for a simpler way to do it, in a single statement if possible. 我正在寻找一种更简单的方法,如果可能的话,在一个声明中。
This article suggest using the NEWID()
function. 本文建议使用NEWID()
函数。 That looks promising, but I can't see how I could reliably select a certain percentage of rows. 这看起来很有希望,但我看不出如何可靠地选择一定比例的行。
Anybody ever do this before? 有人曾经这样做过吗? Any ideas? 有任何想法吗?
参考:https://stackoom.com/question/3YpU/从SQL-Server表中选择n个随机行
Selecting Rows Randomly from a Large Table on MSDN has a simple, well-articulated solution that addresses the large-scale performance concerns. 从 MSDN上的大表中随机选择行有一个简单明了的解决方案,可以解决大规模的性能问题。
SELECT * FROM Table1
WHERE (ABS(CAST(
(BINARY_CHECKSUM(*) *
RAND()) as int)) % 100) < 10
If you (unlike the OP) need a specific number of records (which makes the CHECKSUM approach difficult) and desire a more random sample than TABLESAMPLE provides by itself, and also want better speed than CHECKSUM, you may make do with a merger of the TABLESAMPLE and NEWID() methods, like this: 如果你(不像OP)需要特定数量的记录(这使得CHECKSUM方法难以实现)并且希望比TABLESAMPLE本身提供的更随机的样本,并且还希望比CHECKSUM更快的速度,你可以合并TABLESAMPLE和NEWID()方法,如下所示:
DECLARE @sampleCount int = 50
SET STATISTICS TIME ON
SELECT TOP (@sampleCount) *
FROM [yourtable] TABLESAMPLE(10 PERCENT)
ORDER BY NEWID()
SET STATISTICS TIME OFF
In my case this is the most straightforward compromise between randomness (it's not really, I know) and speed. 在我的情况下,这是随机性(它不是真的,我知道)和速度之间最直接的妥协。 Vary the TABLESAMPLE percentage (or rows) as appropriate - the higher the percentage, the more random the sample, but expect a linear drop off in speed. 根据需要改变TABLESAMPLE百分比(或行) - 百分比越高,样本越随机,但预计速度会线性下降。 (Note that TABLESAMPLE will not accept a variable) (注意TABLESAMPLE不接受变量)
Didn't quite see this variation in the answers yet. 尚未在答案中看到这种变化。 I had an additional constraint where I needed, given an initial seed, to select the same set of rows each time. 在给定初始种子的情况下,我需要一个额外的约束来每次选择相同的行集。
For MS SQL: 对于MS SQL:
Minimum example: 最小例子:
select top 10 percent *
from table_name
order by rand(checksum(*))
Normalized execution time: 1.00 标准化执行时间:1.00
NewId() example: NewId()示例:
select top 10 percent *
from table_name
order by newid()
Normalized execution time: 1.02 标准化执行时间:1.02
NewId()
is insignificantly slower than rand(checksum(*))
, so you may not want to use it against large record sets. NewId()
比rand(checksum(*))
慢得多,所以你可能不想在大型记录集中使用它。
Selection with Initial Seed: 选择初始种子:
declare @seed int
set @seed = Year(getdate()) * month(getdate()) /* any other initial seed here */
select top 10 percent *
from table_name
order by rand(checksum(*) % @seed) /* any other math function here */
If you need to select the same set given a seed, this seems to work. 如果您需要选择给定种子的相同集合,这似乎有效。
This link have a interesting comparison between Orderby(NEWID()) and other methods for tables with 1, 7, and 13 millions of rows. 此链接在Orderby(NEWID())和具有1,7和13百万行的表的其他方法之间进行了有趣的比较。
Often, when questions about how to select random rows are asked in discussion groups, the NEWID query is proposed; 通常,当在讨论组中询问有关如何选择随机行的问题时,建议使用NEWID查询; it is simple and works very well for small tables. 它很简单,适用于小桌子。
SELECT TOP 10 PERCENT *
FROM Table1
ORDER BY NEWID()
However, the NEWID query has a big drawback when you use it for large tables. 但是,当您将NEWID查询用于大型表时,它有一个很大的缺点。 The ORDER BY clause causes all of the rows in the table to be copied into the tempdb database, where they are sorted. ORDER BY子句将表中的所有行复制到tempdb数据库中,并对它们进行排序。 This causes two problems: 这会导致两个问题:
What you need is a way to select rows randomly that will not use tempdb and will not get much slower as the table gets larger. 您需要的是一种随机选择不使用tempdb的行的方法,并且随着表变大而不会慢得多。 Here is a new idea on how to do that: 这是一个关于如何做到这一点的新想法:
SELECT * FROM Table1
WHERE (ABS(CAST(
(BINARY_CHECKSUM(*) *
RAND()) as int)) % 100) < 10
The basic idea behind this query is that we want to generate a random number between 0 and 99 for each row in the table, and then choose all of those rows whose random number is less than the value of the specified percent. 这个查询背后的基本思想是我们想要为表中的每一行生成0到99之间的随机数,然后选择随机数小于指定百分比值的所有行。 In this example, we want approximately 10 percent of the rows selected randomly; 在这个例子中,我们希望随机选择大约10%的行; therefore, we choose all of the rows whose random number is less than 10. 因此,我们选择随机数小于10的所有行。
Please read the full article in MSDN . 请阅读MSDN中的完整文章。
It appears newid() can't be used in where clause, so this solution requires an inner query: 看来newid()不能在where子句中使用,所以这个解决方案需要一个内部查询:
SELECT *
FROM (
SELECT *, ABS(CHECKSUM(NEWID())) AS Rnd
FROM MyTable
) vw
WHERE Rnd % 100 < 10 --10%