问题描述
我对数据库工作还很陌生,所以请耐心等待.我已经阅读了许多类似的问题,但似乎没有一个在谈论我面临的同一问题.
I am still new to working in databases, so please have patience with me. I have read through a number of similar questions, but none of them seem to be talking about the same issue I am facing.
只是一些关于我在做什么的信息,我有一个填满联系信息的表格,一些联系人是重复的,但大多数重复的行都有一个截断的电话号码,这使得这些数据毫无用处.
Just a bit of info on what I am doing, I have a table filled with contact information, and some of the contacts are duplicated, but most of the duplicated rows have a truncated phone number, which makes that data useless.
我编写了以下查询来搜索重复项:
I wrote the following query to search for the duplicates:
WITH CTE (CID, Firstname, lastname, phone, email, length, dupcnt) AS
(
SELECT
CID, Firstname, lastname, phone, email, LEN(phone) AS length,
ROW_NUMBER() OVER (PARTITION BY Firstname, lastname, email
ORDER BY Firstname) AS dupcnt
FROM
[data.com_raw]
)
SELECT *
FROM CTE
WHERE dupcnt > 1
AND length <= 10
我假设此查询会根据我指定的三列查找所有具有重复项的记录,并选择 dupcnt
大于 1 的任何记录,以及具有长度的电话列小于或等于 10.但是当我多次运行查询时,每次执行都会得到不同的结果集.一定有一些我在这里遗漏的逻辑,但我对此完全感到困惑.所有列都是 varchar
数据类型,除了 CID,它是 int
.
I assumed that this query would find all records that have duplicates based on the three columns that I have specified, and select any that have the dupcnt
greater than 1, and a phone column with a length less than or equal to 10. But when I run the query more than once I get different result sets each execution. There must be some logic that I am missing here, but I am completely baffled by this. All of the columns are of varchar
datatype, except for CID, which is int
.
推荐答案
代替 ROW_NUMBER()
使用 COUNT(*)
,并删除 ORDER BY 因为那不是必须使用 COUNT(*)
.
Instead of ROW_NUMBER()
use COUNT(*)
, and remove the ORDER BY since that's not necessary with COUNT(*)
.
按照您现在的方式,您正在通过 firstname
/lastname
/email
将记录分成相似的记录组/分区.然后您按 firstname
对每个组/分区进行排序.Firstname
是分区的一部分,这意味着该组/分区中的每个名字都是相同的.您将获得不同的结果,具体取决于 SQL Server 从存储中获取结果的方式(它首先找到的记录是 1
,第二个找到的是 2
).每次获取记录时(每次运行此 sql 时),它都可能以不同的顺序从磁盘或缓存中获取每条记录.
The way you have it now, you are chunking up records into similar groups/partitions of records by firstname
/lastname
/email
. Then you are ORDERING each group/partition by firstname
. Firstname
is part of the partition, meaning every firstname in that group/partition is identical. You will get different results depending on how SQL Server fetches the results from storage (which record it found first is 1
, what it found second is 2
). Every time it fetches records (every time you run this sql) it may fetch each record from disk or cache at a different order.
Count(*)
将返回所有重复的行
改为:
COUNT(*) OVER (PARTITION BY Firstname, lastname, email ) AS dupcnt
这将返回共享相同名字、姓氏和电子邮件的记录数.然后您保留任何大于 1 的记录.
Which will return the number of records that share the same firstname, lastname, and email. You then keep any record that is greater than 1.
这篇关于相同的查询给出不同的结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持跟版网!