Managing duplicates

Improving Query Performance in SQL Server

Dean Smith

Founder, Atamai Analytics

Query returning duplicates

SELECT PlayerName,
       Team
FROM PlayerStats; 
PlayerName Team
... ...
Emmanuel Mudiay DEN
Emmanuel Mudiay NYK
Enes Kanter NYK
Eric Bledsoe PHO
Eric Bledsoe MIL
... ...
Improving Query Performance in SQL Server

Removing duplicates with DISTINCT()

SELECT DISTINCT(PlayerName)
FROM PlayerStats;
PlayerName
...
Emmanuel Mudiay
Enes Kanter
Eric Bledsoe
Eric Gordon
Eric Moreland
...
Improving Query Performance in SQL Server

GROUP BY instead of DISTINCT()

SELECT DISTINCT(PlayerName)
FROM PlayerStats;
SELECT PlayerName
FROM PlayerStats 
GROUP BY PlayerName;
Improving Query Performance in SQL Server

GROUP BY instead of DISTINCT()

SELECT DISTINCT(PlayerName)
FROM PlayerStats;
SELECT PlayerName
FROM PlayerStats 
GROUP BY PlayerName;
SELECT PlayerName,
    COUNT(Team) AS TeamsPlayedFor
FROM PlayerStats 
GROUP BY PlayerName;
PlayerName TeamsPlayedFor
... ...
Emmanuel Mudiay 2
Enes Kanter 1
Eric Bledsoe 2
Eric Gordon 1
Eric Moreland 1
... ...
Improving Query Performance in SQL Server

Is there another way?

SELECT PlayerName
FROM Players;
PlayerName
...
Emmanuel Mudiay
Enes Kanter
Eric Bledsoe
Eric Gordon
Eric Moreland
...
Improving Query Performance in SQL Server

Fruits tables

Fruits1

FruitName FruitType
Grapefruit Citrus
Orange Citrus
Peach Stone
Strawberry Berry

Fruits2

FruitName FruitType
Marionberry Berry
Orange Citrus
Plum Stone
Strawberry Berry
Improving Query Performance in SQL Server

Duplicate fruits

 

 

 

  • Orange and Strawberry are duplicated

Fruits1 appended to Fruits2

FruitName FruitType
... ...
Orange Citrus
Orange Citrus
Plum Stone
Strawberry Berry
Strawberry Berry
Improving Query Performance in SQL Server

Remove duplicates with UNION

SELECT FruitName, FruitType
FROM Fruits1

UNION

SELECT FruitName, FruitType
FROM Fruits2
FruitName FruitType
Grapefruit Citrus
Orange Citrus
Peach Stone
Strawberry Berry
Marionberry Berry
Plum Stone
Improving Query Performance in SQL Server

What about UNION ALL?

 

  • UNION ALL
    • appends rows from one or more tables and does not remove duplicate rows

 

  • UNION
    • appends rows from one or more tables and removes duplicate rows
Improving Query Performance in SQL Server

DISTINCT() and UNION

  • Use with caution
  • May use an internal sort mechanism to order and check for duplicates
  • Potentially increase the time it takes for a query to run
Improving Query Performance in SQL Server

Using DISTINCT()

Before using DISTINCT() we should ask:

  • Is there an alternative method?
    • using a table with a unique key instead
  • Is the query using an aggregate function?
    • group with GROUP BY
Improving Query Performance in SQL Server

Using UNION

Before using UNION we should ask:

  • Are duplicate rows OK?
  • Will appending queries produce duplicate rows?
  • Consider using UNION ALL if duplicate rows are OK or if no duplicate rows will be created
Improving Query Performance in SQL Server

Let's practice!

Improving Query Performance in SQL Server

Preparing Video For Download...