Removing duplicates, is a challenging task.
Sometimes you need something special. Using DISTINCT/ GROUP BY / UNION is not enough.
You need to remove duplicates is some other way: using window function:
You can do it using following query:
WITH loc_dim AS (
SELECT
ROW_NUMBER() OVER (PARTITION BY u.Location ORDER BY u.Id) AS RowNumber, u.Location
FROM dbo.Users u
)
SELECT loc.Location FROM loc_dim loc
WHERE loc.RowNumber = 1
If you would like to see how input data looks like. What is the expected result. When this approach might not be a good idea. Please continue reading.
Sample data
Data sample comes from Stack Overflow 2013 edition provided by Brent Ozar.
It has the following structure:
And following data. We are interested in last column that you can see in the screenshot: Location
Expected results
You would like to create a dimension from it and have only unique list of locations, no duplicates are expected. Every location should exist only once.
Like this:
This is expected for the purpose of this exercise. In the real world, you will still perform data curation, but now let us keep it simple.
Window function to eliminate duplicates
This is the final query:
WITH loc_dim AS (
SELECT
ROW_NUMBER() OVER (PARTITION BY u.Location ORDER BY u.Id) AS RowNumber, u.Location
FROM dbo.Users u
)
SELECT loc.Location FROM loc_dim loc
WHERE loc.RowNumber = 1
Let me break down it for you.
CTE loc_dim
This is where window fuction is executed:
SELECT
ROW_NUMBER() OVER (PARTITION BY u.Location ORDER BY u.Id) AS RowNumber, u.Location
FROM dbo.Users u
PARTITION BY creates partitions, you can think about them like groups, but don't be misled it is not the same as GROUP BY. Window function is applied on partition.
Location | Id |
---|---|
Hawaii | 423 |
Hawaii | 17 |
Hawaii | 932 |
Hawaii | 53 |
ORDER BY sorts by a given column. It helps us to define order of importance for the records. In this case Id was selected as it is unique acros all the records. Also there is a clustered index created on it, so it improves performance.
In this example sorting is ascending, but you can make it also descending. It depends on your preference.
Location | Id |
---|---|
Hawaii | 17 |
Hawaii | 53 |
Hawaii | 423 |
Hawaii | 932 |
As a last step ROW NUMBER is applied, for records partitioned and ordered.
Location | Id | RowNumber |
---|---|---|
Hawaii | 17 | 1 |
Hawaii | 53 | 2 |
Hawaii | 423 | 3 |
Hawaii | 932 | 4 |
Querying CTE
Result of previous step is filtered:
SELECT loc.Location FROM loc_dim loc
WHERE loc.RowNumber = 1
Our interest is only to have only 1 record per partition. We know that all groups will have at least 1 entry. Also ROW NUMBER starts counting from 1. It does not start from 0.
After filtering you will only have one row:
Location | Id | RowNumber |
---|---|---|
Hawaii | 17 | 1 |
Why window function might not be a good idea?
Using window function could not be a good idea from performance perspective. It uses much more memory than DISTINCT, or GROUP BY. In this example it consumed the same amount of logical reads as DISTINCT or GROUP BY.
It is also more complex to write and maintain this query. It is easier to make an mistake.
Also it is easier to explain DISTINCT or GROUP BY then, window function.
Active learning:
- What are possibilities of removing duplicate?
- Is PARTITION BY the same as GROUP BY?
- Why window function might not be a good idea?