How to remove duplicates using window function?

Removing duplicates, is a challenging task.

Sometimes you need something special. Using DISTINCT/ GROUP BY / UNION is not enough.

You need to remove duplicates is some other way: using window function:

You can do it using following query:

WITH loc_dim AS (
SELECT 
  ROW_NUMBER() OVER (PARTITION BY u.Location ORDER BY u.Id) AS RowNumber, u.Location
FROM dbo.Users u
)
SELECT loc.Location FROM loc_dim loc
WHERE loc.RowNumber = 1

If you would like to see how input data looks like. What is the expected result. When this approach might not be a good idea. Please continue reading.

Sample data

Data sample comes from Stack Overflow 2013 edition provided by Brent Ozar.

It has the following structure:

And following data. We are interested in last column that you can see in the screenshot: Location

Expected results

You would like to create a dimension from it and have only unique list of locations, no duplicates are expected. Every location should exist only once.
Like this:

This is expected for the purpose of this exercise. In the real world, you will still perform data curation, but now let us keep it simple.

Window function to eliminate duplicates

This is the final query:

WITH loc_dim AS (
SELECT
  ROW_NUMBER() OVER (PARTITION BY u.Location ORDER BY u.Id) AS RowNumber, u.Location
FROM dbo.Users u
)
SELECT loc.Location FROM loc_dim loc
WHERE loc.RowNumber = 1

Let me break down it for you.

CTE loc_dim

This is where window fuction is executed:

SELECT
  ROW_NUMBER() OVER (PARTITION BY u.Location ORDER BY u.Id) AS RowNumber, u.Location
FROM dbo.Users u

PARTITION BY creates partitions, you can think about them like groups, but don't be misled it is not the same as GROUP BY. Window function is applied on partition.

Location	Id
Hawaii	423
Hawaii	17
Hawaii	932
Hawaii	53

ORDER BY sorts by a given column. It helps us to define order of importance for the records. In this case Id was selected as it is unique acros all the records. Also there is a clustered index created on it, so it improves performance.

In this example sorting is ascending, but you can make it also descending. It depends on your preference.

Location	Id
Hawaii	17
Hawaii	53
Hawaii	423
Hawaii	932

As a last step ROW NUMBER is applied, for records partitioned and ordered.

Location	Id	RowNumber
Hawaii	17	1
Hawaii	53	2
Hawaii	423	3
Hawaii	932	4

Querying CTE

Result of previous step is filtered:

SELECT loc.Location FROM loc_dim loc
WHERE loc.RowNumber = 1

Our interest is only to have only 1 record per partition. We know that all groups will have at least 1 entry. Also ROW NUMBER starts counting from 1. It does not start from 0.

After filtering you will only have one row:

Location	Id	RowNumber
Hawaii	17	1

Why window function might not be a good idea?

Using window function could not be a good idea from performance perspective. It uses much more memory than DISTINCT, or GROUP BY. In this example it consumed the same amount of logical reads as DISTINCT or GROUP BY.

It is also more complex to write and maintain this query. It is easier to make an mistake.

Also it is easier to explain DISTINCT or GROUP BY then, window function.

Active learning:

What are possibilities of removing duplicate?
Is PARTITION BY the same as GROUP BY?
Why window function might not be a good idea?

How to remove duplicates using window function?

Sample data

Expected results

Window function to eliminate duplicates

CTE loc_dim

Querying CTE

Why window function might not be a good idea?

Active learning:

Zapisz się na newsletter!

Archiwa

Sample data

Expected results

Window function to eliminate duplicates

CTE loc_dim

Querying CTE

Why window function might not be a good idea?

Active learning:

Zapisz się na newsletter!

Tagi

Archiwa