You have a big table. The biggest in your system.

You may say big table, big fun but also in some situation a big challenge.

The manager gives you a task:

  • delete a small portion of data in the table. Only about 1% of rows need to be removed.

How would you approach this task?

What query will you build? How would you minimize logical reads? Would you approach this task differently if it was a one time activity or task executed on a regular basis?

Consider: is this table used exclusively used by you? Maybe in parallel, some other process execute inserts into this table?

Continue reading to see how to delete data in batch on SQL Server.

Setting up the stage

Stack Overflow database has Comments table. It has 24 million rows.

You need to remove records from 2008. They are not needed anymore.

how to delete

If this one time, do and forget task, you will do it using a query like this:

DELETE FROM dbo.F_Comment
WHERE CreationDate < '2009-01-01';

And pray that DBA would not notice how many resources it consumes.

Deleting in batches

There is another way. To delete data from table in batches.

Run it as many times as you need, and give DB time and allow other processes to run on the table.

View will give you a batch of rows to delete:

CREATE VIEW dbo.V_F_Comments_5000 AS
    SELECT TOP 5000 *
    FROM dbo.F_Comment
    ORDER BY CreationDate ASC;

Index will allow finding the right data faster:

CREATE INDEX IX_F_Comment_CreationDate ON dbo.F_Comment(CreationDate);

Then delete from view will take place. It will delete data from the underlying table.

DELETE FROM dbo.V_F_Comments_5000
WHERE CreationDate < '2009-01-01';

You require executing it several times, depending on how many rows you have to delete.

Why only 5 000 records?
You should test it and verify how it looks on your end. It depends on your data characteristics. This is related to lock escalation threshold.

Stored procedure - caution don't copy paste!

Let us automate it and move it to another level.

Create stored procedure that will release us from manual work and delete not needed rows based on filter.

This automation will save us a lot of manual work.

Probably you would not read the next line, but I will write it:

Please don't run it on PROD, it is not tested enough. This is not production-ready code.
These are just developers considerations. Without any warranty.

CREATE PROCEDURE dbo.sp_CommentDelete
    @cutOverDate nvarchar(10)
AS
  SET NOCOUNT OFF ;
  DECLARE @deleted INT = 1
  WHILE (@deleted > 0)
  BEGIN
    DELETE dbo.V_F_Comments_5000
    WHERE CreationDate < @cutOverDate

    SET @deleted = @@ROWCOUNT
    WAITFOR DELAY '00:00:01'
  END
GO

Execution window

This part of the blog post is probably not for you but for your younger friend that is starting his data engineer career, but it adds an important point.

So you have it. Your code ready to execute. But wait. There is another point.

Do you have free access to PROD? And you would like to try this code right away? What can possibly go wrong?

Probably copying the code from a stranger on the internet can give you a bit of adrenaline, but it is not recommended.

Also consider when would you like to execute this script?

During the office hours? When you have a lunch break? Or maybe after 5 PM when there is hardly anyone in the office.

When your organization is mature enough you may also have something like maintenance, service or release windows, so you already have a window when your script will be executed.

Summary and other considerations

You may ask is this method faster then regular delete?

In my scenario, slightly, but it has other advantages.

It allows you to avoid exclusive looks that might happen. Also it does not eat all the server resources. It performs part of the job, waits a bit for and gives other, maybe more important processes time to jump in.

If you would like to use it on production environment, please play with delay time and number of records that you are removing in one batch. If you have more sophisticated demand on deletion, and you have even bigger constraint on performance, please give this a read:
https://michaeljswart.com/2014/09/take-care-when-scripting-batches/

This post was inspired by Brent Ozar's:
https://www.brentozar.com/archive/2018/04/how-to-delete-just-some-rows-from-a-really-big-table/

Active learning:

  1. Can you delete data using Views in SQL Server?
  2. What other delete technics from BIG table do you know? (https://nextlevelbi.pl/jak-usunac-dane-z-tabeli-bez-delete/)
  3. When is the best time to execute this query?
  4. Can you copy and paste scripts from the stranger on the internet and put it into production?