PostgreSQL Checkpoint: Database blocks are temporarily stored in Database shared buffers. As blocks are read, they are stored in DB shared buffers so that if any user accesses them later, they are available in memory and need not be read from the disk.
When we update any row, the buffer in DB shared buffers corresponding to the block containing that row is updated in memory. Record of the change made is kept in wal buffer. On commit, the changes we made are written to the wal file on disk thereby making them permanent. If the instance crashed right then, the DB shared buffers will be wiped out but on restarting the database, PostgreSQL will apply the changes recorded in WAL files to the corresponding data files.
Why doesn’t PostgreSQL write the changes to data files right away when we commit the transaction?
The reason is simple. If it chose to write directly to the data files, it will have to physically locate the data block in the datafile first and then update it which means that after committing, a user has to wait until it finds and writes.
Moreover writing to datafile is not sequential whereas writing to WAL is sequential.
So wal writer process writes the changes to WAL file and later Checkpointer writes the corresponding dirty blocks to the respective data file.
wait, what?
checkpointer process?
Yes,
Dirty blocks are written into datafile by 3 different processes, Checkpointer process, writer process and user backend process.
We will limit our discussion here to CHECKPOINT only.
Table of Contents
What is a checkpoint?
A checkpoint is a synchronization event at a specific point in time which causes some or all dirty pages in the shared buffer to be written to disk.
But why?
In the event of a crash, the crash recovery procedure looks at the latest checkpoint record to determine the point in the Write ahead log from which it should start the REDO operation.
Confused?
Let’s understand the life cycle of a database change.
As we all know PostgreSQL keeps a cache of pages in RAM(shared_buffers) to make it available to all backends for faster access.
How does a change happen in a database?
STEP 1: Whenever PostgreSQL needs to change anything on a give a table, it checks if it is in shared_buffers, if yes, it modifies them directly, and if not – it loads the required pages into shared_buffers and then modifies them.
STEP 2: The modified pages are termed as dirty pages.
STEP 3: Upon commit, the changes we made are written to the Write ahead log thereby making them permanent on disk.
STEP 4: The dirty pages are flushed to the respective file to disk periodically.
But how often dirty pages are written to disk?
This is where checkpoint comes in to picture.
At checkpoint time, all dirty data pages are flushed to disk and a special checkpoint record is written to the log file. Any changes made to data files before that point are guaranteed to be already on disk.
If your system crashes, recovery from the crash will start from this new last point where the database was sure that everything on disk was consistent.
A checkpoint occurs in the following cases:
When the administrator runs the CHECKPOINT statement.
With the interval specified in parameter checkpoint_timeout(default 300 seconds)
Amount of data written to the WAL has reached a parameter max_wal_size (default: 1GB).
At the start of an online backup
At the execution of pg_start_backup function
At the execution of a pg_basebackup command
At the shutdown of an instance(Except for the pg_ctl stop -m immediate command execution)
At the time of database configuration such as CREATE DATABASE / DROP DATABASE statement
during a CHECKPOINT, the database needs to perform these three basic steps
1. Identify all the dirty pages in shared buffers.
2. Write the dirty pages to the respective files.
3. Issue fsync() on all the modified files to disk.
NOTE:The fsync() function is intended to force a physical write of data from the buffer cache, and to assure that after a system crash or other failure that all data up to the time of the fsync() call is recorded on the disk.
Checkpoint parameters in PostgreSQL:
1. checkpoint_timeout:
It is the maximum time between automatic WAL checkpoints(default is 5 min).Increasing this parameter can increase the amount of time needed for crash recovery
2. max_wal_size:
Maximum size to let the WAL grow to between automatic WAL checkpoints. The default is 1 GB. Increasing this parameter can increase the amount of time needed for crash recovery.
If we set both the parameters, checkpoint triggers whichever comes first.
3. min_wal_size:
As long as WAL disk usage stays below this setting, old WAL files are always recycled for future use at a checkpoint, rather than removed. This can be used to ensure that enough WAL space is reserved to handle spikes in WAL usage, for example when running large batch jobs. (default is 80 MB)
4. checkpoint_completion_target :
As Checkpoint occurs for every 5 min or for every max_wal_size threshold reached, during the checkpoint time all the dirty pages that are there in the shared buffers will be flushed to disk thereby causing huge IO.
checkpoint_completion_target comes here for rescue.
It makes the flushing speed as slower which means the PostgreSQL should take checkpoint_completion_target * checkpoint_timeout amount of time to write the data.
For example., if my checkpoint_completion_target is 0.5 and the database will throttle the writes so that the last write is done after 2.5 minutes.
More about the checkpoint_completion_target is discussed here
5. wal_buffers :
The amount of shared memory used for WAL data that has not yet been written to disk. The default setting of -1 selects a size equal to 1/32nd (about 3%) of shared_buffers, but not less than 64kB nor more than the size of one WAL segment, typically 16MB.
6. checkpoint_flush_after:
Whenever more than checkpoint_flush_after bytes have been written while performing a checkpoint, attempt to force the OS to issue these writes to the underlying storage. Doing so will limit the amount of dirty data in the kernel’s page cache, reducing the likelihood of stalls when a fsync is issued at the end of the checkpoint.
This setting may have no effect on some platforms.
7. checkpoint_warning: Write a message to the server log if checkpoints caused by the filling of checkpoint segment files happen closer together than this many seconds.
CASE STUDY on CHECKPOINT:
Let us do a case study on a table “checkpoint_example” to know about checkpoints.
We will first find the object ID for the checkpoint_eg table with the following simple commands.
1 2 3 4 5 |
[postgres@asristgdb ~]$ oid2name -d testdb -t checkpoint_example From database "testdb": Filenode Table Name ------------------------------ 33428 checkpoint_example |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
[postgres@asristgdb ~]$ [postgres@asristgdb ~]$ oid2name All databases: Oid Database Name Tablespace ---------------------------------- 16738 nhps pg_default 25044 pkdb pg_default 13269 postgres pg_default 13268 template0 pg_default 1 template1 pg_default 16749 test1 pg_default 16386 testdb pg_default 25042 tkdb pg_default [postgres@asristgdb ~]$ |
Now, we will update the record into the table
1 2 3 4 5 |
testdb=# update checkpoint_example set city='CHENNAI' where city='HYDERABAD'; UPDATE 1 testdb=# commit; WARNING: there is no transaction in progress COMMIT |
On selecting the page for the status, it is found to be dirty
1 2 3 4 5 6 7 |
testdb=# SELECT reldatabase,relfilenode,isdirty FROM pg_buffercache WHERE relfilenode='33428'; reldatabase | relfilenode | isdirty -------------+-------------+--------- 16386 | 33428 | t (1 row) testdb=# |
Here, the page becomes dirty after the change. Now, let’s issue checkpoint.
1 2 3 4 5 6 7 8 9 |
testdb=# testdb=# checkpoint; CHECKPOINT testdb=# SELECT reldatabase,relfilenode,isdirty FROM pg_buffercache WHERE relfilenode='33428'; reldatabase | relfilenode | isdirty -------------+-------------+--------- 16386 | 33428 | f (1 row) |
testdb=#
The alert recorded says.
1 2 3 4 5 6 |
LOG: checkpoint starting: immediate force wait LOG: checkpoint complete: wrote 3 buffers (0.0%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=0.016 s, sync=0.001 s, total=0.020 s; sync files=3, longest=0.001 s, average=0.000 s; distance=17 kB, estimate=14895 kB |
Conclusion:
- Checkpointer process writes dirty buffers from shared buffers to respective data file on periodic basis depending on the configuration.
- checkpoint_timeout and max_wal_size are the two important one has to look for when configuring checkpoint
Words from postgreshelp
Thank you for giving your valuable time to read the above information. I hope the content served your purpose in reaching out the blog.
Suggestions for improvement of the blog are highly appreciable. Please contact us for any information/suggestion/feedback.
If you want to be updated with all our articles
please follow us on Facebook | Twitter
Please subscribe to our newsletter.
Hi
Where can I find oid2name? On my installation (9.5.20) it is not there) neither when I am at the bash shell nor when using psql.
Thanks.
I found it – it was in /usr/bin.