In this post we are going to understand PostgreSQL Page Layout in detail.
Bonus is coming up towards the end!!
Do I have ORACLE’s PCTFREE and PCTUSED kind of mechanism in PostgreSQL?
Table of Contents
PostgreSQL Page Layout
Each data file in PostgreSQL is divided into PostgreSQL disk blocks of 8 Kb each.The blocks are numbered sequentially, 0 to 0xFFFFFFFE. A PostgreSQL disk page is a memory version of a PostgreSQL disk block.
Below image depicts the page format used within PostgreSQL tables and indexes.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
+----------------+---------------------------------+ | PageHeaderData | linp1 linp2 linp3 ... | +-----------+----+---------------------------------+ | ... linpN | | +-----------+--------------------------------------+ | ^ pd_lower | | | | v pd_upper | +-------------+------------------------------------+ | | tupleN ... | +-------------+------------------+-----------------+ | ... tuple3 tuple2 tuple1 | "special space" | +--------------------------------+-----------------+ ^ pd_special |
Inside the data file (heap table and index, as well as the free space map and visibility map), it is divided into pages (or blocks) of fixed length, the default is 8192 byte (8 KB). Those pages within each file are numbered sequentially from 0, and such numbers are called as block numbers.
A page is full when nothing can be added between pd_lower and pd_upper. If the file has been filled up, PostgreSQL adds a new empty page to the end of the file to increase the file size.
We will keep the PageHeaderData understanding to later this post and we continue with linep1..N.
linp1..N form an ItemId (line pointer) array. ItemPointers point to a physical block number and a logical offset (line pointer number) within that block/page.
tuple1..N are added “backwards” on the page. Since an ItemPointer offset is used to access an ItemId entry rather than an actual byte-offset position, tuples can be physically shuffled on a page whenever the need arises. This indirection also keeps crash recovery relatively simple, because the low-level details of page space management can be controlled by standard buffer page code during logging, and during recovery.
pd_upper: starting position of the new record to be inserted.
pd_lower: maximum allowed location in to the page.
The final section is the “special section“ which can contain anything the access method wishes to store. For example, b-tree indexes store links to the page’s left and right siblings, as well as some other data relevant to the index structure. Ordinary tables do not use a special section at all.
Example.,
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
postgres=# create table a(name varchar(10)); CREATE TABLE postgres=# postgres=# postgres=# insert into a values('ABC'); INSERT 0 1 postgres=# insert into a values('DEF'); INSERT 0 1 postgres=# commit; WARNING: there is no transaction in progress COMMIT postgres=# checkpoint; CHECKPOINT postgres=# select pg_relation_filepath('a'); pg_relation_filepath ---------------------- base/13455/16462 (1 row) |
here, I just created a table with two records, lets understand its page structure with linux ‘od’ command.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
[postgres@test 13455]$ [postgres@test 13455]$ pwd /u01/pgsql/dir_data/base/13455 [postgres@test 13455]$ [postgres@test 13455]$ [postgres@test 13455]$ od -a 16462 0000000 nul nul nul nul P n del soh L E nul nul sp nul @ us 0000020 nul sp eot sp nul nul nul nul ` us 8 nul @ us 8 nul 0000040 nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul * 0017700 D stx nul nul nul nul nul nul nul nul nul nul nul nul nul nul 0017720 stx nul soh nul stx bs can nul ht D E F nul nul nul nul 0017740 C stx nul nul nul nul nul nul nul nul nul nul nul nul nul nul 0017760 soh nul soh nul stx bs can nul ht A B C nul nul nul nul 0020000 [postgres@test 13455]$ |
It is confirmed that my first record ABC has gone to bottom of the page and the next DEF is coming towards pg_lower. The first couple of lines indicate PageHeaderData.
PageHeaderData Layout in PostgreSQL
pageheader information can be viewed with functions.To work with these functions we must install pageinspect extension.
1 2 |
postgres=# create extension pageinspect; CREATE EXTENSION |
Now, by querying page_header() we get page header data
1 2 3 4 5 6 7 8 |
postgres=# postgres=# SELECT * FROM page_header(get_raw_page('a',0)); lsn | checksum | flags | lower | upper | special | pagesize | version | prune_xid -----------+----------+-------+-------+-------+---------+----------+---------+----------- 0/17F6E50 | 0 | 0 | 32 | 8128 | 8192 | 8192 | 4 | 0 (1 row) postgres=# |
lets breakdown the columns.
pd_lsn – identifies xlog record for last change to this page.
0/17F6E50 – Current WAL position
pd_checksum – page checksum, if set.
0 (because not set) If enabled, checksums are calculated for each data page. The detection of a checksum failure will cause an error when reading data and will abort the transaction currently running. So, this brings additional control for the detection of an I/O or hardware problem directly at the level of the database server.
pd_flags – flag bits.
#define PD_HAS_FREE_LINES 0x0001 /* are there any unused line pointers? */
#define PD_PAGE_FULL 0x0002 /* not enough free space for new tuple? */
#define PD_ALL_VISIBLE 0x0004 /* all tuples on page are visible to
* everyone */
#define PD_VALID_FLAG_BITS 0x0007 /* OR of all valid pd_flags bits */
pd_lower – offset to start of free space.
32
pd_upper – offset to end of free space.
8192
pd_special – offset to start of special space.
pd_pagesize_version – size in bytes and page layout version number.
since 7.3 page version and page size is clubbed.
pd_prune_xid – oldest XID among potentially prunable tuples on page.
any dead tuples?
PostgreSQL Data Structure in Page
To find the data structure in page we need to query heap_page_items()
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 |
postgres=# select * from heap_page_items(get_raw_page('a',0)); -[ RECORD 1 ]----------- lp | 1 lp_off | 8160 lp_flags | 1 lp_len | 28 t_xmin | 579 t_xmax | 0 t_field3 | 0 t_ctid | (0,1) t_infomask2 | 1 t_infomask | 2050 t_hoff | 24 t_bits | t_oid | t_data | \x09414243 -[ RECORD 2 ]----------- lp | 2 lp_off | 8128 lp_flags | 1 lp_len | 28 t_xmin | 580 t_xmax | 0 t_field3 | 0 t_ctid | (0,2) t_infomask2 | 1 t_infomask | 2050 t_hoff | 24 t_bits | t_oid | t_data | \x09444546 postgres=# |
Here
lp
A line pointer on a buffer page
lp_off
A line pointer offset
lp_flags
#define LP_UNUSED 0 /* unused (should always have lp_len=0) */
#define LP_NORMAL 1 /* used (should always have lp_len>0) */
#define LP_REDIRECT 2 /* HOT redirect (should have lp_len=0) */
#define LP_DEAD 3 /* dead, may or may not have storage */
lp_len
byte length of a tuple.
t_xmin
txid of insert command
t_xmax
txid of change command
t_field3
t_field3 is an Xvac identifier. This is an older feature that tracks the transaction id of the older VACUUM FULL command. t_field3 is blank in this case because there is not a transaction in progress.
t_ctid
A word about t_ctid: whenever a new tuple is stored on disk, its t_ctid is initialized with its own TID (location). If the tuple is ever updated,
its t_ctid is changed to point to the replacement version of the tuple.
To make minimal changes on disk PostgreSQL will write the new row, update thet_xmax of the original row, and write to the t_ctid field of the original.
t_infomask2 | t_infomask
Botht_infomask2 and t_infomask are flags and attributes which tells more about XMAX that is current state of the row.
t_hoff
The t_hoff field is the header offset – this tells us where we can find the beginning of data inside the row.
t_bits
t_bits is a variable length bitmap of null values.
Bonus
Do I have ORACLE’s PCTFREE and PCTUSED kind of mechanism in PostgreSQL?
FILLFACTOR attribute
When INSERT statement is executed, the tuple is added to the page. If executor cannot store the tuple into the working page anymore, it searches the next free page. FILLFACTOR is an attribute to indicate the percentage of the size in the page where tuples can be stored. The default value for FILLFACTOR is 100 (%). For this reason, tuples are normally stored in the page without gaps. FILLFACTOR can be also specified to the index.
The advantage of the decrease of FILL FACTOR from 100% is the performance improvement of the access in page unit, because free spaces are used in case of update using UPDATE statement. On the other hand, since the number of the pages used by the table to expand, I/O increases in case of reading entire table. For the table or index on which update is frequently performed is recommended to lower the value of the FILLFACTOR from the default value.
FILLFACTOR verification at the time of CREATE TABLE statement is executed
To set the FILLFACTOR when creating tables, it should be written to the WITH clause of the CREATE TABLE statement. To verify, refer the reloptions column of pg_class catalog.
Overall PostgreSQL Page Layout
Item | Description |
PageHeaderData | 24 bytes long. Contains general information about the page, including free space pointers. |
ItemIdData | Array of item identifiers pointing to the actual items. Each entry is an (offset,length) pair. 4 bytes per item. |
Free space | The unallocated space. New item identifiers are allocated from the start of this area, new items from the end. |
Items | The actual items themselves. |
Special space | Index access method specific data. Different methods store different data. Empty in ordinary tables. |
Overall PostgreSQL Page Header Layout
Field | Type | Length | Description |
pd_lsn | PageXLogRecPtr | 8 bytes | LSN: next byte after last byte of WAL record for last change to this page |
pd_checksum | uint16 | 2 bytes | Page checksum |
pd_flags | uint16 | 2 bytes | Flag bits |
pd_lower | LocationIndex | 2 bytes | Offset to start of free space |
pd_upper | LocationIndex | 2 bytes | Offset to end of free space |
pd_special | LocationIndex | 2 bytes | Offset to start of special space |
pd_pagesize_version | uint16 | 2 bytes | Page size and layout version number information |
pd_prune_xid | TransactionId | 4 bytes | Oldest unpruned XMAX on page, or zero if none |
Thank you for giving your valuable time to read the above information. I hope the content served your purpose in reaching out to the blog.
Suggestions for improvement of the blog are highly appreciable. Please contact us for any information/suggestions/feedback.
If you want to be updated with all our articles
please follow us on Facebook | Twitter
Please subscribe to our newsletter.