PostgreSQL TOAST 机制原理摘记

文一

2025-03-23 发布108 浏览 · 0 点赞 · 0 收藏

PostgreSQL TOAST 机制原理摘记

君子拙于不知己，而信于知己。

——《史记》

我们所有的工作，都将会围绕 “团结国内内核研发力量，降低工业级数据库研发门槛” 这个最重要的目标而展开，因此，我们自己必须要保持开阔的技术视野，同时一定要有着阳光、积极、主动、包容的学习心态，期待将来有朝一日，真正形成一个 互知互信 的环境，为将来中国开源生态的繁荣，铺垫好必要的土壤。

写在前面

最近在整理 PostgreSQL 的哈希索引材料，开始对于 PostgreSQL 的页面建立切实了解，几个关键性的函数同步建立了理解，如你所见：


bool
hashinsert(Relation rel, Datum *values, bool *isnull,
		   ItemPointer ht_ctid, Relation heapRel,
		   IndexUniqueCheck checkUnique,
		   bool indexUnchanged,
		   IndexInfo *indexInfo)
{
	// ...
	/* 关系数据表（为索引所使用的），索引元组记录，索引所加速的关系数据表，数据没有排序 */
	_hash_doinsert(rel, itup, heapRel, false);
	// ...
}

这里补充一个很重要的材料（开始翻遍了很多文章，都没有找到对应的内容）：

Index 的 IndexRelation 是在什么时候创建的？

请参考这个函数的 indexRelationId 这部分内容，就可以知道答案。同时，我们可以联系 index_create 函数参数部分的 accessMethodId，就可以知道 “index_create” 同各种二级索引实现方案之间的关联。

言归正传，让我们继续看下去：

void
_hash_doinsert(Relation rel, IndexTuple itup, Relation heapRel, bool sorted)
{
	// ...
	hashkey = _hash_get_indextuple_hashkey(itup);

	// ...
	metabuf = _hash_getbuf(rel, HASH_METAPAGE, HASH_NOLOCK, LH_META_PAGE);
	metapage = BufferGetPage(metabuf);

	// ...

	/* metapage operations */
	metap = HashPageGetMeta(metapage);
	metap->hashm_ntuples += 1;
	// ...
}

可以发现，PostgreSQL 的哈希索引依托 Buffer 层提供的抽象接口，同磁盘上存储的页面展开数据读写的工作（PostgreSQL 把数据组织为页面的形式），而通过阅读有关的文档，我们就可以知道，PostgreSQL 的页面：

PostgreSQL uses a fixed page size
(commonly 8 kB), and does not allow tuples to span multiple pages.
Therefore, it is not possible to store very large field values directly.
To overcome this limitation, large field values are compressed and/or
broken up into multiple physical rows.

PostgreSQL 使用固定尺寸页面（通常为 8KB），同时不允许元组跨越多个页面进行存储。

因此，我们不可能以一种直接的方式存储大数据域的数据。而为了克服这一障碍，大数据域的数据会被压缩，或者切分存储在多个物理行中。

—— 《PostgreSQL TOAST》

而对应到内核的接口（假如我们希望将元组存储于页面上面的数据提取出来）上，相较于类似于这样直接调用 XXXDatumGetXXX 式的提取普通元组数据：

我们往往需要调用一些额外的步骤（如使用 PG_DETOAST_DATUM 宏，或是使用 fastgetattr 函数，等等等等），才可以顺利提取出大数据段里面的数据：

而这就牵涉到 PostgreSQL TOAST 机制，即本文所计划参数的内容。

PostgreSQL TOAST 机制思想：将大数据段切分为小数据段

在前文的阐述中，相信各位可以发现，PostgreSQL 的页面是定长的，因此存储大数据段，定然也是存储在定长而不是变长页面中，因此解决的思路，就是切大为小，如图所示：

可以发现：

大数据段将会被单独存储于一个 “Toast Table” 数据表中，可以通过 pg_class.reltoastrelid（relation's toast relation id）进行查找，而相应的，原始数据表中将会存储一个指向 Toast Table 的指针
Toast Table 以一种类似于链表的方式切分了大的数据，每一份切分后的数据命名为 “chunk”

对应到源代码上面，参考：

struct varlena *
pg_detoast_datum(struct varlena *datum)
{
	if (VARATT_IS_EXTENDED(datum)) // 数据域是否被拓展？（如果超出单个页面长度的话）
		return detoast_attr(datum); // 找到 TOAST 表的对应部分，提取数据
	else // 没有被拓展
		return datum; // 此数据就是正确的元组数据
}

struct varlena *
detoast_attr(struct varlena *attr)
{
	if (VARATT_IS_EXTERNAL_ONDISK(attr))
	{
		attr = toast_fetch_datum(attr);
		if (VARATT_IS_COMPRESSED(attr)) // 如果 TOAST 元组数据被压缩，则解压
		{
			struct varlena *tmp = attr;

			attr = toast_decompress_datum(tmp);
			pfree(tmp);
		}
	}
	// ...

	return attr;
}

static struct varlena *
toast_fetch_datum(struct varlena *attr)
{
	Relation	toastrel;
	struct varlena *result;
	struct varatt_external toast_pointer;
	int32		attrsize;

	// ...

	// 打开 TOAST 表以及其对应索引
	toastrel = table_open(toast_pointer.va_toastrelid, AccessShareLock);

	// 获取所有被切分的数据，即 chunks
	table_relation_fetch_toast_slice(toastrel, toast_pointer.va_valueid,
									 attrsize, 0, attrsize, result);

	// 关闭 TOAST 表
	table_close(toastrel, AccessShareLock);

	return result;
}

void
heap_fetch_toast_slice(Relation toastrel, Oid valueid, int32 attrsize,
					   int32 sliceoffset, int32 slicelength,
					   struct varlena *result)
{
	Relation   *toastidxs;
	ScanKeyData toastkey[3];
	TupleDesc	toasttupDesc = toastrel->rd_att;
	int			nscankeys;
	SysScanDesc toastscan;
	HeapTuple	ttup;
	int32		expectedchunk;
	int32		totalchunks = ((attrsize - 1) / TOAST_MAX_CHUNK_SIZE) + 1;
	int			startchunk;
	int			endchunk;
	int			num_indexes;
	int			validIndex;
	SnapshotData SnapshotToast;

	/* Look for the valid index of toast relation */
	validIndex = toast_open_indexes(toastrel,
									AccessShareLock,
									&toastidxs,
									&num_indexes);

	startchunk = sliceoffset / TOAST_MAX_CHUNK_SIZE;
	endchunk = (sliceoffset + slicelength - 1) / TOAST_MAX_CHUNK_SIZE;
	Assert(endchunk <= totalchunks);

	/* Set up a scan key to fetch from the index. */
	ScanKeyInit(&toastkey[0],
				(AttrNumber) 1,
				BTEqualStrategyNumber, F_OIDEQ,
				ObjectIdGetDatum(valueid));

	/*
	 * No additional condition if fetching all chunks. Otherwise, use an
	 * equality condition for one chunk, and a range condition otherwise.
	 */
	if (startchunk == 0 && endchunk == totalchunks - 1)
		nscankeys = 1;
	else if (startchunk == endchunk)
	{
		ScanKeyInit(&toastkey[1],
					(AttrNumber) 2,
					BTEqualStrategyNumber, F_INT4EQ,
					Int32GetDatum(startchunk));
		nscankeys = 2;
	}
	else
	{
		ScanKeyInit(&toastkey[1],
					(AttrNumber) 2,
					BTGreaterEqualStrategyNumber, F_INT4GE,
					Int32GetDatum(startchunk));
		ScanKeyInit(&toastkey[2],
					(AttrNumber) 2,
					BTLessEqualStrategyNumber, F_INT4LE,
					Int32GetDatum(endchunk));
		nscankeys = 3;
	}

	/* Prepare for scan */
	init_toast_snapshot(&SnapshotToast);
	toastscan = systable_beginscan_ordered(toastrel, toastidxs[validIndex],
										   &SnapshotToast, nscankeys, toastkey);

	/*
	 * Read the chunks by index
	 *
	 * The index is on (valueid, chunkidx) so they will come in order
	 */
	expectedchunk = startchunk;
	while ((ttup = systable_getnext_ordered(toastscan, ForwardScanDirection)) != NULL)
	{
		int32		curchunk;
		Pointer		chunk;
		bool		isnull;
		char	   *chunkdata;
		int32		chunksize;
		int32		expected_size;
		int32		chcpystrt;
		int32		chcpyend;

		/*
		 * Have a chunk, extract the sequence number and the data
		 */
		curchunk = DatumGetInt32(fastgetattr(ttup, 2, toasttupDesc, &isnull));
		Assert(!isnull);
		chunk = DatumGetPointer(fastgetattr(ttup, 3, toasttupDesc, &isnull));
		Assert(!isnull);
		if (!VARATT_IS_EXTENDED(chunk))
		{
			chunksize = VARSIZE(chunk) - VARHDRSZ;
			chunkdata = VARDATA(chunk);
		}
		else if (VARATT_IS_SHORT(chunk))
		{
			/* could happen due to heap_form_tuple doing its thing */
			chunksize = VARSIZE_SHORT(chunk) - VARHDRSZ_SHORT;
			chunkdata = VARDATA_SHORT(chunk);
		}
		else
		{
			/* should never happen */
			elog(ERROR, "found toasted toast chunk for toast value %u in %s",
				 valueid, RelationGetRelationName(toastrel));
			chunksize = 0;		/* keep compiler quiet */
			chunkdata = NULL;
		}

		/*
		 * Some checks on the data we've found
		 */
		if (curchunk != expectedchunk)
			ereport(ERROR,
					(errcode(ERRCODE_DATA_CORRUPTED),
					 errmsg_internal("unexpected chunk number %d (expected %d) for toast value %u in %s",
									 curchunk, expectedchunk, valueid,
									 RelationGetRelationName(toastrel))));
		if (curchunk > endchunk)
			ereport(ERROR,
					(errcode(ERRCODE_DATA_CORRUPTED),
					 errmsg_internal("unexpected chunk number %d (out of range %d..%d) for toast value %u in %s",
									 curchunk,
									 startchunk, endchunk, valueid,
									 RelationGetRelationName(toastrel))));
		expected_size = curchunk < totalchunks - 1 ? TOAST_MAX_CHUNK_SIZE
			: attrsize - ((totalchunks - 1) * TOAST_MAX_CHUNK_SIZE);
		if (chunksize != expected_size)
			ereport(ERROR,
					(errcode(ERRCODE_DATA_CORRUPTED),
					 errmsg_internal("unexpected chunk size %d (expected %d) in chunk %d of %d for toast value %u in %s",
									 chunksize, expected_size,
									 curchunk, totalchunks, valueid,
									 RelationGetRelationName(toastrel))));

		/*
		 * Copy the data into proper place in our result
		 */
		chcpystrt = 0;
		chcpyend = chunksize - 1;
		if (curchunk == startchunk)
			chcpystrt = sliceoffset % TOAST_MAX_CHUNK_SIZE;
		if (curchunk == endchunk)
			chcpyend = (sliceoffset + slicelength - 1) % TOAST_MAX_CHUNK_SIZE;

		memcpy(VARDATA(result) +
			   (curchunk * TOAST_MAX_CHUNK_SIZE - sliceoffset) + chcpystrt,
			   chunkdata + chcpystrt,
			   (chcpyend - chcpystrt) + 1);

		expectedchunk++;
	}

	/*
	 * Final checks that we successfully fetched the datum
	 */
	if (expectedchunk != (endchunk + 1))
		ereport(ERROR,
				(errcode(ERRCODE_DATA_CORRUPTED),
				 errmsg_internal("missing chunk number %d for toast value %u in %s",
								 expectedchunk, valueid,
								 RelationGetRelationName(toastrel))));

	/* End scan and close indexes. */
	systable_endscan_ordered(toastscan);
	toast_close_indexes(toastidxs, num_indexes, AccessShareLock);
}

可以发现，整体的过程就是一个对于类链表结构的遍历流程，只是因为：

依托了同关系型数据紧密结合的 API 接口
PostgreSQL 试图通过 Table Access Method 支持多类型数据表操作接口（以便实现多存储引擎），进而导致 table 与 heap 间还隔着一层抽象层
PostgreSQL 对于多数据类型的支持使得我们需要通过许多额外的 Datum 管理数据

使得这个函数看起来颇为复杂。

写在最后

本篇文章是以一个片面的而非全面的视角简单分析了 PostgreSQL Toast 机制（所以用了 “摘记” 而非 “原理分析”），希望可以为各位同仁理解 TOAST 提供辅助性的帮助。

感谢我的本科生导师，袁国铭系主任，中国 PostgreSQL 分会的魏波、王其达老师，开放原子开源基金会张凯与研发部的所有老师。

#PostgreSQL #Database #分布式 #VSCODE #Editor

请前往登录/注册即可发表您的看法…