Postgresql源码（122）Listen / Notify与事务的联动机制

前言

Notify和Listen是Postgresql提供的不同会话间异步消息通信功能，例子：

LISTEN virtual;
NOTIFY virtual;
Asynchronous notification "virtual" received from server process with PID 8448.
NOTIFY virtual, 'This is the payload';
Asynchronous notification "virtual" with payload "This is the payload" received from server process with PID 8448.

LISTEN foo;
SELECT pg_notify('fo' || 'o', 'pay' || 'load');
Asynchronous notification "foo" with payload "payload" received from server process with PID 14728.

功能使用PG的基础设施shm_mq + 信号机制拼装实现。

监听、通知的行为也兼容了数据库的事务的功能，事务回滚会删除监听、事务提交会触发通知。

本文对异步消息队列与事务的联动机制做一些分析。

事务提交触发

NOTIFY的功能必须等到事务提交才会触发：

postgres=# listen a1;
LISTEN
postgres=# begin;
BEGIN
postgres=*# notify a1;
NOTIFY
postgres=*# notify a1;
NOTIFY
postgres=*# commit;
COMMIT
Asynchronous notification "a1" received from server process with PID 17111.

流程比较简单，先从pendingActions中注册监听。再发信号触发异步notify。

void
AtCommit_Notify(void)
{
	...
    
	if (pendingActions != NULL)
	{
		foreach(p, pendingActions->actions)
		{
			ListenAction *actrec = (ListenAction *) lfirst(p);

			switch (actrec->action)
			{
				case LISTEN_LISTEN:
					Exec_ListenCommit(actrec->channel);
					break;
				case LISTEN_UNLISTEN:
					Exec_UnlistenCommit(actrec->channel);
					break;
				case LISTEN_UNLISTEN_ALL:
					Exec_UnlistenAllCommit();
					break;
			}
		}
	}

	...


	if (pendingNotifies != NULL)
		SignalBackends();
	
  ...
}

事务回滚清理

回滚后监听和通知都会清理：

postgres=# begin;
BEGIN
postgres=*# listen k123;
LISTEN
postgres=*# notify k123;
NOTIFY
postgres=*# abort;
ROLLBACK
postgres=# notify k123;
NOTIFY
postgres=#

事务回滚时执行清理动作：

void
AtAbort_Notify(void)
{
	if (amRegisteredListener && listenChannels == NIL)
		asyncQueueUnregister();

	pendingActions = NULL;
	pendingNotifies = NULL;
}

全部清理干净。

子事务提交不触发，交接给上一层事务

提交的子事务将notify交接给上一层事务。

postgres=# listen k000;
LISTEN
postgres=# begin;
BEGIN
postgres=*# savepoint sp1;
SAVEPOINT
postgres=*# savepoint sp2;
SAVEPOINT
postgres=*# notify k000;
NOTIFY
postgres=*# release sp2;
RELEASE
postgres=*# commit;
COMMIT
Asynchronous notification "k000" received from server process with PID 18902.

实现：

void
AtSubCommit_Notify(void)
{
	int			my_level = GetCurrentTransactionNestLevel();

	if (pendingActions != NULL &&
		pendingActions->nestingLevel >= my_level)
	{
		if (pendingActions->upper == NULL ||
			pendingActions->upper->nestingLevel < my_level - 1)
		{

			--pendingActions->nestingLevel;
		}
		else
		{
			ActionList *childPendingActions = pendingActions;

			pendingActions = pendingActions->upper;

			pendingActions->actions =
				list_concat(pendingActions->actions,
							childPendingActions->actions);
			pfree(childPendingActions);
		}
	}


	if (pendingNotifies != NULL &&
		pendingNotifies->nestingLevel >= my_level)
	{
		Assert(pendingNotifies->nestingLevel == my_level);

		if (pendingNotifies->upper == NULL ||
			pendingNotifies->upper->nestingLevel < my_level - 1)
		{
			--pendingNotifies->nestingLevel;
		}
		else
		{
			NotificationList *childPendingNotifies = pendingNotifies;
			ListCell   *l;

			pendingNotifies = pendingNotifies->upper;

			foreach(l, childPendingNotifies->events)
			{
				Notification *childn = (Notification *) lfirst(l);

				if (!AsyncExistsPendingNotify(childn))
					AddEventToPendingNotifies(childn);
			}
			pfree(childPendingNotifies);
		}
	}
}

pendingActions：用于保存channel信息（LISTEN命令使用，Async_Listen中配置）
pendingNotifies：用于保存channel和payload信息（NOTIFY命令使用，Async_Notify中配置）

子事务提交时，notify并不会真正触发，也是和其他资源一样，将自己绑定的nestingLevel转移到上一层（注意这里是绑的nestingLevel不是xid比较合理）。

整体上会有两种情况：

情况一：子事务有间隔，走这个分支pendingActions->upper->nestingLevel < my_level - 1

begin;
savepoint sp1;
notify ch123;
savepoint sp2;
savepoint sp3;
notify ch789;
release sp3;

情况二：子事务无间隔，走else分支

begin;
savepoint sp1;
notify ch123;
savepoint sp2;
notify ch456;
savepoint sp3;
notify ch789;
release sp3;

pendingActions和pendingNotifies都有自己的upper指针形成链式结构，两种数据结构在子事务提交时的行为都是将信息转移到上一层中，区别是pendingActions直接挂到上一层的actions链表；pendingNotifies调用AddEventToPendingNotifies接口完成同样的动作。

子事务回滚不触发，清理属于子事务的pendings

回滚的子事务会删除监听。

postgres=# begin;
BEGIN
postgres=*# savepoint sp1;
SAVEPOINT
postgres=*# listen k123;
LISTEN
postgres=*# savepoint sp2;
SAVEPOINT
postgres=*# listen k000;
LISTEN
postgres=*# rollback to sp2;
ROLLBACK
postgres=*# notify k123;
NOTIFY
postgres=*# notify k000;
NOTIFY
postgres=*# commit;
COMMIT
Asynchronous notification "k123" received from server process with PID 18098.
postgres=#

void
AtSubAbort_Notify(void)
{
	int			my_level = GetCurrentTransactionNestLevel();
	...
	
	while (pendingActions != NULL &&
		   pendingActions->nestingLevel >= my_level)
	{
		ActionList *childPendingActions = pendingActions;

		pendingActions = pendingActions->upper;
		pfree(childPendingActions);
	}

	while (pendingNotifies != NULL &&
		   pendingNotifies->nestingLevel >= my_level)
	{
		NotificationList *childPendingNotifies = pendingNotifies;

		pendingNotifies = pendingNotifies->upper;
		pfree(childPendingNotifies);
	}
}

子事务回滚的话，全部是直接删除，不在做向上归属。

Listen/Notify的实现原理

(This content is a summary derived from code comments.)

同一台机器上有多个后端进程。多个后端进程监听多个通道。（在代码的其他部分，通道也被称为“conditions”。）
在基于磁盘的存储中有一个中央队列（目录 pg_notify/），通过 slru.c 模块将活跃使用的页面映射到共享内存中。所有的通知消息都被放置在队列中，稍后由监听的后端进程读取。没有集中的信息知道哪个后端进程监听哪个通道；每个后端进程都有自己感兴趣的通道列表。虽然只有一个队列，但通知被视为数据库本地的；这是通过在每个通知消息中包含发送者的数据库 OID 来实现的。监听的后端进程会忽略不匹配其数据库 OID 的消息。这一点很重要，因为它确保了发送者和接收者有相同的数据库编码，不会错误解释通道名称或有效载荷字符串中的非 ASCII 文本。由于通知不期望在数据库崩溃后存活，我们可以在任何重启时简单地清除 pg_notify 数据，并且不需要 WAL 支持或 fsync。
每个至少监听一个频道的后端进程都会通过将其进程ID注册到AsyncQueueControl的数组中来进行注册。然后，它会扫描中央队列中的所有传入通知，首先将通知的数据库OID与自身的数据库OID进行比较，然后将通知的频道与其监听的频道列表进行比较。如果匹配成功，它会将通知事件传递给前端。不匹配的事件将被简单地跳过。
NOTIFY语句（Async_Notify例程）将通知存储在后端本地列表中，直到事务结束才会处理。来自同一事务的重复通知只发送一次通知。这样做是为了节省工作量，例如，当触发器在一个200万行的表上触发时，会为每一行的更改发送一个通知。如果应用程序需要接收每个已发送的单个通知，可以在额外的有效负载参数中轻松添加一些唯一的字符串。当事务准备提交时，PreCommit_Notify()将待处理的通知添加到队列的头部。队列的头指针始终指向下一个空闲位置，而位置只是一个页号和该页中的偏移量。这是在将事务标记为已提交之前完成的。如果在写入通知时遇到问题，我们仍然可以调用elog(ERROR, …)，事务将回滚。一旦我们将所有通知放入队列中，我们将返回到CommitTransaction()，然后执行实际的事务提交。在提交后，我们会再次被调用（AtCommit_Notify()）。在这里，我们对有效的监听状态（listenChannels）进行任何实际的更新。然后，我们向可能对我们的消息感兴趣的后端进程发送信号（包括我们自己的后端进程，如果正在监听）。这是通过SignalBackends()完成的，它会扫描监听后端进程的列表，并向每个监听后端进程发送一个PROCSIG_NOTIFY_INTERRUPT信号（我们不知道哪个后端进程在监听哪个频道，因此必须向它们全部发送信号）。但是，我们可以排除那些已经是最新状态的后端进程，并且还可以排除其他数据库中的后端进程（除非它们远远落后，应该被踢出以使其前进指针）。最后，在完全退出事务并即将进入空闲状态时，我们会扫描队列中需要发送到前端的消息（可能是来自其他后端的通知，或者是自己发送的通知）。这一步骤不是CommitTransaction序列的一部分，有两个重要原因。首先，我们在向前端发送数据时可能会出现错误，而在事务提交后进行清理时出现错误是非常糟糕的。其次，在某些情况下，一个过程在单个前端命令中发出多个提交，我们不希望在命令完成之前向前端发送通知；但是对于其他后端来说，每次提交后的通知应该立即发送出去。
收到PROCSIG_NOTIFY_INTERRUPT信号后，信号处理程序会设置进程的latch，如果该后端处于空闲状态（即等待前端命令并且不在事务块内），则会立即触发事件处理（参见ProcessClientReadInterrupt()）。否则，处理程序可能只设置一个标志，在下次进入空闲状态之前进行处理。入站通知处理包括读取自上次扫描以来到达的所有通知。我们读取每个通知，直到达到未提交事务的通知或者头指针的位置。
为了限制磁盘空间的消耗，需要推进尾指针，以便可以截断旧的页面。这是相对昂贵的操作（特别是，它需要一个独占锁），因此我们不希望经常执行。如果发送后端将队列头推进到新页面，则会执行此操作，但每QUEUE_CLEANUP_DELAY页只执行一次。

一个在相同频道上监听的应用程序将会收到自己发送的NOTIFY消息。如果这些消息对应用程序没有用处，可以通过将NOTIFY消息中的be_pid与应用程序自身后端的PID进行比较来忽略它们。（从FE/BE协议2.0开始，在启动期间，后端的PID会提供给前端。）上述设计确保通过忽略自我通知，不会错过来自其他后端的通知。用于通知管理的共享内存使用量（NUM_NOTIFY_BUFFERS）可以根据需要进行调整，而不会影响除性能之外的任何内容。可以同时排队的通知数据的最大量由max_notify_queue_pages GUC确定。

高铭杰