cluster故障转移配置

设置过程

配置.pgpass文件

此文件用于udbcluster访问unvdb数据库时使用的密码文件，通常位于~/.pgpass。格式为 hostname:port:database:username:password 示例：

#hostname:port:database:username:password
# In a standby server, a database field of replication matches streaming replication connections made to the master server.
192.168.2.151:5678:unvdb:unvdb:12345678
192.168.2.152:5678:unvdb:unvdb:12345678
192.168.2.153:5678:unvdb:unvdb:12345678

配置ssh免密访问

配置unvdb节点之间和pgpool与unvdb节点之间相互免密ssh访问，切换到普通用户udb，再执行命令ssh-keygen -t rsa生成当前用户的私钥和公钥：

ssh-keygen -t rsa
Generating public/private rsa key pair.
Enter file in which to save the key (/home/zjyq/.ssh/id_rsa): 
Enter passphrase (empty for no passphrase): 
Enter same passphrase again: 
Your identification has been saved in /home/zjyq/.ssh/id_rsa
Your public key has been saved in /home/zjyq/.ssh/id_rsa.pub
The key fingerprint is:
SHA256:OP7lX2GRDaYHQKpEJNiHO45oNNNCoGdSlSzKbepY8cA zjyq@zjyq
The key's randomart image is:
+---[RSA 3072]----+
|o .=o+o  .o.. o  |
|.oo =o.  .   + + |
|=o=. o. .   . + .|
|.XE+o. o     . . |
|..** .+ S     o  |
|.oo o. .     . . |
|+.    .   .   .  |
|..     . o   .   |
|        . ...    |
+----[SHA256]-----+

生成后的的公钥文件在~/.ssh/目录下的id_rsa.pub中，打开文件可以看到公钥的内容

分发密钥

[udb@clus-1 ~]$ ssh-copy-id 后端udb数据库用户的IP或hosts解析的主机名

配置pcp认证参数

打开pcp.conf文件，增加如下参数:
# USERID:MD5PASSWD
unvdb:25d55ad283aa400af464c76d713c07ad

其中，unvdb是unvdb数据库的账户。
25d55ad283aa400af464c76d713c07ad是unvdb账户的密码123456的MD5值，该值可以通过如下命令计算：
./pg_md5 12345678
25d55ad283aa400af464c76d713c07ad

配置unvdb数据库集群相关参数

修改如下参数

wal_level = logical                     # minimal, replica, or logical
wal_log_hints = on                      # also do full page writes of non-critical updates

配置udbcluster集群相关参数

设置集群模式为异构复制模式：
backend_clustering_mode = 'heter_replication'
设置socket id目录，设置到udbcluster用户有读写权限的目录
unix_socket_directories = '/home/unvdb/udbcluster'
pcp_socket_dir = '/home/unvdb/udbcluster'
设置集群中unvdb数据库节点的参数，根据实际情况配置IP地址、端口、选择权重，数据目录、是否允许故障转移、应用名称、异构节点数据库实例名称，异构数据库访问用户名和密码等信息。节点编号从0开始，新增加一个节点，参数后缀的数字递增。
注意：如果节点不是异构节点，则需要配置heter_dbs参数设置成空值''。
#节点0是流复制节点
backend_hostname0 = '192.168.2.151'
                                   # Host name or IP address to connect to for backend 0
backend_port0 = 5678
                                   # Port number for backend 0
backend_weight0 = 1
                                   # Weight for backend 0 (only in load balancing mode)
backend_data_directory0 = '/home/unvdb/unvdb-data'
                                   # Data directory for backend 0
backend_flag0 = 'ALLOW_TO_FAILOVER'
                                   # Controls various backend behavior
                                   # ALLOW_TO_FAILOVER, DISALLOW_TO_FAILOVER
                                   # or ALWAYS_PRIMARY
backend_application_name0 = '151'
                                   # walsender's application_name, used for "show unvdbcluster_nodes" command
heter_dbs0 = ''
                                   # Automatically detect heterogeneous tables within these databases
                                   # with multiple dbs separated by ','       
heter_user0 = 'unvdb'
                                   # heterogeneous tables detection user
heter_password0 = '12345678'
                                   # password for heterogeneous tables detection user                                
#节点1是异构节点
backend_hostname1 = '192.168.2.152'
backend_port1 = 5678
backend_weight1 = 1
backend_data_directory1 = '/home/unvdb/unvdb-data'
backend_flag1 = 'ALLOW_TO_FAILOVER'
backend_application_name1 = '152'
heter_dbs1 = 'udbench'
heter_user1 = 'unvdb'
heter_password1 = '12345678'
#节点2是流复制节点
backend_hostname2 = '192.168.2.153'
backend_port2 = 5678
backend_weight2 = 1
backend_data_directory2 = '/home/unvdb/unvdb-data'
backend_flag2 = 'ALLOW_TO_FAILOVER'
backend_application_name2 = '153'
heter_dbs2 = ''
heter_user2 = 'unvdb'
heter_password2 = '12345678'
sr_check_user = 'unvdb'
sr_check_password = '12345678'
health_check_user = 'unvdb'
health_check_password = '12345678'
#设置故障转移命令脚本，主节点发生故障时，调用此脚本切换到其他从节点，从节点提升为主节点。
failover_command = '/home/unvdb/udbcluster/etc/failover.sh %d %h %p %D %m %H %M %P %r %R %N %S'
#故障恢复命令脚本，当故障节点修复后，手动加入集群时，触发执行此脚本，新加入的节点跟随主节点并成为从节点。
failback_command = '/home/unvdb/udbcluster/etc/follow_primary.sh %d %h %p %D %m %H %M %P %r %R'
#主节点故障，切换到某个从节点时，自动触发此脚本，使其他从节点跟随新的主节点。
follow_primary_command = '/home/unvdb/udbcluster/etc/follow_primary.sh %d %h %p %D %m %H %M %P %r %R'

启动udbcluster服务

进入udbcluser/bin目录，执行如下命令启动：
./pgpool -n

查看udbcluster集群状态

执行ud_sql -p9999，连接udbcluster：
ud_sql -p9999
ud_sql (22.4)
Type "help" for help.
unvdb=# show cluster_nodes;
 node_id |   hostname    | port | status | pg_status | lb_weight |     role      | pg_role | select_cnt | load_balance_node | replication_delay | replication_state | replication_sync_state | last_sta
tus_change  | heter_tables_cnt 
---------+---------------+------+--------+-----------+-----------+---------------+---------+------------+-------------------+-------------------+-------------------+------------------------+---------
------------+------------------
 0       | 192.168.2.151 | 5678 | down   | up        | 0.333333  | standby       | primary | 0          | false             | 0                 |                   |                        | 2023-06-
16 20:51:29 | 0
 1       | 192.168.2.152 | 5678 | up     | up        | 0.333333  | heter standby | standby | 0          | true              | 10240             |                   |                        | 2023-06-
16 20:49:49 | 2
 2       | 192.168.2.153 | 5678 | up     | up        | 0.333333  | primary       | primary | 0          | false             | 0                 |                   |                        | 2023-06-
16 20:51:29 | 0
(3 rows)
unvdb=# 
stauts状态表示数据库节点在udbcluster集群中的状态，up表示在集群中，down表示已经被踢出集群。waiting状态表示，数据库节点重新加入到集群中，因为当前有应用通过udbcluster连接了数据库节点，可以断开再连接的步骤来使节点恢复成up状态。
pg_status状态表示数据库节点自身的状态，up表示节点正常运行，down表示节点已经停止。
role角色包括三种：primary主节点、standby从节点和heter standby异构节点。

模拟故障转移

可以用命令ud_ctl -D /home/unvdb/unvdb-data stop来停止主节点，触发故障转移，主节点将转移到可用的从节点。故障转移后，查看集群节点信息

unvdb=# show cluster_nodes;
 node_id |   hostname    | port | status | pg_status | lb_weight |     role      | pg_role | select_cnt | load_balance_node | replication_delay | replication_state | replication_sync_state | last_sta
tus_change  | heter_tables_cnt 
---------+---------------+------+--------+-----------+-----------+---------------+---------+------------+-------------------+-------------------+-------------------+------------------------+---------
------------+------------------
 0       | 192.168.2.151 | 5678 | down   | down      | 0.333333  | standby       | unknown | 0          | false             | 0                 |                   |                        | 2023-06-
16 20:51:29 | 0
 1       | 192.168.2.152 | 5678 | up     | up        | 0.333333  | heter standby | standby | 0          | true              | 10240             |                   |                        | 2023-06-
16 20:49:49 | 2
 2       | 192.168.2.153 | 5678 | up     | up        | 0.333333  | primary       | primary | 0          | false             | 0                 |                   |                        | 2023-06-
16 20:51:29 | 0
(3 rows)
unvdb=# 

节点0 的状态已经变为down状态，已经被踢出集群。primary也由节点0，转移到了节点2。

故障节点恢复并加回集群

下面的操作，重新把节点0加回到集群中：启动节点0 ud_ctl -D /home/unvdb/unvdb-data start starting UnvDB as a demo version waiting for server to start…. done server started

查看集群节点状态
unvdb=# show cluster_nodes;
 node_id |   hostname    | port | status | pg_status | lb_weight |     role      | pg_role | select_cnt | load_balance_node | replication_delay | replication_state | replication_sync_state | last_sta
tus_change  | heter_tables_cnt 
---------+---------------+------+--------+-----------+-----------+---------------+---------+------------+-------------------+-------------------+-------------------+------------------------+---------
------------+------------------
 0       | 192.168.2.151 | 5678 | down   | up        | 0.333333  | standby       | primary | 0          | false             | 0                 |                   |                        | 2023-06-
16 20:51:29 | 0
 1       | 192.168.2.152 | 5678 | up     | up        | 0.333333  | heter standby | standby | 0          | true              | 10240             |                   |                        | 2023-06-
16 20:49:49 | 2
 2       | 192.168.2.153 | 5678 | up     | up        | 0.333333  | primary       | primary | 0          | false             | 0                 |                   |                        | 2023-06-
16 20:51:29 | 0
(3 rows)
unvdb=# 
上面命令输出显示，节点0已经处于up状态，但是在集群中的状态仍然为down，即并没有自动加入到集群中。而是需要手动执行如下命令来加入集群：
[unvdb@localhost ~]$ pcp_attach_node -h localhost -U unvdb -n 0 -w
pcp_attach_node -- Command Successful
再查看集群中节点状态：
unvdb=# show cluster_nodes;
 node_id |   hostname    | port | status  | pg_status | lb_weight |     role      | pg_role | select_cnt | load_balance_node | replication_delay | replication_state | replication_sync_state | last_st
atus_change  | heter_tables_cnt 
---------+---------------+------+---------+-----------+-----------+---------------+---------+------------+-------------------+-------------------+-------------------+------------------------+--------
-------------+------------------
 0       | 192.168.2.151 | 5678 | waiting | up        | 0.333333  | standby       | standby | 0          | false             | 0                 |                   |                        | 2023-06
-20 08:58:39 | 0
 1       | 192.168.2.152 | 5678 | up      | up        | 0.333333  | heter standby | standby | 0          | true              | 0                 | streaming         | async                  | 2023-06
-20 08:57:20 | 2
 2       | 192.168.2.153 | 5678 | up      | up        | 0.333333  | primary       | primary | 0          | false             | 0                 |                   |                        | 2023-06
-20 08:57:20 | 0
(3 rows)
unvdb=#