cluster故障转移配置

设置过程

配置.pgpass文件

此文件用于udbcluster访问unvdb数据库时使用的密码文件,通常位于~/.pgpass。格式为 hostname:port:database:username:password 示例:

#hostname:port:database:username:password
# In a standby server, a database field of replication matches streaming replication connections made to the master server.
192.168.2.151:5678:unvdb:unvdb:12345678
192.168.2.152:5678:unvdb:unvdb:12345678
192.168.2.153:5678:unvdb:unvdb:12345678

配置ssh免密访问

配置unvdb节点之间和pgpool与unvdb节点之间相互免密ssh访问,切换到普通用户udb,再执行命令ssh-keygen -t rsa生成当前用户的私钥和公钥:

ssh-keygen -t rsa
Generating public/private rsa key pair.
Enter file in which to save the key (/home/zjyq/.ssh/id_rsa): 
Enter passphrase (empty for no passphrase): 
Enter same passphrase again: 
Your identification has been saved in /home/zjyq/.ssh/id_rsa
Your public key has been saved in /home/zjyq/.ssh/id_rsa.pub
The key fingerprint is:
SHA256:OP7lX2GRDaYHQKpEJNiHO45oNNNCoGdSlSzKbepY8cA zjyq@zjyq
The key's randomart image is:
+---[RSA 3072]----+
|o .=o+o  .o.. o  |
|.oo =o.  .   + + |
|=o=. o. .   . + .|
|.XE+o. o     . . |
|..** .+ S     o  |
|.oo o. .     . . |
|+.    .   .   .  |
|..     . o   .   |
|        . ...    |
+----[SHA256]-----+

生成后的的公钥文件在~/.ssh/目录下的id_rsa.pub中,打开文件可以看到公钥的内容

分发密钥

[udb@clus-1 ~]$ ssh-copy-id 后端udb数据库用户的IP或hosts解析的主机名

配置pcp认证参数

打开pcp.conf文件,增加如下参数:
# USERID:MD5PASSWD
unvdb:25d55ad283aa400af464c76d713c07ad

其中,unvdb是unvdb数据库的账户。
25d55ad283aa400af464c76d713c07ad是unvdb账户的密码123456的MD5值,该值可以通过如下命令计算:
./pg_md5 12345678
25d55ad283aa400af464c76d713c07ad

配置unvdb数据库集群相关参数

修改如下参数

wal_level = logical                     # minimal, replica, or logical
wal_log_hints = on                      # also do full page writes of non-critical updates

配置udbcluster集群相关参数

设置集群模式为异构复制模式:
backend_clustering_mode = 'heter_replication'
设置socket id目录,设置到udbcluster用户有读写权限的目录
unix_socket_directories = '/home/unvdb/udbcluster'
pcp_socket_dir = '/home/unvdb/udbcluster'
设置集群中unvdb数据库节点的参数,根据实际情况配置IP地址、端口、选择权重,数据目录、是否允许故障转移、应用名称、异构节点数据库实例名称,异构数据库访问用户名和密码等信息。节点编号从0开始,新增加一个节点,参数后缀的数字递增。
注意:如果节点不是异构节点,则需要配置heter_dbs参数设置成空值''#节点0是流复制节点
backend_hostname0 = '192.168.2.151'
                                   # Host name or IP address to connect to for backend 0
backend_port0 = 5678
                                   # Port number for backend 0
backend_weight0 = 1
                                   # Weight for backend 0 (only in load balancing mode)
backend_data_directory0 = '/home/unvdb/unvdb-data'
                                   # Data directory for backend 0
backend_flag0 = 'ALLOW_TO_FAILOVER'
                                   # Controls various backend behavior
                                   # ALLOW_TO_FAILOVER, DISALLOW_TO_FAILOVER
                                   # or ALWAYS_PRIMARY
backend_application_name0 = '151'
                                   # walsender's application_name, used for "show unvdbcluster_nodes" command
heter_dbs0 = ''
                                   # Automatically detect heterogeneous tables within these databases
                                   # with multiple dbs separated by ','       
heter_user0 = 'unvdb'
                                   # heterogeneous tables detection user
heter_password0 = '12345678'
                                   # password for heterogeneous tables detection user                                
#节点1是异构节点
backend_hostname1 = '192.168.2.152'
backend_port1 = 5678
backend_weight1 = 1
backend_data_directory1 = '/home/unvdb/unvdb-data'
backend_flag1 = 'ALLOW_TO_FAILOVER'
backend_application_name1 = '152'
heter_dbs1 = 'udbench'
heter_user1 = 'unvdb'
heter_password1 = '12345678'
#节点2是流复制节点
backend_hostname2 = '192.168.2.153'
backend_port2 = 5678
backend_weight2 = 1
backend_data_directory2 = '/home/unvdb/unvdb-data'
backend_flag2 = 'ALLOW_TO_FAILOVER'
backend_application_name2 = '153'
heter_dbs2 = ''
heter_user2 = 'unvdb'
heter_password2 = '12345678'
sr_check_user = 'unvdb'
sr_check_password = '12345678'
health_check_user = 'unvdb'
health_check_password = '12345678'
#设置故障转移命令脚本,主节点发生故障时,调用此脚本切换到其他从节点,从节点提升为主节点。
failover_command = '/home/unvdb/udbcluster/etc/failover.sh %d %h %p %D %m %H %M %P %r %R %N %S'
#故障恢复命令脚本,当故障节点修复后,手动加入集群时,触发执行此脚本,新加入的节点跟随主节点并成为从节点。
failback_command = '/home/unvdb/udbcluster/etc/follow_primary.sh %d %h %p %D %m %H %M %P %r %R'
#主节点故障,切换到某个从节点时,自动触发此脚本,使其他从节点跟随新的主节点。
follow_primary_command = '/home/unvdb/udbcluster/etc/follow_primary.sh %d %h %p %D %m %H %M %P %r %R'

启动udbcluster服务

进入udbcluser/bin目录,执行如下命令启动:
./pgpool -n

查看udbcluster集群状态

执行ud_sql -p9999,连接udbcluster:
ud_sql -p9999
ud_sql (22.4)
Type "help" for help.
unvdb=# show cluster_nodes;
 node_id |   hostname    | port | status | pg_status | lb_weight |     role      | pg_role | select_cnt | load_balance_node | replication_delay | replication_state | replication_sync_state | last_sta
tus_change  | heter_tables_cnt 
---------+---------------+------+--------+-----------+-----------+---------------+---------+------------+-------------------+-------------------+-------------------+------------------------+---------
------------+------------------
 0       | 192.168.2.151 | 5678 | down   | up        | 0.333333  | standby       | primary | 0          | false             | 0                 |                   |                        | 2023-06-
16 20:51:29 | 0
 1       | 192.168.2.152 | 5678 | up     | up        | 0.333333  | heter standby | standby | 0          | true              | 10240             |                   |                        | 2023-06-
16 20:49:49 | 2
 2       | 192.168.2.153 | 5678 | up     | up        | 0.333333  | primary       | primary | 0          | false             | 0                 |                   |                        | 2023-06-
16 20:51:29 | 0
(3 rows)
unvdb=# 
stauts状态表示数据库节点在udbcluster集群中的状态,up表示在集群中,down表示已经被踢出集群。waiting状态表示,数据库节点重新加入到集群中,因为当前有应用通过udbcluster连接了数据库节点,可以断开再连接的步骤来使节点恢复成up状态。
pg_status状态表示数据库节点自身的状态,up表示节点正常运行,down表示节点已经停止。
role角色包括三种:primary主节点、standby从节点和heter standby异构节点。

模拟故障转移

可以用命令ud_ctl -D /home/unvdb/unvdb-data stop来停止主节点,触发故障转移,主节点将转移到可用的从节点。 故障转移后,查看集群节点信息

unvdb=# show cluster_nodes;
 node_id |   hostname    | port | status | pg_status | lb_weight |     role      | pg_role | select_cnt | load_balance_node | replication_delay | replication_state | replication_sync_state | last_sta
tus_change  | heter_tables_cnt 
---------+---------------+------+--------+-----------+-----------+---------------+---------+------------+-------------------+-------------------+-------------------+------------------------+---------
------------+------------------
 0       | 192.168.2.151 | 5678 | down   | down      | 0.333333  | standby       | unknown | 0          | false             | 0                 |                   |                        | 2023-06-
16 20:51:29 | 0
 1       | 192.168.2.152 | 5678 | up     | up        | 0.333333  | heter standby | standby | 0          | true              | 10240             |                   |                        | 2023-06-
16 20:49:49 | 2
 2       | 192.168.2.153 | 5678 | up     | up        | 0.333333  | primary       | primary | 0          | false             | 0                 |                   |                        | 2023-06-
16 20:51:29 | 0
(3 rows)
unvdb=# 

节点0 的状态已经变为down状态,已经被踢出集群。primary也由节点0,转移到了节点2。

故障节点恢复并加回集群

下面的操作,重新把节点0加回到集群中: 启动节点0 ud_ctl -D /home/unvdb/unvdb-data start starting UnvDB as a demo version waiting for server to start…. done server started

查看集群节点状态
unvdb=# show cluster_nodes;
 node_id |   hostname    | port | status | pg_status | lb_weight |     role      | pg_role | select_cnt | load_balance_node | replication_delay | replication_state | replication_sync_state | last_sta
tus_change  | heter_tables_cnt 
---------+---------------+------+--------+-----------+-----------+---------------+---------+------------+-------------------+-------------------+-------------------+------------------------+---------
------------+------------------
 0       | 192.168.2.151 | 5678 | down   | up        | 0.333333  | standby       | primary | 0          | false             | 0                 |                   |                        | 2023-06-
16 20:51:29 | 0
 1       | 192.168.2.152 | 5678 | up     | up        | 0.333333  | heter standby | standby | 0          | true              | 10240             |                   |                        | 2023-06-
16 20:49:49 | 2
 2       | 192.168.2.153 | 5678 | up     | up        | 0.333333  | primary       | primary | 0          | false             | 0                 |                   |                        | 2023-06-
16 20:51:29 | 0
(3 rows)
unvdb=# 
上面命令输出显示,节点0已经处于up状态,但是在集群中的状态仍然为down,即并没有自动加入到集群中。而是需要手动执行如下命令来加入集群:
[unvdb@localhost ~]$ pcp_attach_node -h localhost -U unvdb -n 0 -w
pcp_attach_node -- Command Successful
再查看集群中节点状态:
unvdb=# show cluster_nodes;
 node_id |   hostname    | port | status  | pg_status | lb_weight |     role      | pg_role | select_cnt | load_balance_node | replication_delay | replication_state | replication_sync_state | last_st
atus_change  | heter_tables_cnt 
---------+---------------+------+---------+-----------+-----------+---------------+---------+------------+-------------------+-------------------+-------------------+------------------------+--------
-------------+------------------
 0       | 192.168.2.151 | 5678 | waiting | up        | 0.333333  | standby       | standby | 0          | false             | 0                 |                   |                        | 2023-06
-20 08:58:39 | 0
 1       | 192.168.2.152 | 5678 | up      | up        | 0.333333  | heter standby | standby | 0          | true              | 0                 | streaming         | async                  | 2023-06
-20 08:57:20 | 2
 2       | 192.168.2.153 | 5678 | up      | up        | 0.333333  | primary       | primary | 0          | false             | 0                 |                   |                        | 2023-06
-20 08:57:20 | 0
(3 rows)
unvdb=#