cluster故障转移配置
设置过程
配置.pgpass文件
此文件用于udbcluster访问unvdb数据库时使用的密码文件,通常位于~/.pgpass。格式为 hostname:port:database:username:password 示例:
#hostname:port:database:username:password
# In a standby server, a database field of replication matches streaming replication connections made to the master server.
192.168.2.151:5678:unvdb:unvdb:12345678
192.168.2.152:5678:unvdb:unvdb:12345678
192.168.2.153:5678:unvdb:unvdb:12345678
配置ssh免密访问
配置unvdb节点之间和pgpool与unvdb节点之间相互免密ssh访问,切换到普通用户udb,再执行命令ssh-keygen -t rsa生成当前用户的私钥和公钥:
ssh-keygen -t rsa
Generating public/private rsa key pair.
Enter file in which to save the key (/home/zjyq/.ssh/id_rsa):
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /home/zjyq/.ssh/id_rsa
Your public key has been saved in /home/zjyq/.ssh/id_rsa.pub
The key fingerprint is:
SHA256:OP7lX2GRDaYHQKpEJNiHO45oNNNCoGdSlSzKbepY8cA zjyq@zjyq
The key's randomart image is:
+---[RSA 3072]----+
|o .=o+o .o.. o |
|.oo =o. . + + |
|=o=. o. . . + .|
|.XE+o. o . . |
|..** .+ S o |
|.oo o. . . . |
|+. . . . |
|.. . o . |
| . ... |
+----[SHA256]-----+
生成后的的公钥文件在~/.ssh/目录下的id_rsa.pub中,打开文件可以看到公钥的内容
分发密钥
[udb@clus-1 ~]$ ssh-copy-id 后端udb数据库用户的IP或hosts解析的主机名
配置pcp认证参数
打开pcp.conf文件,增加如下参数:
# USERID:MD5PASSWD
unvdb:25d55ad283aa400af464c76d713c07ad
其中,unvdb是unvdb数据库的账户。
25d55ad283aa400af464c76d713c07ad是unvdb账户的密码123456的MD5值,该值可以通过如下命令计算:
./pg_md5 12345678
25d55ad283aa400af464c76d713c07ad
配置unvdb数据库集群相关参数
修改如下参数
wal_level = logical # minimal, replica, or logical
wal_log_hints = on # also do full page writes of non-critical updates
配置udbcluster集群相关参数
设置集群模式为异构复制模式:
backend_clustering_mode = 'heter_replication'
设置socket id目录,设置到udbcluster用户有读写权限的目录
unix_socket_directories = '/home/unvdb/udbcluster'
pcp_socket_dir = '/home/unvdb/udbcluster'
设置集群中unvdb数据库节点的参数,根据实际情况配置IP地址、端口、选择权重,数据目录、是否允许故障转移、应用名称、异构节点数据库实例名称,异构数据库访问用户名和密码等信息。节点编号从0开始,新增加一个节点,参数后缀的数字递增。
注意:如果节点不是异构节点,则需要配置heter_dbs参数设置成空值''。
#节点0是流复制节点
backend_hostname0 = '192.168.2.151'
# Host name or IP address to connect to for backend 0
backend_port0 = 5678
# Port number for backend 0
backend_weight0 = 1
# Weight for backend 0 (only in load balancing mode)
backend_data_directory0 = '/home/unvdb/unvdb-data'
# Data directory for backend 0
backend_flag0 = 'ALLOW_TO_FAILOVER'
# Controls various backend behavior
# ALLOW_TO_FAILOVER, DISALLOW_TO_FAILOVER
# or ALWAYS_PRIMARY
backend_application_name0 = '151'
# walsender's application_name, used for "show unvdbcluster_nodes" command
heter_dbs0 = ''
# Automatically detect heterogeneous tables within these databases
# with multiple dbs separated by ','
heter_user0 = 'unvdb'
# heterogeneous tables detection user
heter_password0 = '12345678'
# password for heterogeneous tables detection user
#节点1是异构节点
backend_hostname1 = '192.168.2.152'
backend_port1 = 5678
backend_weight1 = 1
backend_data_directory1 = '/home/unvdb/unvdb-data'
backend_flag1 = 'ALLOW_TO_FAILOVER'
backend_application_name1 = '152'
heter_dbs1 = 'udbench'
heter_user1 = 'unvdb'
heter_password1 = '12345678'
#节点2是流复制节点
backend_hostname2 = '192.168.2.153'
backend_port2 = 5678
backend_weight2 = 1
backend_data_directory2 = '/home/unvdb/unvdb-data'
backend_flag2 = 'ALLOW_TO_FAILOVER'
backend_application_name2 = '153'
heter_dbs2 = ''
heter_user2 = 'unvdb'
heter_password2 = '12345678'
sr_check_user = 'unvdb'
sr_check_password = '12345678'
health_check_user = 'unvdb'
health_check_password = '12345678'
#设置故障转移命令脚本,主节点发生故障时,调用此脚本切换到其他从节点,从节点提升为主节点。
failover_command = '/home/unvdb/udbcluster/etc/failover.sh %d %h %p %D %m %H %M %P %r %R %N %S'
#故障恢复命令脚本,当故障节点修复后,手动加入集群时,触发执行此脚本,新加入的节点跟随主节点并成为从节点。
failback_command = '/home/unvdb/udbcluster/etc/follow_primary.sh %d %h %p %D %m %H %M %P %r %R'
#主节点故障,切换到某个从节点时,自动触发此脚本,使其他从节点跟随新的主节点。
follow_primary_command = '/home/unvdb/udbcluster/etc/follow_primary.sh %d %h %p %D %m %H %M %P %r %R'
启动udbcluster服务
进入udbcluser/bin目录,执行如下命令启动:
./pgpool -n
查看udbcluster集群状态
执行ud_sql -p9999,连接udbcluster:
ud_sql -p9999
ud_sql (22.4)
Type "help" for help.
unvdb=# show cluster_nodes;
node_id | hostname | port | status | pg_status | lb_weight | role | pg_role | select_cnt | load_balance_node | replication_delay | replication_state | replication_sync_state | last_sta
tus_change | heter_tables_cnt
---------+---------------+------+--------+-----------+-----------+---------------+---------+------------+-------------------+-------------------+-------------------+------------------------+---------
------------+------------------
0 | 192.168.2.151 | 5678 | down | up | 0.333333 | standby | primary | 0 | false | 0 | | | 2023-06-
16 20:51:29 | 0
1 | 192.168.2.152 | 5678 | up | up | 0.333333 | heter standby | standby | 0 | true | 10240 | | | 2023-06-
16 20:49:49 | 2
2 | 192.168.2.153 | 5678 | up | up | 0.333333 | primary | primary | 0 | false | 0 | | | 2023-06-
16 20:51:29 | 0
(3 rows)
unvdb=#
stauts状态表示数据库节点在udbcluster集群中的状态,up表示在集群中,down表示已经被踢出集群。waiting状态表示,数据库节点重新加入到集群中,因为当前有应用通过udbcluster连接了数据库节点,可以断开再连接的步骤来使节点恢复成up状态。
pg_status状态表示数据库节点自身的状态,up表示节点正常运行,down表示节点已经停止。
role角色包括三种:primary主节点、standby从节点和heter standby异构节点。
模拟故障转移
可以用命令ud_ctl -D /home/unvdb/unvdb-data stop来停止主节点,触发故障转移,主节点将转移到可用的从节点。 故障转移后,查看集群节点信息
unvdb=# show cluster_nodes;
node_id | hostname | port | status | pg_status | lb_weight | role | pg_role | select_cnt | load_balance_node | replication_delay | replication_state | replication_sync_state | last_sta
tus_change | heter_tables_cnt
---------+---------------+------+--------+-----------+-----------+---------------+---------+------------+-------------------+-------------------+-------------------+------------------------+---------
------------+------------------
0 | 192.168.2.151 | 5678 | down | down | 0.333333 | standby | unknown | 0 | false | 0 | | | 2023-06-
16 20:51:29 | 0
1 | 192.168.2.152 | 5678 | up | up | 0.333333 | heter standby | standby | 0 | true | 10240 | | | 2023-06-
16 20:49:49 | 2
2 | 192.168.2.153 | 5678 | up | up | 0.333333 | primary | primary | 0 | false | 0 | | | 2023-06-
16 20:51:29 | 0
(3 rows)
unvdb=#
节点0 的状态已经变为down状态,已经被踢出集群。primary也由节点0,转移到了节点2。
故障节点恢复并加回集群
下面的操作,重新把节点0加回到集群中: 启动节点0 ud_ctl -D /home/unvdb/unvdb-data start starting UnvDB as a demo version waiting for server to start…. done server started
查看集群节点状态
unvdb=# show cluster_nodes;
node_id | hostname | port | status | pg_status | lb_weight | role | pg_role | select_cnt | load_balance_node | replication_delay | replication_state | replication_sync_state | last_sta
tus_change | heter_tables_cnt
---------+---------------+------+--------+-----------+-----------+---------------+---------+------------+-------------------+-------------------+-------------------+------------------------+---------
------------+------------------
0 | 192.168.2.151 | 5678 | down | up | 0.333333 | standby | primary | 0 | false | 0 | | | 2023-06-
16 20:51:29 | 0
1 | 192.168.2.152 | 5678 | up | up | 0.333333 | heter standby | standby | 0 | true | 10240 | | | 2023-06-
16 20:49:49 | 2
2 | 192.168.2.153 | 5678 | up | up | 0.333333 | primary | primary | 0 | false | 0 | | | 2023-06-
16 20:51:29 | 0
(3 rows)
unvdb=#
上面命令输出显示,节点0已经处于up状态,但是在集群中的状态仍然为down,即并没有自动加入到集群中。而是需要手动执行如下命令来加入集群:
[unvdb@localhost ~]$ pcp_attach_node -h localhost -U unvdb -n 0 -w
pcp_attach_node -- Command Successful
再查看集群中节点状态:
unvdb=# show cluster_nodes;
node_id | hostname | port | status | pg_status | lb_weight | role | pg_role | select_cnt | load_balance_node | replication_delay | replication_state | replication_sync_state | last_st
atus_change | heter_tables_cnt
---------+---------------+------+---------+-----------+-----------+---------------+---------+------------+-------------------+-------------------+-------------------+------------------------+--------
-------------+------------------
0 | 192.168.2.151 | 5678 | waiting | up | 0.333333 | standby | standby | 0 | false | 0 | | | 2023-06
-20 08:58:39 | 0
1 | 192.168.2.152 | 5678 | up | up | 0.333333 | heter standby | standby | 0 | true | 0 | streaming | async | 2023-06
-20 08:57:20 | 2
2 | 192.168.2.153 | 5678 | up | up | 0.333333 | primary | primary | 0 | false | 0 | | | 2023-06
-20 08:57:20 | 0
(3 rows)
unvdb=#