集群测试
负载均衡测试
测试脚本
本脚本产生100次连接,每个连接插入100条数据,分别记录查询服务器IP和当前主节点IP 目的是测试故障转移期间,对当前连接和新连接产生的影响
cat test.sh
#!/bin/bash
ud_sql -U unvdb -p 9999 -h 192.168.2.98 test << EOF
TRUNCATE tb1 RESTART IDENTITY;
EOF
#先清除历史数据
con=1
while [ $con -le 100 ]
do
echo $con
ud_sql -U unvdb -p 9999 -h 192.168.2.98 test << EOF
`for ins in {1..100};do echo -e "select inet_server_addr() as a \gset \n INSERT INTO tb1(time,con_id,ins_id,select_ip,server_ip)values(now(),$con,$ins,:'a',inet_server_addr());";done`
EOF
#sleep 1;
let con++
done
sh test.sh
1
INSERT 0 1
INSERT 0 1
...
检查延时
比如以下显示说明产生288字节的延时。
unvdb=# select pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_flush_lsn(),write_lsn)) delay_wal_size,* from pg_stat_replication ;
delay_wal_size | pid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend_start
| backend_xmin | state | sent_lsn | write_lsn | flush_lsn | replay_lsn | write_lag | flush_lag | replay_
lag | sync_priority | sync_state | reply_time
----------------+------+----------+---------+------------------+--------------+-----------------+-------------+---------------------
---------+--------------+-----------+-----------+-----------+-----------+------------+-----------------+-----------------+----------
-------+---------------+------------+-------------------------------
288 bytes | 1842 | 16386 | rep | walreceiver | 192.168.2.82 | | 59660 | 2022-08-22 15:36:47.
38258+08 | | streaming | 0/998AE70 | 0/998AD50 | 0/998AD50 | 0/998AD50 | 00:00:00.000187 | 00:00:00.000554 | 00:00:00.
000554 | 0 | async | 2022-08-22 17:01:06.779248+08
(1 row)
检查结果
已写入10000条数据,5097条查询来自192.168.2.81,4903条查询来自192.168.2.82
test=# select count(*) from tb1;
count
-------
10000
(1 row)
test=# select count(*) from tb1 where select_ip='192.168.2.81';
count
-------
5097
(1 row)
test=# select count(*) from tb1 where select_ip='192.168.2.82';
count
-------
4903
(1 row)
故障转移模拟测试
查看cluster信息
领导节点位于192.168.2.92
[root@clus-92 udb-clus]# /data/soft/unvdbcluster/bin/pcp_watchdog_info -U admin -h 192.168.2.98 -p 9898
Password:
2 2 YES 192.168.2.92:9999 Linux clus-92 192.168.2.92
192.168.2.92:9999 Linux clus-92 192.168.2.92 9999 9000 4 LEADER 0 MEMBER
192.168.2.91:9999 Linux clus-91 192.168.2.91 9999 9000 7 STANDBY 0 MEMBER
[root@clus-92 udb-clus]# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
inet 192.168.2.92/24 brd 192.168.2.255 scope global eth0
inet 192.168.2.98/24 scope global secondary eth0:0
查看当前主从信息
主节点是81,从节点是82
unvdb=# show cluster_nodes;
node_id | hostname | port | status | ud_status | lb_weight | role | ud_role | select_cnt | load_balance_node | replication_d
elay | replication_state | replication_sync_state | last_status_change
---------+--------------+------+--------+-----------+-----------+---------+---------+------------+-------------------+--------------
-----+-------------------+------------------------+---------------------
0 | 192.168.2.81 | 5678 | up | up | 0.500000 | primary | primary | 0 | false | 0
| | | 2022-08-22 17:36:38
1 | 192.168.2.82 | 5678 | up | up | 0.500000 | standby | standby | 0 | true | 0
| | | 2022-08-22 17:36:38
(2 rows)
测试脚本
本脚本产生100次连接,每个连接插入100条数据,分别记录查询服务器IP和当前主节点IP 目的是测试故障转移期间,对当前连接和新连接产生的影响
cat test.sh
#!/bin/bash
ud_sql -U unvdb -p 9999 -h 192.168.2.98 test << EOF
TRUNCATE tb1 RESTART IDENTITY;
EOF
#先清除历史数据
con=1
while [ $con -le 100 ]
do
echo $con
ud_sql -U unvdb -p 9999 -h 192.168.2.98 test << EOF
`for ins in {1..100};do echo -e "select inet_server_addr() as a \gset \n INSERT INTO tb1(time,con_id,ins_id,select_ip,server_ip)values(now(),$con,$ins,:'a',inet_server_addr());";done`
EOF
#sleep 1;
let con++
done
sh test.sh
1
INSERT 0 1
INSERT 0 1
...
模拟cluster主节点92 down机,高可用IP是否迁移,过程影响多大?
结论:共丢失165条插入操作。包括1次完整连接和1次连接中的65次插入操作。 业务影响时间15秒
sudo systemctl stop unvdbcluster
#活动节点关闭服务,发现连接失败,但几秒后恢复。
...
INSERT 0 1
INSERT 0 1
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
connection to server was lost
3
ud_sql: error: connection to server at "10.252.9.22", port 9999 failed: No route to host
Is the server running on that host and accepting TCP/IP connections?
4
ud_sql: error: connection to server at "10.252.9.22", port 9999 failed: No route to host
Is the server running on that host and accepting TCP/IP connections?
5
ud_sql: error: connection to server at "10.252.9.22", port 9999 failed: No route to host
Is the server running on that host and accepting TCP/IP connections?
...
结果发现只有9982条数据,丢失了18条数据,
test=# select count(*) from tb1;
count
-------
9982
(1 row)
第10次连接只有82条数据,丟了18条数据
test=# select con_id,count(*) a from tb1 group by con_id order by a asc;
con_id | a
--------+-----
10 | 82
成功完成了100次连接。
test=# select con_id from tb1 group by con_id order by con_id;
con_id
--------
1
...
99
100
(100 rows)
业务中断时间为17:49:04-17:48:49=15秒
test=# select * from tb1 where con_id in (10,11);
981 | 2022-08-22 17:48:49.5052+08 | 10 | 81 | 192.168.2.82 | 192.168.2.81/32
982 | 2022-08-22 17:48:49.515651+08 | 10 | 82 | 192.168.2.82 | 192.168.2.81/32
983 | 2022-08-22 17:49:04.58487+08 | 11 | 1 | 192.168.2.81 | 192.168.2.81/32
984 | 2022-08-22 17:49:04.586657+08 | 11 | 2 | 192.168.2.82| 192.168.2.81/32
模拟udb主节点82 down机,读写操作影响多大?
结论:共丢失3069条插入操作。包括第9-38次连接和第6次连接中的13次插入操作。
业务影响时间不到1秒
sudo systemctl stop unvdb
此时主节点已自动转移到81
unvdb=# show cluster_nodes;
node_id | hostname | port | status | ud_status | lb_weight | role | ud_role | select_cnt | load_balance_node | replication_d
elay | replication_state | replication_sync_state | last_status_change
---------+--------------+------+--------+-----------+-----------+---------+---------+------------+-------------------+--------------
-----+-------------------+------------------------+---------------------
0 | 192.168.2.81 | 5678 | up | up | 0.500000 | primary | primary | 6979 | true | 0
| | | 2022-08-22 18:11:24
1 | 192.168.2.82 | 5678 | down | down | 0.500000 | standby | unknown | 793 | false | 0
| | | 2022-08-22 18:11:24
(2 rows)
结果发现只有6931条数据,丢失了3069条数据,
test=# select count(*) from tb1;
count
-------
6931
(1 row)
第8次的94次直到第39次之间全部丢失,但故障时间不到1秒钟。
test=# select * from tb1;
705 | 2022-08-22 18:11:24.732418+08| 8 | 5 | 192.168.2.82 | 192.168.2.82/32
706 | 2022-08-22 18:11:24.733377+08| 8 | 6 | 192.168.2.82 | 192.168.2.82/32
...中间数据丢失
844 | 2022-08-22 18:11:25.078536+08| 39 | 85 | 192.168.2.81 | 192.168.2.81/32
主从切换数据修复
背景:当原主节点down机后,如何快速恢复为从节点 注意:操作前做好数据备份
模拟操作
unvdb=# show cluster_nodes;
node_id | hostname | port | status | ud_status | lb_weight | role | ud_role | select_cnt | load_balance_node | replication_d
elay | replication_state | replication_sync_state | last_status_change
---------+--------------+------+--------+-----------+-----------+---------+---------+------------+-------------------+--------------
-----+-------------------+------------------------+---------------------
0 | 192.168.2.81 | 5678 | up | up | 0.500000 | primary | primary | 6986 | true | 0
| | | 2022-08-22 18:27:34
1 | 192.168.2.82 | 5678 | up | up | 0.500000 | standby | standby | 793 | false | 2072616
| | | 2022-08-22 18:33:53
(2 rows)
此时主节点是81
关闭81后,主节点自动切换为82
unvdb=# show cluster_nodes;
node_id | hostname | port | status | ud_status | lb_weight | role | ud_role | select_cnt | load_balance_node | replication_d
elay | replication_state | replication_sync_state | last_status_change
---------+--------------+------+--------+-----------+-----------+---------+---------+------------+-------------------+--------------
-----+-------------------+------------------------+---------------------
0 | 192.168.2.81 | 5678 | down | down | 0.500000 | standby | unknown | 6986 | false | 0
| | | 2022-08-22 18:34:47
1 | 192.168.2.82 | 5678 | up | up | 0.500000 | primary | primary | 793 | true | 0
| | | 2022-08-22 18:34:47
(2 rows)
在82节点上添加测试库
unvdb=# create database d82_1;
下面修复81节点并配置为从库
ud_rewind方式
[root@udb-81 tmp]# su udb
[udb@udb-81 tmp]$ ud_rewind --target-uddata=/data/udb --source-server='host=192.168.2.82 port=5678 user=unvdb dbname=unvdb password=数据库管理员密码'
ud_rewind: servers diverged at WAL location 0/11000000 on timeline 4
ud_rewind: rewinding from last common checkpoint at 0/10000140 on timeline 4
ud_rewind: Done!
echo "standby_mode = on" > /data/udb/standby.signal
vi unvdbsvr.conf
修改主库连接信息
systemctl start unvdb
rsync文件同步方式
systemctl stop unvdb
rsync -avzP 192.168.2.82:/data/udb/ /data/udb/
echo "standby_mode = on" > /data/udb/standby.signal
vi unvdbsvr.conf
修改主库连接信息
systemctl start unvdb
basebackup全量备份方式
ud_basebackup 方式,参考本章udb从库配置
重新加入节点
pcp_attach_node -d -U admin -p 9898 -h 192.168.2.98 -n 节点id