understand linux route

路由规则理解

主机的网络封包需要会根据路由规则来判断如何将封包发送出去

路由是雙向的,你必須要瞭解出去的路由與回來時的規則

常用命令

  • route
    route -n
    route add default gw 172.16.130.1 eth2
    route del default gw 172.16.130.1 eth2
    route add -host 192.168.168.110 dev eth0
    route del -host 192.168.168.110 dev eth0
    route add -net 172.16.130.0/24 gw 172.16.130.1 eth2
    route del -net 172.16.0.0 netmask 255.255.0.0 dev eth0
    ip route flush cache
  • sysctl -a |grep forward
  • sysctl -a |grep ignore
  • echo 2 > /proc/sys/net/ipv4/conf/default/rp_filter
  • echo 2 > /proc/sys/net/ipv4/conf/all/rp_filter
  • sysctl -p
  • vim /etc/sysctl.conf
  • traceroute www.baidu.com

相关配置文件

[root@localhost ~]# cat /etc/iproute2/rt_tables
#
# reserved values
#
255    local
254    main
253    default
0    unspec
#
# local
#
#1    inr.ruhep
[root@localhost ~]# sysctl -a |grep ignore
net.ipv4.conf.all.arp_ignore = 0
net.ipv4.conf.default.arp_ignore = 0
net.ipv4.conf.lo.arp_ignore = 0
net.ipv4.conf.eth0.arp_ignore = 0
net.ipv4.conf.eth1.arp_ignore = 0
net.ipv4.icmp_echo_ignore_all = 0
net.ipv4.icmp_echo_ignore_broadcasts = 1
net.ipv4.icmp_ignore_bogus_error_responses = 1
[root@localhost ~]# cat /etc/sysctl.conf 

例子

如果是双网卡且设定的是同一网段IP:

  • eth0 : 192.168.0.100
  • eth1 : 192.168.0.200

那么一般会生成这样的路由规则:

[root@www ~]# route -n
Kernel IP routing table
Destination     Gateway   Genmask         Flags Metric Ref   Use Iface
192.168.0.0     0.0.0.0   255.255.255.0   U     0      0       0 eth1
192.168.0.0     0.0.0.0   255.255.255.0   U     0      0       0 eth0

也就是說:

  • 當要主動發送封包到192.168.0.0/24的網域時,都只會透過第一條規則,也就是透過eth1來傳出去!
  • 在回應封包方面,不管是由eth0還是由eth1進來的網路封包,都會透過 eth1 來回傳!

来自: http://linux.vbird.org/linux_server/0230router.php, [8.1.3 重複路由的問題]

只设置一个网口IP

[dennis@localhost ~]$ ssh root@172.16.60.53
root@172.16.60.53's password: 
Last login: Wed Mar 11 17:54:12 2015 from 172.16.50.39
[root@localhost ~]# route
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
172.16.60.0     *               255.255.255.0   U     0      0        0 eth2
default         172.16.60.1     0.0.0.0         UG    0      0        0 eth2
[root@localhost ~]# ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue state UNKNOWN 
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: eth0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN qlen 1000
    link/ether a0:36:9f:32:b0:d0 brd ff:ff:ff:ff:ff:ff
3: eth1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN qlen 1000
    link/ether a0:36:9f:32:b0:d1 brd ff:ff:ff:ff:ff:ff
4: eth2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000
    link/ether 00:25:90:f4:6b:60 brd ff:ff:ff:ff:ff:ff
    inet 172.16.60.53/24 brd 172.16.60.255 scope global eth2
    inet6 fe80::225:90ff:fef4:6b60/64 scope link 
       valid_lft forever preferred_lft forever
5: eth3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000
    link/ether 00:25:90:f4:6b:61 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::225:90ff:fef4:6b61/64 scope link 
       valid_lft forever preferred_lft forever

没有问题的设定

[dennis@localhost ~]$ ssh root@172.16.130.105
root@172.16.130.105's password: 
Permission denied, please try again.
root@172.16.130.105's password: 
Last login: Wed Mar 11 16:34:38 2015 from 172.16.50.39
[root@localhost ~]# route
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
172.16.130.0    *               255.255.255.0   U     0      0        0 eth0
172.16.130.0    *               255.255.255.0   U     0      0        0 eth1
default         172.16.130.1    0.0.0.0         UG    0      0        0 eth1
default         172.16.130.1    0.0.0.0         UG    0      0        0 eth0
[root@localhost ~]# ping 172.16.50.39 -I eth1
PING 172.16.50.39 (172.16.50.39) from 172.16.130.106 eth1: 56(84) bytes of data.
64 bytes from 172.16.50.39: icmp_seq=1 ttl=63 time=11.7 ms
64 bytes from 172.16.50.39: icmp_seq=2 ttl=63 time=0.265 ms
^C
--- 172.16.50.39 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1568ms
rtt min/avg/max/mdev = 0.265/6.030/11.795/5.765 ms
[root@localhost ~]# ping 172.16.50.39 -I eth0
PING 172.16.50.39 (172.16.50.39) from 172.16.130.105 eth0: 56(84) bytes of data.
64 bytes from 172.16.50.39: icmp_seq=1 ttl=63 time=0.244 ms
64 bytes from 172.16.50.39: icmp_seq=2 ttl=63 time=6.12 ms
64 bytes from 172.16.50.39: icmp_seq=3 ttl=63 time=6.50 ms
^C
--- 172.16.50.39 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2414ms
rtt min/avg/max/mdev = 0.244/4.290/6.504/2.866 ms
[root@localhost ~]# ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue state UNKNOWN 
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
    link/ether 00:25:90:de:fc:13 brd ff:ff:ff:ff:ff:ff
    inet 172.16.130.106/24 scope global eth1
    inet6 fe80::225:90ff:fede:fc13/64 scope link 
       valid_lft forever preferred_lft forever
3: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
    link/ether 00:25:90:de:fc:12 brd ff:ff:ff:ff:ff:ff
    inet 172.16.130.105/24 scope global eth0
    inet6 fe80::225:90ff:fede:fc12/64 scope link 
       valid_lft forever preferred_lft forever
[root@localhost ~]# ip route show
172.16.130.0/24 dev eth0  proto kernel  scope link  src 172.16.130.105 
172.16.130.0/24 dev eth1  proto kernel  scope link  src 172.16.130.106 
default via 172.16.130.1 dev eth1 
default via 172.16.130.1 dev eth0 

一台奇怪问题的机器

问题机器: 双网卡,IP信息 eth0:172.16.60.150 eth1:172.16.60.151 网关172.16.60.1

本地主机: IP 172.16.50.39, 网关:172.16.50.1

本地(50.39)可以正常ssh登陆60.150

但ping 172.16.60.151没有回应:

[root@localhost ~]# cat /etc/sysconfig/network-scripts/ifcfg-eth0
DEVICE=eth0
HWADDR=00:1e:67:c9:9a:f4
TYPE=Ethernet
UUID=5adb47dc-d361-43b7-a23a-f17c2ade1e2d
ONBOOT=yes
NM_CONTROLLED=yes
BOOTPROTO=none
IPADDR=172.16.60.150
NETMASK=255.255.255.0
GATEWAY=172.16.60.1
IPV6INIT=no
USERCTL=no
[root@localhost ~]# cat /etc/sysconfig/network-scripts/ifcfg-eth1
DEVICE=eth1
HWADDR=00:1e:67:c9:9a:f5
TYPE=Ethernet
UUID=2af28c1e-abe2-4805-8c78-880ffbfc567e
ONBOOT=yes
NM_CONTROLLED=yes
BOOTPROTO=none
IPV6INIT=no
USERCTL=no
IPADDR=172.16.60.151
NETMASK=255.255.255.0
GATEWAY=172.16.60.1
[root@localhost ~]# ip addr  
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue state UNKNOWN   
  link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00  
  inet 127.0.0.1/8 scope host lo  
  inet6 ::1/128 scope host   
     valid_lft forever preferred_lft forever  
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000  
  link/ether 00:1e:67:c9:9a:f4 brd ff:ff:ff:ff:ff:ff  
  inet 172.16.60.150/24 brd 172.16.60.255 scope global eth0  
  inet6 fe80::21e:67ff:fec9:9af4/64 scope link   
     valid_lft forever preferred_lft forever  
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000  
  link/ether 00:1e:67:c9:9a:f5 brd ff:ff:ff:ff:ff:ff  
  inet 172.16.60.151/24 brd 172.16.60.255 scope global eth1  
  inet6 fe80::21e:67ff:fec9:9af5/64 scope link   
     valid_lft forever preferred_lft forever  
[root@localhost ~]# route  
Kernel IP routing table  
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface  
172.16.60.0     *               255.255.255.0   U     0      0        0 eth0  
172.16.60.0     *               255.255.255.0   U     0      0        0 eth1  
link-local      *               255.255.0.0     U     1002   0        0 eth0  
link-local      *               255.255.0.0     U     1003   0        0 eth1  
default         172.16.60.1     0.0.0.0         UG    0      0        0 eth0  

通过同一网段的机器(172.16.60.53),ping 151得到的是Destination Host Unreachable

[dennis@localhost ~]$ ssh root@172.16.60.53
The authenticity of host '172.16.60.53 (172.16.60.53)' can't be established.
RSA key fingerprint is e6:b0:c1:60:53:cd:77:7c:32:e4:27:4f:01:43:5a:8a.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added '172.16.60.53' (RSA) to the list of known hosts.
root@172.16.60.53's password: 
Last login: Fri Dec  6 04:46:04 2013
[root@localhost ~]# route
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
172.16.60.0     *               255.255.255.0   U     0      0        0 eth2
default         172.16.60.1     0.0.0.0         UG    0      0        0 eth2
[root@localhost ~]# ping 172.16.60.151
PING 172.16.60.151 (172.16.60.151) 56(84) bytes of data.
From 172.16.60.53 icmp_seq=2 Destination Host Unreachable
From 172.16.60.53 icmp_seq=3 Destination Host Unreachable
From 172.16.60.53 icmp_seq=4 Destination Host Unreachable
^C
--- 172.16.60.151 ping statistics ---
5 packets transmitted, 0 received, +3 errors, 100% packet loss, time 4988ms
pipe 3
[root@localhost ~]# ping 172.16.60.150
PING 172.16.60.150 (172.16.60.150) 56(84) bytes of data.
64 bytes from 172.16.60.150: icmp_seq=1 ttl=64 time=0.933 ms
64 bytes from 172.16.60.150: icmp_seq=2 ttl=64 time=0.235 ms
64 bytes from 172.16.60.150: icmp_seq=3 ttl=64 time=0.223 ms
^C
--- 172.16.60.150 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2045ms
rtt min/avg/max/mdev = 0.223/0.463/0.933/0.332 ms
[root@localhost ~]# 

且从eth1(60.151)ping不通网关(172.16.60.1), ping同网段另外一台机器(60.53)开始会停顿:

[root@localhost ~]# arp
Address                  HWtype  HWaddress           Flags Mask            Iface
172.16.60.1              ether   00:0f:e2:b1:c7:5d   C                     eth0
[root@localhost ~]# route
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
172.16.60.0     *               255.255.255.0   U     0      0        0 eth0
172.16.60.0     *               255.255.255.0   U     0      0        0 eth1
link-local      *               255.255.0.0     U     1002   0        0 eth0
default         172.16.60.1     0.0.0.0         UG    0      0        0 eth0
[root@localhost ~]# ping 172.16.60.53 -I eth1
PING 172.16.60.53 (172.16.60.53) from 172.16.60.151 eth1: 56(84) bytes of data.
64 bytes from 172.16.60.53: icmp_seq=10 ttl=64 time=1.08 ms
64 bytes from 172.16.60.53: icmp_seq=11 ttl=64 time=0.228 ms
64 bytes from 172.16.60.53: icmp_seq=12 ttl=64 time=0.230 ms
64 bytes from 172.16.60.53: icmp_seq=13 ttl=64 time=0.226 ms
64 bytes from 172.16.60.53: icmp_seq=14 ttl=64 time=0.238 ms
64 bytes from 172.16.60.53: icmp_seq=15 ttl=64 time=0.225 ms
64 bytes from 172.16.60.53: icmp_seq=16 ttl=64 time=0.227 ms
^C
--- 172.16.60.53 ping statistics ---
16 packets transmitted, 7 received, 56% packet loss, time 15646ms
rtt min/avg/max/mdev = 0.225/0.351/1.085/0.299 ms
[root@localhost ~]# ping 172.16.50.39 -I eth1
PING 172.16.50.39 (172.16.50.39) from 172.16.60.151 eth1: 56(84) bytes of data.
From 172.16.60.151 icmp_seq=2 Destination Host Unreachable
From 172.16.60.151 icmp_seq=3 Destination Host Unreachable
From 172.16.60.151 icmp_seq=4 Destination Host Unreachable
From 172.16.60.151 icmp_seq=6 Destination Host Unreachable
From 172.16.60.151 icmp_seq=7 Destination Host Unreachable
From 172.16.60.151 icmp_seq=8 Destination Host Unreachable
^C
--- 172.16.50.39 ping statistics ---
8 packets transmitted, 0 received, +6 errors, 100% packet loss, time 7646ms
pipe 3
[root@localhost ~]# ping 172.16.60.1 -I eth1
PING 172.16.60.1 (172.16.60.1) from 172.16.60.151 eth1: 56(84) bytes of data.

执行route add default gw 172.16.60.1 eth1后,可以ping 172.16.60.151,但是150有问题了.

使用ssh root@172.16.60.151登陆看到的路由信息

[root@localhost ~]# route
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
172.16.60.0     *               255.255.255.0   U     0      0        0 eth0
172.16.60.0     *               255.255.255.0   U     0      0        0 eth1
link-local      *               255.255.0.0     U     1002   0        0 eth0
default         172.16.60.1     0.0.0.0         UG    0      0        0 eth1
default         172.16.60.1     0.0.0.0         UG    0      0        0 eth0
[root@localhost ~]# ping 172.16.60.1 -I eth0
PING 172.16.60.1 (172.16.60.1) from 172.16.60.150 eth0: 56(84) bytes of data.
64 bytes from 172.16.60.1: icmp_seq=1 ttl=255 time=8.88 ms
64 bytes from 172.16.60.1: icmp_seq=2 ttl=255 time=6.85 ms
64 bytes from 172.16.60.1: icmp_seq=3 ttl=255 time=1.74 ms
^C
--- 172.16.60.1 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2142ms
rtt min/avg/max/mdev = 1.747/5.830/8.885/3.003 ms
[root@localhost ~]# ping 172.16.60.1 -I eth1
PING 172.16.60.1 (172.16.60.1) from 172.16.60.151 eth1: 56(84) bytes of data.
^C
--- 172.16.60.1 ping statistics ---
17 packets transmitted, 0 received, 100% packet loss, time 16976ms

[root@localhost ~]# ping 172.16.50.39 -I eth0
PING 172.16.50.39 (172.16.50.39) from 172.16.60.150 eth0: 56(84) bytes of data.
^C
--- 172.16.50.39 ping statistics ---
13 packets transmitted, 0 received, 100% packet loss, time 12830ms

[root@localhost ~]# ping 172.16.50.39 -I eth1
PING 172.16.50.39 (172.16.50.39) from 172.16.60.151 eth1: 56(84) bytes of data.
64 bytes from 172.16.50.39: icmp_seq=1 ttl=63 time=0.245 ms
64 bytes from 172.16.50.39: icmp_seq=2 ttl=63 time=0.296 ms
^C
--- 172.16.50.39 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1391ms
rtt min/avg/max/mdev = 0.245/0.270/0.296/0.030 ms
[root@localhost ~]# ping 172.16.60.53 -I eth0
PING 172.16.60.53 (172.16.60.53) from 172.16.60.150 eth0: 56(84) bytes of data.
64 bytes from 172.16.60.53: icmp_seq=1 ttl=64 time=0.252 ms
64 bytes from 172.16.60.53: icmp_seq=2 ttl=64 time=0.196 ms
^C
--- 172.16.60.53 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1342ms
rtt min/avg/max/mdev = 0.196/0.224/0.252/0.028 ms
[root@localhost ~]# ping 172.16.60.53 -I eth1
PING 172.16.60.53 (172.16.60.53) from 172.16.60.151 eth1: 56(84) bytes of data.
64 bytes from 172.16.60.53: icmp_seq=1 ttl=64 time=0.236 ms
64 bytes from 172.16.60.53: icmp_seq=2 ttl=64 time=0.232 ms
64 bytes from 172.16.60.53: icmp_seq=3 ttl=64 time=0.240 ms
^C
--- 172.16.60.53 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2431ms
rtt min/avg/max/mdev = 0.232/0.236/0.240/0.003 ms

这时候应该怎么设置route呢?

可是没有看出哪里有问题,发往172.16.60.XX ip段的都从eth0走,其他IP段的从eth0发送给默认网关172.16.60.1

参考

Message Queue Middleware

Introduce

  • MOM: Message Oriented Middleware
  • Message Queue ( MQ ) is a middleware between applications and networks.

There are a number of open source choices of messaging middleware systems, including

Compare

  • Application A <–> Network <–> Application B
  • Application A <–> MOM <—> Application B

Standard and protocol

  • AMQP

Using ActiveMQ

1.Download and run activemq:

  • Download the latest activemq from http://activemq.apache.org/download.html
  • tar xvf apache-activemq-5.11.0-bin.tar.gz
  • cd apache-activemq-5.11.0
  • ./bin/activemq start
  • http://localhost:8161. The default username and password is admin/admin
  • check running:
    • netstat -apn |grep 61616
    • ps aux |grep activemq
  • ./bin/activemq stop
  • for more information, read docs/user-guide.html

2.Build openwire-cpp example:

  • yum install activemq-cpp-devel
  • cd /home/dennis/Downloads/apache-activemq-5.11.0/examples/openwire/cpp
  • gcc Listener.cpp -o listener -I/usr/include/activemq-cpp-3.8.3 -I/usr/include/apr-1 -lactivemq-cpp -lstdc++
  • gcc Publisher.cpp -o publisher -I/usr/include/activemq-cpp-3.8.3 -I/usr/include/apr-1 -lactivemq-cpp -lstdc++

3.Running example:

  • 3.1 Run service as root

    [dennis@localhost apache-activemq-5.11.0]$ su
    Password:
    [root@localhost apache-activemq-5.11.0]# ./bin/activemq start
    INFO: Loading ‘/home/dennis/Downloads/apache-activemq-5.11.0/bin/env’
    INFO: Using java ‘/usr/bin/java’
    INFO: Process with pid ‘3510’ is already running

  • 3.2 Run listener

    [dennis@localhost cpp]$ ./listener

    Starting the Listener example:

    Waiting for messages…
    Received 1000 messages.
    Received 2000 messages.
    Received 3000 messages.
    Received 4000 messages.
    Received 5000 messages.
    Received 6000 messages.
    Received 7000 messages.
    Received 8000 messages.
    Received 9000 messages.
    Received 10000 messages.

    Received 10000 in 1.606 seconds

    Finished with the example.

  • 3.3 Run publisher

    [dennis@localhost cpp]$ ./publisher

    Starting the Publisher example:

    Sent 1000 messages
    Sent 2000 messages
    Sent 3000 messages
    Sent 4000 messages
    Sent 5000 messages
    Sent 6000 messages
    Sent 7000 messages
    Sent 8000 messages
    Sent 9000 messages

    Sent 10000 messages

    Finished with the example.

Reference

linux network performance monitor

ethtool

iptraf

netperf

iperf

Build from source:

[root@localhost ~]# tar xvf iperf-2.0.5.tar.gz 
...
[root@localhost ~]# cd iperf-2.0.5
[root@localhost iperf-2.0.5]# ./configure 
...
[root@localhost iperf-2.0.5]# make
...

On server machine:

[root@Ustor iperf-2.0.5]# ./src/iperf -s
------------------------------------------------------------
Server listening on TCP port 5001
TCP window size: 85.3 KByte (default)
------------------------------------------------------------
[  4] local 192.16.110.80 port 5001 connected with 192.16.110.50 port 43341
[ ID] Interval       Transfer     Bandwidth
[  4]  0.0-10.0 sec  9.20 GBytes  7.90 Gbits/sec
[  5] local 192.16.110.80 port 5001 connected with 192.16.110.50 port 43342
[  5]  0.0-10.0 sec  9.26 GBytes  7.95 Gbits/sec
[  4] local 192.16.110.80 port 5001 connected with 192.16.110.50 port 43343
[  4]  0.0-10.0 sec  8.98 GBytes  7.71 Gbits/sec

On Client machine:

[root@localhost iperf-2.0.5]# ./src/iperf -c 192.16.110.80 -f M -i 2
------------------------------------------------------------
Client connecting to 192.16.110.80, TCP port 5001
TCP window size: 0.02 MByte (default)
------------------------------------------------------------
[  3] local 192.16.110.50 port 43343 connected with 192.16.110.80 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0- 2.0 sec  1598 MBytes   799 MBytes/sec
[  3]  2.0- 4.0 sec  1919 MBytes   960 MBytes/sec
[  3]  4.0- 6.0 sec  1892 MBytes   946 MBytes/sec
[  3]  6.0- 8.0 sec  1894 MBytes   947 MBytes/sec
[  3]  8.0-10.0 sec  1893 MBytes   947 MBytes/sec
[  3]  0.0-10.0 sec  9196 MBytes   920 MBytes/sec
  • 通用参数
    • -f [k|m|K|M] 分别表示以Kbits, Mbits, KBytes, MBytes显示报告,默认以Mbits为单位,eg:iperf -c 222.35.11.23 -f K
    • -i sec 以秒为单位显示报告间隔,eg:iperf -c 222.35.11.23 -i 2
    • -l 缓冲区大小,默认是8KB,eg:iperf -c 222.35.11.23 -l 16
    • -m 显示tcp最大mtu值
    • -o 将报告和错误信息输出到文件eg:iperf -c 222.35.11.23 -o c:\iperflog.txt
    • -p 指定服务器端使用的端口或客户端所连接的端口eg:iperf -s -p 9999;iperf -c 222.35.11.23 -p 9999
    • -u 使用udp协议,测试htb的时候最好用udp,udp通信开销小,测试的带宽更准确
    • -w 指定TCP窗口大小,默认是8KB.如果窗口太小,有可能丢包
    • -B 绑定一个主机地址或接口(当主机有多个地址或接口时使用该参数)
    • -C 兼容旧版本(当server端和client端版本不一样时使用)
    • -M 设定TCP数据包的最大mtu值
    • -N 设定TCP不延时
    • -V 传输ipv6数据包
  • server专用参数
    • -D 以服务方式运行ipserf,eg:iperf -s -D
    • -R 停止iperf服务,针对-D,eg:iperf -s -R
  • client端专用参数
    • -d 同时进行双向传输测试
    • -n 指定传输的字节数,eg:iperf -c 222.35.11.23 -n 100000
    • -r 单独进行双向传输测试
    • -b 指定发送带宽,默认是1Mbit/s. 在测试qos的时候,这是最有用的参数。
    • -t 测试时间,默认10秒,eg:iperf -c 222.35.11.23 -t 5.默认是10s
    • -F 指定需要传输的文件
    • -T 指定ttl值

tcpdump tcptrace

  • tcpdump -i eth0
  • tcpdump -i eth0 host 172.16.30.44 and port 80
  • tcpdump -U port 3260 -w /tmp/tcpdump.pcap

Reference

checking network speed

一次网络性能检查记录

  • 开始检查各种信息(网络、cpu,memory,io等)

    [root@Ustor ~]# ifconfig
    eth0      Link encap:Ethernet  HWaddr 40:16:7E:35:C7:C2  
              inet addr:172.16.130.158  Bcast:0.0.0.0  Mask:255.255.255.0
              inet6 addr: fe80::4216:7eff:fe35:c7c2/64 Scope:Link
              UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
              RX packets:37825154 errors:0 dropped:0 overruns:0 frame:0
              TX packets:214565 errors:0 dropped:0 overruns:0 carrier:0
              collisions:0 txqueuelen:1000 
              RX bytes:2321429968 (2.1 GiB)  TX bytes:226462903 (215.9 MiB)
              Interrupt:16 Memory:dc400000-dc420000 
    
    eth1      Link encap:Ethernet  HWaddr 40:16:7E:35:C7:C3  
              UP BROADCAST MULTICAST  MTU:1500  Metric:1
              RX packets:0 errors:0 dropped:0 overruns:0 frame:0
              TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
              collisions:0 txqueuelen:1000 
              RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)
              Interrupt:17 Memory:dc300000-dc320000 
    
    eth2      Link encap:Ethernet  HWaddr 00:90:FA:6C:E4:0A  
              inet addr:192.16.110.50  Bcast:0.0.0.0  Mask:255.255.255.0
              inet6 addr: fe80::290:faff:fe6c:e40a/64 Scope:Link
              UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
              RX packets:775026147 errors:0 dropped:1576 overruns:0 frame:0
              TX packets:1032091357 errors:0 dropped:0 overruns:0 carrier:0
              collisions:0 txqueuelen:1000 
              RX bytes:892065492548 (830.8 GiB)  TX bytes:1034532529750 (963.4 GiB)
    
    eth3      Link encap:Ethernet  HWaddr 00:90:FA:6C:E4:0E  
              inet6 addr: fe80::290:faff:fe6c:e40e/64 Scope:Link
              UP BROADCAST MULTICAST  MTU:1500  Metric:1
              RX packets:6593 errors:0 dropped:0 overruns:0 frame:0
              TX packets:5748 errors:0 dropped:0 overruns:0 carrier:0
              collisions:0 txqueuelen:1000 
              RX bytes:2011976 (1.9 MiB)  TX bytes:345000 (336.9 KiB)
    
    lo        Link encap:Local Loopback  
              inet addr:127.0.0.1  Mask:255.0.0.0
              inet6 addr: ::1/128 Scope:Host
              UP LOOPBACK RUNNING  MTU:16436  Metric:1
              RX packets:21995 errors:0 dropped:0 overruns:0 frame:0
              TX packets:21995 errors:0 dropped:0 overruns:0 carrier:0
              collisions:0 txqueuelen:0 
              RX bytes:1234012 (1.1 MiB)  TX bytes:1234012 (1.1 MiB)
    
    [root@Ustor ~]# ping 192.16.110.60
    PING 192.16.110.60 (192.16.110.60) 56(84) bytes of data.
    64 bytes from 192.16.110.60: icmp_seq=1 ttl=128 time=0.147 ms
    64 bytes from 192.16.110.60: icmp_seq=2 ttl=128 time=0.197 ms
    64 bytes from 192.16.110.60: icmp_seq=3 ttl=128 time=0.195 ms
    64 bytes from 192.16.110.60: icmp_seq=4 ttl=128 time=0.173 ms
    64 bytes from 192.16.110.60: icmp_seq=5 ttl=128 time=0.197 ms
    ^C
    --- 192.16.110.60 ping statistics ---
    5 packets transmitted, 5 received, 0% packet loss, time 4079ms
    rtt min/avg/max/mdev = 0.147/0.181/0.197/0.026 ms
    
    [root@Ustor ~]# route 
    Kernel IP routing table
    Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
    172.16.130.0    *               255.255.255.0   U     0      0        0 eth0
    192.16.110.0    *               255.255.255.0   U     0      0        0 eth2
    default         192.16.110.1    0.0.0.0         UG    0      0        0 eth2
    default         172.16.130.1    0.0.0.0         UG    0      0        0 eth0
    [root@Ustor ~]# ethtool -i eth2
    driver: be2net
    version: 4.1.307r
    firmware-version: 10.0.803.19
    bus-info: 0000:02:00.0
    [root@Ustor ~]# ethtool eth2
    Settings for eth2:
        Supported ports: [ FIBRE ]
        Supported link modes:   10000baseT/Full 
        Supports auto-negotiation: No
        Advertised link modes:  Not reported
        Advertised pause frame use: No
        Advertised auto-negotiation: No
        Speed: 10000Mb/s
        Duplex: Full
        Port: FIBRE
        PHYAD: 0
        Transceiver: external
        Auto-negotiation: off
        Supports Wake-on: g
        Wake-on: d
        Link detected: yes
    [root@Ustor ~]# iostat 
    Linux 2.6.32-279.el6.x86_64 (Ustor)     12/31/2014     _x86_64_    (2 CPU)
    
    avg-cpu:  %user   %nice %system %iowait  %steal   %idle
               0.36    0.00    1.23    1.08    0.00   97.33
    
    Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
    sda               3.56        45.00        24.38   19491848   10557699
    sdb              28.76      1274.39      2100.99  551950065  909958848
    dm-0            224.75       639.32      1158.70  276896736  501843480
    dm-1            197.05       634.96       941.44  275005928  407746632
    dm-2              0.01         0.02         0.85      10001     368272
    
    [root@Ustor ~]# fdisk -l
    
    Disk /dev/sda: 8012 MB, 8012390400 bytes
    255 heads, 63 sectors/track, 974 cylinders
    Units = cylinders of 16065 * 512 = 8225280 bytes
    Sector size (logical/physical): 512 bytes / 512 bytes
    I/O size (minimum/optimal): 512 bytes / 512 bytes
    Disk identifier: 0x53c628b9
    
       Device Boot      Start         End      Blocks   Id  System
    /dev/sda1   *           1          26      204800   83  Linux
    Partition 1 does not end on cylinder boundary.
    /dev/sda2              26         287     2097152   82  Linux swap / Solaris
    Partition 2 does not end on cylinder boundary.
    /dev/sda3             287         300      102400   83  Linux
    Partition 3 does not end on cylinder boundary.
    /dev/sda4             300         975     5419224    5  Extended
    Partition 4 does not end on cylinder boundary.
    /dev/sda5             300         313      102400   83  Linux
    /dev/sda6             313         975     5314560   83  Linux
    
    Disk /dev/sdb: 36002.0 GB, 36002026487808 bytes
    255 heads, 63 sectors/track, 4376997 cylinders
    Units = cylinders of 16065 * 512 = 8225280 bytes
    Sector size (logical/physical): 512 bytes / 4096 bytes
    I/O size (minimum/optimal): 4096 bytes / 4096 bytes
    Disk identifier: 0x00000000
    
Disk /dev/mapper/r55-i01: 2097.2 GB, 2097152000000 bytes
255 heads, 63 sectors/track, 254964 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disk identifier: 0x2c0893c9


Disk /dev/mapper/r55-i02: 2097.2 GB, 2097152000000 bytes
255 heads, 63 sectors/track, 254964 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disk identifier: 0xcbbcf789


Disk /dev/mapper/r55-ee: 349.5 GB, 349526032384 bytes
255 heads, 63 sectors/track, 42494 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disk identifier: 0x00000000


#### Hard Raid Card ####
[root@Ustor ~]# lspci  
00:00.0 Host bridge: Intel Corporation 2nd Generation Core Processor Family DRAM Controller (rev 09)
00:01.0 PCI bridge: Intel Corporation Xeon E3-1200/2nd Generation Core Processor Family PCI Express Root Port (rev 09)
00:1a.0 USB controller: Intel Corporation 6 Series/C200 Series Chipset Family USB Enhanced Host Controller #2 (rev 05)
00:1c.0 PCI bridge: Intel Corporation 6 Series/C200 Series Chipset Family PCI Express Root Port 1 (rev b5)
00:1c.4 PCI bridge: Intel Corporation 6 Series/C200 Series Chipset Family PCI Express Root Port 5 (rev b5)
00:1c.5 PCI bridge: Intel Corporation 6 Series/C200 Series Chipset Family PCI Express Root Port 6 (rev b5)
00:1d.0 USB controller: Intel Corporation 6 Series/C200 Series Chipset Family USB Enhanced Host Controller #1 (rev 05)
00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev a5)
00:1f.0 ISA bridge: Intel Corporation C202 Chipset Family LPC Controller (rev 05)
00:1f.2 SATA controller: Intel Corporation 6 Series/C200 Series Chipset Family SATA AHCI Controller (rev 05)
00:1f.3 SMBus: Intel Corporation 6 Series/C200 Series Chipset Family SMBus Controller (rev 05)
01:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS 2108 [Liberator] (rev 05)
02:00.0 Ethernet controller: Emulex Corporation OneConnect 10Gb NIC (be3) (rev 01)
02:00.1 Ethernet controller: Emulex Corporation OneConnect 10Gb NIC (be3) (rev 01)
03:00.0 Ethernet controller: Intel Corporation 82574L Gigabit Network Connection
04:00.0 Ethernet controller: Intel Corporation 82574L Gigabit Network Connection
05:05.0 VGA compatible controller: ASPEED Technology, Inc. ASPEED Graphics Family (rev 10)

[root@Ustor ~]# MegaCli -LdPdInfo -aAll

Adapter #0

Number of Virtual Disks: 1
Virtual Drive: 0 (Target Id: 0)
Name                :r55
RAID Level          : Primary-5, Secondary-0, RAID Level Qualifier-3
Size                : 32.743 TB
Parity Size         : 3.637 TB
State               : Optimal
Strip Size          : 64 KB
Number Of Drives    : 10
Span Depth          : 1
Default Cache Policy: WriteBack, ReadAdaptive, Cached, Write Cache OK if Bad BBU
Current Cache Policy: WriteBack, ReadAdaptive, Cached, Write Cache OK if Bad BBU
Default Access Policy: Read/Write
Current Access Policy: Read/Write
Disk Cache Policy   : Enabled
Encryption Type     : None
Bad Blocks Exist: No
Is VD Cached: No
Number of Spans: 1
Span: 0 - Number of PDs: 10

[root@Ustor ~]# MegaCli -LdPdInfo -aAll |grep -i 'raw size'
Raw Size: 3.638 TB [0x1d1c0beb0 Sectors]
Raw Size: 3.638 TB [0x1d1c0beb0 Sectors]
Raw Size: 3.638 TB [0x1d1c0beb0 Sectors]
Raw Size: 3.638 TB [0x1d1c0beb0 Sectors]
Raw Size: 3.638 TB [0x1d1c0beb0 Sectors]
Raw Size: 3.638 TB [0x1d1c0beb0 Sectors]
Raw Size: 3.638 TB [0x1d1c0beb0 Sectors]
Raw Size: 3.638 TB [0x1d1c0beb0 Sectors]
Raw Size: 3.638 TB [0x1d1c0beb0 Sectors]
Raw Size: 3.638 TB [0x1d1c0beb0 Sectors]

#### /dev/sdb的大小是36002GB, 即 36002/1024=35.158TB ####
#### 可是整个raid的大小应该时3.637TB * 9 = 32.733TB才对, 为什么下面/dev/sdb是35.158TB呢 ####
[root@Ustor ~]# fdisk -l /dev/sdb

Disk /dev/sdb: 36002.0 GB, 36002026487808 bytes
255 heads, 63 sectors/track, 4376997 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disk identifier: 0x00000000

[root@Ustor ~]# iostat 1
Linux 2.6.32-279.el6.x86_64 (Ustor)     12/31/2014     _x86_64_    (2 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.37    0.00    1.25    1.08    0.00   97.30

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda               3.57        46.38        24.39   20146336   10593379
sdb              28.77      1270.72      2136.50  551950697  928014000
dm-0            229.04       637.48      1194.85  276896840  518997016
dm-1            196.74       633.13       940.80  275006032  408647752
dm-2              0.01         0.02         0.85      10169     368768

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.00    0.00   18.59    0.00    0.00   81.41

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda               0.00         0.00         0.00          0          0
sdb              72.00         0.00     24584.00          0      24584
dm-0              0.00         0.00         0.00          0          0
dm-1           3072.00         0.00     24576.00          0      24576
dm-2              1.00         0.00         8.00          0          8

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.00    0.00   11.50    0.00    0.00   88.50

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda               0.00         0.00         0.00          0          0
sdb              37.00         0.00     16400.00          0      16400
dm-0              0.00         0.00         0.00          0          0
dm-1           2048.00         0.00     16384.00          0      16384
dm-2              2.00         0.00        16.00          0         16

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           5.05    0.00   18.69    0.00    0.00   76.26

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda               0.00         0.00         0.00          0          0
sdb              72.00         0.00     32768.00          0      32768
dm-0              0.00         0.00         0.00          0          0
dm-1           4096.00         0.00     32768.00          0      32768
dm-2              0.00         0.00         0.00          0          0

#### dd的速度达到1GB/s,说明瓶颈不在磁盘 ####
[root@Ustor ~]# dd if=/dev/zero of=/dev/sdb bs=1M count=102400
28309+0 records in
28309+0 records out
29684137984 bytes (30 GB) copied, 28.4613 s, 1.0 GB/s
33270+0 records in
33270+0 records out
34886123520 bytes (35 GB) copied, 33.4982 s, 1.0 GB/s
38234+0 records in
38234+0 records out
40091254784 bytes (40 GB) copied, 38.5252 s, 1.0 GB/s
^C42879+0 records in
42879+0 records out
44961890304 bytes (45 GB) copied, 43.2484 s, 1.0 GB/s

[root@Ustor ~]# ls /dev/r55/
ee  i01  i02
[root@Ustor ~]# ls /dev/r55/ee -l
lrwxrwxrwx 1 root root 7 Dec 30 16:42 /dev/r55/ee -> ../dm-2
[root@Ustor ~]# dd if=/dev/zero of=/dev/r55/ee bs=1M count=1024000
2552+0 records in
2552+0 records out
2675965952 bytes (2.7 GB) copied, 2.31478 s, 1.2 GB/s
7516+0 records in
7516+0 records out
7881097216 bytes (7.9 GB) copied, 7.34263 s, 1.1 GB/s

[root@Ustor ~]# dd if=/dev/zero of=/dev/r55/i01 bs=1M count=1024000
2869+0 records in
2869+0 records out
3008364544 bytes (3.0 GB) copied, 2.73699 s, 1.1 GB/s
7841+0 records in
7841+0 records out
8221884416 bytes (8.2 GB) copied, 7.76994 s, 1.1 GB/s

[root@Ustor ~]# dd if=/dev/zero of=/dev/r55/i02 bs=1M count=1024000
^[OH2751+0 records in
2751+0 records out
2884632576 bytes (2.9 GB) copied, 2.53248 s, 1.1 GB/s
7704+1 records in
7704+0 records out
8078229504 bytes (8.1 GB) copied, 7.56027 s, 1.1 GB/s

[root@Ustor ~]# blockdev --report /dev/sdb
RO    RA   SSZ   BSZ   StartSec            Size   Device
rw   256   512  4096          0  36002026487808   /dev/sdb
[root@Ustor ~]# blockdev --report /dev/r55/ee 
RO    RA   SSZ   BSZ   StartSec            Size   Device
rw 16384   512  4096          0    349526032384   /dev/r55/ee
[root@Ustor ~]# blockdev --report /dev/r55/i01
RO    RA   SSZ   BSZ   StartSec            Size   Device
rw   256   512  4096          0   2097152000000   /dev/r55/i01
[root@Ustor ~]# blockdev --report /dev/r55/i02
RO    RA   SSZ   BSZ   StartSec            Size   Device
rw   256   512  4096          0   2097152000000   /dev/r55/i0

[root@Ustor ~]# sysctl -a |grep ipv4.tcp
net.ipv4.tcp_timestamps = 1
net.ipv4.tcp_window_scaling = 1
net.ipv4.tcp_sack = 1
net.ipv4.tcp_retrans_collapse = 1
net.ipv4.tcp_syn_retries = 5
net.ipv4.tcp_synack_retries = 5
net.ipv4.tcp_max_orphans = 131072
net.ipv4.tcp_max_tw_buckets = 131072
net.ipv4.tcp_keepalive_time = 7200
net.ipv4.tcp_keepalive_probes = 9
net.ipv4.tcp_keepalive_intvl = 75
net.ipv4.tcp_retries1 = 3
net.ipv4.tcp_retries2 = 15
net.ipv4.tcp_fin_timeout = 60
net.ipv4.tcp_syncookies = 1
net.ipv4.tcp_tw_recycle = 0
net.ipv4.tcp_abort_on_overflow = 0
net.ipv4.tcp_stdurg = 0
net.ipv4.tcp_rfc1337 = 0
net.ipv4.tcp_max_syn_backlog = 1024
net.ipv4.tcp_orphan_retries = 0
net.ipv4.tcp_fack = 1
net.ipv4.tcp_reordering = 3
net.ipv4.tcp_ecn = 2
net.ipv4.tcp_dsack = 1
net.ipv4.tcp_mem = 173856    231808    347712
net.ipv4.tcp_wmem = 4096    16384    4194304
net.ipv4.tcp_rmem = 4096    87380    4194304
net.ipv4.tcp_app_win = 31
net.ipv4.tcp_adv_win_scale = 2
net.ipv4.tcp_tw_reuse = 0
net.ipv4.tcp_frto = 2
net.ipv4.tcp_frto_response = 0
net.ipv4.tcp_low_latency = 0
net.ipv4.tcp_no_metrics_save = 0
net.ipv4.tcp_moderate_rcvbuf = 1
net.ipv4.tcp_tso_win_divisor = 3
net.ipv4.tcp_congestion_control = cubic
net.ipv4.tcp_abc = 0
net.ipv4.tcp_mtu_probing = 0
net.ipv4.tcp_base_mss = 512
net.ipv4.tcp_workaround_signed_windows = 0
net.ipv4.tcp_dma_copybreak = 4096
net.ipv4.tcp_slow_start_after_idle = 1
net.ipv4.tcp_available_congestion_control = cubic reno
net.ipv4.tcp_allowed_congestion_control = cubic reno
net.ipv4.tcp_max_ssthresh = 0
net.ipv4.tcp_thin_linear_timeouts = 0
net.ipv4.tcp_thin_dupack = 0

[root@Ustor ~]# ethtool -k eth2
Offload parameters for eth2:
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp-segmentation-offload: on
udp-fragmentation-offload: off
generic-segmentation-offload: on
generic-receive-offload: off
large-receive-offload: off

[root@Ustor ~]# tcpdump -i eth2 -w /tmp/tcpdump01.pcap
tcpdump: listening on eth2, link-type EN10MB (Ethernet), capture size 65535 bytes
^C154277 packets captured
979189 packets received by filter
824910 packets dropped by kernel
[root@Ustor ~]# ls /tmp/tcpdump01.pcap -lh
-rw-r--r-- 1 root root 187M Dec 31 10:47 /tmp/tcpdump01.pcap
  • tshark 分析

    #### scp下载包到本地,使用tshark分析 ####
    [dennis@localhost ~]$ scp root@172.16.130.158:/tmp/tcpdump01.pcap ./
    root@172.16.130.158's password: 
    tcpdump01.pcap                         100%  186MB  11.0MB/s   00:17 
    [dennis@localhost ~]$ capinfos tcpdump01.pcap 
    File name:           tcpdump01.pcap
    File type:           Wireshark/tcpdump/... - pcap
    File encapsulation:  Ethernet
    Packet size limit:   file hdr: 65535 bytes
    Number of packets:   154 k
    File size:           195 MB
    Data size:           192 MB
    Capture duration:    40 seconds
    Start time:          Wed Dec 31 10:47:13 2014
    End time:            Wed Dec 31 10:47:53 2014
    Data byte rate:      4,816 kBps
    Data bit rate:       38 Mbps
    Average packet size: 1249.82 bytes
    Average packet rate: 3,853 packets/sec
    SHA1:                b08cff272eb5b0d9aa64fb31343314603df44309
    RIPEMD160:           70123da08082954af0d91dde9d9766a61c0a93db
    MD5:                 0571ac4f35410047a13cdfbd3bf3bfe9
    Strict time order:   False
    
    #### 重传好像影响不大? ####
    [dennis@localhost ~]$ tshark -n -q -r tcpdump01.pcap  -z "io,stat,0,tcp.analysis.retransmission"
    
    =======================================================
    | IO Statistics                                       |
    |                                                     |
    | Interval size: 40.0 secs (dur)                      |
    | Col 1: Frames and bytes                             |
    |     2: tcp.analysis.retransmission                  |
    |-----------------------------------------------------|
    |              |1                   |2                |
    | Interval     | Frames |   Bytes   | Frames |  Bytes |
    |-----------------------------------------------------|
    |  0.0 <> 40.0 | 154277 | 192818724 |    141 | 191110 |
    =======================================================
    [dennis@localhost ~]$ tshark -n -q -r tcpdump01.pcap  -z "io,stat,0,tcp.analysis.out_of_order"
    
    =======================================================
    | IO Statistics                                       |
    |                                                     |
    | Interval size: 40.0 secs (dur)                      |
    | Col 1: Frames and bytes                             |
    |     2: tcp.analysis.out_of_order                    |
    |-----------------------------------------------------|
    |              |1                   |2                |
    | Interval     | Frames |   Bytes   | Frames |  Bytes |
    |-----------------------------------------------------|
    |  0.0 <> 40.0 | 154277 | 192818724 |    620 | 898920 |
    =======================================================
    [dennis@localhost ~]$ tshark -n -q -r tcpdump01.pcap  -z "io,stat,5,tcp.analysis.out_of_order"
    
    ==================================================
    | IO Statistics                                  |
    |                                                |
    | Interval size: 5 secs                          |
    | Col 1: Frames and bytes                        |
    |     2: tcp.analysis.out_of_order               |
    |------------------------------------------------|
    |          |1                  |2                |
    | Interval | Frames |   Bytes  | Frames |  Bytes |
    |------------------------------------------------|
    |  0 <>  5 |  10837 | 14246599 |      1 |   1514 |
    |  5 <> 10 |  49803 | 61236860 |    184 | 264780 |
    | 10 <> 15 |  48890 | 61739426 |    186 | 271384 |
    | 15 <> 20 |  42976 | 55481613 |    249 | 361242 |
    | 20 <> 25 |    397 |    24870 |      0 |      0 |
    | 25 <> 30 |    461 |    29156 |      0 |      0 |
    | 30 <> 35 |    488 |    31931 |      0 |      0 |
    | 35 <> 40 |    421 |    28029 |      0 |      0 |
    | 40 <> 40 |      4 |      240 |      0 |      0 |
    ==================================================
    [dennis@localhost ~]$ tshark -n -q -r tcpdump01.pcap  -z "io,stat,5,tcp.analysis.retransmission"
    
    =================================================
    | IO Statistics                                 |
    |                                               |
    | Interval size: 5 secs                         |
    | Col 1: Frames and bytes                       |
    |     2: tcp.analysis.retransmission            |
    |-----------------------------------------------|
    |          |1                  |2               |
    | Interval | Frames |   Bytes  | Frames | Bytes |
    |-----------------------------------------------|
    |  0 <>  5 |  10837 | 14246599 |      3 |  4542 |
    |  5 <> 10 |  49803 | 61236860 |     62 | 85484 |
    | 10 <> 15 |  48890 | 61739426 |     41 | 53682 |
    | 15 <> 20 |  42976 | 55481613 |     35 | 47402 |
    | 20 <> 25 |    397 |    24870 |      0 |     0 |
    | 25 <> 30 |    461 |    29156 |      0 |     0 |
    | 30 <> 35 |    488 |    31931 |      0 |     0 |
    | 35 <> 40 |    421 |    28029 |      0 |     0 |
    | 40 <> 40 |      4 |      240 |      0 |     0 |
    =================================================
    [dennis@localhost ~]$ 
    
  • 会不会网卡驱动有问题?据说这个Emulex卡直接装就可以用,不用单独安装驱动

    [root@Ustor ~]# modinfo be2net
    filename:       /lib/modules/2.6.32-279.el6.x86_64/kernel/drivers/net/benet/be2net.ko
    license:        GPL
    author:         ServerEngines Corporation
    description:    ServerEngines BladeEngine 10Gbps NIC Driver 4.1.307r
    version:        4.1.307r
    srcversion:     7076CA6C80C3CD968BAFFFC
    alias:          pci:v000010DFd00000720sv*sd*bc*sc*i*
    alias:          pci:v000010DFd0000E228sv*sd*bc*sc*i*
    alias:          pci:v000010DFd0000E220sv*sd*bc*sc*i*
    alias:          pci:v000019A2d00000710sv*sd*bc*sc*i*
    alias:          pci:v000019A2d00000700sv*sd*bc*sc*i*
    alias:          pci:v000019A2d00000221sv*sd*bc*sc*i*
    alias:          pci:v000019A2d00000211sv*sd*bc*sc*i*
    depends:        
    vermagic:       2.6.32-279.el6.x86_64 SMP mod_unload modversions 
    parm:           num_vfs:Number of PCI VFs to initialize (uint)
    parm:           multi_rxq:Obsolete and used only for compatibility (bool)
    parm:           rx_frag_size:Size of a fragment that holds rcvd data. (ushort)
    [root@Ustor ~]# lspci -nn |grep Emulex
    02:00.0 Ethernet controller [0200]: Emulex Corporation OneConnect 10Gb NIC (be3) [19a2:0710] (rev 01)
    02:00.1 Ethernet controller [0200]: Emulex Corporation OneConnect 10Gb NIC (be3) [19a2:0710] (rev 01)
    型号是19a2:0710
    [root@Ustor ~]# grep -irn '19a2.*0710' /lib/modules/2.6.32-279.el6.x86_64/modules.alias
    3732:alias pci:v000019A2d00000710sv*sd*bc*sc*i* be2net
    
  • 修改磁盘预读大小,性能没什么提升,依然再100MB/s左右,查看windows下的网络曲线,不够平整

    [root@Ustor SOURCES]# blockdev --report
    RO    RA   SSZ   BSZ   StartSec            Size   Device
    rw 16384   512  4096          0      8012390400   /dev/sda
    rw 16384   512  1024       2048       209715200   /dev/sda1
    rw 16384   512  4096     411648      2147483648   /dev/sda2
    rw 16384   512  1024    4605952       104857600   /dev/sda3
    rw 16384   512  1024    4810752            1024   /dev/sda4
    rw 16384   512  1024    4812800       104857600   /dev/sda5
    rw 16384   512  4096    5019648      5442109440   /dev/sda6
    rw   256   512  4096          0  36002026487808   /dev/sdb
    rw   256   512  4096          0   2097152000000   /dev/dm-0
    rw   256   512  4096          0   2097152000000   /dev/dm-1
    rw 16384   512  4096          0    349526032384   /dev/dm-2
    [root@Ustor SOURCES]# blockdev --setra 16384 /dev/dm-1
    [root@Ustor SOURCES]# blockdev --setra 16384 /dev/dm-0
    [root@Ustor SOURCES]# blockdev --setra 16384 /dev/sdb
    [root@Ustor SOURCES]# blockdev --report
    RO    RA   SSZ   BSZ   StartSec            Size   Device
    rw 16384   512  4096          0      8012390400   /dev/sda
    rw 16384   512  1024       2048       209715200   /dev/sda1
    rw 16384   512  4096     411648      2147483648   /dev/sda2
    rw 16384   512  1024    4605952       104857600   /dev/sda3
    rw 16384   512  1024    4810752            1024   /dev/sda4
    rw 16384   512  1024    4812800       104857600   /dev/sda5
    rw 16384   512  4096    5019648      5442109440   /dev/sda6
    rw 16384   512  4096          0  36002026487808   /dev/sdb
    rw 16384   512  4096          0   2097152000000   /dev/dm-0
    rw 16384   512  4096          0   2097152000000   /dev/dm-1
    rw 16384   512  4096          0    349526032384   /dev/dm-2
    

发现4k;0% read;0% random的速度比1M;0%read;0%random的快,
通常都是1M的快。。。
4k有10G的25%左右速度,速度曲线相对平整;而1M只有8%左右,而且速度曲线极其不平整;

  • 使用rmp source编译驱动

    rpm -ivh be2net-10.2.470.14-1.src.rpm
    cd rpmbuild/SOURCES
    tar xvf be2net-10.2.470.14.tar.gz 
    cd be2net-10.2.470.14 
    make
    cp /lib/modules/2.6.32-279.el6.x86_64/kernel/drivers/net/benet/be2net.ko{,.bak}
    cp ./be2net.ko /lib/modules/2.6.32-279.el6.x86_64/kernel/drivers/net/benet/be2net.ko
    modinfo be2net
    reboot
    ethtool -i be2net
    
    [root@Ustor be2net-10.2.470.14]# modinfo ./be2net.ko
    filename:       ./be2net.ko
    supported:      external
    license:        GPL
    author:         Emulex Corporation
    description:    Emulex OneConnect NIC Driver 10.2.470.14
    version:        10.2.470.14
    srcversion:     DE6CC9D92A3CE9C042DA005
    alias:          pci:v000010DFd00000728sv*sd*bc*sc*i*
    alias:          pci:v000010DFd00000730sv*sd*bc*sc*i*
    alias:          pci:v000010DFd00000720sv*sd*bc*sc*i*
    alias:          pci:v000010DFd0000E228sv*sd*bc*sc*i*
    alias:          pci:v000010DFd0000E220sv*sd*bc*sc*i*
    alias:          pci:v000019A2d00000710sv*sd*bc*sc*i*
    alias:          pci:v000019A2d00000700sv*sd*bc*sc*i*
    alias:          pci:v000019A2d00000221sv*sd*bc*sc*i*
    alias:          pci:v000019A2d00000211sv*sd*bc*sc*i*
    depends:        
    vermagic:       2.6.32-279.el6.x86_64 SMP mod_unload modversions 
    parm:           rss_on_mc:Enable RSS in multi-channel functions with the capability. Disabled by default. (ushort)
    parm:           tx_prio:Create priority based TX queues. Disabled by default (uint)
    parm:           num_vfs:Number of PCI VFs to initialize (uint)
    parm:           rx_frag_size:Size of receive fragment buffer - 2048 (default), 4096 or 8192 (ushort)
    parm:           gro:Enable or Disable GRO. Enabled by default (uint)
    parm:           emi_canceller:Enable or Disable EMI Canceller. Disabled by default (uint)
    
  • 更新驱动,重启机器

    [root@Ustor be2net-10.2.470.14]# cp be2net.ko /lib/modules/2.6.32-279.el6.x86_64/kernel/drivers/net/benet/
    [root@Ustor be2net-10.2.470.14]# ls /lib/modules/2.6.32-279.el6.x86_64/kernel/drivers/net/benet/
    be2net.ko  be2net.ko.bak
    [root@Ustor be2net-10.2.470.14]# modinfo be2net
    filename:       /lib/modules/2.6.32-279.el6.x86_64/kernel/drivers/net/benet/be2net.ko
    supported:      external
    license:        GPL
    author:         Emulex Corporation
    description:    Emulex OneConnect NIC Driver 10.2.470.14
    version:        10.2.470.14
    srcversion:     DE6CC9D92A3CE9C042DA005
    alias:          pci:v000010DFd00000728sv*sd*bc*sc*i*
    alias:          pci:v000010DFd00000730sv*sd*bc*sc*i*
    alias:          pci:v000010DFd00000720sv*sd*bc*sc*i*
    alias:          pci:v000010DFd0000E228sv*sd*bc*sc*i*
    alias:          pci:v000010DFd0000E220sv*sd*bc*sc*i*
    alias:          pci:v000019A2d00000710sv*sd*bc*sc*i*
    alias:          pci:v000019A2d00000700sv*sd*bc*sc*i*
    alias:          pci:v000019A2d00000221sv*sd*bc*sc*i*
    alias:          pci:v000019A2d00000211sv*sd*bc*sc*i*
    depends:        
    vermagic:       2.6.32-279.el6.x86_64 SMP mod_unload modversions 
    parm:           rss_on_mc:Enable RSS in multi-channel functions with the capability. Disabled by default. (ushort)
    parm:           tx_prio:Create priority based TX queues. Disabled by default (uint)
    parm:           num_vfs:Number of PCI VFs to initialize (uint)
    parm:           rx_frag_size:Size of receive fragment buffer - 2048 (default), 4096 or 8192 (ushort)
    parm:           gro:Enable or Disable GRO. Enabled by default (uint)
    parm:           emi_canceller:Enable or Disable EMI Canceller. Disabled by default (uint)
    [root@Ustor be2net-10.2.470.14]# modprobe be2net
    [root@Ustor be2net-10.2.470.14]# ethtool eth2
    Settings for eth2:
        Supported ports: [ FIBRE ]
        Supported link modes:   10000baseT/Full 
        Supports auto-negotiation: No
        Advertised link modes:  Not reported
        Advertised pause frame use: No
        Advertised auto-negotiation: No
        Speed: 10000Mb/s
        Duplex: Full
        Port: FIBRE
        PHYAD: 0
        Transceiver: external
        Auto-negotiation: off
        Supports Wake-on: g
        Wake-on: d
        Link detected: yes
    [root@Ustor be2net-10.2.470.14]# ethtool -i eth2
    driver: be2net
    version: 4.1.307r
    firmware-version: 10.0.803.19
    bus-info: 0000:02:00.0
    [root@Ustor be2net-10.2.470.14]# reboot
    
    Broadcast message from root@Ustor
        (/dev/pts/1) at 15:50 ...
    [dennis@localhost ~]$ ssh root@172.16.130.158
    root@172.16.130.158's password: 
    Last login: Wed Dec 31 09:32:07 2014 from 172.16.50.39
    [root@Ustor ~]# 
    [root@Ustor ~]# ethtool -i eth2
    driver: be2net
    version: 10.2.470.14
    firmware-version: 10.0.803.19
    bus-info: 0000:02:00.0
    [root@Ustor ~]# 
    

重新使用iometer测试看看,是否换了驱动性能有所提升
10G网络使用率为65%,速度达到780MB/s左右, 网络速度曲线还算平整

把驱动还原为旧的版本,在测试一下,如果速度又是100MB/s左右,说明确实是驱动问题。

经过测试,速度确实又回到100MB/s左右了,看来是驱动问题了。

关于Emulex万兆网卡测试,性能比较低(只有约100MB/s)

  1. 硬RAID,10块磁盘建raid5,建iscsi卷
  2. windows使用iometer,1M;0%Read;0%Random
  3. 网卡驱动be2net从4.1.307r升级到10.2.470.14

经检查测试,更新存储机器(172.16.130.158)的驱动版本(网卡驱动be2net从 4.1.307r升级到10.2.470.14) 后

使用iometer(1M;0%Read;0%Random)测试,速度从100MB/s左右提升到780MB/s左右

看起来原系统的Emulex万兆网卡驱动有问题,升级就好了.

驱动下载地址:

总结: 很多时候,复杂问题的背后隐藏的是简单的解决方法

network analyze with wireshark

Filter

1.过滤IP,如来源IP或者目标IP等于某个IP

例子:
ip.src eq 192.168.1.107 or ip.dst eq 192.168.1.107
或者
ip.addr eq 192.168.1.107 // 都能显示来源IP和目标IP

2.过滤端口

例子:
tcp.port eq 80 // 不管端口是来源的还是目标的都显示
tcp.port == 80
tcp.port eq 2722
tcp.port eq 80 or udp.port eq 80
tcp.dstport == 80 // 只显tcp协议的目标端口80
tcp.srcport == 80 // 只显tcp协议的来源端口80
udp.port eq 15000
过滤端口范围
tcp.port >= 1 and tcp.port <= 80

3.过滤协议

例子:
tcp
udp
arp
icmp
http
smtp
ftp
dns
msnms
ip
ssl
oicq
bootp
等等
排除arp包,如!arp  或者  not arp

4.过滤MAC

太以网头过滤
eth.dst == A0:00:00:04:C5:84 // 过滤目标mac
eth.src eq A0:00:00:04:C5:84 // 过滤来源mac
eth.dst==A0:00:00:04:C5:84
eth.dst==A0-00-00-04-C5-84
eth.addr eq A0:00:00:04:C5:84 // 过滤来源MAC和目标MAC都等于A0:00:00:04:C5:84的
less than 小于 < lt
小于等于 le
等于 eq
大于 gt
大于等于 ge
不等 ne

5.包长度过滤

例子:
udp.length == 26 这个长度是指udp本身固定长度8加上udp下面那块数据包之和
tcp.len >= 7  指的是ip数据包(tcp下面那块数据),不包括tcp本身
ip.len == 94 除了以太网头固定长度14,其它都算是ip.len,即从ip本身到最后
frame.len == 119 整个数据包长度,从eth开始到最后
eth —> ip or arp —> tcp or udp —> data

6.http模式过滤

例子:
http.request.method == GET
http.request.method == POST
http.request.uri == /img/logo-edu.gif
http contains GET
http contains HTTP/1.
// GET包
http.request.method == GET && http contains Host:
http.request.method == GET && http contains User-Agent:
// POST包
http.request.method == POST && http contains Host:
http.request.method == POST && http contains User-Agent:
// 响应包
http contains HTTP/1.1 200 OK && http contains Content-Type:
http contains HTTP/1.0 200 OK && http contains Content-Type:
一定包含如下
Content-Type:

7.TCP参数过滤

tcp.flags 显示包含TCP标志的封包。
tcp.flags.syn == 0×02    显示包含TCP SYN标志的封包。
tcp.window_size == 0 && tcp.flags.reset != 1

8.过滤内容

tcp[20]表示从20开始,取1个字符
tcp[20:]表示从20开始,取1个字符以上
tcp[20:8]表示从20开始,取8个字符
tcp[offset,n]
udp[8:3]==81:60:03 // 偏移8个bytes,再取3个数,是否与==后面的数据相等?
udp[8:1]==32  如果我猜的没有错的话,应该是udp[offset:截取个数]=nValue
eth.addr[0:3]==00:06:5B

例子:
判断upd下面那块数据包前三个是否等于0×20 0×21 0×22
我们都知道udp固定长度为8
udp[8:3]==20:21:22
判断tcp那块数据包前三个是否等于0×20 0×21 0×22
tcp一般情况下,长度为20,但也有不是20的时候
tcp[8:3]==20:21:22
如果想得到最准确的,应该先知道tcp长度
matches(匹配)和contains(包含某字符串)语法
ip.src==192.168.1.107 and udp[8:5] matches x02x12x21x00x22
ip.src==192.168.1.107 and udp contains 02:12:21:00:22
ip.src==192.168.1.107 and tcp contains GET
udp contains 7c:7c:7d:7d 匹配payload中含有0x7c7c7d7d的UDP数据包,不一定是从第一字节匹配。

例子:
得到本地qq登陆数据包(判断条件是第一个包==0×02,第四和第五个包等于0x00x22,最后一个包等于0×03)
0×02 xx xx 0×00 0×22 … 0×03
正确
oicq and udp[8:] matches ^x02[x00-xff][x00-xff]x00x22[x00-xff]+x03$
oicq and udp[8:] matches ^x02[x00-xff]{2}x00x22[x00-xff]+x03$ // 登陆包
oicq and (udp[8:] matches ^x02[x00-xff]{2}x03$ or tcp[8:] matches ^x02[x00-xff]{2}x03$)
oicq and (udp[8:] matches ^x02[x00-xff]{2}x00x22[x00-xff]+x03$ or tcp[20:] matches ^x02[x00-xff]{2}x00x22[x00-xff]+x03$)
不单单是00:22才有QQ号码,其它的包也有,要满足下面条件(tcp也有,但没有做):
oicq and udp[8:] matches ^x02[x00-xff]+x03$ and !(udp[11:2]==00:00) and !(udp[11:2]==00:80)
oicq and udp[8:] matches ^x02[x00-xff]+x03$ and !(udp[11:2]==00:00) and !(udp[15:4]==00:00:00:00)
说明:
udp[15:4]==00:00:00:00 表示QQ号码为空
udp[11:2]==00:00 表示命令编号为00:00
udp[11:2]==00:80 表示命令编号为00:80
当命令编号为00:80时,QQ号码为00:00:00:00
得到msn登陆成功账号(判断条件是USR 7 OK ,即前三个等于USR,再通过两个0×20,就到OK,OK后面是一个字符0×20,后面就是mail了)
USR xx OK mail@hotmail.com
正确
msnms and tcp and ip.addr==192.168.1.107 and tcp[20:] matches ^USRx20[x30-x39]+x20OKx20[x00-xff]+

9.dns模式过滤

10.DHCP

以寻找伪造DHCP服务器为例,介绍Wireshark的用法。在显示过滤器中加入过滤规则,
显示所有非来自DHCP服务器并且bootp.type==0×02(Offer/Ack)的信息:
bootp.type==0×02 and not ip.src==192.168.1.1

11.msn

msnms && tcp[23:1] == 20 // 第四个是0×20的msn数据包
msnms && tcp[20:1] >= 41 && tcp[20:1] <= 5A && tcp[21:1] >= 41 && tcp[21:1] <= 5A && tcp[22:1] >= 41 && tcp[22:1] <= 5A
msnms && tcp[20:3]==USR // 找到命令编码是USR的数据包
msnms && tcp[20:3]==MSG // 找到命令编码是MSG的数据包
tcp.port == 1863 || tcp.port == 80
如何判断数据包是含有命令编码的MSN数据包?
1)端口为1863或者80,如:tcp.port == 1863 || tcp.port == 80
2)数据这段前三个是大写字母,如:
  tcp[20:1] >= 41 && tcp[20:1] <= 5A && tcp[21:1] >= 41 && tcp[21:1] <= 5A && tcp[22:1] >= 41 && tcp[22:1] <= 5A
3)第四个为0×20,如:tcp[23:1] == 20
4)msn是属于TCP协议的,如tcp

case 1

  • 1, “业务突然变慢,客户端(确定)和服务端(应该)都没有改动。”

[沛满]:如果怀疑是网络有问题,可以在业务慢的时候抓个500MB左右的网络包分析一下。
论坛可以上传的话我也可以帮忙分析。

  • 2,“在wireshark上,如何方便查一个tcp报对应的响应包(或响应包对应的原始包)。可
    以手工根据sequence id和ack id来判断,但是wireshark提供方便的工具么?—-我们的业
    务是在一个tcp长连接中发很多包。”

[沛满]:这个要从基本原理讲起。TCP的工作方式不是逐个包发送的,而是一口气发出多个包
,从而提高传输效率。就像快递员会一次性携带很多包裹到我司前台一样,为的是减少消耗
在路上的往返时间。接收方收到这些包之后有两个选择,既可以每个包都确认(也就是你提
到的响应),也可以只确认最后一个来暗示所有包都收到了。举个例子,发送方发出了10个
包,编号1至10,且没有一个丢失的,那接收方既可以回复10个确认包,也可以只回复“ack 11”,
表示10以及10之前的所有包都收到了。当有丢包发生时,比如还是发送了10个包的情况,编号
也是1至10,但其中10号包丢失了,那接收方可以回复9个确认包,也可以只回复“ack 10”,
表示9以及9之前的包都收到了。

了解了这个原理,我们就知道在Wireshark上没有必要去对应每个ack和seq,因为大多数包即
便正常收到后也不会有ack。

  • 3,“有没有技巧可以方便的统计延时的情况,例如某时间段中,发包到收到ack的延时超过1s的数据包数量?”

[沛满]:如果你只是想了解网络延时状况,那用ping最简单准确了,没有必要用Wireshark。
一般我们在乎的是应用层的延时,比如向一个服务器发送读请求,到收到读响应的时间差究竟
有多少。Wireshark上有提供这个功能,比如CIFS协议那就可以用“smb.time > 1”来过滤出所
有超过一秒钟的延时的CIFS操作。如果是NFS,就用rpc.time(因为NFS是基于RPC的协议)。
HTTP也有http.time。

  • 4,“有没有技巧可以方便的统计丢包的情况,例如某时间段中,发出去的包没收到ack的有多少?”

[沛满]:还是那句话,没有收到ack不一定是丢包了。如果是想看丢包重传的统计,那就
Analyze–>Expert Info,然后看warnings or notes tab. 虽然要统计出结果很容易,不过
使用者需要理解tcp的基础知识才能解读这个结果,比如一个超时重传,导致的后果远远超过
一个快速重传。有启用SACK的时候,处理多个丢包的效率远高于没有启用SACK的……这个说起来太复杂了

Reference

how to fix NIC name confusion

Description

list some text from command dmesg:

792 igb 0000:02:00.0: Intel(R) Gigabit Ethernet Network Connection
793 igb 0000:02:00.0: eth0: (PCIe:2.5GT/s:Width x1)
794 igb 0000:02:00.0: eth0: MAC: 00:1e:67:58:cc:b8
795 igb 0000:02:00.0: eth0: PBA No: 009000-000
796 igb 0000:02:00.0: LRO is disabled
797 igb 0000:02:00.0: Using MSI-X interrupts. 1 rx queue(s), 1 tx queue(s)
798 igb 0000:03:00.0: PCI INT A -> GSI 18 (level, low) -> IRQ 18
799 igb 0000:03:00.0: setting latency timer to 64
800   alloc irq_desc for 32 on node -1
801   alloc kstat_irqs on node -1
802 igb 0000:03:00.0: irq 32 for MSI/MSI-X
803   alloc irq_desc for 33 on node -1
804   alloc kstat_irqs on node -1
805 igb 0000:03:00.0: irq 33 for MSI/MSI-X
806 igb 0000:03:00.0: Intel(R) Gigabit Ethernet Network Connection
807 igb 0000:03:00.0: eth1: (PCIe:2.5GT/s:Width x1)
808 igb 0000:03:00.0: eth1: MAC: 00:1e:67:58:cc:b9
809 igb 0000:03:00.0: eth1: PBA No: 009000-000
810 igb 0000:03:00.0: LRO is disabled
811 igb 0000:03:00.0: Using MSI-X interrupts. 1 rx queue(s), 1 tx queue(s)
812 udev: renamed network interface eth0 to eth2
813 udev: renamed network interface eth1 to eth3

we can see that, interface name is eth0 and eth1 at the begin, then udev rename them.

why?

Checking and fix

check the rule of udev:

[root@Ustor network]# cat /etc/udev/rules.d/70-persistent-net.rules
SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", ATTR{address}=="00:25:90:d7:be:b4", ATTR{type}=="1", KERNEL=="eth*", NAME="eth0"
SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", ATTR{address}=="00:25:90:d7:be:b5", ATTR{type}=="1", KERNEL=="eth*", NAME="eth1"

check ip information:

[root@Ustor network]# ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue state UNKNOWN 
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: eth2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000
    link/ether 00:1e:67:58:cc:b8 brd ff:ff:ff:ff:ff:ff
    inet 172.16.60.220/24 brd 172.16.60.255 scope global eth2
    inet6 fe80::21e:67ff:fe58:ccb8/64 scope link 
       valid_lft forever preferred_lft forever
3: eth3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000
    link/ether 00:1e:67:58:cc:b9 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::21e:67ff:fe58:ccb9/64 scope link 
       valid_lft forever preferred_lft forever

we found that, the mac address from udev rules and ip addr were difference.

Because udev had defined which NIC used eth0 and eth1, the new NIC(new mac address)
should used new interface name(here, using eth2 and eth3).

BTW, I ask the workmate what he had done before this problem occur, he said this
os was installed on another board before. That was the problem, the NIC changed,
but os not update, so it maded the interface name changed.

Fixed:

  1. update mac address to /etc/udev/rules.d/70-persistent-net.rules.mb2
  2. reboot

Reference

ssd cache and implementation

Introduce

  • SSD缓存技术

  • 快速块设备为较慢的块设备提供缓存

    • Red Hat Enterprise Linux 7.0 中引进让快速块设备作为较慢块设备的缓存的功能作为
      技术预览。这个功能可让 PCIe SSD 设备作为直接附加存储(DAS)或者存储局域网(SAN)
      存储的缓存使用,以便提高文件系统性能
    • Reference

Facebook flashcache

Flashcache是Facebook技术团队的一个开源项目,最初是为加速MySQL设计。Flashcache通过
在文件系统(VFS)和设备驱动之间新增了一次缓存层,来实现对热门数据的缓存。

Flashcache在内核的层次:
VFS -> Block层 -> DM层 -> flashcache -> DeviceDriver -> Disk

Flashcache最初的实现是write backup机制cache,后来又加入了write through和write around机制

write backup: 先写入到cahce,然后cache中的脏块会由后台定期刷到持久存储。
write through: 同步写入到cache和持久存储。
write around: 只写入到持久存储。

参考:

What skills need to master

Implementation

Reference

LIO

历史

  • Linux 2.6.38 为分界线,此前的标准是 Linux SCSI Target , STGT 之后迄今为止的
    标准是 Linux-IO Target , LIO 确切的说 Linus Torvalds 在 2011年1月15日将
    LIO SCSI Target engine merge 到 Linux 2.6.38 中
  • 暂时使用LIO作为IET替代,因为IET当前已不更新,最新版2010年。而LIO已进入内核,
    怎么说这几年应该都会持续发展,属于主流应用。

LIO代码量

  • /home/dennis/work/kernel/linux-3.2.63/drivers/target
  • [dennis@localhost target]$ find ./* -name "*.*" |xargs wc -l |awk 'END{print $1}'
    52187
  • [dennis@localhost target]$ find iscsi/ -name "*.*" |xargs wc -l |awk 'END{print $1}'
    21801

开源iSCSI Target调研

  • SCST与LIO
    • SCST是一个相对较早且比较成熟的SCSI Target开源实现。
    • LIO相比SCST是一个更晚的SCSI Target开源实现,但在与SCST竞争进入Linux内核中,
      却以LIO胜出告终。关于二者之间进入Linux内核时的争论,LWN上一篇很 有趣的文章,
      A tale of two SCSI Targets,中文翻译为“SCSI Target之 双城记”。
    • 虽然LIO因为进入Linux内核而有了更好的发展前景,但SCST也不差,Fusion-io 公司
      刚刚收购了SCST的商业支持公司ID7。
  • Tgt
    • Tgt也是一个通用的SCST Target开源实现,与前两者不同的是,在支持iSCSI协 议上,
      Tgt的所有代码是完全工作在用户态的。
    • Tgt将LU视为backstore,支持backstore可以模块化,也就是说,你可以写一个模块来
      支持你自己定义的LU。Tgt提供了多线程api接口,使得编写backstore时 ,可以使用多
      个线程同时处理SCSI请求。
    • Tgt的主线程使用epoll LT模型,监听并接收Initiator发来
      的读写请求与命令 ,而调用对应的backstore处理模块。
  • iSCSI Target支持LU是分布式文件系统时的优化
    • iSCSI Target与LU之间支持多连接并发读写请求,对于不要求排序的SCSI命令与数据,可以并发发给LU
    • iSCSI Target对SCSI命令与数据进行合并,然后发给LU。
  • 比较
    • 无论是SCST还是LIO,我都不认为它们是支持分布式文件系统的最佳选择。
      首先,它们都是工作在内核态的,一旦出问题,会导致系统挂掉,直接影响跑在系统上的其他线上服务。
      其次,SCSI与LIO作为通用的SCSI Target实现,在处理完iSCSI协议后,会把SCSI的
      处理交给内核SCSI Driver去处理,这对支持分布式文件做二次开发来说,相对更加困难。
    • LIO对于一个LU,分配一个recv线程与一个send线程,recv线程接收Initiator发来的
      iSCSI PDU,解析成SCSI请求后交给send线程,send线程将请求发给LU,并将LU返回
      的结果返回给Initiator。对于LU是分布式文件系统时,一个send线程的框架让支持
      iSCSI Target与LU之间多连接并发读写相对比较困难。而且LIO对iSCSI协议的支持,
      很难针对LU是分布式文件系统做优化。LIO的send线程 与recv线程使用一个队列进行
      通信,该队列中的SCSI请求,有些不关心顺序,有些却关心,这些都是在send线程遍
      历队列时才进行处理的。如果要支持LU的多连接并发读写,需要额外的队列来维护SCSI
      请求,这个队列对SCSI请求到达LU的顺序没有要求。当然,也要额外支持多线程等处理。
    • Tgt由于工作在用户态,没有缺点1,而且Tgt的backstore可以模块化,开发起来非常方便,
      同时backstore支持多线程处理,而且Tgt交给backstore的多线程处理的 list已经对顺序不作要求了。
    • 从以上分析来看,使用Tgt让分布式文件系统支持iSCSI更加有优势,而且更加方便。
      目前,开源分布式存储项目sheepdog与hlfs都是基于Tgt开发模块来支持 iSCSI协议的。
  • Tgt的缺点与改进
    • Tgt的backstore在使用多线程时,多个线程竞争一个list,开销较大。可以让 每个线程维护
      一个list,主线程通过CAS无锁队列的方式,将SCSI请求根据rr算 法加入到每个线程的list中。
    • Tgt的backstore与LU之间连接数与线程数,是1:1关系,且线程数为4,写死了的。
      可以修改代码,将连接数改为可配置的。
    • Tgt使用一个主线程通过epoll接受所有Initiator的读写请求,当登陆的Initiator较多时,
      这里可能成为瓶颈。通常来说,这不是问题,因为会iSCSI Target会部署多个的。

targetcli

  • root权限运行targetcli
  • 浏览存储对象, ls查看目录树信息,cd到执行目录
  • 创建文件存储对象
    • cd /backstores/fileio
    • create disk0 /tmp/disk0.img 10MB
    • cd /backstores/ramdisk
    • create rd0 10MB
  • 创建iSCSI目标
    • cd /iscsi
    • create [这里可以加入自定义的WWN:WorldWide Name]
    • cd iqn.2003-01.org.linux-iscsi.localhost.x8664:sn.6b448471ba5e/tpg1/
    • luns/ create /backstores/fileio/disk0
    • luns/ create /backstores/ramdisk/rd0
    • portals/ create 0.0.0.0
    • set attribute authentication=0 demo_mode_write_protect=0 generate_node_acls=1 cache_dynamic_acls=1
    • cd /
    • ls
    • saveconfig
  • 当使用targetcli操作完毕后,记得使用saveconfig来保存配置,要不然重启target服务后,
    刚才的配置将丢失. targetcli saveconfig
    Last 10 configs saved in /etc/target/backup.
    Configuration saved to /etc/target/saveconfig.json
  • 启动iscsi target服务
    • [root@localhost ~]# service target start
    • [root@localhost ~]# service target status
  • 装载iSCSI Target
    • [root@localhost ~]# iscsiadm -m discovery -t sendtargets -p 127.0.0.1
      127.0.0.1:3260,1 iqn.2003-01.org.linux-iscsi.localhost.x8664:sn.6b448471ba5e
    • [root@localhost ~]# iscsiadm –mode node \

      –targetname iqn.2003-01.org.linux-iscsi.localhost.x8664:sn.6b448471ba5e \
      –portal 127.0.0.1 –login

    • 上面的命令还可以使用简写模式: iscsiadm -m node -T \

      iqn.2003-01.org.linux-iscsi.localhost.x8664:sn.6b448471ba5e \
      -p 127.0.0.1 -l

    • [root@localhost dennis]# lsscsi
      [2:0:0:0] disk ATA ST3160815AS A /dev/sda
      [6:0:0:0] disk LIO-ORG disk0 4.0 /dev/sdb
      [6:0:0:1] disk LIO-ORG rd0 4.0 /dev/sdc
  • 卸载并删除iSCSI目标
    • iscsiadm –mode node –targetname iqn.2003-01.org.linux-iscsi.localhost.x8664:sn.6b448471ba5e –portal 127.0.0.1 –logout
    • 或简写: iscsiadm -m node -T iqn.2003-01.org.linux-iscsi.localhost.x8664:sn.6b448471ba5e -p 127.0.0.1 -u
    • targetcli iscsi/ delete iqn.2003-01.org.linux-iscsi.localhost.x8664:sn.6b448471ba5e
  • Other operation
    • iscsiadm –mode node -U all
    • iscsiadm –mode node -o delete
    • iscsiadm –mode node

测试脚本

mkdir -p /lio/
targetcli /backstores/fileio create disk0 /lio/disk0.img 1024MB
targetcli /iscsi create iqn.2007-10.lio.com:lio.test
targetcli /iscsi/iqn.2007-10.lio.com:lio.test/tpg1/luns/ create /backstores/fileio/disk0
targetcli /iscsi/iqn.2007-10.lio.com:lio.test/tpg1/portals/ create 0.0.0.0 
targetcli /iscsi/iqn.2007-10.lio.com:lio.test/tpg1 set attribute authentication=1 demo_mode_write_protect=0 generate_node_acls=1 cache_dynamic_acls=1
targetcli /iscsi/iqn.2007-10.lio.com:lio.test/tpg1/acls/ create iqn.2007-10.lio.com:acl01
targetcli /iscsi/iqn.2007-10.lio.com:lio.test/tpg1/acls/iqn.2007-10.lio.com:acl01/ set auth userid=dennis
targetcli /iscsi/iqn.2007-10.lio.com:lio.test/tpg1/acls/iqn.2007-10.lio.com:acl01/ set auth password=Dennis@123456
targetcli saveconfig

其他操作

  • 增加本地监听ip地址和端口

    /iscsi/iqn.20…/tpg1/portals> create 172.16.110.11 8089
    Created network portal 172.16.110.11:8089.
    /iscsi/iqn.20…/tpg1/portals> ls
    o- portals …………………………………………… [Portals: 2]

    o- 0.0.0.0:3260 .................................................... [OK]
    o- 172.16.110.11:8089 .............................................. [OK]
    

    /iscsi/iqn.20…/tpg1/portals> status
    Status for /iscsi/iqn.2003-01.org.linux-iscsi.localhost.x8664:sn.6b448471ba5e/tpg1/portals: Portals: 2

  • Initiator访问IP限制? 目前找不到方法!原来的IET有限制指定IP访问的功能,但是并不表示
    该功能是iscsi协议必须实现的,所以现在的LIO没有这样的功能也是合理的。

  • 如何设置块设备的IO模式为write-back或write-through? 目前找不到方法。对于FILEIO是可以的.

    [root@localhost src]# targetcli /backstores/fileio/ create fileio01.dat /tmp/fileio01.dat 10M write_back=False
    Created fileio fileio01.dat with size 10485760
    [root@localhost src]# ls /tmp/fileio01.dat -lh
    -rw-r–r–. 1 root root 10M Feb 3 15:01 /tmp/fileio01.dat
    [root@localhost src]# targetcli /backstores/fileio/ ls
    o- fileio ……………………………………………… [Storage Objects: 1]
    o- fileio01.dat ………….. [/tmp/fileio01.dat (10.0MiB) write-thru deactivated]
    [root@localhost src]# targetcli /backstores/fileio/ create fileio02.dat /tmp/fileio02.dat 10M write_back=True
    Created fileio fileio02.dat with size 10485760
    [root@localhost src]# targetcli /backstores/fileio/ ls
    o- fileio ……………………………………………… [Storage Objects: 2]
    o- fileio01.dat ………….. [/tmp/fileio01.dat (10.0MiB) write-thru deactivated]
    o- fileio02.dat ………….. [/tmp/fileio02.dat (10.0MiB) write-back deactivated]

  • targetcli 设置CHAP登陆: acls中创建wwn, 设置用户名和密码

    • /> cd /iscsi/iqn.2007-10.lio.com:dg2.lv1/tpg1/
    • set attribute authentication=1
    • acls/ create iqn.1991-05.com.microsoft:ibm-t410s
    • cd acls/iqn.1991-05.com.microsoft:ibm-t410s
    • set auth userid=iqn.1991-05.com.microsoft:ibm-t410s
    • set auth password=mytargetsecret
    • saveconfig
    • reference http://linux-iscsi.org/wiki/ISCSI

      [root@localhost dennis]# iscsiadm -m node -p 172.16.130.200 -l
      Logging in to [iface: default, target: iqn.2007-10.lio.com:dg2.lv1, portal: 172.16.130.200,3260] (multiple)
      iscsiadm: Could not login to [iface: default, target: iqn.2007-10.lio.com:dg2.lv1, portal: 172.16.130.200,3260].
      iscsiadm: initiator reported error (24 - iSCSI login failed due to authorization failure)
      iscsiadm: Could not log into all portals

配置检查

  • /sys/kernel/config/target/iscsi/

    [root@localhost ~]# ls -l /sys/kernel/config/target/iscsi
    total 0
    drwxr-xr-x. 2 root root 0 Dec 15 15:59 discovery_auth
    drwxr-xr-x. 4 root root 0 Dec 15 15:58 iqn.2003-01.org.linux-iscsi.localhost.x8664:sn.f264093a034e
    drwxr-xr-x. 4 root root 0 Dec 15 15:59 iqn.2014-12.org.linux-iscsi.localhost.x8664:sn.f3849a0b356e
    -r–r–r–. 1 root root 4096 Dec 15 16:05 lio_version

  • 当使用targetcli操作完毕后,记得使用saveconfig来保存配置,要不然重启target服务后,
    刚才的配置将丢失.

    [root@localhost ~]# targetcli saveconfig
    Last 10 configs saved in /etc/target/backup.
    Configuration saved to /etc/target/saveconfig.json

  • 关于配置文件 /etc/target/saveconfig.json的解析

    • JSON (JavaScript Object Notation) is a lightweight data-interchange format
    • JSON 一种轻量级的数据交换格式
    • http://www.json.org/

需要验证测试的特性

  • 各中类型的性能比较
    • FILEIO 性能
    • Block 性能
  • 增加或删除不用的lun是否需要重启服务,会不会影响正常业务
  • 破坏性测试
    • 块设备异常是否影响其他iscsi卷的业务
  • 相比IET,有哪些新特性
  • 对比所有TARGET的开源实现,有哪些优势,有无缺点

参考

Rapidly method for clean directory with rsync

Methods

  • rm
  • rsync

Test

[dennis@localhost ttt]$ mkdir empty
[dennis@localhost ttt]$ mkdir tmp/; seq 1 40000 | xargs -I{} touch tmp/file_{} 
[dennis@localhost ttt]$ mkdir tmp1/; time seq 1 40000 | xargs -I{} touch tmp1/file_{} 

real    0m28.781s
user    0m0.361s
sys    0m3.091s
[dennis@localhost ttt]$ ls -ld *
drwxrwxr-x. 2 dennis dennis    4096 Sep 15 09:07 empty
drwxrwxr-x. 2 dennis dennis 1114112 Sep 15 09:11 tmp
drwxrwxr-x. 2 dennis dennis 1114112 Sep 15 09:13 tmp1
[dennis@localhost ttt]$ strace -c rsync -a --delete empty/ tmp/
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 98.91    0.001000         100        10         1 select
  1.09    0.000011           2         6         3 wait4
  0.00    0.000000           0        15           read
  0.00    0.000000           0         5           write
  0.00    0.000000           0        21         9 open
  0.00    0.000000           0        20           close
  0.00    0.000000           0         5         3 stat
  0.00    0.000000           0        12           fstat
  0.00    0.000000           0         1           lstat
  0.00    0.000000           0        25           mmap
  0.00    0.000000           0        12           mprotect
  0.00    0.000000           0         6           munmap
  0.00    0.000000           0         6           brk
  0.00    0.000000           0        10           rt_sigaction
  0.00    0.000000           0         1           rt_sigprocmask
  0.00    0.000000           0         1         1 rt_sigreturn
  0.00    0.000000           0         1         1 access
  0.00    0.000000           0         4           socket
  0.00    0.000000           0         4         4 connect
  0.00    0.000000           0         2           socketpair
  0.00    0.000000           0         1           clone
  0.00    0.000000           0         1           execve
  0.00    0.000000           0        11           fcntl
  0.00    0.000000           0         4           getdents
  0.00    0.000000           0         1           getcwd
  0.00    0.000000           0         1           chdir
  0.00    0.000000           0         2           umask
  0.00    0.000000           0         1           geteuid
  0.00    0.000000           0         1           getegid
  0.00    0.000000           0         1           arch_prctl
  0.00    0.000000           0         2           openat
------ ----------- ----------- --------- --------- ----------------
100.00    0.001011                   193        22 total
[dennis@localhost ttt]$ strace -c rm -rf tmp1/
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 79.25    0.035329           1     40001           unlinkat
 17.50    0.007800         186        42           getdents
  3.07    0.001369           7       196           brk
  0.18    0.000082          21         4           munmap
  0.00    0.000000           0         2           read
  0.00    0.000000           0         8         4 open
  0.00    0.000000           0        10           close
  0.00    0.000000           0         4         3 stat
  0.00    0.000000           0         6           fstat
  0.00    0.000000           0         1           lstat
  0.00    0.000000           0         1         1 lseek
  0.00    0.000000           0        11           mmap
  0.00    0.000000           0         4           mprotect
  0.00    0.000000           0         1           ioctl
  0.00    0.000000           0         1         1 access
  0.00    0.000000           0         1           execve
  0.00    0.000000           0         9           fcntl
  0.00    0.000000           0         1           fstatfs
  0.00    0.000000           0         1           arch_prctl
  0.00    0.000000           0         2           openat
  0.00    0.000000           0         1           newfstatat
------ ----------- ----------- --------- --------- ----------------
100.00    0.044580                 40307         9 total

Analyze system call

  • rm
    Do so many system calls, special of unlinkat, it take of 79.25% of time,

  • rsync

Analyze source code

Reference