- Network
- A
About qemu and 802.1p protocol
Launching the first VM. It is not pinging over the network. Locally from the hypervisor — everything is ok. I go to the console, run something like tcpdump -nvvvi any — ping started. I turn off tcpdump — again nothing. When running with the -p key, tcpdump does not see anything at all.
Stumbled upon a rather amusing artifact the other day and decided to share. Because on Google about this — slightly better than "white noise".
In general, I had to raise a "playground" on the Cisco UCS platform. A little used and not without oddities, but that's not the point now. For launching all sorts of telemetry and the openStack control core in our area, the "Greek" (in light of recent news about linux, the quotes here may disappear) GANETI is used. Proxmox and other XCP were once rejected for this purpose as too complex. And the team was strong enough to add the desired "little things".
The circuitry is selected so that there is practically nothing to fall. Everything basic runs in the native (unzipped) vlan. (Regular 802.1q for maximum performance. L3+ — between data centers.) Another couple of vlans — for prioritizing auxiliary traffic. No network managers/openvswitches. (Openvswitch — at the next level of abstraction, and I have no interesting stories about it. Which, you must agree, characterizes the product in a certain way.)
Key virtual machines (VMs) also run in the native vlan. The solution is not without flaws (tagged packets from vlans spread on the same bridge get to the VMs' tap interfaces attached to a regular linux bridge in this mode). But in a dozen and a half years of operation, there have been no surprises exactly there.
Now, having outlined the general picture, I get down to business. I launch the first VM. It doesn't ping over the network. Locally from the hypervisor, everything is fine. I go to the console, run something like "tcpdump -nvvvi any"
(promiscuous mode is set by default) — ping starts. I turn off tcpdump — again nothing. When running with the "-p" key (disable promiscuous mode) tcpdump doesn't see anything at all.
The reason is obviously localized in the (virtio-)network.
As one movie character in a striped sweater said, "let's look deeper". Upon closer inspection, it is clear that "the right tail is longer than the left". "There" 42 bytes fly away, and back the same ARP frame (who-has and the response are the same size) returns in a 54-byte packet. Because an additional 802.1q header is attached to it, only with vlan id = 0 (originally reserved for service purposes). A quick google search shows that this is how a relatively recent implementation of the 802.1p protocol works (a way to carry traffic priority information). "Let's send a frame without a tag with tag 0", something like that.
This is how it looks in tcpdump at the entrance to the hypervisor:
# tcpdump -nvvvi bond0 -e dst host 172.x.x.x | head
15:07:32.847072 00:3a:9c:57:51:fc > ae:9b:f1:f1:90:83, ethertype 802.1Q (0x8100), length 70: vlan 0, p 0, ethertype IPv4, (tos 0x0, ttl 62, id 44387, offset 0, flags [DF], proto
I didn't go into further details, as this is "not my bread". The proposal to colleagues to switch the policy on UCS from "Platinum" to something simpler also did not arouse enthusiasm (it is generally understandable, it's a new toy after all). Well, "we can handle it ourselves"...
QEMU is quite up-to-date (from Oracle appStream), and there's not much to catch from this side. Either figure it out later at leisure, or simply file a bug. (It is clear that in industrial operation, no one has assembled exactly such a use case, which is why it has remained unnoticed until now.) Therefore, we immediately move on to "plan B".
Trying to remove the zero tag "head-on" seems like a rather foolish idea. Even if it is possible to patch the module to work with vlan, it will be an exclusive. We immediately look towards "ip" ("ip filter"
etc.), since it (in principle) allows you to perform most of the tricks for which "netmap" was once invented in freeBSD.
I have a package "network-scripts-extra" that complements the functionality of "network-scripts" with various macvlans, veths, and other pleasant things. There is also a script "ifup-eth-bond" for automating the configuration of buffers and queues on the "physical". And QoS, "to boot". In principle, nothing secret, I will try to post it somehow. (But, most likely, closer to the adoption of the declared obsolete "network-scripts" for maintenance. Because no replacement with acceptable TTX has been proposed, and fighting with a "fatal flaw" is not in my habits.)
To work with incoming "queues" there is a special discipline "ingress" in "tc qdisc". I suppose even a reader not too versed in network tricks will realize that this is not quite a "queue" (the packet is already accepted!). Usually, this thing is used to "police" incoming traffic (roughly discard incoming packets beyond some limit). Here we use it for more interesting manipulations.
This trick for some reason only works on "physical" interfaces.
For bonding, which has an "input", it theoretically doesn't work on the same side. If someone has delved deeper into the issue, I would appreciate comments.
On "taps" "tapNN" the queue will already be "outgoing", and it's hard to get in there, since by default I set a primitive classless pfifo_fast there.
The updated piece of "ifup-eth-bond" now looks like this:
# tc -d -s qdisc show dev em1
tc qdisc del dev ${DEVICE} root # >> net.core.default_qdisc=fq_codel
tc qdisc replace dev ${DEVICE} handle ffff: ingress
tc qdisc replace dev ${DEVICE} root handle 1: htb default 0
# tc -d -s class show dev em1
tc class add dev ${DEVICE} parent 1:1 classid 1:10 htb prio 1 rate ${SPEED}gbit
tc class add dev ${DEVICE} parent 1:1 classid 1:20 htb prio 2 rate ${SPEED}gbit
tc class add dev ${DEVICE} parent 1:1 classid 1:30 htb prio 3 rate ${SPEED}gbit
tc qdisc add dev ${DEVICE} parent 1:10 handle 10: pfifo_fast
# Multi-band pfifo_fast for VM's like the default
# Since UEKR7 kernels, multiq needs "kernel-uek-modules-extra" to be installed.
if modprobe sch_multiq; then
# Try to use hardware queues
tc qdisc add dev ${DEVICE} parent 1:20 handle 20: multiq || tc qdisc add dev ${DEVICE} parent 1:20 handle 20: fq_codel
else
tc qdisc add dev ${DEVICE} parent 1:20 handle 20: fq_codel
fi
tc qdisc add dev ${DEVICE} parent 1:30 handle 30: pfifo_fast
# tc -d -s filter show dev em1
tc filter add dev ${DEVICE} protocol all prio 10 basic match "meta(vlan mask 0xfff eq ${VLAN_LOW_PRIO})" flowid 1:30
tc filter add dev ${DEVICE} protocol all prio 10 basic match "meta(vlan mask 0xfff eq ${VLAN_HIGH_PRIO})" flowid 1:10
tc filter add dev ${DEVICE} protocol all prio 100 u32 match u32 0 0 flowid 1:20 # default, neutral prio
# "Untag" vlan 0 from 802.1p in native vlan (ingress)
#tc filter add dev ${DEVICE} parent ffff: protocol 802.1q basic match 'meta(vlan eq 0)' action vlan pop continue
tc filter add dev ${DEVICE} parent ffff: protocol all basic match "meta(vlan mask 0xfff eq 0)" action vlan pop continue
○ The speed on the interface can usually be found in "/sys/class/net/", although there are nuances.
We create a pair of qdisc ("from there" ingress and "to there" htb). ○ HTB gives the minimum overhead among the tested ones. The hypervisor has other things to do besides handling network traffic.
We distribute 3 priority classes: control channel, general crowd (if possible, offload it to multiq) and "what's left".
We filter outgoing traffic by classes, depending on the vlan tag.
And finally, in the incoming traffic, we look for packets with the 802.1Q header (0x8100) and vlan 0 tag. And we command to untag. ○ On the router side, only
"meta(vlan mask 0xfff eq 0)"
works. Because https://en.wikipedia.org/wiki/IEEE_P802.1p, and lunux"meta(vlan)"
for some reason captures the upper bits too.
¿ Why didn't I specify "802.1q" instead of "all"? I was just afraid of catching another glitch. Prioritization has been polished over the last decade, the impact on performance is at the level of statistical error. Let it remain "all". (But "802.1q" is easier to read, yes.)
That's all. I hope this note will be useful to someone.
Write comment