vpp: Enhanced memory, buffers, and CGNAT documentation, added troubleshooting (#1687)

* vpp: Enhanced memory and buffer configuration documentation

- Added physmem configuration section with practical examples and troubleshooting
- Clarified relationship between physmem and buffer allocation with cross-references
- Improved VPP logging documentation with detailed log location descriptions
- Fixed formatting issues in system configuration

* vpp: Added CGNAT memory requirements

Expanded CGNAT settings page with information about:
- Memory requirements
- Hardcoded simultaneous sessions limit

* vpp: Added troubleshooting page

Added page with basic steps for troubleshooting:
- Capturing packets (PCAP)
- Tracing packets
- Additional diagnostics information from VPP
- Automatic collection of most details with Python script

---------

Co-authored-by: Daniil Baturin <daniil@baturin.org>
This commit is contained in:
zdc 2025-09-25 14:44:23 +03:00 committed by GitHub
parent 9da339ebf8
commit 6b2e69a687
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
7 changed files with 551 additions and 3 deletions

View File

@ -12,6 +12,12 @@ Buffers are essential for handling network packets efficiently, and proper confi
Buffers are used to temporarily store packets during processing, therefore their configuration should be in sync with NIC configuration, CPU threads, and overall system resources.
.. important::
VPP buffers are allocated from the physical memory pool (physmem). The total amount of memory available for buffer allocation is controlled by the ``physmem max-size`` setting, while the buffer configuration parameters below control how that memory is used for buffer allocation.
See :ref:`VPP Physical Memory Configuration <vpp_config_dataplane_physmem>` for details on configuring physmem.
Buffer Configuration Parameters
===============================

View File

@ -10,7 +10,10 @@ VPP Logging Configuration
VPP logging is an important part of monitoring and troubleshooting the performance and behavior of the VPP dataplane.
VPP logs are stored in the ``/var/log/vpp.log`` file. Additionally daemon logs can be found in the system journal.
VPP stores logs in two places:
- ``/var/log/vpp.log`` — This file contains logs related to daemon startup and log of commands executed directly via VPP CLI. Pay attention, VyOS does not use VPP CLI for configuration, so this log will not contain any configuration changes made via VyOS CLI, and will not be informative in most cases.
- System journal — contains logs related to the VPP daemon work, including errors, warnings, and informational messages. It is the main destination of logs generated by VPP.
Logging detalization can be configured via the next command:

View File

@ -32,6 +32,39 @@ Sets the main heap page size for VPP.
Sets the main heap size for VPP.
.. _vpp_config_dataplane_physmem:
Physical Memory Configuration
=============================
VPP uses physical memory for packet buffers and interface operations. The ``physmem`` setting controls how much memory VPP can allocate for these operations.
.. cfgcmd:: set vpp settings physmem max-size <size>
Sets the maximum amount of physical memory VPP can use for packet processing and interface buffers.
**Default**: 16GB (usually sufficient for most deployments)
You may need to modify the value for high-throughput environments with many interfaces, large packet buffers, or very high packet rates or memory-constrained systems where you need to limit VPP's memory usage.
**Physmem Independent of main heap size** - physmem is for packet buffers, main heap is for routing tables.
.. seealso::
- :ref:`Hugepages in VyOS Configuration for VPP <vpp_config_hugepages>`
- :ref:`VPP Buffer Configuration <vpp_config_dataplane_buffers>` - for controlling buffer allocation within physmem
Common configurations
---------------------
.. code-block:: none
# Reduce for memory-constrained systems
set vpp settings physmem max-size 4G
# Increase for high-throughput environments
set vpp settings physmem max-size 32G
Potential Issues and Troubleshooting
====================================
@ -43,3 +76,11 @@ Improper configuration of main heap size can lead to performance degradation or
- Error messages related to memory allocation failures
You need to tune the main heap size based on expected FIB entries. Pay attention - same amount of routes with a single next-hop and with multiple next-hops will consume different amounts of memory.
For physmem, insufficient allocation can lead to packet drops, interface initialization failures, and overall degraded performance. Symptoms include:
- Packet drops or failures to allocate buffers
- Increased latency or jitter in packet processing
- Crashes or restarts of VPP processes under heavy load
You need to tune the physmem settings based on expected traffic patterns and interface usage. Monitor memory usage closely and adjust the configuration as needed to ensure optimal performance.

View File

@ -85,9 +85,9 @@ The NMI (Non-Maskable Interrupt) watchdog can interfere with VPP performance by
* Range: ``2-5``
* Mixed: ``1,3-5,7``
..important::
.. important::
Always reserve at least 1-2 cores for the operating system to ensure system stability. For example, on a 4-core system, isolate cores 2-3 for VPP and leave cores 0-1 for the OS.
Always reserve at least 2 cores for the operating system to ensure system stability. For example, on a 4-core system, isolate cores 2-3 for VPP and leave cores 0-1 for the OS.
Assign the first isolated core as the VPP main core and the remaining isolated cores as VPP worker cores. Ensure that VPP CPU assignments match the isolated CPU range.

View File

@ -59,6 +59,23 @@ Sets the inside prefix (private IP range) that will be translated.
Sets the outside prefix (public IP range) that will be used for translation.
.. important::
**Memory Requirements**
CGNAT memory usage scales with the number of internal customers.
**Each 256 customers** (equivalent to a /24 subnet) requires approximately **4 MB of main heap memory**. This memory is used for maintaining customer-to-port mappings and session state information.
Ensure your VPP main heap size is configured appropriately based on your expected customer count. See :ref:`VPP Memory Configuration <vpp_config_dataplane_memory>` for details on adjusting main heap size.
Session Limitations
-------------------
CGNAT has built-in session limitations to ensure fair resource allocation:
**Each customer (internal IP address) is limited to a maximum of 1000 simultaneous sessions**, even if more than 1000 ports are allocated to that customer. This limitation applies to all types of sessions (TCP, UDP, ICMP).
Timeouts Configuration
----------------------

View File

@ -20,3 +20,4 @@ the Linux kernel networking stack.
requirements
limitations
configuration/index
troubleshooting

View File

@ -0,0 +1,480 @@
:lastproofread: 2025-09-23
.. _vpp_troubleshooting:
.. include:: /_include/need_improvement.txt
#############################
VPP Dataplane Troubleshooting
#############################
This page provides essential troubleshooting information for VPP dataplane issues. It covers data collection techniques that are useful both for self-assistance and for providing comprehensive information to support teams when seeking help.
When experiencing VPP issues, collecting the right diagnostic information is crucial for effective troubleshooting. The following sections describe the most important data collection methods.
Packet Capture (PCAP)
=====================
Packet capture is one of the most valuable debugging tools for analyzing network traffic and identifying issues with packet processing, routing, or filtering.
VPP's pcap trace functionality captures packets at various points: received (rx), transmitted (tx), and dropped (drop) packets.
Starting Packet Capture
-----------------------
**Command syntax:**
.. opcmd::
sudo vppctl pcap trace [rx] [tx] [drop] [max <n>] [intfc <interface-name|any>] [file <name>] [max-bytes-per-pkt <n>]
**Key parameters:**
- ``rx`` - Capture received packets
- ``tx`` - Capture transmitted packets
- ``drop`` - Capture dropped packets
- ``max <n>`` - Depth of local buffer. Once ``n`` packets are received, buffer is flushed to file. Once next ``n`` packets are received the file is overwritten with new data. (default: 100)
- ``intfc <interface-name|any>`` - Specify interface or use ``any`` for all interfaces (default: any)
- ``file <name>`` - Output filename. The PCAP file with this name is stored in ``/tmp/`` directory.
- ``max-bytes-per-pkt <n>`` - Maximum bytes to capture per packet (must be >= 32, <= 9000)
**Examples:**
.. code-block:: none
# Start capturing tx packets with specific parameters
sudo vppctl pcap trace tx max 35 intfc eth1 file vpp_eth1.pcap
# Capture all packet types from any interface
sudo vppctl pcap trace rx tx drop max 1000 intfc any file vpp_capture.pcap max-bytes-per-pkt 128
Monitoring Capture Status
-------------------------
To check the current capture status:
.. opcmd::
sudo vppctl pcap trace status
This command shows:
- Whether capture is active
- Capture parameters
- Number of packets captured so far
- Output file location
Stopping Packet Capture
-----------------------
.. warning::
VPP does not automatically stop packet captures. If left running, captures will continue indefinitely, consuming resources. Always remember to stop captures when they are no longer needed.
To stop the active packet capture:
.. opcmd::
sudo vppctl pcap trace off
Example output when stopping:
.. code-block:: none
Write 35 packets to /tmp/vpp_eth1.pcap, and stop capture...
**Important notes:**
- PCAP files are stored in ``/tmp/`` directory
- Files will be overwritten if they already exist
- If no filename is provided, default names are used: ``/tmp/rx.pcap``, ``/tmp/tx.pcap``, ``/tmp/rxandtx.pcap``
- Large captures can consume significant disk space - monitor available space
- Stop captures promptly to avoid filling up storage
Packet Tracing
==============
VPP packet tracing provides detailed information about how packets flow through the VPP processing graph, showing exactly which nodes process each packet and any transformations applied. Packet tracing is essential for understanding VPP's internal packet processing flow.
.. warning::
Tracing can generate a large amount of data, especially on high-traffic systems. Use it judiciously and limit the number of packets traced to avoid overwhelming the system.
Basic Packet Tracing Commands
-----------------------------
Start tracing
^^^^^^^^^^^^^
To start tracing packets at a specific graph node:
.. opcmd::
sudo vppctl trace add <input-graph-node> <pkts> [verbose]
- ``<input-graph-node>`` - Name of the graph node to start tracing from (e.g., ``dpdk-input``, ``ethernet-input``, ``ip4-input``)
- ``<pkts>`` - Number of packets to trace (e.g., 100)
- ``[verbose]`` - Optional flag to include detailed buffer information in the trace output
**Common node names for tracing:**
- ``dpdk-input``: Packets received from DPDK interfaces
- ``ethernet-input``: Ethernet frame processing
- ``ip4-input``: IPv4 packet processing
- ``ip6-input``: IPv6 packet processing
- ``ip4-lookup``: IPv4 routing table lookup
- ``ip6-lookup``: IPv6 routing table lookup
View traces
^^^^^^^^^^^
When packets have been traced, view the results with:
.. opcmd::
sudo vppctl show trace [max COUNT]
- ``[max COUNT]`` - Optional limit on number of packets to display (default: all)
Clear traces
^^^^^^^^^^^^
After reviewing traces, clear them to free up resources:
.. opcmd::
sudo vppctl clear trace
Example Workflow
^^^^^^^^^^^^^^^^
.. code-block:: none
# Add traces for 100 packets on dpdk-input node
sudo vppctl trace add dpdk-input 100
# Send some traffic, then view results
sudo vppctl show trace
# Clear traces for next test
sudo vppctl clear trace
Understanding Trace Output
--------------------------
Trace output shows the packet's journey through VPP processing nodes:
.. code-block:: none
Packet 1
01:00:09:508438: dpdk-input
eth2 rx queue 0
buffer 0x8533: current data 0, length 98, buffer-pool 0, ref-count 1, trace handle 0x1000000
ext-hdr-valid
PKT MBUF: port 1, nb_segs 1, pkt_len 98
buf_len 1828, data_len 98, ol_flags 0x0, data_off 128, phys_addr 0x78814d40
packet_type 0x0 l2_len 0 l3_len 0 outer_l2_len 0 outer_l3_len 0
rss 0x0 fdir.hi 0x0 fdir.lo 0x0
IP4: 0c:87:6c:4e:00:01 -> 0c:de:0d:e2:00:02
ICMP: 192.168.102.2 -> 192.168.99.3
tos 0x00, ttl 64, length 84, checksum 0xb88d dscp CS0 ecn NON_ECN
fragment id 0x37c5, flags DONT_FRAGMENT
ICMP echo_request checksum 0x64e id 3024
01:00:09:508449: ethernet-input
frame: flags 0x1, hw-if-index 2, sw-if-index 2
IP4: 0c:87:6c:4e:00:01 -> 0c:de:0d:e2:00:02
01:00:09:508455: ip4-input
ICMP: 192.168.102.2 -> 192.168.99.3
tos 0x00, ttl 64, length 84, checksum 0xb88d dscp CS0 ecn NON_ECN
fragment id 0x37c5, flags DONT_FRAGMENT
ICMP echo_request checksum 0x64e id 3024
01:00:09:508458: ip4-sv-reassembly-feature
[not-fragmented]
01:00:09:508460: nat-pre-in2out
in2out next_index 2 arc_next_index 10
01:00:09:508462: nat44-ed-in2out
NAT44_IN2OUT_ED_FAST_PATH: sw_if_index 2, next index 10, session 0, translation result 'success' via i2of
i2of match: saddr 192.168.102.2 sport 3024 daddr 192.168.99.3 dport 3024 proto ICMP fib_idx 0 rewrite: saddr 192.168.99.1 daddr 192.168.99.3 icmp-id 3024 txfib 0
o2if match: saddr 192.168.99.3 sport 3024 daddr 192.168.99.1 dport 3024 proto ICMP fib_idx 0 rewrite: saddr 192.168.99.3 daddr 192.168.102.2 icmp-id 3024 txfib 0
search key local 192.168.102.2:3024 remote 192.168.99.3:3024 proto ICMP fib 0 thread-index 0 session-index 0
01:00:09:508469: ip4-lookup
fib 0 dpo-idx 10 flow hash: 0x00000000
ICMP: 192.168.99.1 -> 192.168.99.3
tos 0x00, ttl 64, length 84, checksum 0xbb8e dscp CS0 ecn NON_ECN
fragment id 0x37c5, flags DONT_FRAGMENT
ICMP echo_request checksum 0x64e id 3024
01:00:09:508472: ip4-rewrite
tx_sw_if_index 1 dpo-idx 10 : ipv4 via 192.168.99.3 eth1: mtu:1500 next:5 flags:[] 0ccea70400010cde0de200010800 flow hash: 0x00000000
00000000: 0ccea70400010cde0de2000108004500005437c540003f01bc8ec0a86301c0a8
00000020: 63030800064e0bd00d9a52c2d26800000000f4490000000000001011
01:00:09:508474: eth1-output
eth1 flags 0x0038000d
IP4: 0c:de:0d:e2:00:01 -> 0c:ce:a7:04:00:01
ICMP: 192.168.99.1 -> 192.168.99.3
tos 0x00, ttl 63, length 84, checksum 0xbc8e dscp CS0 ecn NON_ECN
fragment id 0x37c5, flags DONT_FRAGMENT
ICMP echo_request checksum 0x64e id 3024
01:00:09:508477: eth1-tx
eth1 tx queue 0
buffer 0x8533: current data 0, length 98, buffer-pool 0, ref-count 1, trace handle 0x1000000
ext-hdr-valid
natted l2-hdr-offset 0 l3-hdr-offset 14
PKT MBUF: port 1, nb_segs 1, pkt_len 98
buf_len 1828, data_len 98, ol_flags 0x0, data_off 128, phys_addr 0x78814d40
packet_type 0x0 l2_len 0 l3_len 0 outer_l2_len 0 outer_l3_len 0
rss 0x0 fdir.hi 0x0 fdir.lo 0x0
IP4: 0c:de:0d:e2:00:01 -> 0c:ce:a7:04:00:01
ICMP: 192.168.99.1 -> 192.168.99.3
tos 0x00, ttl 63, length 84, checksum 0xbc8e dscp CS0 ecn NON_ECN
fragment id 0x37c5, flags DONT_FRAGMENT
ICMP echo_request checksum 0x64e id 3024
In this case, the trace shows:
- The packet was received on ``eth2`` interface (``dpdk-input`` node)
- It was processed by the ``ethernet-input`` and ``ip4-input`` nodes
- NAT translation occurred at the ``nat44-ed-in2out`` node, changing the source IP
- The packet was routed via ``ip4-lookup`` and ``ip4-rewrite`` nodes
- Finally, it was transmitted out of ``eth1`` interface (``eth1-tx`` node)
Additional Diagnostic Information
=================================
When reporting issues to support teams or performing advanced troubleshooting, you may need to collect additional diagnostic information.
Before/After Traffic Analysis
-----------------------------
Before sending traffic:
.. code-block:: none
sudo vppctl clear hardware-interfaces
sudo vppctl clear interfaces
sudo vppctl clear error
sudo vppctl clear runtime
After sending traffic:
.. code-block:: none
sudo vppctl show version verbose
sudo vppctl show hardware-interfaces
sudo vppctl show interface address
sudo vppctl show interface
sudo vppctl show runtime
sudo vppctl show error
Core System Information
-----------------------
**Memory and buffer information:**
.. code-block:: none
sudo vppctl show memory api-segment stats-segment numa-heaps main-heap map verbose
sudo vppctl show buffers
sudo vppctl show physmem detail
sudo vppctl show physmem map
**Runtime and performance data:**
.. code-block:: none
sudo vppctl show cpu
sudo vppctl show threads
sudo vppctl show runtime
sudo vppctl show node counters
Protocol-Specific Information
-----------------------------
**Layer 2 information (if configured):**
.. code-block:: none
sudo vppctl show l2fib
sudo vppctl show bridge-domain
**IPv4 information (if configured):**
.. code-block:: none
sudo vppctl show ip fib
sudo vppctl show ip neighbors
**IPv6 information (if configured):**
.. code-block:: none
sudo vppctl show ip6 fib
sudo vppctl show ip6 neighbors
**MPLS information (if configured):**
.. code-block:: none
sudo vppctl show mpls fib
sudo vppctl show mpls tunnel
Creating Support Packages
=========================
When contacting support or reporting issues, use the automated diagnostic collection script to create a comprehensive package. This ensures all relevant VPP troubleshooting information is collected systematically.
VPP Diagnostic Collection Script
--------------------------------
Create the diagnostic collection script:
.. code-block:: python
#!/usr/bin/env python3
"""VyOS VPP Diagnostic Collection Script"""
import datetime
import shutil
import subprocess
import tarfile
from pathlib import Path
def run_cmd(cmd, output_file, diag_dir):
"""Run command and save output to file."""
try:
result = subprocess.run(
cmd, shell=True, capture_output=True, text=True, timeout=30
)
content = f"Command: {cmd}\nExit code: {result.returncode}\nTimestamp: {datetime.datetime.now()}\n{'-' * 50}\n"
if result.stdout:
content += f"\nSTDOUT:\n{result.stdout}"
if result.stderr:
content += f"\nSTDERR:\n{result.stderr}"
(diag_dir / output_file).write_text(content)
except Exception as e:
(diag_dir / output_file).write_text(f"Command: {cmd}\nERROR: {e}")
def collect_diagnostics():
"""Collect all VPP diagnostics and create archive."""
timestamp = datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
diag_dir = Path.home() / f"vpp-diagnostics-{timestamp}"
# VPP commands to collect
commands = [
("sudo vppctl show version verbose cmdline", "vpp-version.txt"),
("sudo vppctl show hardware-interfaces", "hardware-interfaces.txt"),
("sudo vppctl show interface address", "interface-addresses.txt"),
("sudo vppctl show interface", "interfaces.txt"),
("sudo vppctl show errors", "errors.txt"),
("sudo vppctl show runtime", "runtime.txt"),
(
"sudo vppctl show memory api-segment stats-segment numa-heaps main-heap map verbose",
"memory.txt",
),
("sudo vppctl show buffers", "buffers.txt"),
("sudo vppctl show physmem detail", "physmem.txt"),
("sudo vppctl show physmem map", "physmem-map.txt"),
("sudo vppctl show cpu", "cpu.txt"),
("sudo vppctl show threads", "threads.txt"),
("sudo vppctl show node counters", "node-counters.txt"),
("sudo vppctl show l2fib", "l2fib.txt"),
("sudo vppctl show bridge-domain", "bridge-domains.txt"),
("sudo vppctl show ip fib", "ip4-fib.txt"),
("sudo vppctl show ip neighbors", "ip4-neighbors.txt"),
("sudo vppctl show ip6 fib", "ip6-fib.txt"),
("sudo vppctl show ip6 neighbors", "ip6-neighbors.txt"),
("sudo vppctl show mpls fib", "mpls-fib.txt"),
("sudo vppctl show mpls tunnel", "mpls-tunnels.txt"),
("sudo vppctl show trace", "packet-traces.txt"),
]
try:
# Create diagnostics directory
diag_dir.mkdir(parents=True, exist_ok=True)
# Collect VPP data
for cmd, output_file in commands:
run_cmd(cmd, output_file, diag_dir)
# Collect PCAP files
pcap_files = list(Path("/tmp").glob("*.pcap"))
if pcap_files:
pcap_dir = diag_dir / "pcap-files"
pcap_dir.mkdir(exist_ok=True)
for pcap_file in pcap_files:
try:
shutil.copy2(pcap_file, pcap_dir)
except (PermissionError, OSError):
pass
# Create archive
archive_name = f"vpp-diagnostics-{timestamp}.tar.gz"
archive_path = Path.home() / archive_name
with tarfile.open(archive_path, "w:gz") as tar:
tar.add(diag_dir, arcname=diag_dir.name)
# Cleanup
shutil.rmtree(diag_dir)
print(f"VPP diagnostics collected: {archive_path}")
return archive_path
except Exception as e:
if diag_dir.exists():
shutil.rmtree(diag_dir)
print(f"Collection failed: {e}")
return None
def main():
"""Main function."""
collect_diagnostics()
if __name__ == "__main__":
main()
Save this script as ``/config/scripts/vpp-collect-diagnostics``
Installation and Usage
----------------------
**1. Make the script executable**
.. opcmd::
sudo chmod +x /config/scripts/vpp-collect-diagnostics
**2. Run VPP diagnostic collection**
The script will automatically collect all VPP diagnostics and store it in your home directory.
.. opcmd::
/config/scripts/vpp-collect-diagnostics
**3. Generate VyOS tech-support archive separately**
Additionally, you can generate a VyOS tech-support archive that includes system-wide diagnostics:
.. opcmd::
generate tech-support archive
What the Script Collects
------------------------
- **System Information**: Version details, build information, command line parameters
- **Interface Data**: Hardware interfaces, interface addresses, interface statistics and configurations
- **Performance Metrics**: Runtime statistics, error counters, node counters, CPU and thread information
- **Memory Analysis**: Memory usage (API segment, stats segment, NUMA heaps, main heap), buffer information, physical memory details
- **Layer 2 Data**: L2 forwarding table (L2FIB), bridge domain configurations
- **IPv4 Information**: IPv4 forwarding table (FIB), IPv4 neighbor table
- **IPv6 Information**: IPv6 forwarding table (FIB), IPv6 neighbor table
- **MPLS Data**: MPLS forwarding table (FIB), MPLS tunnel information
- **Packet Traces**: Captured packet traces VPP (if available)
- **Packet Dumps**: PCAP files from ``/tmp`` directory (if available)