vyos-documentation/docs/vpp/troubleshooting.rst
zdc 6b2e69a687
vpp: Enhanced memory, buffers, and CGNAT documentation, added troubleshooting (#1687)
* vpp: Enhanced memory and buffer configuration documentation

- Added physmem configuration section with practical examples and troubleshooting
- Clarified relationship between physmem and buffer allocation with cross-references
- Improved VPP logging documentation with detailed log location descriptions
- Fixed formatting issues in system configuration

* vpp: Added CGNAT memory requirements

Expanded CGNAT settings page with information about:
- Memory requirements
- Hardcoded simultaneous sessions limit

* vpp: Added troubleshooting page

Added page with basic steps for troubleshooting:
- Capturing packets (PCAP)
- Tracing packets
- Additional diagnostics information from VPP
- Automatic collection of most details with Python script

---------

Co-authored-by: Daniil Baturin <daniil@baturin.org>
2025-09-25 12:44:23 +01:00

481 lines
16 KiB
ReStructuredText

:lastproofread: 2025-09-23
.. _vpp_troubleshooting:
.. include:: /_include/need_improvement.txt
#############################
VPP Dataplane Troubleshooting
#############################
This page provides essential troubleshooting information for VPP dataplane issues. It covers data collection techniques that are useful both for self-assistance and for providing comprehensive information to support teams when seeking help.
When experiencing VPP issues, collecting the right diagnostic information is crucial for effective troubleshooting. The following sections describe the most important data collection methods.
Packet Capture (PCAP)
=====================
Packet capture is one of the most valuable debugging tools for analyzing network traffic and identifying issues with packet processing, routing, or filtering.
VPP's pcap trace functionality captures packets at various points: received (rx), transmitted (tx), and dropped (drop) packets.
Starting Packet Capture
-----------------------
**Command syntax:**
.. opcmd::
sudo vppctl pcap trace [rx] [tx] [drop] [max <n>] [intfc <interface-name|any>] [file <name>] [max-bytes-per-pkt <n>]
**Key parameters:**
- ``rx`` - Capture received packets
- ``tx`` - Capture transmitted packets
- ``drop`` - Capture dropped packets
- ``max <n>`` - Depth of local buffer. Once ``n`` packets are received, buffer is flushed to file. Once next ``n`` packets are received the file is overwritten with new data. (default: 100)
- ``intfc <interface-name|any>`` - Specify interface or use ``any`` for all interfaces (default: any)
- ``file <name>`` - Output filename. The PCAP file with this name is stored in ``/tmp/`` directory.
- ``max-bytes-per-pkt <n>`` - Maximum bytes to capture per packet (must be >= 32, <= 9000)
**Examples:**
.. code-block:: none
# Start capturing tx packets with specific parameters
sudo vppctl pcap trace tx max 35 intfc eth1 file vpp_eth1.pcap
# Capture all packet types from any interface
sudo vppctl pcap trace rx tx drop max 1000 intfc any file vpp_capture.pcap max-bytes-per-pkt 128
Monitoring Capture Status
-------------------------
To check the current capture status:
.. opcmd::
sudo vppctl pcap trace status
This command shows:
- Whether capture is active
- Capture parameters
- Number of packets captured so far
- Output file location
Stopping Packet Capture
-----------------------
.. warning::
VPP does not automatically stop packet captures. If left running, captures will continue indefinitely, consuming resources. Always remember to stop captures when they are no longer needed.
To stop the active packet capture:
.. opcmd::
sudo vppctl pcap trace off
Example output when stopping:
.. code-block:: none
Write 35 packets to /tmp/vpp_eth1.pcap, and stop capture...
**Important notes:**
- PCAP files are stored in ``/tmp/`` directory
- Files will be overwritten if they already exist
- If no filename is provided, default names are used: ``/tmp/rx.pcap``, ``/tmp/tx.pcap``, ``/tmp/rxandtx.pcap``
- Large captures can consume significant disk space - monitor available space
- Stop captures promptly to avoid filling up storage
Packet Tracing
==============
VPP packet tracing provides detailed information about how packets flow through the VPP processing graph, showing exactly which nodes process each packet and any transformations applied. Packet tracing is essential for understanding VPP's internal packet processing flow.
.. warning::
Tracing can generate a large amount of data, especially on high-traffic systems. Use it judiciously and limit the number of packets traced to avoid overwhelming the system.
Basic Packet Tracing Commands
-----------------------------
Start tracing
^^^^^^^^^^^^^
To start tracing packets at a specific graph node:
.. opcmd::
sudo vppctl trace add <input-graph-node> <pkts> [verbose]
- ``<input-graph-node>`` - Name of the graph node to start tracing from (e.g., ``dpdk-input``, ``ethernet-input``, ``ip4-input``)
- ``<pkts>`` - Number of packets to trace (e.g., 100)
- ``[verbose]`` - Optional flag to include detailed buffer information in the trace output
**Common node names for tracing:**
- ``dpdk-input``: Packets received from DPDK interfaces
- ``ethernet-input``: Ethernet frame processing
- ``ip4-input``: IPv4 packet processing
- ``ip6-input``: IPv6 packet processing
- ``ip4-lookup``: IPv4 routing table lookup
- ``ip6-lookup``: IPv6 routing table lookup
View traces
^^^^^^^^^^^
When packets have been traced, view the results with:
.. opcmd::
sudo vppctl show trace [max COUNT]
- ``[max COUNT]`` - Optional limit on number of packets to display (default: all)
Clear traces
^^^^^^^^^^^^
After reviewing traces, clear them to free up resources:
.. opcmd::
sudo vppctl clear trace
Example Workflow
^^^^^^^^^^^^^^^^
.. code-block:: none
# Add traces for 100 packets on dpdk-input node
sudo vppctl trace add dpdk-input 100
# Send some traffic, then view results
sudo vppctl show trace
# Clear traces for next test
sudo vppctl clear trace
Understanding Trace Output
--------------------------
Trace output shows the packet's journey through VPP processing nodes:
.. code-block:: none
Packet 1
01:00:09:508438: dpdk-input
eth2 rx queue 0
buffer 0x8533: current data 0, length 98, buffer-pool 0, ref-count 1, trace handle 0x1000000
ext-hdr-valid
PKT MBUF: port 1, nb_segs 1, pkt_len 98
buf_len 1828, data_len 98, ol_flags 0x0, data_off 128, phys_addr 0x78814d40
packet_type 0x0 l2_len 0 l3_len 0 outer_l2_len 0 outer_l3_len 0
rss 0x0 fdir.hi 0x0 fdir.lo 0x0
IP4: 0c:87:6c:4e:00:01 -> 0c:de:0d:e2:00:02
ICMP: 192.168.102.2 -> 192.168.99.3
tos 0x00, ttl 64, length 84, checksum 0xb88d dscp CS0 ecn NON_ECN
fragment id 0x37c5, flags DONT_FRAGMENT
ICMP echo_request checksum 0x64e id 3024
01:00:09:508449: ethernet-input
frame: flags 0x1, hw-if-index 2, sw-if-index 2
IP4: 0c:87:6c:4e:00:01 -> 0c:de:0d:e2:00:02
01:00:09:508455: ip4-input
ICMP: 192.168.102.2 -> 192.168.99.3
tos 0x00, ttl 64, length 84, checksum 0xb88d dscp CS0 ecn NON_ECN
fragment id 0x37c5, flags DONT_FRAGMENT
ICMP echo_request checksum 0x64e id 3024
01:00:09:508458: ip4-sv-reassembly-feature
[not-fragmented]
01:00:09:508460: nat-pre-in2out
in2out next_index 2 arc_next_index 10
01:00:09:508462: nat44-ed-in2out
NAT44_IN2OUT_ED_FAST_PATH: sw_if_index 2, next index 10, session 0, translation result 'success' via i2of
i2of match: saddr 192.168.102.2 sport 3024 daddr 192.168.99.3 dport 3024 proto ICMP fib_idx 0 rewrite: saddr 192.168.99.1 daddr 192.168.99.3 icmp-id 3024 txfib 0
o2if match: saddr 192.168.99.3 sport 3024 daddr 192.168.99.1 dport 3024 proto ICMP fib_idx 0 rewrite: saddr 192.168.99.3 daddr 192.168.102.2 icmp-id 3024 txfib 0
search key local 192.168.102.2:3024 remote 192.168.99.3:3024 proto ICMP fib 0 thread-index 0 session-index 0
01:00:09:508469: ip4-lookup
fib 0 dpo-idx 10 flow hash: 0x00000000
ICMP: 192.168.99.1 -> 192.168.99.3
tos 0x00, ttl 64, length 84, checksum 0xbb8e dscp CS0 ecn NON_ECN
fragment id 0x37c5, flags DONT_FRAGMENT
ICMP echo_request checksum 0x64e id 3024
01:00:09:508472: ip4-rewrite
tx_sw_if_index 1 dpo-idx 10 : ipv4 via 192.168.99.3 eth1: mtu:1500 next:5 flags:[] 0ccea70400010cde0de200010800 flow hash: 0x00000000
00000000: 0ccea70400010cde0de2000108004500005437c540003f01bc8ec0a86301c0a8
00000020: 63030800064e0bd00d9a52c2d26800000000f4490000000000001011
01:00:09:508474: eth1-output
eth1 flags 0x0038000d
IP4: 0c:de:0d:e2:00:01 -> 0c:ce:a7:04:00:01
ICMP: 192.168.99.1 -> 192.168.99.3
tos 0x00, ttl 63, length 84, checksum 0xbc8e dscp CS0 ecn NON_ECN
fragment id 0x37c5, flags DONT_FRAGMENT
ICMP echo_request checksum 0x64e id 3024
01:00:09:508477: eth1-tx
eth1 tx queue 0
buffer 0x8533: current data 0, length 98, buffer-pool 0, ref-count 1, trace handle 0x1000000
ext-hdr-valid
natted l2-hdr-offset 0 l3-hdr-offset 14
PKT MBUF: port 1, nb_segs 1, pkt_len 98
buf_len 1828, data_len 98, ol_flags 0x0, data_off 128, phys_addr 0x78814d40
packet_type 0x0 l2_len 0 l3_len 0 outer_l2_len 0 outer_l3_len 0
rss 0x0 fdir.hi 0x0 fdir.lo 0x0
IP4: 0c:de:0d:e2:00:01 -> 0c:ce:a7:04:00:01
ICMP: 192.168.99.1 -> 192.168.99.3
tos 0x00, ttl 63, length 84, checksum 0xbc8e dscp CS0 ecn NON_ECN
fragment id 0x37c5, flags DONT_FRAGMENT
ICMP echo_request checksum 0x64e id 3024
In this case, the trace shows:
- The packet was received on ``eth2`` interface (``dpdk-input`` node)
- It was processed by the ``ethernet-input`` and ``ip4-input`` nodes
- NAT translation occurred at the ``nat44-ed-in2out`` node, changing the source IP
- The packet was routed via ``ip4-lookup`` and ``ip4-rewrite`` nodes
- Finally, it was transmitted out of ``eth1`` interface (``eth1-tx`` node)
Additional Diagnostic Information
=================================
When reporting issues to support teams or performing advanced troubleshooting, you may need to collect additional diagnostic information.
Before/After Traffic Analysis
-----------------------------
Before sending traffic:
.. code-block:: none
sudo vppctl clear hardware-interfaces
sudo vppctl clear interfaces
sudo vppctl clear error
sudo vppctl clear runtime
After sending traffic:
.. code-block:: none
sudo vppctl show version verbose
sudo vppctl show hardware-interfaces
sudo vppctl show interface address
sudo vppctl show interface
sudo vppctl show runtime
sudo vppctl show error
Core System Information
-----------------------
**Memory and buffer information:**
.. code-block:: none
sudo vppctl show memory api-segment stats-segment numa-heaps main-heap map verbose
sudo vppctl show buffers
sudo vppctl show physmem detail
sudo vppctl show physmem map
**Runtime and performance data:**
.. code-block:: none
sudo vppctl show cpu
sudo vppctl show threads
sudo vppctl show runtime
sudo vppctl show node counters
Protocol-Specific Information
-----------------------------
**Layer 2 information (if configured):**
.. code-block:: none
sudo vppctl show l2fib
sudo vppctl show bridge-domain
**IPv4 information (if configured):**
.. code-block:: none
sudo vppctl show ip fib
sudo vppctl show ip neighbors
**IPv6 information (if configured):**
.. code-block:: none
sudo vppctl show ip6 fib
sudo vppctl show ip6 neighbors
**MPLS information (if configured):**
.. code-block:: none
sudo vppctl show mpls fib
sudo vppctl show mpls tunnel
Creating Support Packages
=========================
When contacting support or reporting issues, use the automated diagnostic collection script to create a comprehensive package. This ensures all relevant VPP troubleshooting information is collected systematically.
VPP Diagnostic Collection Script
--------------------------------
Create the diagnostic collection script:
.. code-block:: python
#!/usr/bin/env python3
"""VyOS VPP Diagnostic Collection Script"""
import datetime
import shutil
import subprocess
import tarfile
from pathlib import Path
def run_cmd(cmd, output_file, diag_dir):
"""Run command and save output to file."""
try:
result = subprocess.run(
cmd, shell=True, capture_output=True, text=True, timeout=30
)
content = f"Command: {cmd}\nExit code: {result.returncode}\nTimestamp: {datetime.datetime.now()}\n{'-' * 50}\n"
if result.stdout:
content += f"\nSTDOUT:\n{result.stdout}"
if result.stderr:
content += f"\nSTDERR:\n{result.stderr}"
(diag_dir / output_file).write_text(content)
except Exception as e:
(diag_dir / output_file).write_text(f"Command: {cmd}\nERROR: {e}")
def collect_diagnostics():
"""Collect all VPP diagnostics and create archive."""
timestamp = datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
diag_dir = Path.home() / f"vpp-diagnostics-{timestamp}"
# VPP commands to collect
commands = [
("sudo vppctl show version verbose cmdline", "vpp-version.txt"),
("sudo vppctl show hardware-interfaces", "hardware-interfaces.txt"),
("sudo vppctl show interface address", "interface-addresses.txt"),
("sudo vppctl show interface", "interfaces.txt"),
("sudo vppctl show errors", "errors.txt"),
("sudo vppctl show runtime", "runtime.txt"),
(
"sudo vppctl show memory api-segment stats-segment numa-heaps main-heap map verbose",
"memory.txt",
),
("sudo vppctl show buffers", "buffers.txt"),
("sudo vppctl show physmem detail", "physmem.txt"),
("sudo vppctl show physmem map", "physmem-map.txt"),
("sudo vppctl show cpu", "cpu.txt"),
("sudo vppctl show threads", "threads.txt"),
("sudo vppctl show node counters", "node-counters.txt"),
("sudo vppctl show l2fib", "l2fib.txt"),
("sudo vppctl show bridge-domain", "bridge-domains.txt"),
("sudo vppctl show ip fib", "ip4-fib.txt"),
("sudo vppctl show ip neighbors", "ip4-neighbors.txt"),
("sudo vppctl show ip6 fib", "ip6-fib.txt"),
("sudo vppctl show ip6 neighbors", "ip6-neighbors.txt"),
("sudo vppctl show mpls fib", "mpls-fib.txt"),
("sudo vppctl show mpls tunnel", "mpls-tunnels.txt"),
("sudo vppctl show trace", "packet-traces.txt"),
]
try:
# Create diagnostics directory
diag_dir.mkdir(parents=True, exist_ok=True)
# Collect VPP data
for cmd, output_file in commands:
run_cmd(cmd, output_file, diag_dir)
# Collect PCAP files
pcap_files = list(Path("/tmp").glob("*.pcap"))
if pcap_files:
pcap_dir = diag_dir / "pcap-files"
pcap_dir.mkdir(exist_ok=True)
for pcap_file in pcap_files:
try:
shutil.copy2(pcap_file, pcap_dir)
except (PermissionError, OSError):
pass
# Create archive
archive_name = f"vpp-diagnostics-{timestamp}.tar.gz"
archive_path = Path.home() / archive_name
with tarfile.open(archive_path, "w:gz") as tar:
tar.add(diag_dir, arcname=diag_dir.name)
# Cleanup
shutil.rmtree(diag_dir)
print(f"VPP diagnostics collected: {archive_path}")
return archive_path
except Exception as e:
if diag_dir.exists():
shutil.rmtree(diag_dir)
print(f"Collection failed: {e}")
return None
def main():
"""Main function."""
collect_diagnostics()
if __name__ == "__main__":
main()
Save this script as ``/config/scripts/vpp-collect-diagnostics``
Installation and Usage
----------------------
**1. Make the script executable**
.. opcmd::
sudo chmod +x /config/scripts/vpp-collect-diagnostics
**2. Run VPP diagnostic collection**
The script will automatically collect all VPP diagnostics and store it in your home directory.
.. opcmd::
/config/scripts/vpp-collect-diagnostics
**3. Generate VyOS tech-support archive separately**
Additionally, you can generate a VyOS tech-support archive that includes system-wide diagnostics:
.. opcmd::
generate tech-support archive
What the Script Collects
------------------------
- **System Information**: Version details, build information, command line parameters
- **Interface Data**: Hardware interfaces, interface addresses, interface statistics and configurations
- **Performance Metrics**: Runtime statistics, error counters, node counters, CPU and thread information
- **Memory Analysis**: Memory usage (API segment, stats segment, NUMA heaps, main heap), buffer information, physical memory details
- **Layer 2 Data**: L2 forwarding table (L2FIB), bridge domain configurations
- **IPv4 Information**: IPv4 forwarding table (FIB), IPv4 neighbor table
- **IPv6 Information**: IPv6 forwarding table (FIB), IPv6 neighbor table
- **MPLS Data**: MPLS forwarding table (FIB), MPLS tunnel information
- **Packet Traces**: Captured packet traces VPP (if available)
- **Packet Dumps**: PCAP files from ``/tmp`` directory (if available)