16 KiB
Network Debug and Troubleshooting Guide
Topics
Introduction
Starting with Ansible version 2.1, you can now use the familiar Ansible models of playbook authoring and module development to manage heterogeneous networking devices. Ansible supports a growing number of network devices using both CLI over SSH and API (when available) transports.
This section discusses how to debug and troubleshoot network modules in Ansible 2.3.
How to troubleshoot
This section covers troubleshooting issues with Network Modules.
Errors generally fall into one of the following categories:
- Authentication issues
-
- Not correctly specifying credentials
- Remote device (network switch/router) not falling back to other other authentication methods
- SSH key issues
- Timeout issues
-
- Can occur when trying to pull a large amount of data
- May actually be masking a authentication issue
- Playbook issues
-
- Use of
delegate_to
, instead ofProxyCommand
- Not using
connection: local
- Use of
The
unable to open shell
message is new in Ansible 2.3, it means that theansible-connection
daemon has not been able to successfully talk to the remote network device. This generally means that there is an authentication issue. See the "Authentication and connection issues" section in this document for more information.
Enabling Networking logging and how to read the logfile
Platforms: Any
Ansible 2.3 features improved logging to help diagnose and troubleshoot issues regarding Ansible Networking modules.
Because logging is very verbose it is disabled by default. It can be
enabled via the ANSIBLE_LOG_PATH
and
ANISBLE_DEBUG
options:
# Specify the location for the log file
export ANSIBLE_LOG_PATH=~/ansible.log
# Enable Debug
export ANSIBLE_DEBUG=True
# Run with 4*v for connection level verbosity
ansible-playbook -vvvv ...
After Ansible has finished running you can inspect the log file:
2017-03-30 13:19:52,740 p=28990 u=fred | creating new control socket for host veos01:22 as user admin
2017-03-30 13:19:52,741 p=28990 u=fred | control socket path is /home/fred/.ansible/pc/ca5960d27a
2017-03-30 13:19:52,741 p=28990 u=fred | current working directory is /home/fred/ansible/test/integration
2017-03-30 13:19:52,741 p=28990 u=fred | using connection plugin network_cli
...
2017-03-30 13:20:14,771 paramiko.transport userauth is OK
2017-03-30 13:20:15,283 paramiko.transport Authentication (keyboard-interactive) successful!
2017-03-30 13:20:15,302 p=28990 u=fred | ssh connection done, setting terminal
2017-03-30 13:20:15,321 p=28990 u=fred | ssh connection has completed successfully
2017-03-30 13:20:15,322 p=28990 u=fred | connection established to veos01 in 0:00:22.580626
From the log notice:
p=28990
Is the PID (Process ID) of theansible-connection
processu=fred
Is the user running ansible, not the remote-user you are attempting to connect ascreating new control socket for host veos01:22 as user admin
host:port as usercontrol socket path is
location on disk where the persistent connection socket is createdusing connection plugin network_cli
Informs you that persistent connection is being usedconnection established to veos01 in 0:00:22.580626
Time taken to obtain a shell on the remote device
If the log reports the port as
None
this means that the default port is being used. A future Ansible release will improve this message so that the port is always logged.
Because the log files are verbose, you can use grep to look for
specific information. For example, once you have identified the
`pid
from the
creating new control socket for host
line you can search
for other connection log entries:
grep "p=28990" $ANSIBLE_LOG_PATH
Isolating an error
Platforms: Any
As with any effort to troubleshoot it's important to simplify the test case as much as possible.
For Ansible this can be done by ensuring you are only running against one remote device:
- Using
ansible-playbook --limit switch1.example.net...
- Using an ad-hoc
ansible
command
ad-hoc refers to running Ansible to
perform some quick command using /usr/bin/ansible
, rather
than the orchestration language, which is
/usr/bin/ansible-playbook
. In this case we can ensure
connectivity by attempting to execute a single command on the remote
device:
ansible -m eos_command -a 'commands=?' -i inventory switch1.example.net -e 'ansible_connection=local' -u admin -k
In the above example, we:
- connect to
switch1.example.net
specified in the inventory fileinventory
- use the module
eos_command
- run the command
?
- connect using the username
admin
- inform ansible to prompt for the ssh password by specifying
-k
If you have SSH keys configured correctly, you don't need to specify
the -k
parameter
If the connection still fails you can combine it with the enable_network_logging parameter. For example:
# Specify the location for the log file
export ANSIBLE_LOG_PATH=~/ansible.log
# Enable Debug
export ANSIBLE_DEBUG=True
# Run with 4*v for connection level verbosity
ansible -m eos_command -a 'commands=?' -i inventory switch1.example.net -e 'ansible_connection=local' -u admin -k
Then review the log file and find the relevant error message in the rest of this document.
Category "Unable to open shell"
Platforms: Any
The unable to open shell
message is new in Ansible 2.3.
This message means that the ansible-connection
daemon has
not been able to successfully talk to the remote network device. This
generally means that there is an authentication issue. It is a "catch
all" message, meaning you need to enable ANSIBLE_LOG_PATH
to find the underlying issues.
For example:
TASK [prepare_eos_tests : enable cli on remote device] **************************************************
fatal: [veos01]: FAILED! => {"changed": false, "failed": true, "msg": "unable to open shell"}
or:
TASK [ios_system : configure name_servers] *************************************************************
task path:
fatal: [ios-csr1000v]: FAILED! => {
"changed": false,
"failed": true,
"msg": "unable to open shell",
"rc": 255
}
Suggestions to resolve:
Follow the steps detailed in enable_network_logging.
Once you've identified the error message from the log file, the specific solution can be found in the rest of this document.
Error: "[Errno -2] Name or service not known"
Platforms: Any
Indicates that the remote host you are trying to connect to can not be reached
For example:
2017-04-04 11:39:48,147 p=15299 u=fred | control socket path is /home/fred/.ansible/pc/ca5960d27a
2017-04-04 11:39:48,147 p=15299 u=fred | current working directory is /home/fred/git/ansible-inc/stable-2.3/test/integration
2017-04-04 11:39:48,147 p=15299 u=fred | using connection plugin network_cli
2017-04-04 11:39:48,340 p=15299 u=fred | connecting to host veos01 returned an error
2017-04-04 11:39:48,340 p=15299 u=fred | [Errno -2] Name or service not known
Suggestions to resolve:
- If you are using the
provider:
options ensure that it's suboptionhost:
is set correctly. - If you are not using
provider:
nor top-level arguments ensure your inventory file is correct.
Error: "Authentication failed"
Platforms: Any
Occurs if the credentials (username, passwords, or ssh keys) passed
to ansible-connection
(via ansible
or
ansible-playbook
) can not be used to connect to the remote
device.
For example:
<ios01> ESTABLISH CONNECTION FOR USER: cisco on PORT 22 TO ios01
<ios01> Authentication failed.
Suggestions to resolve:
If you are specifying credentials via password:
(either
directly or via provider:
) or the environment variable
ANSIBLE_NET_PASSWORD
it is possible that
paramiko
(the Python SSH library that Ansible uses) is
using ssh keys, and therefore the credentials you are specifying are
being ignored. To find out if this is the case, disable "look for keys".
This can be done like this:
export ANSIBLE_PARAMIKO_LOOK_FOR_KEYS=False
To make this a permanent change, add the following to your
ansible.cfg
file:
[paramiko_connection]
look_for_keys = False
Error: "connecting to host <hostname> returned an error" or "Bad address"
This may occur if the SSH fingerprint hasn't been added to Paramiko's (the Python SSH library) know hosts file.
When using persistent connections with Paramiko, the connection runs in a background process. If the host doesn't already have a valid SSH key, by default Ansible will prompt to add the host key. This will cause connections running in background processes to fail.
For example:
2017-04-04 12:06:03,486 p=17981 u=fred | using connection plugin network_cli
2017-04-04 12:06:04,680 p=17981 u=fred | connecting to host veos01 returned an error
2017-04-04 12:06:04,682 p=17981 u=fred | (14, 'Bad address')
2017-04-04 12:06:33,519 p=17981 u=fred | number of connection attempts exceeded, unable to connect to control socket
2017-04-04 12:06:33,520 p=17981 u=fred | persistent_connect_interval=1, persistent_connect_retries=30
Suggestions to resolve:
Use ssh-keyscan
to pre-populate the known_hosts. You
need to ensure the keys are correct.
ssh-keyscan veos01
or
You can tell Ansible to automatically accept the keys
Environment variable method:
export ANSIBLE_PARAMIKO_HOST_KEY_AUTO_ADD=True
ansible-playbook ...
ansible.cfg
method:
ansible.cfg
[paramiko_connection] host_key_auto_add = True
Care should be taken before accepting keys.
Error: "No authentication methods available"
For example:
2017-04-04 12:19:05,670 p=18591 u=fred | creating new control socket for host veos01:None as user admin
2017-04-04 12:19:05,670 p=18591 u=fred | control socket path is /home/fred/.ansible/pc/ca5960d27a
2017-04-04 12:19:05,670 p=18591 u=fred | current working directory is /home/fred/git/ansible-inc/ansible-workspace-2/test/integration
2017-04-04 12:19:05,670 p=18591 u=fred | using connection plugin network_cli
2017-04-04 12:19:06,606 p=18591 u=fred | connecting to host veos01 returned an error
2017-04-04 12:19:06,606 p=18591 u=fred | No authentication methods available
2017-04-04 12:19:35,708 p=18591 u=fred | number of connection attempts exceeded, unable to connect to control socket
2017-04-04 12:19:35,709 p=18591 u=fred | persistent_connect_interval=1, persistent_connect_retries=30
Suggestions to resolve:
No password or SSH key supplied
Clearing Out Persistent Connections
Platforms: Any
In Ansible 2.3, persistent connection sockets are stored in
~/.ansible/pc
for all network devices. When an Ansible
playbook runs, the persistent socket connection is displayed when
verbose output is specified.
<switch> socket_path: /home/fred/.ansible/pc/f64ddfa760
To clear out a persistent connection before it times out (the default timeout is 30 seconds of inactivity), simple delete the socket file.
Timeout issues
Timeouts
All network modules support a timeout value that can be set on a per task basis. The timeout value controls the amount of time in seconds before the task will fail if the command has not returned.
For example:
Suggestions to resolve:
- name: save running-config
ios_command:
commands: copy running-config startup-config
provider: "{{ cli }}"
timeout: 30
Some operations take longer than the default 10 seconds to complete. One good example is saving the current running config on IOS devices to startup config. In this case, changing the timeout value form the default 10 seconds to 30 seconds will prevent the task from failing before the command completes successfully.
Playbook issues
This section details issues are caused by issues with the Playbook itself.
Error: "invalid connection specified, expected connection=local, got ssh"
Platforms: Any
Network modules require that the connection is set to
local
. Any other connection setting will cause the playbook
to fail. Ansible will now detect this condition and return an error
message:
fatal: [nxos01]: FAILED! => {
"changed": false,
"failed": true,
"msg": "invalid connection specified, expected connection=local, got ssh"
}
To fix this issue, set the connection value to local
using one of the following methods:
- Set the play to use
connection: local
- Set the task to use
connection: local
- Run ansible-playbook using the
-c local
setting
Error: "Unable to enter configuration mode"
Platforms: eos and ios
This occurs when you attempt to run a task that requires privileged mode in a user mode shell.
For example:
TASK [ios_system : configure name_servers] *****************************************************************************
task path:
fatal: [ios-csr1000v]: FAILED! => {
"changed": false,
"failed": true,
"msg": "unable to enter configuration mode",
"rc": 255
}
Suggestions to resolve:
Add authorize: yes
to the task. For example:
- name: configure hostname
ios_system:
provider:
hostname: foo
authorize: yes
register: result
If the user requires a password to go into privileged mode, this can
be specified with auth_pass
; if auth_pass
isn't set, the environment variable ANSIBLE_NET_AUTHORIZE
will be used instead.
Add authorize: yes to the task. For example:
- name: configure hostname
ios_system:
provider:
hostname: foo
authorize: yes
auth_pass: "{{ mypasswordvar }}"
register: result