Giovanni Tirloni | 3 May 2010 03:56

dladm show-ether doesn't show interfaces

Hello,

 I think we've been hit by bug 6908043, dladm show-ether stopped showing any interface at all.

 All details added to http://www.sysdroid.com/opensolaris/bugs/6908043.txt

 The only recent changes that we had were moving LACP from "short" to "long" on aggr0 (e1000g1+e1000g2) and we started using "dladm show-ether" on Zabbix to monitor the interface status since a few days ago. So I don't know if it was happening before and we never noticed or if heavy use of dladm show-ether is triggering the problem.

Thank you,

--
Giovanni

<div><p>Hello,<br><br>&nbsp;I think we've been hit by bug 6908043, dladm show-ether stopped showing any interface at all.<br><br>&nbsp;All details added to <a href="http://www.sysdroid.com/opensolaris/bugs/6908043.txt">http://www.sysdroid.com/opensolaris/bugs/6908043.txt</a><br><br>&nbsp;The only recent changes that we had were moving LACP from "short" to "long" on aggr0 (e1000g1+e1000g2) and we started using "dladm show-ether" on Zabbix to monitor the interface status since a few days ago. So I don't know if it was happening before and we never noticed or if heavy use of dladm show-ether is triggering the problem.<br><br>Thank you,<br clear="all"><br>-- <br>Giovanni<br></p></div>
Lo Zio | 3 May 2010 17:39
Picon

snv 134 crash

Hi
 I installed snv 134 on a Dell PowerEdge T110. It all went fine but after one day it crashed. fmadm faulty says

--------------- ------------------------------------  -------------- ---------
TIME            EVENT-ID                              MSG-ID         SEVERITY
--------------- ------------------------------------  -------------- ---------
May 02 07:41:15 ffb55ba2-8103-44ad-be0c-a2abb0cf0ebb  PCIEX-8000-J5  Major

Host        : newfiler
Platform    : PowerEdge-T110    Chassis_id  : 1TTJQ4J
Product_sn  :

Fault class : fault.io.pciex.device-interr-corr
Affects     : dev:////pci <at> 0,0/pci8086,3b4a <at> 1c,4/pci1028,2a6 <at> 0
                  faulted but still in service
FRU         : "MB" (hc://:product-id=PowerEdge-T110:server-id=newfiler:chassis-id=1TTJQ4J/motherboard=0)
                  faulty

Description : Too many recovered internal errors have been detected within the
              specified PCIEX device. This may degrade into a non-recoverable
              fault.
              Refer to http://sun.com/msg/PCIEX-8000-J5 for more information.

Response    : One or more device instances may be disabled

Impact      : Loss of services provided by the device instances associated with
              this fault

Action      : Schedule a repair procedure to replace the affected device. Use
              fmadm faulty to identify the device or contact Sun for support.

This is similar to a topic found in networking. If I do a 
prtconf -v grep 3b4a I obtain:

 prtconf -v | grep -B5 -A5 3b4a
                    value=00000005
                name='vendor-id' type=int items=1
                    value=00008086
                name='device-id' type=int items=1
                    value=00003b42
        pci8086,3b4a, instance #3
            System software properties:
                name='ddi-forceattach' type=int items=1
                    value=00000001
            Driver properties:
                name='device_type' type=string items=1 dev=none
--
                name='acpi-namespace' type=string items=1
                    value='\_SB_.PCI0.PEX4'
                name='reg' type=int items=5
                    value=0000e400.00000000.00000000.00000000.00000000
                name='compatible' type=string items=8
                    value='pciex8086,3b4a.5' + 'pciex8086,3b4a' + 'pciexclass,060400' + 'pciexclass,0604' +
'pci8086,3b4a.5' + 'pci8086,3b4a' + 'pciclass,060400' + 'pciclass,0604'
                name='model' type=string items=1
                    value='PCI-PCI bridge'
                name='ranges' type=int items=16
                    value=81000000.00000000.00001000.81000000.00000000.00001000.00000000.00001000.82000000.00000000.df900000.82000000.00000000.df900000.00000000.00100000
                name='bus-range' type=int items=2
--
                name='revision-id' type=int items=1
                    value=00000005
                name='vendor-id' type=int items=1
                    value=00008086
                name='device-id' type=int items=1
                    value=00003b4a
            Device Minor Nodes:
                dev=(80,1023)
                    dev_path=/pci <at> 0,0/pci8086,3b4a <at> 1c,4:devctl
                        spectype=chr type=minor
            pci1028,2a6, instance #0
                System software properties:
                    name='bge-known-subsystems' type=int items=16
                        value=108e1647.108e1648.108e16a7.108e16a8.17c20010.17341013.101402a6.10f12885.17c20020.10b71006.10280109.102801f8.1028865d.0e11005a.0e1100cb.103c12bc
--
                        value=000014e4
                    name='device-id' type=int items=1
                        value=0000165a
                Device Minor Nodes:
                    dev=(116,1)
                        dev_path=/pci <at> 0,0/pci8086,3b4a <at> 1c,4/pci1028,2a6 <at> 0:bge0
                            spectype=chr type=minor
                            dev_link=/dev/bge0
                    dev=(116,1003)
                        dev_path=<clone>
                        Device Minor Layered Under:

The problem is that still I can't find which device is the problem. Is It the network card? And if it is, any
ideas to solve the problem?

Thanks
--

-- 
This message posted from opensolaris.org
Giovanni Tirloni | 4 May 2010 03:11

Re: dladm show-ether doesn't show interfaces

On Sun, May 2, 2010 at 10:56 PM, Giovanni Tirloni <gtirloni <at> sysdroid.com> wrote:
Hello,

 I think we've been hit by bug 6908043, dladm show-ether stopped showing any interface at all.

 All details added to http://www.sysdroid.com/opensolaris/bugs/6908043.txt

 The only recent changes that we had were moving LACP from "short" to "long" on aggr0 (e1000g1+e1000g2) and we started using "dladm show-ether" on Zabbix to monitor the interface status since a few days ago. So I don't know if it was happening before and we never noticed or if heavy use of dladm show-ether is triggering the problem.

Turned out dlmgmtd(1M) was stuck and had to be restarted:

# svcadm restart datalink-management

# dladm show-ether
LINK            PTYPE    STATE    AUTO  SPEED-DUPLEX                    PAUSE
e1000g0         current  up       yes   1G-f                            bi
e1000g1         current  up       yes   1G-f                            bi
e1000g2         current  up       yes   1G-f                            bi
e1000g3         current  up       yes   1G-f                            bi

I'm still trying to understand what causes it. Perhaps dladm should have better error reporting in case it doesn't get a satisfactory answer from /dev/dld.

--
Giovanni

<div>
<div class="gmail_quote">On Sun, May 2, 2010 at 10:56 PM, Giovanni Tirloni <span dir="ltr">&lt;<a href="mailto:gtirloni@...">gtirloni <at> sysdroid.com</a>&gt;</span> wrote:<br><blockquote class="gmail_quote">
Hello,<br><br>&nbsp;I think we've been hit by bug 6908043, dladm show-ether stopped showing any interface at all.<br><br>&nbsp;All details added to <a href="http://www.sysdroid.com/opensolaris/bugs/6908043.txt" target="_blank">http://www.sysdroid.com/opensolaris/bugs/6908043.txt</a><br><br>&nbsp;The only recent changes that we had were moving LACP from "short" to "long" on aggr0 (e1000g1+e1000g2) and we started using "dladm show-ether" on Zabbix to monitor the interface status since a few days ago. So I don't know if it was happening before and we never noticed or if heavy use of dladm show-ether is triggering the problem.<br>
</blockquote>
</div>
<br>Turned out dlmgmtd(1M) was stuck and had to be restarted:<br><br># svcadm restart datalink-management<br><br># dladm show-ether<br>LINK&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; PTYPE&nbsp;&nbsp;&nbsp; STATE&nbsp;&nbsp;&nbsp; AUTO&nbsp; SPEED-DUPLEX&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; PAUSE<br>
e1000g0&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; current&nbsp; up&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; yes&nbsp;&nbsp; 1G-f&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; bi<br>e1000g1&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; current&nbsp; up&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; yes&nbsp;&nbsp; 1G-f&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; bi<br>e1000g2&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; current&nbsp; up&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; yes&nbsp;&nbsp; 1G-f&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; bi<br>
e1000g3&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; current&nbsp; up&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; yes&nbsp;&nbsp; 1G-f&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; bi<br><br>I'm still trying to understand what causes it. Perhaps dladm should have better error reporting in case it doesn't get a satisfactory answer from /dev/dld.<br clear="all"><br>-- <br>Giovanni<br><br>
</div>
Giovanni Tirloni | 4 May 2010 03:39

Re: dladm show-ether doesn't show interfaces

On Mon, May 3, 2010 at 10:11 PM, Giovanni Tirloni <gtirloni <at> sysdroid.com> wrote:
On Sun, May 2, 2010 at 10:56 PM, Giovanni Tirloni <gtirloni-SSCLyIhHoYRWk0Htik3J/w@public.gmane.org> wrote:
Hello,

 I think we've been hit by bug 6908043, dladm show-ether stopped showing any interface at all.

 All details added to http://www.sysdroid.com/opensolaris/bugs/6908043.txt

 The only recent changes that we had were moving LACP from "short" to "long" on aggr0 (e1000g1+e1000g2) and we started using "dladm show-ether" on Zabbix to monitor the interface status since a few days ago. So I don't know if it was happening before and we never noticed or if heavy use of dladm show-ether is triggering the problem.

Turned out dlmgmtd(1M) was stuck and had to be restarted:

# svcadm restart datalink-management

# dladm show-ether
LINK            PTYPE    STATE    AUTO  SPEED-DUPLEX                    PAUSE
e1000g0         current  up       yes   1G-f                            bi
e1000g1         current  up       yes   1G-f                            bi
e1000g2         current  up       yes   1G-f                            bi
e1000g3         current  up       yes   1G-f                            bi

I'm still trying to understand what causes it. Perhaps dladm should have better error reporting in case it doesn't get a satisfactory answer from /dev/dld.

There seems to be a memory link in dlmgmtd since it's using 3.9GB of memory (12% of 32GB).

Should I file a new bug ? If anyone is interested I can send the core dump.

I also updated the file below with a dtrace output of the functions being called when you issue a "dladm show-ether" in another terminal on the same server (it's quite long).

  http://www.sysdroid.com/opensolaris/bugs/6908043.txt

# ps aux | grep dlmgmt
USER       PID %CPU %MEM   SZ  RSS TT       S    START  TIME COMMAND
dladm       15  0.0 12.140393764039376 ?        S   Mar 30  6:47 /sbin/dlmgmtd

# gcore 15
# ls -lh core.15
-rw-r--r-- 1 root root 3.9G 2010-05-03 22:17 core.15

# pstack 15
15:    /sbin/dlmgmtd
-----------------  lwp# 1 / thread# 1  --------------------
 feef0547 pause    ()
 08053a18 main     (1, 8047e50, 8047e58, 8047e0c) + b8
 0805326d _start   (1, 8047ef0, 0, 8047efe, 8047f0e, 8047f1f) + 7d
-----------------  lwp# 2 / thread# 2  --------------------
 feef0ea1 door     (fec9e980, 410, 0, fec9ee00, f5f00, a)
 08054b98 dlmgmt_handler (0, fec9edd8, 28, 0, 0, 8054a9c) + fc
 feef0ed2 __door_return () + 52
-----------------  lwp# 3 / thread# 3  --------------------
 feef0ea1 door     (feb9f980, 410, 0, feb9fe00, f5f00, a)
 08054b98 dlmgmt_handler (0, feb9fdd8, 28, 0, 0, 8054a9c) + fc
 feef0ed2 __door_return () + 52
-----------------  lwp# 4 / thread# 4  --------------------
 feef0ea1 door     (fea7e980, 410, 0, fea7ee00, f5f00, a)
 08054b98 dlmgmt_handler (0, fea7edd8, 28, 0, 0, 8054a9c) + fc
 feef0ed2 __door_return () + 52
-----------------  lwp# 5 / thread# 5  --------------------
 feef0ea1 door     (fe80ed90, 18, 0, fe80ee00, f5f00, a)
 08054b98 dlmgmt_handler (0, fe80ede8, 18, 0, 0, 8054a9c) + fc
 feef0ed2 __door_return () + 52
-----------------  lwp# 6 / thread# 6  --------------------
 feef0ea1 door     (fe70f980, 410, 0, fe70fe00, f5f00, a)
 08054b98 dlmgmt_handler (0, fe70fdd8, 28, 0, 0, 8054a9c) + fc
 feef0ed2 __door_return () + 52


--
Giovanni
<div>
<div class="gmail_quote">On Mon, May 3, 2010 at 10:11 PM, Giovanni Tirloni <span dir="ltr">&lt;<a href="mailto:gtirloni@...">gtirloni <at> sysdroid.com</a>&gt;</span> wrote:<br><blockquote class="gmail_quote">
<div>
<div></div>
<div class="h5">
<div class="gmail_quote">On Sun, May 2, 2010 at 10:56 PM, Giovanni Tirloni <span dir="ltr">&lt;<a href="mailto:gtirloni@..." target="_blank">gtirloni@...</a>&gt;</span> wrote:<br><blockquote class="gmail_quote">
Hello,<br><br>&nbsp;I think we've been hit by bug 6908043, dladm show-ether stopped showing any interface at all.<br><br>&nbsp;All details added to <a href="http://www.sysdroid.com/opensolaris/bugs/6908043.txt" target="_blank">http://www.sysdroid.com/opensolaris/bugs/6908043.txt</a><br><br>&nbsp;The only recent changes that we had were moving LACP from "short" to "long" on aggr0 (e1000g1+e1000g2) and we started using "dladm show-ether" on Zabbix to monitor the interface status since a few days ago. So I don't know if it was happening before and we never noticed or if heavy use of dladm show-ether is triggering the problem.<br>
</blockquote>
</div>
<br>
</div>
</div>Turned out dlmgmtd(1M) was stuck and had to be restarted:<br><br># svcadm restart datalink-management<br><br># dladm show-ether<br>LINK&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; PTYPE&nbsp;&nbsp;&nbsp; STATE&nbsp;&nbsp;&nbsp; AUTO&nbsp; SPEED-DUPLEX&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; PAUSE<br>

e1000g0&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; current&nbsp; up&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; yes&nbsp;&nbsp; 1G-f&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; bi<br>e1000g1&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; current&nbsp; up&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; yes&nbsp;&nbsp; 1G-f&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; bi<br>e1000g2&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; current&nbsp; up&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; yes&nbsp;&nbsp; 1G-f&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; bi<br>

e1000g3&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; current&nbsp; up&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; yes&nbsp;&nbsp; 1G-f&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; bi<br><br>I'm still trying to understand what causes it. Perhaps dladm should have better error reporting in case it doesn't get a satisfactory answer from /dev/dld.<br clear="all">
</blockquote>
</div>
<br>There seems to be a memory link in dlmgmtd since it's using 3.9GB of memory (12% of 32GB).<br><br>Should I file a new bug ? If anyone is interested I can send the core dump.<br><br>I also updated the file below with a dtrace output of the functions being called when you issue a "dladm show-ether" in another terminal on the same server (it's quite long).<br><br>&nbsp; <a href="http://www.sysdroid.com/opensolaris/bugs/6908043.txt" target="_blank">http://www.sysdroid.com/opensolaris/bugs/6908043.txt</a><br><br># ps aux | grep dlmgmt<br>USER&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; PID %CPU %MEM&nbsp;&nbsp; SZ&nbsp; RSS TT&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; S&nbsp;&nbsp;&nbsp; START&nbsp; TIME COMMAND<br>
dladm&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 15&nbsp; 0.0 12.140393764039376 ?&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; S&nbsp;&nbsp; Mar 30&nbsp; 6:47 /sbin/dlmgmtd<br><br># gcore 15<br># ls -lh core.15<br>-rw-r--r-- 1 root root 3.9G 2010-05-03 22:17 core.15<br><br># pstack 15<br>15:&nbsp;&nbsp;&nbsp; /sbin/dlmgmtd<br>-----------------&nbsp; lwp# 1 / thread# 1&nbsp; --------------------<br>
&nbsp;feef0547 pause&nbsp;&nbsp;&nbsp; ()<br>&nbsp;08053a18 main&nbsp;&nbsp;&nbsp;&nbsp; (1, 8047e50, 8047e58, 8047e0c) + b8<br>&nbsp;0805326d _start&nbsp;&nbsp; (1, 8047ef0, 0, 8047efe, 8047f0e, 8047f1f) + 7d<br>-----------------&nbsp; lwp# 2 / thread# 2&nbsp; --------------------<br>&nbsp;feef0ea1 door&nbsp;&nbsp;&nbsp;&nbsp; (fec9e980, 410, 0, fec9ee00, f5f00, a)<br>
&nbsp;08054b98 dlmgmt_handler (0, fec9edd8, 28, 0, 0, 8054a9c) + fc<br>&nbsp;feef0ed2 __door_return () + 52<br>-----------------&nbsp; lwp# 3 / thread# 3&nbsp; --------------------<br>&nbsp;feef0ea1 door&nbsp;&nbsp;&nbsp;&nbsp; (feb9f980, 410, 0, feb9fe00, f5f00, a)<br>
&nbsp;08054b98 dlmgmt_handler (0, feb9fdd8, 28, 0, 0, 8054a9c) + fc<br>&nbsp;feef0ed2 __door_return () + 52<br>-----------------&nbsp; lwp# 4 / thread# 4&nbsp; --------------------<br>&nbsp;feef0ea1 door&nbsp;&nbsp;&nbsp;&nbsp; (fea7e980, 410, 0, fea7ee00, f5f00, a)<br>
&nbsp;08054b98 dlmgmt_handler (0, fea7edd8, 28, 0, 0, 8054a9c) + fc<br>&nbsp;feef0ed2 __door_return () + 52<br>-----------------&nbsp; lwp# 5 / thread# 5&nbsp; --------------------<br>&nbsp;feef0ea1 door&nbsp;&nbsp;&nbsp;&nbsp; (fe80ed90, 18, 0, fe80ee00, f5f00, a)<br>
&nbsp;08054b98 dlmgmt_handler (0, fe80ede8, 18, 0, 0, 8054a9c) + fc<br>&nbsp;feef0ed2 __door_return () + 52<br>-----------------&nbsp; lwp# 6 / thread# 6&nbsp; --------------------<br>&nbsp;feef0ea1 door&nbsp;&nbsp;&nbsp;&nbsp; (fe70f980, 410, 0, fe70fe00, f5f00, a)<br>
&nbsp;08054b98 dlmgmt_handler (0, fe70fdd8, 28, 0, 0, 8054a9c) + fc<br>&nbsp;feef0ed2 __door_return () + 52<br><br><br>-- <br>Giovanni<br>
</div>
Yun Zhou | 4 May 2010 04:07
Picon
Favicon

Re: dladm show-ether doesn't show interfaces

Can you please preload libumem to diagnose the memory leak problem? I 
think the below should do the trick:

# svcadm disable svc:/network/datalink-management:default
# export LD_PRELOAD=libumem.so
# export UMEM_DEBUG=default
# /sbin/dlmgmtd -d 10 &
# mdb -p `pgrep dlmgmtd`
 > ::findleaks

Please post the output of ::findleaks if there is any.

Thanks
- Cathy
> On Mon, May 3, 2010 at 10:11 PM, Giovanni Tirloni 
> <gtirloni@...
<mailto:gtirloni@...>> wrote:
>
>     On Sun, May 2, 2010 at 10:56 PM, Giovanni Tirloni
>     <gtirloni@...
<mailto:gtirloni@...>> wrote:
>
>         Hello,
>
>          I think we've been hit by bug 6908043, dladm show-ether
>         stopped showing any interface at all.
>
>          All details added to
>         http://www.sysdroid.com/opensolaris/bugs/6908043.txt
>
>          The only recent changes that we had were moving LACP from
>         "short" to "long" on aggr0 (e1000g1+e1000g2) and we started
>         using "dladm show-ether" on Zabbix to monitor the interface
>         status since a few days ago. So I don't know if it was
>         happening before and we never noticed or if heavy use of dladm
>         show-ether is triggering the problem.
>
>
>     Turned out dlmgmtd(1M) was stuck and had to be restarted:
>
>     # svcadm restart datalink-management
>
>     # dladm show-ether
>     LINK            PTYPE    STATE    AUTO 
>     SPEED-DUPLEX                    PAUSE
>     e1000g0         current  up       yes  
>     1G-f                            bi
>     e1000g1         current  up       yes  
>     1G-f                            bi
>     e1000g2         current  up       yes  
>     1G-f                            bi
>     e1000g3         current  up       yes  
>     1G-f                            bi
>
>     I'm still trying to understand what causes it. Perhaps dladm
>     should have better error reporting in case it doesn't get a
>     satisfactory answer from /dev/dld.
>
>
> There seems to be a memory link in dlmgmtd since it's using 3.9GB of 
> memory (12% of 32GB).
>
> Should I file a new bug ? If anyone is interested I can send the core 
> dump.
>
> I also updated the file below with a dtrace output of the functions 
> being called when you issue a "dladm show-ether" in another terminal 
> on the same server (it's quite long).
>
>   http://www.sysdroid.com/opensolaris/bugs/6908043.txt
>
> # ps aux | grep dlmgmt
> USER       PID %CPU %MEM   SZ  RSS TT       S    START  TIME COMMAND
> dladm       15  0.0 12.140393764039376 ?        S   Mar 30  6:47 
> /sbin/dlmgmtd
>
> # gcore 15
> # ls -lh core.15
> -rw-r--r-- 1 root root 3.9G 2010-05-03 22:17 core.15
>
> # pstack 15
> 15:    /sbin/dlmgmtd
> -----------------  lwp# 1 / thread# 1  --------------------
>  feef0547 pause    ()
>  08053a18 main     (1, 8047e50, 8047e58, 8047e0c) + b8
>  0805326d _start   (1, 8047ef0, 0, 8047efe, 8047f0e, 8047f1f) + 7d
> -----------------  lwp# 2 / thread# 2  --------------------
>  feef0ea1 door     (fec9e980, 410, 0, fec9ee00, f5f00, a)
>  08054b98 dlmgmt_handler (0, fec9edd8, 28, 0, 0, 8054a9c) + fc
>  feef0ed2 __door_return () + 52
> -----------------  lwp# 3 / thread# 3  --------------------
>  feef0ea1 door     (feb9f980, 410, 0, feb9fe00, f5f00, a)
>  08054b98 dlmgmt_handler (0, feb9fdd8, 28, 0, 0, 8054a9c) + fc
>  feef0ed2 __door_return () + 52
> -----------------  lwp# 4 / thread# 4  --------------------
>  feef0ea1 door     (fea7e980, 410, 0, fea7ee00, f5f00, a)
>  08054b98 dlmgmt_handler (0, fea7edd8, 28, 0, 0, 8054a9c) + fc
>  feef0ed2 __door_return () + 52
> -----------------  lwp# 5 / thread# 5  --------------------
>  feef0ea1 door     (fe80ed90, 18, 0, fe80ee00, f5f00, a)
>  08054b98 dlmgmt_handler (0, fe80ede8, 18, 0, 0, 8054a9c) + fc
>  feef0ed2 __door_return () + 52
> -----------------  lwp# 6 / thread# 6  --------------------
>  feef0ea1 door     (fe70f980, 410, 0, fe70fe00, f5f00, a)
>  08054b98 dlmgmt_handler (0, fe70fdd8, 28, 0, 0, 8054a9c) + fc
>  feef0ed2 __door_return () + 52
>
>
> -- 
> Giovanni
> ------------------------------------------------------------------------
>
> _______________________________________________
> networking-discuss mailing list
> networking-discuss@...

Giovanni Tirloni | 4 May 2010 04:29

Re: dladm show-ether doesn't show interfaces

On Mon, May 3, 2010 at 11:07 PM, Yun Zhou <Cathy.Zhou-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote:
Can you please preload libumem to diagnose the memory leak problem? I think the below should do the trick:

# svcadm disable svc:/network/datalink-management:default
# export LD_PRELOAD=libumem.so
# export UMEM_DEBUG=default
# /sbin/dlmgmtd -d 10 &
# mdb -p `pgrep dlmgmtd`
> ::findleaks

# LD_PRELOAD=libumem.so UMEM_DEBUG=default /sbin/dlmgmtd -d 10

# while [ 1 ]; do dladm show-ether; dladm show-phys; dladm show-link; dladm show-aggr; sleep 1; done

It iterated 100 times and then I fired up mdb (which had the effect of making dladm wait):

# mdb -p 13859
Loading modules: [ ld.so.1 libumem.so.1 libavl.so.1 ]
> ::findleaks
findleaks: no memory leaks detected

--
Giovanni

<div>
<div class="gmail_quote">On Mon, May 3, 2010 at 11:07 PM, Yun Zhou <span dir="ltr">&lt;<a href="mailto:Cathy.Zhou@..." target="_blank">Cathy.Zhou@...</a>&gt;</span> wrote:<br><blockquote class="gmail_quote">

Can you please preload libumem to diagnose the memory leak problem? I think the below should do the trick:<br><br>
# svcadm disable svc:/network/datalink-management:default<br>
# export LD_PRELOAD=libumem.so<br>
# export UMEM_DEBUG=default<br>
# /sbin/dlmgmtd -d 10 &amp;<br>
# mdb -p `pgrep dlmgmtd`<br>
&gt; ::findleaks<br>
</blockquote>
<div>
<br># LD_PRELOAD=libumem.so UMEM_DEBUG=default /sbin/dlmgmtd -d 10<br><br># while [ 1 ]; do dladm show-ether; dladm show-phys; dladm show-link; dladm show-aggr; sleep 1; done<br><br>
It iterated 100 times and then I fired up mdb (which had the effect of making dladm wait):<br><br># mdb -p 13859<br>Loading modules: [ ld.so.1 libumem.so.1 libavl.so.1 ]<br>&gt; ::findleaks<br>findleaks: no memory leaks detected<br><br clear="all">
</div>
</div>-- <br>Giovanni<br><a href="http://sysdroid.com" target="_blank"></a><br>
</div>
Yun Zhou | 4 May 2010 04:39
Picon
Favicon

Re: dladm show-ether doesn't show interfaces

Giovanni Tirloni wrote:
> On Mon, May 3, 2010 at 11:07 PM, Yun Zhou <Cathy.Zhou@... 
> <mailto:Cathy.Zhou@...>> wrote:
>
>     Can you please preload libumem to diagnose the memory leak
>     problem? I think the below should do the trick:
>
>     # svcadm disable svc:/network/datalink-management:default
>     # export LD_PRELOAD=libumem.so
>     # export UMEM_DEBUG=default
>     # /sbin/dlmgmtd -d 10 &
>     # mdb -p `pgrep dlmgmtd`
>     > ::findleaks
>
>
> # LD_PRELOAD=libumem.so UMEM_DEBUG=default /sbin/dlmgmtd -d 10
>
> # while [ 1 ]; do dladm show-ether; dladm show-phys; dladm show-link; 
> dladm show-aggr; sleep 1; done
>
> It iterated 100 times and then I fired up mdb (which had the effect of 
> making dladm wait):
>
> # mdb -p 13859
> Loading modules: [ ld.so.1 libumem.so.1 libavl.so.1 ]
> > ::findleaks
> findleaks: no memory leaks detected
Hmm, it does not report any memory leak problem. Do you still see 
dlmgmtd uses a lot of memory?

- Cathy
>
> -- 
> Giovanni
>

Giovanni Tirloni | 4 May 2010 04:52

Re: dladm show-ether doesn't show interfaces


On Mon, May 3, 2010 at 11:39 PM, Yun Zhou <Cathy.Zhou <at> oracle.com> wrote:
Giovanni Tirloni wrote:

On Mon, May 3, 2010 at 11:07 PM, Yun Zhou <Cathy.Zhou-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org <mailto:Cathy.Zhou-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>> wrote:

   Can you please preload libumem to diagnose the memory leak
   problem? I think the below should do the trick:

   # svcadm disable svc:/network/datalink-management:default
   # export LD_PRELOAD=libumem.so
   # export UMEM_DEBUG=default
   # /sbin/dlmgmtd -d 10 &
   # mdb -p `pgrep dlmgmtd`
   > ::findleaks


# LD_PRELOAD=libumem.so UMEM_DEBUG=default /sbin/dlmgmtd -d 10

# while [ 1 ]; do dladm show-ether; dladm show-phys; dladm show-link; dladm show-aggr; sleep 1; done

It iterated 100 times and then I fired up mdb (which had the effect of making dladm wait):

# mdb -p 13859
Loading modules: [ ld.so.1 libumem.so.1 libavl.so.1 ]
> ::findleaks
findleaks: no memory leaks detected
Hmm, it does not report any memory leak problem. Do you still see dlmgmtd uses a lot of memory?


No, it doesn't seem to be growing right now.

We have Zabbix hammering the servers with `dladm show-ether` since a few days ago and they all stopped working between yesterday and today. I restarted dlmgmtd in a few servers and we should be able to see if they break again in a few days.

--
Giovanni

<div>
<br><div class="gmail_quote">On Mon, May 3, 2010 at 11:39 PM, Yun Zhou <span dir="ltr">&lt;<a href="mailto:Cathy.Zhou@...">Cathy.Zhou <at> oracle.com</a>&gt;</span> wrote:<br><blockquote class="gmail_quote">
Giovanni Tirloni wrote:<div>
<div></div>
<div class="h5">
<br><blockquote class="gmail_quote">
On Mon, May 3, 2010 at 11:07 PM, Yun Zhou &lt;<a href="mailto:Cathy.Zhou <at> oracle.com" target="_blank">Cathy.Zhou@...</a> &lt;mailto:<a href="mailto:Cathy.Zhou@..." target="_blank">Cathy.Zhou@...</a>&gt;&gt; wrote:<br><br>
 &nbsp; &nbsp;Can you please preload libumem to diagnose the memory leak<br>
 &nbsp; &nbsp;problem? I think the below should do the trick:<br><br>
 &nbsp; &nbsp;# svcadm disable svc:/network/datalink-management:default<br>
 &nbsp; &nbsp;# export LD_PRELOAD=libumem.so<br>
 &nbsp; &nbsp;# export UMEM_DEBUG=default<br>
 &nbsp; &nbsp;# /sbin/dlmgmtd -d 10 &amp;<br>
 &nbsp; &nbsp;# mdb -p `pgrep dlmgmtd`<br>
 &nbsp; &nbsp;&gt; ::findleaks<br><br><br>
# LD_PRELOAD=libumem.so UMEM_DEBUG=default /sbin/dlmgmtd -d 10<br><br>
# while [ 1 ]; do dladm show-ether; dladm show-phys; dladm show-link; dladm show-aggr; sleep 1; done<br><br>
It iterated 100 times and then I fired up mdb (which had the effect of making dladm wait):<br><br>
# mdb -p 13859<br>
Loading modules: [ ld.so.1 libumem.so.1 libavl.so.1 ]<br>
&gt; ::findleaks<br>
findleaks: no memory leaks detected<br>
</blockquote>
</div>
</div>
Hmm, it does not report any memory leak problem. Do you still see dlmgmtd uses a lot of memory?<br>
</blockquote>
<div>
<br><br>No, it doesn't seem to be growing right now.<br><br>We have Zabbix hammering the servers with `dladm show-ether` since a few days ago and they all stopped working between yesterday and today. I restarted dlmgmtd in a few servers and we should be able to see if they break again in a few days.<br>
</div>
</div>
<br>-- <br>Giovanni<br><br>
</div>
Reang Su | 4 May 2010 08:09
Picon

NICDRV Netstress test

Hi All,

Anybody had a chance to clear netstress recently on latest source, B138 above.
I'm experiencing strange issue. After 8 hr or so Tx stalls on remote, whreas Rx works fine.

On putting remote interface through snoop, I don't see any packet reaching snoop (i.e no packets from stack to driver on ping). System is up and driver stats also looks ok.

Any clue.

regards,
~Su

<div><p>Hi All,<br><br>Anybody had a chance to clear netstress recently on latest source, B138 above.<br>I'm experiencing strange issue. After 8 hr or so Tx stalls on remote, whreas Rx works fine.<br><br>On putting remote interface through snoop, I don't see any packet reaching snoop (i.e no packets from stack to driver on ping). System is up and driver stats also looks ok.<br><br>Any clue.<br><br>regards,<br>~Su<br></p></div>
tian robin luo | 4 May 2010 08:27
Picon

Re: [driver-discuss] NICDRV Netstress test

Whether your NIC driver was hang during the test? Do you find any chip reset event?

Regards,
Robin

Reang Su:
Hi All,

Anybody had a chance to clear netstress recently on latest source, B138 above.
I'm experiencing strange issue. After 8 hr or so Tx stalls on remote, whreas Rx works fine.

On putting remote interface through snoop, I don't see any packet reaching snoop (i.e no packets from stack to driver on ping). System is up and driver stats also looks ok.

Any clue.

regards,
~Su
_______________________________________________ driver-discuss mailing list driver-discuss-xZgeD5Kw2fzokhkdeNNY6A@public.gmane.org http://mail.opensolaris.org/mailman/listinfo/driver-discuss

<div>
Whether your NIC driver was hang during the test? Do you find any chip
reset event?<br><br>
Regards,<br>
Robin<br><br>
Reang Su:
<blockquote cite="mid:i2r66d8f59a1005032309k677e74b1sdf0303217163a57d@..." type="cite">Hi All,<br><br>
Anybody had a chance to clear netstress recently on latest source, B138
above.<br>
I'm experiencing strange issue. After 8 hr or so Tx stalls on remote,
whreas Rx works fine.<br><br>
On putting remote interface through snoop, I don't see any packet
reaching snoop (i.e no packets from stack to driver on ping). System is
up and driver stats also looks ok.<br><br>
Any clue.<br><br>
regards,<br>
~Su<br>

_______________________________________________
driver-discuss mailing list
<a class="moz-txt-link-abbreviated" href="mailto:driver-discuss@...">driver-discuss@...</a>
<a class="moz-txt-link-freetext" href="http://mail.opensolaris.org/mailman/listinfo/driver-discuss">http://mail.opensolaris.org/mailman/listinfo/driver-discuss</a>

</blockquote>
<br>
</div>

Gmane