[pfSense] Fatal trap 12 page fault

Hiren Joshi josh at moonfruit.com
Thu Jan 12 07:52:17 EST 2012


We had another crash one this morning...

It seems to cascade, that is to say, whatever crashes the primary seems to persist during the failover and crashes the secondary as well (about a minute later), not good. Does anyone have any clues/ideas?

Thanks,

Josh.

-----Original Message-----
From: list-bounces at lists.pfsense.org [mailto:list-bounces at lists.pfsense.org] On Behalf Of Russell Howe
Sent: 11 January 2012 15:25
To: pfSense support and discussion
Subject: Re: [pfSense] Fatal trap 12 page fault

On 04/01/12 11:53, Hiren Joshi wrote:
> And another one:
> http://sysops2.moonfruit.com/communities/4/004/009/843/874/images/4560450091_525x290.jpg
> http://sysops2.moonfruit.com/communities/4/004/009/843/874/images/4560450095_525x291.jpg
> http://sysops2.moonfruit.com/communities/4/004/009/843/874/images/4560450097_525x294.jpg
>
> We're not upgrading to pfsence 2.0 and will change the memory over soon...

s/not/now/, I believe. We upgraded to 2.0.1.

pfSense 2.0.1 has problems with our bge interfaces. (see separate thread about "WAN
interface (bge) receive path died after 4 hours").

We tried the workarounds suggested at

	http://doc.pfsense.org/index.php/Tuning_and_Troubleshooting_Network_Cards

but that just seemed to make the interface lock up more frequently (twice in 10 minutes)
so we reverted the changes.

We added a script which did "ifconfig bge1 down; sleep 10; ifconfig bge1 up" if the
default gateway became unreachable, which works around the interface lockup issue.

We tried a different server as the primary pfSense node, replacing the Dell PowerEdge
R200 with a Dell PowerEdge 850. So, the replacement was not a new machine but it had
been reliably running vmware on Linux and 3 highly utilised Windows VMs until now.
The tasks it was doing were CPU and memory intensive but fairly light on network
traffic compared to our firewalls. Both old and new servers had bge interfaces
onboard.

We saw the same kernel panics and interface lockups on the PE850 as we had seen on
the R200, with pfSense 2.0.1 (we didn't try 1.2.3 on the replacement hardware).

We also witnessed the switch we were using reset itself, losing its VLAN
configuration which meant that traffic was no longer segregated between LAN and WAN
interfaces (although things still worked as there were no overlapping subnets).
Whilst the VLANs were merged we witnessed total system freezes which were
recoverable only by hitting the power switch. That was fun. We have replaced the
switch with a warranty replacement from Dell.

We tried installing a pair of Realtek network cards (the only ones we could
get at short notice) in each node to eliminate usage of the bge interfaces and
driver. We continued to see similar kernel crashes and also, interestingly,
interface lockups. These are still in the servers and one of each is now being
used for pfsync traffic (which is ~10Mbit/s) which was previously on the WAN
interfaces.

So, we're pretty much back where we started. A pair of Dell R200 servers using
the onboard bge NICs keep crashing and restarting, using either pfSense 1.2.3
or 2.0.1. In addition, under 2.0.1 the receive path on the bge interfaces
sometimes stops working in a way consistent with pfSense issue 1425 and FreeBSD
pr 152295. It happens seemingly at random, at times of high and low traffic
(low would be ~50Mbit/s, high around 200Mbit/s LAN -> WAN). WAN->LAN
traffic is generally between 10 and 20Mbit/s. 95pc is 190 and 15.

Yesterday I witnessed this message on the console of the primary node, shortly
before it crashed and rebooted:

	bge1: No memory for std Rx buffers

Looking at commit r198928 (merged to branch in r201685) which introduced that
log message (note that the message can be found twice in r201685 - one was a
copy & paste error corrected in r210014 and added to FBSD 8 branch in r211371.

	http://lists.freebsd.org/pipermail/svn-src-stable-8/2010-January/001146.html

"bge_newbuf_std still has a bug for handling dma map load failure
under high network load. Just reusing mbuf is not enough as driver
already unloaded the dma map of the mbuf. Graceful recovery needs
more work."

I don't know if that is relevant and may be more a symptom than a cause.
I've only seen it the once. Is it a sign of memory pressure? mbufs are currently
24180/25600 but the upper limit appears to increase as necessary - is that
correct?


pciconf -l|grep bge
bge0 at pci0:3:0:0:	class=0x020000 card=0x023c1028 chip=0x165914e4 rev=0x21 hdr=0x00
bge1 at pci0:4:0:0:	class=0x020000 card=0x023c1028 chip=0x165914e4 rev=0x21 hdr=0x00

(same on both nodes)

 From lspci in Linux on the PowerEdge 850 we swapped in:

04:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5721 Gigabit Ethernet PCI Express (rev 11)
04:00.0 0200: 14e4:1659 (rev 11)
         Subsystem: 1028:01b6

Which I assume maps to card=0x01b61028 chip=0x165914e4 rev=0x11 in pciconf parlance

So, pretty similar but not identical. My decoding of the above is:

chip=
0x14e4 is Broadcom
0x1659 is BCM5721

card=0x1028023c - from pci.ids on my Debian box is listed as
"PowerEdge R200 Broadcom NetXtreme BCM5721" so that seems OK to me

0x102801e6 is listed in the same file as "PowerEdge 860" so 1028:01b6 presumably
is the PowerEdge 850 variant.

 From the primary:

netstat -m
14666/16834/31500 mbufs in use (current/cache/total)
10305/13875/24180/25600 mbuf clusters in use (current/cache/total/max)
10303/13504 mbuf+clusters out of packet secondary zone in use (current/cache)
4223/2931/7154/12800 4k (page size) jumbo clusters in use (current/cache/total/max)
0/0/0/6400 9k jumbo clusters in use (current/cache/total/max)
0/0/0/3200 16k jumbo clusters in use (current/cache/total/max)
41605K/43682K/85287K bytes allocated to network (current/cache/total)
0/0/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters)
0/0/0 requests for jumbo clusters denied (4k/9k/16k)
0/7/6656 sfbufs in use (current/peak/max)
0 requests for sfbufs denied
0 requests for sfbufs delayed
0 requests for I/O initiated by sendfile
0 calls to protocol drain routines


-- 
Russell Howe
rhowe at moonfruit.com
_______________________________________________
List mailing list
List at lists.pfsense.org
http://lists.pfsense.org/mailman/listinfo/list


More information about the List mailing list