PDA

Voir la version complète : DIMM clearly busted, but memtest86+ does not find any problems


FesselEnte
09/02/2009, 11h45
Hello, Just a little note about a problem I recently had with a DDR2-800 DIMM which passed all the memtest86+ tests, including the bit fade tests (and I ran them several nights), but which finally was shown to have problems. <_<

Indeed, (under Fedora 10), large files (gigabyte-sized) containing only zeros, written to disk with "dd if=/dev/zero" , then re-read, consistently had bit-flips at random locations at 32-byte intervals (where exactly changed in-between reads). For example, one might get the following file contents instead of the expected zeros-only (but only for large files, on the order of the system's memory size):


0000000 0000 0000 0000 0000 0000 0000 0000 0000
*
16fcf40 0000 0000 0000 0000 0020 0000 0000 0000
16fcf50 0000 0000 0000 0000 0000 0000 0000 0000
16fcf60 0000 0000 0000 0000 0024 0000 0000 0000
16fcf70 0000 0000 0000 0000 0000 0000 0000 0000
16fcf80 0000 0000 0000 0000 0024 0000 0000 0000
16fcf90 0000 0000 0000 0000 0000 0000 0000 0000
16fcfa0 0000 0000 0000 0000 0024 0000 0000 0000
16fcfb0 0000 0000 0000 0000 0000 0000 0000 0000
16fcfc0 0000 0000 0000 0000 0020 0000 0000 0000
16fcfd0 0000 0000 0000 0000 0000 0000 0000 0000
16fcfe0 0000 0000 0000 0000 0031 0000 0000 0000
16fcff0 0000 0000 0000 0000 0000 0000 0000 0000
*
4df0a660 0000 0000 0000 0000 0004 0000 0000 0000
4df0a670 0000 0000 0000 0000 0000 0000 0000 0000
4df0a680 0000 0000 0000 0000 0020 0000 0000 0000
4df0a690 0000 0000 0000 0000 0000 0000 0000 0000
4df0a6a0 0000 0000 0000 0000 0004 0000 0000 0000
4df0a6b0 0000 0000 0000 0000 0000 0000 0000 0000
*
4df0a6e0 0000 0000 0000 0000 0021 0000 0000 0000
4df0a6f0 0000 0000 0000 0000 0000 0000 0000 0000
4df0a700 0000 0000 0000 0000 0024 0000 0000 0000
4df0a710 0000 0000 0000 0000 0000 0000 0000 0000
4df0a720 0000 0000 0000 0000 0020 0000 0000 0000
4df0a730 0000 0000 0000 0000 0000 0000 0000 0000
4df0a740 0000 0000 0000 0000 0020 0000 0000 0000
4df0a750 0000 0000 0000 0000 0000 0000 0000 0000
4df0a760 0000 0000 0000 0000 0030 0000 0000 0000
4df0a770 0000 0000 0000 0000 0000 0000 0000 0000
4df0a780 0000 0000 0000 0000 0024 0000 0000 0000
4df0a790 0000 0000 0000 0000 0000 0000 0000 0000
*
4df0a7c0 0000 0000 0000 0000 0026 0000 0000 0000
4df0a7d0 0000 0000 0000 0000 0000 0000 0000 0000
4df0a7e0 0000 0000 0000 0000 0037 0000 0000 0000
4df0a7f0 0000 0000 0000 0000 0000 0000 0000 0000
4df0a800 0000 0000 0000 0000 0027 0000 0000 0000
4df0a810 0000 0000 0000 0000 0000 0000 0000 0000
4df0a820 0000 0000 0000 0000 0038 0000 0000 0000
4df0a830 0000 0000 0000 0000 0000 0000 0000 0000
*
73f78000

Replacing one of the DIMMs fixed this. I have now made a point of putting ECC RAM into that system ... and all my future ones.

So, how often does it happen that memtest86+ finds nothing but the memory is busted anyway? Or why would this problem not be caught by the memtest86+ test suite?

For reference purposes, here is the test script that generated the large files, then re-read them and checked for zeros:


#!/bin/bash

BLOCKS=4200000
ERROR=0
SOURCE=/dev/zero
OUTFILE=entropy
BLOCKSSTEP=10000

# Write initial file full of zeros
echo "Writing initial file of $BLOCKS blocks"
dd if=$SOURCE of=$OUTFILE count=$BLOCKS conv=fdatasync

while [[ $ERROR == 0 ]]; do
echo -n "Testing $BLOCKS blocks at "
date
# hexdump the file full of zeros
HEXDUMP=hexdump_`date +%Y%m%d_%H%M%S`
hexdump $OUTFILE > $HEXDUMP
# if the hexdump contains only zeros, then...
LINE=`cat $HEXDUMP | cut --fields=1- --delimiter=" " --only-delimited`
if [[ $LINE != "0000000 0000 0000 0000 0000 0000 0000 0000 0000" ]]; then
ERROR=1
echo "Errors found in $HEXDUMP ... dumping a second time"
hexdump $OUTFILE > ${HEXDUMP}_repeat
else
let BLOCKS=$BLOCKS+$BLOCKSSTEP
dd if=$SOURCE of=$OUTFILE \
count=$BLOCKSSTEP \
conv=fdatasync,notrunc \
status=noxfer \
oflag=append
fi
done
done

Wichetael
09/02/2009, 15h35
It is entirely possible that some errors are not caught by memtest, for one memtest doesn't have 100% test coverage of the entire memory subsystem, so it could be that there are problems elsewhere in the subsystem which memtest is completely unaware of. But secondly and more importantly errors are not just a matter of a broken bit or something like that. A memory module might work fine under certain conditions but start failing under other conditions, it is simply impossible to test under all possible conditions, there are just way too many different factors in play.

As to how big the chance is, that is difficult to say, you can be pretty confident that the memory is ok when memtest doesn't give you any errors, but it simply can not and can never be guaranteed.

FesselEnte
09/02/2009, 20h50
As the saying goes, "if you want a guarantee, buy a toaster" ;)

I have also thought whether there may be something with the disk DMA which makes this error come out of the hiding, in which case memtest86+ has not a chance of course. If I have some time, I will test this the DIMM on another motherboard, see what happens.