David Woodhouse | 1 Dec 1999 13:19

Re: Temporary downtime


dwmw2 <at> infradead.org said:
> My machine was hacked this afternoon. I'm in the process of completely
> reinstalling it. Expect a day or two of downtime on the mailing list
> and www/ftp sites. 

As far as I know, it's back up and functioning correctly.

If I'm mistaken, please let me know ASAP.

--
dwmw2

To unsubscribe, send "unsubscribe mtd" to majordomo <at> infradead.org

Bob Canup | 3 Dec 1999 20:28

power down

The reason that I said that expecting anything to work during power down
is wishful thinking is this: once the voltage to a digital chip goes
below the minimum specification of the chip, the behavior of the chip
becomes indeterminate.

For example: the old Western Digital 1791 double density disk controller
chip would sometimes glitch in such a way during power down that it
would write to the floppy - you could see the floppy light blink when
this happened.

Unless chips are specifically designed to handle power down conditions
this sort of thing happens.  For example - any competently designed
Flash memory has to refuse to write if the voltage is below spec.

As to flushing the buffers and doing a shutdown when a power fail
condition occurs - I believe that Linux already has code to handle a
power down such as I described. What I have described is very similar to
a UPS signaling the kernel that power is going down. Linux can do an
ordered shutdown when it receives the signal.

Qualifying digital circuitry with a POWER GOOD signal is very similar to
protecting the circuitry with a typical 'SCR over voltage crowbar
circuit': it makes the engineer feel good - but it doesn't actually do
much of anything.

Why doesn't the crowbar work? After all, it is a text book circuit. The
answer is that the SCR is a power device which takes on the order of 10
microseconds to turn on while the delicate chips are destroyed by a few
nanoseconds of over voltage. The result is that the SCR never turns on -
the fuse blows because the weakest digital chip  shorts the power supply
(Continue reading)

Vipin Malik | 7 Dec 1999 00:22

[Fwd: [Fwd: Flash reliability]]

Bjorn Eriksson wrote:
> 
>  I found Bob's comment right on target; The "20K+ power cycles"-test
> describe is interesting but not really applicable to real-world<tm>
> embedded systems <snip>

Sorry if I gave someone the wrong impression. The problems I was having
with the flash disks was not once in 20K cycles. I did a TOTAL of about
20K+ cycles for aLL my tests.

The problems I was seeing were occuring about once in 250 cycles in the
case of M-systems IDE2000 flash IDE drives and about 1 in 5(!) (power
down) cycles for hitachi and sandisk compact flash!!!!

For me (and hopefully a lot more embedded designers out there) any file
system integrity problem that happens once in 250 power cycles is
serious enough. Imagine 10K or 20K of these devices out in the field.
Just by probability and statistics one may encounter a couple of these
problems per week!
Not acceptable, and pretty "real world" for me.

>- they are not (IIRC) designed to withstand sudden
> powerfailure and Bob's analysis as to why the two flash technologies
> described faired worse than the magnetic media seems reasonable. I didn't
> see him say that "this problem cannot be managed" or "reboots/crashes are a
> way of life, get used to them", but then again, I didn't follow this very
> closely :-)
> 
>  Re: Linux latency problem you're describing - You're talking about a
> user-space process, right? Anyway, 'My' hardware designer says I've got
(Continue reading)

Vipin Malik | 7 Dec 1999 00:41

[Fwd: power down]

Bob Canup wrote:
> 
> The reason that I said that expecting anything to work during power down
> is wishful thinking is this: once the voltage to a digital chip goes
> below the minimum specification of the chip, the behavior of the chip
> becomes indeterminate.

That's why the stuff you need to protect during a power down (SRAM say),
has
its own backup battery and writes to the SRAM are shut off as soon as
the system voltage falls below the operational threshold.

> 
> For example: the old Western Digital 1791 double density disk controller
> chip would sometimes glitch in such a way during power down that it
> would write to the floppy - you could see the floppy light blink when
> this happened.

Someone's buggy design does not mean that a better way does not exist.
Obviously the chip was buggy if it exhibited this behavior.

> 
> Unless chips are specifically designed to handle power down conditions
> this sort of thing happens.  For example - any competently designed
> Flash memory has to refuse to write if the voltage is below spec.

This is true. Flash chips will not initiate a write if power is not
within specs. So this helps design a system that CAN survive random
power downs.

(Continue reading)

Bob Canup | 7 Dec 1999 16:47

Re: [Fwd: power down]

Vipin Malik wrote:

> Bob Canup wrote:
> >

I don't think that you understand what we're trying to tell you. There is a
difference in philosophy.

If you are running a flash as a normal read - write imitation of a disk there
are severe time limitations as to how long the flash is going to work because
of the limit on write cycles which flash technology has. As has been pointed
out in an earlier post - one write a second will ruin a flash chip in a few
weeks - which is not a very long for an embedded system to work.

Because of this limitation most of the people in this group who do design
with flash use it in a Write Rarely Read Mostly manner. The only time the
flash is written to is when there is a firmware upgrade. This is also the
manner in which flash chips are used on conventional PC motherboards - if you
lose power during a firmware upgrade - you are in trouble - nor do I see any
practical method of handling that problem.

If you are trying to use the flash in a data - logging application where the
file system has to be read - write to store data you are very quickly going
to run into the write cycle limitations of the technology. I don't think that
flash is the correct technology to use in such an application.

We use our DOC2000 in read only mode - with things like /var in volatile ram
disk - we have found this to be a satisfactory way of doing things.

Now - as to the issue of a POWER GOOD signal. The inverse of a POWER GOOD
(Continue reading)

Oron Ogdan | 7 Dec 1999 17:36

RE: [Fwd: power down]

Bob Canup wrote :

>I don't think that you understand what we're trying to tell you. There is a
>difference in philosophy.
>
>If you are running a flash as a normal read - write imitation of a disk
there
>are severe time limitations as to how long the flash is going to work
because
>of the limit on write cycles which flash technology has. As has been
pointed
>out in an earlier post - one write a second will ruin a flash chip in a few
>weeks - which is not a very long for an embedded system to work.

This is actually not correct, With good wear leveling the calculations are
different.
when you implement a correct wear leveling mechanism such as the one
DiskOnChip implements,
writing again and again to sector number 1 will spread the writes and erase
cycles all
over the media, In a pretty average way.
If you take for example an 8MB flash and write a page (512 bytes) every one
second, 
You will get something like 512 years until the device starts to wear out. 
8MB flash has 16384 pages and each supports a minimum of 1M erase cycles. Of
course 
there is some overhead of garbage collection and writing and erasing pages
when
implementing flash management algorithms but still, Even if we take the over
head as 
(Continue reading)

Bob Canup | 7 Dec 1999 19:54

Data Reliability

Is Vipin correct that there is a problem if there is a power loss when
writing to flash in embedded systems? Yes, no question.

There are 2 main ways of handling any problem. 1. Fix the problem once
it occurs. 2. Avoid the problem in the first place.

I have already given two ways to avoid the problem: use the flash in
Read Only mode, or have a power supply which signals impending power
loss and holds itself up long enough to allow an ordered shut down.

It has been my experience that it is generally better to avoid problems
than to attempt to fix them after they have occurred. Sometimes this is
not possible. For example: hard drives have to have ECC built into them
because it is virtually impossible to keep from having read errors.

I don't see any practical ways of fixing the problem of trashed flash
chips once it occurs. Does anyone else have any other suggestions as to
how to either avoid the problem or fix it once it happens?

To unsubscribe, send "unsubscribe mtd" to majordomo <at> infradead.org

Bjorn Eriksson | 7 Dec 1999 20:43
Picon
Picon

RE: Data Reliability


>I have already given two ways to avoid the problem: use the flash in
>Read Only mode, or have a power supply which signals impending power
>loss and holds itself up long enough to allow an ordered shut down.

Definitely my choice. Anyone else working on this?

//Björnen.

To unsubscribe, send "unsubscribe mtd" to majordomo <at> infradead.org

Jon Burford | 7 Dec 1999 21:36

Re: [Fwd: power down]

I am actually extremely interested in this issue, although I am not very
qualified to present possible solutions.  I am primarily a systems and
software guy and have been constructing an embedded linux system which boots
off an M-Systems DOC2000 and runs mostly out of ram disk.  The board I am
using has a watchdog timer which could spuriously reset the board (just like
hitting the reset button on your PC).  Power failures are also a reality I
must deal with.  I must at least make an attempt to guarantee that the
system will always come back up (the damaged DOC2000 filesystem will be
repaired by e2fsck upon subsequent boot up).  To give you an idea of
what/when I am doing flash writes, I am running postgres whose db files are
in flash and am doing about a 20-100 byte record insert per minute (on
average).  The log files in /var/log/* are also in flash.  There are no
custom apps which write often to syslog and I am not running mail (although
I am running apache which I could, but haven't yet turned off logging for).
I mount the DOC2000 on /usr, but write only to the logs and db files (I have
'chattr i' on all other files in /usr).  What I would like to get an opinion
on is:

1) What is the probability that e2fsck will not be able to reapair the
filesystem?
2) What is the probability that I will damage the boot sector and lilo will
not be able to being to boot at all?
3) Since I use a pretty standard 5/12 V switching power supply and embedded
PC board (a 40W compact version of a standard PC power supply w/o fan), do I
have any hope in making HW or SW changes to possibly reduce or fix this
problem?

Any suggestions or insight much appreciated.

Regards,
(Continue reading)

Bob Canup | 7 Dec 1999 22:19

Re:Power Down

Watch dogs are generally there to catch the problem of a run-away
machine - this ought to be a very rare occurrence.

According to Vipin's statistics about 1 in 250 random power failures
during writes to a DOC2000 results in a bad sector on the device. Since
you are required to run the chip in RW mode the only way I see to avoid
the problem is a UPS on the front end - with signaling to indicate power
failure so that an ordered shutdown could occur.

As far as the problem of a bad sector which he discussed I have not seen
any solutions other than the erase and start over one he originally came
up with - which for the reasons he discussed - is unacceptable.

The first step toward solving a problem is understanding exactly what
the problem is. My theory is that if you interrupt a sector write while
it is in progress the data and the error checking code don't match -
thus you get a bad sector. Any other theories?

To unsubscribe, send "unsubscribe mtd" to majordomo <at> infradead.org


Gmane