Discussion:
ZFS really fragile? ZFS problem
(too old to reply)
Chad Leigh -- Shire.Net LLC
2006-08-30 18:04:42 UTC
Permalink
Hi

Sol 10 U2.

I have a test box that has a single raid 6 device that is used as the
source device for a ZFS pool called "local". As of now there is no
important data in it but what happened worries me.

First, I get this message on boot and when I do a "zpool status -x"
command. I seem to be hosed.

# zpool status -x
pool: local
state: FAULTED
status: The pool metadata is corrupted and the pool cannot be opened.
action: Destroy and re-create the pool from a backup source.
see: http://www.sun.com/msg/ZFS-8000-CS
scrub: none requested
config:

NAME STATE READ WRITE CKSUM
local FAULTED 0 0 6 corrupted data
c1t1d0 ONLINE 0 0 6
#


How I got here:

The machine was idle. I did a

# shutdown -i 5 -y -g 1

This completely shuts down and powers off (is there a better way to
do this?)

The ZFS pool was not being used for anything and had not been used
for quite some time when I did this. According to the Areca tech
support (the underlying RAID device used as the pool source device is
an Areca Raid), Solaris has a bug in that it does not call the driver
flush routines to get the device drivers to flush when you do this
sort of shutdown. (reboot and some others do). The RAID device has
a battery backup so it should be ok on reboot. But I had the case
open to check on a BMC/remote console device (Tyan) that is not
working right and I accidently removed the battery backup cable. So
any pending data on thh raid controller was lost.

That was all that happened. What I am concerned about is that the
ZFS meta data is so fragile that such a simple process could
permanently destroy the whole pool. With UFS we could do an fsck
which would probably fix whatever the underlying problems here were.
With ZFS we seem to have fragile meta data that is easily corrupted.
This makes ZFS unusable for production use if it is so easily
corrupted. ZFS does not keep copies of its meta data or have any
other way to fight such a simple corruption? The machine had been
idle for a few days and the ZFS pool should have had 0 activity on
it. This is a machine still being tested before it goes into
production.

Thanks
Chad

---
Chad Leigh -- Shire.Net LLC
Your Web App and Email hosting provider
chad at shire.net





Please check the Links page before posting:
http://groups.yahoo.com/group/solarisx86/links
Post message: ***@yahoogroups.com
UNSUBSCRIBE: solarisx86-***@yahoogroups.com
John D Groenveld
2006-08-30 19:00:59 UTC
Permalink
In message <607BBE99-ED5E-4DFA-AA44-***@shire.net>, "Chad Leigh -- Shi
re.Net LLC" writes:
>The ZFS pool was not being used for anything and had not been used
>for quite some time when I did this. According to the Areca tech
>support (the underlying RAID device used as the pool source device is
>an Areca Raid), Solaris has a bug in that it does not call the driver

Solaris 10 System Administrator Collection >>
Solaris ZFS Administration Guide
<URL:http://docs.sun.com/app/docs/doc/819-5461/6n7ht6qrh?a=view>
| As described in ZFS Pooled Storage, ZFS eliminates the need for a
| separate volume manager. ZFS operates on raw devices, so it is
| possible to create a storage pool comprised of logical volumes, either
| software or hardware. This configuration is not recommended, as ZFS
| works best when it uses raw physical devices. Using logical volumes
| might sacrifice performance, reliability, or both, and should be
| avoided.

Can you configure the Areca to be a plain dumb HBA?
If not, then I think you want to stick to ufs(7FS).

John
***@acm.org



Please check the Links page before posting:
http://groups.yahoo.com/group/solarisx86/links
Post message: ***@yahoogroups.com
UNSUBSCRIBE: solarisx86-***@yahoogroups.com
Chad Leigh -- Shire.Net LLC
2006-08-30 19:34:26 UTC
Permalink
On Aug 30, 2006, at 1:00 PM, John D Groenveld wrote:

> In message <607BBE99-ED5E-4DFA-AA44-***@shire.net>, "Chad
> Leigh -- Shi
> re.Net LLC" writes:
>> The ZFS pool was not being used for anything and had not been used
>> for quite some time when I did this. According to the Areca tech
>> support (the underlying RAID device used as the pool source device is
>> an Areca Raid), Solaris has a bug in that it does not call the driver
>
> Solaris 10 System Administrator Collection >>
> Solaris ZFS Administration Guide
> <URL:http://docs.sun.com/app/docs/doc/819-5461/6n7ht6qrh?a=view>
> | As described in ZFS Pooled Storage, ZFS eliminates the need for a
> | separate volume manager. ZFS operates on raw devices, so it is
> | possible to create a storage pool comprised of logical volumes,
> either
> | software or hardware. This configuration is not recommended, as ZFS
> | works best when it uses raw physical devices. Using logical volumes
> | might sacrifice performance, reliability, or both, and should be
> | avoided.

This does not address the issue of fragile meta data.

>
> Can you configure the Areca to be a plain dumb HBA?

We are going to add in a second raid and do a mirror between them
with ZFS. Just need to get some more funds to finish that off later
this Fall.

However, this does not address the poor fragility of ZFS. The ZFS
pool was idle and had not had anything done against it for several
days and the system was shutdown and brought back up and the ZFS pool
was hosed.

best
Chad

> If not, then I think you want to stick to ufs(7FS).
>
> John
> ***@acm.org
>

---
Chad Leigh -- Shire.Net LLC
Your Web App and Email hosting provider
chad at shire.net





Please check the Links page before posting:
http://groups.yahoo.com/group/solarisx86/links
Post message: ***@yahoogroups.com
UNSUBSCRIBE: solarisx86-***@yahoogroups.com
Rich Teer
2006-08-30 19:46:36 UTC
Permalink
On Wed, 30 Aug 2006, Chad Leigh -- Shire.Net LLC wrote:

> However, this does not address the poor fragility of ZFS. The ZFS
> pool was idle and had not had anything done against it for several
> days and the system was shutdown and brought back up and the ZFS pool
> was hosed.

Just a guess, but is it possible that the array signalled to ZFS that
the meta data was correctly written to disk, when in fact it wasn't?
So, ZFS carries, thinking that meta data is safe (flushed), when in
fact it isn't.

BTW, to answer another question you posed: "init 5" is an easier way
of accomplishing a clean shutdown and power off. Much kinder to the
fingers, too!

--
Rich Teer, SCNA, SCSA, OpenSolaris CAB member

President,
Rite Online Inc.

Voice: +1 (250) 979-1638
URL: http://www.rite-group.com/rich


Please check the Links page before posting:
http://groups.yahoo.com/group/solarisx86/links
Post message: ***@yahoogroups.com
UNSUBSCRIBE: solarisx86-***@yahoogroups.com
Chad Leigh -- Shire.Net LLC
2006-08-30 19:42:55 UTC
Permalink
On Aug 30, 2006, at 1:00 PM, John D Groenveld wrote:

> This configuration is not recommended, as ZFS
> | works best when it uses raw physical devices. Using logical volumes
> | might sacrifice performance, reliability, or both, and should be
> | avoided.

The Areca raid IS a raw physical device to the OS. It is not a
"logical volume" in any sense that the OS would care about.

It is just a really bug physical volume that has its own redundancy
built in.

Chad

---
Chad Leigh -- Shire.Net LLC
Your Web App and Email hosting provider
chad at shire.net





Please check the Links page before posting:
http://groups.yahoo.com/group/solarisx86/links
Post message: ***@yahoogroups.com
UNSUBSCRIBE: solarisx86-***@yahoogroups.com
Al Hopper
2006-08-30 19:51:22 UTC
Permalink
On Wed, 30 Aug 2006, Chad Leigh -- Shire.Net LLC wrote:

> Hi
>
> Sol 10 U2.
>
> I have a test box that has a single raid 6 device that is used as the
> source device for a ZFS pool called "local". As of now there is no
> important data in it but what happened worries me.
>
> First, I get this message on boot and when I do a "zpool status -x"
> command. I seem to be hosed.
>
> # zpool status -x
> pool: local
> state: FAULTED
> status: The pool metadata is corrupted and the pool cannot be opened.
> action: Destroy and re-create the pool from a backup source.
> see: http://www.sun.com/msg/ZFS-8000-CS
> scrub: none requested
> config:
>
> NAME STATE READ WRITE CKSUM
> local FAULTED 0 0 6 corrupted data
> c1t1d0 ONLINE 0 0 6
> #

This issue here is that ZFS does not have a fault tolerant configuration -
like a ZFS mirror or raidz. You can run a zfs scrub on it and get to a
consistant state. You may or may not suffer data loss ... for the same
reason - it's not a redundant/fault-tolerant configuration.

>
> How I got here:
>
> The machine was idle. I did a
>
> # shutdown -i 5 -y -g 1
>
> This completely shuts down and powers off (is there a better way to
> do this?)
>
> The ZFS pool was not being used for anything and had not been used
> for quite some time when I did this. According to the Areca tech
> support (the underlying RAID device used as the pool source device is
> an Areca Raid), Solaris has a bug in that it does not call the driver
> flush routines to get the device drivers to flush when you do this
> sort of shutdown. (reboot and some others do). The RAID device has
> a battery backup so it should be ok on reboot. But I had the case
> open to check on a BMC/remote console device (Tyan) that is not
> working right and I accidently removed the battery backup cable. So
> any pending data on thh raid controller was lost.
>
> That was all that happened. What I am concerned about is that the
> ZFS meta data is so fragile that such a simple process could
> permanently destroy the whole pool. With UFS we could do an fsck

No - the ZFS meta data is not "fragile" - it guarantees that every write
will be either committed or not. Except, in this case, you've presented
it with a device (the Acrea) that when issued a sync (fsync, dsync) by
ZFS, "told" ZFS that the data was committed to stable storage, when, in
fact, it was probably only committed to the Acreas' battery backed memory.
Or, it ignored the sync command from ZFS and "told" ZFS that it was
committed to stable storage.

> which would probably fix whatever the underlying problems here were.
> With ZFS we seem to have fragile meta data that is easily corrupted.
> This makes ZFS unusable for production use if it is so easily
> corrupted. ZFS does not keep copies of its meta data or have any

Your conclusion is incorrect - for the reasons outlined above.

> other way to fight such a simple corruption? The machine had been
> idle for a few days and the ZFS pool should have had 0 activity on
> it. This is a machine still being tested before it goes into
> production.

ZFS can't do anything to protect your data if is not configured with
fault tolerant hardware or if the attached hardware does not honor sync
requests. This is why you see zfs being used with directly attached disk
drives.

ZFS, in its current release, has its issues - but your conclusions are
fatally flawed.

Regards,

Al Hopper Logical Approach Inc, Plano, TX. ***@logical-approach.com
Voice: 972.379.2133 Fax: 972.379.2134 Timezone: US CDT
OpenSolaris.Org Community Advisory Board (CAB) Member - Apr 2005
OpenSolaris Governing Board (OGB) Member - Feb 2006


Please check the Links page before posting:
http://groups.yahoo.com/group/solarisx86/links
Post message: ***@yahoogroups.com
UNSUBSCRIBE: solarisx86-***@yahoogroups.com
John D Groenveld
2006-08-30 20:01:02 UTC
Permalink
In message <***@logical.logical-approach.com>, Al
Hopper writes:
>No - the ZFS meta data is not "fragile" - it guarantees that every write
>will be either committed or not. Except, in this case, you've presented
>it with a device (the Acrea) that when issued a sync (fsync, dsync) by
>ZFS, "told" ZFS that the data was committed to stable storage, when, in
>fact, it was probably only committed to the Acreas' battery backed memory.
>Or, it ignored the sync command from ZFS and "told" ZFS that it was
>committed to stable storage.

Assuming it has a JBOD mode, does the Areca do the write thing?

John
***@acm.org



Please check the Links page before posting:
http://groups.yahoo.com/group/solarisx86/links
Post message: ***@yahoogroups.com
UNSUBSCRIBE: solarisx86-***@yahoogroups.com
Chad Leigh -- Shire.Net LLC
2006-08-30 20:38:03 UTC
Permalink
On Aug 30, 2006, at 1:51 PM, Al Hopper wrote:

> On Wed, 30 Aug 2006, Chad Leigh -- Shire.Net LLC wrote:
>
>> Hi
>>
>> Sol 10 U2.
>>
>> I have a test box that has a single raid 6 device that is used as the
>> source device for a ZFS pool called "local". As of now there is no
>> important data in it but what happened worries me.
>>
>> First, I get this message on boot and when I do a "zpool status -x"
>> command. I seem to be hosed.
>>
>> # zpool status -x
>> pool: local
>> state: FAULTED
>> status: The pool metadata is corrupted and the pool cannot be opened.
>> action: Destroy and re-create the pool from a backup source.
>> see: http://www.sun.com/msg/ZFS-8000-CS
>> scrub: none requested
>> config:
>>
>> NAME STATE READ WRITE CKSUM
>> local FAULTED 0 0 6 corrupted data
>> c1t1d0 ONLINE 0 0 6
>> #
>
> This issue here is that ZFS does not have a fault tolerant
> configuration -
> like a ZFS mirror or raidz. You can run a zfs scrub on it and get
> to a
> consistant state.

# zpool scrub local
cannot scrub 'local': pool is currently unavailable
#

UFS keeps redundant super blocks and fsck can try and repair the file
system. ZFS can't seem to handle that.

> You may or may not suffer data loss ...

I am not worried about data loss in this case. That risk I can
handle. It is ZFS itself going South that concerns me.

> for the same
> reason - it's not a redundant/fault-tolerant configuration.
>
>>
>> How I got here:
>>
>> The machine was idle. I did a
>>
>> # shutdown -i 5 -y -g 1
>>
>> This completely shuts down and powers off (is there a better way to
>> do this?)
>>
>> The ZFS pool was not being used for anything and had not been used
>> for quite some time when I did this. According to the Areca tech
>> support (the underlying RAID device used as the pool source device is
>> an Areca Raid), Solaris has a bug in that it does not call the driver
>> flush routines to get the device drivers to flush when you do this
>> sort of shutdown. (reboot and some others do). The RAID device has
>> a battery backup so it should be ok on reboot. But I had the case
>> open to check on a BMC/remote console device (Tyan) that is not
>> working right and I accidently removed the battery backup cable. So
>> any pending data on thh raid controller was lost.
>>
>> That was all that happened. What I am concerned about is that the
>> ZFS meta data is so fragile that such a simple process could
>> permanently destroy the whole pool. With UFS we could do an fsck
>
> No - the ZFS meta data is not "fragile" - it guarantees that every
> write
> will be either committed or not. Except, in this case, you've
> presented
> it with a device (the Acrea) that when issued a sync (fsync, dsync) by
> ZFS, "told" ZFS that the data was committed to stable storage,
> when, in
> fact, it was probably only committed to the Acreas' battery backed
> memory.
> Or, it ignored the sync command from ZFS and "told" ZFS that it was
> committed to stable storage.

The same issue exists with JBOD. There are disks out there known
(and probably many unknown) that lie about the state of the onboard
cache. ZFS *is* fragile if there is no good way to recover this
except to restore the whole pool from backup.

In this case, the issue is that the OS did not signal a shutdown to
the disk driver to allow it to flush (ie Solaris bug unrelated to
ZFS). Because I have a the optional battery for the cache this
wouldn't have been a problem except for my fat fingers.

My issue is that ZFS is fragile and cannot easily recover from
errors. You may have some data lost etc but the whole pool should
not be bad due to this -- ZFS should try and be able to recover. It
can't.

And why there was anything in the cache is beyond me. The ZFS pool
had literally had NO activity for several days. Most OSes flush
things more regularly than that and the Areca writes to the disk
faster than that as well.

>
>> which would probably fix whatever the underlying problems here were.
>> With ZFS we seem to have fragile meta data that is easily corrupted.
>> This makes ZFS unusable for production use if it is so easily
>> corrupted. ZFS does not keep copies of its meta data or have any
>
> Your conclusion is incorrect - for the reasons outlined above.

I still think it is correct. ZFS should have a recovery mode and
have more robust meta data or meta data handling to allow a recovery
to take place. I am not asking for 100% of my data to be ok. I am
asking for it to be able to recover itself so the whole pool is not
destroyed. I seriously doubt that the meta data for all the files in
the pool was not corrupted and if it was then ZFS has bad stategies
for vital meta data storage or how it uses its meta data.

>
>> other way to fight such a simple corruption? The machine had been
>> idle for a few days and the ZFS pool should have had 0 activity on
>> it. This is a machine still being tested before it goes into
>> production.
>
> ZFS can't do anything to protect your data

I am not asking it to protect my data in this case. I am asking it to
protect its own data a bit better. Vital pieces of meta data should
be replicated in the pool if they are vital to the complete pool's
ability to live, much like superblocks are written all over a file
system by UFS. If a small piece of meta data affecting a couple of
files is corrupted, I should not have to lose the whole pool. So by
inference, the meta data that became corrupted was something vital to
the whole pool (or ZFS has design issues that basically say that any
sort of meta data corruption will kill the whole pool which makes it
fragile) and ZFS should have ways of dealing with that central vital
meta data.

> if is not configured with
> fault tolerant hardware or if the attached hardware does not honor
> sync
> requests.

Or if the OS does not issue flushes.

> This is why you see zfs being used with directly attached disk
> drives.

They have the same issues. Directly attached disk drives all behave
differently and many of them do not "honor" cache flush requets or
lie about the state of writes.

>
> ZFS, in its current release, has its issues - but your conclusions are
> fatally flawed.

If they are, you have not yet pointed them out.

I am concerned that the whole pool is wiped out by some meta data
corruption instead of a more local problem with a few files, etc.

ZFS *IS* Fragile

What happened to me should not have happened. The issue should have
been more localized and the ZFS pool itself should have been able to
recover itself and then flagged any user files etc inside the pool
that had been affected as possibly corrupted so the user could
restore them from backup.

best regards
Chad

---
Chad Leigh -- Shire.Net LLC
Your Web App and Email hosting provider
chad at shire.net





Please check the Links page before posting:
http://groups.yahoo.com/group/solarisx86/links
Post message: ***@yahoogroups.com
UNSUBSCRIBE: solarisx86-***@yahoogroups.com
Tao Chen
2006-08-30 20:14:52 UTC
Permalink
On 8/30/06, Chad Leigh -- Shire.Net LLC <***@shire.net> wrote:
> According to the Areca tech
> support (the underlying RAID device used as the pool source device is
> an Areca Raid), Solaris has a bug in that it does not call the driver
> flush routines to get the device drivers to flush when you do this
> sort of shutdown. (reboot and some others do). The RAID device has
> a battery backup so it should be ok on reboot. But I had the case
> open to check on a BMC/remote console device (Tyan) that is not
> working right and I accidently removed the battery backup cable. So
> any pending data on thh raid controller was lost.
>

Do they have a bug id? This behaviour is filesystem-dependent, so
"Solaris has a bug" doesn't mean it applies to zfs in this case. It
might be UFS-specific.
From various discussion, zfs does flush the write cache whenever
necessary, more so than UFS. It's up to the storage then to handle the
flush request correctly.

"accidently removed the battery backup cable" could prevent the flush
pass through to the physical disks.

> That was all that happened. What I am concerned about is that the
> ZFS meta data is so fragile that such a simple process could
> permanently destroy the whole pool.

It is strange.
I believe at least in Update 2, zfs uses so called "Ditto block" to
replicate meta-data.
See here:
http://blogs.sun.com/bill/entry/ditto_blocks_the_amazing_tape

"We use ditto blocks to ensure that the more "important" a filesystem
block is (the closer to the root of the tree), the more replicated it
becomes. Our current policy is that we store one DVA for user data,
two DVAs for filesystem metadata, and three DVAs for metadata that's
global across all filesystems in the storage pool."

So I wouldn't expect the meta-data are so easily corrupted.

> With UFS we could do an fsck
> which would probably fix whatever the underlying problems here were.

Not if all copies of super blocks are corrupted.

> With ZFS we seem to have fragile meta data that is easily corrupted.
> This makes ZFS unusable for production use if it is so easily
> corrupted. ZFS does not keep copies of its meta data or have any
> other way to fight such a simple corruption? The machine had been
> idle for a few days and the ZFS pool should have had 0 activity on
> it. This is a machine still being tested before it goes into
> production.

If you can reproduce the problem (with or without the NVRAM), please
report it here:
http://www.opensolaris.org/jive/forum.jspa?forumID=80
Even it is not a bug (just unlucky), some might re-think if current
replicating policy is sufficient.
This is the first time I heard a simple shutdown caused zfs meta-data
corruption, although unplugging the NVRAM is not what people usually
do.

Tao


Please check the Links page before posting:
http://groups.yahoo.com/group/solarisx86/links
Post message: ***@yahoogroups.com
UNSUBSCRIBE: solarisx86-***@yahoogroups.com
John D Groenveld
2006-08-30 20:41:21 UTC
Permalink
In message <***@mail.gmail.com>, "Tao
Chen" writes:
>Do they have a bug id? This behaviour is filesystem-dependent, so
>"Solaris has a bug" doesn't mean it applies to zfs in this case. It
>might be UFS-specific.
>From various discussion, zfs does flush the write cache whenever
>necessary, more so than UFS. It's up to the storage then to handle the
>flush request correctly.

If the Areca doesn't actually flush write cache when asked, then
my suggestion to revert back to UFS is a bad idea.[tm]

Does lsimega(7D) and the LSI RAID controllers exhibit this problem?

John
***@acm.org



Please check the Links page before posting:
http://groups.yahoo.com/group/solarisx86/links
Post message: ***@yahoogroups.com
UNSUBSCRIBE: solarisx86-***@yahoogroups.com
Timothy Kennedy
2006-08-31 16:05:15 UTC
Permalink
Part of the problem with having a RAID system underneath ZFS, and not
letting ZFS handling the RAID, is that the Areca is trying to
intercept the writes, and is using it's own algorithms to calculate
whether the RAID is still in sync.

If all you did is reboot, and it came up corrupted, then I would
definitely think that that is definitely a problem with the RAID,
since ZFS by it's own nature
ensures that data is *always* in a consistent state on the disk. For
it to be corrupt on a reboot, the only way I can see that happening is
when you have a middleman intercepting the disk write requests, and
reporting back to zfs as the hardware "Yep, we wrote it". Thus, ZFS
has satisfied it's requirement that the data on the disk is
consistent, to the best of it's knowledge.

So yea, when you abstract the real hardware from ZFS, then yes, ZFS
apparently can be fragile. Possibly even more fragile than a
filesystem with a journal that can be replayed, since ZFS doesn't
expect to have to replay a journal, as "all data on disk is always
consistent."

I've pulled the power out on several disk enclosures that have ZFS
filesystems, sometimes only on one disk, sometimes with pools spread
across disks, and I've never had bad data. Granted, that's only on
USB or firewire Disks, but still never bad data.

I did recently lose a 225GB UFS filesystem, with all the superblock
copies corrupted, that required a reformat and caused a 10 hour
service outage while we restored from backup, though.

-TIm

--
There are 10 types of people on this planet. Those who understand
binary, and those who don't.


Please check the Links page before posting:
http://groups.yahoo.com/group/solarisx86/links
Post message: ***@yahoogroups.com
UNSUBSCRIBE: solarisx86-***@yahoogroups.com
Chad Leigh -- Shire.Net LLC
2006-08-31 17:00:06 UTC
Permalink
On Aug 31, 2006, at 10:05 AM, Timothy Kennedy wrote:

> Part of the problem with having a RAID system underneath ZFS, and not
> letting ZFS handling the RAID, is that the Areca is trying to
> intercept the writes,

No it is not. You seem to not know what a HW raid is. HW raid IS
the device and it is receiving the writes and performing them.

> and is using it's own algorithms to calculate
> whether the RAID is still in sync.

So? To the OS the RAID is a single device and there is only 1 write.

>
> If all you did is reboot, and it came up corrupted, then I would
> definitely think that that is definitely a problem with the RAID,

No, it is with ZFS because it cannot recover itself after disk
corruption. JBOD is not a panacea. There are several brands of
disks that lie about finishing disk writes, for example, or that have
bad caching algorithms.

> since ZFS by it's own nature
> ensures that data is *always* in a consistent state on the disk.

But if the disk suffers bitrot, then ZFS makes a bad assumption as it
cannot recover from it. It may have been written 100% but it did not
come back as written later on.

> For
> it to be corrupt on a reboot, the only way I can see that happening is
> when you have a middleman intercepting the disk write requests,

There is no middleman intercepting disk writes. The OS is writing to
the device, which is a raid device.

--

in this case, it had been several days since the ZFS pool had been
touched. The whole machine had been idle with no processes beyond
basic OS processes running for several DAYS. The OS/ZFS obviously
decided it needed to write something out, some status or something,
when the machine was shut down, and shutdown before the physical disk
(in this case the raid) could finish it -- the vendor claims due to
the kernel not reporting a flush request on init 5 (which I am
investigating).

The point is, ZFS needs to be better able to recover from meta data
corruption, which can happen with JBOD just as easily as a RAID
device underneath it, due to driver errors, or disk firmware errors,
etc.

> and
> reporting back to zfs as the hardware "Yep, we wrote it".

Kind of like a plain disk reporting back, yep, we wrote it once it
enters the on-disk cache? This is well known to happen with many disks.

> Thus, ZFS
> has satisfied it's requirement that the data on the disk is
> consistent, to the best of it's knowledge.

But if something happens after it was written ZFS should be able to
recover. It is not enough to have written it correctly.

>
> So yea, when you abstract the real hardware from ZFS, then yes, ZFS
> apparently can be fragile. Possibly even more fragile than a
> filesystem with a journal that can be replayed, since ZFS doesn't
> expect to have to replay a journal, as "all data on disk is always
> consistent."

No its not. Having written it consistently does not guarantee a bit
error on reading or degredation on disk.

>
> I've pulled the power out on several disk enclosures that have ZFS
> filesystems, sometimes only on one disk, sometimes with pools spread
> across disks, and I've never had bad data. Granted, that's only on
> USB or firewire Disks, but still never bad data.
>
> I did recently lose a 225GB UFS filesystem, with all the superblock
> copies corrupted, that required a reformat and caused a 10 hour
> service outage while we restored from backup, though.

In 10 years I have never had UFS completely go bad. I have had
corruptions but did not lose my entire file system - -with many bad
crashes and HW lock ups.

best
Chad

>
> -TIm
>
> --
> There are 10 types of people on this planet. Those who understand
> binary, and those who don't.
>
>
> Please check the Links page before posting:
> http://groups.yahoo.com/group/solarisx86/links
> Post message: ***@yahoogroups.com
> UNSUBSCRIBE: solarisx86-***@yahoogroups.com
>
>
> Yahoo! Groups Links
>
>
>
>
>
>
>

---
Chad Leigh -- Shire.Net LLC
Your Web App and Email hosting provider
chad at shire.net





Please check the Links page before posting:
http://groups.yahoo.com/group/solarisx86/links
Post message: ***@yahoogroups.com
UNSUBSCRIBE: solarisx86-***@yahoogroups.com
Al Hopper
2006-08-31 17:47:40 UTC
Permalink
On Thu, 31 Aug 2006, Chad Leigh -- Shire.Net LLC wrote:

>
> On Aug 31, 2006, at 10:05 AM, Timothy Kennedy wrote:
>
> > Part of the problem with having a RAID system underneath ZFS, and not
> > letting ZFS handling the RAID, is that the Areca is trying to
> > intercept the writes,
>
> No it is not. You seem to not know what a HW raid is. HW raid IS
> the device and it is receiving the writes and performing them.
>
> > and is using it's own algorithms to calculate
> > whether the RAID is still in sync.
>
> So? To the OS the RAID is a single device and there is only 1 write.
>
> >
> > If all you did is reboot, and it came up corrupted, then I would
> > definitely think that that is definitely a problem with the RAID,
>
> No, it is with ZFS because it cannot recover itself after disk
> corruption. JBOD is not a panacea. There are several brands of
> disks that lie about finishing disk writes, for example, or that have
> bad caching algorithms.
>
> > since ZFS by it's own nature
> > ensures that data is *always* in a consistent state on the disk.
>
> But if the disk suffers bitrot, then ZFS makes a bad assumption as it
> cannot recover from it. It may have been written 100% but it did not
> come back as written later on.
>
> > For
> > it to be corrupt on a reboot, the only way I can see that happening is
> > when you have a middleman intercepting the disk write requests,
>
> There is no middleman intercepting disk writes. The OS is writing to
> the device, which is a raid device.

Yes - there are several "middlemen" and each one of them has the potential
to not function correctly - IOW - they can all have bugs.

1) When you write to your Acrea, you may experience bugs in their driver,
running on the Solaris box. First middleman.

2) The Acrea HW RAID is an embedded system which runs code. In this type
of system, we generally refer to the code as firmware. That firmware can
have bugs. 2nd middleman.

3) The Acrea HW RAID firmware will probably, like most H/W RAID
controllers, write the data into battery backed up memory - then report
back to the driver that the data has been written to the "disk". But, in
fact, it has not yet been written to disk. It's in the H/W RAIDs memory
and will probably be written to disk, some time in the future. Now there
could be a bad memory SIMM in the Acrea HW RAID controller. 3rd
middleman.

4) When ZFS issues a sync, its telling the "disk", in this case, an
emulated disk, to write the data to persistant storage and don't come back
until its committed to disk. The Acrea HW RAID firmware will probably
come back immediately - after ensuring that the data has been written into
memory. This is how they achieve good benchmarks and win customers. So
now ZFS thinks that the data is committed to stable storage - when its
not.

5) Additionally, H/W RAID controllers have been known to corrupt data, not
because the firmware was bad/buggy per se, but because the algorithms that
the RAID contoller implemented was incorrect. For example,
consider if a H/W RAID controller finds a bad block in one of the disk
drives that is part of a RAID-5 volume and then "botches" the relocation
of the stripe that represents that data. That is not a firmware bug as
such - its a correct implementation of a bad/incorrect algorithm. This is
simply another middleman that is handling your data.

When ZFS detects corrupt data, it immediately "knows" that the data is
corrupted - because everything it CRCed/checksummed. It then recovers
from that corruption by using *another* copy of the data, which has good
checksums. IOW - you must provide ZFS with redundant storage mechanism -
either a mirror or a (zfs) raidz configuration. How can you expect zfs to
function correctly if you present it with a flawed/buggy/broken *single*
disk or a single H/W RAID device that looks to zfs like a *single* disk.

IOW: If I make a storage pool from a single disk drive, sitting on a
tabletop and I take a screwdriver, pop the cover off that single disk
drive and jam it into the disk platter and corrupt the data, should I
expect ZFS to "recover" my data? No - I don't think so.

What if the "single disk drive" is a H/W RAID device and the screwdriver
is a bug in that H/W RAID firmware, should I expect ZFS to "recover" my
data? No - I don't think so.

If I take two disk drives and form a zfs mirror and I jam a screwdriver
into *one* of the disks, should I expect zfs to recover my data and mark
the damaged drive as "bad". Yes. And that is exactly what ZFS does.

Now if I form a zfs mirror using a bunch of disk drives configured in a
raidz pool as one side of a mirror, and a H/W RAID device of equal size as
the other side of the mirror, and I suffer a bug/failure/RAM corruption or
bad algorithm in the the H/W RAID system, zfs will figure out pretty
quickly that the H/W RAID "disk" is bad and offline it. You can then run
a zfs scrub and zfs will sync up the mirror by correcting the data on the
bad side of the mirror.

And if you read the zfs blogs, you'll see several people have already
simulated this by using a dd command to write zeros onto one side of a zfs
mirror and zfs will correctly figure out which disk is corrupted and can
rebuild the pool via a zfs scrub. But it was *never* designed to do this
for a single disk. You *must* provide a redundant storage mechanism -
which is usually provided by multiple disk drives.

If you want to see what zfs can do, form a mirror with a zfs raidz pool
comprised of inexpensive, well cooled SATA drives as one side of the
mirror and your Acrea H/W RAID controller as the other side of the mirror
and see where the problems really are. My WAG is that you'll find that
the Acrea system is the source of your issues.

Regards,

Al Hopper Logical Approach Inc, Plano, TX. ***@logical-approach.com
Voice: 972.379.2133 Fax: 972.379.2134 Timezone: US CDT
OpenSolaris.Org Community Advisory Board (CAB) Member - Apr 2005
OpenSolaris Governing Board (OGB) Member - Feb 2006


Please check the Links page before posting:
http://groups.yahoo.com/group/solarisx86/links
Post message: ***@yahoogroups.com
UNSUBSCRIBE: solarisx86-***@yahoogroups.com
Chad Leigh -- Shire.Net LLC
2006-08-31 18:18:05 UTC
Permalink
On Aug 31, 2006, at 11:47 AM, Al Hopper wrote:

> Yes - there are several "middlemen" and each one of them has the
> potential
> to not function correctly - IOW - they can all have bugs.
>
> 1) When you write to your Acrea, you may experience bugs in their
> driver,
> running on the Solaris box. First middleman.
>
> 2) The Acrea HW RAID is an embedded system which runs code. In
> this type
> of system, we generally refer to the code as firmware. That
> firmware can
> have bugs. 2nd middleman.
>
> 3) The Acrea HW RAID firmware will probably, like most H/W RAID
> controllers, write the data into battery backed up memory - then
> report
> back to the driver that the data has been written to the "disk".
> But, in
> fact, it has not yet been written to disk. It's in the H/W RAIDs
> memory
> and will probably be written to disk, some time in the future. Now
> there
> could be a bad memory SIMM in the Acrea HW RAID controller. 3rd
> middleman.

These are not middlemen in the way implied. The implication is that
the OS is trying to write to the disks and that the controller card
is "intercepting" this somehow.

Also, JBOD has all the same "middlemen" above.

1) When you write to a JBOD controller/set of disks, you may
experience bugs in the driver controlling the controller and disks,
running on the SOlaris box. First middlemen.

2) The JBOD controller is an embedded system which runs code. In
this type of system, we generally refer to the code as firmware or
the chipset. That firmware/chipset can have bugs. 2nd middleman.

3). The JBOD controller and/or the onboard firmware of the actual
disks will probably write into cache memory on the controller or on
the disk and signal to the driver that the data has been written to
the "disk." But, in fact, it has not yet been written to the disk.
Its in the disk's onboard cache or perhaps, if existent, in the JBOD
controller onboard cache. and will probably be written to disk, some
time in the future. Now there could be a bad memory SIMM [DIMM] in
the JBOD controller or a bad cache RAM chip on the disk. 3rd middleman.

If you want to call those middlemen, the exact same middlemen exist
for JBO controllers and disks.

The issue is that ZFS CANNOT assume that things are coherent and
uncorrupted. It has to be able to recover. Such assumptions are the
same sort of assumptions that get you in security trouble with buffer
overflows.

> 4) When ZFS issues a sync, its telling the "disk", in this case, an
> emulated disk, to write the data to persistant storage and don't
> come back
> until its committed to disk. The Acrea HW RAID firmware will probably
> come back immediately - after ensuring that the data has been
> written into
> memory. This is how they achieve good benchmarks and win
> customers. So
> now ZFS thinks that the data is committed to stable storage - when its
> not.
>

The same thing can happen with JBOD and ondisk cache.

> 5) Additionally, H/W RAID controllers have been known to corrupt data,

Not on a regular basis or else they would go out of business.
Statistically speaking, a HW RAID controller is just as likely to
have a bad RAID algorithm as ZFS is to have a bad RAID algorithm.
Firmware is just software burned into a cjip.

> not
> because the firmware was bad/buggy per se, but because the
> algorithms that
> the RAID contoller implemented was incorrect. For example,
> consider if a H/W RAID controller finds a bad block in one of the disk
> drives that is part of a RAID-5 volume and then "botches" the
> relocation
> of the stripe that represents that data.

Most drives today, I believe, have on disk bad block relocation and
could suffer from the same sort of bad alogorithm.

> That is not a firmware bug as
> such - its a correct implementation of a bad/incorrect algorithm.
> This is
> simply another middleman that is handling your data.

No more different than with ZFS having a bad or incorrect algorithm,
which is just as likely, or for the firmware on a JBOD disk to screw up.

People keep trying to blame the underlying disk/controller for the
problem, when that is not correct. The underlying disk/controller in
ANY situation may have a problem and ZFS needs to be more resilient
to corruption. Lets say that in a JBOD situation, ZFS writes to a
disk and re-reads it and is sure it is correct. That disk suffers a
HW issue in its firmware (HW degradation) so that later on it starts
to have a few random bit errors in reading, or heaven forbid, the
data on disk is corrupted due to this HW fault developing. It
appears ZFS will kill your pool if these read errors happen as it
cannot deal with corrupted meta data. The data was confirmed to be
correct on disk when written, after all, which appears, from the
argumentation made here in this list (not just Al but in many
replies), to be a major assumption of ZFS.


best
Chad


---
Chad Leigh -- Shire.Net LLC
Your Web App and Email hosting provider
chad at shire.net





Please check the Links page before posting:
http://groups.yahoo.com/group/solarisx86/links
Post message: ***@yahoogroups.com
UNSUBSCRIBE: solarisx86-***@yahoogroups.com
Chad Leigh -- Shire.Net LLC
2006-08-31 18:32:09 UTC
Permalink
On Aug 31, 2006, at 12:18 PM, Chad Leigh -- Shire.Net LLC wrote:

>
>> 5) Additionally, H/W RAID controllers have been known to corrupt
>> data,
>
> Not on a regular basis or else they would go out of business.
> Statistically speaking, a HW RAID controller is just as likely to
> have a bad RAID algorithm as ZFS is to have a bad RAID algorithm.
> Firmware is just software burned into a cjip.
>
>> not
>> because the firmware was bad/buggy per se, but because the
>> algorithms that
>> the RAID contoller implemented was incorrect. For example,
>> consider if a H/W RAID controller finds a bad block in one of the
>> disk
>> drives that is part of a RAID-5 volume and then "botches" the
>> relocation
>> of the stripe that represents that data.
>
> Most drives today, I believe, have on disk bad block relocation and
> could suffer from the same sort of bad alogorithm.
>
>> That is not a firmware bug as
>> such - its a correct implementation of a bad/incorrect algorithm.
>> This is
>> simply another middleman that is handling your data.
>
> No more different than with ZFS having a bad or incorrect algorithm,
> which is just as likely, or for the firmware on a JBOD disk to
> screw up.

In fact, high end HW RAID controller should be a lot safer than most
JBOD controllers. High end RAID controllers get a lot of engineering
effort since they are expensive high end things (etc etc etc). Most
JBOD controller are based on cheap commodity disk controller chips/
chipsets that often seem to have skipped the QA phase. I read on the
FreeBSD lists the ongoing efforts to fix issues and bugs in the ATA/
SATA/IDE disk controller chips/chipsets through special cases in the
drivers. This is a LOT more common than reading about faulty
algorithms or problems with high end RAID controllers.

Chad

---
Chad Leigh -- Shire.Net LLC
Your Web App and Email hosting provider
chad at shire.net





Please check the Links page before posting:
http://groups.yahoo.com/group/solarisx86/links
Post message: ***@yahoogroups.com
UNSUBSCRIBE: solarisx86-***@yahoogroups.com
C***@Sun.COM
2006-08-31 19:32:49 UTC
Permalink
>In fact, high end HW RAID controller should be a lot safer than most
>JBOD controllers. High end RAID controllers get a lot of engineering
>effort since they are expensive high end things (etc etc etc). Most
>JBOD controller are based on cheap commodity disk controller chips/
>chipsets that often seem to have skipped the QA phase. I read on the
>FreeBSD lists the ongoing efforts to fix issues and bugs in the ATA/
>SATA/IDE disk controller chips/chipsets through special cases in the
>drivers. This is a LOT more common than reading about faulty
>algorithms or problems with high end RAID controllers.


I don't think this I could agree with that on the basis of an
important principle:

- a RAID controller is a JBOD controller with added
stuff

So it *must* be more fragile.

There is a lot of commonality between a "HW" RAID and a JBOD.

They use the same disks (including firmware); they use the same
bus and similar controllers.

The HW RAID adds more hardware and softare between the host and platters;
there is no denying that.

Because there's more stuff there are more bugs. With JBODs those bugs
move to the host OS. (But if you use the same OS + FS on the RAID you
have a larger software stack and therefor more bugs)

Casper


Please check the Links page before posting:
http://groups.yahoo.com/group/solarisx86/links
Post message: ***@yahoogroups.com
UNSUBSCRIBE: solarisx86-***@yahoogroups.com
C***@Sun.COM
2006-08-31 18:39:35 UTC
Permalink
>The issue is that ZFS CANNOT assume that things are coherent and
>uncorrupted. It has to be able to recover. Such assumptions are the
>same sort of assumptions that get you in security trouble with buffer
>overflows.

But it does not assume so. Not any more and in fact much less
so than other file systems.

The particular device breaks a promise: the promise that when
it tells that data has been written it has indeed been written.

The only thing that can happen in that case is data corruption.

The only way to recover from such data corruption is to duplicate
all data; and that means that you need to have more than one
device in a pool.

The fact that this device *always* and quickly gives this error
indicates that there a serious bug somewhere. But I cannot
see why this is somehow ZFS's fault; no data duplication means
that you cannot recover.

Now it's easy to point the finger at ZFS as it tells you about the
failure, but I think this is not correct.

In all of this there seems to be an assumption that the OS somehow
"tells" a device that the system is going away; but this is not and
never has been the case. When a device returns "data has been written",
it MUST ensure that the data HAS BEEN written or WILL BE written.

There are several reasons for this:

- power can be removed by the OS at any moment
(shutdown/poweroff)

- power can be removed by the environment at any moment
(power/UPS failure)

ZFS guarantees no corruption is either case AS LONG AS THE DEVICE DOES
NOT LIE.

UFS will simply not detect the corruption nor will other filesystems
likely find the corruption.

Perhaps ZFS does a transaction late in shutdown which makes it much more
likely to trigger the event. And perhaps the only checksum failure
is on a timestamp (but we can't tell)

Casper


Please check the Links page before posting:
http://groups.yahoo.com/group/solarisx86/links
Post message: ***@yahoogroups.com
UNSUBSCRIBE: solarisx86-***@yahoogroups.com
Chad Leigh -- Shire.Net LLC
2006-08-31 19:04:05 UTC
Permalink
On Aug 31, 2006, at 12:39 PM, ***@Sun.COM wrote:

>
>> The issue is that ZFS CANNOT assume that things are coherent and
>> uncorrupted. It has to be able to recover. Such assumptions are the
>> same sort of assumptions that get you in security trouble with buffer
>> overflows.
>
> But it does not assume so. Not any more and in fact much less
> so than other file systems.

Every one is saying it does assume that. It writes it out and re-
reads it and assumes it is safely on disk. Lets say it is safely on
disk, it then should not assume the data will be read back
correctly. No lying necessary by the devices.

>
> The particular device breaks a promise: the promise that when
> it tells that data has been written it has indeed been written.

Defensive programming.

But, see my comment above. This is irrelevant. It is making the
assumption that once on the disk safely, it will come back off the
disk safely. THAT is the bad assumption.

>
> The only thing that can happen in that case is data corruption.
>
> The only way to recover from such data corruption is to duplicate
> all data;

No. Parity and stuff can help you recover 100% (the premise behind
raidz and raid3/5). However, even if it cannot recover the corrupted
data, it should be able to try and recover the pool and flag whatever
files are affected as being possible bad. Punting and throwing out
the whole thing are not a good response.

> and that means that you need to have more than one
> device in a pool.

Duplication of vital meta data does not require more than one
device. It just requires more space on the device.

But there can also be other strategies to deal with corrupted data
than to just punt and say the pool is totally gone and inaccessible
(scrub and import don't work). There were 10GB of data in the pool
-- even if just one timestamp was corrupt, the whole pool was
booted. That is wrong. ZFS needs to be able to try and recover what
it can. Punting is not acceptable.

best
Chad

---
Chad Leigh -- Shire.Net LLC
Your Web App and Email hosting provider
chad at shire.net





Please check the Links page before posting:
http://groups.yahoo.com/group/solarisx86/links
Post message: ***@yahoogroups.com
UNSUBSCRIBE: solarisx86-***@yahoogroups.com
C***@Sun.COM
2006-08-31 19:39:22 UTC
Permalink
>But, see my comment above. This is irrelevant. It is making the
>assumption that once on the disk safely, it will come back off the
>disk safely. THAT is the bad assumption.

Well, that's the assumption *ALL* filesytems make about underlying
disks.

Is that a bad assumption to make?

Clearly, because that's why we have mirroring, RAID5, RAIDZ etc.

But on a device without a mirror, is there anything you can do
but to make such an assumption?

>No. Parity and stuff can help you recover 100% (the premise behind
>raidz and raid3/5). However, even if it cannot recover the corrupted
>data, it should be able to try and recover the pool and flag whatever
>files are affected as being possible bad. Punting and throwing out
>the whole thing are not a good response.

Ah, so this is not about finding errors but about the order of
magnitude.

The failure mode of this device appears to be such that a recovery
like that cannot be attempted.

ZFS generally does recover (and gives EIO for the affected files and
duplicates meta data such that files and filenames generally can be
located)

>Duplication of vital meta data does not require more than one
>device. It just requires more space on the device.

And ZFS does that.

>But there can also be other strategies to deal with corrupted data
>than to just punt and say the pool is totally gone and inaccessible
>(scrub and import don't work). There were 10GB of data in the pool
>-- even if just one timestamp was corrupt, the whole pool was
>booted. That is wrong. ZFS needs to be able to try and recover what
>it can. Punting is not acceptable.

When all you have is pointers pointing to corrupted blocks, what
can you do?

Casper


Please check the Links page before posting:
http://groups.yahoo.com/group/solarisx86/links
Post message: ***@yahoogroups.com
UNSUBSCRIBE: solarisx86-***@yahoogroups.com
Chad Leigh -- Shire.Net LLC
2006-08-31 20:18:17 UTC
Permalink
On Aug 31, 2006, at 1:39 PM, ***@Sun.COM wrote:

>
>> But, see my comment above. This is irrelevant. It is making the
>> assumption that once on the disk safely, it will come back off the
>> disk safely. THAT is the bad assumption.
>
> Well, that's the assumption *ALL* filesytems make about underlying
> disks.
>
> Is that a bad assumption to make?
>
> Clearly, because that's why we have mirroring, RAID5, RAIDZ etc.
>
> But on a device without a mirror, is there anything you can do
> but to make such an assumption?

Yes, defensive programming -- instead of punting and killing the
whole FS, you attempt to recover and then mark those things that did
not recover as flawed.

>
>> No. Parity and stuff can help you recover 100% (the premise behind
>> raidz and raid3/5). However, even if it cannot recover the corrupted
>> data, it should be able to try and recover the pool and flag whatever
>> files are affected as being possible bad. Punting and throwing out
>> the whole thing are not a good response.
>
> Ah, so this is not about finding errors but about the order of
> magnitude.
>
> The failure mode of this device appears to be such that a recovery
> like that cannot be attempted.
>
> ZFS generally does recover (and gives EIO for the affected files and
> duplicates meta data such that files and filenames generally can be
> located)

Does not seem so here and the pool had not been used in days so there
was no big outstanding data to be written out -- just whatever ZFS
does to shutdown.

>
>> Duplication of vital meta data does not require more than one
>> device. It just requires more space on the device.
>
> And ZFS does that.

There seems to be a problem in the implementation then. In this case
the pool is marked such that I cannot find any way to do any recovery
with it. The only "use" if the pool in several days before the event
happened was to shutdown the system, so whatever data ZFS decides to
purge back to the pool on a shutdown. Then the whole pool was wiped
out and there does not seem to be a way to try and recover any thing
of it though all the caches involved were much smaller than the total
of the data on the disk. If ZFS cached all the filenames themselves
this whole time and did not write it out until shutdown (I am not
saying it did -- but it fits the evidence) then that is a ZFS problem
for not writing it out. I would like to know what ZFS actually does
on system shutdown -- what sort of data it was writing out

>
>> But there can also be other strategies to deal with corrupted data
>> than to just punt and say the pool is totally gone and inaccessible
>> (scrub and import don't work). There were 10GB of data in the pool
>> -- even if just one timestamp was corrupt, the whole pool was
>> booted. That is wrong. ZFS needs to be able to try and recover what
>> it can. Punting is not acceptable.
>
> When all you have is pointers pointing to corrupted blocks, what
> can you do?

The whole disk was not corrupted.

best
Chad

---
Chad Leigh -- Shire.Net LLC
Your Web App and Email hosting provider
chad at shire.net





Please check the Links page before posting:
http://groups.yahoo.com/group/solarisx86/links
Post message: ***@yahoogroups.com
UNSUBSCRIBE: solarisx86-***@yahoogroups.com
Tony Reeves
2006-08-31 22:12:53 UTC
Permalink
On 9/1/06, Chad Leigh -- Shire.Net LLC <***@shire.net> wrote:
>
>
>
>
>
>
>
>
>
> On Aug 31, 2006, at 1:39 PM, ***@Sun.COM wrote:
>
> >
> >> But, see my comment above. This is irrelevant. It is making the
> >> assumption that once on the disk safely, it will come back off the
> >> disk safely. THAT is the bad assumption.
> >
> > Well, that's the assumption *ALL* filesytems make about underlying
> > disks.
> >
> > Is that a bad assumption to make?
> >
> > Clearly, because that's why we have mirroring, RAID5, RAIDZ etc.
> >
> > But on a device without a mirror, is there anything you can do
> > but to make such an assumption?
>
>
> Yes, defensive programming -- instead of punting and killing the
> whole FS, you attempt to recover and then mark those things that did
> not recover as flawed.


I get the impression that you believe that ZFS should write duplicate
images of key meta data somewhere, but where? Duplicate it on the
disk system the original was written to perhaps, as ufs does. But if
the RAID system caches the duplicate meta data in the same way it
caches the original, then you are still screwed if the RAID cache is
not written to the spindle.

This is an age old problem with memory caching RAID systems, if the
memory corrupts data or the memory power fails before writeback then
you are screwed. There is a comprimise between speed and data
integrity, frequently not recognised because the RAID systems are
inherently quite reliable. Most times when memory backed RAID reports
a write was successful it is lying, but 99.99999% of the time it gets
away with it.


Please check the Links page before posting:
http://groups.yahoo.com/group/solarisx86/links
Post message: ***@yahoogroups.com
UNSUBSCRIBE: solarisx86-***@yahoogroups.com
Robert Milkowski
2006-08-31 22:50:44 UTC
Permalink
On Fri, 1 Sep 2006, Tony Reeves wrote:
> I get the impression that you believe that ZFS should write duplicate
> images of key meta data somewhere, but where? Duplicate it on the
> disk system the original was written to perhaps, as ufs does. But if
> the RAID system caches the duplicate meta data in the same way it
> caches the original, then you are still screwed if the RAID cache is
> not written to the spindle.
>

Actually that's what ZFS is doing - writing pool meta data to 3 copies,
file system meta data to 2 copies, and user data in one copy if no
redundancy is in a pool. If there's redundancy each copy is additionally
protected.

Now if you have a pool with only one disk then so called ditto blocks are
written to the same device (although some spread is "guaranteed". If you
have more than one device that these blocks do use 1 and 2nd and 3rd (in
case of pool metadata) disk.


--
Robert Milkowski
***@rudy.mif.pg.gda.pl



Please check the Links page before posting:
http://groups.yahoo.com/group/solarisx86/links
Post message: ***@yahoogroups.com
UNSUBSCRIBE: solarisx86-***@yahoogroups.com
Tao Chen
2006-09-01 00:21:45 UTC
Permalink
On 8/31/06, Robert Milkowski <***@rudy.mif.pg.gda.pl> wrote:
> On Fri, 1 Sep 2006, Tony Reeves wrote:
> > I get the impression that you believe that ZFS should write duplicate
> > images of key meta data somewhere, but where? Duplicate it on the
> > disk system the original was written to perhaps, as ufs does. But if
> > the RAID system caches the duplicate meta data in the same way it
> > caches the original, then you are still screwed if the RAID cache is
> > not written to the spindle.
> >
>
> Actually that's what ZFS is doing - writing pool meta data to 3 copies,
> file system meta data to 2 copies, and user data in one copy if no
> redundancy is in a pool. If there's redundancy each copy is additionally
> protected.
>
> Now if you have a pool with only one disk then so called ditto blocks are
> written to the same device (although some spread is "guaranteed". If you
> have more than one device that these blocks do use 1 and 2nd and 3rd (in
> case of pool metadata) disk.
>

The ironic part is, with such aggressive caching in the storage,
copies of metadata on _all_ disks can be corrupted when you unplug the
NVRAM power.

Why they get corrupted? Could be "RAID-5/6 Write Hole", as explained
in Jeff's blog:
http://blogs.sun.com/bonwick/entry/raid_z

Tao


Please check the Links page before posting:
http://groups.yahoo.com/group/solarisx86/links
Post message: ***@yahoogroups.com
UNSUBSCRIBE: solarisx86-***@yahoogroups.com
Tao Chen
2006-09-01 02:54:27 UTC
Permalink
On 8/31/06, Tao Chen <***@gmail.com> wrote:

> The ironic part is, with such aggressive caching in the storage,
> copies of metadata on _all_ disks can be corrupted when you unplug the
> NVRAM power.
>

Sorry, in this context each 'disk' is a HW-RAID with its own NVRAM,
chance of all NVRAMs are unpowered is very small :). So my comment
above is wrong.


Please check the Links page before posting:
http://groups.yahoo.com/group/solarisx86/links
Post message: ***@yahoogroups.com
UNSUBSCRIBE: solarisx86-***@yahoogroups.com
John D Groenveld
2006-09-01 01:57:56 UTC
Permalink
In message <***@mail.gmail.com>, "Tony
Reeves" writes:
>This is an age old problem with memory caching RAID systems, if the
>memory corrupts data or the memory power fails before writeback then
>you are screwed. There is a comprimise between speed and data
>integrity, frequently not recognised because the RAID systems are
>inherently quite reliable. Most times when memory backed RAID reports

How much faster is RAID5 on the Areca with its big fast volatile cache
(that may or may not get flushed to disk) and RAIDZ on a dumb JBOD
connected to a SATA/SAS HBA?

Will the Areca work without its ((dead?) battery backed) cache?

John
***@acm.org



Please check the Links page before posting:
http://groups.yahoo.com/group/solarisx86/links
Post message: ***@yahoogroups.com
UNSUBSCRIBE: solarisx86-***@yahoogroups.com
Al Hopper
2006-09-02 17:41:48 UTC
Permalink
On Thu, 31 Aug 2006, John D Groenveld wrote:

> In message <***@mail.gmail.com>, "Tony
> Reeves" writes:
> >This is an age old problem with memory caching RAID systems, if the
> >memory corrupts data or the memory power fails before writeback then
> >you are screwed. There is a comprimise between speed and data
> >integrity, frequently not recognised because the RAID systems are
> >inherently quite reliable. Most times when memory backed RAID reports
>
> How much faster is RAID5 on the Areca with its big fast volatile cache
> (that may or may not get flushed to disk) and RAIDZ on a dumb JBOD
> connected to a SATA/SAS HBA?

I have no experience with the Areca - but it looks like a great product at
a price/performance level that is shaking up the marketplace. Here's a
review to get some performance numbers:

http://www.xbitlabs.com/articles/storage/display/areca-arc1220.html

Re ZFS performance - it's a mixed bag right now (Update 2), depending on
how you're using it. Let me explain. There are some things that go
incredibly fast - you see zfs issuing over 1k IO Ops/Sec (IOPS) and you
can't believe the operation has already completed before you even get a
chance to evaluate how its performing. And there are times when zfs won't
issue more than 200 to 300 IOPS and you're left scratching your head and
wondering why. And its behavior defies any attempt to predict how it'll
handle different usage scenarios.

That being said, a lot of silly bugs that escaped the initial release of
ZFS are already fixed and will be released in Update 3. And there were
some oversights in the code where the developers went duuhhhh[0] .... and
fixed them pretty quickly. And there are a bunch of changes that will
improve performance quite a bit.

But none of what I write here should discourage anyone from grabbing an
8-channel SATA card and a bunch of SATA drives and actually using zfs.
The admin model and usability will blow you away. You can create a > 1Tb
pool in under 10 Seconds and copy CDROM sized images (to a 5 disk raidz
pool) in under 3 Seconds. You have to change your thinking and start
creating filesystems where you would normally create directory entries.
And then you have the 3 most important features of zfs: snapshots,
snapshots and snapshots. :)

Also, looking forward, zfs is being integrated with Zones to create
features/facilities that will fundamentally change the way most
progressive[1] users deploy Solaris systems. The concept is to create a
snapshot of a zone, and then simply clone that snapshot when you want to
create your next zone[2]. In practical terms, the time it'll take to
create a "fat" zone will go from 12 to 20 minutes (depending on your
hardware) to probably (?? WAG) 60 Seconds.

You can already try this with Solaris Express or by building Opensolaris -
but its not supported in the commercial release of Solaris. In fact,
building zones on top of ZFS is currently unsupported because patching
will cause *major* breakage[2.5].

Re: dedicated RAID5 hardware versus ZFS. Obviously, buying a $100 SATA
controller versus a $700 H/W RAID5 controller will leave you with more $s
to buy inexpensive SATA disk drives. The fundamental difference, IMHO, is
that the H/W RAID controller is a one shot purchase that will have a fixed
useful life - before it begins to feel too slow as the attached storage
continues to increase in size. OTOH, ZFS is at rev 1.0 and will continue
to grow, in terms of performance, reliability, features/facilities etc
over time and will take advantage of faster CPUs, more CPU cores and
larger system main memory capacity. It will also be integrated with
Solaris in clever/ingenious ways in the future.

Concluding remarks. Personally I hate it when people only do "good" news
and don't report the downside. So.. Q: Where is ZFS currently weakest?

- IMHO the "variability" in performance that you experience in Update 2 is
troublesome - versus the predictable performance we've come to expect of a
UFS based filesystem.[3] Try zfs with *your* application data in *your*
system environment first.

- ZFS does not "understand" how disk drives fail and what might be done to
work around the more common failure modes. I get the sense that team zfs
wants to examine the whole issue of how/what/why disk drives fail, and
then develope and implement a comprehensive strategy to deal with
failures. IOW, a clean-sheet-of-paper approach to disk drive
failures[3.5]. It does not help that failure data on SATA drives,
particularly the newer, monster drives on the market, is not widely
available ... yet.[4]

- the usability of the new ACL scheme leaves a lot to be desired.

- zfs is not "known" by the popular commercial backup tools - altough
support is coming from some vendors. In contrast to this remark, there
was a recent discussion on the zfs-dicuss list on opensolaris.org that
details how you can make a killer incremental backup facility using
snapshots and rsync that will rival the facilities available only in
high $ commercial backup products.

- zfs puts incredible pressure on Solaris virtual memory and appears
exessively greedy with memory usage. It won't give up any memory until
the system reaches the low-memory watermark. To be completely fair to
zfs: zfs and dtrace have put incredible pressure on the current Solaris
virtual memory implementation and the complete fix for zfs may not arrive
until more work has been put into the virtual memory management code.

- zfs needs the large vitual memory address space offered by a 64-bit
architecture. IOW: it does not perform as well on a 32-bit system.

- there is a required mindset shift when working with zfs. People fail to
understand that a 12-way raidz system is not a good idea, or that
configuring zpools from partial disks is also not a good idea. And that
zfs is a rev 1.0 release that does have deficiencies and that lacks the
stability/performance/polish etc of a rev 5.0 and rev 6.0 or rev N
filesystem. Or that putting all you disk storage "eggs" into a zfs
revision 1.0 filesystem "basket" is proably not a good idea. I'm not sure
what, if anything, Sun can do to help educate the (potential) user
community. Its a difficult problem to solve and people tend to get
really, really pissed when something goes wrong with a filesystem.

- building on the last point: ZFS works best with large numbers of
inexpensive disk drives. But people have become so accustomed to getting
by with the minimum # of disk drives and carefully managing that disk
space, that they fail to deploy zfs based systems that make sense. When
thinking of zfs based systems, please think of *large* numbers of disk
drives. The fact that the Sun x4500 includes 48 disk drives if your first
clue! Think in terms of 3-way or 4-way mirrors. Think about a 4 or 5-way
raidz[5] config with at least one spare drive. Think about a system
enclosure with between 4 and 10+ disk drive bays.

Recommendation: If you're serious about using ZFS, put together a test box
with between 5 and 10 SATA drives and gain experience with it *before* you
put it into production. If you're concerned with performance and can wait
a little longer, don't put it into production before Update 3 ships.

[0] Of course, this has *never* happened to _me_ with _my_ code! :) Yeah
... right!!

[1] IOW - those who are *not* still running Solaris 8! :)

[2] This is already documented in the ZFS Administration Guide (Solaris
Express System Administators Collection). Why is docs.sun.com so bloody
slow.!

[2.5] A really experienced Solaris admin can still figure out how to work
around these issues. A less experienced admin will probably loose several
zones and then discover that his/her system will not upgrade from Update 2
to Update 3 without major breakage.

[3] What do you expect from something that has had tens (possibly
hundreds) of man years of development/tuning work.

[3.5] Just like zfs is a clean sheet of paper approach to Unix filesystems.

[4] In Update 3, you'll be able to define spares, and associate them with
one or more pools. But ZFS will still fundamentally "see" a disk drive a
good or bad - without any "rescue mode" logic.

[5] Or a 4-way or 5-way raidz2 system - which uses 2 drives for parity.

> Will the Areca work without its ((dead?) battery backed) cache?

Don't know. But don't try this at home! :)

Regards,

Al Hopper Logical Approach Inc, Plano, TX. ***@logical-approach.com
Voice: 972.379.2133 Fax: 972.379.2134 Timezone: US CDT
OpenSolaris.Org Community Advisory Board (CAB) Member - Apr 2005
OpenSolaris Governing Board (OGB) Member - Feb 2006


Please check the Links page before posting:
http://groups.yahoo.com/group/solarisx86/links
Post message: ***@yahoogroups.com
UNSUBSCRIBE: solarisx86-***@yahoogroups.com
rogerfujii
2006-09-04 11:55:27 UTC
Permalink
--- In ***@yahoogroups.com, Al Hopper <***@...> wrote:

> - there is a required mindset shift when working with zfs. People
fail to
> understand that a 12-way raidz system is not a good idea, or that
> configuring zpools from partial disks is also not a good idea. And that

I wouldn't say "not a good idea", but rather "less than optimal".
Obviously, if you are trying to optimize SPACE usage (as opposed
to performance), you might try a 12-way raidz system in a 12
drive system. Likewise, you might try partial disks right now,
since you still need svm for reliable root (ie, use the rest of
the non-root space for zfs so that transitioning to zfs WHEN
you can boot off it is a lot easier).

> - building on the last point: ZFS works best with large numbers of
> inexpensive disk drives. But people have become so accustomed to
> getting by with the minimum # of disk drives and carefully
> managing that disk space, that they fail to deploy zfs based
> systems that make sense. When thinking of zfs based systems,
> please think of *large* numbers of disk drives. The fact that
> the Sun x4500 includes 48 disk drives if your first clue!
> Think in terms of 3-way or 4-way mirrors. Think about a 4 or
> 5-way raidz[5] config with at least one spare drive. Think
> about a system enclosure with between 4 and 10+ disk drive bays.

actually, I think this is a little unfair to ZFS. There are lots
of advantages to zfs, even with a small number of disks (snapshots,
easier administration, expandability (much easier to grow fs).

The big problem is that it's hard to figure out how to configure
zfs. Here are some very good blog entries that were most helpful:
http://blogs.sun.com/roch/entry/when_to_and_not_to
http://blogs.sun.com/roller/page/roch?entry=the_dynamics_of_zfs

(I know you seen them (since they're in your post), but it's worth
repeating here, since it did take a little digging to find it).

The thing that is not obviously killer about zfs is that you can
change configurations on-the-fly (so long as you are not trying
to make things smaller). Try growing a partition on a slice
to a new disk. It is *trivial* with zfs. Not having to deal
with svm alone would be worthwhile to use zfs.

-r






Please check the Links page before posting:
http://groups.yahoo.com/group/solarisx86/links
Post message: ***@yahoogroups.com
UNSUBSCRIBE: solarisx86-***@yahoogroups.com
Chad Leigh -- Shire.Net LLC
2006-09-02 18:50:04 UTC
Permalink
On Aug 31, 2006, at 7:57 PM, John D Groenveld wrote:

>
> Will the Areca work without its ((dead?) battery backed) cache?

The cache memory is built into the controller. Some versions have a
removable (upgradable) DIMM while others have it onboard. The
battery module is an optional item and the card works fine without
it, barring OS and/or driver bugs

Chad

---
Chad Leigh -- Shire.Net LLC
Your Web App and Email hosting provider
chad at shire.net





Please check the Links page before posting:
http://groups.yahoo.com/group/solarisx86/links
Post message: ***@yahoogroups.com
UNSUBSCRIBE: solarisx86-***@yahoogroups.com
Phillip Bruce
2006-09-02 19:07:38 UTC
Permalink
Chad Leigh -- Shire.Net LLC wrote:
> On Aug 31, 2006, at 7:57 PM, John D Groenveld wrote:
>
>
>> Will the Areca work without its ((dead?) battery backed) cache?
>>
>
> The cache memory is built into the controller. Some versions have a
> removable (upgradable) DIMM while others have it onboard. The
> battery module is an optional item and the card works fine without
> it, barring OS and/or driver bugs
>
> Chad
>
> ---
> Chad Leigh -- Shire.Net LLC
> Your Web App and Email hosting provider
> chad at shire.net
>
What is the battery life? I've seen some that last anywhere from 24
hours to 72 hours only. Because of
that it is recommended that you put the array and system on UPS so you
could have longer life span but
that is limited also to the UPS you may have.

Large Arrays like EMC, and HDS have backup battery and it too is used
for cache purposes. So it is important
to have power to critical system at all time. You may have protected
your data for short term but if your in
a situation like the tristate power outages in the North East of the US
a few years ago, power was gone for
more than 3 days. So make sure your systems are well protected for
critical business systems.


Phillip


Please check the Links page before posting:
http://groups.yahoo.com/group/solarisx86/links
Post message: ***@yahoogroups.com
UNSUBSCRIBE: solarisx86-***@yahoogroups.com
Chad Leigh -- Shire.Net LLC
2006-09-02 19:33:30 UTC
Permalink
On Sep 2, 2006, at 1:07 PM, Phillip Bruce wrote:

> Chad Leigh -- Shire.Net LLC wrote:
>> On Aug 31, 2006, at 7:57 PM, John D Groenveld wrote:
>>
>>
>>> Will the Areca work without its ((dead?) battery backed) cache?
>>>
>>
>> The cache memory is built into the controller. Some versions have a
>> removable (upgradable) DIMM while others have it onboard. The
>> battery module is an optional item and the card works fine without
>> it, barring OS and/or driver bugs
>>
> What is the battery life? I've seen some that last anywhere from 24
> hours to 72 hours only. Because of
> that it is recommended that you put the array and system on UPS so you
> could have longer life span but
> that is limited also to the UPS you may have.
>
> Large Arrays like EMC, and HDS have backup battery and it too is used
> for cache purposes. So it is important
> to have power to critical system at all time. You may have protected
> your data for short term but if your in
> a situation like the tristate power outages in the North East of
> the US
> a few years ago, power was gone for
> more than 3 days. So make sure your systems are well protected for
> critical business systems.

Yes, about 60-80 hours. Our systems all have heavy duty industrial
UPS through the data center which has a couple of diesel generators
and a weeks worth of oil on standby and contracts to continually
supply oil as the case may arise

Good points.

Chad


---
Chad Leigh -- Shire.Net LLC
Your Web App and Email hosting provider
chad at shire.net





Please check the Links page before posting:
http://groups.yahoo.com/group/solarisx86/links
Post message: ***@yahoogroups.com
UNSUBSCRIBE: solarisx86-***@yahoogroups.com
Chad Leigh -- Shire.Net LLC
2006-09-02 19:38:34 UTC
Permalink
On Sep 2, 2006, at 1:33 PM, Chad Leigh -- Shire.Net LLC wrote:

>
> On Sep 2, 2006, at 1:07 PM, Phillip Bruce wrote:
>
>> Chad Leigh -- Shire.Net LLC wrote:
>>> On Aug 31, 2006, at 7:57 PM, John D Groenveld wrote:
>>>
>>>
>>>> Will the Areca work without its ((dead?) battery backed) cache?
>>>>
>>>
>>> The cache memory is built into the controller. Some versions
>>> have a
>>> removable (upgradable) DIMM while others have it onboard. The
>>> battery module is an optional item and the card works fine without
>>> it, barring OS and/or driver bugs
>>>
>> What is the battery life? I've seen some that last anywhere from 24
>> hours to 72 hours only. Because of
>> that it is recommended that you put the array and system on UPS so
>> you
>> could have longer life span but
>> that is limited also to the UPS you may have.
>>
>> Large Arrays like EMC, and HDS have backup battery and it too is used
>> for cache purposes. So it is important
>> to have power to critical system at all time. You may have protected
>> your data for short term but if your in
>> a situation like the tristate power outages in the North East of
>> the US
>> a few years ago, power was gone for
>> more than 3 days. So make sure your systems are well protected for
>> critical business systems.
>
> Yes, about 60-80 hours. Our systems all have heavy duty industrial
> UPS through the data center which has a couple of diesel generators
> and a weeks worth of oil on standby and contracts to continually
> supply oil as the case may arise

This should read: "Our systems all have heavy duty industrial UPS
through the data center, which also has a couple of diesel
generators..."

Was unclear to me after I posted and one could assume I thought UPS =
Diesel Generator

Chad

---
Chad Leigh -- Shire.Net LLC
Your Web App and Email hosting provider
chad at shire.net





Please check the Links page before posting:
http://groups.yahoo.com/group/solarisx86/links
Post message: ***@yahoogroups.com
UNSUBSCRIBE: solarisx86-***@yahoogroups.com
Tony Reeves
2006-09-02 20:11:24 UTC
Permalink
Chad Leigh -- Shire.Net LLC wrote:
>
>
>
> On Aug 31, 2006, at 7:57 PM, John D Groenveld wrote:
>
> >
> > Will the Areca work without its ((dead?) battery backed) cache?
>
> The cache memory is built into the controller. Some versions have a
> removable (upgradable) DIMM while others have it onboard. The
> battery module is an optional item and the card works fine without
> it, barring OS and/or driver bugs
>

So what OS does not have a "bug" such that if the system is completely
powered off with the last sync (or equiv.) written to the RAID cache
only, not to disk, and then the RAID memory is lost or fails before the
system comes back up, the file system is inconsistent?


Please check the Links page before posting:
http://groups.yahoo.com/group/solarisx86/links
Post message: ***@yahoogroups.com
UNSUBSCRIBE: solarisx86-***@yahoogroups.com
Chad Leigh -- Shire.Net LLC
2006-09-02 20:29:07 UTC
Permalink
On Sep 2, 2006, at 2:11 PM, Tony Reeves wrote:

> Chad Leigh -- Shire.Net LLC wrote:
>>
>>
>>
>> On Aug 31, 2006, at 7:57 PM, John D Groenveld wrote:
>>
>>>
>>> Will the Areca work without its ((dead?) battery backed) cache?
>>
>> The cache memory is built into the controller. Some versions have a
>> removable (upgradable) DIMM while others have it onboard. The
>> battery module is an optional item and the card works fine without
>> it, barring OS and/or driver bugs
>>
>
> So what OS does not have a "bug" such that if the system is completely
> powered off with the last sync (or equiv.) written to the RAID cache
> only, not to disk, and then the RAID memory is lost or fails before
> the
> system comes back up, the file system is inconsistent?
>

Same issue with JBOD and on disk cache

FreeBSD tells the drivers to flush and waits for them to finish

Chad

---
Chad Leigh -- Shire.Net LLC
Your Web App and Email hosting provider
chad at shire.net





Please check the Links page before posting:
http://groups.yahoo.com/group/solarisx86/links
Post message: ***@yahoogroups.com
UNSUBSCRIBE: solarisx86-***@yahoogroups.com
Tony Reeves
2006-09-02 20:47:03 UTC
Permalink
Chad Leigh -- Shire.Net LLC wrote:
>
>
>
> On Sep 2, 2006, at 2:11 PM, Tony Reeves wrote:
>
> > Chad Leigh -- Shire.Net LLC wrote:
> >>
> >>
> >>
> >> On Aug 31, 2006, at 7:57 PM, John D Groenveld wrote:
> >>
> >>>
> >>> Will the Areca work without its ((dead?) battery backed) cache?
> >>
> >> The cache memory is built into the controller. Some versions have a
> >> removable (upgradable) DIMM while others have it onboard. The
> >> battery module is an optional item and the card works fine without
> >> it, barring OS and/or driver bugs
> >>
> >
> > So what OS does not have a "bug" such that if the system is completely
> > powered off with the last sync (or equiv.) written to the RAID cache
> > only, not to disk, and then the RAID memory is lost or fails before
> > the
> > system comes back up, the file system is inconsistent?
> >
>
> Same issue with JBOD and on disk cache
>
> FreeBSD tells the drivers to flush and waits for them to finish
>

Does FreeBSD rely on the RAID controller telling it that it has finished
(when it may just be written to RIAD cache) or does FreeBSD look beyond
the RAID controller to the physical media to determine if the writes are
finished?

I already know the answer, it is the former, it is no different to
Solaris in that respect and so could suffer the same issue.


Please check the Links page before posting:
http://groups.yahoo.com/group/solarisx86/links
Post message: ***@yahoogroups.com
UNSUBSCRIBE: solarisx86-***@yahoogroups.com
Chad Leigh -- Shire.Net LLC
2006-09-02 21:36:24 UTC
Permalink
On Sep 2, 2006, at 2:47 PM, Tony Reeves wrote:

> Chad Leigh -- Shire.Net LLC wrote:
>>
>>
>>
>> On Sep 2, 2006, at 2:11 PM, Tony Reeves wrote:
>>
>>> Chad Leigh -- Shire.Net LLC wrote:
>>>>
>>>>
>>>>
>>>> On Aug 31, 2006, at 7:57 PM, John D Groenveld wrote:
>>>>
>>>>>
>>>>> Will the Areca work without its ((dead?) battery backed) cache?
>>>>
>>>> The cache memory is built into the controller. Some versions have a
>>>> removable (upgradable) DIMM while others have it onboard. The
>>>> battery module is an optional item and the card works fine without
>>>> it, barring OS and/or driver bugs
>>>>
>>>
>>> So what OS does not have a "bug" such that if the system is
>>> completely
>>> powered off with the last sync (or equiv.) written to the RAID cache
>>> only, not to disk, and then the RAID memory is lost or fails before
>>> the
>>> system comes back up, the file system is inconsistent?
>>>
>>
>> Same issue with JBOD and on disk cache
>>
>> FreeBSD tells the drivers to flush and waits for them to finish
>>
>
> Does FreeBSD rely on the RAID controller telling it that it has
> finished
> (when it may just be written to RIAD cache) or does FreeBSD look
> beyond
> the RAID controller to the physical media to determine if the
> writes are
> finished?
>
> I already know the answer, it is the former, it is no different to
> Solaris in that respect and so could suffer the same issue.


Actually, as far as I can tell, FreeBSD tells the controller the
system is going down and waits for the controller driver to return.
It is not just a standard flush. I have a Highpoint controller in a
FreeBSD box that always comes up at the very end, after all the
buffers and inodes are flushed by the OS, and it reports back that it
is flushing all caches.

According to what I understand Casper to have said, Solaris can do
the same sort of thing but it is not documented in the driver docs.

Chad

---
Chad Leigh -- Shire.Net LLC
Your Web App and Email hosting provider
chad at shire.net





Please check the Links page before posting:
http://groups.yahoo.com/group/solarisx86/links
Post message: ***@yahoogroups.com
UNSUBSCRIBE: solarisx86-***@yahoogroups.com
John D Groenveld
2006-09-02 23:11:00 UTC
Permalink
In message <21887196-D44A-4292-A5D5-***@shire.net>, "Chad Leigh -- Shi
re.Net LLC" writes:
>According to what I understand Casper to have said, Solaris can do
>the same sort of thing but it is not documented in the driver docs.

Given their business relationship with Sun, does LSI Logic make use
of the undocumented interface in lsimega(7D) or is it also vulnerable
to the unflushed cache problem on sync;poweroff?

John
***@acm.org



Please check the Links page before posting:
http://groups.yahoo.com/group/solarisx86/links
Post message: ***@yahoogroups.com
UNSUBSCRIBE: solarisx86-***@yahoogroups.com
Dan Mick
2006-09-03 01:17:58 UTC
Permalink
John D Groenveld wrote:
> In message <21887196-D44A-4292-A5D5-***@shire.net>, "Chad Leigh -- Shi
> re.Net LLC" writes:
>> According to what I understand Casper to have said, Solaris can do
>> the same sort of thing but it is not documented in the driver docs.
>
> Given their business relationship with Sun, does LSI Logic make use
> of the undocumented interface in lsimega(7D) or is it also vulnerable
> to the unflushed cache problem on sync;poweroff?

reverse-engineering, lsimega has a routine called mega_reset that calls a
routine called mega_adapter_cache_flush...so I bet so.

why is that relevant to the issue with the Areca?


Please check the Links page before posting:
http://groups.yahoo.com/group/solarisx86/links
Post message: ***@yahoogroups.com
UNSUBSCRIBE: solarisx86-***@yahoogroups.com
palowoda
2006-09-03 09:00:02 UTC
Permalink
--- In ***@yahoogroups.com, Dan Mick <***@...> wrote:
>
> John D Groenveld wrote:
> > In message <21887196-D44A-4292-A5D5-***@...>, "Chad Leigh
-- Shi
> > re.Net LLC" writes:
> >> According to what I understand Casper to have said, Solaris can do
> >> the same sort of thing but it is not documented in the driver docs.
> >
> > Given their business relationship with Sun, does LSI Logic make use
> > of the undocumented interface in lsimega(7D) or is it also vulnerable
> > to the unflushed cache problem on sync;poweroff?
>
> reverse-engineering, lsimega has a routine called mega_reset that
calls a
> routine called mega_adapter_cache_flush...so I bet so.
>
> why is that relevant to the issue with the Areca?

Sounds like a good discussion for OpenSolaris device drivers
but that would not be relevant either. I agree it's all
pot luck and you don't know if a dog is on the other side
developing your device driver.

---Bob






Please check the Links page before posting:
http://groups.yahoo.com/group/solarisx86/links
Post message: ***@yahoogroups.com
UNSUBSCRIBE: solarisx86-***@yahoogroups.com
John D Groenveld
2006-09-03 16:18:38 UTC
Permalink
In message <***@sun.com>, Dan Mick writes:
>why is that relevant to the issue with the Areca?

I'm not too concerned about Chad's Areca since I don't own one, but
since I do administer Solaris on a Dell with an OEM LSI MegaRAID
controller, I was a bit concerned about the generic unflushed write
cache issue.

LSIutils does report battery status so the dead battery isn't
a big issue but occasionally the pointy hair bosses in my little
corner of the blogosphere ask the Teamsters to move systems to
the other end of the building or campus and the trip may last longer
than the controller's battery life.

Until Areca resolves their issue and assuming the Areca firmware
flushes its cache on boot, it might be smart for Areca owners powering
off their hardware to not init 5 and forget, but instead init 6 and
power-off at the GRUB menu.

John
***@acm.org



Please check the Links page before posting:
http://groups.yahoo.com/group/solarisx86/links
Post message: ***@yahoogroups.com
UNSUBSCRIBE: solarisx86-***@yahoogroups.com
rogerfujii
2006-09-01 10:53:50 UTC
Permalink
--- In ***@yahoogroups.com, "Chad Leigh -- Shire.Net LLC"
<***@...> wrote:
> On Aug 31, 2006, at 1:39 PM, ***@... wrote:

> > ZFS generally does recover (and gives EIO for the affected files
> > and duplicates meta data such that files and filenames generally
> > can be located)
> Does not seem so here and the pool had not been used in days so
> there was no big outstanding data to be written out -- just
> whatever ZFS does to shutdown.

"shutdown" is a little misleading - think of "unmounting" the
filesystem and all that entails.

> >> Duplication of vital meta data does not require more than one
> >> device. It just requires more space on the device.
> >
> > And ZFS does that.
>
> There seems to be a problem in the implementation then. In this
> case the pool is marked such that I cannot find any way to do
> any recovery with it. The only "use" if the pool in several
> days before the event happened was to shutdown the system, so
> whatever data ZFS decides to purge back to the pool on a shutdown.

You don't know this. If you were running raid5, the data affected
is the data you're writing out *AND* what other data that's on the
disk block not being written out (pending on implementation).
This other data could be critical data.

> Then the whole pool was wiped out and there does not seem to be
> a way to try and recover any thing of it though all the caches
> involved were much smaller than the total of the data on the disk.

What difference does this make? Try recovering a ufs disk if the
block containing the slice information gets corrupted.

> I would like to know what ZFS actually does
> on system shutdown -- what sort of data it was writing out

The real question is "which is more recoverable?" a random single
block error in ZFS (in non-mirrored mode), or UFS. This is not
clear from the docs (everything has mirrored/raidz on). As a side
note, perhaps a zfs equivalent of fsdb would address this issue.

> >> But there can also be other strategies to deal with corrupted
> >> data than to just punt and say the pool is totally gone and
> >> inaccessible (scrub and import don't work). There were 10GB
> >> of data in the pool -- even if just one timestamp was corrupt,
> >> the whole pool was booted. That is wrong.

You don't know this. For all you know, the entire root data might
have been wiped out. Unless you know what the problem really is,
you're just blowing FUD.

> > When all you have is pointers pointing to corrupted blocks, what
> > can you do?
> The whole disk was not corrupted.

That isn't what he said. If zfs writes 3 redundant copies, only
3 corrupted blocks would cause a failure.

Personally, I'd rename this to "ZFS should be more robust", which
I think is a statement that everyone probably could agree on.
Sun could also lessen future problems by adding a simple
configurable DELAY before poweroff, so this sort of problem could
be avoided.

-r





Please check the Links page before posting:
http://groups.yahoo.com/group/solarisx86/links
Post message: ***@yahoogroups.com
UNSUBSCRIBE: solarisx86-***@yahoogroups.com
Chad Leigh -- Shire.Net LLC
2006-08-30 20:45:25 UTC
Permalink
On Aug 30, 2006, at 2:14 PM, Tao Chen wrote:

> On 8/30/06, Chad Leigh -- Shire.Net LLC <***@shire.net> wrote:
>> According to the Areca tech
>> support (the underlying RAID device used as the pool source device is
>> an Areca Raid), Solaris has a bug in that it does not call the driver
>> flush routines to get the device drivers to flush when you do this
>> sort of shutdown. (reboot and some others do). The RAID device has
>> a battery backup so it should be ok on reboot. But I had the case
>> open to check on a BMC/remote console device (Tyan) that is not
>> working right and I accidently removed the battery backup cable. So
>> any pending data on thh raid controller was lost.
>>
>
> Do they have a bug id?

I'll have to ask them. It was discovered a month or so ago and I
have not gotten back with them then.

> This behaviour is filesystem-dependent, so
> "Solaris has a bug" doesn't mean it applies to zfs in this case. It
> might be UFS-specific.

This was at a lower level I think.

> From various discussion, zfs does flush the write cache whenever
> necessary, more so than UFS. It's up to the storage then to handle the
> flush request correctly.

Yes, but the device drivers rely on the OS telling them it is going
down to be able to do a final flush. This happens in some cases like
a reboot but not as I did it with shutdown -i 5

>
> "accidently removed the battery backup cable" could prevent the flush
> pass through to the physical disks.

Yes, the raid would have recovered itself if I had not done that, but
not all people have battery backups and the card would have completed
any flushes before shutdown if Solaris had told it.

>
>> That was all that happened. What I am concerned about is that the
>> ZFS meta data is so fragile that such a simple process could
>> permanently destroy the whole pool.
>
> It is strange.
> I believe at least in Update 2, zfs uses so called "Ditto block" to
> replicate meta-data.
> See here:
> http://blogs.sun.com/bill/entry/ditto_blocks_the_amazing_tape
>
> "We use ditto blocks to ensure that the more "important" a filesystem
> block is (the closer to the root of the tree), the more replicated it
> becomes. Our current policy is that we store one DVA for user data,
> two DVAs for filesystem metadata, and three DVAs for metadata that's
> global across all filesystems in the storage pool."
>
> So I wouldn't expect the meta-data are so easily corrupted.

Neither would I. I will read more above.

>
>> With UFS we could do an fsck
>> which would probably fix whatever the underlying problems here were.
>
> Not if all copies of super blocks are corrupted.

Which is unlikely to happen (it can but it is unlikely -- I have been
using UFS on FreeBSD for more than 10 years with many hard crashes
and have never had that happen)

>
>> With ZFS we seem to have fragile meta data that is easily corrupted.
>> This makes ZFS unusable for production use if it is so easily
>> corrupted. ZFS does not keep copies of its meta data or have any
>> other way to fight such a simple corruption? The machine had been
>> idle for a few days and the ZFS pool should have had 0 activity on
>> it. This is a machine still being tested before it goes into
>> production.
>
> If you can reproduce the problem (with or without the NVRAM), please
> report it here:
> http://www.opensolaris.org/jive/forum.jspa?forumID=80

Will do. I am not sure I can reproduce this as the server is in the
rack in the data center and I am not there (20 miles away) due to
time constraints on going there but it is probably worth reporting
anyway.

> Even it is not a bug (just unlucky), some might re-think if current
> replicating policy is sufficient.
> This is the first time I heard a simple shutdown caused zfs meta-data
> corruption, although unplugging the NVRAM is not what people usually
> do.

Agreed. Not my idea to unplug the battery. I had to unplug the BMC/
remote console card for a diagnostic check and the battery cable came
loose :-)

In my case, no important data was lost as I was just testing things.
And later this year we are adding another raid volume to the ZFS pool
in a mirror configuration for redundancy. Just the fact that it
happened so easily scares me.

thanks
best
Chad


---
Chad Leigh -- Shire.Net LLC
Your Web App and Email hosting provider
chad at shire.net





Please check the Links page before posting:
http://groups.yahoo.com/group/solarisx86/links
Post message: ***@yahoogroups.com
UNSUBSCRIBE: solarisx86-***@yahoogroups.com
C***@Sun.COM
2006-08-31 11:14:16 UTC
Permalink
>Yes, but the device drivers rely on the OS telling them it is going
>down to be able to do a final flush. This happens in some cases like
>a reboot but not as I did it with shutdown -i 5

The OS *does* tell device drivers when it goes down:

reboot;/shutdown -i 5 all follow the "careful sync path"

panic and others also give devices a last control option

If the device driver does not work on any of the above circumstances,
then it is broken.

Casper


Please check the Links page before posting:
http://groups.yahoo.com/group/solarisx86/links
Post message: ***@yahoogroups.com
UNSUBSCRIBE: solarisx86-***@yahoogroups.com
Chad Leigh -- Shire.Net LLC
2006-08-31 13:11:35 UTC
Permalink
On Aug 31, 2006, at 5:14 AM, ***@Sun.COM wrote:

>
>> Yes, but the device drivers rely on the OS telling them it is going
>> down to be able to do a final flush. This happens in some cases like
>> a reboot but not as I did it with shutdown -i 5
>
> The OS *does* tell device drivers when it goes down:
>
> reboot;/shutdown -i 5 all follow the "careful sync path"
>
> panic and others also give devices a last control option
>
> If the device driver does not work on any of the above circumstances,
> then it is broken.

According to the driver people for this device, shutdown -i 5 does
not do this but the other forms (reboot, other levels of init) do.
They never get it in the shutdown -i 5 case but they do other times.
Because they do get it for other levels of shutdown/reboot it seems
to me that they are handling it. They claim Solaris is not sending
it in this one case. I am asking them what the big number is.

Chad


---
Chad Leigh -- Shire.Net LLC
Your Web App and Email hosting provider
chad at shire.net





Please check the Links page before posting:
http://groups.yahoo.com/group/solarisx86/links
Post message: ***@yahoogroups.com
UNSUBSCRIBE: solarisx86-***@yahoogroups.com
C***@Sun.COM
2006-08-31 15:04:51 UTC
Permalink
>According to the driver people for this device, shutdown -i 5 does
>not do this but the other forms (reboot, other levels of init) do.
>They never get it in the shutdown -i 5 case but they do other times.
>Because they do get it for other levels of shutdown/reboot it seems
>to me that they are handling it. They claim Solaris is not sending
>it in this one case. I am asking them what the big number is.

What exactly is it not sending?

shutdown -i 5 (aka init 5) is a clean shutdown.

What specifically are they looking for/expecting?

What driver is being used for the device?

Casper


Please check the Links page before posting:
http://groups.yahoo.com/group/solarisx86/links
Post message: ***@yahoogroups.com
UNSUBSCRIBE: solarisx86-***@yahoogroups.com
Chad Leigh -- Shire.Net LLC
2006-08-31 17:06:27 UTC
Permalink
On Aug 31, 2006, at 9:04 AM, ***@Sun.COM wrote:

>
>> According to the driver people for this device, shutdown -i 5 does
>> not do this but the other forms (reboot, other levels of init) do.
>> They never get it in the shutdown -i 5 case but they do other times.
>> Because they do get it for other levels of shutdown/reboot it seems
>> to me that they are handling it. They claim Solaris is not sending
>> it in this one case. I am asking them what the big number is.
>
> What exactly is it not sending?
>
> shutdown -i 5 (aka init 5) is a clean shutdown.
>
> What specifically are they looking for/expecting?
>
> What driver is being used for the device?

This is an Areca raid board which has their own driver tied into the
sd system. I have asked them what the results of their contacting
Sun about the issue was and if they got a bug number. I have not
heard back. However, this is what they told me back mid-July when we
were debugging a related issue with the battery backup module (I had
a bad controller). However, they did confirm the issue that the
controller was not flushing when an init 5 happened

On Jul 19, 2006, at 8:01 AM, Areca Support wrote:
> Dear Sir,
>
> many thanks for your updated information.
> we finally reproduced your situation.
> the procedure we missing is the shutdown command.
> normally we shutdown solaris by
> # init 0 or
> # shutdown -n
> it will need a manual power off, with this procedure, the BBM will not
> activated.
>
> but when we used the command you provided (#shutdown -y -i 5 -g 5),
> the
> problem appears,
> check with our debug tool, the problem is the system don't flush
> controller
> before power off.
> that's the reason why BBM always activated after powered off.
> this no flush problem could be a driver issue or kernel bug, i will
> ask our
> engineer to check it.

and then as a follow-up


On Jul 20, 2006, at 6:18 AM, Areca Support wrote:
> Dear Sir,
>
> after some test, the abnormal BBM activated after power off looks
> like a
> kernel bug.
> when you shutdown system by the commond you provide, kernel will
> not send
> flush command to our driver.
> if you shutdown system in the graphic mode by choose the shutdown
> feature.
> kernel will send flush command.
> our engineer will contact with the Sun engineer to get more detail
> about
> this.
> and we will added a FAQ in our knowledge base to avoid similar
> problem on
> other customer side.


This is all I know of the situation.

best regards
Chad




---
Chad Leigh -- Shire.Net LLC
Your Web App and Email hosting provider
chad at shire.net





Please check the Links page before posting:
http://groups.yahoo.com/group/solarisx86/links
Post message: ***@yahoogroups.com
UNSUBSCRIBE: solarisx86-***@yahoogroups.com
Mike Riley
2006-08-31 17:15:17 UTC
Permalink
I have sent this to the IHV group, in case they are not already talking to
Areca to help them.

Mike

Chad Leigh -- Shire.Net LLC wrote:
> This is an Areca raid board which has their own driver tied into the
> sd system. I have asked them what the results of their contacting
> Sun about the issue was and if they got a bug number. I have not
> heard back. However, this is what they told me back mid-July when we
> were debugging a related issue with the battery backup module (I had
> a bad controller). However, they did confirm the issue that the
> controller was not flushing when an init 5 happened



Please check the Links page before posting:
http://groups.yahoo.com/group/solarisx86/links
Post message: ***@yahoogroups.com
UNSUBSCRIBE: solarisx86-***@yahoogroups.com
C***@Sun.COM
2006-08-31 17:34:08 UTC
Permalink
>This is an Areca raid board which has their own driver tied into the
>sd system. I have asked them what the results of their contacting
>Sun about the issue was and if they got a bug number. I have not
>heard back. However, this is what they told me back mid-July when we
>were debugging a related issue with the battery backup module (I had
>a bad controller). However, they did confirm the issue that the
>controller was not flushing when an init 5 happened

Interesting; this seems to indicate that there's some time
needed by the device after the last write and before the power-off
to commit the writes.

>> many thanks for your updated information.
>> we finally reproduced your situation.
>> the procedure we missing is the shutdown command.
>> normally we shutdown solaris by
>> # init 0 or
>> # shutdown -n
>> it will need a manual power off, with this procedure, the BBM will not
>> activated.

This seems to be timing related; I don't think there's a difference
in Solaris between the two steps.

Devices do not get a final warning when a system shutdown unless they
define a devi_reset() entry point (this is not defined in the DDI) and
few devices (need) to use it.

>> not send
>> flush command to our driver.
>> if you shutdown system in the graphic mode by choose the shutdown
>> feature.
>> kernel will send flush command.
>> our engineer will contact with the Sun engineer to get more detail
>> about
>> this.
>> and we will added a FAQ in our knowledge base to avoid similar
>> problem on
>> other customer side.
>
>
>This is all I know of the situation.

This is strange as "init 5" is a graceful shutdown followed by a power-off;
init 0 is the same except that it does not power-off the hardware.

Casper


Please check the Links page before posting:
http://groups.yahoo.com/group/solarisx86/links
Post message: ***@yahoogroups.com
UNSUBSCRIBE: solarisx86-***@yahoogroups.com
Mike Riley
2006-08-31 17:58:29 UTC
Permalink
***@Sun.COM wrote:
>> This is an Areca raid board which has their own driver tied into the
>> sd system. I have asked them what the results of their contacting
>> Sun about the issue was and if they got a bug number. I have not
>> heard back. However, this is what they told me back mid-July when we
>> were debugging a related issue with the battery backup module (I had
>> a bad controller). However, they did confirm the issue that the
>> controller was not flushing when an init 5 happened
>
> Interesting; this seems to indicate that there's some time
> needed by the device after the last write and before the power-off
> to commit the writes.
>
>>> many thanks for your updated information.
>>> we finally reproduced your situation.
>>> the procedure we missing is the shutdown command.
>>> normally we shutdown solaris by
>>> # init 0 or
>>> # shutdown -n
>>> it will need a manual power off, with this procedure, the BBM will not
>>> activated.
>
> This seems to be timing related; I don't think there's a difference
> in Solaris between the two steps.
>
> Devices do not get a final warning when a system shutdown unless they
> define a devi_reset() entry point (this is not defined in the DDI) and
> few devices (need) to use it.
>
>>> not send
>>> flush command to our driver.
>>> if you shutdown system in the graphic mode by choose the shutdown
>>> feature.
>>> kernel will send flush command.
>>> our engineer will contact with the Sun engineer to get more detail
>>> about
>>> this.
>>> and we will added a FAQ in our knowledge base to avoid similar
>>> problem on
>>> other customer side.
>>
>> This is all I know of the situation.
>
> This is strange as "init 5" is a graceful shutdown followed by a power-off;
> init 0 is the same except that it does not power-off the hardware.
>
> Casper

Maybe it is not so strange...

If we are talking about an internal card then it may not have completed all
the flushing before power goes away. But using init 0 with a manual power
off does give it the needed time to complete the flush operation.

Mike


Please check the Links page before posting:
http://groups.yahoo.com/group/solarisx86/links
Post message: ***@yahoogroups.com
UNSUBSCRIBE: solarisx86-***@yahoogroups.com
C***@Sun.COM
2006-08-31 18:29:18 UTC
Permalink
>If we are talking about an internal card then it may not have completed all
>the flushing before power goes away. But using init 0 with a manual power
>off does give it the needed time to complete the flush operation.

Which points to a bug in the driver, not Solaris?

(The driver will need to have a reset entry point if it needs to
delay power-off sufficiently).

But the reset entry point is not in the DDI/DDK documentation, so it's
not strange it is missing.

Casper


Please check the Links page before posting:
http://groups.yahoo.com/group/solarisx86/links
Post message: ***@yahoogroups.com
UNSUBSCRIBE: solarisx86-***@yahoogroups.com
Mike Riley
2006-08-31 18:43:56 UTC
Permalink
***@Sun.COM wrote:
>> If we are talking about an internal card then it may not have completed all
>> the flushing before power goes away. But using init 0 with a manual power
>> off does give it the needed time to complete the flush operation.
>
> Which points to a bug in the driver, not Solaris?
>
> (The driver will need to have a reset entry point if it needs to
> delay power-off sufficiently).
>
> But the reset entry point is not in the DDI/DDK documentation, so it's
> not strange it is missing.
>
> Casper

I agree. Do you know of any reason why such an apparently vital part of the
DDI/DDK is not documented? A RAID device driver would definitely want to
have that so it could be sure it flushed everything prior to shutdown.

Mike


Please check the Links page before posting:
http://groups.yahoo.com/group/solarisx86/links
Post message: ***@yahoogroups.com
UNSUBSCRIBE: solarisx86-***@yahoogroups.com
C***@Sun.COM
2006-08-31 19:26:15 UTC
Permalink
>***@Sun.COM wrote:
>>> If we are talking about an internal card then it may not have completed all
>>> the flushing before power goes away. But using init 0 with a manual power
>>> off does give it the needed time to complete the flush operation.
>>
>> Which points to a bug in the driver, not Solaris?
>>
>> (The driver will need to have a reset entry point if it needs to
>> delay power-off sufficiently).
>>
>> But the reset entry point is not in the DDI/DDK documentation, so it's
>> not strange it is missing.
>>
>> Casper
>
>I agree. Do you know of any reason why such an apparently vital part of the
>DDI/DDK is not documented? A RAID device driver would definitely want to
>have that so it could be sure it flushed everything prior to shutdown.

"It came from New Jersey"?

I have no clue; I'm not *that* old.

Casper


Please check the Links page before posting:
http://groups.yahoo.com/group/solarisx86/links
Post message: ***@yahoogroups.com
UNSUBSCRIBE: solarisx86-***@yahoogroups.com
Dan Mick
2006-08-31 22:41:15 UTC
Permalink
Mike Riley wrote:
> ***@Sun.COM wrote:
>>> If we are talking about an internal card then it may not have completed all
>>> the flushing before power goes away. But using init 0 with a manual power
>>> off does give it the needed time to complete the flush operation.
>> Which points to a bug in the driver, not Solaris?
>>
>> (The driver will need to have a reset entry point if it needs to
>> delay power-off sufficiently).
>>
>> But the reset entry point is not in the DDI/DDK documentation, so it's
>> not strange it is missing.
>>
>> Casper
>
> I agree. Do you know of any reason why such an apparently vital part of the
> DDI/DDK is not documented? A RAID device driver would definitely want to
> have that so it could be sure it flushed everything prior to shutdown.

do we have to have the devo_reset discussion *again*?


Please check the Links page before posting:
http://groups.yahoo.com/group/solarisx86/links
Post message: ***@yahoogroups.com
UNSUBSCRIBE: solarisx86-***@yahoogroups.com
C***@Sun.COM
2006-08-31 22:59:22 UTC
Permalink
>Mike Riley wrote:
>> ***@Sun.COM wrote:
>>>> If we are talking about an internal card then it may not have completed all
>>>> the flushing before power goes away. But using init 0 with a manual power
>>>> off does give it the needed time to complete the flush operation.
>>> Which points to a bug in the driver, not Solaris?
>>>
>>> (The driver will need to have a reset entry point if it needs to
>>> delay power-off sufficiently).
>>>
>>> But the reset entry point is not in the DDI/DDK documentation, so it's
>>> not strange it is missing.
>>>
>>> Casper
>>
>> I agree. Do you know of any reason why such an apparently vital part of the
>> DDI/DDK is not documented? A RAID device driver would definitely want to
>> have that so it could be sure it flushed everything prior to shutdown.
>
>do we have to have the devo_reset discussion *again*?

Yes, until the problem is fixed.

devo_reset is needed so it should be documented and supported.

Casper


Please check the Links page before posting:
http://groups.yahoo.com/group/solarisx86/links
Post message: ***@yahoogroups.com
UNSUBSCRIBE: solarisx86-***@yahoogroups.com
Dan Mick
2006-08-31 23:18:14 UTC
Permalink
***@sun.com wrote:
>> Mike Riley wrote:
>>> ***@Sun.COM wrote:
>>>>> If we are talking about an internal card then it may not have completed all
>>>>> the flushing before power goes away. But using init 0 with a manual power
>>>>> off does give it the needed time to complete the flush operation.
>>>> Which points to a bug in the driver, not Solaris?
>>>>
>>>> (The driver will need to have a reset entry point if it needs to
>>>> delay power-off sufficiently).
>>>>
>>>> But the reset entry point is not in the DDI/DDK documentation, so it's
>>>> not strange it is missing.
>>>>
>>>> Casper
>>> I agree. Do you know of any reason why such an apparently vital part of the
>>> DDI/DDK is not documented? A RAID device driver would definitely want to
>>> have that so it could be sure it flushed everything prior to shutdown.
>> do we have to have the devo_reset discussion *again*?
>
> Yes, until the problem is fixed.
>
> devo_reset is needed so it should be documented and supported.

I agree, as I have done tens of times in the past. I'm not going to pass
on again what I know about what the stumbling blocks there are, though, and
I know both you and Mike have heard them and know where to go for help.


Please check the Links page before posting:
http://groups.yahoo.com/group/solarisx86/links
Post message: ***@yahoogroups.com
UNSUBSCRIBE: solarisx86-***@yahoogroups.com
Chad Leigh -- Shire.Net LLC
2006-08-31 18:57:09 UTC
Permalink
On Aug 31, 2006, at 12:29 PM, ***@Sun.COM wrote:

>
>> If we are talking about an internal card then it may not have
>> completed all
>> the flushing before power goes away. But using init 0 with a
>> manual power
>> off does give it the needed time to complete the flush operation.
>
> Which points to a bug in the driver, not Solaris?
>
> (The driver will need to have a reset entry point if it needs to
> delay power-off sufficiently).
>
> But the reset entry point is not in the DDI/DDK documentation, so it's
> not strange it is missing.
>
> Casper
>

This kind of fits this situation. The "bug" was discovered when
attempting to debug an unrelated to this current issue problem with
the controller and battery module (turned out to be a defective
controller) and I learned that when I turned off ACPI, hence, power
off capability not turned working, the problem with the card flushing
did not happen, even with an init 5. Whether that is because per
chance the RAID card was able to flush by itself or whatever, it was
successful in flushing with init 5 but no power off due to this
"capability" being turned off in the MB bios.

best
Chad


---
Chad Leigh -- Shire.Net LLC
Your Web App and Email hosting provider
chad at shire.net





Please check the Links page before posting:
http://groups.yahoo.com/group/solarisx86/links
Post message: ***@yahoogroups.com
UNSUBSCRIBE: solarisx86-***@yahoogroups.com
z***@buffy.sighup.org.uk
2006-08-31 16:54:17 UTC
Permalink
On Thu, Aug 31, 2006 at 07:11:35AM -0600, Chad Leigh -- Shire.Net LLC wrote:
> According to the driver people for this device, shutdown -i 5 does
> not do this but the other forms (reboot, other levels of init) do.
> They never get it in the shutdown -i 5 case but they do other times.
> Because they do get it for other levels of shutdown/reboot it seems
> to me that they are handling it. They claim Solaris is not sending
> it in this one case. I am asking them what the big number is.

What about sync? Surely this must tell all storage device drivers to flush
any cache to storage. If down the line a firmware layer tells lies then all
bets are off no matter how good the OS driver is.

But then, even if random blocks are not flushed then zfs on the disk must
still be valid assuming it started out valid. That is the central aspect of
the design: the blocks on the disk are _always_ a valid file system.

So, there is either a bug that is triggered by this specific hardware, or
maybe nothing was ever written to disk? If there is a big enough cache
and a buggy driver at some level could the file system have been totally
cache resident?

Just a half-baked theory :-)

--
Geoff Lane

Rainy days and automatic weapons get me down....


Please check the Links page before posting:
http://groups.yahoo.com/group/solarisx86/links
Post message: ***@yahoogroups.com
UNSUBSCRIBE: solarisx86-***@yahoogroups.com
Al Hopper
2006-08-31 14:57:10 UTC
Permalink
On Wed, 30 Aug 2006, Chad Leigh -- Shire.Net LLC wrote:

..... snip .....
> Agreed. Not my idea to unplug the battery. I had to unplug the BMC/
> remote console card for a diagnostic check and the battery cable came
> loose :-)
>
> In my case, no important data was lost as I was just testing things.
> And later this year we are adding another raid volume to the ZFS pool
> in a mirror configuration for redundancy. Just the fact that it
> happened so easily scares me.

I installed the kernel patch 118855-15 on an otherwise stock Update 2 zfs
test box with 3 zfs pools defined - different pools[1] on a total of 10
SATA drives. After the mandatory reboot, all my ZFS volumes were *gone*!
The stupid patch or the stupid patchadd wiped away my zfs config file.

Assuming all my zpools were gone, I went to re-create the first pool using
the same zpool create command I had originally used and zfs warned me that
the disks might be part of another pool or I might want to import the
pool. It presented the pool name. I did a zpool import poolname (IIRC) -
and within about 10 Seconds my pool was back - totally intact.

ZFS had scanned the disks and recovered the pool info and rebuilt the pool
config files. I did the same for the other two pools, and, in less than a
minute my 3 pools were back from extinction - completely intact.

I was very impressed....

[1] raidz pool using 5 disks; 2-way mirror using 2 disks and a 3-way
mirror using 3 drives. Drives connected to two 8-port supermicro dumb
SATA controllers.

Regards,

Al Hopper Logical Approach Inc, Plano, TX. ***@logical-approach.com
Voice: 972.379.2133 Fax: 972.379.2134 Timezone: US CDT
OpenSolaris.Org Community Advisory Board (CAB) Member - Apr 2005
OpenSolaris Governing Board (OGB) Member - Feb 2006


Please check the Links page before posting:
http://groups.yahoo.com/group/solarisx86/links
Post message: ***@yahoogroups.com
UNSUBSCRIBE: solarisx86-***@yahoogroups.com
Mike Riley
2006-08-31 16:08:06 UTC
Permalink
Al Hopper wrote:
> On Wed, 30 Aug 2006, Chad Leigh -- Shire.Net LLC wrote:
>
> ..... snip .....
>> Agreed. Not my idea to unplug the battery. I had to unplug the BMC/
>> remote console card for a diagnostic check and the battery cable came
>> loose :-)
>>
>> In my case, no important data was lost as I was just testing things.
>> And later this year we are adding another raid volume to the ZFS pool
>> in a mirror configuration for redundancy. Just the fact that it
>> happened so easily scares me.
>
> I installed the kernel patch 118855-15 on an otherwise stock Update 2 zfs
> test box with 3 zfs pools defined - different pools[1] on a total of 10
> SATA drives. After the mandatory reboot, all my ZFS volumes were *gone*!
> The stupid patch or the stupid patchadd wiped away my zfs config file.
>
> Assuming all my zpools were gone, I went to re-create the first pool using
> the same zpool create command I had originally used and zfs warned me that
> the disks might be part of another pool or I might want to import the
> pool. It presented the pool name. I did a zpool import poolname (IIRC) -
> and within about 10 Seconds my pool was back - totally intact.
>
> ZFS had scanned the disks and recovered the pool info and rebuilt the pool
> config files. I did the same for the other two pools, and, in less than a
> minute my 3 pools were back from extinction - completely intact.
>
> I was very impressed....
>
> [1] raidz pool using 5 disks; 2-way mirror using 2 disks and a 3-way
> mirror using 3 drives. Drives connected to two 8-port supermicro dumb
> SATA controllers.
>
> Regards,

We had a crash on the system most developers use for home directories in
Menlo Park a while back. It was apparently caused by a bad disk. ZFS was
able to recover everything very quickly once the cause was determined with
no apparent data loss.

I am certain that ZFS is going to be one (of many!) crown jewels in Solaris.

Mike


Please check the Links page before posting:
http://groups.yahoo.com/group/solarisx86/links
Post message: ***@yahoogroups.com
UNSUBSCRIBE: solarisx86-***@yahoogroups.com
Continue reading on narkive:
Loading...