Tank reaching 97%

danb35 · Nov 8, 2017

As suspected, the condition of that pool is very serious, bordering on terminal. You have one dead drive, so you have no redundancy. You have a second drive that's throwing lots of errors--when that one dies, your entire pool dies. Replace the failed disk immediately. That won't fix the data that's already corrupt, but it will bring back a little bit of redundancy. Once that's done, you can worry about the other disk that's throwing errors--expect you'll need to replace that one as well.

What is this system being used for? If it stores important data, you really need to rebuild the pool into a more robust configuration--you have too little redundancy, and too wide a vdev.

But we're getting distracted from the subject of your thread. I still think your most likely issue is snapshots. Post, in code tags, the output of zfs list and zfs list -t snapshot.

dnet · Nov 8, 2017

danb35 said:
As suspected, the condition of that pool is very serious, bordering on terminal. You have one dead drive, so you have no redundancy. You have a second drive that's throwing lots of errors--when that one dies, your entire pool dies. Replace the failed disk immediately. That won't fix the data that's already corrupt, but it will bring back a little bit of redundancy. Once that's done, you can worry about the other disk that's throwing errors--expect you'll need to replace that one as well.

What is this system being used for? If it stores important data, you really need to rebuild the pool into a more robust configuration--you have too little redundancy, and too wide a vdev.

But we're getting distracted from the subject of your thread. I still think your most likely issue is snapshots. Post, in code tags, the output of zfs list and zfs list -t snapshot.

I see..

zfs list output ..

Code:

[root@nas1 ~]# zfs list														 
NAME				  USED  AVAIL  REFER  MOUNTPOINT							
tank				 19.2T   398G  19.2T  /mnt/tank							 
tank/.system		 3.12M   398G  56.7K  /mnt/tank/.system					 
tank/.system/cores   51.2K   398G  51.2K  /mnt/tank/.system/cores			   
tank/.system/samba4  2.97M   398G  2.97M  /mnt/tank/.system/samba4			 
tank/.system/syslog  51.2K   398G  51.2K  /mnt/tank/.system/syslog			 
tank/iscsi			104K   398G  53.0K  /mnt/tank/iscsi					   
tank/iscsi/backupa   51.2K   398G  51.2K  /mnt/tank/iscsi/backupa

zfs list -t snapshot output..

Code:

[root@nas1 ~]# zfs list -t snapshot											 
NAME						 USED  AVAIL  REFER  MOUNTPOINT					 
tank@auto-20171109.0900-2w  14.8G	  -  19.2T  -							 
tank@auto-20171109.1000-2w  1.65M	  -  19.2T  -

TQVM

danb35 · Nov 8, 2017

OK, doesn't look like it's snapshots after all--there weren't any there at all before today (assuming it's 9 Nov wherever you are; it's still 8 Nov here). How about the output (again, in code tags) of du -sh /mnt/tank/*?

dnet · Nov 8, 2017

danb35 said:
OK, doesn't look like it's snapshots after all--there weren't any there at all before today (assuming it's 9 Nov wherever you are; it's still 8 Nov here). How about the output (again, in code tags) of du -sh /mnt/tank/*?

Output..

Code:

[root@nas1 ~]# du -sh /mnt/tank/*											   
19T	/mnt/tank/data1														 
5.0k	/mnt/tank/iscsi

TQVM

danb35 · Nov 8, 2017

So all the data's in data1. What about du -sh /mnt/tank/data1/*?

dnet · Nov 8, 2017

danb35 said:
So all the data's in data1. What about du -sh /mnt/tank/data1/*?

Code:

[root@nas1 ~]# du -sh /mnt/tank/data1/*										 
du: /mnt/tank/data1/*: Not a directory

TQVM

danb35 · Nov 9, 2017

OK, ls -lh /mnt/tank.

dnet · Nov 9, 2017

danb35 said:
OK, ls -lh /mnt/tank.

Code:

[root@nas1 ~]# ls -lh /mnt/tank												 
total 41126218213															   
drwxr-xr-x  2 www	 www		3B May 24  2014 .freenas					   
drwxr-xr-x  5 root	wheel	  5B Jun  4  2014 .system						
-rw-r--r--  1 root	wheel	 19T Nov  9 22:14 data1						 
drwxr-xr-x  3 nobody  nobody	 3B May 31  2013 iscsi

TQVM

danb35 · Nov 9, 2017

OK, you have a single file called data1, that's 19TB in size, and owned by root. That file is consuming almost all the space on your pool. How are you using that file?

dnet · Nov 9, 2017

danb35 said:
OK, you have a single file called data1, that's 19TB in size, and owned by root. That file is consuming almost all the space on your pool. How are you using that file?

Actually those files are data from some servers. It is important and backed up every day. We use the software and one server to manage the backup process, then it will be sent to the NAS via iscsi.

danb35 · Nov 9, 2017

So data1 is a file-based iSCSI extent that's consuming pretty much all your pool space? That's just all kinds of wrong, and it's probably why deleting data on the other end isn't freeing space on your pool. I'm not sure how you'd go about shrinking this; maybe one of the experts can chime in. @Ericloewe? @Arwen?

If this data is important, you need to get the pool fixed--replace the failed drive, and probably replace the second drive showing errors. That's the first priority. The second priority is to reduce the size of data1 to give yourself some breathing room.

The third priority is to rebuild the server with a sane configuration. If you need 20 TB of block storage (i.e., iSCSI), you need 40 TB of pool capacity, net of redundancy. About the smallest way to do that, as far as I can see, would be to put 8 x 8 TB disks in RAIDZ2.

Ericloewe · Nov 9, 2017

danb35 said:
data1 is a file-based iSCSI extent that's consuming pretty much all your pool space? That's just all kinds of wrong, and it's probably why deleting data on the other end isn't freeing space on your pool. I'm not sure how you'd go about shrinking this; maybe one of the experts can chime in. @Ericloewe? @Arwen?

The filesystem that's using the extent is presumably not going to react well to sudden shrinkage of the underlying disk, so that would require one or more of the following:

Nuke it, restore from backup to a new share that is properly configured to not allow for the pool to get so full.
Add more storage.
Get rid of snapshots.
Move other data elsewhere.

Chris Moore · Nov 9, 2017

danb35 said:
If this data is important, you need to get the pool fixed--replace the failed drive, and probably replace the second drive showing errors. That's the first priority. The second priority is to reduce the size of data1 to give yourself some breathing room.

@dnet just be sure you only replace one drive at a time. You must replace the totally failed drive first, let the resilver complete, then replace the drive that is giving errors, which looks to be da6 from the graphic you posted earlier.

Ericloewe · Nov 9, 2017

Or functioning TRIM/UNMAP on the iSCSI initiator, if the virtual disk can afford to delete stuff. But this is a bit of a stopgap.

Forgot that one.

danb35 · Nov 9, 2017

Ericloewe said:
Get rid of snapshots.

From the output of zfs list -t snapshot posted up-thread, doesn't look like those are the issue.

Ericloewe · Nov 9, 2017

danb35 said:
From the output of zfs list -t snapshot posted up-thread, doesn't look like those are the issue.

I figured as much. It's not often that insanely full pools are insanely full due to snapshots.

dnet · Nov 9, 2017

danb35 said:
So data1 is a file-based iSCSI extent that's consuming pretty much all your pool space? That's just all kinds of wrong, and it's probably why deleting data on the other end isn't freeing space on your pool. I'm not sure how you'd go about shrinking this; maybe one of the experts can chime in. @Ericloewe? @Arwen?

If this data is important, you need to get the pool fixed--replace the failed drive, and probably replace the second drive showing errors. That's the first priority. The second priority is to reduce the size of data1 to give yourself some breathing room.

The third priority is to rebuild the server with a sane configuration. If you need 20 TB of block storage (i.e., iSCSI), you need 40 TB of pool capacity, net of redundancy. About the smallest way to do that, as far as I can see, would be to put 8 x 8 TB disks in RAIDZ2.

Ok I will replace the failed drive immediately and reduce the size of data.

dnet · Nov 10, 2017

In our environment, NAS is mapped to windows server using iscsi. After reduce the size of data in windows server, the tank in NAS still showed 97%. Why? Or I also need to adjust something on NAS.

Ericloewe · Nov 10, 2017

iSCSI provides raw block storage. Unless the client is issuing UNMAP, the server has no way of knowing if something was deleted.

dnet · Nov 20, 2017

After replacing the hard disk, and resilvering process. But why the process did not change after hours. Or I need to restart the machine.

Code:

[root@nas1 ~]# zpool status																										 
  pool: tank																														
 state: UNAVAIL																													 
status: One or more devices is currently being resilvered.  The pool will														   
		continue to function, possibly in a degraded state.																		 
action: Wait for the resilver to complete.																						 
  scan: resilver in progress since Mon Nov 20 11:34:53 2017																		 
		148M scanned out of 21.4T at 4.37K/s, (scan is slow, no estimated time)													 
		12.4M resilvered, 0.00% done																								
config:																															 
																																	
		NAME											  STATE	 READ WRITE CKSUM												
		tank											  UNAVAIL  1.16K	 0	 0												
		  raidz1-0										UNAVAIL  1.16K	 0	 0												
			gptid/314214ac-c8a9-11e2-8927-002590c1fcf4	ONLINE	   0	 0	 0												
			gptid/31d202ce-c8a9-11e2-8927-002590c1fcf4	ONLINE	   0	 0	 0												
			gptid/3264371b-c8a9-11e2-8927-002590c1fcf4	ONLINE	   0	 0	 0												
			gptid/32f0c656-c8a9-11e2-8927-002590c1fcf4	ONLINE	   0	 0	 0												
			replacing-4								   UNAVAIL	  0	 0	 0												
			  9773083294005733761						 UNAVAIL	  0	 0	 0  was /dev/gptid/3380f7fb-c8a9-11e2-8927-002590c
1fcf4																															   
			  gptid/b60ebc44-cda3-11e7-916b-002590c1fcf4  ONLINE	   0	 0	 0  (resilvering)								 
			gptid/3418fcca-c8a9-11e2-8927-002590c1fcf4	ONLINE	   0	 0	 0												
			1338157980908881363						   REMOVED	  0	 0	 0  was /dev/gptid/34b1725d-c8a9-11e2-8927-002590c
1fcf4																															   
			gptid/35421063-c8a9-11e2-8927-002590c1fcf4	ONLINE	   0	 0	 0												
			gptid/35dbbdfb-c8a9-11e2-8927-002590c1fcf4	DEGRADED 1.16K	 0	 0  too many errors							   
			gptid/36690f03-c8a9-11e2-8927-002590c1fcf4	ONLINE	   0	 0	 0												
			gptid/36fcd7d8-c8a9-11e2-8927-002590c1fcf4	ONLINE	   0	 0	 0												
			gptid/378c07c4-c8a9-11e2-8927-002590c1fcf4	ONLINE	   0	 0	 0												
		logs																														
		  gptid/37d94c7f-c8a9-11e2-8927-002590c1fcf4	  ONLINE	   0	 0	 0												
																																	
errors: 6 data errors, use '-v' for a list

TQVM

Important Announcement for the TrueNAS Community.

Tank reaching 97%

Hall of Famer

Dabbler

Hall of Famer

Dabbler

Hall of Famer

Dabbler

Hall of Famer

Dabbler

Hall of Famer

Dabbler

Hall of Famer

Server Wrangler

Hall of Famer

Server Wrangler

Hall of Famer

Server Wrangler

Dabbler

Dabbler

Server Wrangler

Dabbler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Tank reaching 97%"

Similar threads