uberthoth
Dabbler
- Joined
- Mar 15, 2022
- Messages
- 11
I have a Minio app running in distributed mode with 4 nodes each with 32GB of ram.
Two of the nodes failed in an out of memory condition, after rebooting the machines, they show logs like this:
```
2022-03-16 15:47:49.798681+00:00Waiting for all other servers to be online to format the disks.
```
```
2022-03-16 15:48:45.789080+00:00Waiting for the first server to format the disks.
```
The other two machines have logs like this:
```
2022-03-16 15:52:14.513028+00:002022-03-16T15:52:14.513028950Z
2022-03-16 15:52:14.513080+00:00API: SYSTEM()
2022-03-16 15:52:14.513133+00:00Time: 15:52:14 UTC 03/16/2022
2022-03-16 15:52:14.513185+00:00Error: Marking https://1.example.local:9000/minio/storage/data/v42 temporary offline; caused by Post "https://1.example.local:9000/minio/...k-id=&file-path=format.json&volume=.minio.sys": lookup 1.example.local on 10.0.0.11:53: dial udp 10.0.0.11:53: i/o timeout (*fmt.wrapError)
2022-03-16 15:52:14.513239+00:006: internal/rest/client.go:149:rest.(*Client).Call()
2022-03-16 15:52:14.513292+00:005: cmd/storage-rest-client.go:152:cmd.(*storageRESTClient).call()
2022-03-16 15:52:14.513354+00:004: cmd/storage-rest-client.go:520:cmd.(*storageRESTClient).ReadAll()
2022-03-16 15:52:14.513430+00:003: cmd/format-erasure.go:406:cmd.loadFormatErasure()
2022-03-16 15:52:14.513504+00:002: cmd/format-erasure.go:326:cmd.loadFormatErasureAll.func1()
2022-03-16 15:52:14.513560+00:001: internal/sync/errgroup/errgroup.go:123:errgroup.(*Group).Go.func1()
```
```
2022-03-16 14:19:11.134801+00:002022-03-16T14:19:11.134801123Z
2022-03-16 14:19:11.135063+00:00API: SYSTEM()
2022-03-16 14:19:11.135086+00:00Time: 14:19:11 UTC 03/16/2022
2022-03-16 14:19:11.135104+00:00DeploymentID: deadbeef
2022-03-16 14:19:11.135121+00:00Error: Operation timed out (cmd.OperationTimedOut)
2022-03-16 14:19:11.135139+00:001: cmd/iam.go:339:cmd.(*IAMSys).watch()
```
Important to note, the two nodes that have not been rebooted continue to operate just fine, I can still pull files from them etc.
How do I get the two rebooted nodes to rejoin the cluster without initializing?
Two of the nodes failed in an out of memory condition, after rebooting the machines, they show logs like this:
```
2022-03-16 15:47:49.798681+00:00Waiting for all other servers to be online to format the disks.
```
```
2022-03-16 15:48:45.789080+00:00Waiting for the first server to format the disks.
```
The other two machines have logs like this:
```
2022-03-16 15:52:14.513028+00:002022-03-16T15:52:14.513028950Z
2022-03-16 15:52:14.513080+00:00API: SYSTEM()
2022-03-16 15:52:14.513133+00:00Time: 15:52:14 UTC 03/16/2022
2022-03-16 15:52:14.513185+00:00Error: Marking https://1.example.local:9000/minio/storage/data/v42 temporary offline; caused by Post "https://1.example.local:9000/minio/...k-id=&file-path=format.json&volume=.minio.sys": lookup 1.example.local on 10.0.0.11:53: dial udp 10.0.0.11:53: i/o timeout (*fmt.wrapError)
2022-03-16 15:52:14.513239+00:006: internal/rest/client.go:149:rest.(*Client).Call()
2022-03-16 15:52:14.513292+00:005: cmd/storage-rest-client.go:152:cmd.(*storageRESTClient).call()
2022-03-16 15:52:14.513354+00:004: cmd/storage-rest-client.go:520:cmd.(*storageRESTClient).ReadAll()
2022-03-16 15:52:14.513430+00:003: cmd/format-erasure.go:406:cmd.loadFormatErasure()
2022-03-16 15:52:14.513504+00:002: cmd/format-erasure.go:326:cmd.loadFormatErasureAll.func1()
2022-03-16 15:52:14.513560+00:001: internal/sync/errgroup/errgroup.go:123:errgroup.(*Group).Go.func1()
```
```
2022-03-16 14:19:11.134801+00:002022-03-16T14:19:11.134801123Z
2022-03-16 14:19:11.135063+00:00API: SYSTEM()
2022-03-16 14:19:11.135086+00:00Time: 14:19:11 UTC 03/16/2022
2022-03-16 14:19:11.135104+00:00DeploymentID: deadbeef
2022-03-16 14:19:11.135121+00:00Error: Operation timed out (cmd.OperationTimedOut)
2022-03-16 14:19:11.135139+00:001: cmd/iam.go:339:cmd.(*IAMSys).watch()
```
Important to note, the two nodes that have not been rebooted continue to operate just fine, I can still pull files from them etc.
How do I get the two rebooted nodes to rejoin the cluster without initializing?