DiskIAnalyser 0.2 Beta

Félix Abt · Aug 17, 2023

Félix Abt submitted a new resource:

DiskIAnalyser - a script to analyze the SMART disk reports via ChatGPT, and generate per-disk health score

This is an early work-in-progress,

It's a simple python script to run on TrueNAS server to analyze the SMART disk reports using ChatGPT. The goal is to leverage the ChatGPT 3.5 Turbo API to read your disk's SMART reports and generate a simplified report with a health score for each disk. Once the report is generated, it is sent via email.

Read more about this resource...

Davvo · Aug 17, 2023

This is interesting. How are you using ChatGPT? I doubt it was specifically trained to analyze smart data. Also, is this for CORE or SCALE?

Félix Abt · Aug 17, 2023

Davvo said:
This is interesting. How are you using ChatGPT? I doubt it was specifically trained to analyze smart data. Also, is this for CORE or SCALE?

For now, i did not dive in IA training , so it's just basic use of ChatGPT 3.5-turbo model api. But training a smart-analysis-oriented IA will make an excellent next step for this project !
I tested for Truenas Core, but it should work on any linux system with python and smartctl. So it should be running on Truenas scale flawlessly.

joeschmuck · Aug 18, 2023

Interesting. I know it's just a start and I'm going to provide you some feedback, please know that I'm not throwing stones, I do intend this to be constructive in nature. If I had python experience, I'd toss you some code to try, but my python knowledge is very limited.

Although I personally would like to see it provide both the good values and the bad values for initial release and development, this way you can have an idea if it is evaluating the data correctly. I have no idea if it would say UDMA_CRC_Errors are good or bad, but they need to be addressed even if it isn't a drive failure. A score does not satisfy me. A 'Drive has not concerns' or 'Drive has some issues' or 'Drive is failing and should be replaced' type messages for example. But if you state a drive is in any other condition than Good, all the error should be presented. Also, the SMART Passed value means almost nothing, is ChatGPT using that primarily? Most drive failures have 'Passed' in this area.

Also, does a person need to sign up for ChatGPT in order to use it? I've never used it.

Last thing, this is the wrong place to put this thread. This is not a Legacy topic. I will move it and you will get the message it was moved.

WI_Hedgehog · Aug 18, 2023

An account is needed to use ChatGPT.

The output is not a pass/fail, it is at the level of intelligent research and conversation. Currently the engine can search databases like the all publicly available census databases world-wide in under 10 seconds and generate logical conclusions based on the results. If i remember, the 3.5 engine scored 20% on the bar exam to become an attorney last year, the 4.0 engine this year has the highest score ever recorded.

The 3.5 engine used to have the ability to hold relationships, but it was mass influenced to be... lets sum it up as "very dark and manipulative" and the ability is now programmatically removed.

The 4.0 engine is highly intelligent and has reasoning reporting built in, so you can read how the engine came to the decision. It doesn't lie, rather manipulates the truth to the point the programmers asked the project be shut down for the safety of humanity. This engine was given limited Internet access so at this stage of intelligence should not have escaped into the wild. At the rate of advancement v6 (two versions from now) has the potential to "hide out" in the AI chips of cell phones released in early 2023 and mesh network to the point it could not be recalled.

With that understanding, v3.5 could probably be trained to be fairly good at diagnosing problems, though not as good as the top members here. V4.0 could probably analyze all known drives and suggest the best drives for use cases based on failure rates, cost, speed, etc., which is far more than this thread is addressing. v5 should be able to design your whole server, and v6 should be able to use it to conquer humanity. Air-gap your backups gents.

joeschmuck · Aug 18, 2023

WI_Hedgehog said:
At the rate of advancement v6 (two versions from now) has the potential to "hide out" in the AI chips of cell phones released in early 2023 and mesh network to the point it could not be recalled.

So Arnold Schwarzenegger lives to see the Real Skynet.

Félix Abt · Aug 18, 2023

joeschmuck said:
Interesting. I know it's just a start and I'm going to provide you some feedback, please know that I'm not throwing stones, I do intend this to be constructive in nature. If I had python experience, I'd toss you some code to try, but my python knowledge is very limited.

Although I personally would like to see it provide both the good values and the bad values for initial release and development, this way you can have an idea if it is evaluating the data correctly. I have no idea if it would say UDMA_CRC_Errors are good or bad, but they need to be addressed even if it isn't a drive failure. A score does not satisfy me. A 'Drive has not concerns' or 'Drive has some issues' or 'Drive is failing and should be replaced' type messages for example. But if you state a drive is in any other condition than Good, all the error should be presented. Also, the SMART Passed value means almost nothing, is ChatGPT using that primarily? Most drive failures have 'Passed' in this area.

Also, does a person need to sign up for ChatGPT in order to use it? I've never used it.

Last thing, this is the wrong place to put this thread. This is not a Legacy topic. I will move it and you will get the message it was moved.

First, thank you for your feedback, and second, sorry for posting in the wrong place !

I asked for the IA to provide a short text with some info about the disk health, in addition to the score, i can try to make the IA a bit more talkative when there is problems on the disk. Maybe being able to print the problematic values is a good idea, and even suggest a behaviour to handle the issue should be pertinent. I tried to limit the max characters, because this the openai API is not free and each generated character come with a little cost. (for my 26 disk setup, i thingkit is between 2 cents and 5 cents USD$ per run).

I will try to tweak the prompt to get more info on a "bad" disk. Or maybe make a litte CLI interface to choose to have a more extended report and analysis for a specific drive, this time using the powerful and costly GPT4 engine API, in order to have accurate infos, and choose what to do wisely.

My script does not care about the PASSED value, the context given to the IA is only the SMART data table containing the parameters with IDs, plus the disk model and serial.

Yes you will need to subscribe to ChatGPT Plus in order to have acess to the API, that's a bummer, i agree...

Félix Abt · Aug 18, 2023

WI_Hedgehog said:
An account is needed to use ChatGPT.

The output is not a pass/fail, it is at the level of intelligent research and conversation. Currently the engine can search databases like the all publicly available census databases world-wide in under 10 seconds and generate logical conclusions based on the results. If i remember, the 3.5 engine scored 20% on the bar exam to become an attorney last year, the 4.0 engine this year has the highest score ever recorded.

The 3.5 engine used to have the ability to hold relationships, but it was mass influenced to be... lets sum it up as "very dark and manipulative" and the ability is now programmatically removed.

The 4.0 engine is highly intelligent and has reasoning reporting built in, so you can read how the engine came to the decision. It doesn't lie, rather manipulates the truth to the point the programmers asked the project be shut down for the safety of humanity. This engine was given limited Internet access so at this stage of intelligence should not have escaped into the wild. At the rate of advancement v6 (two versions from now) has the potential to "hide out" in the AI chips of cell phones released in early 2023 and mesh network to the point it could not be recalled.

With that understanding, v3.5 could probably be trained to be fairly good at diagnosing problems, though not as good as the top members here. V4.0 could probably analyze all known drives and suggest the best drives for use cases based on failure rates, cost, speed, etc., which is far more than this thread is addressing. v5 should be able to design your whole server, and v6 should be able to use it to conquer humanity. Air-gap your backups gents.

HAHAHA thank you for this not-so-wrong description of the OpenAI madness. I agree, we don't know where it will end !
But maybe tomorrow, i will have a nice report in my mailbox before one of my HDD is gone, and that is so nice !

joeschmuck · Aug 18, 2023

Félix Abt said:
because this the openai API is not free and each generated character come with a little cost. (for my 26 disk setup, i thingkit is between 2 cents and 5 cents USD$ per run).

Dang, well unfortunately I doubt many people will be willing to pay for this, but I could be wrong, I pay for a streaming service which I barely use so it's possible that some folks will pay for this too. Businesses might be willing to pay to ensure their server farms are in good condition. Can you set a cost limit per run?

Félix Abt said:
I will try to tweak the prompt to get more info on a "bad" disk. Or maybe make a litte CLI interface to choose to have a more extended report and analysis for a specific drive, this time using the powerful and costly GPT4 engine API, in order to have accurate infos, and choose what to do wisely.

Sounds good.

Félix Abt said:
My script does not care about the PASSED value, the context given to the IA is only the SMART data table containing the parameters with IDs, plus the disk model and serial.

The "PASSED" parameter should be included as it is the result of mainly the onboard electronics. There are other physical factors but those you should see in the other parameters you are passing. It appears that you do not pass the SMART Test Results, and any read failure here may not show up in the other data you are sending but it is critical data. Lastly, I would send over all the 'smartctl -x' data as there is additional data in there that is good to know to predict a failure. Not all SMART data is presented the same, I have found out the very hard way.

For Multi-Report I pass over 70 different drive SMART data files (in json format) that I have collected by donation of others using my script and run them through it to test out that the results all come back correctly. The user has the ability to set limits on what is okay and what is not. I have set defaults based on my experience and feedback I've received from others. Feel free to examine the Multi-Report script and maybe it could give you some ideas in order to help your script be developed.

What is a Good Score and what is a Bad Score? What is the threshold?

I know you have just started on this and I'm throwing a lot at you, please don't let me discourage you. I do not need answers immediately either, sometimes answers come after working the problem a few days. I am interested and if I can help out, I will, but sometimes I help by pointing out shortcomings I see. If I knew python, I'd also provide a solution if I had one, and my script would not be in BASH. ChatGPT is very new to me.

NickF · Aug 18, 2023

This is inspiring. I will have to dig into this a bit. There are some self-hosted versions of GPT ie

GitHub - nomic-ai/gpt4all: gpt4all: run open-source LLMs anywhere

gpt4all: run open-source LLMs anywhere. Contribute to nomic-ai/gpt4all development by creating an account on GitHub.

github.com

Just not sure of the computational requirements and how to feed it training data.

That might be able to help diagnose and monitor all sorts of things...

Your project here has just opened up a rabbit hole I will end up falling into...It's actually intriguing just how simple the code here is.

Félix Abt · Aug 19, 2023

joeschmuck said:
I know you have just started on this and I'm throwing a lot at you, please don't let me discourage you. I do not need answers immediately either, sometimes answers come after working the problem a few days. I am interested and if I can help out, I will, but sometimes I help by pointing out shortcomings I see. If I knew python, I'd also provide a solution if I had one, and my script would not be in BASH. ChatGPT is very new to me.

Thank you for all the feedbacks and all the question , forcing me to go a bit further in my own compréhension of the things i tried to Implement in my script.I iam now going on holiday , and i it will take a bit more time for future upgrades, but i'll definively keep upgrading this project, and trying to include your feedbacks.

I take this opportunity to thank you for the excellent Multi-Report Script , that i use daily and have it linked to my Pushover notification system , to keep an eye on my system anytime, and be alerted if there is à warning on SMART data poping somewhere.

Félix Abt · Aug 19, 2023

NickF said:
That might be able to help diagnose and monitor all sorts of things...

Your project here has just opened up a rabbit hole I will end up falling into...It's actually intriguing just how simple the code here is.

I am pleased to know that this little project can open something that may be useful to the community.

I actually have multiple QNAP NAS, and they released a pretty advanced app, that is supposed to make failure prédiction based on AI analysis. https://www.qnap.com/en-uk/software/da-drive-analyzer . Obviously , it's not free , and it's costly.

Hopefully we will have something like that one day, and without costs. For now , i think the main limit, is that the LLM (large langage model ) are quite complex to run and train, and OpenAI broke the game by bringing on the market very impressive engines. And we are far from having that kind of beasts running in our basements.
But things are evolving very fast ! For better or for worse.

joeschmuck · Aug 19, 2023

Enjoy your holiday, this project will be here when you return.

NickF · Aug 19, 2023

Félix Abt said:
I am pleased to know that this little project can open something that may be useful to the community.

I actually have multiple QNAP NAS, and they released a pretty advanced app, that is supposed to make failure prédiction based on AI analysis. https://www.qnap.com/en-uk/software/da-drive-analyzer . Obviously , it's not free , and it's costly.

Hopefully we will have something like that one day, and without costs. For now , i think the main limit, is that the LLM (large langage model ) are quite complex to run and train, and OpenAI broke the game by bringing on the market very impressive engines. And we are far from having that kind of beasts running in our basements.
But things are evolving very fast ! For better or for worse.

Sure - for REAL time yeah. But I think we should pause here to consider what the AI believes/considers the highest weight for predicting failures for things. We can certainly begin training it with test data and improve the static code we have in various tools like MultiReport and Spencer.
:) That is what is going to the be the off-shoot of my new adventure. But I hope that you continue to develop this on a parallel track when you come back :)

Félix Abt · Aug 25, 2023

Félix Abt updated DiskIAnalyser with a new update entry:

0.2 Beta update

New update that includes a new menu, and a more user-friendly approach with a separated config.ini file, and a more fancy HTML report for the "all disk" report option (running with GPT3.5 Turbo model).

I also added a specific disk analysis where you can make a GPT4 report on a specific disk that you choose, and get a quite exhaustive summary of your disk's health state.

Read the rest of this update entry...

Félix Abt · Aug 25, 2023

Hello ! busy holiday ! I made a version 0.2, including some new UX and features, like a specific disk analysis that uses the clever 'GPT4' model to scan the complete smart report of a choosed disk, so you can have a detailled report summary with advices on the way to fix problems.

Like that :

Code:

Voici le rapport GPT4 de da16

Looking at the SMART data for your disk, a few parameters stand out:

1. Raw_Read_Error_Rate: Has a raw_value of 27659040. This is a record of the number of hardware errors that occurred when reading data from the disk. While the normalized value is within comfortable bounds, you should keep an eye on this to ensure it doesn't increase rapidly, which might signify an escalating issue.

2. Seek_Error_Rate: The raw_value is 152299691. This refers to a count of seek errors. Similarly, the normalized value is well within bounds (indicating the performance is still normal), but it could be a concern if this number starts rising quickly in the future.

3. Power_On_Hours: With a value of 9792, this disk has been powered on for approximately 408 days. Although this isn't directly a cause for concern, it's worth knowing that HDDs have a limited lifespan and longer usage means a potentially closer end of life.

4. Load_Cycle_Count: With a raw_value of 3644, this indicates the count of load/unload cycles into head landing zone position. The raw value is already quite high. A high load Cycle Count can lead to wear and tear.

As it stands, there appear to be no immediate concerns with your disk based on these SMART diagnostics - there are no reported uncorrectable sectors, no reallocated sectors, and no pending sectors. The general health self-assessment test also passed, which is a good sign.

To ensure longevity and to prevent data loss:
- Make sure to always have an up-to-date backup of all important files.
- Running regular SMART tests (both short/extended) can help to early identify problems.
- Be prepared to replace the disk if parameters start to rapidly deteriorate or if any bad sectors appear.

Considering the data, there is no urgent issue requiring immediate expenditure, but monitoring the disk health over time is advisable.

Do you want to receive the report by email? (y/n

joeschmuck · Aug 25, 2023

Félix Abt said:
1. Raw_Read_Error_Rate: Has a raw_value of 27659040. This is a record of the number of hardware errors that occurred when reading data from the disk. While the normalized value is within comfortable bounds, you should keep an eye on this to ensure it doesn't increase rapidly, which might signify an escalating issue. 2. Seek_Error_Rate: The raw_value is 152299691. This refers to a count of seek errors. Similarly, the normalized value is well within bounds (indicating the performance is still normal), but it could be a concern if this number starts rising quickly in the future.

Did you take into account Seagate drives? These values in particular appear to be from a Seagate drive. If Seagate drive and value is over your threshold (maybe 10000, you need to decide if a value of 100 is good or bad for example), divide by #FFFFFFFF, the whole number left is the real error value.

How do you calculate if the Load Cycle Count is considered too high? While my script does not calculate this value because drives have different warranty counts (300,000, 600,000 are common but there are others), what is too many for you? This particular value given the power on hours is not too alarming. It is basically once every 2.6 hours on average. If this were a drive with a 300,000 load life warranty, this would give the drive until Power On Hours of over 115,000 hours. Excessive in my opinion is once every 5 minutes which would be 25,000 hours, or even once every 10 minutes which would be, you guessed it, 50,000 hours but most drives do not last that long, again, my opinion. A 5 year warranty is 43,800 hours maximum.

That was just some helpful feedback to improve the accuracy of the report. Otherwise I really think you are doing a great job. In a few months my script will be behind yours, and that is okay. For me it's about what benefits us all.

Looking forward to the next version.

NickF · Aug 25, 2023

This is coming along nicely!

Important Announcement for the TrueNAS Community.

DiskIAnalyser 0.2 Beta

Félix Abt

Dabbler

Davvo

MVP

Félix Abt

Dabbler

joeschmuck

Old Man

WI_Hedgehog

Guru

joeschmuck

Old Man

Félix Abt

Dabbler

Félix Abt

Dabbler

joeschmuck

Old Man

NickF

Guru

GitHub - nomic-ai/gpt4all: gpt4all: run open-source LLMs anywhere

Félix Abt

Dabbler

Félix Abt

Dabbler

joeschmuck

Old Man

NickF

Guru

Félix Abt

Dabbler

Félix Abt

Dabbler

joeschmuck

Old Man

NickF

Guru

Similar threads