I've had my PSU instance running nearly 2 years in my current org.
A script/sync job I have, one of my earliest/longest running, has particularly poor error handling, while iterating through users there was a try catch block, that upon error would spit out the $UserObject | ConvertTo-JSON into the thrown error message.
Stupidly I didnt think to really check this during development, but I found that recently it must have been triggering error after error through thousands of obects and spitting out a json object containing far more metadata and properties than is required for any error.
Subsequently I think it borked my instance.
Before I found out the root cause was the above script, I was having issues with performance, some jobs were failing before they even started, I couldnt get to job logs, the server was flipping all the time and maxing out resources/triggering infra alerts.
I have errors upto my eyeballs in the PSU System logs, relating to sql timeouts or similar.
Grooming was also failing.
This was impacting both my nodes.
Eventually after spotting the large data in my script errors, I decided to backup my DB and just truncated the following tables (since I didnt care about historical job logs/output):
TRUNCATE TABLE JobOutput
TRUNCATE TABLE JobPipelineOutput
TRUNCATE TABLE JobFeedback
TRUNCATE TABLE JobLog
TRUNCATE TABLE JobQueues
TRUNCATE TABLE JobParameter
I left the Job table since it moaned about constraints, and it was relatively small at around 283 rows
Then rebooted my service.
After this, all my errors and issues went away, SQL monitoring cleared, infra resource alerts stopped, PSU is behaving normally.
So while I realize to a large degree that this is self inflicted with the sheer amount of data I was pushing into the JobOutput table, I thought I'd post my experience here and my recovery steps. I'm wondering if there could be improvements made to how the application itself handles these issues or if this is just a tough luck scenario and just a case of ensuring to properly implement safe output in future?
Thoughts?
I've had my PSU instance running nearly 2 years in my current org.
A script/sync job I have, one of my earliest/longest running, has particularly poor error handling, while iterating through users there was a try catch block, that upon error would spit out the $UserObject | ConvertTo-JSON into the thrown error message.
Stupidly I didnt think to really check this during development, but I found that recently it must have been triggering error after error through thousands of obects and spitting out a json object containing far more metadata and properties than is required for any error.
Subsequently I think it borked my instance.
Before I found out the root cause was the above script, I was having issues with performance, some jobs were failing before they even started, I couldnt get to job logs, the server was flipping all the time and maxing out resources/triggering infra alerts.
I have errors upto my eyeballs in the PSU System logs, relating to sql timeouts or similar.
Grooming was also failing.
This was impacting both my nodes.
Eventually after spotting the large data in my script errors, I decided to backup my DB and just truncated the following tables (since I didnt care about historical job logs/output):
TRUNCATE TABLE JobOutput
TRUNCATE TABLE JobPipelineOutput
TRUNCATE TABLE JobFeedback
TRUNCATE TABLE JobLog
TRUNCATE TABLE JobQueues
TRUNCATE TABLE JobParameter
I left the Job table since it moaned about constraints, and it was relatively small at around 283 rows
Then rebooted my service.
After this, all my errors and issues went away, SQL monitoring cleared, infra resource alerts stopped, PSU is behaving normally.
So while I realize to a large degree that this is self inflicted with the sheer amount of data I was pushing into the JobOutput table, I thought I'd post my experience here and my recovery steps. I'm wondering if there could be improvements made to how the application itself handles these issues or if this is just a tough luck scenario and just a case of ensuring to properly implement safe output in future?
Thoughts?
@insomniacc
I appreciate the feedback. The fact that the groom job was failing is really a big part of the problem because it should have been trimming the JobOutput table for you. That said, I do think it's is relatively easy to fall into this trap and we may need to put some guard rails in place to help folks from running into this.
If you do have any left of logs from before your cleanup job, I would like to see the error you were getting out of the groom job.
Adam Driscoll
PowerShell Expert and Developer at Devolutions
I got a lot of this:
2026-05-13 09:20:40.578 +01:00 [ERR][Universal.Server.Services.Configuration.ConfigurationSystemWatcher] Error processing configuration change notifications System.ObjectDisposedException: Cannot access a disposed object.
and
2026-05-13 09:31:59.410 +01:00 [ERR][Microsoft.EntityFrameworkCore.Query] An exception occurred while iterating over the results of a query for context type 'PowerShellUniversal.SQL.PsuDbContext'. Microsoft.Data.SqlClient.SqlException (0x80131904): Execution Timeout Expired. The timeout period elapsed prior to completion of the operation or the server is not responding. ---> System.ComponentModel.Win32Exception (258): The wait operation timed out.
and
2026-05-13 09:57:58.316 +01:00 [ERR][Microsoft.EntityFrameworkCore.Update] An exception occurred in the database while saving changes for context type 'PowerShellUniversal.SQL.PsuDbContext'. Microsoft.EntityFrameworkCore.DbUpdateException: An error occurred while saving the entity changes. See the inner exception for details. ---> Microsoft.Data.SqlClient.SqlException (0x80131904): Execution Timeout Expired. The timeout period elapsed prior to completion of the operation or the server is not responding. ---> System.ComponentModel.Win32Exception (258): The wait operation timed out.
I searched the system log on this day for the word 'groom' and nothing is showing. it's set to 'error' level logging.
The JSON object I mentioned that worked its way into the error job output was also in the system log, it was consuming 67k rows!!
As well as the AD user properties, it had a bunch of sql metadata from the object like this for countless lines in the log:
"DefaultView": [
"System.Data.DataRowView",
"System.Data.DataRowView",
"System.Data.DataRowView",
"System.Data.DataRowView",
"System.Data.DataRowView",
"System.Data.DataRowView",
"System.Data.DataRowView",
SQL was in such a state that I tried a Select statement on one of the offending jobID's (JobOutput table) and it just timed out, even from SSMS, delete statements similarly timed out, truncate worked thankfully.
@insomniacc Thanks. That's good info. I've captured this in a work item for us to address.
What I'm considering is adding some sort of "large job log or output" setting. By default, it would warn. I think a lot of times it isn't clear this is happening. At least you would get an early indicator and could address it before it gets out of hand.
Additionally, you could configure to automatically truncate or ignore. I also think we could have some sort of compaction mechanism where large logs are truncated and only a subset of the job log and output are actually stored per row. Then the remainder is ZIP and accessible for download so there isn't any data loss.
Obviously, there are ways we could further optimize the SQL as well so we will do some review there as part of this work too.
Adam Driscoll
PowerShell Expert and Developer at Devolutions
@Adam Driscoll
Sounds like a good plan, thanks! :)
We have been seeing this a lot in our production environment where we had to eventually burn and rebuild the DB. We never got this addressed from support either.
Looking forward to seeing what you come up with.