After collecting data over time, use the Puppet metrics dashboard to review historical data and take a look at how performance changes over time. You can use timestamps from metrics as a starting point to review changes, logs, and code deployments and then solve performance issues.
Version and installation information
Puppet version: 5.0 and later
Component and version: Puppet Server, PuppetDB, and PostgreSQL
OS: Any
Installation type: Any
Solution
The steps in this article assume that you’ve installed the Puppet metrics dashboard and have been collecting metrics for a minimum of two hours. If you have time, collect metrics for a week or more.
Let’s go through an example of troubleshooting.
Look for times when performance changed
After you’ve collected data, open up the dashboards to look through your historical data. In the graphs and linked dashboards, find graphs with sudden spikes or valleys to start figuring out when a performance issue started. Let’s start by looking at the Puppetserver Performance dashboard to check Puppet Server performance.
In these graphs, performance changes over time for Average Requested JRubies and Average Free JRubies. Before 8 a.m on 18 November, Puppet Server was running and had free resources. At around 8 a.m, a change affected performance, requests started to take longer, and there was a spike in requested JRuby instances. After the initial spike ended, there were additional spikes until 19 November at around 8 a.m. At that time, a change caused an outage, and there were not enough free JRuby instances available to process requests. The requests backed up, and the number of average free JRuby instances was consistently near zero. With no free JRuby instances to process requests, incoming requests were put in the queue. After the queue filled, requests were rejected. As a result, agent runs either timed out or were very long.
This timeline gives you two starting points to take a look at. However, since many people were working in this environment at that time, you can’t yet be sure what changes were made on the 18th and 19th at around 8 a.m. Let’s take a deeper look at the metrics.
Look for other performance changes at the times you identified
If you take a look at other dashboards, you can see which components are overloaded or had performance issues at the identified times. It might even show the underlying cause. Let’s look at the Heap Usage graph to see heap utilization and garbage collection metrics.
In this screenshot, committed memory
is the orange line, second from the top. It trends downwards. This is likely heap exhaustion causing a full garbage collection on the JVM heap. When the number of queued JRuby requests goes up, heap utilization also goes up. The heap can also be consumed by complex JRuby requests including facts, catalogs, and report sizes.
The PuppetDB Workload dashboard shows the performance of PuppetDB’s connection to PostgreSQL. The Average Write Duration is the average time it takes to write a request to the database. Look for trends around the previously identified times. You can see an upward trend over a few days with a large spike starting on 18 November, showing that a performance issue was building over several days and reached critical mass on the 18th. Since the SQL writes are taking longer, it is likely that the amount of data coming from the agents is getting larger. So check the PuppetDB Performance dashboard to identify performance issues with PuppetDB command processing.
This screenshot shows the average request time for each request type. Catalog and facts request times are growing slightly over time, likely due to the increased write time for PostgreSQL. The Store Report Time started growing on 11 November. It was close to 0ms and started showing performance degradation on 17 November, spiking at several seconds. The cause of our performance issues is likely this sustained, large increase in report store time due to increased report size. When the store report time increases, JRuby instances are occupied for longer, causing a backup of queued requests. Due to the backup, Puppet Server heap is depleted, causing performance issues.
In the timeline, you can see that the code deployment containing the changes happened a while ago on a subset of nodes. It increased every day at around 8 a.m., with large increases in store time on some days. You can use this information to ask questions about changes to the environment. Then, review code commits and new node deployments for issues.
What really happened
The performance issue in this example was due to a recursive file resource in the agent’s catalog that had been in the code for many years. When applications were updated, thousands of files went into a recursively managed directory, which increased the report size from a few hundred KB to several MB in size. Managing recursive file resources with the npwalker-recursive_file_permissions
module prevented thousands of file resources from being added to the report. The report size was reduced, and the issue was resolved.
Comments
0 comments
Please sign in to leave a comment.