This is a description of a high-level performance optimization workflow, primarily focusing on optimizing the design of our program, it's algorithms and data structures, and ignoring low-level optimizations unless its absolutely necessary.
I've written this as a reference to my future self, as someone who only occasionally does performance optimization and has to re-learn the tools and tricks each time. It will contain a summary of helpful tools and a suggested workflow that anyone can use as an inspiration for their own workflow.
Detecting a performance problem
It's easy to waste time when working on performance, because it's an inherently interesting and fun activity, but it has no inherent value outside of what you are optimizing. Therefore we must start by finding a real problem that needs improvement.
I'll use a real example in this post, from new analysis capabilities that we are developing at CodeScene right now. In this specific case, it's an analysis that deals with knowledge, and the loss of knowledge in a code base over time, for example measuring the impact on the rest of the team when a developer leaves their job.
See Knowledge Distribution in the CodeScene docs if you want to know more.
After wiring this in to the analysis in our staging environments we had a spike in analysis durations, as reported by our cloud monitoring, and we set off to investigate.
As an aside, but an important one: if you're not monitoring your performance, you won't know if something has a real performance impact. We could write a separate blog post about monitoring performance, but a very simple way to do it is to log durations for potentially expensive operations and extract them using some of the log aggregation tools available in your environment.
We use something like this in our AWS CloudWatch monitoring:
fields @message
| filter @message like /timed-analysis-progress/
| parse @message "step-name='*' duration=*" as stepName, duration
| stats min(duration), avg(duration), max(duration) by bin (1h)
This is the basis of finding something worth optimizing. Now we can set off to investigate what's causing the issue.
Investigation
After being alerted by our monitoring, we can drill down and see which analysis step is taking too long:
INFO [codescene.analysis.timed-analysis-progress:64] - Completed step-name='4 factor trends and KPIs' in duration=46085 ms
This is immediately suspicious to us, since we were working on those new analysis features at the time, and we implemented in a naive way at first. We have something to work with now, an identified namespace, and measurements in place around the suspicious function.
Analyze
We've established that we have a problem - some function in the calculation takes too long to run, and now we can run clj-async-profiler to get a good idea of what that time is being spent on.
This profiler is a sampling profiler, which we will use in its CPU profiling mode to figure how much time is being spent in different parts of the code. After following the step by step instructions in the README we can produce sampled output. This can be visualized in either the browser as a flame graph, by reading the text, or by using a dedicated flame graph viewer like speedscope.
Here's output from the sampler, when running our instrumented suspicious function:
total time symbol name
43,754 (94%) codescene/factors/knowledge/knowledge_kpi$calculate_trend_STAR_.invoke
43,709 (94%) codescene/factors/knowledge/knowledge_kpi$historical_knowledge_at.invoke
22,975 (49%) codescene/factors/knowledge/knowledge_kpi$main_developers_from.invoke
18,801 (40%) codescene/factors/knowledge/knowledge_kpi$knowledge_loss_at.invoke
We see that half the time of our suspicious function calculate-trend
is being spent in main-developers-from
. Lets start there!
Looking for places to improve
There are a number of approaches we can take, but let's have a look deeper in the call stack in main-developers-from
:
All the time there is spent in churn/by-main-developer
. This is also where the other large chunk of time in calculate trends i being spent, so it seems promising.
My first suspicion when something is slow is that I'm waiting for IO. The reason I always suspect this, is because of Latency Numbers Every Programmer Should Know. CPU and memory is so fast that if something is taking a long time it's often because we're waiting for some of the slower parts of the system, like the network or disk IO.
In a sampling CPU profiler, this should show up as waiting in a function that does IO, but here it seems we're not stuck waiting since I can't see any IO related function calls. If we're working on a black box, we'd need to use wall clock profiling to know where our time is spent, since we can't know which threads the CPU profiler ignores. This could be a problem if you're using a different thread for the IO than where the main calculation you are investigating takes place.
But in this case, we do have the source code where the slow function resides, and there's no IO in the call stack. We can see that we're spending a lot of time sorting and iterating the data, which can hint at a poorly chosen data structure or algorithm.
Let's have a look at the offending function:
(defn by-main-developer
[ds _options]
(let [vs (->>
(grouped-by-author-contrib ds)
(map pick-main-developer)
(sort-by (juxt :entity)))]
{:analysis-result-header [:entity :main-dev :added :total-added :ownership]
:analysis-result-values vs}))
After instrumenting that function , we see that it's being called with long sequences (30k+ in our testing project), that are growing incrementally with entries like this:
{:author-email "first.last@gmail.com", :date #clj-time/date-time
"2015-06-01T00:00:00.000Z", :entity "testing/software/config.clj" ...}
It would seem we are re-calculating who the main developer is by computing over the whole sequence of commits, for each new commit we are considering.
Improvement
Now we have a place to work on - we have a function that looks like it's performing poorly, and we know where it's called from and with what data structure. We also see that the problem is an inefficiency in our algorithm (or data structure), and not in waiting for IO.
To backtrack a bit, the point of this function is to back-fill a historic trend for knowledge, so users can see the evolution of their knowledge of the codebase over time:
The total runtime for this calculation is a function of both the number of samples and the efficiency of the calculate-trend
function. We currently sample the knowledge each day, for a whole year back, so we have 365 invocations of our bad function.
A couple of approaches come to mind.
If we first look at fixing the function, it looks like we can do it if we restructure the data a little bit. To drastically change the performance characteristics, we'd need to save the results from the preceding calculations somehow, so we don't need to work at the whole sequence for every calculation.
But the context for this work is one of getting a feature of speculative value out to customers as fast as possible. So instead of doing an in-depth job of algorithmic optimization, we can lower the sample rate and invoke our bad function less times.
We decided that from a product perspective we can go for a weekly, or even monthly sample rate. By doing that we lower the cost in calculation time in a linear fashion, so going from 365 samples to somewhere between 20-60 samples will lower the runtime by an order of magnitude. Since we did the work and added performance metrics around the function in the beginning, it's now simple to verify that we get the desired performance increase from our simple optimization.
INFO [codescene.analysis.timed-analysis-progress:64] - Completed step-name='4 factor trends and KPIs' in duration=5041 ms
Working on performance optimization in a real product is ultimately an exercise in judgement, and in this case going from about 45 seconds to sub 5 seconds is good enough. Hopefully this becomes a great hit with our users, and then we'll revisit it and do the algorithm optimization.
If you're looking to read more, my colleague Joseph has some simple examples on how you can avoid code complexity.