Adam, Can you comment on FixCache and it’s use for bug prediction in codebases by, like code maat, mining CM logs? Is it based on the same or a subset of the ideas you present in Crime Scene? From what I have read about it’s algorithm, by far the strongest piece of evidence for predicting bugs is “does the module have a lot of bug fixes already?”, i.e. this means there are probably more! Any insights or comments you may have of this or, more generally, FixCache would be appreciated. Thanks… -Jay
I haven’t used FixCache myself, but it’s based on related ideas. In general, process- and evolutionary metrics outperform any metric you measure from the code itself. The strategies I use in the book, like identifying Hotspots and files with fragmented development history, are also good predictors of defects. However, I haven’t seen any research that compares these different approaches so I cannot recommend one over another.
That said, I’m a big supporter of the idea that we shall support our decisions by data. And algorithms like FixCache and Hotspots help us focus on the areas of the code that matter the most for our productivity. I also suspect there’s a strong correlation between the different metrics.
I’ve continued to work on the tooling since I wrote Your Code as a Crime Scene. You can have a look at the new tools at Empear’s site or try the demo of the tools on real projects. The main technique I use now is a combination of different evolutionary metrics that I feed into a machine learning algorithm to prioritize the results. The final step is important since in a large codebase it’s not particularly helpful for any algorithm to report 10% of the files in the project as potential problems; Those 10% may represent hundreds of thousands lines of code. Using the machine learning approach, I’m usually able to narrow it down to 2-5% of the total codebase. And those 2-5% of the code are typically responsible for 20-70% of all reported defects.
I am at the SEI and interested in learning anything I can regarding code maat, but particularly your newer tools - looks like you founded your own company around this - congrats! Are there any papers or other technical discussions that you might point me to describing what you’ve done at Empear, the machine learning algorithms you use, performance of the tools, etc? Thx, Jay
Thanks! Yes, I founded Empear last year with the goal of automating all analyses in Your Code as a Crime Scene. Since then we’ve taken the tooling beyond the analyses in the book and developed a bunch of new techniques that help us make sense of large-scale systems.
The Empear tools also have much better performance than code-maat. I use them to analyze projects with 15 million lines of code, 200.000 commits and hundreds of developers. We also support analyses of codebases that are split across multiple repositories (quite a common case these days).
The Empear tools are commercial software, but we do support academia with free licenses for research purposes. Perhaps that could be interesting to you?
I’m in the process of writing-up some of my case studies and share some of the findings we’ve done on how software evolves. I’d be happy to tell you more. You can reach me at adam.tornhill at empear dot com