Machine Learning for Locating Bugs in Source Code
Machine learning can be used to do some nifty stuff. Did you know it can find programming bugs before you do?
It's called software defect prediction or SDP for short. This article describes how to implement a basic system for performing SDP. I'd also recommend any one of my publications which you can find here.
The global cost of debugging software has risen to $312 billion annually (2014). On average, software developers spend 50% of their programming time finding and fixing bugs
This article presents a quick and easy example of an SDP implementation. It uses the sklearn implementation of Random Forest. Like any classification task, a training dataset is needed:
The Training Data
Software engineers have invented a ton of software metrics which are used to measure your source code. They can be as simple as counting the number of lines, counting the number of variables, or sound as intimidating as cyclomatic complexity. If you wanna check out some more software metrics used in SDP, read this research paper. In SDP, the term module refers to a section of source code. Typically, we’ll consider each function/method as a module, however, classes are commonly used instead. Each module then gets measured using a suite of software metrics. The most important software metric is whether or not the module contains bugs.
NASA's Public SDP Training Data
A spacecraft instrument was developed by NASA in the C programming language. The NASA Metrics Data Program considered each function to be a module. Then, by using software metrics, they measured each module. They stored all of these measures in a table where each column was a software metric, and each row was a module. You can check out this file online! It’s called CM1. If you’re interested in the metrics which NASA used to measure their source code, check out the Halstead complexity measures. NASA also used several variants of the above-mentioned cyclomatic complexity and several basic metrics such as number of lines. In CM1 you’ll find 22 software measures for each module.
Building a Classifier from the Training Data
In Step 1, a flat dataset is extracted from some existing source code. The unstructured data (source code) is transformed into structured data (measures for each module). In technical jargon, to build a classifier, a set of training features X is fitted to a set of training labels Y. In everyday language, the flat dataset is fed into a algorithm which generates a classifier. In my work with SDP, I have always referred to the resulting classifier as an SDP system. Any new functions can be passed to the SDP system and either the word “buggy” or “not buggy” will be returned. Depending on the choice of algorithm, a confidence measure is also given.
Doing This With Python
I’ll run through how I used Python (with Scikit-Learn and Pandas) to build an SDP system using the supervised classification algorithm Random Forest. First I read in the data from CM1.csv (the dataset mentioned above). Then I’ll split the data into the software measures, and whether or not each module was buggy. In CM1, the Defective column is True if the module contains bugs, and False otherwise.
import pandas as pd data = pd.read_csv('CM1.csv') software_measures = data.drop('Defective', axis=1) buggy = data.ix[:, 'Defective']
Let's take one third of the data away so that we can use it to test our SDP system later.
from sklearn.cross_validation import train_test_split features_train, features_test, labels_train, labels_test = \ train_test_split(features, labels, test_size=0.33, random_state=42)
Building the SDP using a random forest classifier is a simple three liner.
from sklearn.ensemble import RandomForestClassifier sdp_system = RandomForestClassifier() sdp_system.fit(features_train, labels_train.tolist())
Now that the SDP system is built, it's important to test accurate its predictions are. We can check that using the testing data which was generated above.
from sklearn.metrics import accuracy_score accuracy_score(labels_test.tolist(), sdp_system.predict(features_test))
This outputs 85% as the accuracy score. Pretty great performance right? Maybe not! SDP suffers from a problem known as class imbalance. This problem can render accuracy useless as a performance measure. Check out my published works if you're interested to see how I've tackled this problem in the past.
I've worked academically in the SDP space for over 4 years now. I'd love to see it adopted more in industry. My PhD research proposes a few ways for this to happen, but it would likely require a fruitful collaboration between academia and industry.