1. It forces me to learn to be more productive

A few days ago, I needed to perform a simple copy and paste of some Javascript. For such a simple task, I wanted to know how to perform it as efficiently as possible. While a simple (command + c), (command + v) would have done it, I wanted to instead use the single key commands (‘y’ and ‘p’). The first challenge was to learn how to quickly select the block of code which I wanted to copy. The Vim wikia has this great article on copying and pasting within Vim.

First I move my cursor to the start of the code I want to copy. Then I enter visual mode (control + v). Using ‘j’, I move the cursor down until I reach the last line. Finally, I use ‘$’ to select all text from all of these lines.

Selecting in Vim gif

Selecting in Vim

2. Keeping my hands on the keyboard maintains my focus

This isn’t something that I can back up by research or online articles. However, I feel that my focus is steadily maintained if my hands stay on the keyboard. The fingers of Vim users naturally spend more time on the keyboard than on the mouse. Googling “how to maintain focus programming” will illustrate just how common it is to have trouble focussing while programming. Reading the first three results will uncover a common answer: “minimise distractions”. Perhaps for myself, the mouse is also a distraction. This may be particularly true for programmers who use a laptop touchpad instead of a mouse.

3. Great control over tabs and spaces

Every time I have setup Vim on a new machine, the first thing I’ll change in my .vimrc is how tabs and spaces are handled. For the past year or so, I have been on the spaces side of the never-ending tabs vs spaces debate. The very first change I’ll make to my .vimrc is to make it so that every time I press ‘tab’, it creates two spaces. I always find myself at the same stackoverflow question. The top answer on this question teaches you how to set tabs to be the same as four spaces. I simply put the same commands into my .vimrc and replace each ‘4’ with a ‘2’.

  • set tabstop=2
  • set shiftwidth=2

The last time I setup my .vimrc, I ran into a problem where putting a space either side of the ‘=’ caused an error. Apparently Vim requires there to be no whitespace surrounding the ‘=’. The TV show Silicon Valley has a hilarious bit about tabs vs spaces.


4. Sharing .vimrc files among the community is fun

The Vim community regularly shares vim configurations in the form of .vimrc files. Github user amix has a great repo with Vim configurations that I would recommend. The Vim subreddit is an active community with over 30,000 subscribers. There are always posts which describe new and better ways to configure Vim. For example, one of the top posts has some amazingly excellent tips to be more efficient in Vim. It’s so easy to share your .vimrc files. I always place my .vimrc file in my $HOME directory. Sometimes it can be difficult to work out which .vimrc file your Vim is loading. I found a simple trick to work this out. Simply typing ‘:echo $MYVIMRC’ will give the path of the .vimrc that was loaded for the current Vim session.

.vimrc gif

Checking location of the loaded .vimrc

5. Seamless integration with console

I’m a big fan of doing tasks through the console on any OS. You get a feeling of pure control when you are entering very specific commands rather than working through a GUI. For example, I didn’t understand how LaTeX files were compiled until I started regularly doing them from the command line, but that’s a different story. Lately, I’ve been using MacOS which has a beautiful console. Being able to edit my code in the console using Vim gives me that same sense of control.

I haven’t used other advanced text editors but I’m content to invest my time into learning Vim rather than emacs simply because I’ve already started learning Vim.

xkcd comic

xkcd’s take on editors for programming.

1. Comment well enough so that you can read your code in six months

My first personal data mining code base was horribly commented. Fast approaching deadlines put so much pressure on me to produce code quickly. I remember thinking I could come back to the Java code at a later date and comment it. Oh boy, I was wrong! I eventually came back to this code base and got lost in my own code very quickly. A new deadline was approaching and I needed to use the code base. This resulted in multiple all-nighters of slowly losing my sanity until the work was finally completed. Today, I’m certain that I could have avoided ALL of those all-nighters if my code had been commented when it was written. I’ve now written a new code base from scratch which has had MUCH better comments since the very first line of code.

2. Use version control even for small personal projects

I was once writing a decision support tool for the video game DoTA2 using Java. During that project, I switched computers several times, changed operating systems, and sent alpha copies to a few friends for testing. Eventually, I had several versions of the tool on multiple computers. After a year of not touching the project, I decided to finish the project. However, I got lost when trying to remember where the latest version of the project was stored. I ended up giving up on this project all together.  Every project that I work on, no matter how small, is done using GitHub. I try to gradually improve my projects, pushing my code along the way. Now, I always know the location of the latest version of my project.

3. Always look for existing solutions for your coding problem

During the first year of my Bachelor in Computer Science, I learned C++. Like most new coders, I decided I wanted to make a game. Being the complete newbie I was, I didn’t even think about using an existing library like SDL. I ended up learning several graphics libraries for C++ and realised how naive I had been. (I definitely recommend LazyFoo’s tutorials for learning SDL http://lazyfoo.net/tutorials/SDL/.) Now, when presented with a coding problem, I default to believing that someone else has solved this problem before. This leads me to look for the solution online, and save a LOT of time!

4. Only collaborate with people who won’t block your progress

Ouch – I learned this lesson the hard way! I was once very enthusiastic about a UI project I was working on with AngularJS. I invited a friend to help me and gave him some tasks to do for the project. This brought the project to a total halt because he kept telling me he was going to work on it “tomorrow” and wouldn’t let me help him. I’m now very careful about who I collaborate with. I also try to make it clear that if they do not do the coding that they commit to, I will jump in and do it for them.

5. Its very easy to start coding in languages you haven’t learned!

There have been several times when I’ve asked someone to collaborate on a project with me and they refuse because they don’t know the syntax of the language that I want to use. I used to have the same mindset: that I couldn’t work on projects that were written in language that I didn’t know. Some of the projects that I wanted to work on, I didn’t have the confidence unless I studied a text book on the necessary language. I once did a major refactoring of a friends game AI project written in Lua. At that point, I knew nothing of Lua’s syntax. However, I was able to get started by googling example code of what I was trying to do and make progress. My progress wasn’t even slow! I’ve since fallen in love with learning new languages by doing. It feels so productive!

Machine learning has been used to perform some nifty tasks. My favourite is locating bugs in source code. We data scientists often refer to this as software defect prediction (SDP for short). I’ve been involved in SDP for 4 years now, constantly making improvements to the existing machine learning algorithms used for SDP. I reckon I can do a pretty swell job of explaining it… So grab a cup of coffee and a biscuit and by the time you’ve finished your snack, you’ll understand how SDP works!

Simple SDP system diagram.

An SDP system typically determines which functions are buggy.

Step 1: Measure Existing Source Code

Software engineers have invented a ton of software metrics which are used to measure your source code. They can be as simple as counting the number of lines, counting the number of variables, or sound as intimidating as cyclomatic complexity. If you wanna check out some more software metrics used in SDP, read this research paper. In SDP, we use the term module to refer to a section of source code. Typically, we’ll consider each function/method as a module, however, classes can be used as well. Each module then gets measured using a suite of software metrics. The most important software metric is whether or not the module contains bugs.

Real Life Example of Step 1

A spacecraft instrument was developed by NASA in the C programming language. The NASA Metrics Data Program considered each function to be a module. Then, by using software metrics, they measured each module. They stored all of these measures in a table where each column was a software metric, and each row was a module. You can check out this file online! It’s called CM1. If you’re interested in the metrics which NASA used to measure their source code, check out the Halstead complexity measures. NASA also used several variants of the above-mentioned cyclomatic complexity and several basic metrics such as number of lines. In CM1 you’ll find 22 software measures for each module!

Step 2: Run a Supervised Classification Algorithm

In Step 1, a flat dataset is extracted from some existing source code. In data science terms, the unstructured data (source code) is transformed into structured data (measures for each module). In technical jargon, to build a classifier, a set of training features X is fitted to a set of training labels Y. In everyday language, the flat dataset is fed into a algorithm which generates a classifier. In my work with SDP, I have always referred to the resulting classifier as an SDP system. Any new functions can be passed to the SDP system and either the word “buggy” or “not buggy” will be spat out. Depending on the choice of algorithm, a confidence measure is also given.

Python Example of Step 2

In this example, I’ll run through how I used Python (with Scikit-Learn and Pandas) to build an SDP system using the supervised classification algorithm Random Forest.First I’ll read in the data from CM1.csv (the awesome dataset from above). Then I’ll split the data into the software measures, and whether or not each module was buggy. In CM1, the Defective column is True if the module contains bugs, and False otherwise.

import pandas as pd

data = pd.read_csv('CM1.csv')
software_measures = data.drop('Defective', axis=1)
buggy = data.ix[:, 'Defective']

We can use a random forest as the supervised classification algorithm…but first let’s take one third of the data away so that we can use it to test our SDP system later.

from sklearn.cross_validation import train_test_split

features_train, features_test, labels_train, labels_test = train_test_split(
features, labels, test_size=0.33, random_state=42)

Building the SDP using a random forest classifier is a simple three liner!

from sklearn.ensemble import RandomForestClassifier

sdp_system = RandomForestClassifier()                                           
sdp_system.fit(features_train, labels_train.tolist()) 

Now that I’ve built an SDP system, I’m interested in how accurate its predictions are. We can check that easily using the testing data we made earlier!

from sklearn.metrics import accuracy_score    
accuracy_score(labels_test.tolist(), sdp_system.predict(features_test))


I found that this SDP system gets 85.09% accuracy, sweet!

85% Accuracy is Great!…but…

There are a few caveats to this… SDP commonly suffers from what’s referred to as the class imbalance problem. Simply put, there are too many examples of bug-free modules in datasets than there are examples of buggy modules. This renders the accuracy meaningless… I’ll explain why. My PhD supervisor told me a story he heard from a data scientist who used machine learning to predict whether a patient has breast cancer or not. This data scientist could get 99% accuracy on his models! Sounds great right? No! The classifier could potentially be useless. Consider that 100 patients are used to test the system. In reality, 1 out of the 100 has breast cancer. When the cancer prediction system was generated, it may have been optimised for accuracy, in which case it thinks “Easy! Hardly anyone has cancer, so I’ll just predict everyone as cancer-free! Then I’ll get a really high accuracy.” This is obviously a terrible mindset, so class imbalance techniques are often applied in situations like these. In SDP, we typically use AUC or Area Under the Receiver Operating Characteristic to measure the performance of SDP systems.

SDP and class imbalance have been a huge topic of my data science work for the last 4 years. I hope you’ve learnt something from this post and enjoyed that biscuit in the meantime. 😉