Demystifying programming

If you are not familiar with programming, or if your only experience is using your colleagues' scripts without really digging into them, then code can look very intimidating. For instance, here's a random snapshot of some of my own ugly MATLAB code (for EEG data preprocessing); this looks confusing and intimidating even to me, and I wrote the darn thing!

plot of chunk unnamed-chunk-1

The truth, however, is that coding is easy. It might look confusing because you're not used to looking at it. (Or because, in this case, the code you're looking at is not well-written.) But once you get used to what coding actually is, it's very straightforward. The purpose of this tutorial is to give you a quick introduction to what programming is. It's not specifically meant to teach any particular programming language, as the details of each language are different; rather, the purpose is to show you how to think like a programmer. Once you know that, learning any new programming language is easy.

What is programming?

Programming is just problem-solving. You have a small set of tools (the things that a programming language can do) and a task to accomplish (maybe it's finding all the words in a corpus that have some certain property, maybe it's filtering a digital signal, maybe it's taking a folder full of data from an experiment and doing statistics on them, or maybe it's making a game that plays Rock-Paper-Scissors against you—the sky's the limit). Programming is just figuring out how to use the tools to accomplish the task. The sequence of steps you will take to accomplish the task is, in programming jargon, an algorithm (the "AL" part of HAL 9000). An algorithm is just a plan to solve a problem. You use algorithms all the time in your daily life without realizing it—maybe you have a sink full of dirty dishes and you need to clean them, so you mentally make a plan to take the tools you have available (a sponge, soap, a tap), wash the first dish, rinse the first dish, place it in the rack, and then move on to the next dish and repeat... or maybe you wash all the dishes, then rinse all the dishes, then place them on a rack. Those are two different algorithms you could use to approach the same problem.

Coding is just thinking of an algorithm to solve the problem you want to solve, and then translating that algorithm from concepts in your head (e.g., "First I need to load all the files, then I need to strip off the headers of each file, then I need to combine them into one sheet...") into language that the computer understands. Translating your ideas into the computer's language is the implementation part of programming, and that varies from one programming language to another—for instance, getting the mean of some list of numbers is done with mean() in languages like R and MATLAB, but with =AVERAGE() in Excel/Visual Basic. Don't worry about these details; the crucial part of programming is the problem-solving algorithm part. Once you're good at that, you can learn any implementation (i.e., any programming language) very easily.

In this tutorial we'll walk through what a simple algorithm looks like. This will show you what it means to think like a programmer. Once you know how to think up an algorithm and know what tools are available in programming, you'll have a better idea how to understand other people's code and you'll know how to write your own code to solve just about any task.

An algorithm to average numbers

Let's say you have a list of numbers and you want to get their average. Most programming languages have built-in functions to do this (we discussed a few of them above). But this is also a good starting point for an example of a programming task and an algorithm to solve it. Here we won't use any particular language to solve it; instead, we'll do it in pseudocode, which is a fancy way of saying that instead of writing actual code we will just describe, in plain English, what we want the computer to do each step of the way.

Let's start out with our list: LIST = [90 112 103 88 103 118 93 99]. Now, we know that to get the mean, we need to divide the sum of the list by the number of elements in the list. So we already have a rough outline of the algorithm we need to use:

How do we figure out the sum of the numbers in LIST, and the number of numbers in LIST? This is where the tools of programming come in. Across just about any programming language, the same three basic tools are available. While simple, once you use them together you can do just about anything with them.

Now let's think about how we can use these three basic tools to accomplish the task of getting a mean of LIST.

First we want to get the sum, and store it in a variable so we can use it later (when we divide the sum by the N to get the mean). How can we do that? We can start with the value 0, and loop through each item in LIST, adding that item to the value. Here's how that will look in our pseudocode algorithm:

Notice the new part of the code above. We first created a variable, SUM, which starts at 0. We then used a loop to go through each item in our LIST, and at each item we took it and added it back to the SUM (specifically, we set the new value of SUM to be NUMBER plus the old value of SUM). For instance, we start with SUM set to zero, and we grab the first number in our list, which is 90; we set SUM to SUM+90, so now SUM is 90 instead of zero. Then we grab the next number in our list, which is 112, and set SUM to SUM+112, which is 90+112, which is 202. We continue this until we've done it for every number in the list; at the end of this loop, the value of SUM will be the sum of all the numbers in LIST, and we have it stored in our variable SUM so we can use it again later. After that, any time we refer to "SUM", the program will plug in the actual number in the place of "SUM". (Don't worry about the details of how the loop itself is actually written; this will vary from one programming language to another, but hopefully you see the logic behind why we used a loop.)

Now how about finding the number of items in our LIST? We can use a very similar strategy. I'll put an updated pseudocode algorithm here; before reading on, see if you can understand the algorithm and see how it works, then continue on to read my explanation of it and see if you were right.

To get the number of items in the list, we used another loop. It's almost the same as the first one. We start with a variable N, which begins at 0 but which we will use to keep track of the number of items in LIST. Then we loop through each item in a list. Only this time instead of adding the value of each item to N, we just add 1 to the current value of N; thus, by the end of the loop, N will represent the number of items that there were in the list.

Once we've gotten SUM and N, all we need to do is divide and, ta-da! We have the mean.

This example was very simple. But using this same logic, you can program almost anything. Of course, most programming languages don't require you to make loops to figure out means and sums on your own; these capabilities are already built in. For instance, R has functions sum() and length(), which would have automatically given you the SUM and N values without needing to make loops; it also has a function mean(). Deep down inside those functions, though, is more code that does, somewhere at the low level, execute a very simple algorithm something like this. Functions, like mean() and sum() are simply variables that, instead of holding some number, hold some computer code, and they can be called upon to run that computer code and accomplish that task. So, for instance, if you are writing a program that needs to find means of numbers 5 different times, then instead of writing out a long algorithm each time like we did above, you can simply repeatedly call on the mean() function, saving yourself a lot of time and avoiding the risk of making a mistake one of the times you write out the same bit of code.

Functions (also called subroutines, which philosophically speaking is a more accurate name for them) are another one of the tools you have available when making an algorithm to solve a problem (just keep in mind that functions are, deep down inside, just more code which uses the basic tools we discussed above: variables, loops, and conditionals). If you look back up at the messy MATLAB example I illustrated above, you will see several functions: the built-in function dlmread() is used twice to read some external data into the program and store it in a variable, and at the very bottom you can even see the MATLAB version of the mean() function being used.

To sum up, programming is just thinking up an algorithm that uses the tools available to you to solve some task. Those tools include variables, loops, conditionals, and functions. Once you're used to thinking in those terms, you can pick up any programming language. Let's look at how to implement this simple algorithm in a specific programming language, R.

Implementing a programming algorithm in R

You're going to need to install the free programming language R to run this code.

Open R and let's start writing this code. First of all, create your list of numbers:

my_list <- c( 90, 112, 103, 88, 103, 118, 93, 99 )

The "<-" here is an assignment operator: it sets (or assigns) the value of the variable my_list to equal the list of numbers to its right. We used the built-in function c() (stands for "concatenate") to create a list of numbers, and we fed that function (inside its parentheses) a comma-separated list of the numbers that the variable will be storing. Note that all of this is specific to R; the details of how to create a list, and how to assign it to a variable, might differ in other programs, but crucially, no matter how specifically we code it, it's all accomplishing the first step of our algorithm from above, which is to make that list of numbers. If you want, you can type back in "my_list" in R and hit ENTER to see that it now is holding the list of numbers you gave it.

Now let's make a loop to get the sum, the second step of our algorithm:

my_sum <- 0
for( number in my_list ){
	my_sum <- my_sum + number
}

Here we first created a variable my_sum, which begins as 0 but by the end of the loop will hold the sum of all the values in my_list. Next we have a for loop. This loops works by iterating through some list (in this case, my_list) and each time through it sets the value of a temporary variable, number, and then executes all the code that is within the curly brackes {}. Specifically, the first time through the loop, it sets the value of number to be the very first number in my_list, and then it adds that number to the value of my_sum. Then the first iteration through the loop is finished, and the loop begins again, only this time it sets the value of number to the second number in my_list and adds that to the value of my_sum. It repeats this process until it has done it for every number in my_list, at which point the loop has finished. You can type in my_sum in R to see what the sum ended up being (and since R already has a sum() built-in function, you can check if this is correct: type sum( my_list ) and you should see the same value).

Now let's make a loop to get the N, the third step of our algorithm. This will work in almost the same way as the previous loop.

my_N <- 0
for( number in my_list ){
	my_N <- my_N + 1
}

By now you should be able to understand the loop above and see how it works to figure out the number of elements in my_list. Once again, R already has a built-in function that does the same thing, so we can check our work. Compare the values of my_N and of length( my_list ) in R to see if our loop figured out the correct value.

Now we have all the values we need to figure out the mean, and both are stored in variables. All we need to do is divide them:

my_sum / my_N

This division should give you the same value as the built-in mean function. Check it by entering mean( my_list ).

Wrap-up

We have reviewed the basics of what a program actually does, seen how to think like a programmer, and gone through how to translate an algorithm into actual computer code. This is basically all the skills you need to write your own code. Of course, real-life programs will look different than this; we were just getting the mean of some numbers, which is such a common operation that we usually don't need to write code like this to do it, as most programs already have built-in functions that do common things like this. But the concepts we observed here are used in any other computer program as well: the program is simply a series of steps that the computer will take to accomplish some thing.

In many programs, each of these steps might be yet another function. We saw several examples in the big ugly MATLAB program I illustrated at the beginning of the tutorial—almost every step of that program was using some function! So then you may wonder, how do people writing computer code know what all these functions do? When someone is approaching a task and figuring out their algorithm and, for example, in step 3 they need to sort their list of numbers in descending order, how do they know which built-in function to use to do that? And when someone is reading another program and trying to understand it, how do they know what all these crazy functions do? I'll let you in on a secret: programmers aren't magic. We use Google. Sure, after writing R code for several years I remember that the function for getting a mean is called mean() and not average(), and I remember that the function for finding which number in a list is the one I want is which() and not find() (but vice versa in MATLAB!), but I don't have the entire R programming language memorized, and in fact no one in the world does. Rather, any time I need to do a thing and I don't remember how, I just Google it, e.g., "read xlsx file in R" or "r find unique numbers in array", etc. In fact, that is how most people learn to code: not by sitting down with a book and learning the whole language at once, but by Googling one question at a time, as they need it, until eventually they become pretty comfortable with coding in that language. So don't feel intimidated by code; you can figure out how to do whatever you need by first coming up with an algorithm, then using Google one step at a time as you convert that algorithm into real code. Indeed, that's how most code gets written.

This tutorial was, of course, only scratching the surface. In addition to the basic programming tools we discussed above (variables, conditionals, and loops, as well as functions), hardcore programming languages also include some other things, but you probably won't need to worry about them. For example, all programming languages have input/output streams, which control getting data into the program through either the keyboard or a file, and sending data out of the program either to be shown on the screen or be saved in a file, but in most high-level programming languages used for applied science stuff (such as R, MATLAB, and Python) these are handled just like other functions; for example, in the MATLAB code I showed at the beginning of this tutorial, data were read into the program with the dlmread() function, and in the R code we ran just now data were shown on the screen pretty much automatically; however, if you ever use a more hardcore programming language like C++, handling input/output streams will be a separate concept to learn. Another tool we haven't discussed is pointers/references, which are a thing in languages like C++ and Perl, but which you probably won't need to think about at all for writing scripts in languages like R, MATLAB, and Python. And another tool is operators: we are all familiar with things like +, -, *, and /, but most programming languages also allow you to make up your own operators; this usually isn't necessary for writing simple programs, though, and more often is just something experts do to save themselves time. Anyway, that is all to say, in this tutorial we have not covered all the main tools of programming (so don't be surprised if someday you come across something new), but we have covered the core ones that are likely to show up in almost any code you read or write.

What's next? Get out and start programming! There are lots of books and online tutorials available for whatever programming language you need to use and these can be helpful—in particular, even though you know the basics now, for any given programming language you will need to check the details of their implementation, for example whether "for" loops use curly brackets or square brackets, how the language lets you create your own functions, and how the language handles indexing (i.e., specifying that you want to grab a certain number out of a list rather than grabbing the whole list). But sometimes the best way is just to start out with some code that has already been written and modify it to do what you need. Of course, because every programming language has its own details of implementation, don't feel bad if you don't understand everything immediately. For example, while we talked about the basics of programming here, we didn't talk about anything that would explain to you what's going on with all the periods in my MATLAB code (stuff like averages.(condition).(subject)), or all the stuff in [] square brackets, or all the ~ squigglies. Those are bits of syntax that are specific to MATLAB code, but as you need each one you can learn how to use it via Googling. For instance, [] is just how you make a list of numbers in MATLAB; it's the equivalent of R's c() function that we used to create our list earlier. And the ~ is MATLAB's version of "not", so it can be used in conditionals to make sure something happens if a given condition is NOT met instead of if a given condition IS met. All of these details will vary from one programming language to another and even from one person's code to another's (in most languages there is more than one way to do the same thing, i.e., more than one algorithm to accomplish the same task), but now you have the foundational knowledge to figure out how to use them.


by Stephen Politzer-Ahles. Last modified on 2016-04-08.