Thursday 26 August 2010

Data manipulation and statistics with pyxplot

One of the handiest features of pyxplot is the ease at which common tasks can be grouped into procedures, and then can be called as functions. For a start, we will define a function that calculates the average of a column in a file. The general form of a subroutine is
subroutine my_function(argument list)
{
      do whatever we need
      return something, if necessary
}
Now, we want to calculate the average, therefore, we have to access the data. That is really easily done, simply by calling
foreach datum foo in 'bar' using baz

which, instead of plotting, takes the numbers in column 'baz' of file 'bar' and puts them into the variable 'foo'. Once the variable is assigned, we can do whatever we want with it. Our average-calculating routine would then look like this
subroutine ave(filename, column)
{
        sum_x = 0
        N = 0
        foreach datum x in filename using $(column) 
        {
                N = N + 1
                sum_x = sum_x + x
        }
        if(N > 0) 
        {
                return sum_x / N
        } else 
        {
                return NaN
        }
}
Firs, we initialise two variables, sum_x, and N, then we read the values in 'filename', column 'column' into x. We increment N by one, just to know how many records there are in the file, and add x to sum_x. At the end, depending on the number of records, we either return the average, or a NaN, to indicate that nothing was in the file.

Having defined the function, we can call it from the command line as
x = average('data1.dat', 2)
print x
which will print out the average of the second column in 'data1.dat'. A function which calculates the standard deviation could be defined as follows.

subroutine std(filename, column)
{
        sum_x = 0
        sum_sq = 0
        N = 0
        foreach datum x in filename using $(column) 
        {
                N = N + 1
                sum_x = sum_x + x
                sum_sq = sum_sq + x * x
        }
        if(N > 0) 
        {
                sum_x = sum_x / N
                return sqrt(sum_sq / N - sum_x * sum_x)
        } else 
        {
                return NaN
        }
}
Now, this works all right for cases, where we want to calculate the statistics of some column in a data file. But what, if before doing that, we also want to modify the data. For instance, we might need the average of not the data per se, but the average of the sine of the data points. What should we do then?

Well, the trivial solution is to hard-wire the expression in our the average routine that we defined above. However, this method does not lend itself to very flexible use, does it? So, is there a way to supply the data file, the column that we want to process, and the expression that we want to use? Sure, there is, in fact, it is rather simple. All we have to do is to add a string to our function, and evaluate the string to yield a function that we can apply to the data points. Thus, our new version of the average subroutine would read as follows
subroutine ave_e(filename, column, expr)
{
        str_temp = "f(x) = %s"%(expr)
        exec str_temp
        sum_x = 0
        N = 0
 
        foreach datum x in filename using $(column)
        {
                N = N + 1
                sum_x = sum_x + f(x)
        }
        if(N > 0)
        {
                return sum_x / N
        } else
        {
                return NaN
        }
}
The first two arguments are as above, and the third one is supposed to be a string variable. In the subroutine, first we create a new string, which is nothing but the function definition. Note that the sprintf function as such does not exist in pyxplot, but we have
str_temp = "f(x) = %s"%(expr)
instead. Except for the slight difference in the form, it behaves in the same way as sprintf in gnuplot. Also note that string concatenation does not work as it does in gnuplot, namely, with the dots between the string variables. Instead, if one has to append a string to another one, one can do
a = "foo"
b = "bar"
c = "%s"%(a,b)
Now, back to our average subroutine! At this point, we have a string that contains the definition of our new function. In gnuplot, we would just use eval to turn this into an actual definition. In pyxplot, we execute exec instead. But the results is the same, our string has been promoted to the rank of a full-fledged function, which we can call in the common way. In the body of the foreach loop, we add f(x) instead of x, and we are done. So, if we wanted the calculate the average of the sin(x*tan(x)+cos(x)) of the 2nd data column in file 'foo', we would call our subroutine as
print ave_e('foo', 2, "sin(x*tan(x)+cos(x)")
And finally, what happens, when we have to calculate the average of some quantity that depends on two or more columns? If this is the case, we can print the manipulated data to a file, and apply the average function to the new data set. In gnuplot, we would simply plot the data after having set the tabular flag. In pyxplot, we use the tabulate command to indicate that we do not want to plot, but send the output to a data file. Thus, if we need the average of sin($1)*cos($2), we can do the following
set output 'bar.dat'
tabulate 'foo.dat' using (sin($1)*cos($2))
ave('foo.dat', 1)
I hope that today I could convince you that data can very easily be accessed and manipulated in pyxplot. Next time I will discuss "real" plotting.
So long,
Zoltán

2 comments:

  1. Hi I have been following the gnuplot tricks(awesome pages I have to admit) but now you switched to pyxplot. Well, then let me rise the question why not R?
    Cheers,

    Schwan

    ReplyDelete
  2. Greetings Schwan,

    The answer is, I don't quite know. I think, R is too big for my taste as a plotting application, in fact, it is much more, than that. But then, I started with gnuplot some 10 years ago, and I grew sort of fond of it, until I looked into the code recently.
    Cheers,
    Zoltán

    ReplyDelete