Thursday 26 August 2010

A new beginning

As I announced on my older blog, gnuplot-trick.blogspot.com, I will run a parallel one, dealing with pyxplot-related matters. Pyxplot is quite similar to gnuplot in its command structure, though there are some subtle differences. However, it has some features that gnuplot hasn't, and which make life rather easy, when it comes to data manipulation. If you recall from that old blog of mine, data manipulation was not an easy task in gnuplot, and in some sense, it was almost deemed illegal. The argument of the developers was that gnuplot is a plotting utility, and not a
data processing unit, so why should it manipulate data at all? I have tried to argue on a number of occasions that plotting and data processing are part of the very same task, and they cannot be disentangled, but I had the feeling that my pleas fell on deaf ears. Anyway, it seems that some other people have had the same experience, and decided to do something about it. Pyxplot is a well-written plotting utility with a number of conveniences that you will certainly appreciate. On this blog, I will visit a number of methods on producing informative graphs with pyxplot. I hope you will enjoy it!
Cheers,
Zoltán

Data manipulation and statistics with pyxplot

One of the handiest features of pyxplot is the ease at which common tasks can be grouped into procedures, and then can be called as functions. For a start, we will define a function that calculates the average of a column in a file. The general form of a subroutine is
subroutine my_function(argument list)
{
      do whatever we need
      return something, if necessary
}
Now, we want to calculate the average, therefore, we have to access the data. That is really easily done, simply by calling
foreach datum foo in 'bar' using baz

which, instead of plotting, takes the numbers in column 'baz' of file 'bar' and puts them into the variable 'foo'. Once the variable is assigned, we can do whatever we want with it. Our average-calculating routine would then look like this
subroutine ave(filename, column)
{
        sum_x = 0
        N = 0
        foreach datum x in filename using $(column) 
        {
                N = N + 1
                sum_x = sum_x + x
        }
        if(N > 0) 
        {
                return sum_x / N
        } else 
        {
                return NaN
        }
}
Firs, we initialise two variables, sum_x, and N, then we read the values in 'filename', column 'column' into x. We increment N by one, just to know how many records there are in the file, and add x to sum_x. At the end, depending on the number of records, we either return the average, or a NaN, to indicate that nothing was in the file.

Having defined the function, we can call it from the command line as
x = average('data1.dat', 2)
print x
which will print out the average of the second column in 'data1.dat'. A function which calculates the standard deviation could be defined as follows.

subroutine std(filename, column)
{
        sum_x = 0
        sum_sq = 0
        N = 0
        foreach datum x in filename using $(column) 
        {
                N = N + 1
                sum_x = sum_x + x
                sum_sq = sum_sq + x * x
        }
        if(N > 0) 
        {
                sum_x = sum_x / N
                return sqrt(sum_sq / N - sum_x * sum_x)
        } else 
        {
                return NaN
        }
}
Now, this works all right for cases, where we want to calculate the statistics of some column in a data file. But what, if before doing that, we also want to modify the data. For instance, we might need the average of not the data per se, but the average of the sine of the data points. What should we do then?

Well, the trivial solution is to hard-wire the expression in our the average routine that we defined above. However, this method does not lend itself to very flexible use, does it? So, is there a way to supply the data file, the column that we want to process, and the expression that we want to use? Sure, there is, in fact, it is rather simple. All we have to do is to add a string to our function, and evaluate the string to yield a function that we can apply to the data points. Thus, our new version of the average subroutine would read as follows
subroutine ave_e(filename, column, expr)
{
        str_temp = "f(x) = %s"%(expr)
        exec str_temp
        sum_x = 0
        N = 0
 
        foreach datum x in filename using $(column)
        {
                N = N + 1
                sum_x = sum_x + f(x)
        }
        if(N > 0)
        {
                return sum_x / N
        } else
        {
                return NaN
        }
}
The first two arguments are as above, and the third one is supposed to be a string variable. In the subroutine, first we create a new string, which is nothing but the function definition. Note that the sprintf function as such does not exist in pyxplot, but we have
str_temp = "f(x) = %s"%(expr)
instead. Except for the slight difference in the form, it behaves in the same way as sprintf in gnuplot. Also note that string concatenation does not work as it does in gnuplot, namely, with the dots between the string variables. Instead, if one has to append a string to another one, one can do
a = "foo"
b = "bar"
c = "%s"%(a,b)
Now, back to our average subroutine! At this point, we have a string that contains the definition of our new function. In gnuplot, we would just use eval to turn this into an actual definition. In pyxplot, we execute exec instead. But the results is the same, our string has been promoted to the rank of a full-fledged function, which we can call in the common way. In the body of the foreach loop, we add f(x) instead of x, and we are done. So, if we wanted the calculate the average of the sin(x*tan(x)+cos(x)) of the 2nd data column in file 'foo', we would call our subroutine as
print ave_e('foo', 2, "sin(x*tan(x)+cos(x)")
And finally, what happens, when we have to calculate the average of some quantity that depends on two or more columns? If this is the case, we can print the manipulated data to a file, and apply the average function to the new data set. In gnuplot, we would simply plot the data after having set the tabular flag. In pyxplot, we use the tabulate command to indicate that we do not want to plot, but send the output to a data file. Thus, if we need the average of sin($1)*cos($2), we can do the following
set output 'bar.dat'
tabulate 'foo.dat' using (sin($1)*cos($2))
ave('foo.dat', 1)
I hope that today I could convince you that data can very easily be accessed and manipulated in pyxplot. Next time I will discuss "real" plotting.
So long,
Zoltán