Dealing with Large Data Sets

Meteorological data sets are often so large that we run up against the limits of our computing power. Space and speed become an issue. The basic trade-off is between memory (chips), which is very fast but in limited supply, and storage (magnetic media) which is usually more abundant but slow. Below are some techniques for data management: if you plan ahead you may avoid restructuring your program.

Sample Task: two equal sized arrays are stored in two files. You need to perform simple array addition and copy the result to a third file.

Sample Data :

• Arrays A and B have dimensions of 1000x1000 elements.
• They are stored in File_A and File_B respectively.
Programming for Speed - you will want to get the data out of the files (storage) and into the arrays (memory) as soon as possible. Most modern languages allow you to extract the data in one step IF you know the size of the file ahead of time (Fortran90 will allow this, but not Fortran77). Many languages used for meteorology also allow algebraic operations on entire arrays (see previous Fortran comment). Below is a sample program written in pseudo code.
! this program will read arrays A, B; add them to get array C
! all data is put into memory as soon as possible

integer xdim=1000,ydim=1000
real A(xdim,ydim), B(xdim,ydim), C(xdim,ydim)

open(unit=1) File_A
open(unit=2) File_B

C = A+B

open(unit=3) File_C

write(3) C

end
The algebraic addition of arrays is equivalent to the loop operation:
for y = 1, ydim
for x = 1, xdim

C(x,y) = A(x,y) + B(x,y)

write(3) C(x,y)

endfor
endfor
Note that you have three entire arrays in memory as you calculate the value of C. This could be enough to crash your program.

Programming to Conserve Memory - you will want to use a loop to read and add one element at a time. We are assuming the elements in the file are read like the words in a book: all the elements in the first row, then the second row, etc (this may not be the case for many languages).

Program memory_save
! Each element of the arrays A and B are read from the file and added
! reading and writing one element at a time saves memory

integer xdim=1000,ydim=1000
real A, B, C

open(unit=1) File_A
open(unit=2) File_B
open(unit=3) File_C

for y = 1, ydim
for x = 1, xdim

C = A + B

write(3) C

endfor
endfor

end
There are several things to note here:
• only one element of each array is in memory at any given time.
• The read commands are used for each element of the array, causing multiple magnetic head operations that slow down the code considerably.
• Reading x on the inside of the loop is meant to mimic the physical layout of data on the disk, speeding up the code. This may differ from system to system or language to language. It is a good idea to time such a loop in both arrangements...the 'slow' arrangement requires more physical motion of the head, which could eventually crash your drive.
A compromise between speed and memory may be to read an entire row at a time:

integer xdim=1000,ydim=1000
real A_row(xdim), B_row(xdim), C_row(xdim)

open(1) File_A
open(2) File_B
open(3) File_C

for y = 1, ydim

C_row = A_row + B_row

write(3) C_row

endfor

end

Avoiding Redundant Variables in Memory

Once variables are defined, they remain in memory until overwritten or otherwise destroyed. If the variables in question are large arrays, this can present a problem. There are several scenarios in which this can happen. Let's assume A is a large array variable that you want to multiply by a constant c. In the expression

A_scaled = c*A

you have created a second variable A_scaled. If you don't need A anymore, you are better off writing

A = c*A

so that you have overwritten the old value without taking up extra memory. For a brief amount of time you still have the value of A on both sides of this equation. Some languages allow you to notify the compiler that both memory spaces are not needed: IDL for example uses the "temporary" function:

A = c*temporary(A)

Warning - in most languages, any variable that is passed into a subroutine is passed back out to the main program in its modified version. Make sure this is harmless before overwriting variables.

A more pervasive memory problem occurs during the use of subroutines, in which the variable list is copied from the main program memory into the subroutine memory. Two copies of each variable therefore exist simultaneously until the subroutine is exited. There are several ways to keep programs from duplicating this memory.

Global Variables or Common Blocks - these are two different ways of telling the program to maintain a single copy of a variable: by declaring them as global variables, or by making common blocks of variables. Most languages use one or the other but not both. Here are abstract examples of both, in which we are scaling the array A by the constant c. (extraneous things like variable definitions and comments are not shown):

***** Global Program *****

float A(500,1000), c
global A

A_scaled = MULTIPLY(c)

end

* * * * * * * * * * * * *
function MULTIPLY(c)

A_scaled = c*A     ! note that A is automatically available!

return A_scaled

end

------------------------------------------------------

***** Common Program *****

float A(500,1000),B(500,1000), c
common 'arraylist' A, B

D = MULTIPLY(c)

end

* * * * * * * * * * * *
function MULTIPLY(c)

common 'arraylist' A, B

D = A*B*c

return D

end
A drawback of using global variables and common blocks is that the subroutines and functions are no longer completely self contained and independant, so it takes more work to graft them into another program. Changes in one block of code may also effect others. For this reason this technique should only be used when needed.

Pointers - a "pointer" is simply the computer address of a variable. Rather than passing around the entire value of a large array, you can pass the short address that "points" to the location of the array. The address itself becomes a separate variable that references the original variable. In order to use pointers, you need a way to put the address into a pointer variable (all languages do this a little different) and a way to expose (or 'dereference') the actual value of the variable behind the address. To avoid confusion, it is useful to give the pointer variable a name that both reflects the original variable and makes it evident that you are using a pointer. For example:

A = [1,2,3,4,5] is an integer array.

in IDL we could use the ptr_new function to create a pointer variable 'Aptr' that holds the address of A:

Aptr = ptr_new(A)

We pass this address to the subroutine, then get the value of A by the '*' function which is the same in nearly every language:

A = *Aptr
print A
[1 2 3 4 5]
Once again, instead of passing (and therefore copying) the entire value of A into a subroutine, we pass the much shorter Aptr, which is the address of A. When needed, the value of A can be exposed, but at any one time there is only one value of A residing at one physical address.

Efficient File Storage

As has been mentioned, the best way to store a large file is in binary rather than text format. If you know the variable type of the data, the language will know the length of each data field. Since binary files are stored as one long stream, you will also need to know the dimensions of the array. For example, if you have the 2x3 array

A =  2 3 4
3 2 1
The file might be stored as a stream 2 3 4 3 2 1, but in binary, not ASCII, so if we shorten each field to 3 bits (instead of the normal 8) we'd have 010011100011010001. To read this you'd have to know to break it up into 3 bit words (variable type length), know whether to read the digits from right to left or left to right (PC versus mainframe or 'big endian' versus 'little endian'), convert to base 10 (done automatically), and arrange into a 2x3 array (done by your program).

Floating point (or real) numbers require words at least twice as long as integers. But any set of real numbers can be converted to integers by judicious use of a scaling factor and perhaps an offset. For example, the pair of numbers [0.233, -1.497] can be converted to positive integers by the following manipulation:

[0.233, -1.497]*1000 + 1497 = [1730, 0]

To convert the integers back to the original real numbers, you'd offset the numbers by subtracting 1497, then scale them by division by 1000. This technique is used for most HDF data sets.