Mumps/MDH Toolkit
MDH: The Multi-Dimensional and Hierarchical
Database Toolkit Programmer's Guide
Version 2.1
Kevin C. O'Kane, Ph.D.
Computer Science Department
University of Northern Iowa
Cedar Falls, IA 50614
okane@cs.uni.edu
http://www.cs.uni.edu/~okane
April 18, 2005
Except as otherwise noted, this document is Copyright (c) 2004 Kevin C. O'Kane, Ph.D. Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.1 or any later version published by the Free Software Foundation; with the Invariant Sections being: Page 1, with the Front-Cover Texts being: Page 1, and with the Back-Cover Texts being: no Back-Cover Texts. A copy of the license is included in the section entitled "GNU Free Documentation License". The software is distributed under one of the following licenses (please see each source code module for specific copyright and license details applicable to that module). In general, the compiler itself is distributed under the GNU GPL license and the run-time support routines are distributed under the GNU LGPL.
Full texts of the licenses appear at the end of this document. Programs may call upon the Perl Compatible Regular Expression Library which, in some cases, is distributed with the Mumps Compiler. The separate license and copyright statement for PCRE appears in Appendix B. You should also read the license provided with the Berkeley Data Base (http://www.sleepycat.com). |
Contents
Part I - Programmers Guide
Source code distributions are available at:
http://www.cs.uni.edu/~okane/source/
see also:
http://math-cs.cns.uni.edu/~okane/cgi-bin/newpres/m.compiler/compiler/index.cgi
ulimit -s unlimited
(Other options are ulimit -a and ulimit -aH to show limits).
The MDH (Multi-Dimensional and Hierarchical) Database Toolkit is a Linux-based, open sourced, toolkit of portable software that supports very fast, flexible, multi-dimensional and hierarchical storage, retrieval and manipulation of information in data bases ranging in size up to 256 terabytes. The package is written in C and C++ and is available under the GNU GPL/LGPL licenses in source code form. The distribution kit contains demonstration implementations of network-capable, interactive text and sequence retrieval tools that function with very large genomic data bases and illustrate the toolkit's capability to manipulate massive data sets of genomic information.
The toolkit is distributed as part of the Mumps Compiler Versions exist for Linux, Cygwin, the DJGPP port of the GCC compiler for Windows XP and the command line version of the MicroSoft Visual C++ Compiler
The toolkit is a solution to the problem of manipulating very large, character string indexed, multi-dimensional, sparse matrices. It is based on Mumps (also referred to as M), a general purpose programming language that originated in the mid 60's at the Massachusetts General Hospital. The toolkit supports access to the PostgreSQL relational data base server, the Perl Compatible Regular Expression Library, the Berkeley Data Base, and the Glade GUI builder as well as server-side development of interactive web pages.
The principal database feature in this project is the global array which permits direct, efficient manipulation of multi-dimensional arrays of effectively unlimited size. A global array is a persistent, sparse, undeclared, multi-dimensional, string indexed data disk based structure. A global array may appear anywhere an ordinary array reference is permitted and data may be stored at leaf nodes as well as intermediate nodes in the data base array. The number of subscripts in an array reference is limited only by the total length of the array reference with all subscripts expanded to their string values. The toolkit includes several functions to traverse the data base and manipulate the arrays.
The toolkit makes the
data base and function set available as C++ classes and also permits
execution of legacy Mumps scripts. To use the toolkit, you
install the MDH and Mumps distribution kit and related code.
You
must also use a recent version of the g++ compiler. Many older
versions do not include recent changes to the C preprocessor standard
and will not work. The code presented here was compiled and
tested using g++ version 3.2.2.
The class, function and macro libraries primarily operate on global arrays. Global arrays are undimensioned, string indexed, disk resident data structures whose size is limited only by available disk space. They can be viewed either as multi-dimensional sparse matrices or as tree structured hierarchies. Global arrays are a C++ class and must be declared or instantiated in your C++ program as an instance of the global. For example, to create the global named "gbl", do the following:
#include <mumpsc/libmpscpp.h> global gbl("gbl");The instantiation consists of two parts: the name of the global array object and the name of the global array on disk associated with this object. In the above example, these are both "gbl". Note that the disk name of the global is enclosed in a parenthesized character string expression following the object name. The name in the expression need not (but usually does) match the name of the object. The name given in the parenthesized character string is the disk name of the global array. The global array object is associated with the disk name when the object is created. When the object is destroyed, the disk based global array persists.
Global objects may be created through declarations as shown above or dynamically:
global *gptr; gptr = new global ("gbl_name"); (*gptr)("1","2","3") = "test";which is equivalent to:
global g("gbl_name"); g("1","2","3") = "test";The #include <mumpsc/libmpscpp.h> statement brings in the necessary header files for you C++ program. These include, in addition to the header files necessary to access the toolkit, the standard system libraries:
#include <iostream> #include <string> #include <string.h> #include <math.h> #include <stdlib.h>
These are referenced at the beginning of libmpscpp.h and you may modify them if your system uses different naming conventions.
Each global declaration creates a global array name (gbl) to be an object or instance of the global class. Each global array you use must be first declared to be an object of the global class.
You create a global by substituting the name of the global you want to create for "gbl" in the above. Global names can be any valid C/C++ variable name.
A global array will typically have one or more subscripts as discussed below. These will be of type mstring, string or "pointer to character" (examples: character arrays, character string constants, pointers to character strings). Subscripts of global arrays must evaluate to a printable characters in the range of decimal 32 (space) to tilde (~). No data types other than mstring, string or pointer to character may be used as subscripts. Numerics data types (int, short, long, float, double, etc.) may not be used as global array subscripts.
mstring is a data type (class) whose behavior is similar to the basic string data type in Mumps. Objects of mstring are store internally as strings but may contain text, integers and floating point values. Addition, multiplication, subtraction, division, modulo, and concatenation may be performed directly on mstring objects (see details below). Many of the following examples use mstring objects.
Global arrays may be viewed either as multi-dimensional matrices or as tree structured hierarchies. As matrices, data may be stored not only at fully subscripted matrix elements but also at other levels. For example, given a three dimensional matrix mat1, you could initialize it as follows:
|
In this example, all the elements of a three dimensional matrix of 100 rows, 100 columns and 100 planes are initialized to zero. The function cvt() converts from int to string. The mstring usage is less efficient in that it does more conversions between int and string.
In the view expressed by the code above, the matrix is a traditional three dimensional structure with data stored at each fully indexed position or node.
Unlike other programming languages, however, there are additional nodes of the matrix which could have been initialized such as indicated by the following example:
|
In effect, this means that mat1 can also be a single dimensional vector, a two dimensional matrix and a three dimensional matrix simultaneously.
Furthermore, not all elements of a matrix need exist. That is, the matrix can be sparse. For example:
global mat1("mat1"); int main() { mstring i,j,k; for (i=0; i<100; i=i+10) for (j=0; j<100; j=j+10) { for (k=0; k<100; k=k+10) { mat2(i,j,k)=0; } } } return 0; } |
In the above, only index values 0, 10, 20, 30, 40, 50, 60, 70, 80, and 90 are used to create each of the dimensions of the array and only those elements of the matrix are created. The omitted elements do not exist.
For example, if you are running a drug protocol on a number of patients and are dosing with medications M1, M2, M3, ... on patients P1, P2, P3, ... and collecting observations on days D1, D2, D3, ... you could create a three dimensional matrix named protocol in which each plane consisted of the observations for each patient on each medication for a given day:
D1 |
D2 |
D3 |
D4 |
||||||||||||||||||||
|
M1 |
M2 |
M3 |
M4 |
M5 |
|
M1 |
M2 |
M3 |
M4 |
M5 |
|
M1 |
M2 |
M3 |
M4 |
M5 |
|
M1 |
M2 |
M3 |
M4 |
M5 |
P1 |
|
|
|
|
|
P1 |
|
|
|
|
|
P1 |
|
|
|
|
|
P1 |
|
X |
|
|
|
P2 |
|
|
|
|
|
P2 |
|
|
|
|
|
P2 |
|
|
|
|
|
P2 |
|
|
|
|
|
P3 |
|
|
|
|
|
P3 |
|
|
|
|
|
P3 |
|
|
|
|
|
P3 |
|
|
|
|
|
You could refer to patient P1, medication M2 on day D4 with the reference:
protocol("P1","M2","D4")="X";
Alternatively, you can view the same data base as a tree structure with patient id at the root, followed by medication, followed by day of study:
Note that at each node in the tree, a data box may appear containing information about the node. Addressing a node is accomplished by giving its path description such as:
protocol("P2","M2",D2)
To compile programs written in C++ that use the MDH (multi-Dimensional and Hierarchical) library, use the command:
mumpsc myprog.cpp
This will invoke the g++ compiler and make available the necessary libraries. The result will be a program named myprog.cgi which is executable. The cgi extension is used as the default because very often these programs may be used in connection with web servers. You may rename the program as you see fit, however. The script mumpsc is part of the Mumps Compiler which must be installed prior to using the toolkit.
Note: prior to exiting a program that accessed globals arrays, you should execute a GlobalClose macro to shut down the global array facility. This flushes the system buffers to disk and insures that the file system if properly closed. This appears in your program as:
GlobalClose;
There are several ways to insert and extract global array elements. They include:
You can create/modify elements of the global array using either the assignment or the shift operator. The indices of the global array may be specified as variables of type mstring, string, character string constants or pointers to character strings. The values stored at a global array node may be character string constants, pointers to strings, mstrings, strings, integers, other globals arrays and floating point values. Examples (where "index1" and "index2" may be of either type mstring or string):
global array1("array1"); global global_array("global_array"); mstring matring_var="test"; char * char_pointer="test"; long long_variable=99; string string_variable="test"; double double_variable=99.0; int int_variable=99; short short_variable=99; goobal_array("10")=99; array1("100") = "character string"; array1("101") = mstring_var; array1(indx1) = char_pointer; array1(indx1,"3") = long_variable; array1(indx2,indx1) = string_variable; array1("10","2","3") = double_variable; array1("10","2","4") = int_variable; array1("10","2","5") = global_array("10"); array1("100") << "character string"; array1("101") << mstring_var; array1(indx1) << char_pointer; array1(indx1,"3") << long_variable; array1(indx2,indx1) << string_variable; array1("10","2","3") << double_variable; array1("10","2","4") << int_variable; array1("10","2","5") << global_array("1"); mstring_var = array1(indx1,"3"); char_pointer = array1(indx1,"3"); string_variable = (string &) array1(indx2,indx1); float_variable = array1("10","2","3"); int_variable = array1("10","2","4"); long_variable = array1("10","2","5"); short_variable = array1("10"); global_array("10") = array1("10"); array1("100") >> string_var; array1("100") >> char_pointer; array1(indx1) >> float_variable; array1(indx1,"3") >> double_variable; array1(indx2,indx1) >> string_variable; array1("10","2","3") >> int_variable; array1("10","2","4") >> long_variable; array1("10","2","5") >> global_array("1"); array1("10","2","5") >> char_pointer; |
Global arrays are sparse so not all elements need to exist. In the examples above, the lowest value of the first index of the global array is "10" but this does not imply that elements "1" through "9" exist.
The shift operator may only be used as shown above. It may not be used in multiple chained format as is the case with cin and cout. Internally, all data is stored at nodes in character string form. If you shift or assign a global array to a target whose data type is incompatible with the contents of the global array, for example, shifting text data into an integer variable, an error will result:
int i; arrray1("100")="this is a test"; i=array1("100"); // error - string cannot be converted to int array1("100") >> i; // error - string cannot be converted to int |
Note: when assigning to a global from a pointer-to-character, the contents of the array pointed to by the pointer are copied to the global array whether you use the shift (<<) or assignment form (=). However, when you assign from a global to a pointer-to-character using the assignment operator, only the address of the character string is assigned to the pointer. The actual string is not copied and the pointer reference is valid only until the global array is referenced again. Instead, you should copy the contents of the character array to the target:
char tmp[]="this is a test"; array1("100")=tmp; // works - char array is copied to global tmp=array1("100"); // error - attempt to alter value of pointer "tmp" strcpy(tmp,array("100")); // works - global value copied to "tmp" |
The above notes only apply to char arrays - not to string or mstring data:
string tmp="this is a test"; array1("100")=tmp; // works - string is copied to global tmp=(string &)array1("100"); // works - global is copied to string strcpy(tmp,array("100")); // error - string variables may not be used with strcpy() |
Alternatively, if you use the shift operator form of assignment, character strings are copied to the address specified by the contents of the target pointer:
char tmp[]="this is a test"; array1("100") << tmp; // works - char array is copied to global array1("100") >> tmp; // works - value of global copied to char array |
The reason for the above is restrictions in the C++ language with regard to handling the overloaded assignment operator: the left hand side of an assignment expression must be a class member. In order to bypass this for fundamental data types (int, float, etc.), we use an overloaded cast operator on the right hand side that converts the right hand side to a basic data type prior to non-overloaded assignment. Thus, in the case of character strings, only the pointer is assigned. If you use the assignment operator with a pointer to character, be aware that the pointer is only valid until the next access to the same global. After another access, the pointer is undefined. For other data types, the assignment is as expected.
If a numeric value is stored in a global, it may be assigned to an appropriate numeric variable. The assignment or shift operator will convert the strings stored in the global to the appropriate numeric form. It is important, however, that the data stored in the array nodes conform to the numeric type requested. For example:
global array1("array1"); long x; double y; string z; array1("1","2","3") = "test string"; array1("1","2","4") = "100"; array1("1","2","5") = "100.123"; x = array("1","2","4"); // integer 100 assigned to x y = array("1","2","4"); // 100 converted to double and assigned to y z = (string &)array("1","2","4"); // character string "100" assigned to z x = array("1","2","5"); // integer 100 assigned to x y = array("1","2","5"); // 100.123 assigned to y x = array("1","2","3"); // error - string cannot be converted to long |
Alternatively, the following shift operator versions have the same effect:
array1("1","2","3") >> z; // character string copied to z array1("1","2","4") >> x; // 100 stored in x array1("1","2","4") >> y; // 100. stored in y |
When global array references are passed to function, no more than one instance of the same global object should be used in the argument list. Each global object maintains a private static string which contains the most recent value fetched from the data base. When a global object is passed to a function, its this string value is effectively passed. This means that, in a function reference where two references to the same global object are passed, even though they have differing indices, the value passed will be the value for the second instance of the global. This restriction only applies where there are two or more instances of the same global.
If you use a reference to a global without a parenthesized list following the name of the global, the reference will be to the most recent referenced global. Effectively, this is similar to the "naked indicator" from Mumps. Example:
|
Internally, the indices of global arrays are always stored as character strings (null terminated array of char). If you initialize a global array with a loop, you must insure that the indices are converted to an appropriate character string format before using them as global array indices. Indices to globals may be either char*, string or mstring but MUST all be of the same type (I,i>i.e. all string, all char * or all mstring). For example:
mstring A,B,C; for (A=0; A<1000; A++) for (B=0; B<1000; B++) for (C=0; C<1000; C++) { array1(A,B,C) << "0"; } |
The above initializes an array of 1 billion elements to zero.
There are several builtin functions used to navigate the globals. The two most important are the data functions and the order functions. The data functions tell you if a node exists and if it has descendants and the order functions give you the next higher (or lower) index at a given level in the global array tree.
The data functions return an integer which indicates whether the global array node is defined:
global array1("array1"); int result; array1("1","11") << "foo" array1("1","11","21") << "bar" result = $data(array1("1")); // yields 10 result = $data(array1("1","11")); // yields 11 result = $data(array1("1","11","21")); // yields 1 result = array1("1").Data() ; // yields 10 result = array1("1","11").Data(); // yields 11 result = array1("1","11","21").Data(); // yields 1 |
The $data() function corresponds to legacy usage while the Data() function is in the traditional C++ notation. Either format produces the same results.
The other major navigation functions are the order functions. These give you, for a given global array index, the next ascending or descending value for the last index. There are several forms of the function. For example:
mstring x; char y[16]; array1("100") << "a"; // initialize the array with three entries array1("200") << "b"; array1("300") << "c"; x = ""; // initialize the index with empty string strcpy(y,""); // char array form x = $order(array1(x),1); // get the first value of the first index: 100 cout << x << endl; // writes 100 strcpy(y,$order(array1(y),1); // gets first index: 100 cout << y << endl; // write 100 x = $order(array1(x),1); // get the second value of the first index: 200 cout << x << endl; // writes 200 strcpy(y,$order(array1(x),1)); // get the second value of the first index: 200 cout << y << endl; // writes 200 x = $order(array1(x),1); // get the third value of the first index: 300 cout << x << endl; // writes 300 strcpy(y,$order(array1(x),1)); // get the third value of the first index: 300 cout << y << endl; // writes 300 x = $order(array1(x),1); // get the next value of the first index: empty string if (x == "") cout << "done" << endl // write "done" strcpy(y,$order(array1(x),1)); // get the next value of the first index: empty string if (strcmp(y,"")==0) cout << "done" << endl // write "done" |
Each call to $order() gives the next value of the last index. The numeric qualifier indicates if the direction is ascending (1) or descending (-1). To get the first index, the empty string is supplied and the function returns the first index of the global array. For subsequent calls, it returns the next ascendant index value until there are no more indices. The it returns the empty string. The second parameter to each function invocation specifies the direction. A 1 means ascending key order and a -1 means descending key order. Thus, if in the above each of the 1's in the $order() function were replaced by -1, the sequence of values printed would be 300, 200, 100, empty rather than 100, 200, 300, empty.
The $order() form of the function derives from legacy usage. Other forms of the Order() function in more traditional C++ notation that may be applied to an object of type global are:
All forms of the order functions set $test to true (1) if a non-empty index is returned and false (0) if an empty string is returned.
In the following example, we build a global array vector from an input file consisting of keywords with one keyword per line, keep a count of each time the keyword is used, and, at the end, print an alphabetized list of the keywords followed by the number of times each occurs, do the following:
#include <mumpsc/libmpscpp.h> global key("key"); int main() { mstring w; char word[64]; long i; while (1) { cin.getline(word,63,'\n'); // read a word if (cin.eof()) break; // exit if none if ($data(key(word))) // is word in vector? key(word)++; // yes, increment count else key(word) << 1; // not in vector - add } w = ""; // empty string begins while ((w = $order(key(w),1)) != "") // next word cout << w << " " << key(w) << endl; // print word and count return 0; } |
In the above, each line is read into the variable word until the end of file is reached. Each word is tested with the $data() function of the global array to determine if word exists in the key vector. The $data() returns zero if the element does not exist, non-zero if it does. In the case where the word is in the key global array vector, the value stored in the vector for the word is extracted into the variable i, incremented and stored back into the vector. If the word does not exist in the vector, it is added and its initial count is set to one.
When all the words have been read and stored into the vector, the program sequences through the word entries and prints the words and the total number of times each one was present in the input file. Since global arrays are stored in ascending key order, the display of words will be alphabetic. The function that sequences through the vector is the $order() function. When the function is passed a string containing a value, it returns the contents of the string with the next ascending index from the vector or the empty string if there are no indices in the vector greater than the string passed. If the empty string is passed to the function, the function replaces it with the first index in the vector.
Note that the char * variable word is used initially in the above because the cin.getline() function does not accept a string variable.
Similarly, given a global array of patient lab data organized hierarchically first by patient id, then by lab test, then by date, we can print a table of patient id's, labs, dates and results with the following:
#include <mumpsc/libmpscpp.h> global labs("labs"); int main() { mstring ptid,lab_test,date,rslt; // create dummy example data base labs("1000","hct","July 12, 2003")="45"; labs("1000","hct","July 13, 2003")="46"; labs("1000","hct","July 14, 2003")="47"; labs("1000","hct","July 15, 2003")="48"; labs("1000","hgb","July 12, 2003")="15"; labs("1000","hgb","July 15, 2003")="14"; labs("1001","hct","July 12, 2003")="35"; labs("1001","hct","July 13, 2003")="36"; labs("1001","hct","July 14, 2003")="37"; labs("1001","hct","July 15, 2003")="38"; labs("1001","hgb","July 13, 2003")="15"; labs("1001","hgb","July 14, 2003")="15"; labs("1002","hct","Sept 12, 2003")="35"; labs("1002","hct","Sept 13, 2003")="36"; labs("1002","hct","Sept 14, 2003")="37"; labs("1002","hct","Sept 15, 2003")="38"; labs("1002","hgb","Sept 13, 2003")="15"; labs("1002","hgb","Sept 14, 2003")="15"; ptid = ""; while (1) { ptid = $order(labs(ptid),1); if (ptid == "") break; lab_test = ""; while (1) { lab_test = $order(labs(ptid,lab_test),1); if (lab_test == "") break; date = ""; while (1) { date = $order(labs(ptid,lab_test,date),1); if (date == "") break; cout << ptid << " " << lab_test << " " << date ; cout << " " << labs(ptid,lab_test,date) << endl; } } } GlobalClose; return 1; } |
The above begins with an empty string for patient id ptid. This is used at the outer loop level to cycle through all the patient ids. At the first nexted loop, the program cycles through all the lab test names (lab_test) then at the innermost level, it cycles through all the dates (date). The resulting table is of the form:
1000 hct July 12, 2003 45 1000 hct July 13, 2003 46 1000 hct July 14, 2003 47 1000 hct July 15, 2003 48 1000 hgb July 12, 2003 15 1000 hgb July 15, 2003 14 1001 hct July 12, 2003 35 1001 hct July 13, 2003 36 1001 hct July 14, 2003 37 1001 hct July 15, 2003 38 1001 hgb July 13, 2003 15 1001 hgb July 14, 2003 15 |
If the database from the previous example is modified slightly, it can be viewed purely as a table or a relation (for more detail on relational access, see below). To accomplish this, the data values are moved into the array reference as a final index and the empty string is stored for each node.
To perform tabular access to the database, we use the Select() primitive function which returns successive rows from a global array viewed as a tree. In the following example, we access and print the lab values for patient "1001":
#include <mumpsc/libmpscpp.h> global labs("labs"); int main() { mstring ptid,test,date,rslt; // create dummy example data base labs("1000","hct","July 12, 2003","45") = ""; labs("1000","hct","July 13, 2003","46") = ""; labs("1000","hct","July 14, 2003","47") = ""; labs("1000","hct","July 15, 2003","48") = ""; labs("1000","hgb","July 12, 2003","15") = ""; labs("1000","hgb","July 15, 2003","14") = ""; labs("1001","hct","July 12, 2003","35") = ""; labs("1001","hct","July 13, 2003","36") = ""; labs("1001","hct","July 14, 2003","37") = ""; labs("1001","hct","July 15, 2003","38") = ""; labs("1001","hgb","July 13, 2003","15") = ""; labs("1001","hgb","July 14, 2003","15") = ""; labs("1002","hct","Sept 12, 2003","35") = ""; labs("1002","hct","Sept 13, 2003","36") = ""; labs("1002","hct","Sept 14, 2003","37") = ""; labs("1002","hct","Sept 15, 2003","38") = ""; labs("1002","hgb","Sept 13, 2003","15") = ""; labs("1002","hgb","Sept 14, 2003","15") = ""; ptid=""; test=""; date=""; rslt=""; while ( labs(ptid,test,date,rslt).Select(ptid,test,date,rslt) != NULL ) { if (ptid != "1001") continue; cout << ptid << " " << test << " " << date << " " << rslt << endl; } GlobalClose; } |
Rows of the database are presented in overall key ascending order. Those rows whose first columns do not contain "1001" re rejected while those continuing the value are printed.
Using the database from above, a set of simple speedup techniques can be applied by starting the scan at the patient id and terminating the scan when the next patient id appears:
#include <mumpsc/libmpscpp.h> global labs("labs"); int main() { mstring ptid,test,date,rslt; ptid="1001"; test=""; date=""; rslt=""; while ( labs(ptid,test,date,rslt).Select(ptid,test,date,rslt) != NULL ) { if (ptid != "1001") break; cout << ptid << " " << test << " " << date << " " << rslt << endl; } GlobalClose; } |
In the above example, by setting the initial value of ptid to "1001", the scan will begin at that point in the table. Any or all of the leading column values may be specified in this manner to target to a specific starting point. For example, to print the "hct" values only of patient "1001" using the database from above:
#include <mumpsc/libmpscpp.h> global labs("labs"); int main() { mstring ptid,test,date,rslt; ptid="1001"; test="hct"; date=""; rslt=""; while ( labs(ptid,test,date,rslt).Select(ptid,test,date,rslt) != NULL ) { if (ptid != "1001" || test != "hct" ) break; cout << ptid << " " << test << " " << date << " " << rslt << endl; } GlobalClose; } |
Note: if one or more column values are supplied, they must be the initial column values and they may be no intervening values specified as the empty string.
To copy the results to another global array using the database from above:
#include <mumpsc/libmpscpp.h> global labs("labs"); global tmp("tmp"); int main() { mstring ptid,test,date,rslt; tmp().Kill(); // delete any prior values ptid="1001"; test="hct"; date=""; rslt=""; while ( labs(ptid,test,date,rslt).Select(ptid,test,date,rslt) != NULL ) { if ( test != "hct" && rslt > "40" ) continue; cout << ptid << " " << test << " " << date << " " << rslt << endl; tmp(ptid,test,data,rslt) = ""; // build new array } GlobalClose; } |
In the above example, the array tmp() is built consisting only of "hct" tests whose values were above "40". The array being built may be constructed from all, or some to the column values extracted from the source array, arranged in any order and may contain column values from other sources. For example, to identify all the patients with diagnosis code "Y06" whose "hct" values are less that "40" using a global array named diagnosis whose columns are patient id, diagnostic code and date and the labs() from above:
#include <mumpsc/libmpscpp.h> global labs("labs"); global diagnosis("diagnosis"); global tmp("tmp"); int main() { mstring ptid,test,date,rslt,dx,dxDate; tmp().Kill; // delete any prior values ptid="1001"; test="hct"; date=""; rslt=""; while ( labs(ptid,test,date,rslt).Select(ptid,test,date,rslt) != NULL ) { if ( test != "hct" && rslt > "40" ) continue; if ( !$data(diagnosis(ptid,"Y06")) ) continue; // row does not exist cout << ptid << " " << test << " " << date << " " << rslt << endl; tmp(ptid) = ""; // build new array } GlobalClose; } |
Relational Operations on Globals
Global arrays, if properly constructed, can be the subject of basic relational operations. For example, consider the following:
Global array names and column meanings: patient(P,NAME,ADDRESS,SEX) lab(L,TEST,NORMALS) med(M,MED,QTY) ptlab(P,L,RSLT,DATE) ptmed(P,M,DATE) Where: P is patient id number NAME is patient name ADDRESS is patient home city SEX is patient gender L is test id number TEST is lab test name NORMALS is lab test normal values M is medication id number MED is medication name QTY is quantity in inventory RSLT is lab test result DATE is date administration The global arrays are defined in code as: global patient("patient"); global lab("lab"); global med("med"); global ptlat("ptlab"); global ptmed("ptmed"); |
A possible set of values in the global array data base might be:
patient("001","Jones","Boston","Male") = ""; patient("002","Smith","New York","Female") = ""; patient("003","Blake","Washington","Male") = ""; patient("004","Doe","Hartford","Female") = ""; patient("005","Morley","New York","Male") = ""; lab("100","Hct","38-54%") = ""; lab("101","Hgb","14-18 Gm.") = ""; lab("102","Platelets","200-500k") = ""; lab("103","Acetone","0.3-2 mg/100ml") = ""; lab("104","Cholesterol","150-250 mg/100ml") = ""; lab("105","Creatinine","70-140 mcg/100ml") = ""; lab("106","Iron","75-175 mcg/100ml") = ""; lab("107","Uric Acid","3-6 mg/100ml") = ""; med("200","Protamine Sulfate","125") = ""; med("201","Quinidine Sulfate","150") = ""; med("202","Probenecid","90") = ""; med("203","Allopurinol","200") = ""; med("204","Colchicine","50") = ""; med("205","Hydrochlorothiazide","100") = ""; ptlab("001","107","8.5","1-Jul-84") = ""; ptlab("001","100","42","1-Jul-84") = ""; ptlab("002","103","250k","1-Aug-84") = ""; ptlab("003","107","80","1-Sep-84") = ""; ptlab("004","104","1.1","1-Oct-84") = ""; ptlab("005","107","9.0","1-Nov-84") = ""; ptmed("001","204","1-Jul-84") = ""; ptmed("001","205","1-Jul-84") = ""; ptmed("005","203","1-Nov-84") = ""; ptmed("005","206","1-Nov-84") = ""; |
Example queries answered by relational manipulations:
// get medication codes for medication names global t1("t1"); kill (t1()); mstring mcode="",mname="",qty=""; while ( med(mcode,mname,qty).Select(mcode,mname,qty) != NULL) { if (mname != "colchicine" && mname != "benemid" ) continue; t1(mcode)=""; // create node by medication code } // get list of patients who are taking one or more of these meds mstring ptid="",date="",code=""; mcode=""; // for each row of "ptmed" while ( ptmed(ptid,mcode,date).Select(ptid,mcode,date) != NULL) { code = ""; // for each medication code sought while ( t1(code).Select(code) != NULL) if (code == mcode ) { // get/print the name and address of ptid mstring name="",addr=""; patient(ptid,name,addr).Select(ptid,name,addr); cout << "PTID=" << ptid << endl; cout << "Name=" << name << endl; cout << "Address << addr << endl; } } |
/* get med code for med name */ mstring mcode="",mname="",qty=""; while ( med(mcode,mname,qty).Select(mcode,mname,qty) != NULL) { if (mname != "hydrochlorothiazide") continue; break; } /* get list of patients who are taking this med */ mstring ptid="",code="",date=""; global t1("t1"); kill (t1()); while ( ptmed(ptid,code,date).Select(ptid,code,date) != NULL) if (code == mcode ) t1(ptid) = ""; /* create list t2() of patients who not in t1() */ global t2("t2"); kill (t2()); while ( patient(ptid,name,addr,s).Select(ptid,name,addr,s) != NULL ) { if ($data(t1(ptid))) continue; t2(ptid)=""; } /* get the names and address of patients in t2() */ ptid=""; while ( t2(ptid).Select(ptid) != NULL } mstring name="",addr="",s=""; patient(ptid,name,addr,s).Select(ptid,name,addr,s); cout << "PTID=" << ptid << endl; cout << "Name=" << name << endl; cout << "Address << addr << endl; } |
/* get lab code number */ mstring lcode="",test="",norm=""; while (lab(lcode,test,norm).Select(lcode,test,norm) != NULL) if (test == "Uric aAcid" ) break; /* find ptid's and rslt's of those who have had lcode > 7 */ global t1("t1"); while (ptlab(ptid,lcode,rslt).Select(ptid,lcode,rslt) != NULL) if (rslt > 7) t1(ptid)=""; /* get med code for "probenecid" */ mstring mcode="",mname="",qty=""; while ( med(mcode,mname,qty).Select(mcode,mname,qty) != NULL) { if (mname != "probenecid") continue; break; } ptid=""; while (t1(ptid).Select(ptid) != NULL) { if ($data(ptmed(p,mcode))) { mstring name="",addr="",s=""; patient(ptid,name,addr,s).Select(ptid,name,addr,s); cout << "PTID=" << ptid << endl; cout << "Name=" << name << endl; cout << "Address << addr << endl; } } |
As can be seen, these manipulations have considerable similarity from one query to the next. The basic manipulations, from a relational algebra point of view are:
global t1("t1"); global t2("t2"); global t3("t3"); kill (t3()); mstring a="",b="",c=""; while (t1(a,b,c).Select(a,b,c) != NULL ) t3(a,b,c)=""; a = b = c = ""; while (t1(a,b,c).Select(a,b,c) != NULL ) t3(a,b,c)=""; while (t2(a,b,c).Select(a,b,c) != NULL ) t3(a,b,c)=""; |
global t1("t1"); global t2("t2"); global t3("t3"); kill (t3()); mstring a="",b="",c=""; while (t1(a,b,c).Select(a,b,c) != NULL ) if ($data(t2(a,b,c))) t3(a,b,c)=""; |
global t1("t1"); global t2("t2"); global t3("t3"); kill (t3()); mstring a="",b="",c=""; while (t1(a,b,c).Select(a,b,c) != NULL ) if (!$data(t2(a,b,c))) t3(a,b,c)=""; |
global t1("t1"); global t2("t2"); global t3("t3"); kill (t3()); mstring a="",b="",c=""; mstring d="",e="",f=""; while (t1(a,b,c).Select(a,b,c) != NULL ) while (t2(d,e,f).Select(d,e,f) != NULL ) t3(a,b,c,d,e,f)=""; |
global t1("t1"); global t2("t2"); kill (t2()): mstring a="",b="",c=""; while (t1(a,b,c).Select(a,b,c) != NULL ) if (a == "aaa" && b < "bbb" ) t2(a,b,c)=""; |
global t1("t1"); global t2("t2"); mstring a="",b="",c=""; while (t1(a,b,c).Select(a,b,c) != NULL ) t2(a,c)=""; or, in combination with selection: global t1("t1"); global t2("t2"); mstring a="",b="",c=""; while (t1(a,b,c).Select(a,b,c) != NULL ) if (a == "aaa" && b < "bbb" ) t2(a,c)=""; |
global t1("t1"); global t2("t2"); global t3("t3"); kill (t3()); mstring a="",b="",c=""; while (t1(a,b,c).Select(a,b,c) != NULL ) { mstring d="",e="",f=""; while (t2(d,e,f).Select(d,e,f) != NULL ) if ( c == d ) t3(a,b,c,e,f)=""; } |
In the example code, the rows of both relations are scanned and the values of the third column from the first relation (variable "c") are compared with the values of the first column (variable "d") of the second relation. If each relation contains 100 rows, the above would test 10,000 row combinations. This could be speeded up considerably by re-writing the code as follows:
global t1("t1"); global t2("t2"); global t3("t3"); kill (t3()); mstring a="",b="",c=""; while (t1(a,b,c).Select(a,b,c) != NULL ) { mstring d=c,e="",f=""; while (t2(d,e,f).Select(d,e,f) != NULL ) if ( c != d ) break; t3(a,b,c,e,f)=""; } |
Here, each scan of the second relation begins with the first row containing a value for the first column which is equal to the third column of the first relation. The scan of the second relation terminates when the value of the first column is no longer equal to the value of the third column from the first relation.
For comparisons other than equality:
// join if col 3 of t1() < col 1 of t2() global t1("t1"); global t2("t2"); global t3("t3"); kill (t3()); mstring a="",b="",c=""; while (t1(a,b,c).Select(a,b,c) != NULL ) { mstring d=c,e="",f=""; // begin scan of t2() at value of d equal to c while (t2(d,e,f).Select(d,e,f) != NULL ) // skip initial cases where c is still equal to d if ( c == d ) continue; t3(a,b,c,e,f)=""; } |
In the above, the scan of the second relation begins at the first row where the first column is equal to the third column of the first relation. The continue will cause those rows where "c" and "d" are equal to be skipped. Since the rows are presented in ascending key order, after the rows where "c" and "d" are skipped, there will follow only rows where "c" is less than "d".
Similarly, for a greater-than relation:
// join if col 3 of t1() > col 1 of t2() global t1("t1"); global t2("t2"); global t3("t3"); kill (t3()); mstring a="",b="",c=""; while (t1(a,b,c).Select(a,b,c) != NULL ) { mstring d="",e="",f=""; while (t2(d,e,f).Select(d,e,f) != NULL ) // scan lines up so long as c is < than d if ( c <= d ) break; t3(a,b,c,e,f)=""; } |
The above terminates the inner loop when "c" is less than or equal to "d" . Prior to that point, where "c" is greater than "d", rows are joined.
For relations involving columns that are not the initial columns of the second relation, other speed-up techniques are possible.
// join if col 3 of t1() > col 3 of t2() global t1("t1"); global t2("t2"); global t3("t3"); kill (t3()); mstring a="",b="",c=""; while (t1(a,b,c).Select(a,b,c) != NULL ) { mstring d="",e="",f=""; while (t2(d,e,f).Select(d,e,f) != NULL ) // scan lines up so long as c is < than f if ( c <= f ) break; t3(a,b,c,e,f)=""; } |
The above will produce minimal savings as many combinations of "d" and "e" may need to be tried in locating rows with values of "f" that meet the search criteria. In such cases, it may be more efficient to build a temporary copy if the second relation with the columns reordered so that the scan can proceed more quickly:
// join if col 3 of t1() > col 3 of t2() global t1("t1"); global t2("t2"); global t3("t3"); kill (t3()); kill (t4()); mstring a="",b="",c=""; while (t2(a,b,c).Select(a,b,c) != NULL ) t4(c,a,b)=""; // reordered relation mstring a="",b="",c=""; while (t1(a,b,c).Select(a,b,c) != NULL ) { mstring d="",e="",f=""; while (t4(f,d,e).Select(f,d,e) != NULL ) // scan lines up so long as c is < than f if ( c <= f ) break; t3(a,b,c,e,f)=""; } kill (t4()); |
In large joins which may result in many iterations of the inner loop, a single pass to build a temporary, reordered relation may be faster.
There are several builtin relational functions, written in Mumps, that can be called from the C++ environment. To use these, you must include the following at the beginning of your C++ program:
#include <mumpsc/libmpsrdbms.h>
The functions available (implemented as macros) are:
For a full description, see the Mumps Compiler Programmers Guide section on Relational algebra for global arrays. The macros above correspond to the functions described in the manual except the macro names are all upper case. The actual functions, which have the same names except that only the first letter is in upper case and the remainder are lower case, have two additional initial parameters used internally by the Mumps service routines. The macros automatically substitute these added parameters.
The processing functions are wittten in Mumps and have been compiled to an object code library. When compiling a Mumps program for use with the class library, the first line of the Mumps program must be:
+#define CPP
This line causes the compiler to omit some lines of code that would conflict with the C++ runtime routines.
There are several functions for locking portions of the data base. Following legacy convention, a lock does not prevent access to an element but merely flags the element as locked. Locking views a global array as a tree structure. If an element is locked, its descendants are locked. An attempt to lock a locked element of an element that has a locked parent or a locked descendant will fail. The primary locking functions are $lock(), Lock() and UnLock():
if ($lock(gbl(a,b,c)) cout << "locked" << endl; if (gbl(a,b,c).Lock()) cout << "locked" << endl; gbl(a,b,c).UnLock(); |
The $lock() and Lock() functions test to see if the node can be locked and locks it if possible. It returns true (1) if successful and false (0) otherwise ($test is set accordingly). A node can be locked if it itself is not locked, if it has no descendants that are locked and if it is not the descendant of a locked node. The UnLock() function releases a lock on a node.
Additionally, there are functions to release all locks for the current process and all locks for all processes:
CleanLocks(); // release all locks for this process only CleanAllLocks(); // release all locks for all processes |
There are several other basic support functions available. One of these is the $piece() function. This function takes either three or four arguments. The first is a source string (pointer to character), the second is a pattern string (pointer to character), the third and fourth are integers. The fourth may be omitted. The function returns the "piece" of the source string delimited by the pattern. If the fourth argument is not present, the "piece" returned is delimited by the i-1 and i'th instance of the pattern string. For example:
$piece("abc.def.ghi",".",1) yields abc $piece("abc.def.ghi",".",2) yields def $piece("abc.def.ghi",".",3) yields ghi |
If the fourth argument is present, the piece returned is between third and fourth argument instance of the pattern.
Taking the above into account, it is possible to build a larger example using the GenBank gbkey.idx file. The format of this file is a line of keyword text followed by one or more lines of reference to locations where the keyword text applies. Each reference line begins with a TAB character followed by one or more locus codes, followed by a TAB character followed by a division code followed by an accession id. A typical entry is:
1,4-alpha-D-glucan glucanohydrolase ECOFTAA BCT L01642 STYFTAA BCT L01643 RICAAMYA PLN M24286 RICAAMYB PLN M24287 BMAMY BCT X07261 AHAAMYG BCT X58627 ECMALS BCT X58994 |
The following program will construct a matrix giving for each accession (row) the keyword phrases that apply to the accession:
#include <mumpsc/libmpscpp.h> global mat("mat"); int main() { char line[1024]; mstring key; mstring locus; mstring div; mstring accession; long key_count=0,acc_count=0;; mstring s1="1"; while (1) { cin.getline(line,1023, '\n'); if (cin.eof()) break; if (line[0]!='\t') { key = line; continue; } locus = $piece(line,"\t",2); div = $piece(line,"\t",3); accession = $piece(line,"\t",4); mat(accession,key) = ""; } accession = ""; while (1) { accession = $order(mat(accession),1); if (accession == "") break; acc_count++; key = ""; while (1) { key = $order(mat(accession,key),1); if ( key == "") break; key_count++; } } cout << "average number of keys per accession: " << (float) key_count / acc_count << endl; GlobalClose; return 0; } |
Additionally, the Perl Compatible Regular Expression Library is available through the $perl() macro (see Appendices C and D). For example:
#include <mumpsc/libmpscpp.h> #include <stdlib.h%gt; int main () { char line[1024]="acgtcgctcggctgcgctcgagctcgagagactgcgctgctcgaagagctagag"; cout << $perl(line,"gctgcg[acgt]tcgagctcga") << endl; GlobalClose; return 1; } |
The above prints 1.
Invoking the Mumps Interpreter
The full facilities of the Mumps interpreter can be invoked from C++ programs. The interpreter reads, parses and executes commands presented to it at run time. It may also read and execute text files containing Mumps programs. The interpreter is invoked by means of the Xecute() macro and xecute() functions:
int Xecute("command")
int xecute(mstring command)
int xecute(string command)
int xecute(char * command)
These functions and macro invoke the Mumps interpreter and execute the text replacing "command". They return 1 of successful, 0 otherwise. With Xecute(), if the mumps command contains quotes or other special symbols, they will be automatically prefixed with backslashes (e.g., quote becomers \").
Xecute("set i="test")); Xecute("for s i=$order(^a(i)) quit:i="" set sum=sum+^a(i)"); |
Details on the Mumps Language are contained in the file compiler.html in the mumpsc/doc subdirectory of the Mumps Compiler distribution.
Hashing Example
The following example stores lines of text into a global array based on a hash function calculation of each line. It reads lines of text from stdin and submits each line to a simple hash function that produces an unsigned long which is converted to character string (char *) and returned. The resulting character string is copied to the string variable x. The input line is stored at hash_table(x,ii) where ii is a string value between 0 and 999. The value if ii is determined by locating the first ascending integer not already in use. If a given hash result produces more that 1000 collisions, the process terminates with an error message.
#include <mumpsc/libmpscpp.h> global hash_table("hash"); // global array int main() { char in[1024]; string x; long i; while (fgets(in,1024,stdin)!=NULL) { x = hash(in); // hash input line for (i=0; i<1000; i++) { string ii=cvt(i); if ($data(hash_table(x,ii))==0) { // find a slot hash_table(x,ii)=in; // add line to database cout << x << "," << ii << " " << in << endl; break; } } if (i>1000) { cout << "Too many collisions " << x << endl; GlobalClose; return 1; } } GlobalClose; return 0; } |
Linking to Compiled Mumps Functions
You may compile functions in Mumps and call them from C++ programs. If you do, you must begin each file of functions with:
#define CPP
which disables some code that would otherwise conflict with the class libraries. If you do not use the class libraries, you may omit this line.
See the Mumps Compiler Programmers Guide for details.
Writing Active Web Server Pages
C++ programs can be written with the toolkit to be web server active pages. For example:
Web page HTML code:
A C++ program can accept data from the web page, store the data in global arrays and return a summary web page to the browser. When using "get" mode data transmission from HTML forms, the form names and data are concatenated into a string, delimited by ampersands, containing "name=value" tokens. These are passed in an environment variable named QUERY_STRING. The include file mumpsc/cgi.h contains code to extract data from QUERY_STRING and store the data in the runtime symbol table. The function $SymGet() can be used to retrieve values from runtim symbol table.
|