MDH Toolkit Programmer's Guide

The Wayback Machine - https://web.archive.org/web/20060830001443/http://www.cs.uni.edu/~okane/source/mdh.html

Mumps/MDH Toolkit MDH: The Multi-Dimensional and Hierarchical
Database Toolkit Programmer's Guide
Version 2.1
Kevin C. O'Kane, Ph.D.
Computer Science Department
University of Northern Iowa
Cedar Falls, IA 50614
okane@cs.uni.edu
http://www.cs.uni.edu/~okane
April 18, 2005

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.1 or any later version published by the Free Software Foundation; with the Invariant Sections being: Page 1, with the Front-Cover Texts being: Page 1, and with the Back-Cover Texts being: no Back-Cover Texts. A copy of the license is included in the section entitled "GNU Free Documentation License".

The software is distributed under one of the following licenses (please see each source code module for specific copyright and license details applicable to that module). In general, the compiler itself is distributed under the GNU GPL license and the run-time support routines are distributed under the GNU LGPL.

GNU General Public License
This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
GNU Lesser General Public License
This library is free software; you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation; either version 2.1 of the License, or (at your option) any later version.

This library is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License for more details.

You should have received a copy of the GNU Lesser General Public License along with this library; if not, write to the Free Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA

Full texts of the licenses appear at the end of this document. Programs may call upon the Perl Compatible Regular Expression Library which, in some cases, is distributed with the Mumps Compiler. The separate license and copyright statement for PCRE appears in Appendix B. You should also read the license provided with the Berkeley Data Base (http://www.sleepycat.com).

Contents

Part I - Programmers Guide

Software Distributions
Important Notes
Introduction
Creating Global Arrays
Structure of Global Arrays
Compiling Programs
Accessing Global Arrays
Global Array Indices
Navigating Globals
Tabular Access to Globals
Relational Algebra Operations on Globals
Builtin Relational Algebra Functions
Locking the Data Base
Pattern Matching
Invoking the Mumps Interpreter
Programming Examples
Linking to Compiled Mumps Functions
Writing Active Web Server Pages
Hash class
Class mstring
Btree Access
Function and Macro Library
- Mumps Related Functions
- Vector and Matrix Functions
- Text Processing, Searching and Retrieval Functions
  Bioinformatics Related Functions
  1. Smith-Waterman Alignment Function
- Global Array Functions
- Pattern Matching Functions
- Miscelaneous Functions
- Error Exceptions
Appendix A - Example Code
Appendix B - PCRE License
Appendix C - Using PERL Expressions
Appendix D - Mumps 95 Pattern Matching

Part I - Programmers Guide

Software Distribution

Source code distributions are available at:

http://www.cs.uni.edu/~okane/source/

Important Notes:

Some older g++ compilers have errors in string processing libraries and may generate errors when compiling. This code was developed using g++ (GCC) 3.2.2 (Mandrake Linux 9.1 3.2.2-3mdk). Use of earlier compilers may cause problems.
Likewise, recent versions of the g++ compiler have changed the manner in which string casting takes place. This effects the using the assignment operator (=) to assign global array values to strings. Consequently, either (string &) or (char *) must be used when assigning from global to string. Example:
Versions after April 17, 2004, changed some function call parameters, most notably xecute().
Details on installation of the Toolkit for Linux, Windows XP and Cygwin are contained in the Mumps Compiler Manual (compiler.html) which is part of the distribution package as well as operating specific "INSTALL" files contained in the distribution.
Stack size can be an issue for some functions, most notably the Smith-Waterman alignment procedure. The stack size is set for WindowsXP programs in the file "mumpsc.bat" in the batch variable "STACK" which is set, by default, to 5,000,000. It may be raised (or lowered) as needed. The stack size is set for Linux in the "ulimit: command. This can be increased under Linux with the command:
ulimit -s unlimited
(Other options are ulimit -a and ulimit -aH to show limits).

Introduction

The MDH (Multi-Dimensional and Hierarchical) Database Toolkit is a Linux-based, open sourced, toolkit of portable software that supports very fast, flexible, multi-dimensional and hierarchical storage, retrieval and manipulation of information in data bases ranging in size up to 256 terabytes. The package is written in C and C++ and is available under the GNU GPL/LGPL licenses in source code form. The distribution kit contains demonstration implementations of network-capable, interactive text and sequence retrieval tools that function with very large genomic data bases and illustrate the toolkit's capability to manipulate massive data sets of genomic information.

The toolkit is distributed as part of the Mumps Compiler Versions exist for Linux, Cygwin, the DJGPP port of the GCC compiler for Windows XP and the command line version of the MicroSoft Visual C++ Compiler

The toolkit is a solution to the problem of manipulating very large, character string indexed, multi-dimensional, sparse matrices. It is based on Mumps (also referred to as M), a general purpose programming language that originated in the mid 60's at the Massachusetts General Hospital. The toolkit supports access to the PostgreSQL relational data base server, the Perl Compatible Regular Expression Library, the Berkeley Data Base, and the Glade GUI builder as well as server-side development of interactive web pages.

The principal database feature in this project is the global array which permits direct, efficient manipulation of multi-dimensional arrays of effectively unlimited size. A global array is a persistent, sparse, undeclared, multi-dimensional, string indexed data disk based structure. A global array may appear anywhere an ordinary array reference is permitted and data may be stored at leaf nodes as well as intermediate nodes in the data base array. The number of subscripts in an array reference is limited only by the total length of the array reference with all subscripts expanded to their string values. The toolkit includes several functions to traverse the data base and manipulate the arrays.

The toolkit makes the data base and function set available as C++ classes and also permits execution of legacy Mumps scripts. To use the toolkit, you install the MDH and Mumps distribution kit and related code.

You must also use a recent version of the g++ compiler. Many older versions do not include recent changes to the C preprocessor standard and will not work. The code presented here was compiled and tested using g++ version 3.2.2.

Creating Global Arrays

The class, function and macro libraries primarily operate on global arrays. Global arrays are undimensioned, string indexed, disk resident data structures whose size is limited only by available disk space. They can be viewed either as multi-dimensional sparse matrices or as tree structured hierarchies. Global arrays are a C++ class and must be declared or instantiated in your C++ program as an instance of the global. For example, to create the global named "gbl", do the following:

      #include <mumpsc/libmpscpp.h>
      global gbl("gbl");

The instantiation consists of two parts: the name of the global array object and the name of the global array on disk associated with this object. In the above example, these are both "gbl". Note that the disk name of the global is enclosed in a parenthesized character string expression following the object name. The name in the expression need not (but usually does) match the name of the object. The name given in the parenthesized character string is the disk name of the global array. The global array object is associated with the disk name when the object is created. When the object is destroyed, the disk based global array persists.

Global objects may be created through declarations as shown above or dynamically:

      global *gptr;
      gptr = new global ("gbl_name");
      (*gptr)("1","2","3") = "test";

which is equivalent to:

      global g("gbl_name");
      g("1","2","3") = "test";

The #include <mumpsc/libmpscpp.h> statement brings in the necessary header files for you C++ program. These include, in addition to the header files necessary to access the toolkit, the standard system libraries:

      #include <iostream>
      #include <string>
      #include <string.h>
      #include <math.h>
      #include <stdlib.h>

These are referenced at the beginning of libmpscpp.h and you may modify them if your system uses different naming conventions.

Each global declaration creates a global array name (gbl) to be an object or instance of the global class. Each global array you use must be first declared to be an object of the global class.

You create a global by substituting the name of the global you want to create for "gbl" in the above. Global names can be any valid C/C++ variable name.

A global array will typically have one or more subscripts as discussed below. These will be of type mstring, string or "pointer to character" (examples: character arrays, character string constants, pointers to character strings). Subscripts of global arrays must evaluate to a printable characters in the range of decimal 32 (space) to tilde (~). No data types other than mstring, string or pointer to character may be used as subscripts. Numerics data types (int, short, long, float, double, etc.) may not be used as global array subscripts.

mstring is a data type (class) whose behavior is similar to the basic string data type in Mumps. Objects of mstring are store internally as strings but may contain text, integers and floating point values. Addition, multiplication, subtraction, division, modulo, and concatenation may be performed directly on mstring objects (see details below). Many of the following examples use mstring objects.

Structure of Global Arrays

Global arrays may be viewed either as multi-dimensional matrices or as tree structured hierarchies. As matrices, data may be stored not only at fully subscripted matrix elements but also at other levels. For example, given a three dimensional matrix mat1, you could initialize it as follows:

global mat1("mat1"); int main() { mstring i,j,k; for (i=0; i<100; i++) for (j=0; j<100; j++) for (k=0; k<100; k++) { mat1(i,j,k)=0; } return 0; } Alternatively, the above can be performed with strings but the numeric indices must be integer and converted to string before use: global mat1("mat1"); int main() { int i,j,k; for (i=0; i<100; i++) for (j=0; j<100; j++) for (k=0; k<100; k++) { string s1=cvt(i); string s2=cvt(j); string s3=cvt(k); mat1(s1,s2,s3)=0; } return 0; }

In this example, all the elements of a three dimensional matrix of 100 rows, 100 columns and 100 planes are initialized to zero. The function cvt() converts from int to string. The mstring usage is less efficient in that it does more conversions between int and string.

In the view expressed by the code above, the matrix is a traditional three dimensional structure with data stored at each fully indexed position or node.

Unlike other programming languages, however, there are additional nodes of the matrix which could have been initialized such as indicated by the following example:

global mat1("mat1"); int main() { mstring i,j,k; for (i=0; i<100; i++) { mat1(i)=i; for (j=0; j<100; j++) { mat1(i,j)=j; for (k=0; k<100; k++) { mat1(i,j,k)=0; } } } return 0; }

In effect, this means that mat1 can also be a single dimensional vector, a two dimensional matrix and a three dimensional matrix simultaneously.

Furthermore, not all elements of a matrix need exist. That is, the matrix can be sparse. For example:

      global mat1("mat1");

      int main() {
      mstring i,j,k;
      for (i=0; i<100; i=i+10)
            for (j=0; j<100; j=j+10) {
                  for (k=0; k<100; k=k+10) {
                        mat2(i,j,k)=0;
                        }
                  }
            }
      return 0;
      }

In the above, only index values 0, 10, 20, 30, 40, 50, 60, 70, 80, and 90 are used to create each of the dimensions of the array and only those elements of the matrix are created. The omitted elements do not exist.

For example, if you are running a drug protocol on a number of patients and are dosing with medications M1, M2, M3, ... on patients P1, P2, P3, ... and collecting observations on days D1, D2, D3, ... you could create a three dimensional matrix named protocol in which each plane consisted of the observations for each patient on each medication for a given day:

You could refer to patient P1, medication M2 on day D4 with the reference:

protocol("P1","M2","D4")="X";

Alternatively, you can view the same data base as a tree structure with patient id at the root, followed by medication, followed by day of study:

Note that at each node in the tree, a data box may appear containing information about the node. Addressing a node is accomplished by giving its path description such as:

protocol("P2","M2",D2)

Compiling Programs

To compile programs written in C++ that use the MDH (multi-Dimensional and Hierarchical) library, use the command:

      mumpsc myprog.cpp

This will invoke the g++ compiler and make available the necessary libraries. The result will be a program named myprog.cgi which is executable. The cgi extension is used as the default because very often these programs may be used in connection with web servers. You may rename the program as you see fit, however. The script mumpsc is part of the Mumps Compiler which must be installed prior to using the toolkit.

Accessing Global Arrays

Note: prior to exiting a program that accessed globals arrays, you should execute a GlobalClose macro to shut down the global array facility. This flushes the system buffers to disk and insures that the file system if properly closed. This appears in your program as:

GlobalClose;

There are several ways to insert and extract global array elements. They include:

An overloaded form of the assignment operator;
Functions applied to global class objects;
An overloaded shift operator

You can create/modify elements of the global array using either the assignment or the shift operator. The indices of the global array may be specified as variables of type mstring, string, character string constants or pointers to character strings. The values stored at a global array node may be character string constants, pointers to strings, mstrings, strings, integers, other globals arrays and floating point values. Examples (where "index1" and "index2" may be of either type mstring or string):

      global array1("array1");
      global global_array("global_array");

      mstring matring_var="test";
      char * char_pointer="test";
      long long_variable=99;
      string string_variable="test";
      double double_variable=99.0;
      int int_variable=99;
      short short_variable=99;

      goobal_array("10")=99;

      array1("100") =        "character string";
      array1("101") =        mstring_var;
      array1(indx1) =        char_pointer;
      array1(indx1,"3") =    long_variable;
      array1(indx2,indx1) =  string_variable;
      array1("10","2","3") = double_variable;
      array1("10","2","4") = int_variable;
      array1("10","2","5") = global_array("10");

      array1("100")        << "character string";
      array1("101")        << mstring_var;
      array1(indx1)        << char_pointer;
      array1(indx1,"3")    << long_variable;
      array1(indx2,indx1)  << string_variable;
      array1("10","2","3") << double_variable;
      array1("10","2","4") << int_variable;
      array1("10","2","5") << global_array("1");

      mstring_var        = array1(indx1,"3");
      char_pointer       = array1(indx1,"3");
      string_variable    = (string &) array1(indx2,indx1);
      float_variable     = array1("10","2","3");
      int_variable       = array1("10","2","4");
      long_variable      = array1("10","2","5");
      short_variable     = array1("10");
      global_array("10") = array1("10");

      array1("100")        >> string_var;
      array1("100")        >> char_pointer;
      array1(indx1)        >> float_variable;
      array1(indx1,"3")    >> double_variable;
      array1(indx2,indx1)  >> string_variable;
      array1("10","2","3") >> int_variable;
      array1("10","2","4") >> long_variable;
      array1("10","2","5") >> global_array("1");
      array1("10","2","5") >> char_pointer;

Global arrays are sparse so not all elements need to exist. In the examples above, the lowest value of the first index of the global array is "10" but this does not imply that elements "1" through "9" exist.

The shift operator may only be used as shown above. It may not be used in multiple chained format as is the case with cin and cout. Internally, all data is stored at nodes in character string form. If you shift or assign a global array to a target whose data type is incompatible with the contents of the global array, for example, shifting text data into an integer variable, an error will result:

      int i;
      arrray1("100")="this is a test";
      i=array1("100");    // error - string cannot be converted to int
      array1("100") >> i; // error - string cannot be converted to int

Note: when assigning to a global from a pointer-to-character, the contents of the array pointed to by the pointer are copied to the global array whether you use the shift (<<) or assignment form (=). However, when you assign from a global to a pointer-to-character using the assignment operator, only the address of the character string is assigned to the pointer. The actual string is not copied and the pointer reference is valid only until the global array is referenced again. Instead, you should copy the contents of the character array to the target:

      char tmp[]="this is a test";
      array1("100")=tmp;        // works - char array is copied to global
      tmp=array1("100");        // error - attempt to alter value of pointer "tmp"
      strcpy(tmp,array("100")); // works - global value copied to "tmp"

The above notes only apply to char arrays - not to string or mstring data:

      string tmp="this is a test";
      array1("100")=tmp;        // works - string is copied to global
      tmp=(string &)array1("100");        // works - global is copied to string
      strcpy(tmp,array("100")); // error - string variables may not be used with strcpy()

Alternatively, if you use the shift operator form of assignment, character strings are copied to the address specified by the contents of the target pointer:

      char tmp[]="this is a test";
      array1("100") << tmp;     // works - char array is copied to global
      array1("100") >> tmp;     // works - value of global copied to char array

The reason for the above is restrictions in the C++ language with regard to handling the overloaded assignment operator: the left hand side of an assignment expression must be a class member. In order to bypass this for fundamental data types (int, float, etc.), we use an overloaded cast operator on the right hand side that converts the right hand side to a basic data type prior to non-overloaded assignment. Thus, in the case of character strings, only the pointer is assigned. If you use the assignment operator with a pointer to character, be aware that the pointer is only valid until the next access to the same global. After another access, the pointer is undefined. For other data types, the assignment is as expected.

If a numeric value is stored in a global, it may be assigned to an appropriate numeric variable. The assignment or shift operator will convert the strings stored in the global to the appropriate numeric form. It is important, however, that the data stored in the array nodes conform to the numeric type requested. For example:

      global array1("array1");

      long x;
      double y;
      string z;

      array1("1","2","3") = "test string";
      array1("1","2","4") = "100";
      array1("1","2","5") = "100.123";

      x = array("1","2","4");           // integer 100 assigned to x
      y = array("1","2","4");           // 100 converted to double and assigned to y
      z = (string &)array("1","2","4"); // character string "100" assigned to z
      x = array("1","2","5");           // integer 100 assigned to x
      y = array("1","2","5");           // 100.123 assigned to y
      x = array("1","2","3");           // error - string cannot be converted to long

Alternatively, the following shift operator versions have the same effect:

      array1("1","2","3") >> z;  // character string copied to z
      array1("1","2","4") >> x;  // 100 stored in x
      array1("1","2","4") >> y;  // 100. stored in y

When global array references are passed to function, no more than one instance of the same global object should be used in the argument list. Each global object maintains a private static string which contains the most recent value fetched from the data base. When a global object is passed to a function, its this string value is effectively passed. This means that, in a function reference where two references to the same global object are passed, even though they have differing indices, the value passed will be the value for the second instance of the global. This restriction only applies where there are two or more instances of the same global.

If you use a reference to a global without a parenthesized list following the name of the global, the reference will be to the most recent referenced global. Effectively, this is similar to the "naked indicator" from Mumps. Example:

Global Array Indices

Internally, the indices of global arrays are always stored as character strings (null terminated array of char). If you initialize a global array with a loop, you must insure that the indices are converted to an appropriate character string format before using them as global array indices. Indices to globals may be either char*, string or mstring but MUST all be of the same type (I,i>i.e. all string, all char * or all mstring). For example:

      mstring A,B,C;
      for (A=0; A<1000; A++)
            for (B=0; B<1000; B++)
                  for (C=0; C<1000; C++) {
                        array1(A,B,C) << "0";
                        }

The above initializes an array of 1 billion elements to zero.

Navigating Globals

There are several builtin functions used to navigate the globals. The two most important are the data functions and the order functions. The data functions tell you if a node exists and if it has descendants and the order functions give you the next higher (or lower) index at a given level in the global array tree.

The data functions return an integer which indicates whether the global array node is defined:

0 if the global array node is undefined;
1 if it is defined and has no descendants;
10 if it is defined but has no value stored at the node (but does have descendants);
11 it is defined and has descendants.

A global is defined is data has been stored at it. A "10" is returned for a node at which nothing has been stored but the node has descendants. For example, assuming the global array has only the contents created in the example below:

      global array1("array1");

      int result;

      array1("1","11") << "foo"
      array1("1","11","21") << "bar"

      result = $data(array1("1"));             // yields 10
      result = $data(array1("1","11"));        // yields 11
      result = $data(array1("1","11","21"));   // yields 1

      result = array1("1").Data() ;            // yields 10
      result = array1("1","11").Data();        // yields 11
      result = array1("1","11","21").Data();   // yields 1

The $data() function corresponds to legacy usage while the Data() function is in the traditional C++ notation. Either format produces the same results.

The other major navigation functions are the order functions. These give you, for a given global array index, the next ascending or descending value for the last index. There are several forms of the function. For example:

      mstring x;
      char y[16];

      array1("100") << "a";            // initialize the array with three entries
      array1("200") << "b";
      array1("300") << "c";

      x = "";                          // initialize the index with empty string
      strcpy(y,"");                    // char array form
      
      x = $order(array1(x),1);         // get the first value of the first index: 100
      cout << x << endl;               // writes 100

      strcpy(y,$order(array1(y),1);    // gets first index: 100
      cout << y << endl;               // write 100

      x = $order(array1(x),1);         // get the second value of the first index: 200
      cout << x << endl;               // writes 200

      strcpy(y,$order(array1(x),1));   // get the second value of the first index: 200
      cout << y << endl;               // writes 200

      x = $order(array1(x),1);         // get the third value of the first index: 300
      cout << x << endl;               // writes 300

      strcpy(y,$order(array1(x),1));   // get the third value of the first index: 300
      cout << y << endl;               // writes 300

      x = $order(array1(x),1);         // get the next value of the first index: empty string
      if (x == "") 
            cout << "done" << endl     // write "done"

      strcpy(y,$order(array1(x),1));   // get the next value of the first index: empty string
      if (strcmp(y,"")==0) 
            cout << "done" << endl     // write "done"

Each call to $order() gives the next value of the last index. The numeric qualifier indicates if the direction is ascending (1) or descending (-1). To get the first index, the empty string is supplied and the function returns the first index of the global array. For subsequent calls, it returns the next ascendant index value until there are no more indices. The it returns the empty string. The second parameter to each function invocation specifies the direction. A 1 means ascending key order and a -1 means descending key order. Thus, if in the above each of the 1's in the $order() function were replaced by -1, the sequence of values printed would be 300, 200, 100, empty rather than 100, 200, 300, empty.

The $order() form of the function derives from legacy usage. Other forms of the Order() function in more traditional C++ notation that may be applied to an object of type global are:

Order(1) - returns a pointer to a null terminated character array of the next ascendant index relative to x.
Order(-1) - returns a pointer to a null terminated character array of the next descendant index relative to x.
Order_Next(char * x) - returns 0 if the empty string is found, 1 otherwise and copies to the character array pointed to by x ascendant index found.
Order_Prior(char * x) - returns 0 if the empty string is found, 1 otherwise and copies to the character array pointed to by x to the descendant index found.

All forms of the order functions set $test to true (1) if a non-empty index is returned and false (0) if an empty string is returned.

In the following example, we build a global array vector from an input file consisting of keywords with one keyword per line, keep a count of each time the keyword is used, and, at the end, print an alphabetized list of the keywords followed by the number of times each occurs, do the following:

    #include <mumpsc/libmpscpp.h>
    global key("key"); 

    int main() { 

    mstring w;
    char word[64];
    long i; 

    while (1) {
        cin.getline(word,63,'\n'); // read a word
        if (cin.eof()) break;      // exit if none
        if ($data(key(word)))      // is word in vector?
            key(word)++;           // yes, increment count
        else key(word) << 1;       // not in vector - add
        } 

    w = "";                              // empty string begins
    while ((w = $order(key(w),1)) != "") // next word
      cout << w << " " << key(w) << endl; // print word and count
    return 0;
    }

In the above, each line is read into the variable word until the end of file is reached. Each word is tested with the $data() function of the global array to determine if word exists in the key vector. The $data() returns zero if the element does not exist, non-zero if it does. In the case where the word is in the key global array vector, the value stored in the vector for the word is extracted into the variable i, incremented and stored back into the vector. If the word does not exist in the vector, it is added and its initial count is set to one.

When all the words have been read and stored into the vector, the program sequences through the word entries and prints the words and the total number of times each one was present in the input file. Since global arrays are stored in ascending key order, the display of words will be alphabetic. The function that sequences through the vector is the $order() function. When the function is passed a string containing a value, it returns the contents of the string with the next ascending index from the vector or the empty string if there are no indices in the vector greater than the string passed. If the empty string is passed to the function, the function replaces it with the first index in the vector.

Note that the char * variable word is used initially in the above because the cin.getline() function does not accept a string variable.

Similarly, given a global array of patient lab data organized hierarchically first by patient id, then by lab test, then by date, we can print a table of patient id's, labs, dates and results with the following:

      #include <mumpsc/libmpscpp.h>
      global labs("labs");
      int main() {
      mstring ptid,lab_test,date,rslt;

      // create dummy example data base

      labs("1000","hct","July 12, 2003")="45";
      labs("1000","hct","July 13, 2003")="46";
      labs("1000","hct","July 14, 2003")="47";
      labs("1000","hct","July 15, 2003")="48";
      labs("1000","hgb","July 12, 2003")="15";
      labs("1000","hgb","July 15, 2003")="14";
      labs("1001","hct","July 12, 2003")="35";
      labs("1001","hct","July 13, 2003")="36";
      labs("1001","hct","July 14, 2003")="37";
      labs("1001","hct","July 15, 2003")="38";
      labs("1001","hgb","July 13, 2003")="15";
      labs("1001","hgb","July 14, 2003")="15";
      labs("1002","hct","Sept 12, 2003")="35";
      labs("1002","hct","Sept 13, 2003")="36";
      labs("1002","hct","Sept 14, 2003")="37";
      labs("1002","hct","Sept 15, 2003")="38";
      labs("1002","hgb","Sept 13, 2003")="15";
      labs("1002","hgb","Sept 14, 2003")="15";

      ptid = "";
      while (1) {
          ptid = $order(labs(ptid),1);
          if (ptid == "") break;
          lab_test = "";
          while (1) {
              lab_test = $order(labs(ptid,lab_test),1);
              if (lab_test == "") break;
              date = "";
                  while (1) {
                      date = $order(labs(ptid,lab_test,date),1);
                      if (date == "") break;
                      cout << ptid << " " << lab_test << " " << date ;
                      cout << " " << labs(ptid,lab_test,date) << endl;
                      }
                  }
              }
      GlobalClose;
      return 1;
      }

The above begins with an empty string for patient id ptid. This is used at the outer loop level to cycle through all the patient ids. At the first nexted loop, the program cycles through all the lab test names (lab_test) then at the innermost level, it cycles through all the dates (date). The resulting table is of the form:

      1000 hct July 12, 2003 45
      1000 hct July 13, 2003 46
      1000 hct July 14, 2003 47
      1000 hct July 15, 2003 48
      1000 hgb July 12, 2003 15
      1000 hgb July 15, 2003 14
      1001 hct July 12, 2003 35
      1001 hct July 13, 2003 36
      1001 hct July 14, 2003 37
      1001 hct July 15, 2003 38
      1001 hgb July 13, 2003 15
      1001 hgb July 14, 2003 15

Tabular Access to Globals

If the database from the previous example is modified slightly, it can be viewed purely as a table or a relation (for more detail on relational access, see below). To accomplish this, the data values are moved into the array reference as a final index and the empty string is stored for each node.

To perform tabular access to the database, we use the Select() primitive function which returns successive rows from a global array viewed as a tree. In the following example, we access and print the lab values for patient "1001":

#include <mumpsc/libmpscpp.h>
global labs("labs");

int main() {
mstring ptid,test,date,rslt;

// create dummy example data base

labs("1000","hct","July 12, 2003","45") = "";
labs("1000","hct","July 13, 2003","46") = "";
labs("1000","hct","July 14, 2003","47") = "";
labs("1000","hct","July 15, 2003","48") = "";
labs("1000","hgb","July 12, 2003","15") = "";
labs("1000","hgb","July 15, 2003","14") = "";
labs("1001","hct","July 12, 2003","35") = "";
labs("1001","hct","July 13, 2003","36") = "";
labs("1001","hct","July 14, 2003","37") = "";
labs("1001","hct","July 15, 2003","38") = "";
labs("1001","hgb","July 13, 2003","15") = "";
labs("1001","hgb","July 14, 2003","15") = "";
labs("1002","hct","Sept 12, 2003","35") = "";
labs("1002","hct","Sept 13, 2003","36") = "";
labs("1002","hct","Sept 14, 2003","37") = "";
labs("1002","hct","Sept 15, 2003","38") = "";
labs("1002","hgb","Sept 13, 2003","15") = "";
labs("1002","hgb","Sept 14, 2003","15") = "";

ptid="";
test="";
date="";
rslt="";

while ( labs(ptid,test,date,rslt).Select(ptid,test,date,rslt) != NULL ) {
      if (ptid != "1001") continue;
      cout << ptid << " " << test << " " << date << " " << rslt << endl;
      }
GlobalClose;
}

Rows of the database are presented in overall key ascending order. Those rows whose first columns do not contain "1001" re rejected while those continuing the value are printed.

Using the database from above, a set of simple speedup techniques can be applied by starting the scan at the patient id and terminating the scan when the next patient id appears:

#include <mumpsc/libmpscpp.h>
global labs("labs");

int main() {
mstring ptid,test,date,rslt;

ptid="1001";
test="";
date="";
rslt="";

while ( labs(ptid,test,date,rslt).Select(ptid,test,date,rslt) != NULL ) {
      if (ptid != "1001") break;
      cout << ptid << " " << test << " " << date << " " << rslt << endl;
      }
GlobalClose;
}

In the above example, by setting the initial value of ptid to "1001", the scan will begin at that point in the table. Any or all of the leading column values may be specified in this manner to target to a specific starting point. For example, to print the "hct" values only of patient "1001" using the database from above:

#include <mumpsc/libmpscpp.h>
global labs("labs");

int main() {
mstring ptid,test,date,rslt;

ptid="1001";
test="hct";
date="";
rslt="";

while ( labs(ptid,test,date,rslt).Select(ptid,test,date,rslt) != NULL ) {
      if (ptid != "1001" || test != "hct" ) break;
      cout << ptid << " " << test << " " << date << " " << rslt << endl;
      }
GlobalClose;
}

Note: if one or more column values are supplied, they must be the initial column values and they may be no intervening values specified as the empty string.

To copy the results to another global array using the database from above:

#include <mumpsc/libmpscpp.h>
global labs("labs");
global tmp("tmp");

int main() {
mstring ptid,test,date,rslt;

tmp().Kill();  // delete any prior values

ptid="1001";
test="hct";
date="";
rslt="";

while ( labs(ptid,test,date,rslt).Select(ptid,test,date,rslt) != NULL ) {
      if ( test != "hct"  && rslt > "40" ) continue;
      cout << ptid << " " << test << " " << date << " " << rslt << endl;
      tmp(ptid,test,data,rslt) = "";  // build new array
      }
GlobalClose;
}

In the above example, the array tmp() is built consisting only of "hct" tests whose values were above "40". The array being built may be constructed from all, or some to the column values extracted from the source array, arranged in any order and may contain column values from other sources. For example, to identify all the patients with diagnosis code "Y06" whose "hct" values are less that "40" using a global array named diagnosis whose columns are patient id, diagnostic code and date and the labs() from above:

#include <mumpsc/libmpscpp.h>
global labs("labs");
global diagnosis("diagnosis");
global tmp("tmp");

int main() {
mstring ptid,test,date,rslt,dx,dxDate;

tmp().Kill;  // delete any prior values

ptid="1001";
test="hct";
date="";
rslt="";

while ( labs(ptid,test,date,rslt).Select(ptid,test,date,rslt) != NULL ) {
      if ( test != "hct"  && rslt > "40" ) continue;
      if ( !$data(diagnosis(ptid,"Y06")) ) continue; // row does not exist
      cout << ptid << " " << test << " " << date << " " << rslt << endl;
      tmp(ptid) = "";  // build new array
      }
GlobalClose;
}

Relational Operations on Globals

Global arrays, if properly constructed, can be the subject of basic relational operations. For example, consider the following:

Global array names and column meanings:

patient(P,NAME,ADDRESS,SEX)
lab(L,TEST,NORMALS)
med(M,MED,QTY)
ptlab(P,L,RSLT,DATE)
ptmed(P,M,DATE)

Where:      P is patient id number
            NAME is patient name
            ADDRESS is patient home city
            SEX is patient gender
            L is test id number
            TEST is lab test name
            NORMALS is lab test normal values
            M is medication id number
            MED is medication name
            QTY is quantity in inventory
            RSLT is lab test result
            DATE is date administration

The global arrays are defined in code as:

global patient("patient");
global lab("lab");
global med("med");
global ptlat("ptlab");
global ptmed("ptmed");

A possible set of values in the global array data base might be:

patient("001","Jones","Boston","Male") = "";
patient("002","Smith","New York","Female") = "";
patient("003","Blake","Washington","Male") = "";
patient("004","Doe","Hartford","Female") = "";
patient("005","Morley","New York","Male") = "";

lab("100","Hct","38-54%") = "";
lab("101","Hgb","14-18 Gm.") = "";
lab("102","Platelets","200-500k") = "";
lab("103","Acetone","0.3-2 mg/100ml") = "";
lab("104","Cholesterol","150-250 mg/100ml") = "";
lab("105","Creatinine","70-140 mcg/100ml") = "";
lab("106","Iron","75-175 mcg/100ml") = "";
lab("107","Uric Acid","3-6 mg/100ml") = "";

med("200","Protamine Sulfate","125") = "";
med("201","Quinidine Sulfate","150") = "";
med("202","Probenecid","90") = "";
med("203","Allopurinol","200") = "";
med("204","Colchicine","50") = "";
med("205","Hydrochlorothiazide","100") = "";

ptlab("001","107","8.5","1-Jul-84") = "";
ptlab("001","100","42","1-Jul-84") = "";
ptlab("002","103","250k","1-Aug-84") = "";
ptlab("003","107","80","1-Sep-84") = "";
ptlab("004","104","1.1","1-Oct-84") = "";
ptlab("005","107","9.0","1-Nov-84") = "";

ptmed("001","204","1-Jul-84") = "";
ptmed("001","205","1-Jul-84") = "";
ptmed("005","203","1-Nov-84") = "";
ptmed("005","206","1-Nov-84") = "";

Example queries answered by relational manipulations:

Query: "Find the names of those patients who have received colchicine and probenecid."

      // get medication codes for medication names

      global t1("t1");
      kill (t1());
      mstring mcode="",mname="",qty="";
      while ( med(mcode,mname,qty).Select(mcode,mname,qty) != NULL) {
            if (mname != "colchicine" && mname != "benemid" ) continue;
            t1(mcode)=""; // create node by medication code
            }

      // get list of patients who are taking one or more of these meds 

      mstring ptid="",date="",code="";
      mcode="";

      // for each row of "ptmed"
      while ( ptmed(ptid,mcode,date).Select(ptid,mcode,date) != NULL) {

            code = "";
            // for each medication code sought
            while ( t1(code).Select(code) != NULL)

                  if (code == mcode ) {
                        // get/print the name and address of ptid
                        mstring name="",addr="";
                        patient(ptid,name,addr).Select(ptid,name,addr);
                        cout << "PTID=" << ptid << endl;
                        cout << "Name=" << name << endl;
                        cout << "Address << addr << endl;
                        }
            }

Query: "Find the id numbers of those patients who are not receiving medication hydrochlorothiazide."

      /* get med code for med name */

      mstring mcode="",mname="",qty="";
      while ( med(mcode,mname,qty).Select(mcode,mname,qty) != NULL) {
            if (mname != "hydrochlorothiazide") continue;
            break;
            }

      /* get list of patients who are taking this med */

      mstring ptid="",code="",date="";
      global t1("t1");
      kill (t1());
      while ( ptmed(ptid,code,date).Select(ptid,code,date) != NULL)
            if (code == mcode ) t1(ptid) = "";

      /* create list t2() of patients who not in t1() */

      global t2("t2");
      kill (t2());
      while ( patient(ptid,name,addr,s).Select(ptid,name,addr,s) != NULL ) {
            if ($data(t1(ptid))) continue;
            t2(ptid)="";
            }

      /* get the names and address of patients in t2() */

      ptid="";
      while ( t2(ptid).Select(ptid) != NULL }
            mstring name="",addr="",s="";
            patient(ptid,name,addr,s).Select(ptid,name,addr,s);
            cout << "PTID=" << ptid << endl;
            cout << "Name=" << name << endl;
            cout << "Address << addr << endl;
            }

Query: "Find the names of those patients whose uric acid is greater than 7 who are not receiving probenecid."

      /* get lab code number */

      mstring lcode="",test="",norm="";
      while (lab(lcode,test,norm).Select(lcode,test,norm) != NULL) 
            if (test == "Uric aAcid" ) break;

      /* find ptid's and rslt's of those who have had lcode > 7 */

      global t1("t1");
      while (ptlab(ptid,lcode,rslt).Select(ptid,lcode,rslt) != NULL) 
            if (rslt > 7) t1(ptid)="";

      /* get med code for "probenecid" */

      mstring mcode="",mname="",qty="";
      while ( med(mcode,mname,qty).Select(mcode,mname,qty) != NULL) {
            if (mname != "probenecid") continue;
            break;
            }
      ptid="";
      while (t1(ptid).Select(ptid) != NULL) {
            if ($data(ptmed(p,mcode))) {
                  mstring name="",addr="",s="";
                  patient(ptid,name,addr,s).Select(ptid,name,addr,s);
                  cout << "PTID=" << ptid << endl;
                  cout << "Name=" << name << endl;
                  cout << "Address << addr << endl;
                  }
            }

As can be seen, these manipulations have considerable similarity from one query to the next. The basic manipulations, from a relational algebra point of view are:

Union: the set of all rows belonging to two global arrays provided that the number and meanings of the columns are the same. Example assuming two global arrays t1() and t2() each containing three columns and leaving the result in t3():

      global t1("t1");
      global t2("t2");
      global t3("t3");
      kill (t3());
      mstring a="",b="",c="";
      while (t1(a,b,c).Select(a,b,c) != NULL )
            t3(a,b,c)="";
      a = b = c = "";
      while (t1(a,b,c).Select(a,b,c) != NULL )
            t3(a,b,c)="";
      while (t2(a,b,c).Select(a,b,c) != NULL )
            t3(a,b,c)="";

Intersection: the set of rows belonging to two global arrays:

      global t1("t1");
      global t2("t2");
      global t3("t3");
      kill (t3());
      mstring a="",b="",c="";
      while (t1(a,b,c).Select(a,b,c) != NULL )
            if ($data(t2(a,b,c))) t3(a,b,c)="";

Difference: the set of rows in one global array not in another:

      global t1("t1");
      global t2("t2");
      global t3("t3");
      kill (t3());
      mstring a="",b="",c="";
      while (t1(a,b,c).Select(a,b,c) != NULL )
            if (!$data(t2(a,b,c))) t3(a,b,c)="";

Cartesian Product: the set of rows consisting of all rows from one global array concatenated to all rows of a second global array:

      global t1("t1");
      global t2("t2");
      global t3("t3");
      kill (t3());
      mstring a="",b="",c="";
      mstring d="",e="",f="";
      while (t1(a,b,c).Select(a,b,c) != NULL )
            while (t2(d,e,f).Select(d,e,f) != NULL )
                  t3(a,b,c,d,e,f)="";

Selection: the set of rows from a global array satisfying some predicate. The predicate can be any string expression involving the column values of the global array being inspected or other data known to the program:

      global t1("t1");
      global t2("t2");
      kill (t2()):
      mstring a="",b="",c="";
      while (t1(a,b,c).Select(a,b,c) != NULL )
            if (a == "aaa" && b < "bbb" ) t2(a,b,c)="";

Projection: selecting one or more columns from a global array. For example:

      global t1("t1");
      global t2("t2");
      mstring a="",b="",c="";
      while (t1(a,b,c).Select(a,b,c) != NULL )
            t2(a,c)="";

or, in combination with selection:

      global t1("t1");
      global t2("t2");
      mstring a="",b="",c="";
      while (t1(a,b,c).Select(a,b,c) != NULL )
            if (a == "aaa" && b < "bbb" ) t2(a,c)="";

Join: a Cartesian product where the selection of rows to concatenate is based on a predicate involving one or more columns from the participating globals. Example: if you have two relations named t1() and t2() each with three columns and you want to create a third relation t3() consisting of columns 1, 2, and 3 from relation t1() and columns 2 and 3 from t2(), where rows from the first relation are joined to rows in the second relation to form rows in the third relation if the value in the third column of the first relation is equal to the value in the first column of the second relation, you would write a code segment of the form:

      global t1("t1");
      global t2("t2");
      global t3("t3");
      kill (t3());
      mstring a="",b="",c="";
      while (t1(a,b,c).Select(a,b,c) != NULL ) {
            mstring d="",e="",f="";
            while (t2(d,e,f).Select(d,e,f) != NULL )
                  if ( c == d ) t3(a,b,c,e,f)="";
            }

In the example code, the rows of both relations are scanned and the values of the third column from the first relation (variable "c") are compared with the values of the first column (variable "d") of the second relation. If each relation contains 100 rows, the above would test 10,000 row combinations. This could be speeded up considerably by re-writing the code as follows:

      global t1("t1");
      global t2("t2");
      global t3("t3");
      kill (t3());
      mstring a="",b="",c="";
      while (t1(a,b,c).Select(a,b,c) != NULL ) {
            mstring d=c,e="",f="";
            while (t2(d,e,f).Select(d,e,f) != NULL )
                  if ( c != d ) break;
                  t3(a,b,c,e,f)="";
            }

Here, each scan of the second relation begins with the first row containing a value for the first column which is equal to the third column of the first relation. The scan of the second relation terminates when the value of the first column is no longer equal to the value of the third column from the first relation.

For comparisons other than equality:

      // join if col 3 of t1() < col 1 of t2()

      global t1("t1");
      global t2("t2");
      global t3("t3");
      kill (t3());
      mstring a="",b="",c="";
      while (t1(a,b,c).Select(a,b,c) != NULL ) {
            mstring d=c,e="",f="";
            // begin scan of t2() at value of d equal to c
            while (t2(d,e,f).Select(d,e,f) != NULL )
                  // skip initial cases where c is still equal to d
                  if ( c == d ) continue; 
                  t3(a,b,c,e,f)="";
            }

In the above, the scan of the second relation begins at the first row where the first column is equal to the third column of the first relation. The continue will cause those rows where "c" and "d" are equal to be skipped. Since the rows are presented in ascending key order, after the rows where "c" and "d" are skipped, there will follow only rows where "c" is less than "d".

Similarly, for a greater-than relation:

      // join if col 3 of t1() > col 1 of t2()

      global t1("t1");
      global t2("t2");
      global t3("t3");
      kill (t3());
      mstring a="",b="",c="";
      while (t1(a,b,c).Select(a,b,c) != NULL ) {
            mstring d="",e="",f="";
            while (t2(d,e,f).Select(d,e,f) != NULL )
                  // scan lines up so long as c is < than d
                  if ( c <= d ) break; 
                  t3(a,b,c,e,f)="";
            }

The above terminates the inner loop when "c" is less than or equal to "d" . Prior to that point, where "c" is greater than "d", rows are joined.

For relations involving columns that are not the initial columns of the second relation, other speed-up techniques are possible.

      // join if col 3 of t1() > col 3 of t2()

      global t1("t1");
      global t2("t2");
      global t3("t3");
      kill (t3());
      mstring a="",b="",c="";
      while (t1(a,b,c).Select(a,b,c) != NULL ) {
            mstring d="",e="",f="";
            while (t2(d,e,f).Select(d,e,f) != NULL )
                  // scan lines up so long as c is < than f
                  if ( c <= f ) break; 
                  t3(a,b,c,e,f)="";
            }

The above will produce minimal savings as many combinations of "d" and "e" may need to be tried in locating rows with values of "f" that meet the search criteria. In such cases, it may be more efficient to build a temporary copy if the second relation with the columns reordered so that the scan can proceed more quickly:

      // join if col 3 of t1() > col 3 of t2()

      global t1("t1");
      global t2("t2");
      global t3("t3");
      kill (t3());
      kill (t4());

      mstring a="",b="",c="";
      while (t2(a,b,c).Select(a,b,c) != NULL )
            t4(c,a,b)=""; // reordered relation

      mstring a="",b="",c="";
      while (t1(a,b,c).Select(a,b,c) != NULL ) {
            mstring d="",e="",f="";
            while (t4(f,d,e).Select(f,d,e) != NULL )
                  // scan lines up so long as c is < than f
                  if ( c <= f ) break; 
                  t3(a,b,c,e,f)="";
            }
      kill (t4());

In large joins which may result in many iterations of the inner loop, a single pass to build a temporary, reordered relation may be faster.

Builtin Relational Algebra Functions

There are several builtin relational functions, written in Mumps, that can be called from the C++ environment. To use these, you must include the following at the beginning of your C++ program:

#include <mumpsc/libmpsrdbms.h>

The functions available (implemented as macros) are:

SELECT(arr,out,exp) - copy rows from global "arr" to global "out" if "exp" is true.
PRINT(arr,exp) - print those rows of global "arr" for which "exp" is true.
UNION(arr1,arr2,out) - copy rows of globals "arr1" and "arr2" to global "out".
PROJECT(arr,out,cols) - copy only "cols" columns of rows of global "arr" to global "out".
SUBTRACT(arr,sub,out) - copy rows of global "arr" to global "out" which are not in "sub".
INTERSECT(arr,sub,out) - copy rows of global "arr" to to global "out" if there is an identical row in global "sub".
JOIN(arr1,arr2,out,exp) - concatenate rows of global "arr1" and global "arr2" and copy to global "out" if "exp" is true.

For a full description, see the Mumps Compiler Programmers Guide section on Relational algebra for global arrays. The macros above correspond to the functions described in the manual except the macro names are all upper case. The actual functions, which have the same names except that only the first letter is in upper case and the remainder are lower case, have two additional initial parameters used internally by the Mumps service routines. The macros automatically substitute these added parameters.

The processing functions are wittten in Mumps and have been compiled to an object code library. When compiling a Mumps program for use with the class library, the first line of the Mumps program must be:

+#define CPP

This line causes the compiler to omit some lines of code that would conflict with the C++ runtime routines.

Locking the Data Base

There are several functions for locking portions of the data base. Following legacy convention, a lock does not prevent access to an element but merely flags the element as locked. Locking views a global array as a tree structure. If an element is locked, its descendants are locked. An attempt to lock a locked element of an element that has a locked parent or a locked descendant will fail. The primary locking functions are $lock(), Lock() and UnLock():

      if ($lock(gbl(a,b,c)) cout << "locked" << endl;
      if (gbl(a,b,c).Lock()) cout << "locked" << endl;
      gbl(a,b,c).UnLock();

The $lock() and Lock() functions test to see if the node can be locked and locks it if possible. It returns true (1) if successful and false (0) otherwise ($test is set accordingly). A node can be locked if it itself is not locked, if it has no descendants that are locked and if it is not the descendant of a locked node. The UnLock() function releases a lock on a node.

Additionally, there are functions to release all locks for the current process and all locks for all processes:

    CleanLocks();      // release all locks for this process only
    CleanAllLocks();  // release all locks for all processes

Pattern Matching

There are several other basic support functions available. One of these is the $piece() function. This function takes either three or four arguments. The first is a source string (pointer to character), the second is a pattern string (pointer to character), the third and fourth are integers. The fourth may be omitted. The function returns the "piece" of the source string delimited by the pattern. If the fourth argument is not present, the "piece" returned is delimited by the i-1 and i'th instance of the pattern string. For example:

      $piece("abc.def.ghi",".",1) yields abc
      $piece("abc.def.ghi",".",2) yields def
      $piece("abc.def.ghi",".",3) yields ghi

If the fourth argument is present, the piece returned is between third and fourth argument instance of the pattern.

Taking the above into account, it is possible to build a larger example using the GenBank gbkey.idx file. The format of this file is a line of keyword text followed by one or more lines of reference to locations where the keyword text applies. Each reference line begins with a TAB character followed by one or more locus codes, followed by a TAB character followed by a division code followed by an accession id. A typical entry is:

      1,4-alpha-D-glucan glucanohydrolase
            ECOFTAA   BCT   L01642
            STYFTAA   BCT   L01643
            RICAAMYA  PLN   M24286
            RICAAMYB  PLN   M24287
            BMAMY     BCT   X07261
            AHAAMYG   BCT   X58627
            ECMALS    BCT   X58994

The following program will construct a matrix giving for each accession (row) the keyword phrases that apply to the accession:

#include <mumpsc/libmpscpp.h>
 
global mat("mat");
 
int main() {
 
char line[1024];
mstring key;
mstring locus;
mstring div;
mstring accession;
long key_count=0,acc_count=0;;
mstring s1="1";
 
while (1) {
 
      cin.getline(line,1023, '\n');
      if (cin.eof()) break;
      if (line[0]!='\t') {
            key = line;
            continue;
            }
      locus = $piece(line,"\t",2);
      div = $piece(line,"\t",3);
      accession = $piece(line,"\t",4);
      mat(accession,key) = "";
      }
 
accession = "";
 
while (1) {
      accession = $order(mat(accession),1);
      if (accession == "") break;
      acc_count++;
      key = "";
      while (1) {
            key = $order(mat(accession,key),1);
            if ( key == "") break;
            key_count++;
            }
      }
 
cout << "average number of keys per accession: " << (float) key_count / acc_count << endl;
GlobalClose;
return 0;
}

Additionally, the Perl Compatible Regular Expression Library is available through the $perl() macro (see Appendices C and D). For example:

#include <mumpsc/libmpscpp.h>
#include <stdlib.h%gt;

int main () {
     char line[1024]="acgtcgctcggctgcgctcgagctcgagagactgcgctgctcgaagagctagag";
     cout << $perl(line,"gctgcg[acgt]tcgagctcga") << endl;
     GlobalClose;
     return 1;
}

The above prints 1.

Invoking the Mumps Interpreter

The full facilities of the Mumps interpreter can be invoked from C++ programs. The interpreter reads, parses and executes commands presented to it at run time. It may also read and execute text files containing Mumps programs. The interpreter is invoked by means of the Xecute() macro and xecute() functions:

int Xecute("command")
int xecute(mstring command)
int xecute(string command)
int xecute(char * command)

These functions and macro invoke the Mumps interpreter and execute the text replacing "command". They return 1 of successful, 0 otherwise. With Xecute(), if the mumps command contains quotes or other special symbols, they will be automatically prefixed with backslashes (e.g., quote becomers \").

Xecute("set i="test"));
Xecute("for  s i=$order(^a(i)) quit:i=""  set sum=sum+^a(i)");

Details on the Mumps Language are contained in the file compiler.html in the mumpsc/doc subdirectory of the Mumps Compiler distribution.

Programming Examples

Hashing Example

The following example stores lines of text into a global array based on a hash function calculation of each line. It reads lines of text from stdin and submits each line to a simple hash function that produces an unsigned long which is converted to character string (char *) and returned. The resulting character string is copied to the string variable x. The input line is stored at hash_table(x,ii) where ii is a string value between 0 and 999. The value if ii is determined by locating the first ascending integer not already in use. If a given hash result produces more that 1000 collisions, the process terminates with an error message.

#include <mumpsc/libmpscpp.h>
global hash_table("hash"); // global array 
int main() {
      char in[1024];
      string x;
      long i;
      
while (fgets(in,1024,stdin)!=NULL) {
      x = hash(in);  // hash input line
      for (i=0; i<1000; i++) {
            string ii=cvt(i);
            if ($data(hash_table(x,ii))==0) {  // find a slot
                  hash_table(x,ii)=in;  // add line to database
                  cout << x << "," << ii  << " " << in << endl;
                  break;
                  }
            }
      if (i>1000) { 
            cout << "Too many collisions " << x << endl;      
            GlobalClose;
            return 1;
            }
      }
      GlobalClose;
      return 0;
      }

Linking to Compiled Mumps Functions

You may compile functions in Mumps and call them from C++ programs. If you do, you must begin each file of functions with:

#define CPP

which disables some code that would otherwise conflict with the class libraries. If you do not use the class libraries, you may omit this line.

See the Mumps Compiler Programmers Guide for details.

Writing Active Web Server Pages

C++ programs can be written with the toolkit to be web server active pages. For example:

Web page HTML code:

A C++ program can accept data from the web page, store the data in global arrays and return a summary web page to the browser. When using "get" mode data transmission from HTML forms, the form names and data are concatenated into a string, delimited by ampersands, containing "name=value" tokens. These are passed in an environment variable named QUERY_STRING. The include file mumpsc/cgi.h contains code to extract data from QUERY_STRING and store the data in the runtime symbol table. The function $SymGet() can be used to retrieve values from runtim symbol table.

Note: you can test code by simulating input from a web browser with the following code:

#!/bin/bash
QUERY_STRING="abc=xyz&cde;=123"
export QUERY_STRING
your_program.cgi

The "name=value" sets (delimted by ampersands) will be passed to the program. Note: web server cgi protocol requires the value strings to be encoded (see EncodeHTML()).

Hash class

The hash class permits quick direct access by means of a hash table. Objects of the hash class are created by:

hash hashname(filename, filesize, filedisp);

where:

hashname is the name of the object;
filename is the name of the external file name of the object with ".key" appended;
filesize is the size in bytes of the object (at least 1,000, default 100,000);
filedisp is the disposition: "new" or "old".

If the disposition is "new", a new hash will be created and any previous hash discarded. If the disposition is "old", a previously existing disk based hash object will be used. The default is "new". Example:

hash x("x",10000,"new");

will create an object named "x" which will reside in a file named "x.key" that will be 10,000 bytes long and will be newly created.

You may assign values to a hash object by providing the value and the hash key:

x("key one")="test";

where "key one" is a string key and "test" is the value to be stored.

Values may be retrieved into strings by:

string s;
s=x("key one");

You may replace the value stored at a hash key with one of equal or shorter length with no penalty. If the replacement value is longer, the original space will be marked as unavailable, new space will be allocated and the old space will not be reused. In the event of a collision (i.e., two keys produce the same hash code), the functions search forwards in the file for available space. The value for a key is stored immediately after the key in the file.

Class mstring

The mstring class provides Mumps-like strings that can be used to write programs in C++ that treat variables in a manner similar to that of Mumps. This means that mstring objects are essentially strings on which arithmetic operations may be performed. For example:

Note: the code "(a || b)" in the cout expression is parenthesized. If not parenthesized, the C++ compiler precedence will result in an error.

Objects of class mstring may:

Contain character strings, integers or floating point values;
Be assigned to from char *, string, mstring, float, int, or double.; Objects of mstring may be initialized with character string constants in declaration statements.
Participate in add(+, +=), subtract(-, -=), multiply(*, *=), divide(/, /=), modulo (%, %=) (integers values only) pre/post increment/decrement (++/--), and concatenation (||) operations. The mode of the operation will depend on the mode of the other operand. Available modes ASCII string, integer and floating point.
Participate in relational expressions >, >=, <, <=. The mode of comparison will depend on the mode of the other operand. Available modes ASCII string, integer and floating point.
Participate in equality expressions == and !=. The mode of the comparison will depend on the mode of the other operand. Available modes ASCII string, integer and floating point.
Participate in input and output stream operations >> and <<.
Participate in assignment to objects of mstring and string.
Be declared as arrays or allocated/freed by the new/delete operators. Only numeric subscripts permitted at this time.

Access functions defined on mstring are:

c_str() - returns the address of a character string containing the value in the mstring.
s_str() - returns a string containing the value in the mstring.

Objects of mstring may be passed to all Mumps $ functions.

Btree Access

Programmers may access the btree directly through the builtin BTREE macro. A number of examples can be found in mumpsc/doc/examples/btree in the distribution.

To access the btree directly from a C++ program:

You must first install the Mumps compiler and MDH. Include at the beginning of your program. You can now access the btree directly with the BTREE macro (see description below). Note: any keys you store in the btree co-exist with Mumps/MDH keys. In rare cases, these can interfere with one another if a key you store lies in the range of a global array key set.

For example, the following program stores NBR_ITERATIONS (defined in btree.h which is included by libmpscpp.h usually with the value 100,000) of keys and data into the btree and then retrieves them (this "btest1.cpp" from mumpsc/doc/examples/btree.cpp). See the other examples and the documentation below for further details.

Assignment operations on global arrays

Assignments to and from global arrays may be accomplished either with overloaded versions of the shift operators (<< and >>) or the assignment operator (=). Originally, only the shift forms were permitted since restrictions in the C++ language made it difficult to construct assignments from globals to ordinary data types. This was bypassed by using an overloaded impled cast operator which permits most forms of assignment.

When you access a global array, the access may result in the thrown error exceptions GlobalNotFoundException and/or ConversionException. The first can occur in any context that attempts to retrieve data from a global array where none exists. The second occurs if you attempt to convert the contents of a global to a numeric type where the contents of the global are not valid data for the conversion.

If uncaught, both exceptions will result in program termination. Both exceptions may be caught, however, with code such as the following:

The following discussion is divided into two parts: assignment to global arrays and assignment from global arrays:

Assignment TO global arrays

Assignments using the overloaded assignment operator are permitted using the following assignment operator overloads:

global & global::operator = (char * ) 
global & global::operator = (int)
global & global::operator = (string)
global & global::operator = (mstring)
global & global::operator = (double)
global & global::operator = (global);
global & global::operator = (unsigned int);
global & global::operator = (float);
global & global::operator = (short);
global & global::operator = (unsigned short);
global & global::operator = (long);
global & global::operator = (unsigned long);

Assignment to a global array using the "=" operator is enabled for right hand side variables of types character array (char *), mstring, string, integer, double, etc (see above). For example:

gbl(a,b,c) = "test string";
gbl(a,b,c) = 123;
gbl(a,b,c) = 123.45;

Assignment to global arrays can alos be accomplished by using the overloaded shift operator:

global & global::operator << (char *) 
global & global::operator << (int) 
global & global::operator << (unsigned int)
global & global::operator << (short)
global & global::operator << (unsigned short)
global & global::operator << (long)
global & global::operator << (unsigned long)
global & global::operator << (float)
global & global::operator << (double)
global & global::operator << (string)
global & global::operator << (mstring)
global & global::operator << (global)

Examples:

Assignment from global arrays to other data types:

Assignment from global arrays by overloaded shift operator:

char *         global::operator >> (char *)
int            global::operator >> (int &)
unsigned int   global::operator >> (unsigned int &)
long           global::operator >> (long &)
unsigned long  global::operator >> (unsigned long &)
short          global::operator >> (short &)
unsigned short global::operator >> (unsigned short &)
float          global::operator >> (float &)
double         global::operator >> (double &)
string         global::operator >> (string &)
mstring         global::operator >> (mstring &)
global &       global::operator >> (global)

Alternatively, the overloaded cast operator can be utilized in combination with the assignment operator ("="):

global::operator char*()
global::operator string &()
global::operator int()
global::operator unsigned int()
global::operator short()
global::operator unsigned short()
global::operator long()
global::operator unsigned long()
global::operator float()
global::operator doublee()

Each of the above converts the value stored at a global variable to a builtin data type, mstring or string.

Note: the C++ language specification does not permit a fundamental data type (e.g. int, double, char) to be placed on the left hand side of an overloaded assignment ("=") operator. In order to get around this, we use two techniques: (1) the overloaded right-shift operator; and (2) the overloaded cast operator.

Assignment by overloaded right-shift operator copies and, if necessary, converts the global array string value to the target. It works in all cases.

The overloaded cast operator permits fundamental data types to be placed on the left hand side of the assignment operator ("=") and a global array reference to be placed on the right hand side. When the C++ compiler detects a fundamental data type on the left hand side of the assignment operator and a global on the right hand side, it invokes a default cast to convert the right hand side.

In all cases except the one in which the left hand side is char *, the overloaded cast will make the necessary conversion and the assignment will take place as expected. In the case of a char * left hand side, only a pointer to a char * is copied from the global to the left hand side char *. The pointer copied is the address of a public char * string in the global class that contains the value of the global. This pointer is only valid until the next reference to the same global object.

Assignments from global to string or mstring using the assignment operator ("=") copy the value from the global to the object of type string or mstring. Assignments from global to string require a cast of the global consisting of (string &).

The case of assignment from global to mstring is handled by an overload in the class mstring. The case of assignment from global to string is handled by class string which copies the contents of the character string pointed to by the global.

Examples using an arbitrary global array reference with three string indices named gbl(a,b,c) at which is stored the string "12345":

Right-shift based assignments (overloaded ">>" operator):

Cast operator based assignments:

The character string stored at gbl(a,b,c) is converted and copied to the target variable in all cases except those where the left hand side of the assignment operator is char *. This case does not copy character strings but, instead, copies the address of a string containing the result to the target (see first example).

This also means the the left hand side of an assignment operator may not be the name of an array of type char since this implies altering the address of the array. The only permitted char left hand side would be a variable pointer to char.

The value copied to the pointer will be a public address of an array of char in the class containing the value of the global reference. This reference is valid only until another reference to the same global object. This usage is not preferred. Instead, used the shift form or strcpy():

char out[STR_MAX];
strcpy(out, gbl(a,b,c));

The overloaded cast assignment form may may be used within larger programming structures such as:

The above will print numbers 10 through 1.

double global::Avg()

Returns the average of the values of data bearing nodes beneath the given global array reference. Example:

The above prints 5.5 - the average value of numeric data bearing nodes beneath A("100"). If there are non-numeric data elements, they are treated as a zero values and contribute to the result.

The global array object must be specified with indices (i.e., a parenthesized list must follow the name of the global array object. An empty list means the entire array.

int mstring::begins(mstring pattern);

Returns an integer which is the starting point in the string of pattern or -1 if the pattern is not found. Throws: PatternException if the pattern is in error.

Boyer-Moore-Gosper Functions

extern "C" int bmg_fullsearch(char * search_string, char * buffer_base);
int bmg_fullsearch(string search_string, string buffer_base);
int bmg_fullsearch(mstring search_string, mstring buffer_base);
int bmg_fullsearch(mstring search_string, global buffer_base);
int bmg_fullsearch(mstring search_string, global buffer_base);

Returns the number of non-overlapping instances of "search_string" in "buffer_base". This function is covered by the LGPL license. The string, global and mstring versions are less efficient than the char * version since they copy the contents of the string, global or mstring variable to a character array.

Note: if you use the char * version, the "buffer_base" may NOT be a character constant or pointer to character constant. The search routines modify this string and using a character constant will generate a segmentation fault on most machines.

Examples:

All use the following functions which are not covered by the GPL/LGPL:

extern "C" void bmg_setup(char * search_string, int.case_fold_flag);
extern "C" int bmg_search(char * buffer_base, inti buffer_length, int (*action_func)());

These functions are publically available from:

ftp://ftp.uu.net/usenet/comp.sources.unix/volume5/bmgsubs.Z

and are believed to be contributed source and are unrestricted with respect to use and redistribution, and, that most, if not all, the code was written by employee(s) of the United States and thus in the public domain. The distribution contains, in part, the following notes:

Here are routines to perform fast string searches using the Boyer-Moore-Gosper algorithm; they can be used in any Unix program (and should be portable to non-Unix systems). You can search either a file or a buffer in memory. The code is mostly due to James A. Woods (jaw@ames-aurora.arpa) although I have modified it heavily, so all bugs are my fault. The original code is from his sped-up version of egrep, recently posted on mod.sources and available via anonymous FTP from ames-aurora.arpa as pub/egrep.one and pub/egrep.two. That code handles regular expressions; mine does not. These have only been tested on 4.2BSD Vax systems. -Jeff Mogul mogul@navajo.stanford.edu decwrl!glacier!navajo!mogul

BMGSUBS(3L) BMGSUBS(3L) NAME (bmgsubs) bmg_setup, bmg_search, bmg_fsearch - Boyer-Moore-Gosper string search routines SYNOPSIS bmg_setup(search_string, case_fold_flag) char *search_string; int case_fold_flag; bmg_fsearch(file_des, action_func) int file_des; int (*action_func)(); bmg_search(buffer_base, buffer_length, action_func) char *buffer_base; int buffer_length; int (*action_func)(); DESCRIPTION These routines perform fast searches for strings, using the Boyer- Moore-Gosper algorithm. No meta-characters (such as `*' or `.') are interpreted, and the search string cannot contain newlines. Bmg_setup must be called as the first step in performing a search. The search_string parameter is the string to be searched for. Case_fold_flag should be false (zero) if characters should match exactly, and true (non-zero) if case should be ignored when checking for matches. Once a search string has been specified using bmg_setup, one or more searches for that string may be performed. Bmg_fsearch searches a file, open for reading on file descriptor file_des (this is not a stdio file.) For each line that contains the search string, bmg_fsearch will call the action_func function specified by the caller as action_func(matching_line, byte_offset). The match- ing_line parameter is a (char *) pointer to a temporary copy of the line; byte_offset is the offset from the beginning of the file to the first occurence of the search string in that line. Action_func should return true (non-zero) if the search should continue, or false (zero) if the search should terminate at this point. Bmg_search is like bmg_fsearch, except that instead of searching a file, it searches the buffer pointed to by buffer_base; buffer_length specifies the number of bytes in the buffer. The byte_offset parameter to action_func gives the offset from the beginning of the buffer. If the user merely wants the matching lines printed on the standard output, the action_func parameter to bmg_fsearch or bmg_search can be NULL. AUTHOR Jeffrey Mogul (Stanford University), based on code written by James A. Woods (NASA Ames) BUGS Might be nice to have a version of this that handles regular expres- sions. There are large, but finite, limits on the length of both pattern strings and text lines. When these limits are exceeded, all bets are off. The string pointer passed to action_func points to a temporary copy of the matching line, and must be copied elsewhere before action_func returns. Bmg_search does not permanently modify the buffer in any way, but dur- ing its execution (and therefore when action_func is called), the last byte of the buffer may be temporarily changed. The Boyer-Moore algorithm cannot find lines that do not contain a given pattern (like "grep -v") or count lines ("grep -n"). Although it is fast even for short search strings, it gets faster as the search string length increases. 16 May 1986 BMGSUBS(3L)

int BTREE(int code, unsigned char * key, unsigned char * data)

BTREE() is a macro permitting direct access to the underlying btree system. The first argument, "code" is an integer indicating the operation to be performed (see below). The second argument is the key to be stored consisting of a null-terminated array printable ASCII characters. The length of the key should be no greater than one quarter of the btree block size whose default value is 8192 (i.e., max key length is about 2048 bytes in the default case). The third argument is the data to be stored with the key. It is a null-terminated string of printable ASCII characters not greater than the system defined limit STR_MAX (defaults to 4096). An empty string is interpreted as no data to be stored. Note that the second and third arguments must be unsigned char *. The macro returns an integer indicating success. It may also alter "key" or "data" to return values or for other purposes. The contents of "key" and "data" are not preserved across in invocation of BTREE() Examlples of using BTREE() are given in mumpsc/doc/examples/btree.

Permitted btree operations:

STORE - store a key and data value in the btree; retuns zero if successful, non-zero otherwise:

      unsigned char key[]="test key";
      unsigned char data[]="test data";
      if ( BTREE(STORE,key,data) == 0 ) cout << "stored" << endl;
      else cout << "not stored" << endl;

RETRIEVE - retrieve data stored with a key; returns zero if successful, non-zero otherwise:

      unsigned char key[]="test key";
      unsigned char data[STR_MAX];
      if ( BTREE(RETRIEVE,key,data) == 0 ) cout << "retrieved: " << data << endl;
      else cout << "not retrieved." << endl;

CLOSE - close the btree data base; returns zero:

      unsigned char key[]="";
      unsigned char data[]="";
      BTREE(CLOSE,key,data);

XNEXT/PREVIOUS - retrieve next ascendina/descending key; returns one. Value of second and third arguments become the value of the next ascendina/descendingg key. An initial value of the empty string for the second argument will retrieve the first/last key and the value of the second argument becomes the empty string when there are no more ascending/descending values. An initial value of the empty string for the second argument will retrieve the first/last key.
```
      unsigned char key[]="";
      unsigned char data[STR_MAX];
      printf("\nbegin retrieve...\n");
      while(1) { // rerteive keys in ascending order
            i=BTREE(XNEXT,key,data);
            if (strlen( (char *) data)==0) break;
            cout << key << endl;
            }
```

void global::Centroid(global B)

A centroid vector B is calculated for the invoking two dimensional global array. The centroid vector is the average value for each for each column of the matrix. Any previous contents of the global array named to receive the centroid vector are lost. The invoking global array (A) must contain at least two dimensions. For example:

Yields:

The above yields a vector giving the average value of each named column of the matrix "A" (5 in this case since each column is initialized with 5).

voidCleanLocks(void)
void CleanAllLocks(void);

"CleanLocks()" removes all locks for the current process. "CleanAllLocks()" removes all locks for all processes for which the current directory is the default directory. Locks are implemented by entries in a file named "Mumps.Locks" created and maintained in the current directory. This file must be read/write enabled for the current process. You may also delete all locks by removing this file. Locks are discussed elsewhere but, in brief, they are used to signal ownership of a portion of a global array. When a lock has been applied to a node, no other process may lock this node, any descendant node or any parent node. Locking does not actually prevent access, it merely marks a resource as locked.

char * mstring::c_str()

Returns a char * to a NULL terminated character string containing the same value as the mstring variable.

command(string)

"command()" is a macro that takes a quoted string constant argument. The macro surrounds the string with an extra set of quotes and processes any embedded quotes to backslash-quote. It then invokes a function (__command__()) which strips the extra surrounding quotes. The net effect of this is that you can pass a quoted string containing quotes without the need for "leaning toothpick" notation. Example:

Normal usage: 

$pattern(source_str, "3n1\"-\"2n1\"-\"4n") 
strcpy(target, "for i=1:1:10 write \"test \",i,!"); 

with command(): 

$pattern(source_string, command("3n1"-"2n1"-"4n")) 
xecute(command("for i=1:1:10 "test ",i,!")); 
strcpy(target, command("for i=1:1:10 write "test ",i,!"));

The argument must be a character string constant.

Comparison operations involving globals.

The comparison operators >, >=, <, <=, ==, and != are defined for global arrays. The determination of the mode of the comparison is based on the other operand. That is, if a global array is compared with an integer, integer comparison will be used; if it is compared with a character string, character string comparison will be used. The contents of the data stored at the global array node must be compatible with the comparison mode. When two gloabl array elements are compared, the comaprison will be a string comparison, regardless of the contents. For example:

Note that the mode of comparison is dependent upon the second operand. In the case of string comparisons, an ASCII comparison takes place thus "123" is less than "2".

Conversion functions

char *cvt(long i)
char *cvt(double i)
char *cvt(float i)
char *cvt(int i)

These functions return a null terminated varying length character string containing in printable version of the argument. The functions contain short static character arrays and, consequently, are not threadsafe. Note that char * can be assigned to variable of typ string. These functions are mainly useful to get string values for global array indices. BEcause these functions use static character arrays, do not use these functions directly as indices to global arrays. Example:

GlobalClose;

This macro closes the global array files. The global arrays must be closed on exit or they will be corrupt. The macro causes the file system to flush all its buffers and cache and close the file system. Normally, a "GlobalClose" is executed automatically when your program ends except if your program is terminated by SIGKILL or SIGSTOP (which cannot be trapped). If your program is using a large memory based cache (cache's can be 1 GB or more, on some systems), there may be a noticeable delay in file system shutdown due to the time required to write the cache to disk.

Correlation functions

void global::TermCorrelate(global B)
void global::DocCorrelate(global B, string fcnname, double threshold)
void global::DocCorrelate(global B, mstring fcnname, double threshold)
void global::DocCorrelate(global B, char * fcnname, double threshold)

These functions build document indexing correlation matrices. The invoking global is assumed to be a two dimensional document-term matrix whose rows are documents and whose columns represent the occurrence of terms in the documents (either weights or frequencies).

TermCorrelate() builds a square term-term correlation matrix in B from the invoking document-term matrix.

DocCorrelate() builds a square document-document correlation matrix from the invoking document-term matrix. The name of the function to be used in calculating the document-document similarity is given in fcn and may be Cosine, Jaccard, Dice, or Sim1. The minimum corrrelation threshold is given in threshold which defaults to 0.80 if omitted.

TermCorrelate() Example:

Yields:

The above gives the number of co-occurences of each word with each other word. For example, the words "computer" and "memory" co-occur in two vectors (2 nd 3) while the words "laptop" and "computer" co-occur in all three vectors. If each vector is thought of as a document, the strength of the co-occurences between words is a measure of similarity for indexing purposes.

DocCorrelate() Example:

Yields

The above program calculates the similarities between the document vectors according to the Cosine method.

long global::Count()

Returns the number of data bearing nodes beneath the given global array reference. Example:

Data functions

int $data(global)
int global.Data()

The functions $data() and global::Data() function returns an integer which indicates whether the global array node is defined. The value returned is 0 if the global array node is undefined, 1 if it is defined and has no descendants; 10 if it is defined but has no value stored at the node (but does have descendants); and 11 it is defined and has descendants.

If a global array with no indices is passed to these functions, a value of "10" will be returned if the array exists and "0" if the array does not exist. For example:

int mstring::decorate(mstring pattern, mstring left, mstring right);

Locates the pattern in the invoking mstring and inserts left immediately to the left of the string that matched the pattern and inserts right immediately to the right of the found pattern. Returns 1 if the pattern was found and the insertions were made, -1 if the pattern was not found, and less than -1 for other errors (see PCRE documentation concerning pcre_exec() return codes). Throws: PatternException().

char * EncodeHTML(char *)

Encodes the argument string according to HTML rules and returns the result. Alphabetics and numbers are unchanged. Blanks become plus signs and all other characters replaced by "%xx" where "xx" is the hexadecimal value of the character in the ASCII collating sequence. The function is used mainly in connection with parameters passed with URL's which may not contain blanks or special characters. the code in cgi.h is used to decode these strings. Example:

int mstring::ends(mstring pattern)

Returns an integer giving the character position (relative to zero) immediately following the strincg that matched pattern. Returns -1 if the string did not match. Throws: PatternException.

extern "C" void ErrorMessage(char * message, int line_number)

This function (written in C and part of the underlying legacy library) will print and error message, close the global array files and terminate the program. The integer "line_number" will be printed with the message. The pre-processor predefined macro "__LINE__" can be used here. Example:

ErrorMessage("Cannot locate patient",__LINE__);

Error Exceptions

The toolket generates (throws) exceptions for certain conditions. For example, when you access global arrays with the toolkit, the accesses may result in the thrown error exceptions:

ConversionException.
GlobalNotFoundException
MumpsSymbolTableException.
NumericRangeException.

The first can occur in any context that attempts to retrieve data from a global array where none exists. The second occurs if you attempt to convert the contents of a global to a numeric type where the contents of the global are not valid data for the conversion.

If uncaught, both exceptions will result in program termination.

The following are the exceptions thrown by the toolkit:

ConversionException() - usually occurs when you attempt to store a value from a global array into a numeric variable but the string in the global is not a valid number.
GlobalNotFoundException() - thrown by an attempt to reference non-existent global array data.
MumpsSymbolTableException() - thrown by an attempt to fetch the value of a non-esistent variable from the Mumps runtime symbol table.
NumericRangeException() - thrown by attempts to divide by zero or using arguments with values less that or equal to zero to log functions.

char * $extract(char * source_string [, int start, [int end]])
char * $extract(string source_string [, int start, [int end]])
mstring $extract(mstring source_string [, int start, [int end]])

Returns a pointer to a substring substring of the first argument. The substring begins at the position noted by the second operand. If the third operand is omitted, the substring consists only of the "start" character of "source_string". If the third argument is present, the substring begins at position "start" and ends at position "end". If only "source_string" is given, the function returns the first character of the string "source_string". If "end" specifies a position beyond the end of "source_string", the substring ends at the end of "source_string". String position counting begins at one. Global arrays may be used in any argument position but only one instance of the same global may appear (see note in Accessing global arrays) section. For example:

char * $find(char * source_string, char * pattern_string [, int start)
char * $find(string source_string, string pattern_string [, int start)
mstring $find(mstring source_string, mstring pattern_string [, int start)
mstring $find(mstring source_string, string pattern_string [, int start)
mstring $find(mstring source_string, const char * pattern_string [, int start)

$find() searches the first argument for an occurrence of the second argument. If one is found, the value returned is one greater than the end position of the second argument in the first argument. If "start" is specified, the search begins at position "start" in argument 1. If the second argument is not found, the value returned is 0. String position counting begins at position one. For example:

Global arrays may be used in any argument position but only one instance of the same global may appear (see note in Accessing global arrays) section.

Interpreter Get, Set, Order and Data functions on global arrays

mstring GlobalGet (mstring global_ref)
string GlobalGet (string global_ref)
char * GlobalGet (char * global_ref)

mstring GlobalOrder (mstring global_ref, int direction)
string GlobalOrder (string global_ref, int direction)
char * GlobalOrder (char * global_ref, int direction)

int GlobalData (mstring global_ref)
int GlobalData (char * global_ref)
int GlobalData (string global_ref)

int GlobalSet (mstring global_ref, mstring source)
int GlobalSet (mstring global_ref, char * source)
int GlobalSet (char * global_ref, mstring source)
int GlobalSet (string global_ref, string source)
int GlobalSet (string global_ref, char * source)
int GlobalSet (char * global_ref, string source)
int GlobalSet (char * global_ref, char * source)

These function use the interpreter. These functions are used to permit runtime construction and access to global arrays. In both cases global_ref is a string containing a global array reference. This string can be dynamically constructed at runtime or may be read from a file or another global. Note: as this facility uses the interpreter, global array references must be preceded by the circumflex character (^).

In the case of the GlobalGet() functions, the string global array reference is interpreted and the value stored at the reference returned. If the reference is invalid or no data is stored, the value returned is the empty string and $test is set to false (zero). If a value is found, $test is set to true and the value is returned.

GlobalOrder() gives the next or prior value of the last index of the global array reference depending upon if direction is 1 (next) or -1 (prior). $test is set to 0 in the event of an error and 1 if there is no error. See $order().

GlobalData() returns a number indicating if the node exists and has descendants (see $data()). $test is set to 0 if there i>s an error, 1 otherwise. In the case of the GlobalSet() functions, the second argument is a string of data to be stored at the global array reference. The runtime routines will interpret the global_ref and assign the source to it. The value returned is one if successful ($test is set to 1), zero if not successful ($test set to 0). Examples:

These functions can be used to allow a program to create a text string global array reference and then use the string to address the global. Note that the target must contain either quoted literals or variables previously instantiated to the interpreter environment (see $SymSet() and $SymGet()).

Generally speaking, these functions will be only used for dynamically constructed global array references. Most access to globals will be by overloaded shift or assignment operators.

double HitRatio(void)

Calculates the native global array processor cache hit ratio since the beginning of the program or the last call to "HitRatio()". The native global array file processor, as opposed to the Berkeley Data Base, keeps track of how many file I/O requests are satisfied from data already in the file system's cache. This function gives the percentage of cache hits. It only works with the native global array processor.

Hashing functions

char * hash(char * str)
long lhash(char * str)

hash() returns either a null terminated character string up to 10 characters in length containing a numeric hash code of the string passed as an argument. The argument may be up to STR_MAX characters in length. lhash() returns an unsigned long value of the hash value. See also: Hash class.

char * $horolog

Returns a pointer to a static null terminated array of character containing of two numbers. The first is the number of days since December 31, 1840 and the second is the number of seconds since the most recent midnight. These values are relative to Greenwich Mean Time.

Inverse Document Frequency function

void global::IDF(doubleDocCount)

The IDF() function calculates for the global array vector provided the inverse document frequency weight of each term. The vector should be indexed by words and have stored the number of documents in which each word occurs. The document count will be replaced by the calculated IDF value. The IDF is log2(DocCount/Wn)+1 where Wn is the number of documents in which a term appears (the document freqwuency). The value DocCount is the total number of documents present in the collection. Example:

char * $justify(char * source_string, int field_width, int precision)
string $justify(string source_string, int field_width, int precision)
mstring $justify(mstring source_string, int field_width, int precision)

$justify() right justifies the first argument in a string field whose length is given by the second argument. If the third argument is -1, the first argument is interpreted as a string. If the third argument is a positive integer, the first argument is right justified in a field whose length is given by the second argument with "precision" decimal places. The three argument form imposes a numeric interpretation upon the first argument. Both the "field_width" and "precision" MUST be present. For example:

$justify("39",3,-1) yields " 39"
$justify("TEST",7,-1) yields " TEST"
$justify(39,4,1) yields "39.0"

Global arrays may be used in any argument position but only one instance of the same global may appear (see note in Accessing global arrays) section.

void global::Kill()
void global.Kill())

These functions delete a node and all its descendants. Examples:

gbl().Kill();     // kill entire global array "gbl"
gbl(a,b,c).Kill();  // kill stated node and all descendants

int $length(char * source_string [, char * pattern_string])
int $length(string source_string [, string pattern_string])
int $length(mstring source_string [, mstring pattern_string])
int $length(mstring source_string [, string pattern_string])
int $length(mstring source_string [, char * pattern_string])

The function returns the string length of its argument. For example:

$length("ABC") yields 3
$length("22.5") yields 4

If a second argument is given, the function returns the number of non-overlapping occurrences of "pattern_string" in "source_string" plus 1.

Global arrays may be used in any argument position but only one instance of the same global may appear (see note in Accessing global arrays) section.

int global::Lock()
int $lock(global gbl)

Creates a lock on the named node. If successful, "$test" will be true (1), false (0) otherwise. Both forms return a 1 if the lock succeeds and a 0 otherwise.

The "Lock()" function marks a portion of the data base for exclusive access for an individual user. The "UnLock()" frees prior locks (see below). The locks are stored in a file named "Mumps.Locks" which is opened for exclusive access by the locking/unlocking job. The contents of the file may be deleted to remove all locks. A lock does not actually prevent access to a global but merely marks it as locked. If another task attempts to place a lock on a locked node, the descendant of a locked node or a direct parent of a locked node, the lock attempt will fail. Examples:

See also: CleanLocks(), CleanAllLocks(), and UnLock().

double global::Max()

Returns the maximum numeric value of the data bearing nodes beneath the given reference. Non-numeric values are treated as zeros. Example:

int global::Merge(global)

Copies the first global and its descendants to the second global. The Merge() function copies from one array to another. Examples:

Xecute("for i=1:1:9 for j=1:1:9 set ^a(i,j)=i+j");
c().Merge(a()); // copies all of ^a to ^c 
Xecute("for i=100:1:109 s ^b(i)=i"); 
b("103").Merge(a("3")); // copies ^a(3) to ^b(103) and children of 
                        // ^a(3) to be children of ^b(103) 
d("").Merge(a("3"));  // creates ^d=^a(3); ^d(1)=^a(3,1),...

double global::Min()

Returns the minimum numeric value of the data bearing nodes beneath the given reference. Non-numeric values are treated as zeros. Example:

void Multiply(char * A,char * B,char * C);
void Multiply(string A,string B,string C);
void global::Multiply(global B,global C);

For the non-member versions, the two dimensional matrix named in the string A is multiplied by the two dimensional matrix named in the string B and the result becomes the contents of two dimensional matrix named in the string C. The number of columns of A must equal the number of rows of B. The resulting matrix C will have "n" rows and "m" columns where "n" is the number of rows of "A" and "m" is the number of columns of "B".

For the member version, the invoking object is multiplied by B and the result is place in C. The dimensionality of "C" is as above.

In all cases C will be deleted before the operation commences. The data stored at each node must be numeric. All calculations are performed in double arithmetic. Each matrix must be two dimensional. Example:

$name(global)

Returns a null terminated pointer to array of characters containing of the global reference with all variables and expressions in the indices evaluated. Example:

Order/Next functions - several forms:

char * $order(global, int direction)
char * global.Order(int direction)
int global.Order_Next(char * result)
int global.Order_Prior(char * result)

The order functions give the next ascending or descending value of the last index in a global array reference. The direction, ascending or descending, is given by either the name of the function or an integer "direction" which is either 1 - next ascending index, or -1 - next descending index. For example, if an array named "test" has nodes:

Use the empty string ("") to get the initial value of an index. When there are no further values, the empty string is returned.

The "order()" functions traverse an array from one sibling node to the next in key ascending or descending order. The result returned is the next value of the last index of the global or local array given as the first argument. The default traversal is in key ascending order except if the second argument evaluates to "-1" in which case the traversal is in descending key order. If the second argument has a value of "1", the traversal will be in ascending key order.

Note: all keys are stored in ASCII character collating order. This means that numeric keys are sorted alphabetically rather than numerically. Example:

string a = ""; 
gbl().Kill(); // delete any instances of gbl() 
Xecute("for i=1:1:20 set ^gbl(i)=i"); 
while(1) { a = $order(gbl(a),1); 
      if (a == "") break; 
      cout << a << endl; 
      }

The above will print 1 10 11 12 13 14 15 16 17 18 19 2 20 3 4 5 6 7 8 9 (ASCII collating sequence).

int global.Order_Next(char * result)

Returns next ascending value of the last index (c) - direction implied - and sets "result" to the value of the index. Use the empty string ("") to get the initial value. When there are no further values, the empty string is returned in result and the value of the function is zero.

int global.Order_Prior(char * result)

Returns next descending value of the last index (c) - direction implied - and sets "result" to the value of the index. Use the empty string ("") to get the initial value. When there are no further values, the empty string is returned in result and the value of the function is zero.

friend ostream & operator << (ostream&, global);

A global array may participate in "cout" stream output. For example:

gbl("A","B","C") << "test test test";
cout << gbl("A","B","C") << endl;

The above will print "test test test" (without quotes) followed by the newline character. Alternatively:

cout << gbl("A","B","C").Get() << endl;

will do the same thing (the Get() function returns "char *" which is already defined for "cout").

int $pattern(char * source_string, char * pattern_string)
int $pattern(string source_string, string pattern_string)
int $pattern(mstring source_string, mstring pattern_string)

Evaluates the source_string according to the pattern_string and returns 0 (does not match) or 1 (does match). Pattern_string rules are as as shown below but you must remember to place a backslash before quotes in the pattern string (as per usual C++ rules). The pattern match function is used to determine if a string conforms to a certain pattern. Pattern match operations are converted to Perl Compatible Regular Expressions and are executed by functions in the PCRE library which must be present. You may access the PCRE directly, using Perl expression format with the "perl_pm(string, pattern, 1, svPtr)" function discussed in Appendix D.

The pattern codes are:

A for the entire upper and lower case alphabet.
C for the 33 control characters.
E for any of the 128 ASCII characters.
L for the 26 lower case letters.
N for the numerics
P for the 33 punctuation characters.
U for the 26 upper case characters.
A literal string.

A pattern code is made up of one or more of the above, each preceded by a count specifier. The count specifier indicates how many of the named item must be present. Alternatively, an indefinite specifier - a decimal point - may be used to indicate any count (including zero). For example:

strcpy(A,"123-45-6789"); if ($pattern(A, command("3N1"-"2N1"-"4N") )) cout << "OK" << endl; if (!$pattern(A, command("3N1"-"2N1"-"4N") )) cout << "OK" << endl; strcpy(A,"JONES, J. L."); if ($pattern(A, command(".A1",".A") )) cout << "OK" << endl; if (!$pattern(A, command(".A1",".A") )) cout << "OK" << endl;

Full pattern matching syntax, including support for alternation, are supported as described in Appendix D of the Compiler manual. The macro "command()" will handle the required backslash escape characters required before quote marks.

Global arrays may be used in any argument position but only one instance of the same global may appear (see note in Accessing global arrays) section.

int $perl(char * string, char * regex)

The regular expression in the null terminated character array pointed to by regex is applied to the null terminated character array pointed to by string. If the pattern match succeeds, true (1) is returned, false (0) otherwise and $test is set accordingly. This macro also sets variables in the run-time symbol table. See $SymGet() and $SymPut() for details on accessing the symbol table. See Appendix D for examples of using this function.

Global arrays may be used in any argument position but only one instance of the same global may appear (see note in Accessing global arrays) section.

char * $piece(char * source-string, char * pattern-string, int start[, int end])
char * $piece(string source-string, string pattern-string, int start[, int end])
char * $piece(mstring source-string, mstring pattern-string, int start[, int end])
char * $piece(mstring source-string, const string pattern-string, int start[, int end])
char * $piece(mstring source-string, const char * pattern-string, int start[, int end])

The $piece() function returns a substring of the first argument delimited by the instances of the second argument. The substring returned in the three argument case is that substring of the first argument that lies between the "start" minus one and "start" occurrence of the second argument. In the four argument form, the string returned is that substring of the first argument delimited by the "start" minus one instance of the second argument and the i4'th instance of the second argument. If only two arguments are given, i3 is assumed to be 1. For example:

$PIECE("A.BX.Y",".",2) YIELDS "BX" $PIECE("A.BX.Y",",",1) YIELDS "A" $PIECE("A.BX.Y",".",2,3) YIELDS "BX"

Global arrays may be used in any argument position but only one instance of the same global may appear (see note in Accessing global arrays) section.

Query functions

char * global.Query()
char * global.Query(char)
$query(global)
$queryX(global,char))

Returns a character string pointer to the next global array reference in the data base or the empty string. In the forms global::Query(char) and $queryX(global,char), the char argumnet is a pattern character that will replace commas and parentheses in the returned result. Global array references are returned with the elements of the global separated by commas and parentheses for Query() and $query(). Example:

$query("A(1,1)") yields A(1,2) where A(1,2) follows A(1,1) in the data base.
A(1,1).Query() yields A(1,2) where A(1,2) follows A(1,1) in the data base.

Global array references are returned with commas and parentheses replaced by a pattern character for Query(char pat) and $queryX(global ref,char pat). Example:

$queryX("A(1,1)",'#') yields A#1#2# where A(1,2) follows A(1,1) in the data base.
A(1,1).Query('#') yields A(1,2) where A#1#2# follows A(1,1) in the data base.

Note that text indices returned by these functions are not quoted. If you build a new global array reference (as shown below), you will need to enclose any text indices in double quotation marks. Numeric indices do not need to be quoted.

int mstring::replace(mstring pattern, mstring replacement)

Replaces the string matching pattern with replacement. Returns 1 if successfull, 01 if there was no match and less than -1 on error (See PCRE documentation for pcre_exec()). Throws: PatternException.

string mstring::s_str()

Returns a char * to a character string containing the same value as the mstring variable.

mstring Shred(mstring str, int size)
string Shred(string str, int size)
char * Shred(char * str, int size)

The Shred() function shreds the input string str into fragments of length size upon successive calls. The function returns a string of length zero when there are no more fragments of length size remaining (thus, short fragements at the end of a string are not returned). Shred() copies the input string to an internal buffer upon the first call. Subsequent calls retrieve from this buffer. When the buffer is consumed, the fuction will copy the contents of the next string submitted to the buffer. Example:

mstring ShredQuery(mstring str, int size)
string ShredQuery(string str, int size)
char * ShredQuery(char * str, int size)

The ShredQuery() function shreds size shifted copies of the input string str into fragments of length size upon successive calls. That is, the function first returns all the size fragments of the string in the same manner as Shred(). However, it then shifts the starting point of the input string to the right by one and returns all the size length fragments relative to the shifted starting point. It repeats this process a total of size times.

The function returns a string of length zero when there are no more fragments of length size remaining (thus, short fragements at the end of a string are not returned). ShredQuery() nitially copies the input string to an internal buffer upon the first call. Subsequent calls retrieve from this buffer. When the buffer is consumed, the fuction will copy the contents of the next string submitted to the buffer. Example:

Similarity functions

double global::Sim1(global B)
double global::Cosine(global B)
double global::Jaccard(global B)
double global::Dice(global B)

The global arrays referenced by the invoking object and the passed object are compared and a similarity value is computed. The functions compute the similarities of the data bearing nodes beneath the global array references.

double global::Sim1(global B) computes the simple pairwise sum of the products of the values stored at nodes having the same suffix array reference. For example:

#include <mumpsc libmpscpp.h> global A("A"); global B("B"); int main() { A("1","1","1")=1; A("1","1","2")=1; A("1","1","3")=1; A("1","1","5")=1; B("1","1","1")=1; B("1","1","2")=1; B("1","1","4")=1; B("1","1","6")=1; cout << A("1","1").Sim1(B("1","1")) << endl; return 0; } The above prints 2 since there are two nodes in common below the "1,1" levels. #include <mumpsc libmpscpp.h> global A("A"); global B("B"); int main() { A("1","1","1")=2; A("1","1","2")=1; A("1","1","3")=1; A("1","1","5")=1; B("1","1","1")=2; B("1","1","2")=1; B("1","1","4")=1; B("1","1","6")=1; cout << A("1","1").Sim1(B("1","1")) << endl; return 0; } The above prints 5 since there are two nodes in common below the "1,1" levels but one of the set of nodes in common have a stored value of 2. (2*2+1*1) #include <mumpsc libmpscpp.h> global A("A"); global B("B"); int main() { A("1","1","1")=1; A("1","1","2")=1; A("1","1","3")=1; A("1","1","5")=1; B("1")=1; B("2")=1; B("4")=1; B("6")=1; cout << A("1","1").Sim1(B()) << endl; return 0; } prints 2 also. #include <mumpsc libmpscpp.h> global A("A"); global B("B"); int main() { A("1")=3; A("2")=2; A("3")=1; A("4")=0; A("5")=0; A("6")=0; A("7")=1; A("8")=1; B("1")=1; B("2")=1; B("3")=1; B("4")=0; B("5")=0; B("6")=1; B("7")=0; B("8")=0; cout << A().Jaccard(B()) << endl; return 0; } prints 1 #include <mumpsc libmpscpp.h> global A("A"); global B("B"); int main() { A("1")=3; A("2")=2; A("3")=1; A("4")=0; A("5")=0; A("6")=0; A("7")=1; A("8")=1; B("1")=1; B("2")=1; B("3")=1; B("4")=0; B("5")=0; B("6")=1; B("7")=0; B("8")=0; cout << A().Dice(B()) << endl; return 0; } prints 1 #include <mumpsc libmpscpp.h> global A("A"); global B("B"); int main() { A("1")=3; A("2")=2; A("3")=1; A("4")=0; A("5")=0; A("6")=0; A("7")=1; A("8")=1; B("1")=1; B("2")=1; B("3")=1; B("4")=0; B("5")=0; B("6")=1; B("7")=0; B("8")=0; cout << A().Cosine(B()) << endl; return 0; } prints 0.75

double global::Cosine(global B) is invoked in the same manner and computer the Cosine similarity metric. Likewise, Dice and are other commonly used similarity metrics invoked in the same manner (see Salton, G; and McGill, M, Introduction to Modern Information Retrieval, McGraw Hill, 1983).

Smith-Waterman Alignment Function

int sw(mstring s, mstring t, [int show_aligns=0, int show_mat=0, int gap=-1, int mismatch=-1, int match=2])
int sw(string s, string t, [int show_aligns=0, int show_mat=0, int gap=-1, int mismatch=-1, int match=2])
int sw(char *s, char *t, [int show_aligns=0, int show_mat=0, int gap=-1, int mismatch=-1, int match=2])

Calculate the Smith-Waterman Alignment between strings "s" and "t". Result returned is the highest alignment score achieved. Parameters other than the first two are optional. If only some of the optional parameters are supplied, only trailing parameters may be omitted, as per C/C++ rules.

If you compare very long strings (>100,000 character), you may exceed stack space. This can be increased under Linux with the command:

ulimit -s unlimited

(Other options are ulimit -a and ulimit -aH to show limits).

If "show_aligns" is zero, no printout of alternative alignments is produced (default). If "show_aligns" is not zero, a summary of the alternative alignments will be printed. If "show_mat" is zero, intermediate matrices will not be printed (default). The gap and mismatch penalties are -1 and the match reward is +2. The parameters "gap", "mismatch" and "match" are the gap and mismatch penalties (negative integers) and the match reward (a positive integer). These values default to -1, -1 and 2 resectively. If insufficient memory is available, a segmentation violation will be raised. The first character of each sequence string MUST be blank. Example:

#include 

int main() {

char s[]=" now is the time for all good men to come to the aid of the party";
char t[]=" time  for   good   men";

int i=sw(s,t,1,0,-1,-1,3);

return 0;
}

results in:

S-W Alignments for:
64  now is the time for all good men to come to the aid of the party
22  time  for   good   men

  29  men 32
     ::::
  19  men 22
score=12

  29 - men 32
      ::::
  18   men 22
score=11

  23 l good-- men 32
      :::::  ::::
  11   good   men 22
score=24

  22 ll good-- men 32
       :::::  ::::
  11 -  good   men 22
score=23

  16  for all good-- men 32
     :::::   :::::  ::::
   6  for --  good   men 22
score=37

  12 time- for all good-- men 32
     :::: :::::   :::::  ::::
   1 time  for --  good   men 22
score=48

mstring stem(mstring & word)
string stem(string & word)
char * stem(char * word)

Returns the original word or the english linguistic root stem of the word, if one can be found. The char* form alters the word passed if a linguitsic stem is found.

double global::Sum()

The global array nodes beneath the referenced global array are summed. Non numeric quantities are treated as zero. Example:

mstring $SymGet(mstring name)
string $SymGet(string name)
char * $SymGet(char * name)
mstring $SymPut(mstring name, mstring value)
string $SymPut(string name, string value)
char * $SymPut(char * name, char * value)

These macros retrieve and store values from/to the run-time symbol table. In all, name is a a string containing the name of the variable and value is the value to be stored. The SymPut() functions return the string stored. A MumpsSymbolTableException exception is raised if $SymGet() fails. If $SymPut() fails, the program terminates (out of memory).

char **global.Select(char * [,char *, ... ])
char ** global.Select(string [,string &, ...])
char ** global.Select(mstring [,mstring &, ...])

Used in connection with a relational treatment of a global array. The Select() functions locate the next ascending row of the named global and set the specified variables to the corresponding columns of the row. If the next row has fewer columns than there are variables specified, the excess variables are set to the empty string. If a row has more columns than there are variables specified, only the variables specified are set. The function returns an array of pointers to char * that point to the text values of all existing columns successively. Pointers for non-existing columns have the value of NULL. That is, if the row has 5 columns, there will be 5 pointer numbered 0 through 4. Pointers above 4 will be NULL. If pointer 0 is NULL, there were no more rows for this global array. Example:

#include <mumpsc/libmpscpp.h>
 
global s("s");
global tmp("tmp");
 
int main() {
string  a,b,c;
char **p;
 
s().Kill();
tmp().Kill();
 
for (int i=0; i<100; i++ ){
      a=cvt(i);
      for (int j=0; j<100; j++){
            b=cvt(j);
            for (int k=0; k<10; k++) {
                  c=cvt(k);
                  s(a,b,c)="";
                  }
            }
      }
       
a=""; b=""; c="";
 
while (s(a,b,c).Select(a,b,c) != NULL) {
      if (a == "1") {
            tmp(a,b,c) = "";
            }
      }
 
a=""; b=""; c="";
 
while (tmp(a,b,c).Select(a,b,c) != NULL ) cout << a << " " << b << " " << c << endl;
 
return 0;
}

The above program creates a global array s() of 100,00 elements. The array is then accessed row by row selecting only those rows where the first column has the value "1". Rows selected are copied to another array named tmp() which is ultimately printed.

This facility can be used to manipulate global arrays as relational tables.

Stop list functions

void StopInit(mstring file)
void StopInit(string file)
void StopInit(char * file)
int StopLookup(mstring word)
int StopLookup(string word)
int StopLookup(char * word)

StopInit() reads the sorted file "file" of stoplist words into the stoplist container (one word per line). StopLookup() returns 0 if "word" is not found and 1 if "word" is found in the stoplist.

Synonym Functions

int SynInit(mstring filename)
int SynInit(string filename)
int SynInit(char * filename)
mstring SYN(mstring word)
string SYN(string word)
char * SYN(char * word)

SysInit() opens and reads a synonym file and returns the number of lines read. The maximum number of synonyms permitted is determined by "SYNMAX" in libmpscpp.h (default is 20,000). Each line of the synonym file consists of multiple words, in lower case, separated from on another by a single blank. The first word is the root alias and the remaining words are alternative synonyms. The function SYN() looks up a word. If the word is an alternative synonym, the root alias is returned. If not, the original word is returned.

void global::TablePrint([int indt [, const char indtchr]]);

Prints the global array in tablular form. If"indt" is given, it is the number of positions between columns. If "indtchr" is given, it is the character repeated between columns. If not given, the character between columns is space.

#include <mumpsc libmpscpp.h> global a("a"); int main() { mstring i,j,k,m,n,p,q,r,s; for (i=1; i<4; i++) { for (j=1; j<4; j++) { for (k=1; k<4; k++) { for (m=1; m<4; m++) { for (n=1; n<4; n++) { for (p=1; p<4; p++) { for (q=1; q<4; q++) { for (r=1; r<4; r++) { for (s=1; s<4; s++) { a("a"||i, "b"||j, "c"||k, "d"||m, "e"||n, "f"||p, "g"||q, "h"||r, "i"||s)=""; } } } } } } } } } a().TablePrint(); } a1 b1 c1 d1 e1 f1 g1 h1 i1 a1 b1 c1 d1 e1 f1 g1 h1 i2 a1 b1 c1 d1 e1 f1 g1 h1 i3 a1 b1 c1 d1 e1 f1 g1 h2 i1 a1 b1 c1 d1 e1 f1 g1 h2 i2 . . . a3 b3 c3 d3 e3 f3 g3 h2 i2 a3 b3 c3 d3 e3 f3 g3 h2 i3 a3 b3 c3 d3 e3 f3 g3 h3 i1 a3 b3 c3 d3 e3 f3 g3 h3 i2 a3 b3 c3 d3 e3 f3 g3 h3 i3

int $test

Returns integer 1 or 0 indicating the success or failure of certain previous commands. Some, but not all, commands set "$test".

mstring Token(mstring)
string Token(string)
char * Token(const char *)

Returns a pointer to the next word token from the input string. This function is passed a line of text. For each subsequent call, a pointer is returned to the next lexical word from the string with punctuation removed. When there are no more words, a pointer to the empty string is returned. After the pointer to the empty string is returned (or when initially called), the function will accept and store a new line of text. Lexical words are alphanumeric strings delimited by punctuation and white space.

void Transpose(mstring in, mstring out)
void Transpose(char * in, char * out)
void Transpose(string in, string out)
void global::Transpose(global out)

For the non-member versions, the first two dimensions of the array named as in is transposed into the array named out. The array names are specified as unsubscripted null terminated character strings. For the member version, the invoking object is transposed and the result is placed in out. In all versions, out is deleted before the operation commences. Example:

void global::TreePrint([int indt [, const char indtchr]]);

The invoking object is printed as an indented tree. If one argument is present (indt), it is the amount of indentation. If the second argument is present (indtchr) it is the character used in the indentation. The default indentation character is blank and the default amount of indentation is one. Example:

#include <mumpsc libmpscpp.h> global d("d"); int main() { for (int i=1; i<6; i++) for (int j=1; j<6; j++) for (int k=1; k<6; k++) { string a=cvt(i); string b=cvt(j); string c=cvt(k); d(a)=rand()%100; d(a,b)=rand()%100; d(a,b,c)=rand()%100; } d().TreePrint(1,'.'); return 0; } Yields

1=82
.1=59
..1=77
..2=35
..3=49
..4=27
..5=63
.2=67
..1=26
..2=11
..3=29
..4=62
..5=35
.3=19
..1=22
..2=67
..3=11
..4=73
..5=84
.4=96
..1=24
..2=13
..3=80
..4=62
..5=81
.5=45
..1=84
..2=5
..3=13
..4=95
..5=14

2=68
.1=54
..1=64
..2=87
..3=78
..4=3
..5=99
.2=78
..1=76
..2=12
..3=94
..4=70
..5=67
.3=44
..1=2
..2=52
..3=80
..4=65
..5=19
.4=53
..1=31
..2=71
..3=9
..4=56
..5=86
.5=8
..1=83
..2=28
..3=29
..4=70
..5=15

3=72
.1=28
..1=96
..2=45
..3=21
..4=88
..5=41
.2=59
..1=0
..2=24
..3=56
..4=27
..5=36
.3=93
..1=37
..2=7
..3=58
..4=37
..5=18
.4=4
..1=11
..2=76
..3=63
..4=6
..5=18
.5=25
..1=69
..2=96
..3=70
..4=99
..5=44

4=66
.1=48
..1=39
..2=69
..3=64
..4=55
..5=11
.2=30
..1=99
..2=68
..3=11
..4=1
..5=78
.3=62
..1=36
..2=22
..3=16
..4=24
..5=24
.4=94
..1=52
..2=50
..3=73
..4=30
..5=60
.5=84
..1=81
..2=59
..3=68
..4=26
..5=40

5=79
.1=72
..1=76
..2=7
..3=79
..4=12
..5=59
.2=21
..1=10
..2=6
..3=72
..4=19
..5=4
.3=69
..1=40
..2=28
..3=84
..4=24
..5=96
.4=98
..1=84
..2=72
..3=85
..4=40
..5=13
.5=69
..1=24
..2=81
..3=32
..4=4
..5=73

int global::UnLock()

UnLock() removes a lock from the designated node.

int Xecute(char * command)
int xecute(mstring command)
int xecute(string command)
int xecute(char * command)

These functions invoke the Mumps interpreter which executes command. Returns 1 of successful, 0 otherwise. The macro Xecute() is a special case. It is used with character string constants. It will pre-process a character string constant command and insert the backslash escape character prior to any embedded quotes thus permitting more normal appearing text (see similar macro command()). Global arrays created by C++ programs are fully compatible with the interpreter except that, in the interpreter, global array names are prefixed by the circumflex character in the normal manner. Examples:

    mstring c;
    Xecute("for  s i=$Order(^a(i)) q:i="" s sum=sum+^a(i)");
    c="for i=1:1:10 write i,!";
    xecute(c);
    c=command("for i=1:1:10 write "ans=",i,!");
    xecute(c);

Appendix A
Code Examples

Appendix B

Perl Compatible Regular Expression Library License

Programs written with the MDH may call upon the Perl Compatible Regular Expression Library. In some cases, this library is distributed with the Mumps Compiler. The PCRE Library is not covered by the GNU GPL/LGPL Licenses but, rather, by the license shownn below. The following is the PCRE license:

PCRE LICENCE
------------
PCRE is a library of functions to support regular expressions whose syntax
and semantics are as close as possible to those of the Perl 5 language.
Written by: Philip Hazel 
University of Cambridge Computing Service,
Cambridge, England. Phone: +44 1223 334714.
Copyright (c) 1997-2001 University of Cambridge
Permission is granted to anyone to use this software for any purpose on any
computer system, and to redistribute it freely, subject to the following
restrictions:
1. This software is distributed in the hope that it will be useful,
   but WITHOUT ANY WARRANTY; without even the implied warranty of
   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
2. The origin of this software must not be misrepresented, either by
   explicit claim or by omission. In practice, this means that if you use
   PCRE in software which you distribute to others, commercially or
   otherwise, you must put a sentence like this
     Regular expression support is provided by the PCRE library package,
     which is open source software, written by Philip Hazel, and copyright
     by the University of Cambridge, England.
   somewhere reasonably visible in your documentation and in any relevant
   files or online help data or similar. A reference to the ftp site for
   the source, that is, to
     ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre/
   should also be given in the documentation.
3. Altered versions must be plainly marked as such, and must not be
   misrepresented as being the original software.
4. If PCRE is embedded in any software that is released under the GNU
   General Purpose Licence (GPL), or Lesser General Purpose Licence (LGPL),
   then the terms of that licence shall supersede any condition above with
   which it is incompatible.
The documentation for PCRE, supplied in the "doc" directory, is distributed
under the same terms as the software itself.
End

Appendix C

Using Perl Regular Expressions
Author: Matthew Lockner

In addition to Mumps 95 pattern matching using the '?' operator, it is also possible to perform pattern matching against Perl regular expressions via the perlmatch function. Support for this functionality is provided by the Perl-Compatible Regular Expressions library (PCRE), which supports a majority of the functionality found in Perl's regular expression engine.

The perlmatch function works in a somewhat similar fashion to the '?' operator. It is provided with a subject string and a Perl pattern against which to match the subject. The result of the function is boolean and may be used in boolean expression contexts such as the "If" statement.

Some subtleties that differ significantly from Mumps pattern matching should be noted:

A Mumps match expects that the pattern will match against the entire subject string, in that successful matching implies that no characters are left unmatched even if the pattern matched against an initial segment of the subject string. Using perlmatch, it is sufficient that the entire Perl pattern matches an initial segment of the subject string to return a successful match.
The perlmatch function has the side effect of creating variables in the local symbol table to hold backreferences, the equivalent concept of $1, $2, $3, ... in Perl. Up to nine backreferences are currently supported, and can be accessed through the same naming scheme as Perl ($1 through $9). These variables remain defined up to a subsequent call to perlmatch , at which point they are replaced by the backreferences captured from that invocation. Undefined backreferences are cleared between invocations; that is, if a match operation captured five backreferences, then $6 through $9 will contain the null string.

`Examples`

This program asks the user to input a telephone number. If the data entered looks like a valid telephone number, it extracts and prints the area code portion using a backreference; otherwise, it prints a failure message and exits.

   Zmain
   Write "Please enter a telephone number:",!
   Read phonenum
   If $$^perlmatch(phonenum,"^(1-)?(\(?\d{3}\)?)?(-| )?\d{3}-?\d{4}$") Do
   . Write "+++ This looks like a phone number.",!
   . Write "The area code is: ",$2,!
   Else  Do
   . Write "--- This didn't look like a phone number.",!
   Halt

The output of several sample runs of the program follows:

Please enter a telephone number:
1-123-555-4567
+++ This looks like a phone number.
The area code is: 123
Please enter a telephone number:
(123)-555-1234
+++ This looks like a phone number.
The area code is: (123)
Please enter a telephone number:
(123) 555-0987
+++ This looks like a phone number.
The area code is: (123)

As in Perl, sections of the regular expression contained in parentheses define what is contained in the backreferences following a match operation. The backreference variables are named in a left-to-right order with respect to the expression, meaning that $1 is assigned the portion matched against the leftmost parenthesized section of the regular expression, with further references assigned names in increasing order. For a much more in-depth treatment of the subject of Perl regular expressions, refer to the perlre manpage distributed with the Perl language (also widely available online).

Appendix E Mumps 95 Pattern Matching Author: Matthew Lockner

Mumps 95 compliant pattern matching (the '?' operator) is implemented in this compiler as given by the following grammar:

 pattern         ::= {pattern_atom}
 pattern_atom    ::= count pattern_element
 count           ::= int | '.' | '.' int
                   | int '.' | int '.' int
 pattern_element ::= pattern_code {pattern_code} | string | alternation
 pattern_code    ::= 'A' | 'C' | 'E' | 'L' | 'N' | 'P' | 'U'
 alternation     ::= '(' pattern_atom {',' pattern_atom} ')'

The largest difference between the current and previous standard is the introduction of the alternation construct, an extension that works as in other popular regular expressions implementations. It allows for one of many possible pattern fragments to match a given portion of subject text.

A string literal must be quoted. Also note that alternations are only allowed to contain pattern atoms and not full patterns; while this is a possible shortcoming, it is in accordance with the standard. It is a trivial matter to extend alternations to the ability to contain full patterns, and this may be implemented upon sufficient demand.

Pattern matching is supported by the Perl-Compatible Regular Expressions library (PCRE). Mumps patterns are translated via a recursive-descent parser in the Mumps library into a form consistent with Perl regular expressions, where PCRE then does the actual work of matching. Internally, much of this translation is simple character-level transliteration (substituting '|' for the comma in alternation lists, for example). Pattern code sequences are supported using the POSIX character classes supported in PCRE and are mostly intuitive, with the possible exception of 'E', which is substituted with [[:print][:cntrl:]]. Currently, this construct should cover the ASCII 7-bit character set (lower ASCII).

Due to the heavy string-handling requirements of the pattern translation process, this module uses a separate set of string-handling functions built on top of the C standard string functions, using no dynamic memory allocation and fixed-length buffers for all operations whose length is given by the constant STR_MAX in sysparms.h. If an operation overflows during the execution of a Mumps compiled binary, a diagnostic is output to stderr and the program terminates. If such termination occurs too frequently, simpl

Jul	AUG	Oct
	30
2005	2006	2008