Chapter 1. Developing GUS Plugins

Table of Contents

GUS Plugins
Supported versus Community Plugins
The Plugin API
The Plugin Standard
Portability
Plugin Naming
GUS Primary Keys
Application Specific Tables
Command Line Arguments
Documentation
Use of GUS Objects
Database Access
Logging
Standard Output
Commenting
Handling Errors
Failure Recovery and Restart
Opening Files
Caching to Minimize Database Access
Regular Expressions
Variable and Method Names
Methods
Syntax
Application Specific Controlled Vocabularies
Assigning an External Database Release Id

GUS Plugins

GUS plugins are Perl programs that load data into GUS. They are written using the Plugin API (the section called “The Plugin API”). You may use plugins that are bundled with the GUS distribution or you may write your own.

The standard GUS practice is to use only plugins, not straight SQL or bulk loading, to load the database. The reason is that plugins:

  • track the data that is loaded

  • copy any updated or deleted rows to "version" tables that store a history of the changes

  • are known programs that can be scrutinized and used again

  • have a standard documentation process so that they are easily understood

  • use the Plugin API and so are easier to write than regular scripts.

Supported versus Community Plugins

The distribution of GUS comes with two types of plugins:

  • Supported plugins:

    • are confirmed to work

    • are portable

    • are useful to sites other than the site that developed the plugin

    • meet the Plugin Standard described below

  • Community plugins:

    • are contributed by the staff at CBIL and any other plugin developers

    • have not been reviewed with respect to the criteria for being supported

When you begin writing your plugin, use as a guideline or as a template an existing supported plugin. They are found in $PROJECT_HOME/GUS/Supported/plugin/perl.

The Plugin API

Plugin.pm: The Plugin Superclass

GUS plugins are subclasses of GUS::PluginMgr::Plugin. The public subroutines in Plugin.pm (private ones begin with an underscore) constitute the Plugin API. GUS also provides Perl objects for each table and view in the GUS schema. These are also part of the API. (the section called “The Plugin API”)

The plugin's package and @ISA statements

All plugins must declare their package, using Perl's package statement. The package name of a plugin is derived as follows:

ProjectName::ComponentName::Plugin::PluginName

Plugins must also declare that they are subclasses of Plugin.pm, using Perl's @ISA array. The first lines of a plugin will look like this:

package GUS::Supported::Plugin::SubmitRow

@ISA = qw(GUS::PluginMgr::Plugin)

Plugin Initialization

Plugins are objects and so must have a constructor. This constructor is the new() method. The new() method has exactly two tasks to accomplish: constructing the object (and returning it), and initializing it. Construction of the object follows standard Perl practice. Initialization is handled by the Plugin.pm superclass method initialize(). the section called “The Plugin API” for details about that method.

Example 1.1. A Sample new() method

sub new {
    my ($class) = @_;
    my $self = {};

    bless($self,$class);

    $self->initialize({
        requiredDbVersion => 3.5,
        cvsRevision => '$Revision: 3009 $',
        name => ref($self),
        argsDeclaration => $argsDeclaration,
        documentation => $documentation
    });

    return $self;
}

The $Revision: 3009 $ string is CVS or Subversion keyword. When the plugin is checked into source control, the repository substitutes the file's revision into that keyword. The keywords must be in single quotes to prevent Perl from interpreting $Revision: 3009 $ as a variable.

Keeping your Plugin Current as GUS Changes

If you follow the pattern used by supported plugins, you will only ever need to change one line in the new() method. As you can probably tell, initialize() takes one argument, a reference to a hash that contains a set of parameter values. The one you will need to change is requiredDbVersion. As the GUS schema evolves, you will need to review your plugin to make sure it is compatible with the latest version of GUS, upgrading it if not. When it is compatible with the new version of GUS, update requiredDbVersion to that version of GUS.

Declaring the plugin's command line arguments

In the example above (Example 1.1, “A Sample new() method”), the line

argsDeclaration => $argsDeclaration,

provides to the initialization() method a reference to an array, $argsDeclaration, that declares what command line arguments the plugin will offer. When you look at a supported plugin you will see the $argsDeclaration variable being set like this:

Example 1.2. Defining Command Line Arguments

my $argsDeclaration = [
   tableNameArg({name  => 'tablename',
                 descr => 'Table to submit to, eg, Core::UserInfo',
                 reqd  => 1,
                 constraintFunc=> undef,
                 isList =>0,
   }),

   stringArg({name  => 'attrlist',
              descr => 'List of attributes to update (comma delimited)',
              reqd  => 1,
              constraintFunc => undef,
              isList = >1,
  }),

  enumArg({name  => 'type',
           descr => 'Dimension of attributes (comma delimited)',
           reqd  => 1,
           constraintFunc => undef,
           enum => "one, two, three",
           isList => 1,
  }),

  fileArg({name  => 'matrixFile',
           descr => 'File containing weight matrix',
           reqd => 1,
           constraintFunc=> \&checkFileFormat,
           mustExist=>0,
           isList=>0,
  }),
];

If you look carefully at the list above you will notice that each element of it is a call to a method such as stringArg(). These are methods of Plugin.pm and they all return subclasses of GUS::PluginMgr::Args::Arg. In the case of stringArg(), it returns GUS::PluginMgr::Args::StringArg. All you really need to know is that there are a set of methods available for you to use when declaring your command line arguments. That is, the argsDeclaration parameter of the initialize() method expects a list of Arg objects. You can learn about them in detail in the Plugin API (the section called “The Plugin API”)

The Arg objects are very powerful. They parse the command line, validate the input, handle list values, deal with optional arguments and default values and provide for documentation of the arguments. There are two ways the Arg objects validate the input. First, it applies its standard validation. For example, a FileArg confirms that the input is a file, and throws an error otherwise. Second, if you provide a constraintFunc, it will run that as well, throwing an error if the plugin value violates the constraints.

Declaring the Plugin's Documentation

In a way that parallels the declaration of command line arguments, the initialize method also expects a reference to a hash that provides standardized fields that document the plugin: (Example 1.1, “A Sample new() method”)

documentation => $documentation,

Here is a code snippet that demonstrates the standard way $documentation is set:

Example 1.3. Defining Plugin Documentation

my $purposeBrief = <<PURPOSE_BRIEF;
Load blast results from a condensed file format into the DoTS.Similarity table.
PURPOSE_BRIEF

my $purpose = <<PLUGIN_PURPOSE;
Load a set of BLAST similarities from a file in the form generated by the blastSimilarity command.
PLUGIN_PURPOSE

my $tablesAffected = 
    [ ['DoTS::Similarity', 'One row per similarity to a subject'],
      ['DoTS::SimilaritySpan', 'One row per similarity span (HSP)'],
    ];

my $tablesDependedOn =
    [
    ];

my $howToRestart = <<PLUGIN_RESTART;
Use the restartAlgInvs argument to provide a list of algorithm_invocation_ids that represent 
previous runs of loading these similarities. The algorithm_invocation_id of a run of this 
plugin is logged to stderr. If you don't have that information for a previous run or runs,  
you will have to poke around in the Core.AlgorithmInvocation table and others to find your 
runs and their algorithm_invocation_ids.
PLUGIN_RESTART

my $failureCases = <<PLUGIN_FAILURE_CASES;
PLUGIN_FAILURE_CASES

my $notes = <<PLUGIN_NOTES;
The definition lines of the sequences involved in the BLAST (both query and subject) must 
begin with the na_sequence_ids of those sequences. The standard way to achieve that is to
first load the sequences into GUS, using the InsertFastaSequences plugin, and then to 
extract them into a file with the dumpSequencesFromTable.pl command. That command places 
the na_sequence_id of the sequence as the first thing in the definition line.
PLUGIN_NOTES

my $documentation = { purpose=>$purpose,
                      purposeBrief=>$purposeBrief,
                      tablesAffected=>$tablesAffected,
                      tablesDependedOn=>$tablesDependedOn,
                      howToRestart=>$howToRestart,
                      failureCases=>$failureCases,
                      notes=>$notes
                     };

When you look at this example, you will see that a bunch of variables, such as $purposeBrief and $tablesAffected, are being set. They are used as values of the hash called $documentation. $documentation is in turn passed as a value to the initialize() method. You will also notice that Perl's HEREDOC syntax is used. The setting of the variables begins with, eg, <<PLUGIN_PURPOSE and ends with, eg, PLUGIN_PURPOSE. This is Perl's way of allowing you to create paragraph-style strings without worrying about quoting or metacharacters such as \n.

The documentation is shown to the user when he or she uses the help flag, or when he or she makes a command line error.

The documentation is formatted using Perl's documentation generation facility, pod. This means that you can include simple pod directives in your documentation to, say, emphasize a word. Run the command perldoc perlpod for more information

The run()Method

Plugins are run by a command called ga (which stands for "GUS application"). ga constructs the plugin (by calling its new() method) and then runs the plugin by calling its run() method.

The purpose of the run() method is to provide at a glance the structure of the plugin. It should be very concise and under no circumstances be longer than one screen. A good practice, when reasonable, is for the run() method to call high level methods that return the objects to be submitted to the database, and then to submit them in the run() method. This way, a reader of the run() method will know just what is being written to the database, which is the main purpose of a plugin.

The run() method is expected to return a string describing the result of running the plugin. An example would be "inserted 3432 sequences".

The Pointer Cache

The pointer cache is a somewhat infamous component of the GUS object layer. It is a memory management facility that was designed to steer around poor garbage collection in Perl (in 2000). Whether or not is still needed is another matter because it is part of the object layer for now. The pointer cache is a way for the plugin to re-use objects that have been allocated but are no longer in active use. Because Perl was not properly garbage collecting objects when they were no longer referred to, the memory footprint of plugins was getting huge.

As a plugin developer what you need to know is that at points in your code where you no longer need any of the GUS objects that you have created (typically at the bottom of your outermost loop, you should call the Plugin.pm method undefPointerCache(). This method clears out the cache.

If the default capacity (10000) is not enough to hold all the objects you are creating in one cycle through your logic, you can augment its size with the Plugin.pm method setPointerCacheSize().