The Plugin Standard

Portability

A supported plugin must be useful to sites other than the site that developed it. It also must run at other sites without modification.

Plugin Naming

  • Plugin names begin with one of four verbs:

    • insert if the plugin inserts only

    • delete if the plugin deletes only

    • update if the plugin updates only

    • load if the plugin does any two or more of insert, delete or update

  • Plugin names are concise

    • for example, a plugin named InsertNewSequences is not concise because Insert and New are redundant

  • Plugin names are precise

    • for example, a plugin named InsertData is way too general. The name should reflect the type of data inserted

    • if a Plugin expects exactly one file type, that file type should be in the name. For example, InsertFastaSequences.

  • Plugin names are accurate

    • for example, a plugin named InsertExternalSequences is inaccurate if it can also insert internally generated sequences. A better name would be InsertSequences.

GUS Primary Keys

Plugins never directly use (hard-code) GUS primary keys, either in the body of the code or for command line argument values. Instead they use semantically meaningful alternate keys. The reason that plugins cannot use primary keys in their code is that doing so makes the plugin site specific, not portable. The reason they cannot use primary keys as values in their command line arguments is that plugins are often incorporated as steps in a pipeline (using the GUS Pipeline API described elsewhere). The pipelines should be semantically transparent so that people both on site and externally who look at the pipeline will understand it.

Application Specific Tables

Some sites augment GUS with their own application specific tables. These are not permitted in supported plugins.

Command Line Arguments

  • The name of the argument should be concise and precise

  • The Plugin API provides a means for you to declare arguments of different types, such integers, strings and files (the section called “Declaring the plugin's command line arguments”). Use the most appropriate type. For example, don't use a string for a file argument.

  • Use camel caps (eg matrixFile) not underscores (eg matrix_file) in the names of the arguments.

Documentation

The Plugin API provides a means for you to document the plugin and its arguments. Be thorough in your documentation. the section called “Declaring the Plugin's Documentation”

Use of GUS Objects

The GUS object layer assists in writing clean plugin code. The guidelines for their use are:

  • When writing data to the database, use GUS objects when possible. Avoid using SQL directly.

  • When forming a relationship between two objects, use the setParent() or setChildren() method. Do not explicitly set the foreign keys of the objects.

Database Access

The GUS objects are good at writing data to the database. That is because they allow you to build up a tree structure of objects and then to simply submit the root. However they are not as useful at reading the database. You can only read one object at a time (more on this in the Guide to GUS Objects). For this reason, you will need to use SQL to efficiently read data from the database as needed by your plugin.

This is how a typical database access looks:

Example 1.4. Typical Database Access

my $sql = 
  "SELECT $self->{primaryKeyColumn}, $self->{termColumn} 
   FROM $self->{table}";

my $queryHandle = $self->getQueryHandle();
my $statementHandle = $queryHandle->prepareAndExecute($sql);

my %vocabFromDb;

while (my ($primaryKey, $term) = $sth->fetchrow_array()) {
    $vocabFromDb{$term} = $primaryKey;
}

The SQL is formatted on multiple lines for clarity (Perl allows this), and the SQL keywords are upper case. The Plugin API provides a method to easily get a query handle, returning a GUS::ObjRelP::DbiDbHandle. That object provides an easy-to-use method that prepares and executes the SQL.

Logging

The Plugin API offers a set of logging methods. They print to standard error. Use these and no other means of writing out logging messages.

Standard Output

Do not write to standard output. If your plugin generates data (such as a list of IDs already loaded, for restart) write it to a file.

Commenting

Less is more with commenting. Comment only the non-obvious. For example, do not comment a method called getSize() with a comment # gets the size. Most methods should need no commenting, as they should be self-explanatory. In many cases, if you find that you need to comment because something non-obvious needs explaining, that is a red flag indicating that your code might need simplification.

Handling Errors

There is only one permissible way to handle errors: call die(). Never log errors or write them to standard error or standard out. Doing that masks the error (the logs are not read reliably) so that what is really happening is the plugin is failing silently. Causing the plugin to die forces the user of the plugin or its developer to fix the problem.

When you call die, give it an informative message, including the values of the suspicious variables. Surround the variables in single quotes so that white space errors will be apparent. Provide enough information so that the user can track down the source of the problem in the input files.

If you would like your program to continue past errors, then dedicate a file or directory which will house describing the errors. The user will know that he or she must look there for a list of inputs that caused problems. Typically you use this strategy if you expect the input to be huge, and don't want to abort it because of a few errors. You may want to include as a command line argument the number of errors a user will tolerate before giving up and just aborting.

Failure Recovery and Restart

Plugins abort. They do so for many reasons. When they do, the user must be able to recover from the failure, one way or another.

A few strategies you could adopt are:

  • If the plugin is inserting data (rather than inserting and updating) the plugin can check if an object that is about to be written to the database is already there. If so, it can skip that object. Because this checking will slow the plugin down, the plugin should offer a restart flag on the command line that turns that check on.

  • If the plugin is updating it can include a command line argument that takes a list of row_alg_invocation_ids, one per each run of the plugin with this dataset. (Each table in GUS has a row_alg_invocation_id column to store the identifier of the particular run of a plugin that put data there. This is part of the automatic tracking that plugins do.) The plugin can take the same approach as the previous strategy, but, must additionally check that the object has one of the provided row_alg_invocation_ids.

  • The plugin can store in dedicated file the identifiers of the objects it has already loaded. In this case, the plugin should offer a command line argument to ask for the name of the file.

Opening Files

A very common error is to open files without dying if the open fails. The proper way to open a file is like this:

Example 1.5. Properly Opening a File

open(FILE, $myFile) || die "could not open file '$myFile'\n");

Caching to Minimize Database Access

One of the most time consuming operations in a plugin is accessing the database. The typical flow of a plugin is that it reads the input and as it goes it constructs and submits GUS objects to the database. Some plugins additionally need to read data from the database to do their work. While it is often impossible to avoid writing to the database with each new input value, it is often possible to avoid reading it.

If most of the values of a table (or tables) will be needed then the plugin should read the table (or tables) outside the loop that processes the input. It should store the values in a hash keyed on a primary or alternate key. Storing multiple megabytes of data this way in memory should not be a problem. Gigabytes may well be a problem.

If only a few values from the table will be needed then an alternative caching strategy may be appropriate. Wrap the access to the values in a getter method, such as getGeneType(). This method stores values it gets in a hash. When the method is called, it first looks in the hash for the value. If the hash does not have it, then the method reads the database and stores the value in the hash to optimize future accesses.

Regular Expressions

Complicated regular expressions should be accompanied by a comment line that shows what the input string looks like. It is otherwise often very difficult to figure out what the regular expression is doing. Long regular expressions should be split into multiple lines with embedded whitespace and comments using the /x modifier. See the "Readability" section of Maintaining Regular Expressions

Variable and Method Names

Choosing good names for your variables and methods makes your code much more understandable. To make your code clear:

  • Variable and method names should start with a lower case letter.

  • Use "camel caps" ($sequenceLength) for variable names and method names, not underscores ($sequence_length).

  • Variable names should be named after the type of data they hold (unless there are more than one variable for a given type, in which case they are qualified). For example a good name for a sequence would be $sequence

  • In plugins, there are typically:

    • strings parsed from the input

    • objects created from the input (if you are using an object based parser such as Bioperl)

    • GUS object layer objects

  • Input objects or strings should be named with 'input' as a prefix. For example: $inputSequence

  • Object layer objects are named for their type, for example $NASequence

  • Method names should be self-explanatory. A bad method name would be process() (what is being processed?). Don't "save keystrokes" with short names. If being self-explanatory requires using a long name, then use a long name.

Methods

Use "structured programming" when you create your methods:

  • No method should ever be longer than one screen. If it is, refactor part of into its own method.

  • Never repeat code. Repeated code must be in a method.

Some methods in the API are marked as deprecated. Do not use them. They are for backward compatibility only.

Syntax

  • Use C and Java like syntax. Do not use weird Perl specific syntax.

  • Indenting must be spaces not tabs. Two or four spaces are acceptable

  • Use $self to refer to the object itself

  • Declare method arguments using this syntax:

    my ($self, $sequence, $length) = @_;.

    Do not use shift

Application Specific Controlled Vocabularies

A controlled vocabulary (CV) is a restricted set of terms that are allowed values for a data type. They may be simple lists or they may be complex trees, graphs or ontologies. In GUS the CVs fall into two categories: standard CVs such as the Gene Ontology, and small application specific CVs such as ReviewStatus.

The complete list of application specific CVs in the GUS 3.5 schema is:

  • DoTS.BlatAlignmentQuality

  • DoTS.GOAssociationInstanceLOE

  • DoTS.GeneInstanceCategory

  • DoTS.InteractionType

  • DoTS.MotifRejectionReason

  • DoTS.ProteinCategory

  • DoTS.ProteinInstanceCategory

  • DoTS.ProteinProteinCategory

  • DoTS.ProteinPropertyType

  • DoTS.RNACategory

  • DoTS.RNAInstanceCategory

  • DoTS.RNARNACategory

  • DoTS.RepeatType

  • SRes.BibRefType

  • SRes.ReviewStatus

Acquiring a standard CV typically involves downloading files from the CV provider and running a plugin to load it.

Application specific CVs are handled by the plugin that will use the CV. For example, a plugin that inserts bibliographic references will use the SRes.BibRefType CV. It is these plugins that are responsible for making sure that the CV they want to use is in the database.

Plugins that use CVs fall into two categories:

  1. those that hard code the CV

  2. those that do not hard code the CV, but, rather, get it from the input

In case 1, the plugin hard codes the CV in the Perl code.

In case 2, the plugin hard codes only a default. It also offers an optional command line argument that takes a file that contains the CV. If the user of the plugin determines that the input has an different CV than the default, the user will provide such a file.

In both cases, the plugin reads the table in GUS that contains the CV and compares it to the CV it expects to use. If the expected vocab is not found, the plugin updates the table.

Assigning an External Database Release Id

GUS is a data warehouse so it is very common for plugins to load into GUS data from another source. Whether the source is external or in-house, tracking its origin is often required. The tables in GUS that handle this are SRes.ExternalDatabase and SRes.ExternalDatabaseRelease. The former describes the database, eg, PFam, and the latter describes the particular release of the database that is being loaded, eg, 1.0.0. The data loaded will have a foreign key to the database release, which in turn has a foreign key to the database.

In order to create that relationship, the plugin must know the primary key of the external database release. To accomplish this, the plugin takes as command line arguments the name of the database and its release. It does not take the primary key of the external database release (that violates the plugin standard). The plugin passes that information to the API subroutine getExtDbRlsId($dbName, $dbVersion).

If the plugin is inserting the dataset as opposed to updating it, create new entries for the database and the release by using the plugins GUS::Supported::Plugin::InsertExternalDatabase and GUS::Supported::Plugin::InsertExternalDatabaseRls.