Problems with gene association files

Karen Christie kchris at genome.Stanford.EDU
Mon Jul 1 10:53:21 PDT 2002

Hi Gabriel,

On Fri, 28 Jun 2002, Gabriel Berriz wrote:

> Here is a summary of the above for the current version of the listed files:
> gene_associations.sgd:
> 2 "descendant-of-obsolete" attribs (2 associations)
> 2 deprecated attribs (6 associations)
> 3 "descendant-of-unknown" attribs (6762 associations)

With SGD as well, all of the obsolete or synomous GOids you list are
no longer in use. 

Harold gave a really nice explanation of what MGI does regarding
obsolete IDs and also the use of the unknown terms. The use of unknown
terms was discussed at length at the October meeting of the GO
Consortium, so I thought it might be helpful to summarize the
discussion and resulting action (below after Wish List item 1).

Regarding obsolete and synonymous GOids, I agree with Harold and Becky
that this is very much a timing issue. Because of the fact that the
gene_associations files are updated with varying frequency for each group,
there is no easy solution to this problem. I've tried to address the scope
of the problem and also why we cannot computationally assign new GOids for
obsoleted or synonymous GOids (below under Wish List items 2 & 3).

Hope this helps clarify what's going on.


> I would like to add the following few points to The Wish List:
> 1. Is it useful to have gene associations to the category of unknown? 
> Wouldn't lack of evidence already imply that?  I propose that all these
> associations be eliminated, or reassigned to a GO attribute other than
> one of the "unknown" ones.

At the October GO meeting, we discussed this issue and all groups felt
that it is *definitely useful* to distinguish between genes where a
curator has looked and there is no data available, versus those genes
that have not been looked at or annotated yet. By marking genes as
having been looked at, it helps us to target the genes that have not
yet been looked at by a curator and progress towards completing a set
of annotations for the genome that reflect the state of the current
knowledge, even when that is 'unknown'.

Because we all felt so strongly that the unknown terms are useful to
represent that the research community does not know what a given gene
does, we also came up with a procedure to standardize use of the 3
unknown terms, so that all the groups should be doing something fairly
consistent in their use of these terms. As a result of this
discussion, the date field was added in order to provide a time
context for annotations to any of the 3 unknown terms.

> 2. It seems reasonable that if an attribute has been classified as
> obsolete (by making it a descendant of one of the "obsolete" attributes),
> then the associations in which this attribute appears should also become
> obsolete, and therefore expunged from the association file, or at least
> reassigned to a non-obsolete attribute.
> 3. It would be very helpful if the gene association files were in synch
> with the ontology files.  The next best thing would be a clean-up script
> that replaces all occurrences of deprecated attributes in the association
> files with their current versions.

For SGD, we have a script that runs nightly to pick up both obsoleted
or synonymous (deprecated) GOids. Several of us SGD curators get this
email and are responsible for manually fixing them. These GOids
disappear from the associations file within a couple days of being

These GOids are *not* automatically deleted or transferred because,
particularly with obsoletes, there is not a computational way to
reassign a correct GOid. Even with synonymous GOids, we cannot make
the transfer computationally, because historically, an original GOid
that gets split into two separate terms gets made synonymous with both
new, non-equivalent, terms as part of the mechanism of GOid
tracking. We have discussed that this is confusing, but until and
unless we change this, it is impossible to computationally assign a
correct new GOid for synonymous GOids.

It may also be worth pointing out that the various groups update their
gene_associations files at different frequencies. Harold mentioned
that MGI's file is updated weekly; SGD's is written nightly; FlyBase's
update procedure is fairly complicated and takes longer than
MGI's. Every file was up-to-date and in-synch with the ontologies when
they were written. However, while the ontology files change almost
daily, the update frequency of the gene_association files is often
less frequent, depending on the organization. If it is important to
you not to run into associations to obsolete or synonymous GOids, you
may want to use the monthly release of the ontology files (available
from the ontology-archive directory of the GO ftp site) that most
closely correlates with the date of a particular gene associations
file. As you can see below, the dates (also visible on the ftp site)
span a large time frame, from October to July, so if you want to use
all of the gene associations files at the same time though, you may
just have to accept that some of the older files will contain
associations to obsolete or synonymous terms. Each group has their own
mechanism for finding these and replacing them with valid terms, but
the time frame varies.

gocvs         512 Jul  1 01:15 CVS/
gocvs      250399 Jun 27 16:15 gene_association.GeneDB_tsetse
gocvs    20224664 Oct 17  2001 gene_association.compugen.Genbank
gocvs    27393781 Oct 17  2001 gene_association.compugen.Swissprot
gocvs     1782823 Jun 11 08:15 gene_association.fb
gocvs     9369524 Jun 20 05:28 gene_association.goa_human
gocvs    15256631 Jun 21 05:35 gene_association.goa_sptr.gz
gocvs     2456470 Mar 17 15:06 gene_association.gramene_oryza
gocvs     5194820 Jun 28 09:15 gene_association.mgi
gocvs     1200676 Apr 24 16:15 gene_association.pombase
gocvs      488210 Dec 22  2001 gene_association.rgd
gocvs     2199444 Jul  1 01:15 gene_association.sgd
gocvs     8382946 May 10 10:15 gene_association.tair
gocvs      558939 May 11 04:15 gene_association.tigr_ath
gocvs    15421798 Feb 12 00:15 gene_association.tigr_gene_index
gocvs      914919 Feb 26 11:47 gene_association.tigr_vibrio
gocvs     2317338 May  1 15:15 gene_association.wb

> 4. The most serious problem are the records that are completely invalid,
> due to a missing or malformed field.  It would be nice if the gene
> association files were checked for data integrity before they are
> published.  It would be easy to write a Perl script to automate this
> checking.

