Search Mailing List Archives

Limit search to: Subject & Body Subject Author
Sort by: Reverse Sort
Limit to: All This Week Last Week This Month Last Month
Select Date Range     through    

Problems with gene association files

Karen Christie kchris at genome.Stanford.EDU
Tue Jul 2 10:04:30 PDT 2002

Hi Gabriel,

Below, I've tried to address (at some length, I admit) some of the reasons
why different groups will have used the 'unknown' terms to differing
degrees, and also tried to address your question about automated cleanup
of obsoleted and synonymous GOids.

Let me also go back to one of your questions from your initial email, this
one, because I think there is another point worth stating explicitly.
> 1. Is it useful to have gene associations to the category of unknown? 
> Wouldn't lack of evidence already imply that?  

Right now, definitely for SGD, and I think for many other model organism
databases, lack of a GO annotation does *not* imply lack of evidence. 

For us it definitely only means that no one has annotated it yet, and
that's a curator-power issue, not a reflection of the state of knowledge
generated by the research community. Use of 'unknown', especially now with
a date, clearly indicates that as of the date, the curator could not find
any information on which to base an annotation in a given ontology. Thus
the 'unknown' terms are a better reflection of the state of knowledge than
lack of annotation.



On Mon, 1 Jul 2002, Gabriel Berriz wrote:

> Hi, Karen.  Thanks for the clarification.
> There must be a significant difference between the way SGD handles these 
> attributions and the way MGD, FB, and WB handle them, because about 1/3 of 
> all the SGD associations are to the unknown attribs, whereas for the other 
> three organisms, the fraction is under 3%.  In fact, for FB, not a single 
> association has been made to these attributes.  The total numbers of 
> associations for these organisms are approximately 20K (SGD), 38K (MGD), 
> 21K (FB), and 23K (WB). (We have not looked at any of the other 
> files.)

Thanks for the numbers. It's interesting to see it expressed in those

I'm guessing that the differences in percentage may relate more to things
like the length of time over which the 'unknown' terms have been used by a
given database to indicate that the available literature/information does
not permit assignment of a specific term. SGD and MGI have probably been
using the 'unknown' terms for longer than anyone else. 

With SGD, another factor is probably the relative completeness of our
sequence and the vastly greater ease of calling the genes for cerevisiae
and knowing which gene you're working with. MGI (and the other databases
for the larger eukaryotes) have larger genomes with more genes, more
complex gene structures which makes for a much tricker problem to predict
genes from sequence, lots of multigene families, and sequence data which
is not quite as far along the way to "completeness" as that for
cerevisiae.  So for us at SGD, it may be easier to say 'unknown' with

Differences in how genes are chosen for annotation will also have an
impact. At SGD, with a relatively small genome for a eukaryote, we would
like to have a set of GO annotations that represents the state of
knowledge of the entire genome, even when the state of knowledge for a
given gene is that we have no idea what it does. There are still a few
hundred genes that are essential in S. cerevisiae (inviable when deleted
in haploids) that we have no idea what they do. At the moment, we're
targeting things with no GO annotations at all. Other groups may be
targeting things for different reasons to best serve the needs of their
organism's research community and may be using different annotation
strategies, both of which may impact the need to use the 'unknown'

At the October meeting, it became apparent that there was some
difference in the practice of how the unknown terms were being used,
which was why the system I outlined was agreed upon. Previous to
October, both SGD and MGI have used the 'unknown' terms to indicate
that the gene has been looked at with respect to trying to associate
it with the relevant GO terms, with some differences in the
detail. The best bits of the two procedures were combined to produce
the one agreed upon by the whole Consortium. A big advantage of using
the unknown terms combined with a date is that we will be able to
implement a way to flag when a gene associated with one of them and
some date becomes associated with a paper that has a more recent date,
in order to examine whether this paper contains relevant info to
upgrade to a specific term, and delete the unknown.

Since I was curious now, I grepped for the counts of the three unknown
terms in all of the files. 12 of 16 files contain instances of
GO:0005554 (molecular_function unknown). Some groups which have used
the term 'molecular_function unknown' have not had occasion to use the
other 2 unknown terms. I included all the numbers below. (NB: one file
is zipped, so though gene_association.goa_sptr.gz contains instances
of the unknown terms, my grep counted 0's for it. Thus 13 of 16 files
utilize the 5554 unknown.)

So that was kind of long, but I hope illustrates that though, of course
each group is going to have their own strategy for doing GO
annotations, for a variety of reasons, that the whole GO Consortium
is in agreement about the basic principle of use of the 3 unknown
terms, though there may always be differences in the specifics of

Automatic cleanup stuff:

> >For SGD, we have a script that runs nightly to pick up both obsoleted
> >or synonymous (deprecated) GOids. Several of us SGD curators get this
> >email and are responsible for manually fixing them. These GOids
> >disappear from the associations file within a couple days of being
> >identified.
> >
> >These GOids are *not* automatically deleted or transferred because,
> >particularly with obsoletes, there is not a computational way to
> >reassign a correct GOid. Even with synonymous GOids, we cannot make
> >the transfer computationally, because historically, an original GOid
> >that gets split into two separate terms gets made synonymous with both
> >new, non-equivalent, terms as part of the mechanism of GOid
> >tracking. We have discussed that this is confusing, but until and
> >unless we change this, it is impossible to computationally assign a
> >correct new GOid for synonymous GOids.
> If this is the case, we must we doing something wrong here, because we 
> perform the appropriate clean-up automatically.  In the case of deprecated 
> synonyms, our script generates the synonyms from the *.ontology 
> files.  E.g., for the line
>   %chaperone ; GO:0003754, GO:0003757, GO:0003758, GO:0003760, GO:0003761
> the script sets GO:0003754 as the correct id, and GO:0003757, GO:0003758, 
> GO:0003760, and GO:0003761 as deprecated synonyms.  Whenever an association 
> is found that uses one of the deprecated synonyms, the script simply 
> replaces it with the correct id.  Is this an incorrect interpretation of 
> the ordering in this list?

Most of the time, it is fine to map a synonymous GOid to the primary, but
I believe there are still instances where a primary GOid was split into
two GO terms, each with a new GOid and the original (now secondary) GOid
was listed as a synonym for two primary GOids. In these cases, it is not
possible to computationally decide which of the two primary GOids is the
correct one for a given gene. For the majority of cases, where there is a
1 to 1 mapping of secondary ID to primary ID, then a computational method
is fine.  If/when we notate the splits differently, then there shouldn't
be a problem here at all, because I can't think of any issues, other than
splits, that would affect a computational mapping of secondary to primary

> For obsoleted attributes, again, we use the information in the *.ontology 
> file to determine whether an attribute in an association is a descendant of 
> one of the attributes called "obsolete" in one of the three main 
> branches.  If it is, we discard the association.  It seems to me that the 
> same approach could be used to automatically weed out the obsolete 
> associations from a gene associations file.

In my previous email, I meant that a computational method would not be
appropriate for choosing a replacement GOid. We could probably modify
the script to not output associations to synonymous or obsolete IDs
while curators are deciding on the replacements. 



Addendum - Unknown counts:

molecular function unknown:
whiskey 172 > grep -c GO:0005554 gene_association.*

biological_process unknown
whiskey 173 > grep -c GO:0000004 gene_association.*

cellular_component unknown
whiskey 174 > grep -c GO:0008372 gene_association.*

line counts included only for rough idea of file size, gives total number
of annotations (+0-3 header lines), but doesn't allow determination of
annotations per each of the 3 ontologies

line count	filename
    2085	gene_association.GeneDB_tsetse
 2345084	gene_association.compugen.Genbank
 3262671	gene_association.compugen.Swissprot
   20844	gene_association.fb
   81037	gene_association.goa_human
  557221	gene_association.goa_sptr.gz
   21180	gene_association.gramene_oryza
   38276	gene_association.mgi
   13468	gene_association.pombase
    3777	gene_association.rgd
   19903	gene_association.sgd
   54422	gene_association.tair
    3970	gene_association.tigr_ath
 1816368	gene_association.tigr_gene_index
    7253	gene_association.tigr_vibrio
   22625	gene_association.wb

This message is from the GOFriends moderated mailing list.  A list of public
announcements and discussion of the Gene Ontology (GO) project.
Problems with the list?           E-mail: owner-gofriends at
Subscribing   send   "subscribe"   to   gofriends-request at
Unsubscribing send   "unsubscribe"  to  gofriends-request at

More information about the go-friends mailing list