Search Mailing List Archives
Problems with gene association files
Karen Christie
kchris at genome.Stanford.EDU
Mon Jul 1 10:53:21 PDT 2002
Hi Gabriel,
On Fri, 28 Jun 2002, Gabriel Berriz wrote:
> Here is a summary of the above for the current version of the listed files:
>
> gene_associations.sgd:
> 2 "descendant-of-obsolete" attribs (2 associations)
> 2 deprecated attribs (6 associations)
> 3 "descendant-of-unknown" attribs (6762 associations)
With SGD as well, all of the obsolete or synomous GOids you list are
no longer in use.
Harold gave a really nice explanation of what MGI does regarding
obsolete IDs and also the use of the unknown terms. The use of unknown
terms was discussed at length at the October meeting of the GO
Consortium, so I thought it might be helpful to summarize the
discussion and resulting action (below after Wish List item 1).
Regarding obsolete and synonymous GOids, I agree with Harold and Becky
that this is very much a timing issue. Because of the fact that the
gene_associations files are updated with varying frequency for each group,
there is no easy solution to this problem. I've tried to address the scope
of the problem and also why we cannot computationally assign new GOids for
obsoleted or synonymous GOids (below under Wish List items 2 & 3).
Hope this helps clarify what's going on.
-Karen
> I would like to add the following few points to The Wish List:
>
> 1. Is it useful to have gene associations to the category of unknown?
> Wouldn't lack of evidence already imply that? I propose that all these
> associations be eliminated, or reassigned to a GO attribute other than
> one of the "unknown" ones.
At the October GO meeting, we discussed this issue and all groups felt
that it is *definitely useful* to distinguish between genes where a
curator has looked and there is no data available, versus those genes
that have not been looked at or annotated yet. By marking genes as
having been looked at, it helps us to target the genes that have not
yet been looked at by a curator and progress towards completing a set
of annotations for the genome that reflect the state of the current
knowledge, even when that is 'unknown'.
Because we all felt so strongly that the unknown terms are useful to
represent that the research community does not know what a given gene
does, we also came up with a procedure to standardize use of the 3
unknown terms, so that all the groups should be doing something fairly
consistent in their use of these terms. As a result of this
discussion, the date field was added in order to provide a time
context for annotations to any of the 3 unknown terms.
> 2. It seems reasonable that if an attribute has been classified as
> obsolete (by making it a descendant of one of the "obsolete" attributes),
> then the associations in which this attribute appears should also become
> obsolete, and therefore expunged from the association file, or at least
> reassigned to a non-obsolete attribute.
>
> 3. It would be very helpful if the gene association files were in synch
> with the ontology files. The next best thing would be a clean-up script
> that replaces all occurrences of deprecated attributes in the association
> files with their current versions.
>
For SGD, we have a script that runs nightly to pick up both obsoleted
or synonymous (deprecated) GOids. Several of us SGD curators get this
email and are responsible for manually fixing them. These GOids
disappear from the associations file within a couple days of being
identified.
These GOids are *not* automatically deleted or transferred because,
particularly with obsoletes, there is not a computational way to
reassign a correct GOid. Even with synonymous GOids, we cannot make
the transfer computationally, because historically, an original GOid
that gets split into two separate terms gets made synonymous with both
new, non-equivalent, terms as part of the mechanism of GOid
tracking. We have discussed that this is confusing, but until and
unless we change this, it is impossible to computationally assign a
correct new GOid for synonymous GOids.
It may also be worth pointing out that the various groups update their
gene_associations files at different frequencies. Harold mentioned
that MGI's file is updated weekly; SGD's is written nightly; FlyBase's
update procedure is fairly complicated and takes longer than
MGI's. Every file was up-to-date and in-synch with the ontologies when
they were written. However, while the ontology files change almost
daily, the update frequency of the gene_association files is often
less frequent, depending on the organization. If it is important to
you not to run into associations to obsolete or synonymous GOids, you
may want to use the monthly release of the ontology files (available
from the ontology-archive directory of the GO ftp site) that most
closely correlates with the date of a particular gene associations
file. As you can see below, the dates (also visible on the ftp site)
span a large time frame, from October to July, so if you want to use
all of the gene associations files at the same time though, you may
just have to accept that some of the older files will contain
associations to obsolete or synonymous terms. Each group has their own
mechanism for finding these and replacing them with valid terms, but
the time frame varies.
gocvs 512 Jul 1 01:15 CVS/
gocvs 250399 Jun 27 16:15 gene_association.GeneDB_tsetse
gocvs 20224664 Oct 17 2001 gene_association.compugen.Genbank
gocvs 27393781 Oct 17 2001 gene_association.compugen.Swissprot
gocvs 1782823 Jun 11 08:15 gene_association.fb
gocvs 9369524 Jun 20 05:28 gene_association.goa_human
gocvs 15256631 Jun 21 05:35 gene_association.goa_sptr.gz
gocvs 2456470 Mar 17 15:06 gene_association.gramene_oryza
gocvs 5194820 Jun 28 09:15 gene_association.mgi
gocvs 1200676 Apr 24 16:15 gene_association.pombase
gocvs 488210 Dec 22 2001 gene_association.rgd
gocvs 2199444 Jul 1 01:15 gene_association.sgd
gocvs 8382946 May 10 10:15 gene_association.tair
gocvs 558939 May 11 04:15 gene_association.tigr_ath
gocvs 15421798 Feb 12 00:15 gene_association.tigr_gene_index
gocvs 914919 Feb 26 11:47 gene_association.tigr_vibrio
gocvs 2317338 May 1 15:15 gene_association.wb
> 4. The most serious problem are the records that are completely invalid,
> due to a missing or malformed field. It would be nice if the gene
> association files were checked for data integrity before they are
> published. It would be easy to write a Perl script to automate this
> checking.
--
This message is from the GOFriends moderated mailing list. A list of public
announcements and discussion of the Gene Ontology (GO) project.
Problems with the list? E-mail: owner-gofriends at geneontology.org
Subscribing send "subscribe" to gofriends-request at geneontology.org
Unsubscribing send "unsubscribe" to gofriends-request at geneontology.org
Web: http://www.geneontology.org/
More information about the go-friends
mailing list