Search Mailing List Archives
[Gofriends] [Fwd: Re: Redundancy in go_XXXXXX-assocdb-tables/dbxref.txt]
Petri, Victoria
vpetri at mcw.edu
Tue Sep 9 13:27:31 PDT 2008
Hi Gabriel,
The gene association files are non-redundant.
The RGD GO annotations come from two sources: manual annotation of genes
and annotations that are brought in electronically from MGI and GOA via
QC_based pipelines.
For data from GOA for which a match is not found in RGD that information
is appended at the end of the gene association file 'as is', or a match
is found but the annotation is already in the database for that gene. It
is important to keep in mind that GOA annotates proteins rather than
genes (which we and other MODs do) and if multiple protein transcripts
get the same annotation - which is not a redundancy - one could/would be
loaded into the database and the others would be appended at the end of
GAF.
As Mike has already suggested, I would filter out IEAs which would 1)
remove the Ensembl IDs in question and 2) keep in annotations that have
been experimentally determined either for rat or for an orthologous
gene. If possible I would also compare protein IDs associated with one
gene versus Ensembl IDs at the end of the gene association file because
of the one-to-many gene-to-protein relationship.
Victoria
Victoria Petri, Ph.D.
Research Scientist
Rat Genome Database
(http://rgd.mcw.edu)
Bioinformatics Program
Human and Molecular Genetics Center
Medical College of Wisconsin
8701 Watertown Plank Road, Milwaukee, WI 53226
(414) 456-8871
Fax (414) 456-6595
vpetri at mcw.edu
vpetri at mail.brc.mcw.edu
-----Original Message-----
From: Judith Blake [mailto:jblake at informatics.jax.org]
Sent: Tuesday, September 09, 2008 1:14 PM
To: Shimoyama, Mary
Cc: Petri, Victoria
Subject: [Fwd: Re: [Gofriends] Redundancy in
go_XXXXXX-assocdb-tables/dbxref.txt]
Hi Mary,
Can you respond here. Is this a curation issue for these organisms?
Is mouse not on this list because of the substantial resources we can
bring to this project?
Judy
-------- Original Message --------
Subject: Re: [Gofriends] Redundancy in
go_XXXXXX-assocdb-tables/dbxref.txt
Date: Tue, 9 Sep 2008 13:22:49 -0400
From: Gabriel Berriz <gberriz at hms.harvard.edu>
To: Judith Blake <jblake at informatics.jax.org>
CC: <gofriends at genome.stanford.edu>
References: <31552965-46E2-46A9-9C76-92C7EE3D179F at hms.harvard.edu>
<48C5A292.9030005 at informatics.jax.org>
On 2008.09.08 Mon, at 18:09, Judith Blake wrote:
> Gabriel,
>
> The gene association files are non-redundant. Primary model organisms
> have responsibility for integrating annotations from mulitple sources
> and submitting a non-redundant file to the GOdb. QC checks on the
files
> also remove redundancies.
Hi, Judy. My word choice was not a very good one when I wrote of
"redundancies", so let me give an example of what I meant. It comes
from the latest gene_association.rgd.gz file. (This example is the
first one I followed up on of the 1000 or so that I mentioned in my
previous email.)
The latest gene_association.rgd.gz file contains 15 associations for RGD
ID 1302948, and 4 associations for ENSEMBL ID ENSRNOP00000034933. In
fact, according to both Ensembl and RGD
(http://rgd.mcw.edu/tools/genes/genes_view.cgi?id=1302948) these
two identifiers both refer to the same entity (transforming acidic
coiled-coil containing protein 3, aka Tacc3). Hence, the file uses two
names for the same thing. Why?
The reason why I bring this problem up is that, in our work, we compute
statistics that are very sensitive to how many genes have a particular
GO attribute, therefore it is crucial for us to count the associations
in this example as being 19 belonging to the same protein, rather than
15 belonging to one and 4 belonging to another. This accounting task is
made significantly more difficult by the fact that the association file
uses two different names for the same thing.
Maybe I'm wrong here, but this looks to me like a bug rather than a
feature: I can't see that any good could come of using multiple names
for the same thing in a document like this.
If it is indeed a bug, would it be too difficult to fix? I.e. would it
be too difficult for GO and the purveyors of associations files to use a
consistent nomenclature whenever possible?
If it's of any help with this, we have a tool, called Synergizer, for
bulk mapping of identifiers from one namespace to another, and it is a
simple matter to set up a pipeline to do it automatically (see
http://llama.med.harvard.edu/synergizer/doc). We'd be happy to help
with this in any way we can. (Although I imagine that the organizations
that generate such associations files are the ultimate experts for
resolving such nomenclature issues.)
Also, as I said earlier, the example above is not isolated. For R.
norvegicus alone there are about 1000, and that's only focusing on RGD
vs. ENSEMBL IDs. And the problem is not limited to R. norvegicus.
Among the organisms that I have analyzed, I found a similar
nomenclature inconsistencies with several others, including B. taurus,
G. gallus, C. elegans, and H. sapiens.
Thanks for your comments!
Gabriel Berriz
=============================================================
Gabriel F. Berriz, PhD
Bioinformatics Developer
Roth Lab
Biological Chemistry and Molecular Pharmacology -- Harvard Medical
School
Seeley G. Mudd Building 322B
Boston, MA 02115-5701
Telephone: 617.432.3555
Fax: 617.432.3557
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.stanford.edu/pipermail/go-friends/attachments/20080909/fa872160/attachment.html>
More information about the go-friends
mailing list