Search Mailing List Archives


Limit search to: Subject & Body Subject Author
Sort by: Reverse Sort
Limit to: All This Week Last Week This Month Last Month
Select Date Range     through    

[Gofriends] [Fwd: Re: Redundancy in go_XXXXXX-assocdb-tables/dbxref.txt]

Petri, Victoria vpetri at mcw.edu
Tue Sep 9 13:27:31 PDT 2008


 

 

Hi Gabriel,

 

The gene association files are non-redundant. 

 

The RGD GO annotations come from two sources: manual annotation of genes
and annotations that are brought in electronically from MGI and GOA via
QC_based pipelines. 

 

For data from GOA for which a match is not found in RGD that information
is appended at the end of the gene association file 'as is', or a match
is found but the annotation is already in the database for that gene. It
is important to keep in mind that GOA annotates proteins rather than
genes (which we and other MODs do) and if multiple protein transcripts
get the same annotation - which is not a redundancy - one could/would be
loaded into the database and the others would be appended at the end of
GAF. 

 

As Mike has already suggested, I would filter out IEAs which would 1)
remove the Ensembl IDs in question and 2) keep in annotations that have
been experimentally determined either for rat or for an orthologous
gene. If possible I would also compare protein IDs associated with one
gene versus Ensembl IDs at the end of the gene association file because
of the one-to-many gene-to-protein relationship.

 

Victoria

 

Victoria Petri, Ph.D.

Research Scientist

Rat Genome Database 

(http://rgd.mcw.edu)

Bioinformatics Program

Human and Molecular Genetics Center

Medical College of Wisconsin 

8701 Watertown Plank Road, Milwaukee, WI 53226

(414) 456-8871

Fax (414) 456-6595 

vpetri at mcw.edu

vpetri at mail.brc.mcw.edu

 

 

-----Original Message-----
From: Judith Blake [mailto:jblake at informatics.jax.org] 
Sent: Tuesday, September 09, 2008 1:14 PM
To: Shimoyama, Mary
Cc: Petri, Victoria
Subject: [Fwd: Re: [Gofriends] Redundancy in
go_XXXXXX-assocdb-tables/dbxref.txt]

 

Hi Mary,

 

Can you respond here.  Is this  a curation issue for these organisms?  

Is mouse not on this list because of the substantial resources we can 

bring to this project?

 

Judy

 

-------- Original Message --------

Subject:    Re: [Gofriends] Redundancy in
go_XXXXXX-assocdb-tables/dbxref.txt

Date:       Tue, 9 Sep 2008 13:22:49 -0400

From:       Gabriel Berriz <gberriz at hms.harvard.edu>

To:   Judith Blake <jblake at informatics.jax.org>

CC:   <gofriends at genome.stanford.edu>

References:       <31552965-46E2-46A9-9C76-92C7EE3D179F at hms.harvard.edu>


<48C5A292.9030005 at informatics.jax.org>

 

 

 

On 2008.09.08 Mon, at 18:09, Judith Blake wrote:

> Gabriel,

> 

> The gene association files are non-redundant.  Primary model organisms

> have responsibility for integrating annotations from mulitple sources

> and submitting a non-redundant file to the GOdb.  QC checks on the
files

> also remove redundancies.

 

 

Hi, Judy.  My word choice was not a very good one when I wrote of 

"redundancies", so let me give an example of what I meant.  It comes 

from the latest gene_association.rgd.gz file.  (This example is the 

first one I followed up on of the 1000 or so that I mentioned in my 

previous email.)

 

The latest gene_association.rgd.gz file contains 15 associations for RGD


ID 1302948, and 4 associations for ENSEMBL ID ENSRNOP00000034933.  In 

fact, according to both Ensembl and RGD 

(http://rgd.mcw.edu/tools/genes/genes_view.cgi?id=1302948) these 

two identifiers both refer to the same entity (transforming acidic 

coiled-coil containing protein 3, aka Tacc3).  Hence, the file uses two 

names for the same thing.  Why?

 

The reason why I bring this problem up is that, in our work, we compute 

statistics that are very sensitive to how many genes have a particular 

GO attribute, therefore it is crucial for us to count the associations 

in this example as being 19 belonging to the same protein, rather than 

15 belonging to one and 4 belonging to another.  This accounting task is


made significantly more difficult by the fact that the association file 

uses two different names for the same thing.

 

Maybe I'm wrong here, but this looks to me like a bug rather than a 

feature:  I can't see that any good could come of using multiple names 

for the same thing in a document like this.

 

If it is indeed a bug, would it be too difficult to fix?  I.e. would it 

be too difficult for GO and the purveyors of associations files to use a


consistent nomenclature whenever possible?

 

If it's of any help with this, we have a tool, called Synergizer, for 

bulk mapping of identifiers from one namespace to another, and it is a 

simple matter to set up a pipeline to do it automatically (see 

http://llama.med.harvard.edu/synergizer/doc).  We'd be happy to help 

with this in any way we can.  (Although I imagine that the organizations


that generate such associations files are the ultimate experts for 

resolving such nomenclature issues.)

 

Also, as I said earlier, the example above is not isolated.  For R. 

norvegicus alone there are about 1000, and that's only focusing on RGD 

vs. ENSEMBL IDs.  And the problem is not limited to R. norvegicus. 

 Among the organisms that I have analyzed, I found a similar 

nomenclature inconsistencies with several others, including B. taurus, 

G. gallus, C. elegans, and H. sapiens.

 

Thanks for your comments!

 

Gabriel Berriz

=============================================================

Gabriel F. Berriz, PhD

Bioinformatics Developer

Roth Lab

Biological Chemistry and Molecular Pharmacology -- Harvard Medical
School

Seeley G. Mudd Building 322B

Boston, MA 02115-5701

Telephone: 617.432.3555

Fax: 617.432.3557

 

 

 

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.stanford.edu/pipermail/go-friends/attachments/20080909/fa872160/attachment.html>


More information about the go-friends mailing list