Search Mailing List Archives
[Gofriends] Redundancy in go_XXXXXX-assocdb-tables/dbxref.txt
val at sanger.ac.uk
Fri Sep 12 01:53:23 PDT 2008
Some other points maybe worth considering here,
1. Ensembl appear to derive their primary GO data from Uniprot; Uniprot
only include a subset of evidence codes which excludes some of the
curator assigned annotations from the MODs (including ND, ISS, IC).
Wouldn't it be preferable for Ensembl to use the MOD derived curated
data removing the need to create many of the IEA mappings?
2. Could UniProt import all of the curated data for the MODs, rather
than just a subset, especially for the reference genomes?
3. The Ensembl entry has IEA to DNA binding but Tacc3 does not appear
to have DNA binding domains. What is the source of the Ensembl IEA data
for Tacc3 (it isn't recorded, the source of this would be useful)?
Mike Cherry wrote:
> I wouldn't say this is a bug. The 1302948 ID is used by RGD when the
> annotations have been created by the RGD project. Those annotations
> that have the ENSEMBL ID ENSRNOP00000034933 have been created by
> ENSEMBL. RGD is just passing the ENSEMBL annotations through in their
> The gene association file is created by RGD. While some groups do map
> all the external IDs to internal IDs this is not done by all.
> One suggestion for your example is to filter out the IEA annotations.
> That would remove the ENSEMBL associations for this example. You
> would likely want to do that anyway, or at least compare your
> statistics with and without the computationally defined annotations.
> On Sep 9, 2008, at 10:22 AM, Gabriel Berriz wrote:
>> On 2008.09.08 Mon, at 18:09, Judith Blake wrote:
>>> The gene association files are non-redundant. Primary model organisms
>>> have responsibility for integrating annotations from mulitple sources
>>> and submitting a non-redundant file to the GOdb. QC checks on the
>>> also remove redundancies.
>> Hi, Judy. My word choice was not a very good one when I wrote of
>> "redundancies", so let me give an example of what I meant. It comes
>> from the latest gene_association.rgd.gz file. (This example is the
>> first one I followed up on of the 1000 or so that I mentioned in my
>> previous email.)
>> The latest gene_association.rgd.gz file contains 15 associations for
>> RGD ID 1302948, and 4 associations for ENSEMBL ID
>> ENSRNOP00000034933. In fact, according to both Ensembl and RGD
>> (http://rgd.mcw.edu/tools/genes/genes_view.cgi?id=1302948) these two
>> identifiers both refer to the same entity (transforming acidic
>> coiled-coil containing protein 3, aka Tacc3). Hence, the file uses
>> two names for the same thing. Why?
>> The reason why I bring this problem up is that, in our work, we
>> compute statistics that are very sensitive to how many genes have a
>> particular GO attribute, therefore it is crucial for us to count the
>> associations in this example as being 19 belonging to the same
>> protein, rather than 15 belonging to one and 4 belonging to another.
>> This accounting task is made significantly more difficult by the fact
>> that the association file uses two different names for the same thing.
>> Maybe I'm wrong here, but this looks to me like a bug rather than a
>> feature: I can't see that any good could come of using multiple
>> names for the same thing in a document like this.
>> If it is indeed a bug, would it be too difficult to fix? I.e. would
>> it be too difficult for GO and the purveyors of associations files to
>> use a consistent nomenclature whenever possible?
>> If it's of any help with this, we have a tool, called Synergizer, for
>> bulk mapping of identifiers from one namespace to another, and it is
>> a simple matter to set up a pipeline to do it automatically (see
>> http://llama.med.harvard.edu/synergizer/doc). We'd be happy to help
>> with this in any way we can. (Although I imagine that the
>> organizations that generate such associations files are the ultimate
>> experts for resolving such nomenclature issues.)
>> Also, as I said earlier, the example above is not isolated. For R.
>> norvegicus alone there are about 1000, and that's only focusing on
>> RGD vs. ENSEMBL IDs. And the problem is not limited to R.
>> norvegicus. Among the organisms that I have analyzed, I found a
>> similar nomenclature inconsistencies with several others, including
>> B. taurus, G. gallus, C. elegans, and H. sapiens.
>> Thanks for your comments!
>> Gabriel Berriz
>> Gabriel F. Berriz, PhD
>> Bioinformatics Developer
>> Roth Lab
>> Biological Chemistry and Molecular Pharmacology -- Harvard Medical
>> Seeley G. Mudd Building 322B
>> Boston, MA 02115-5701
>> Telephone: 617.432.3555
>> Fax: 617.432.3557
>> Gofriends mailing list
>> Gofriends at geneontology.org
> Gofriends mailing list
> Gofriends at geneontology.org
Valerie Wood Tel: 01223 496909
S. pombe Genome Project Fax: 01223 494919
Wellcome Trust Sanger Institute email: val at sanger.ac.uk
Wellcome Trust Genome Campus http://www.genedb.org/genedb/pombe
Hinxton, Cambridge, CB10 1HH http://www.sanger.ac.uk/Projects/S_pombe
The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.
More information about the go-friends