Search Mailing List Archives


Limit search to: Subject & Body Subject Author
Sort by: Reverse Sort
Limit to: All This Week Last Week This Month Last Month
Select Date Range     through    

[Gofriends] Number of annotated gene products

Rachael Huntley huntley at ebi.ac.uk
Mon Nov 9 02:24:39 PST 2009


Hi Purvesh,

The reason you are seeing such a dramatic decrease in annotations and 
gene products is that in February 2009 we stopped using the 
International Protein Index human protein set to make the human gene 
association file and started using the complete human proteome from 
UniProtKB/Swiss-Prot. The sharp decrease is due to us no longer 
providing electronic annotations to UniProtKB/TrEMBL proteins in the 
human file.
Here is the news release we sent out on February 6th 2009 describing 
this change;
<ftp://ftp.ebi.ac.uk/pub/databases/uniprot/current_release/knowledgebase/idmapping/README> 

> Please note that the gene_association.,goa_human file provided in the 
> next GOA release will no longer be made using the IPI non-redundant 
> human protein set. (http://www.ebi.ac.uk/IPI/ 
> <http://www.ebi.ac.uk/IPI/IPIhelp.html>). Instead, the next version of 
> this file will now use the complete human proteome now available in 
> UniProtKB/Swiss-Prot (http://www.uniprot.org/news/2008/09/02/release). 
> This change will enable us to provide a non-redundant set of 
> annotations for the human proteome, therefore please expect a sharp 
> drop in both the number of distinct sequence identifiers and in the 
> total number of electronic annotations in the new file.
>
> The name and format of this human file will remain the same, however 
> annotations will be assigned to proteins only from the 'UniProtKB' 
> (column 1) database source. Human IPI identifiers will continue to be 
> included in column 11 of annotations.
>
> In addition, the cross-references file for human IPI set 
> (human.xrefs.gz), will no longer be provided. Instead, identifier 
> mapping will be possible using the UniProt ID mapping file, available 
> from: 
> ftp://ftp.ebi.ac.uk/pub/databases/uniprot/current_release/knowledgebase/idmapping/idmapping.dat.gz 
> <ftp://ftp.ebi.ac.uk/pub/databases/uniprot/current_release/knowledgebase/idmapping/idmapping.dat.gz>
>
> idmapping.dat.gz is a tab-delimited table, which includes mappings for 
> 20 different sequence identifier types (and will be expanded in time 
> for the next file release to include IPI identifiers).
>
> A readme for this file is available from: 
> ftp://ftp.ebi.ac.uk/pub/databases/uniprot/current_release/knowledgebase/idmapping/README

I hope that helps.

Best wishes,
Rachael.


Purvesh Khatri wrote:
> Hi Chris,
>
> Thank you for the quick reply. That explains the discrepancy. However, 
> this leads me to another question.
>
> Running the same query (i.e., counting the number of annotated genes) 
> on October 2008 GO assocdb returns the number of genes as 36,726 
> versus 18,587 genes in November 2009 release. The number of annotated 
> genes is essentially reduced by almost 50% between October 2008 and 
> November 2009. The number of associations for human in the same period 
> have gone down from 197,411 to 159,303. What is the reason for such a 
> dramatic reduction in the number of associations (and the 
> corresponding reduction in the number of annotated genes)?
>
> Once again, thank you for your help.
>
> Cheers,
>
> Purvesh
>
>
> ----- Original Message -----
> From: "Chris Mungall" <cjm at berkeleybop.org>
> To: "Purvesh Khatri" <pkhatri at stanford.edu>
> Cc: "gofriends" <gofriends at genome.stanford.edu>, "Daniel Barrell" 
> <dbarrell at ebi.ac.uk>
> Sent: Friday, November 6, 2009 6:32:20 PM GMT -08:00 US/Canada Pacific
> Subject: Re: [Gofriends] Number of annotated gene products
>
>
> Hi Purvesh,
>
> The reason for the lower number is that you are counting gene symbols,  
> not gene IDs. Try this:
>
> select count(distinct g.dbxref_id) from association a, gene_product g,  
> species s
>       where a.gene_product_id = g.id and g.species_id = s.id
>       and s.ncbi_taxa_id = 9606;
>
> It gives the right number (18587)
>
> Shouldn't symbols be unique with a species you might ask? We can take  
> a look:
>
> select
>    g1.symbol,
>    x1.xref_dbname,
>    x1.xref_key,
>    x2.xref_dbname,
>    x2.xref_key
>   from
>    gene_product g1,
>    gene_product g2,
>    dbxref x1,
>    dbxref x2,
>    species s
>   where
>    g1.species_id = s.id and
>    g2.species_id = s.id and
>    g1.dbxref_id = x1.id and
>    g2.dbxref_id = x2.id and
>    s.ncbi_taxa_id = 9606 and
>    g1.symbol=g2.symbol and
>    g1.id != g2.id;
>
> as you can see a subset of these are due to alternate isoforms of a  
> generic protein sharing the same symbol. This is an area we're  
> actively looking into.
>
> In other cases we have what appear to be different proteins sharing  
> the same symbol:
>
> | ERVK6     | UniProtKB/Swiss-Prot | Q9Y6I0    | UniProtKB/Swiss-Prot  
> | Q9BXR3    |
> | ERVK6     | UniProtKB/Swiss-Prot | Q9Y6I0    | UniProtKB/Swiss-Prot  
> | Q9WJR5    |
> | ERVK6     | UniProtKB/Swiss-Prot | Q9Y6I0    | UniProtKB/Swiss-Prot  
> | Q7LDI9    |
> | ERVK6     | UniProtKB/Swiss-Prot | Q9Y6I0    | UniProtKB/Swiss-Prot  
> | Q69383    |
> | ERVK6     | UniProtKB/Swiss-Prot | Q9Y6I0    | UniProtKB/Swiss-Prot  
> | Q69384    |
>
> I haven't looked at these and I have to head off just now, but I can  
> get back to you later.
>
> Cheers
> Chris
>
> On Nov 6, 2009, at 3:41 PM, Purvesh Khatri wrote:
>
> > Hi,
> >
> > I am trying to count the number of gene products currently annotated  
> > in GOA for human. I imported "go_200911-assocdb-tables.tar.gz" in a  
> > local database and used the following two queries to count the  
> > number of unique gene products with and without "IEA" evidence code:
> >
> > select count(distinct g.symbol) from association a, gene_product g,  
> > species s
> >      where a.gene_product_id = g.id and g.species_id = s.id
> >      and s.ncbi_taxa_id = 9606;
> >
> > select count(distinct g.symbol) from association a, evidence e,  
> > gene_product g, species s
> >      where a.gene_product_id = g.id and e.association_id = a.id and  
> > g.species_id = s.id
> >      and s.ncbi_taxa_id = 9606 and e.code != 'IEA';
> >
> > My question is whether these queries are correct or not. The reason  
> > for asking the question is that using the first query, I get 18098  
> > gene products as being annotated. However, the "Current annotations"  
> > page on GO website 
> (http://geneontology.org/GO.current.annotations.shtml?all
> > ) lists the number of annotated gene products as 18587.
> >
> > Thank you for you help.
> >
> > Best regards,
> >
> > Purvesh Khatri, Ph.D.
> > Postdoc (Butte/Sarwal labs)
> > Stanford University
> > Center for Biomedical Information Research (BMIR)
> > 251 Campus Dr., Stanford, CA 94305
> > Phone: (313) 433-2836
> >
> > Division of Nephrology
> > Department of Pediatrics
> > Stanford Medical School
> > 300 Pasteur Dr., Room G327
> > Stanford, CA 94305
> > Phone: (650) 724-3765
> >
> > _______________________________________________
> > Gofriends mailing list
> > Gofriends at geneontology.org
> > http://fafner.stanford.edu/mailman/listinfo/gofriends
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> Gofriends mailing list
> Gofriends at geneontology.org
> http://fafner.stanford.edu/mailman/listinfo/gofriends
>   


-- 
GOA and IntAct Curator
European Bioinformatics Institute
Welcome Trust Genome Campus
Hinxton
Cambridge, CB10 1SD
UK

Tel: 01223 492515
Fax: 01223 494468
Email: huntley at ebi.ac.uk
GOA: http://www.ebi.ac.uk/GOA
IntAct: http://www.ebi.ac.uk/intact




More information about the go-friends mailing list