Search Mailing List Archives


Limit search to: Subject & Body Subject Author
Sort by: Reverse Sort
Limit to: All This Week Last Week This Month Last Month
Select Date Range     through    

Using GO annotation in a blast reflib

Mark Waugh mew at ncgr.org
Thu Oct 5 10:34:03 PDT 2000


Hi Suzanna,

Thanks for your thoughtful reply. My comments follow:

Suzanna Lewis wrote:

> > We are interested in augmenting the sequence definitions in standard
> > Blast output with their corresponding GO annotations where available.
> > To do this, we can either create reflibs containing only those
> > sequences with GO annotations,.....
> 
> Yep, I did that exercise once myself, but just as a one-shot. We
> have been ruminating and gradually approaching this over the last
> year.

We are building an automated sequence analysis "pipeline" which includes
similarity searching as well as motif searching against Blocks+. From
the standpoint of administering the pipeline, fewer steps is better and
the less we have to resolve or merge annotation the better. So while
this approach seems the easiest at the outset, it requires the
independent step of gathering together all of these sequences from each
of the contributing DBs when presumably most are already in Genbank and
are represented in NR (I'm making the pretty big assumption that most of
the consortium member's sequences also get submitted to Genbank-is this
right?). Also it means an extra blast run in addition to NR and then
resolving the results from the two runs. Further down the road we would
also like to traverse the GO tree to find lowest common denominators
which could resolve conflicting annotations resulting from dissociating
high scoring blocks from Blocks+ into their protein sequences. This too
will require getting from an external db accession (Swiss-Prot or
Genbank) to a GOid, so this will be an ongoing need.
> 
> > ....or we can scan the HSPs resulting from a search
> > against either NT or NR for GB identifiers that have corresponding GOids
> > from the GO database.
> 
> That would work too, -if- the GB identifiers were in the database.

Yep! Exactly. There's the rub.
> 
> > In either case it would seem to be a fairly
> > straightforward mapping between Genbank accessions and GOids. The
> > problem is that, although there is a place for Genbank accession numbers
> > in the table dbxref, there aren't any in the latest version of the
> > database (we have a local version running in house and obtained the
> > latest update on Tuesday from John Richter).
> 
> Right, although really there needs to an additional table in the
> database to hold this information. The dbxref table is one half and the
> gene_product table is the other, but there isn't a table specifically
> to link a particular gene_product to a particular GB dbxref entry.  All
> we have right now are these 3
> 
> 1. term_dbxref  where the definition came from
> 2. gene_product where to find the complete model organism db entry
> 3. evidence     the supporting citation for an association between
>                 a term and a gene_product
> 
> its easy to create a new table say 'gene_product_seq'
> something like this:
> 
> create table gene_product_seq (
>         gene_product_id integer not null,
>         dbxref_id       integer not null
> );
> 
> or to add another dbxref_id to the gene_product table (though that
> mandates a 1:1 relationship so i'll do the separate table).

Exactly. Our pipeline has its own relational database to store sequence,
annotation, analysis parameters etc. and we're leaning towards
generating a new table in this DB rather than modify GO itself. This
table will act as an index for looking up GOids based on GB accessions
to query GO directly (say for traversing tree relationships) and we're
thinking we can also populate it with each sequence's "first-order" GO
annotation (without parent/child relationships) for rapid querying. All
of this is still in the planning stages so anything could change and we
welcome comments/ideas. Additionally, all of our stuff is implemented in
Sybase 11 and we are working on porting GO to that DBMS. As soon as I
have functional Sybase table creation scripts and have tested the DB
with a data dump from MySQL, I will send them off to John Richter for
inclusion on his database page. I'll also send an ER diagram
representing my interpretation of the GO schema for posting if he would
like to. 

> 
> > We have experimented with
> > the idea of using a webbot to obtain the GB accession numbers from the
> > consortium members' individual web sites based on identifiers (the
> > contributing group's internal accession numbers) parsed from the
> > association files available on the GO website, and while this appears to
> > work, it's pretty convoluted and we don't want to flood these sites with
> > automated requests. This also introduces a potential problem of
> > synchronization between the version of GO we are running and all of the
> > ancillary files we are parsing to get the GB accessions.
> >
> > Are there any plans in the near future to populate the dbxref table with
> > IC accession numbers, and if not, can anyone suggest where we might
> > obtain this information for each sequence represented in GO?
> >
> 
> Ok so the answer is yes. The fiddly bit is implementing the plan. We
> have (thanks Mike, Midori, et al.) already a lovely file for yeast
> SGD_GO_assoc_prot (its actually a fasta file) that has the data needed
> in its header lines.

Fantastic! How does one obtain it?
> 
> I am pretty sure we can provide the same for the fly soon. I'd like to
> wait until we have a handle on the second release of the fly so that
> the data is as good as we can make it. And we're late in providing this
> file to the GO repository, but its on the job list.

Great, that would be a tremendous help.

> 
> Each of the above gives us a 1:1 correspondence, one gene product = one
> protein. 

What about enzymes comprised of multiple subunits? This gets even
trickier when one or more of the subunits has catalytic activity on its
own, especially if it differs from that of the oligomer. I haven't poked
around enough yet to know how GO deals with this.

> Mouse is a bit more difficult. Do you have a set of mouse
> sequences already? 

Nope, not yet.

>Is it 1:M? Is this livable? If I remember correctly
> we can ultimately expect a file similar to the one yeast has already
> provided from MGD. It might be better just to wait until MGD provides
> this and we get updates from them.
> 
I just started looking around and don't know much about anybody's data
yet. I will contact MGD directly to see if I they have something similar
to what you mentioned for yeast. Thanks again for your time.

Cheers,
M


-- 
Mark Waugh, Scientist 
National Center for Genome Resources
2935 Rodeo Park Drive East
Santa Fe, NM 87505, USA
mew at ncgr.org http://www.ncgr.org	
Ph: (505) 995-4446, (800) 450-4854 Fax: (505) 982-7690

--
This message is from the GOFriends Mailing list.  A list of public
announcements and discussion of the Gene Ontology (GO) project.
Problems with the list?           E-mail: owner-gofriends at genome.stanford.edu
Subscribing   send   "subscribe"   to   gofriends-request at genome.stanford.edu
Unsubscribing send   "unsubscribe"  to  gofriends-request at genome.stanford.edu
Web:          http://www.geneontology.org/



More information about the go-friends mailing list