H1N1 Sequence and Publications Mapping

May 1st, 2009

The recent outbreak of H1N1 has increased researchers attention on flu viruses. The strain of H1N1 originating in Mexico in April has already been sequenced by a number of different researchers and the sequences have been posted to the NCBI databases [GenBank, 2009]. An issue of interest to researchers are the similarities and differences between this new strain and versions of the flu virus that have been experienced and studied in the past. It is notable that swine flu has historically been rare in humans and the number of published studies is limited.  “Part of the problem is that swine flu infections of humans have rarely been documented. Virologist Christopher Olsen of the University of Wisconsin, Madison, School of Veterinary Medicine co-authored a study in 2007 that found only 50 cases in the biomedical literature dating back to 1958 [Cohen, 2009].”

The Large Graph Layout (LGL) algorithm was originally developed to show sequence homology between proteins across multiple organisms. LGL was used to produce a map of proteins [Adai, 2004]. Within the map co-location of sequences is shown to be correlated with similarities in function and evolution. It may be useful then to run LGL on the set of sequences in the NCBI databases for viruses to provide a map of the new H1N1 virus in functional and evolutionary context of other flu viruses.

There is some evidence that H1N1 may be related to the flus of 1918 and 1930 as reported by Juergen A. Richt of Kansas State University [Richt, 2009]. Richt’s work is scheduled to be published in Journal of Virology this month. Olsen’s 2007 study of research on swine flu goes back to the 1958. Plotting research papers on swine flu against the positions of those flu sequences in the map may be useful in helping researchers understand the coverage of research relative to the breadth of viral homology or dissimilarity.

The methodology to produce a homology map seems to be straightforward in theory but potentially time consuming in practice. The exercise could have utility in identifying bottlenecks that slow quick profiling of new viral strains. First the virus sequences would need to be identified in the NCBI database. Then each subsequence of interest such as H1, H5, N1, etc could be identified. For the set of all subsequences selected for inclusion in the map they would need to be compared with BLAST. The continuous variable of distance would need to be coded into a boolean indicating whether an edge should be instantiated between them or not. For this it seems reasonable to use the threshold used in the original LGL study. A graph would then be instantiated with nodes representing subsequences and edges representing BLAST homology measures above the threshold. At this point the data could be run through LGL to get a 2D or 3D layout for the map. This would serve as the base layer for additional annotation.

Correlating the published research with the sequences will require processing the meta-data or perhaps full text of the articles depending on how sufficiently the relations between the viruses studied and their NCBI indexes have been already established. Give the small number of swine flu publications the survey paper might be used to manually establish the relations. These could then be plotted as a layer in the map to show the intensity of research relative to the distribution of sequences.

With these two layers in place a researcher could then get a sense how well studied the broad space of viral sequences actually is. This may be a useful perspective for researchers such as Richt who are evaluating the evolutionary development of H1N1 or Cohen who are trying to place the new strain in context of existing works.

This entry was posted on Friday, May 1st, 2009 at 12:06 pm and is filed under Uncategorized. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

Leave a Reply