SARS-CoV-2 sequences deleted during the early stages of the Wuhan outbreak provide insight.
July 24, 2021On June 22, Dr. Jesse Bloom of the Fred Hutchinson Cancer Research Center’s evolutionary biology department stated that he had retrieved SARS-CoV-2 sequences from the early stages of the Wuhan outbreak that had been deleted from a National Institutes of Health database.
In a related Twitter thread, he described how he recovered raw sequencing data from 34 early Wuhan SARS-CoV-2 cases, how he used the data in the files to reconstruct partial sequences for 13 of those cases, and what he learned from his research.
He went on to say that the sequences back with previous lines of evidence that SARS-CoV-2 was circulating in Wuhan prior to the outbreak in a seafood market in December 2019. They do not establish the virus’s natural animal origin or an unintentional lab leak, nor do they identify the infection’s first victim.
He concluded that more early sequences are very certainly available, and scientists should focus their efforts on locating and understanding all available data.
“Scientists must keep an eye on the origins and early spread of SARS-CoV-2. “After spending the last four months thoroughly investigating this, I am cautiously optimistic that fresh relevant evidence may emerge,” he tweeted on June 22.
“We should thus avoid dogmatic discussions over the development and early spread of SARS-CoV-2 and instead focus on two problems,” he said. (1) How can we receive further information? (2) How might we improve our analysis of the data we have now?”
Bloom’s discovery drew media attention and sparked controversy among scientists and government officials. In response to scientific criticism, he submitted a revised preprint on June 29. At this moment, the preprint has not been peer-reviewed.
Bloom noticed that little scientific progress had been made in determining how and when SARS-CoV-2 began last summer. Natural zoonosis, in which the virus evolved from bat coronaviruses to one capable of infecting humans, and a laboratory mishap were the two most possible causes. He wanted more information on the early incidents because there was scant scientific data to support or reject either opinion.
Coronaviruses acquire mutations to their genomic sequences during replication, and scientists can reconstruct the virus’s history by studying these sequence changes over time.
Bloom looked at accounts of the virus’s genetic sequences from early infected people to identify patterns of how it evolved. He had little to deal with at first.
Then he came upon a paper that mentioned a sequence dataset he had never heard of before. When he looked for these sequences in the most likely internet data archive, he came up empty-handed.
He was aware that researchers might ask for the removal of sequences they had contributed to the archive. Recognizing that the data could have been saved online, he derived the relevant URLs and located files connected with the sequences that remained on the Google Cloud.
“I was able to confirm that the deleted data corresponded to a study in which 45 nasopharyngeal samples from [Wuhan] outpatients with suspected COVID-19 were partially sequenced early in the epidemic,” he tweeted.
He eventually uncovered 241 data files that had been submitted to the database and subsequently withdrawn using additional clues. When those data were pooled, they revealed parts of 34 previously unknown SARS-CoV-2 samples. However, each file only contained a subset of the whole sequencing information for each sample.
Bloom eventually gathered enough data to look into the partial sequences of 13 early SARS-CoV-2 cases.
What the sequences indicate about the Wuhan epidemic’s early beginnings
The 13 reconstructed genomes add little to our understanding of the early stages of the Wuhan outbreak, and there is little information on sample collection. Nonetheless, they provide information that brings us closer to identifying the first overflow incident.
To begin, the figures support previous findings that the virus did not transmit from animals to humans at the Wuhan seafood market.
“The early viral sequences from Wuhan originate from individuals affiliated with the city’s Huanan Seafood Market in December 2019, which was previously thought to be the site of the coronavirus’s first human infection,” according to Nature News. The seafood-market sequences, on the other hand, are more distantly related to SARS-closest CoV-2 cousins in bats — the virus’s most likely ultimate source — than later variants, including one from the United States.”
Bloom’s discovery, according to Dr. W. Ian Lipkin, an epidemiologist at Columbia University, gives “evidence of what many of us thought — that the virus was circulating prior to the market outbreak,” according to an email from Lipkin to the Washington Post. The retraction of sequence data is unprecedented and must be addressed.”
“This path of inquiry may help us establish the virus’s genesis and replicate how it propagated during the pandemic’s early days,” Lipkin says.
“Wuhan market appears to be one of the first super-spreading instances,” says Dr. Sudhir Kumar, an evolutionary geneticist at Temple University.
The sequences, according to Kumar, “suggest that SARS-CoV-2 developed a significant degree of variation during the early stages of the pandemic in China – even in Wuhan.”
Before making conclusions about the virus’s origins, scientists need to uncover more pieces of the early outbreak jigsaw.
Explanations for deletion
The National Institutes of Health, which runs the repository that formerly housed the sequence data, explained to the media in a statement how the sequences were erased at the request of the scientist who submitted them.
The NIH stated in a statement received by USA Today that “the requestor indicated that the sequence information had been modified, that it was being submitted to another database, and that the data should be erased from SRA (the Sequence Read Archive) to minimise version control concerns.” “Submitting investigators retain ownership of their data and have the option to request that it be deleted.”
These reasons were provided by NIH in an email to Bloom, which he included in his revised preprint. Bloom, on the other hand, remarked that he couldn’t find any proof that the sequences had been uploaded to another database, as the authors claimed.
“Stephen Goldstein, a virologist at the University of Utah in Salt Lake City, observes that the sequences recovered by Bloom were not hidden: they are described in detail in the Small paper, along with enough sequence information to determine their evolutionary relationship to other early SARS-CoV-2 sequences. ‘I don’t think this preprint provides much new information, but it does bring to light sequence data that has been publicly available but has gone overlooked,’ Goldstein says.
*Important Notice
bioRxiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as conclusive, guide clinical practice/health-related behavior, or treated as established information.
Reference
Bloom, J. D. (2021). Recovery of deleted deep sequencing data sheds more light on the early Wuhan SARS-CoV-2 epidemic. bioRxiv. doi: https://doi.org/10.1101/2021.06.18.449051