[ad_1]
In March 2020, when WHO declared a pandemic, the publicly available GISAID sequence database contained 524 covid sequences. Over the next month, scientists loaded another 6,000. By the end of May, their number exceeded 35,000. (In contrast, scientists worldwide added 40,000 influenza sequences to GISAID for the entire 2019.)
“Forget it without a name — we can’t figure out what others are saying,” says Anderson Brito, a postdoc in genomic epidemiology at the Yale School of Public Health who contributes to Pango’s work.
As the number of covid sequences spiraled, researchers trying to study them were forced to create entirely new infrastructure and standards on the fly. A universal naming system was one of the most important elements of these efforts: without it, scientists would have a hard time talking to each other about how the descendants of the virus travel and change – either to indicate a question or, more importantly, to sound the alarm. …
Where did Pango come from?
In April 2020, several prominent virologists from the UK and Australia proposed a system of letters and numbers to denote lines or new branches of the covid family. It had logic and hierarchy, although the names it generated, like B.1.1.7, were a little boring.
One of the authors of the article was Ayn O’Toole, a doctoral student at the University of Edinburgh. She soon became the first person to actually do this sorting and classification, eventually combing hundreds of thousands of sequences by hand.
She says: “In the beginning, only who could supervise the episodes. It ended up being my job for a long time. I guess I never understood what scale we were going to reach. “
She quickly set about creating software to link new genomes to the correct lines. Shortly thereafter, another postdoc researcher, Emily Sher, created a machine learning algorithm that further accelerated the process.
They called the program Pangolin, derisively referring to the covid animal origin debate. (The entire system is now known simply as Pango.)
The naming system, as well as the software for its implementation, quickly became important around the world. Although the WHO has recently started using Greek letters for variants that seem particularly interesting, such as delta, these nicknames are intended for the public and the media. Delta actually belongs to a growing family of species that scientists call the more accurate names for pango: B.1.617.2, AY.1, AY.2, and AY.3.
“When alpha arrived in the UK, Pango made it easier for us to look for these mutations in our genomes to see if we had this lineage in our country,” says Jolly. “Since then, Pango has been used as the basis for reporting and observing options in India.”
Because Pango offers a rational and orderly approach to what would otherwise be chaos, it could forever change the way scientists call viral strains, allowing experts from around the world to work together using a common vocabulary. Brito says, “This will most likely be the format we will use to track any other new virus.”
Many of the foundational tools for tracing covid genomes have been developed and maintained by young scientists such as O’Toole and Cher over the past year and a half. As the need for worldwide covid collaboration skyrocketed, scientists were quick to support it with dedicated infrastructure like Pango. Much of this work has been done by tech-savvy young researchers in their 20s and 30s. They used informal networks and open source tools, which meant they were free to use and anyone could voluntarily make changes and improvements.
“The people at the forefront of technology are generally graduate students and postdocs,” says Angie Hinrichs, a bioinformatist at the University of California, Santa Cruz, who joined the project earlier this year. For example, O’Toole and Cher work in the laboratory of Andrew Rambeau, a genomic epidemiologist who posted the first publicly available covid sequences on the Internet after receiving them from Chinese scientists. “They just proved to be perfect for delivering these tools, which have become absolutely critical,” says Hinrichs.
We build quickly
That was not easy. For much of 2020, O’Toole took over much of the responsibility for defining and naming new bloodlines. The university was shuttered, but she and another Rambeau graduate student, Verity Hill, received permission to enter the office. The drive to work – a 40-minute walk to school from the apartment where she lived alone, gave her some sense of normalcy.
Every few weeks, O’Toole downloaded the entire covid repository from the GISAID database, which grew exponentially each time. She then looked for groups of genomes with similar mutations, or things that looked strange and could be mislabeled.
When she got particularly stuck, Hill, Rambeau, and other members of the lab intervened to discuss the notation. But the main job fell on her.
Deciding when the descendants of the virus deserve a new surname can be not only a science, but an art as well. It was a painstaking process, he sifted through an unheard-of number of genomes and asked again and again: is this a new covid variant or not?
“It was pretty exhausting,” she says. “But it was always very humiliating. Imagine watching 20,000 episodes from 100 different locations around the world. I’ve seen footage from places that I haven’t even heard of. “
As time went on, O’Toole struggled to keep up with the volume of new genomes that needed to be sorted and named.
In June 2020, there were over 57,000 sequences stored in the GISAID database, and O’Toole sorted them into 39 variants. By November 2020, a month after she was due to pass her dissertation, O’Toole had her final solo data review. It took her 10 days to go through all the sequences, which by that time numbered 200,000. (Although covid has overshadowed her research on other viruses, she puts a chapter on Pango in her dissertation.)
Fortunately, the Pango software is designed to work together and others have stepped up. The online community that Jolly reached out to when she noticed this option was spreading throughout India grew and grew. O’Toole’s work has been much more casual this year. New pedigrees are now mostly defined when epidemiologists around the world contact O’Toole and the rest of the team via Twitter, email, or GitHub – her preferred method.
“It’s more reactionary now,” says O’Toole. “If a group of researchers somewhere in the world is working on some data and they believe they have identified a new pedigree, they can submit a request.”
The data stream continues. This spring, the team ran a pangoton, a kind of hackathon, in which they sorted 800,000 sequences across approximately 1,200 transmission lines.
“We gave ourselves three solid days,” says O’Toole. “It took two weeks.”
Since then, the Pango team has recruited several more volunteers, such as UCSC researcher Hindricks and Yale researcher Brito, who initially participated, adding their two cents on Twitter and on the GitHub page. University of Cambridge postdoc Chris Ruis turned his attention to helping O’Toole get rid of outstanding GitHub requests.
O’Toole recently asked them to formally join the organization as part of the newly formed Pango Lineage Designation Committee, which discusses and decides on variant names. Another committee, which includes the head of the Rambeau laboratory, makes decisions at a higher level.
“We have a website and an email, and it’s not just my email,” says O’Toole. “It has become a lot more formalized, and I think it will really help it scale.”
Future
As the data grew, cracks began to appear at the edges. To date, there are nearly 2.5 million covid sequences in GISAID, which the Pango team has split into 1,300 branches. Each branch corresponds to a variant. According to the WHO, eight of them are worth looking at.
Since there is so much to process, the software starts to break. Things are wrongly labeled. Many strains are similar to each other because the virus has the most beneficial mutations over and over again.
As a stopgap, the team created new software that uses a different sorting method and can catch what Pango might miss.
[ad_2]
Source link