The digital innovation house ART21 has taken up an ambitious project in response to the constant emergence of new mutations in coronavirus - to develop a unique system for predicting the evolution of the SARS-CoV-2 genome based on artificial intelligence and mathematical modelling technologies. In this way, the technology developed by Lithuanians will be able to predict the course and tends of modification in the genome of this virus by using the data collected in databases. The ‘learning’ phase of the system has just ended with satisfying results - the accuracy is as high as 99%. If the results of the second technology development remain at least 80%, Lithuania will be proud of the development of advanced technology to fight the COVID-19 virus.
‘The technology we are developing will allow us to control the dynamics of pandemic development more efficiently. Based on the results obtained, test and vaccine developers will be able to make adjustments in their production and avoid loss of efficacy due to new mutations in the viral genome. It is particularly relevant in the test production process. For example, a person may get coronavirus and carry it because tests will not be able to detect it and will give false negative results. Therefore, this technology based on artificial intelligence and big data is especially relevant until the virus is completely controlled. In addition, it is an excellent example of enabling open data in the development of advanced solutions, even in such complex processes as disease prevention,’ said Augustas Alešiūnas, the founder and director of the innovation company ART21.
According to the company’s representatives, there have been only a few other attempts in the world to develop a similar technology but in this race, they lag far behind the Lithuanians. The uniqueness and complexity of the results prediction generator consist of several aspects.
Pharmaceutical companies respond to changes by conducting real-time research and observing changes, thus, there is a time lag between making adjustments to the production of tests or vaccines. In this case, the developers of the innovations under development are preparing to provide reasonable statistical predictions for the future. Therefore, it will be possible to prepare for future mutations beforehand. Another complex factor that keeps other developers away from risking the cost of such projects is the incredibly long and big data series. One line in this sequence contains about 30,000 different characters, and there are tens of thousands of such lines. This requires not only professionals with special competence but it is also really time consuming.
‘Based on the existing COVID-19 genome databases, we create a mutation pair data set. Genome sequences are compared, purified, various phylogenetic trees of the virus are constructed, and the whole COVID-19 genome and the major S (needle) protein are examined separately. The obtained data of mutation pairs indicate the (possible) direction of mutation of the virus ‘parent’ - ‘child’ and allow to train a complex neural network - mutation generator. We are testing the obtained model using the newly emerging mutations in the virus and expect to predict future mutations with sufficient reliability. The first results are promising and positive but there is still a long, financially and time-consuming road to full training, testing and validating the system,’ says one of the data scientists Dr. Valdas Rapševičius while presenting the technology development process.