Text Mining & NLP analysis of SARS-CoV-2

Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2): A 50-200 nm virion with a genome size of approximately 30000 nucleotides paused the normal and routine activities of the world by being the causative agent of Coronavirus Disease 2019 (COVID-19). Several questions came flooding when I started to know more about this novel coronavirus. The researcher in me wanted to understand and analyze all the structural and behavioural properties of this virus. Having to sit at home to practice social distancing, I decided to mine information from numerous research papers being published with detailed information on this coronavirus.


After downloading multiple research papers, I started reading one on drug targets and potential drugs to treat COVID-191. Half-hour later, I was still stuck on the initial pages thanks to my well functioning laziness gene(s). Then I decided to just use the ‘Extract relationships via NLP’ option in Strand NGS to identify interactions of various genes or biomolecules (entities) mentioned in the research article.

Figure 1: Interactions identified between entities mentioned in the research paper

In less than 3 minutes, major interactions between various compounds and processes were mapped out as a pathway. And with this, I could start understanding the major processes being targeted for drug discovery to mitigate the ongoing pandemic. Some of the targeted  processes include replication, translation, entry of virus, and defenses of the human body against the virus. These processes help the virus to infiltrate the human body, utilise human genes and proteins to reproduce and make proteins for propagation. Major targets involved in such processes are Eukaryotic Initiation Factors- genes for making proteins, Cytoplasmic stress granules- genes for making RNA, histone deacetylase etc. 

Intrigued on how this is done? Let’s go over it in detail.

NLP (Natural Language Processing) is an algorithm used by  computers to analyze and understand human language and utilize the analysis results for specific tasks. For example, we can analyze the emotions of all the text in our favorite novel and utilize it in our daily conversations to understand one’s emotions. Strand NGS utilizes it’s own NLP algorithm based on “deep-parsing” method to extract interactions between entities from Medline abstracts and has an Interaction database for different organisms. This can be used to extract the interactions in an entity list or a research paper.

There are 3 workflow options to use the Interaction database and NLP networks.

Figure 2: Workflow options to use NLP networks

‘Extract Relations via NLP’ can be used to extract interactions from a research paper using the NLP algorithm. Choose files in a valid format and the algorithm finds entities in each sentence and marks them.

Figure 3: Extraction of entity terms from a research paper

If two entities are present in the same sentence, then the algorithm also identifies the relationship between the entities. Then a pathway containing the interactions (Figure 1) is generated. A search term can also be used. The workflow finds Pubmed articles related to the term and extracts relationships. This is how the drug targets and their related process or the drug targets and their known drug compounds could be extracted as a pathway from the research paper.

Back to my research study; from the NLP generated pathway it was seen that the Eukaryotic Initiation Factor is one important target. What are the processes it is involved in? To get this insight, I used the ‘MeSH Network Builder’ workflow option.

The ‘MeSH Network Builder’ builds a pathway based on the given search term. A MeSH (Medical Subject Heading) database is a thesaurus for indexing PubMed articles. When a term is entered, the workflow pulls out all relevant MeSH terms to input from the MeSH database and all the interactions from the Interaction database. Then choose the MeSH terms required and generate a pathway based on direct/indirect relationship and the min frequency (minimum number of Pubmed articles associated with MeSH term).

I entered the Eukaryotic Initiation Factor as input and looked for MeSH terms. I chose the MeSH term and a pathway was generated with all direct interaction with min frequency 3.

Figure 4: MeSH terms related to Eukaryotic Initiation Factor

Figure 5: Pathway for the MeSH term: Eukaryotic Initiation Factor

The pathway can be viewed at the cellular level. It can be easily understood where each interaction occurs at a cellular level. In Figure 5, all the processes occur in the cytoplasm of the cell. The Eukaryotic Initiation Factor has a lot of subtypes and is mostly involved in the translation processes.

Figure 6: NLP Pathway for EIF4 genes

Different types of Eukaryotic Initiation Factors (EIF4) have been identified  in humans2. I made an entity list of all EIF4 genes and ran ‘NLP Network Discovery’. This workflow identifies pathways related to the genes from the interaction database.

The interactions of EIF4 genes and some well-known drugs like silvesterol and geldanamycin used on these targets can also be seen. In conclusion, in less than 15 minutes, I could understand that Eukaryotic factors are involved in SARS-COV-2 human interactions, role of EIF4, and potential drugs used on these targets. And this gained knowledge pleased the researcher in me.



  1. Gordon, D.E., Jang, G.M., Bouhaddou, M., Xu, J., Obernier, K., White, K.M., O’Meara, M.J., Rezelj, V.V., Guo, J.Z., Swaney, D.L. and Tummino, T.A., 2020. A SARS-CoV-2 protein interaction map reveals targets for drug repurposing. Nature, pp.1-13.
  2. Jackson, R.J., Hellen, C.U. and Pestova, T.V., 2010. The mechanism of eukaryotic translation initiation and principles of its regulation. Nature reviews Molecular cell biology, 11(2), pp.113-127.