I trained a machine learning model that found a diabetes drug that can be used for HIV

10 min readDec 19, 2021

It might not come as a surprise to you that I’ve once again decided to examine the 30-year-old problem of HIV from another angle. Previously, I had examined preventative and curative therapeutics for HIV using emerging technologies such as gene editing. In this 3rd installment of my HIV series, I’m approaching HIV from a treatment lens.

You can also view the accompanying video to this article!

HIV is still a large problem in African countries

According to UNAIDS, 37.7 million people in the world were living with HIV in 2020. Each year, 2 million people are newly infected by the virus. In 2020, 68% of people living with HIV were located in Africa and ⅔ of the new infections were in Africa. Of the HIV patients living in Africa, 16.3 million (64% of the total) had access to HIV treatment called antiretroviral therapy (ART) in 2018.

It’s important to note that ART reduces viral load, NOT CURE the body of HIV; it is essentially a treatment for HIV but NOT a cure. However, ART is still important because it prevents the progression of an HIV infection to AIDS and other health complications.

Great strides have been made to reduce the price of ART for Africa

For decades there have been efforts by NGOs and non-profit organizations like UNAIDS and the Bill & Melinda Gates Foundation that have been working on reducing the price of ART drugs for African patients. In consequence of these efforts, since 2018, African countries have been able to provide a year’s supply of Dolutegravir combination ART drugs per patient for 75$. Dolutegravir combination ART is a type of ART used for patients who have developed resistance to the general ART drugs.

Even with this low cost of ART drugs, 36% of HIV patients in Africa still lack access to HIV treatment. During the COVID pandemic, the availability of ART drugs in South Africa was also impacted due to delays in importation. As a result of this crisis, some people in South Africa didn’t have access to ART which is alarming because if an HIV patient stops taking ART drugs for some time, the viral load will start to increase once more.

Even with this low cost of ART drugs, 36% of HIV patients in Africa still lack access to HIV treatment.

African countries are reliant on the importation of ART drugs

In 2015, 98% of ART supplies were imported to African countries. The inexpensive Dolutegravir combination ART drugs are being produced in India and imported to African countries. Although local production is being developed in countries like South Africa, Uganda, and Zimbabwe, their production is not enough to supply all the HIV patients in Africa. One solution to increase accessibility and availability of HIV drugs in Africa would be to produce locally instead of relying on importation from foreign countries.

Current ART drugs are too expensive to produce in bulk in Africa

While the immediate solution to overcome the limits imposed by importation is to produce ART drugs locally, current ART drugs are too expensive to be produced in bulk in African countries. This is because the cost of Active Pharmaceutical Ingredients (APIs) for ART drugs is $120–800 USD/KG. In contrast to the APIs of Glyburide - a diabetes drug - which cost $43.1–115.2 USD/KG, the APIs of ART are costly.

Given that the reason why African countries are reliant on importation is that the cost of ART APIs is too much, I formed a hypothesis around repurposing drugs with less expensive APIs to treat HIV.

My hypothesis was to repurpose drugs that have less costly APIs to treat HIV.

I used machine learning to speed up the drug discovery process

To test my hypothesis, I trained a machine earning model to predict how active a drug with less expensive APIs would be against two HIV targets. Machine learning (ML) is a branch of Artificial Intelligence (AI) where a computational model learns to carry out a task or a function. ML is used in drug discovery and development to reduce the time, energy and cost put into the process. You can read more about “The power of AI and computation in drug discovery” in my last article. To explain how useful ML is in drug discovery, imagine you are looking for a piece of jewellery in the sand. You do this either by eye or by a metal detector. Which one would cost you less time? Similarly, machine learning is a tool that facilitates the task of finding new drugs.

How I trained a machine learning model

There are 3 ways for an ML model to learn to carry out a task:

Supervised learning: data is provided to the model with labels. The model learns to associate data with labels. When test data is fed to the model, the model will give the test data a label.

2. Unsupervised learning: data is provided to the model without labels. The model learns patterns in the data on its own and creates groups based on those patterns. When test data is fed to the model, the model will place the test data in groups based on similar patterns to the training data.

3. Reinforcement learning: the model (referred to as the agent) learns to carry out a function by “rewarding” itself for making a taking a correct action so that it can maintain the produced state of the action when it takes another action.

For my project, I used supervised learning to train my ML model.

There are two subtypes of supervised learning:

Regression: a model is trained to predict a numerical value for a test data(ex: 0.001).
Classification: a model is trained to predict the label for a test data(ex: apples).

I decided to use a regression model because I wanted to predict the “activeness” of a small molecule compound would be against an HIV target. The “activeness” was measured by a numerical value called Half-maximal inhibitory concentration (IC50) which is an indicator of the potency of a drug against a target. The IC50 value indicates how much of a drug is required to inhibit a target by 50%.

Using a regression model

A regression model predits the value of a dependent variable (X) from the independent variable (Y). The relationship between the Y and the Y value is modelled by a linear function Y=mX + b.

For my model, the X value was the molecular structure of small molecule compounds represented by Pubchem fingerprints — a binary representation of molecular structure that is interpretable by the ML model. The Y value was the pIC50 value, the log negative form of IC50, which was used instead becasue the IC50 value can be very large and hard to interpret.

I obtained my training data which contained the aforementioned X and Y value from the chEMBL database and I used the ML workflow developed by Chanin Nantasenamat.

Choosing 2 HIV targets

I chose 2 HIV targets to work with: the HIV reverse transcriptase (RT) and the C-C chemokine receptor type 5 (CCR5) which is found on human white blood cells.

HIV reverse transcriptase (RT): an enzyme that is a very common inhibitory drug target on the HIV virus because it is crucial to the replication of HIV in the infected person. Truvada and Epivir are examples of HIV drugs that target reverse transcriptase. The reverse transcriptase inhibitors are used by people who are already infected with HIV to prevent the progression of the HIV infection into AIDS; reverse transcriptase inhibitors do not cure someone of HIV.

2. C-C chemokine receptor type 5 (CCR5): C-C chemokine receptor type 5 (CCR5) is a protein on the surface of white blood cells that are involved in the immune system. The HIV virus uses CCR5 receptors to enter its target host cell. Inhibiting the CCR5 receptor on the surface of T cells prevents the HIV virus from infecting the T cell. CCR5 inhibitors are a new class of HIV drugs that are used by people who are already infected with HIV to prevent the virus from entering more T cells; they do not cure someone of HIV.

My workflow

From uploading data from the chEMBL database onto my Google Colab workspace to obtaining my results, I followed a workflow that is common to many data science projects.

Pre-processing training data: I represented molecular structure of small molecule compounds that have already been tested against the HIV targets with Pubchem fingerprint and assigned them to the X value. I assigned the IC50 value of tested small molecule compounds against HIV targets to the Y value.
Pre-processing test data: Using Pubchem fingerprints, I represented molecular structure of small molecule compounds with less costly APIs and assigned them to the X value. The small molecule compounds that were tested were ibuprofen, glyburide, metformin, aspirin. There are no Y values for the test data because the regression model will be predicting it.
Choosing a regression model: I used Lazy predict, a python library that tests a lot of regression models with the training data, to identify a regression model with the lowest RMSE and lowest time of training. The LightGBMregressor was the model that best fit these criteria.
Training the regression model: the LightGBMregressor trained on the training data.
Testing the regression model: The test data was fed to the LightGBMregressor. The model output an IC50 value for the 4 tested small molecule compounds.

Glyburide is active for for HIV reverse transcriptase and CCR5

Following Chanin Nantasenamat’s workflow, I set the cutoff for an active drug as pIC50 > 6 and for an inactive drug as pIC50 < 5. Any drug with 5≤pIC50≤6 was considered intermediate.

HIV reverse transcriptase

The graph to the left shows the results obtained for the training data of HIV RT. The experimental pIC50 value is compared to the pIC50 value that the regressor predicted. Due to the fact that drugs considered as intermediate were removed from the training data, there is a gap in the graph for all pIC50 values that are considered intermediate.

For the test data , the following pIC50 values were obtained:

aspirin: 4.35426596 = inactive
metformin: 5.27117421 = intermediate
glyburide: 6.1356888 = active
ibuprofen: 5.02812142 = intermediate

→ Thus, glyburide is the drug with less costly APIs active for HIV RT target.

CCR5

For the test data , the following pIC50 values were obtained:

aspirin: 5.2780148 = intermediate
metformin: 5.77697543 = intermediate
glyburide: 7.63694304 = active
ibuprofen: 5.668248 = intermediate

→ Thus, glyburide is the drug with less costly APIs active for CCR5 target of HIV.

What is glyburide?

It is very surprising that glyburide is active for two HIV targets because glyburide is a drug used to treat Diabetes Mellitus Type 2. Glyburide stimulates the pancreatic beta cells to release insulin.

I was not able to find any literature that discussed the using glyuride to treat HIV, however, I found that glyburide is a diabetes drug that complements current ART drugs for HIV and diabetes comorbidities.

How can my results be confirmed?

The results I have obtained would have to be confirmed for any sort of clinical applications by doing the following:

More test data: given that I was working on building a prototype, I only test my model on 5 drugs with less costly APIs. More drugs should be tested against more HIV targets to find the one with the lowest pIC50 value.
More models: I only used ML and only one regression model. More regression models and other type of AI frameworks such as neural networks can be used to validate the results I have obtained.
Testing: as with any drug that is being developed, a repurposed drug would also have to go through preclinical and clinical testing to assure its safety before its is sold commercially.

The benefits of scaling up local production of HIV drugs in Africa

As outlined by this article, scaling up the local production of HIV drugs has many benefits for African countries:

Jobs are created which reduces the unemployment rate in African countries.
A case study in Brazil has shown that Brazil saved up to $200 million in 3 years by scaling up its local production of HIV drugs. Reducing importation of HIV drugs can also be beneficial for African countries economically.
Since most people infected with HIV in African are aged 15 to 24, the spread of HIV reduced the productivity of the working population. More access and availability of HIV drugs means a more productive working population for African countries.
As more people gain access to HIV treatment, we make faster progress towards reaching UNAIDS 90–90–90 goal which is for 90% of people with HIV to be diagnosed, 90% of those diagnosed to have access to sustained treatment, 90% of those receiving treatment to have viral suppression.

Scaling up the local production of HIV drugs in African will not only increase sustained accesability and availability of HIV drugs for the patients but it will also help African countries to build a more sustainable way of meeting their pharmaceutical demands.

Thanks for reading, want to connect?

Follow me on Linkedin (I post my progress on here + you can contact me directly): Diba Dindoust

Follow my newsletter: Diba Dindoust

Follow my Youtube: Diba Dindoust