Spaces:
Sleeping
Sleeping
hiitsmeme
commited on
Commit
·
bccf506
1
Parent(s):
d36f70f
README changed
Browse files
README.md
CHANGED
|
@@ -1,12 +1,129 @@
|
|
| 1 |
---
|
| 2 |
-
title: Tox21
|
| 3 |
-
emoji:
|
| 4 |
-
colorFrom:
|
| 5 |
-
colorTo:
|
| 6 |
sdk: docker
|
| 7 |
pinned: false
|
| 8 |
license: cc-by-nc-4.0
|
| 9 |
-
short_description:
|
| 10 |
---
|
| 11 |
|
| 12 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
+
title: Tox21 GROVER Classifier
|
| 3 |
+
emoji: 🤖
|
| 4 |
+
colorFrom: green
|
| 5 |
+
colorTo: blue
|
| 6 |
sdk: docker
|
| 7 |
pinned: false
|
| 8 |
license: cc-by-nc-4.0
|
| 9 |
+
short_description: GROVER Classifier for Tox21
|
| 10 |
---
|
| 11 |
|
| 12 |
+
# Tox21 Graph Isomorphism Network (GIN) Classifier
|
| 13 |
+
|
| 14 |
+
This repository hosts a Hugging Face Space that provides an examplary API for submitting models to the [Tox21 Leaderboard](https://huggingface.co/spaces/ml-jku/tox21_leaderboard).
|
| 15 |
+
|
| 16 |
+
Here the base version of [GROVER](https://arxiv.org/pdf/2007.02835) is finetuned on the Tox21 dataset, using the [code](https://github.com/tencent-ailab/grover) provided and the finetuning hyperparameters specified in the paper. The final model is provided for
|
| 17 |
+
inference. Model input is a SMILES string of the small molecule, and the output are 12 numeric values for
|
| 18 |
+
each of the toxic effects of the Tox21 dataset.
|
| 19 |
+
|
| 20 |
+
|
| 21 |
+
**Important:** For leaderboard submission, your Space needs to include training code. The file `train.py` should train the model using the config specified inside the `config/` folder and save the final model parameters into a file inside the `checkpoints/` folder. The model should be trained using the [Tox21_dataset](https://huggingface.co/datasets/ml-jku/tox21) provided on Hugging Face. The datasets can be loaded like this:
|
| 22 |
+
```python
|
| 23 |
+
from datasets import load_dataset
|
| 24 |
+
ds = load_dataset("ml-jku/tox21", token=token)
|
| 25 |
+
train_df = ds["train"].to_pandas()
|
| 26 |
+
val_df = ds["validation"].to_pandas()
|
| 27 |
+
```
|
| 28 |
+
Additionally, the Space needs to implement inference in the `predict()` function inside `predict.py`. The `predict()` function must keep the provided skeleton: it should take a list of SMILES strings as input and return a nested prediction dictionary as output, with SMILES as keys and dictionaries containing targetname-prediction pairs as values. Therefore, any preprocessing of SMILES strings must be executed on-the-fly during inference.
|
| 29 |
+
|
| 30 |
+
# Repository Structure
|
| 31 |
+
- `predict.py` - Defines the `predict()` function required by the leaderboard (entry point for inference).
|
| 32 |
+
- `app.py` - FastAPI application wrapper (can be used as-is).
|
| 33 |
+
- `main.py` - provided grover code.
|
| 34 |
+
- `evaluate.py` - predict outputs of a given model on a dataset and compute AUC.
|
| 35 |
+
- `generate_features.py` - generate features used as model input, given a csv containing smiles.
|
| 36 |
+
- `hp_search.py` - finetune and evaluate 300 configs that are randomly drawn from a parameter grid specified in the paper.
|
| 37 |
+
- `prepare_data.py` - clean smiles in a given csv and save a mask to consider uncleanable smiles during evaluation.
|
| 38 |
+
- `train.py` - finetunes and saves a model using the config in the `config/` folder.
|
| 39 |
+
|
| 40 |
+
- `config/` - the config file used by `train.py`.
|
| 41 |
+
- `checkpoint/` - the saved model that is used in `predict.py` is here.
|
| 42 |
+
- `grover/` - [GROVER](https://github.com/tencent-ailab/grover) repository with slight changes in file structure and import paths.
|
| 43 |
+
- `predictions/` - [GROVER](https://github.com/tencent-ailab/grover) saves prediction results in a csv. These are saved here.
|
| 44 |
+
- `pretrained/` - pretrained GROVER models provided.
|
| 45 |
+
- `tox21/` - all masks, generated features and clean data csv files are saved here.
|
| 46 |
+
|
| 47 |
+
- `src/` - Core model & preprocessing logic:
|
| 48 |
+
- `preprocess.py` - SMILES preprocessing pipeline and dataset creation
|
| 49 |
+
- `commands.py` - GROVER commands
|
| 50 |
+
- `eval.py` - compute evaluation metric
|
| 51 |
+
- `hp_search.py` - generate configs for hyperparameter search
|
| 52 |
+
|
| 53 |
+
# Quickstart with Spaces
|
| 54 |
+
|
| 55 |
+
You can easily adapt this project in your own Hugging Face account:
|
| 56 |
+
|
| 57 |
+
- Open this Space on Hugging Face.
|
| 58 |
+
|
| 59 |
+
- Click "Duplicate this Space" (top-right corner).
|
| 60 |
+
|
| 61 |
+
- Create a `.env` according to `.example.env`.
|
| 62 |
+
|
| 63 |
+
- Modify `src/` for your preprocessing pipeline and model class
|
| 64 |
+
|
| 65 |
+
- Modify `predict()` inside `predict.py` to perform model inference while keeping the function skeleton unchanged to remain compatible with the leaderboard.
|
| 66 |
+
|
| 67 |
+
- Modify `train.py` according to your model and preprocessing pipeline.
|
| 68 |
+
|
| 69 |
+
- Modify the file inside `config/` to contain all hyperparameters that are set in `train.py`.
|
| 70 |
+
That’s it, your model will be available as an API endpoint for the Tox21 Leaderboard.
|
| 71 |
+
|
| 72 |
+
# Installation
|
| 73 |
+
To run the GROVER classifier, clone the repository and install dependencies:
|
| 74 |
+
|
| 75 |
+
```bash
|
| 76 |
+
git clone https://huggingface.co/spaces/ml-jku/tox21_grover_classifier
|
| 77 |
+
cd tox21_grover_classifier
|
| 78 |
+
conda env create -f environment.yaml
|
| 79 |
+
```
|
| 80 |
+
|
| 81 |
+
# Training
|
| 82 |
+
|
| 83 |
+
|
| 84 |
+
To train the GROVER model from scratch, download the [Tox21](https://huggingface.co/datasets/ml-jku/tox21/tree/main) csv files and put them into the tox21 folder.
|
| 85 |
+
|
| 86 |
+
Then run:
|
| 87 |
+
|
| 88 |
+
```bash
|
| 89 |
+
python prepare_data.py
|
| 90 |
+
python generate_features.py
|
| 91 |
+
python train.py
|
| 92 |
+
```
|
| 93 |
+
|
| 94 |
+
These commands will:
|
| 95 |
+
1. Load and preprocess the Tox21 training dataset
|
| 96 |
+
2. Generate and save features used as GROVER inputs
|
| 97 |
+
2. Finetune the GROVER base model
|
| 98 |
+
3. Store the resulting model in the `finetune/` directory.
|
| 99 |
+
|
| 100 |
+
# Inference
|
| 101 |
+
|
| 102 |
+
For inference, you only need `predict.py`.
|
| 103 |
+
|
| 104 |
+
Example usage inside Python:
|
| 105 |
+
|
| 106 |
+
```python
|
| 107 |
+
from predict import predict
|
| 108 |
+
|
| 109 |
+
smiles_list = ["CCO", "c1ccccc1", "CC(=O)O"]
|
| 110 |
+
results = predict(smiles_list)
|
| 111 |
+
|
| 112 |
+
print(results)
|
| 113 |
+
```
|
| 114 |
+
|
| 115 |
+
The output will be a nested dictionary in the format:
|
| 116 |
+
|
| 117 |
+
```python
|
| 118 |
+
{
|
| 119 |
+
"CCO": {"target1": 0, "target2": 1, ..., "target12": 0},
|
| 120 |
+
"c1ccccc1": {"target1": 1, "target2": 0, ..., "target12": 1},
|
| 121 |
+
"CC(=O)O": {"target1": 0, "target2": 0, ..., "target12": 0}
|
| 122 |
+
}
|
| 123 |
+
```
|
| 124 |
+
|
| 125 |
+
# Notes
|
| 126 |
+
|
| 127 |
+
- Adapting `predict.py`, `train.py`, `config/`, and `checkpoints/` is required for leaderboard submission.
|
| 128 |
+
|
| 129 |
+
- Preprocessing (here inside `src/preprocess.py`) must be done inside `predict.py` not just `train.py`.
|