hiitsmeme commited on
Commit
bccf506
·
1 Parent(s): d36f70f

README changed

Browse files
Files changed (1) hide show
  1. README.md +123 -6
README.md CHANGED
@@ -1,12 +1,129 @@
1
  ---
2
- title: Tox21 Grover Classifier
3
- emoji: 👀
4
- colorFrom: gray
5
- colorTo: gray
6
  sdk: docker
7
  pinned: false
8
  license: cc-by-nc-4.0
9
- short_description: Grover Classifier for Tox21
10
  ---
11
 
12
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ title: Tox21 GROVER Classifier
3
+ emoji: 🤖
4
+ colorFrom: green
5
+ colorTo: blue
6
  sdk: docker
7
  pinned: false
8
  license: cc-by-nc-4.0
9
+ short_description: GROVER Classifier for Tox21
10
  ---
11
 
12
+ # Tox21 Graph Isomorphism Network (GIN) Classifier
13
+
14
+ This repository hosts a Hugging Face Space that provides an examplary API for submitting models to the [Tox21 Leaderboard](https://huggingface.co/spaces/ml-jku/tox21_leaderboard).
15
+
16
+ Here the base version of [GROVER](https://arxiv.org/pdf/2007.02835) is finetuned on the Tox21 dataset, using the [code](https://github.com/tencent-ailab/grover) provided and the finetuning hyperparameters specified in the paper. The final model is provided for
17
+ inference. Model input is a SMILES string of the small molecule, and the output are 12 numeric values for
18
+ each of the toxic effects of the Tox21 dataset.
19
+
20
+
21
+ **Important:** For leaderboard submission, your Space needs to include training code. The file `train.py` should train the model using the config specified inside the `config/` folder and save the final model parameters into a file inside the `checkpoints/` folder. The model should be trained using the [Tox21_dataset](https://huggingface.co/datasets/ml-jku/tox21) provided on Hugging Face. The datasets can be loaded like this:
22
+ ```python
23
+ from datasets import load_dataset
24
+ ds = load_dataset("ml-jku/tox21", token=token)
25
+ train_df = ds["train"].to_pandas()
26
+ val_df = ds["validation"].to_pandas()
27
+ ```
28
+ Additionally, the Space needs to implement inference in the `predict()` function inside `predict.py`. The `predict()` function must keep the provided skeleton: it should take a list of SMILES strings as input and return a nested prediction dictionary as output, with SMILES as keys and dictionaries containing targetname-prediction pairs as values. Therefore, any preprocessing of SMILES strings must be executed on-the-fly during inference.
29
+
30
+ # Repository Structure
31
+ - `predict.py` - Defines the `predict()` function required by the leaderboard (entry point for inference).
32
+ - `app.py` - FastAPI application wrapper (can be used as-is).
33
+ - `main.py` - provided grover code.
34
+ - `evaluate.py` - predict outputs of a given model on a dataset and compute AUC.
35
+ - `generate_features.py` - generate features used as model input, given a csv containing smiles.
36
+ - `hp_search.py` - finetune and evaluate 300 configs that are randomly drawn from a parameter grid specified in the paper.
37
+ - `prepare_data.py` - clean smiles in a given csv and save a mask to consider uncleanable smiles during evaluation.
38
+ - `train.py` - finetunes and saves a model using the config in the `config/` folder.
39
+
40
+ - `config/` - the config file used by `train.py`.
41
+ - `checkpoint/` - the saved model that is used in `predict.py` is here.
42
+ - `grover/` - [GROVER](https://github.com/tencent-ailab/grover) repository with slight changes in file structure and import paths.
43
+ - `predictions/` - [GROVER](https://github.com/tencent-ailab/grover) saves prediction results in a csv. These are saved here.
44
+ - `pretrained/` - pretrained GROVER models provided.
45
+ - `tox21/` - all masks, generated features and clean data csv files are saved here.
46
+
47
+ - `src/` - Core model & preprocessing logic:
48
+ - `preprocess.py` - SMILES preprocessing pipeline and dataset creation
49
+ - `commands.py` - GROVER commands
50
+ - `eval.py` - compute evaluation metric
51
+ - `hp_search.py` - generate configs for hyperparameter search
52
+
53
+ # Quickstart with Spaces
54
+
55
+ You can easily adapt this project in your own Hugging Face account:
56
+
57
+ - Open this Space on Hugging Face.
58
+
59
+ - Click "Duplicate this Space" (top-right corner).
60
+
61
+ - Create a `.env` according to `.example.env`.
62
+
63
+ - Modify `src/` for your preprocessing pipeline and model class
64
+
65
+ - Modify `predict()` inside `predict.py` to perform model inference while keeping the function skeleton unchanged to remain compatible with the leaderboard.
66
+
67
+ - Modify `train.py` according to your model and preprocessing pipeline.
68
+
69
+ - Modify the file inside `config/` to contain all hyperparameters that are set in `train.py`.
70
+ That’s it, your model will be available as an API endpoint for the Tox21 Leaderboard.
71
+
72
+ # Installation
73
+ To run the GROVER classifier, clone the repository and install dependencies:
74
+
75
+ ```bash
76
+ git clone https://huggingface.co/spaces/ml-jku/tox21_grover_classifier
77
+ cd tox21_grover_classifier
78
+ conda env create -f environment.yaml
79
+ ```
80
+
81
+ # Training
82
+
83
+
84
+ To train the GROVER model from scratch, download the [Tox21](https://huggingface.co/datasets/ml-jku/tox21/tree/main) csv files and put them into the tox21 folder.
85
+
86
+ Then run:
87
+
88
+ ```bash
89
+ python prepare_data.py
90
+ python generate_features.py
91
+ python train.py
92
+ ```
93
+
94
+ These commands will:
95
+ 1. Load and preprocess the Tox21 training dataset
96
+ 2. Generate and save features used as GROVER inputs
97
+ 2. Finetune the GROVER base model
98
+ 3. Store the resulting model in the `finetune/` directory.
99
+
100
+ # Inference
101
+
102
+ For inference, you only need `predict.py`.
103
+
104
+ Example usage inside Python:
105
+
106
+ ```python
107
+ from predict import predict
108
+
109
+ smiles_list = ["CCO", "c1ccccc1", "CC(=O)O"]
110
+ results = predict(smiles_list)
111
+
112
+ print(results)
113
+ ```
114
+
115
+ The output will be a nested dictionary in the format:
116
+
117
+ ```python
118
+ {
119
+ "CCO": {"target1": 0, "target2": 1, ..., "target12": 0},
120
+ "c1ccccc1": {"target1": 1, "target2": 0, ..., "target12": 1},
121
+ "CC(=O)O": {"target1": 0, "target2": 0, ..., "target12": 0}
122
+ }
123
+ ```
124
+
125
+ # Notes
126
+
127
+ - Adapting `predict.py`, `train.py`, `config/`, and `checkpoints/` is required for leaderboard submission.
128
+
129
+ - Preprocessing (here inside `src/preprocess.py`) must be done inside `predict.py` not just `train.py`.