{ "cells": [ { "metadata": {}, "cell_type": "markdown", "source": "## Tutorial on human breast cancer dataset.", "id": "11ff83f35e564bb5" }, { "metadata": {}, "cell_type": "markdown", "source": [ " This demo demonstrates how to use **STAID** to deconvolve spatial transcriptomics (ST) data with the help of single-cell RNA-seq (scRNA-seq) reference data.\n", "We will use a **breast cancer dataset** as an example.\n" ], "id": "ea10488e430400bb" }, { "metadata": {}, "cell_type": "markdown", "source": "### 1. Import packages", "id": "3ed553c6880d43a5" }, { "metadata": { "ExecuteTime": { "end_time": "2025-09-19T14:02:39.100993Z", "start_time": "2025-09-19T14:02:33.418757Z" } }, "cell_type": "code", "source": [ "import os\n", "import scanpy as sc\n", "from staid.utils import seed_everything\n", "from staid import run_deconvolution" ], "id": "79282d7905f00c20", "outputs": [], "execution_count": 1 }, { "metadata": {}, "cell_type": "markdown", "source": "### 2. Load data", "id": "436e5acb92a233b2" }, { "metadata": {}, "cell_type": "markdown", "source": "The human breast cancer Visium datasets are available at https://doi.org/10.5281/zenodo.4739739 and match human breast cancer scRNA-seq reference datasets are available through the Gene Expression Omnibus under accession number GSE176078. For convenience, we also provide a sorted version on Google Drive: [Download from Google Drive](https://drive.google.com/drive/folders/1-GhHslCBIYvNFb1Zs3DmLKVg9JZx1QSP?usp=sharing).", "id": "5344fd6b6e78268f" }, { "metadata": { "ExecuteTime": { "end_time": "2025-09-19T14:02:39.454047Z", "start_time": "2025-09-19T14:02:39.122655Z" } }, "cell_type": "code", "source": [ "# Set dataset paths\n", "sample = \"CID4535\"\n", "spa_data_path = \"/opt/data/private/jxliu/project/staid/data/Breast_cancer/merged_datasets\"\n", "sc_data_path = \"/opt/data/private/jxliu/project/staid/data/scRNA-seq/Breast_cancer\"\n", "\n", "# Load AnnData objects\n", "sp_adata = sc.read_h5ad(os.path.join(spa_data_path,\n", " f\"{sample}_visium_breast_cancer.h5ad\"))\n", "sc_adata = sc.read_h5ad(os.path.join(sc_data_path,\n", " f\"{sample}_scRNA_seq_with_annotations.h5ad\"))" ], "id": "f29bf7dbf849839e", "outputs": [], "execution_count": 2 }, { "metadata": {}, "cell_type": "markdown", "source": "### 3. Preprocessing", "id": "1007721028aa5420" }, { "metadata": {}, "cell_type": "markdown", "source": "Just filter genes with low expression ratios and keep the raw gene expression counts.", "id": "473bfbf6f5475774" }, { "metadata": { "ExecuteTime": { "end_time": "2025-09-19T14:02:40.072303Z", "start_time": "2025-09-19T14:02:39.697574Z" } }, "cell_type": "code", "source": [ "sp_adata.var_names_make_unique()\n", "sc_adata.var_names_make_unique()\n", "\n", "sc.pp.filter_genes(sp_adata, min_cells=10)\n", "sc.pp.filter_genes(sc_adata, min_cells=10)" ], "id": "8d642b6d4c4fcdb3", "outputs": [], "execution_count": 3 }, { "metadata": {}, "cell_type": "markdown", "source": "### 4. Run deconvolution", "id": "508779ff784eb06b" }, { "metadata": {}, "cell_type": "markdown", "source": [ "We now run the iterative deconvolution process.\n", "For this demo, we set the number of iterations (num_iter) to 5 for faster execution." ], "id": "7f58513d88c34c84" }, { "metadata": { "ExecuteTime": { "end_time": "2025-09-19T14:08:11.069310Z", "start_time": "2025-09-19T14:02:40.102027Z" } }, "cell_type": "code", "source": [ "# set seed\n", "seed_everything(2025)\n", "# set annotation key in sc_adata.obs.keys()\n", "anno_key = \"celltype_major\"\n", "\n", "cell_type_df = run_deconvolution(\n", " sp_adata=sp_adata,\n", " sc_adata=sc_adata,\n", " anno_key=anno_key,\n", " device=\"cuda:0\", # use GPU\n", " lr=0.0005,\n", " num_pseudo=5000,\n", " num_iter=5,\n", " batch_size=128,\n", " min_cells=1,\n", " max_cells=10,\n", " batch_correction=\"scanorama\", # platform correction\n", " hidden_dims=[512, 256, 256],\n", " library_size=1e4,\n", " dropout=0.1,\n", " c=0.1,\n", " weight=1e-5\n", ")" ], "id": "1925e72857ac6eb0", "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The number of marker genes: 3872\n", "*************** Iteration: \t1 ***************\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "Generate pseudo spots: 100%|██████████| 3/3 [00:12<00:00, 4.07s/it]\n", "AE Epoch: 100%|██████████| 100/100 [00:27<00:00, 3.61it/s, loss=4.44e-5]\n", "Epoch: 75%|███████▌ | 150/200 [00:45<00:15, 3.32it/s, loss=0.00138]\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "*************** Iteration:\t2 ***************\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "Generate pseudo spots: 100%|██████████| 3/3 [00:13<00:00, 4.52s/it]\n", "Epoch: 13%|█▎ | 26/200 [00:05<00:38, 4.52it/s, loss=0.00132]\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "*************** Iteration:\t3 ***************\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "Generate pseudo spots: 100%|██████████| 3/3 [00:13<00:00, 4.38s/it]\n", "Epoch: 10%|█ | 20/200 [00:04<00:38, 4.63it/s, loss=0.000965]\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "*************** Iteration:\t4 ***************\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "Generate pseudo spots: 100%|██████████| 3/3 [00:13<00:00, 4.37s/it]\n", "Epoch: 18%|█▊ | 35/200 [00:07<00:34, 4.73it/s, loss=0.000648]\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "*************** Iteration:\t5 ***************\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "Generate pseudo spots: 100%|██████████| 3/3 [00:13<00:00, 4.48s/it]\n", "Epoch: 10%|█ | 21/200 [00:04<00:35, 5.06it/s, loss=0.000888]\n" ] } ], "execution_count": 4 }, { "metadata": {}, "cell_type": "markdown", "source": "The cell type composition information can be obtained in ```cell_type_df```:", "id": "7526100887919981" }, { "metadata": { "ExecuteTime": { "end_time": "2025-09-19T14:08:11.175872Z", "start_time": "2025-09-19T14:08:11.150941Z" } }, "cell_type": "code", "source": "cell_type_df.iloc[:5, :5]", "id": "98994d3f3ccd0dbf", "outputs": [ { "data": { "text/plain": [ " B-cells CAFs Cancer Epithelial Endothelial \\\n", "AACATTGGTCAGCCGT-1 0.933518 0.006616 0.000000 0.000000 \n", "CATCGAATGGATCTCT-1 0.000000 0.000000 0.192064 0.000000 \n", "GCGTCCAGCTCGTGGC-1 0.000000 0.000000 0.719807 0.000000 \n", "CCAAAGTCCCGCTAAC-1 0.000000 0.068452 0.794619 0.005912 \n", "GTTACGGCCCGACTGC-1 0.000000 0.000000 0.555312 0.000000 \n", "\n", " Myeloid \n", "AACATTGGTCAGCCGT-1 0.000000 \n", "CATCGAATGGATCTCT-1 0.000000 \n", "GCGTCCAGCTCGTGGC-1 0.115706 \n", "CCAAAGTCCCGCTAAC-1 0.121707 \n", "GTTACGGCCCGACTGC-1 0.235157 " ], "text/html": [ "
| \n", " | B-cells | \n", "CAFs | \n", "Cancer Epithelial | \n", "Endothelial | \n", "Myeloid | \n", "
|---|---|---|---|---|---|
| AACATTGGTCAGCCGT-1 | \n", "0.933518 | \n", "0.006616 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "
| CATCGAATGGATCTCT-1 | \n", "0.000000 | \n", "0.000000 | \n", "0.192064 | \n", "0.000000 | \n", "0.000000 | \n", "
| GCGTCCAGCTCGTGGC-1 | \n", "0.000000 | \n", "0.000000 | \n", "0.719807 | \n", "0.000000 | \n", "0.115706 | \n", "
| CCAAAGTCCCGCTAAC-1 | \n", "0.000000 | \n", "0.068452 | \n", "0.794619 | \n", "0.005912 | \n", "0.121707 | \n", "
| GTTACGGCCCGACTGC-1 | \n", "0.000000 | \n", "0.000000 | \n", "0.555312 | \n", "0.000000 | \n", "0.235157 | \n", "