Train a machine learning model on a collection¶

Here, we iterate over the artifacts within a collection to train a machine learning model at scale.

import lamindb as ln

→ connected lamindb: testuser1/test-scrna

ln.context.uid = "Qr1kIHvK506r0000"
ln.context.track()

→ notebook imports: lamindb==0.76.1 torch==2.4.0

→ created Transform('Qr1kIHvK506r0000') & created Run('2024-08-23 18:29:40.751036+00:00')

Query our collection:

collection = ln.Collection.get(
    name="My versioned scRNA-seq collection", version="2"
)
collection.describe()

Create a map-style dataset¶

Let us create a map-style dataset using using mapped(): a MappedCollection. This is what, for example, the PyTorch DataLoader expects as an input.

Under-the-hood, it performs a virtual inner join of the features of the underlying AnnData objects and thus allows to work with very large collections.

You can either perform a virtual inner join:

with collection.mapped(obs_keys=["cell_type"], join="inner") as dataset:
    print(len(dataset.var_joint))

Or a virtual outer join:

dataset = collection.mapped(obs_keys=["cell_type"], join="outer")

len(dataset.var_joint)

This is compatible with a PyTorch DataLoader because it implements __getitem__ over a list of backed AnnData objects. The 5th cell in the collection can be accessed like:

dataset[5]

The labels are encoded into integers:

dataset.encoders

Create a pytorch DataLoader¶

Let us use a weighted sampler:

from torch.utils.data import DataLoader, WeightedRandomSampler

# label_key for weight doesn't have to be in labels on init
sampler = WeightedRandomSampler(
    weights=dataset.get_label_weights("cell_type"), num_samples=len(dataset)
)
dataloader = DataLoader(dataset, batch_size=128, sampler=sampler)

We can now iterate through the data loader:

for batch in dataloader:
    pass

Close the connections in MappedCollection:

dataset.close()