Libraries
PyTorch Lightning
This library was developed to be used in PyTorch Lightning first and foremost. Lightning helps writing clean and reproducible deep learning code that can run on most common training hardware. Datasets are represented by LightningDataModules which give access to data loaders for each data split. The RUL Datasets library implements several data modules that are 100% compatible with Lightning:
import pytorch_lightning as pl
import rul_datasets
import rul_estimator # (1)!
cmapss_fd1 = rul_datasets.CmapssReader(fd=1)
dm = rul_datasets.RulDataModule(cmapss_fd1, batch_size=32)
my_rul_estimator = rul_estimator.MyRulEstimator() # (2)!
trainer = pl.Trainer(max_epochs=100)
trainer.fit(my_rul_estimator, dm) # (3)!
trainer.test(my_rul_estimator, dm)
- This is a hypothetical module containing your model.
- This should be a subclass of LightningModule.
- The trainer calls the data module's
prepare_data
andsetup
functions automatically.
The RUL datasets library loads all data into memory at once and uses the main process for creating batches, i.e. num_workers=0
for all dataloaders.
Unnecessary copies are avoided by using shared memory for both Numpy and PyTorch.
This means that modifying a batch directly, e.g., features += 1
should be avoided.
When data is held in memory, multiple data loading processes are unnecessary and may even slow down training.
The warning produced by PyTorch Lightning that num_workers
is too low is, therefore, suppressed.
PyTorch
If you do not want to work with PyTorch Lightning, you can still use the RUL Dataset library in plain PyTorch. The data loaders provided by the data modules can be used as is:
import torch
import rul_datasets
import rul_estimator
cmapss_fd1 = rul_datasets.CmapssReader(fd=1)
dm = rul_datasets.RulDataModule(cmapss_fd1, batch_size=32)
dm.prepare_data() # (1)!
dm.setup() # (2)!
my_rul_estimator = rul_estimator.MyRulEstimator() # (3)!
optim = torch.optim.Adam(my_rul_estimator.parameters())
best_val_loss = torch.inf
for epoch in range(100):
print(f"Train epoch {epoch}")
my_rul_estimator.train()
for features, targets in dm.train_dataloader():
optim.zero_grad()
predictions = my_rul_estimator(features)
loss = torch.sqrt(torch.mean((targets - predictions)**2)) # (4)!
loss.backward()
print(f"Training loss: {loss}")
optim.step()
print(f"Validate epoch {epoch}")
my_rul_estimator.eval()
val_loss = 0
num_samples = 0
for features, targets in dm.val_dataloader():
predictions = my_rul_estimator(features)
loss = torch.sum((targets - predictions)**2)
val_loss += loss.detach()
num_samples += predictions.shape[0]
val_loss = torch.sqrt(val_loss / num_samples) # (5)!
if best_val_loss < val_loss:
break
else:
best_val_loss = val_loss
print(f"Validation loss: {best_val_loss}")
test_loss = 0
num_samples = 0
for features, targets in dm.test_dataloader():
predictions = my_rul_estimator(features)
loss = torch.sqrt(torch.dist(predictions, targets))
test_loss += loss.detach()
num_samples += predictions.shape[0]
test_loss = torch.sqrt(test_loss / num_samples) # (6)!
print(f"Test loss: {test_loss}")
- You need to call
prepare_data
before using the reader. This downloads and pre-processes the dataset if not done already. - You need to call
setup
to load all splits into memory before using them. - This should be a subclass of a torch Module.
- Calculates the RMSE loss.
- Calculate the mean and square root after all squared sums are collected. This ensures a correct validation loss.
- Calculate the mean and square root after all squared sums are collected. This ensures a correct test loss.
Others
All datasets in this library can be used in any other library as well.
For this you need to create a reader for your desired dataset and call its load_split
function.
Here is an example with tslearn:
import numpy as np
import tslearn
import rul_datasets
cmapss_fd1 = rul_datasets.CmapssReader(fd=1)
cmapss_fd1.prepare_data() # (1)!
dev_features, _ = cmapss_fd1.load_split("dev") # (2)!
dev_data = np.concatenate(dev_features) # (3)!
km = tslearn.clustering.TimeSeriesKMeans(n_clusters=5, metric="dtw")
km.fit(dev_data)
- You need to call
prepare_data
before using the reader. This downloads and pre-processes the dataset if not done already. - This yields a list of numpy arrays with the shape
[len_time_series, window_size, num_features]
. - Concatenate to a single numpy array with the shape
[num_series, window_size, num_features]
.