Converting H5 files for analysis in the open source scRNA-seq data visualization tool Cellenics®


Introduction

The H5 format (short for HDF5, which stands for Hierarchical Data Format version 5) is an increasingly common data format to store single-cell RNA sequencing (scRNA-seq) data. Cellranger, for example, defaults its output in that format. One of its advantages is the ability to store both the count matrices and all metadata in a single file (versus using features/barcodes/matrix files.)

In this article, we’ll show how to convert H5 files to the features/barcodes/matrix format to be able to upload and analyze your data using the Biomage-hosted community instance of Cellenics® while we work on adding native support for H5 files.

The H5 file format is a container that can have many different things inside. So, your H5 file may be different from the file generated directly by Cellranger. Because of this, the article is divided into two parts:

The first section will show how to process standard H5 files, the direct output from cellranger count.

The second section will show how to take an arbitrary scRNA-seq H5 file, manually inspect its contents, pick what is necessary and convert them to a Cellenics®-supported format. Be mindful that this section is a bit more involved and might require a bit of manual code editing.

Processing standard H5 files - cellranger output

Standard Cellranger HDF5 files can be processed using functionality already implemented in the Seurat package. It should work out of the box for Cellranger output. If this fails, refer to the Non-Standard HDF5 file section of this document.

Libraries

We need to have Seurat, DropletUtils, and hdf5r installed. Seurat and hdf5r can be installed from CRAN, and DropletUtils is available on Bioconductor. To install them, you can use the install.packages() function for Seurat and hdf5r, and refer to Bioconductor for instructions on installing DropletUtils.

library(Seurat)
library(DropletUtils)
library(hdf5r)

Processing

Set the data_dir to the folder that contains the h5 files. After that, we create a list of all H5 files in the directory, which will be converted.

data_dir <- "./"
setwd(data_dir)
h5_files <- list.files(data_dir, pattern = "*h5$")

Create an output directory to store the converted files.

output_dir <- "out"
dir.create(output_dir)

Convert the H5 files. The sample_name is going to be the folder name for each sample, feel free to edit as desired.

for (file in h5_files) {

  # make sample names, removing .h5
  sample_name <- sub("\\.h5$", "", basename(file))
  sample_path <- file.path(output_dir, sample_name)

  # to show progress
  print(sample_name)

  # load the count matrices
  gene_names <- rownames(Seurat::Read10X_h5(file))
  counts <- Seurat::Read10X_h5(file, use.names = F)

  # convert
  DropletUtils::write10xCounts(sample_path, counts, version = "3", gene.symbol = gene_names)
}

Processing non-standard H5 files

Non-standard H5s should be treated with care. The general idea is to manually inspect the file using the hdf5r R package or the GUI program HDFView and take note of the names of the slots that contain the necessary data. This is the actual counts, the slot with the genes, and the slot with the barcodes (the names of the cells). The problem is that depending on previous processing of these files, the slots could be named differently, which means that there’s no easy way to automate this, and manual decisions must be made. All of them should be single columns of integer numbers. These are not the slots you’re looking for if there are decimals.

The counts slot could be called anything from “data”, “counts” “reads” to “umi_corrected_reads” (we prefer UMI corrected counts if available). Genes and barcodes are usually named like that, “genes” and “barcodes”.

In addition, we will need two extra slots with metadata, the gene IDs (usually, ensemblIDs; look for a vector of strings that start with “ENS” and have a number), and the gene names (gene symbols).

More details are provided in the Define Parameters section.

Libraries

We need to install hdf5r, data.table and Matrix packages (using the install.packages() function for hdf5r, data.table and Matrix. Please refer to Bioconductor for instructions on how to install DropletUtils.)

library(hdf5r)
library(Matrix)
library(DropletUtils)

Define parameters

Define slot names by inspecting the H5 files to be processed, using either hdf5r or HDFView. The slot names are the paths inside the H5 file that point to different pieces of information required to convert the files, such as the data or the gene names.

These slot names MUST be changed before processing. They are specific to each non-standard h5 file.

The counts, genes, and barcode slots' lengths must be the same. These three are used to build the sparse count matrix.

  • counts_slot should point to the actual data.
  • genes_slot should point to an integer vector with row indices
  • barcodes_slot should point to an integer vector with column indices
counts_slot <- "umi_corrected_reads"
genes_slot <- "gene"
barcodes_slot <- "barcode"

The gene_names and ids should be the same length and most likely smaller than the counts/genes/barcodes slots. These are the gene labels used when creating the 10x files.

Like the previous slots, these MUST be renamed according to the structure of the specific h5 file being processed.

  • gene_ids_slot should point to a character vector of gene ids.
  • gene_names_slot should point to a character vector of gene symbols
gene_ids_slot <- "gene_ids"
gene_names_slot <- "gene_names"

Bulk processing

Use this section to bulk process h5 files.

Set the data_dir to the folder that contains the h5 files. After that, we create a list of all H5 files in the directory, which will be converted. It’s important to print the h5_files variable and check if the file names are correct, and we’re processing the h5 files that we want to process.

data_dir <- "./"
setwd(data_dir)
h5_files <- list.files(data_dir, pattern = "*h5$")

Create an output directory to store the converted files.

output_dir <- file.path(data_dir, "out")
dir.create(output_dir)

Required functions

These functions do the actual work, so we need to load them. They extract the slots we defined earlier and build the sparse count matrix using them.

extract_slots <- function(h5_path) {
  h5 <- H5File$new(h5_path, mode = "r")

  counts <- h5[[counts_slot]][]
  genes <- h5[[genes_slot]][]
  barcodes <- h5[[barcodes_slot]][]

  gene_ids <- h5[[gene_ids_slot]][]
  gene_names <- h5[[gene_names_slot]][]

  r_barcodes <- data.table::frankv(barcodes, ties.method = "dense")

  if(min(genes) == 0 || min(barcodes) == 0) {index1 <- F} else {index1 <- T}

  return(
    list(
      "counts" = counts,
      "genes" = genes,
      "barcodes" = barcodes,
      "r_barcodes" = r_barcodes,
      "gene_ids" = gene_ids,
      "gene_names" = gene_names,
      "index1" = index1
    )
  )
}

build_sparse_matrix <- function(slots) {
  sparse_matrix <-
    sparseMatrix(
      i = slots[["genes"]],
      j = slots[["r_barcodes"]],
      x = slots[["counts"]],
      repr = "C",
      index1 = slots[["index1"]]
    )

  return(sparse_matrix)
}

Processing

This block converts all the h5 files detected and stored in the h5_files variable using our previously defined parameters (slot names) and functions. The sample_name will be the folder name for each sample; feel free to edit as desired.

for (file in h5_files) {
  print(file)

  # make sample names, removing .h5
  sample_name <- sub("\\.h5$", "", basename(file))
  sample_path <- file.path(output_dir, sample_name)

  # to show progress
  print(sample_name)

  # read h5 files and build sparse matrix
  slots <- extract_slots(file)
  counts <- build_sparse_matrix(slots)

  # write to files.
  DropletUtils::write10xCounts(sample_path,
                               counts,
                               barcodes = paste0("cell_", unique(slots[["barcodes"]])),
                               gene.id = slots[["gene_ids"]],
                               gene.symbol = slots[["gene_names"]],
                               version = "3")
}

Now your data should be in a format that's compatible with Cellenics®! You can analyze your data for free using the Biomage-hosted community instance of Cellenics® available at https://scp.biomage.net/

Previous
Previous

How to demultiplex a Seurat object and convert it to 10X files for analysis in Cellenics®

Next
Next

How to demultiplex a count matrix and convert it to Cellenics®-compatible 10X files