Converting CSV/TSV files to upload to Cellenics®

Tutorial

Jul 21

Introduction

“Comma (or Tab) Separated Value” files (CSV or TSV) are a common file type used for the storage of tabular data. In general, it is not recommended to use them, and there are better, more robust alternatives for storing and sharing biological data (such as H5 files), but they are very widely used and supported.

The main issue with these files, concerning compatibility with the open source scRNA-seq data analysis tool Cellenics®, is that there is no well-defined standard as to how the single-cell RNA-seq information is represented. The genes and barcodes might be on rows or columns, the sample information could be represented in one file per sample (best case scenario) but it could be encoded in many different ways (in the barcode name, in an extra column, etc). All of this requires careful examination of the input files, to decide what the processing should be, which could potentially involve some modification of the code presented in this document.

We will make some generalizing assumptions:

Genes are stored in rows
Barcodes (cells) are stored in columns
Sample information is encoded in the name of the barcode

In case the sample assignment is not in the barcode (stored as different files for example), leaving the sample_regex variable as NULL should be enough.

Libraries

We need to have data.table, DropletUtils and the Matrix packages installed. DropletUtils is available on Bioconductor, while both data.table and Matrix are available on CRAN.

library(data.table)
library(DropletUtils)
library(Matrix)

Function definition

These are the functions that will do the work for us, so we have to load them.

#' clean original data.table CSV column names
#' 
#' Removes sample information from column names. It modifies in place!
#'
#' @param dt 
#' @param sample_barcode_tab 
#'
clean_dt_colnames <- function(dt, clean_barcodes) {
  setnames(dt, base::colnames(dt), clean_barcodes)
}

#' make sample <-> barcode table
#'
#' Extracts sample name from "sample_barcode" encoded column names in csv table. 
#' Creates table with barcode - sample association.
#' Users should manually check if the regex is correct for the particular dataset
#' being demultiplexed.
#'
#' @param dt data.table original csv/tsv dataset
#' @param sample_regex chr regex to parse column names for sample and barcodes
#'
#' @return data.table
#'
make_sample_barcode_tab <- function(dt, sample_regex = NA) {
  samp_bc <- colnames(dt)

  if (!is.na(sample_regex)) {
    sample_names <- gsub(sample_regex, "\\1", samp_bc)
    barcodes <- gsub(sample_regex, "\\2", samp_bc)

    clean_dt_colnames(dt, barcodes)
  } else {
    barcodes <- samp_bc
    sample_names <- rep_len("single_sample", length(barcodes))
  }

  # first var in dt is the gene_names var (data.tables don't have rownames)
  data.table(
    sample = sample_names[-1],
    barcode = barcodes[-1]
  )
}


#' Create list of barcodes in samples
#'
#' @param sample_barcode_tab data.table sample/barcode table
#'
#' @return list one element per sample, with every barcode in sample
#'
list_barcodes_in_sample <- function(sample_barcode_tab) {
  # nest each barcode group to separate data.table
  nested_sample_dt <- sample_barcode_tab[, .(bc_list = list(.SD)), by = sample]

  # convert nested data table to list
  lapply(nested_sample_dt[["bc_list"]], unlist)
}


#' subset data.table
#'
#' Subsets cleaned (clean_dt_colnames) data.table, provided character vector of 
#' barcodes in sample. 
#' Helper function to simplify lapply calls. 
#'
#' @param dt data.table cleaned count csv
#' @param columns character vector 
#'
#' @return data.table subsetted data.table
#'
sub_dt <- function(columns, dt) {
  # subset a data table by character vector, to ease lapply
  columns <- c("V1", columns)
  dt[, ..columns]
}


#' export demultiplexed data
#' 
#' exports  10X files in a folder per sample.
#' 
#' @param sample_dt data.table sample <-> barcode table
#' @param sparse_matrix_list list of count matrices per sample
#' @param data_dir chr root dir to export
#'
export_demultiplexed_data <- function(sample_dt, sparse_matrix_list, data_dir) {

  nested_sample_dt <- sample_dt[, .(bc_list = list(.SD)), by = sample]

  for (row in 1:nrow(nested_sample_dt)) {
    fname <- file.path(data_dir, "out", nested_sample_dt[row][["sample"]])

    # unnest barcodes in sample
    expected_barcodes_in_sample <- nested_sample_dt[row, bc_list[[1]]][["barcode"]]

    if (!identical(expected_barcodes_in_sample, colnames(sparse_matrix_list[[row]]))) {
      stop("not the same barcodes")
    }

    DropletUtils::write10xCounts(fname,
      sparse_matrix_list[[row]],
      version = "3"
    )
  }
}

Parameter definition

Files and Folders

Set the data_dir to the folder that contains the CSV/TSV file or files. After that, we create a list of all CSV/TSV files in the directory, which will be converted. We will refer to them as CSV files, but this applies to both types. If they are compressed, you should uncompress them beforehand.

After creating the list of CSV/TSV files to process, we should manually check if it contains the correct files by printing it.

data_dir <- "./"
setwd(data_dir)
csv_files <- list.files(data_dir, pattern = "*[ct]sv$")

print(csv_files)

Create an output directory, to store the converted files.

output_dir <- file.path(data_dir, "out")
dir.create(output_dir)

Manual inspection

We should read at least one of the CSV files and take a look at them. We’re especially interested in the column names, to see if they contain sample information.

We can take a look at the output of some useful R functions, such as str, colnames

csv_example <- fread(csv_files[1])

# Look at the general structure of the matrix.
str(csv_example)

# print the column names, usually the barcodes
colnames(csv_example)

# print the first 20 rows of the first column (usually gene names)
head(csv_example[, 1], 20)

Looking at the column names, we should be able to tell if there’s sample information encoded, which will inform our decision in the next section.

Sample Information

If the samples are encoded in the barcode names, you should write a regular expression (regex) that captures the sample name/id and the barcodes. For example, if the barcodes looked like “sampleX_AAACTAGCTCGCGA” our regex should have two groups (surrounded by parentheses), and match “sampleX” and “AAACTAGCTCGCGA”.

Explaining regex in depth is out of the scope of this document, but this should get you started:

The example regex has two groups, separated by an underscore: 1. The first group captures the sample ID: (sample[[:digit:]]+) captures the word “sample” followed by any number “[[:digit:]]” repeated 1 or more times “+”

The second group captures the barcode, which usually is the cDNA sequence, so using ([ACTG]+) we match any of ACTG (“[ACTG]”) that appears one or more times “+”
Finally, we expect them to be separated by an underscore “_“.

<! data-preserve-html-node="true"-- -->

sample_regex <- NA
# example regex: "(sample[[:digit:]]+)_([ACGT]+)"

Processing the files

After we loaded our packages, sourced our functions, and defined our parameters, it’s time to process our files, by running the next block.

NOTE: Since CSV/TSV files can be pretty big, we have to be careful with the RAM usage, which is why there are some calls to the rm() function (to remove unnecessary objects) and gc() to force R’s garbage collection.

for (file in csv_files) {
  csv_table <- fread(file)
  setnames(csv_table, old = 1, new = "V1")

  sample_tab <- make_sample_barcode_tab(csv_table, sample_regex)

  gc()

  # subset the original count data.table, separating by samples if present
  dt_subset <- lapply(list_barcodes_in_sample(sample_tab), sub_dt, csv_table)
  rm(csv_table)
  gc()

  # convert each subsetted count data.table to count matrix
  counts <- lapply(dt_subset, as.matrix, rownames = "V1")
  rm(dt_subset)
  gc()

  # convert each count matrix to sparse matrices
  sparse_counts <- lapply(counts, Matrix, sparse = T)
  rm(counts)
  gc()

  # export the data to one folder per sample
  export_demultiplexed_data(sample_tab, sparse_counts, data_dir)
}

After this, you should have an “out” folder containing all the samples in a format compatible with Cellenics®! Academic users can analyze their data for free using the Biomage-hosted community instance of Cellenics® at https://scp.biomage.net/

Germán Beldorati Stark