Mtx format
The mtx, matrix market, format is a sparse format for matrices. It only stores non zero values and is becoming popular in single-cell softwares.
The main advantage is that it requires less space than a dense matrix and that you can easily add different feature names within the same object.
For CITE-seq-Count, the output looks like this:
OUTFOLDER/
-- umi_count/
-- -- matrix.mtx.gz
-- -- features.tsv.gz
-- -- barcodes.tsv.gz
-- read_count/
-- -- matrix.mtx.gz
-- -- features.tsv.gz
-- -- barcodes.tsv.gz
-- unmapped.csv
-- run_report.yaml
File descriptions
features.tsv.gz
contains the feature names, in this context our tags.barcodes.tsv.gz
contains the cell barcodes.matrix.mtx.gz
contains the actual values. read_count and umi_count contain respectively the read counts and the collapsed umi counts. For analysis you should use the umi data. The read_count can be used to check if you have an overamplification or oversequencing issue with your protocol.unmapped.csv
contains the top N tags that haven't been mapped.run_report.yaml
contains the parameters used for the run as well as some statistics. here is an example:
Date: 2019-10-01
Running time: 13.86 seconds
CITE-seq-Count Version: 1.4.3
Reads processed: 1000000
Percentage mapped: 33
Percentage unmapped: 67
Uncorrected cells: 0
Correction:
Cell barcodes collapsing threshold: 1
Cell barcodes corrected: 57
UMI collapsing threshold: 2
UMIs corrected: 329
Run parameters:
Read1_filename: fastq/test_R1.fastq.gz,fastq/test2_R1.fastq.gz
Read2_filename: fastq/test_R2.fastq.gz,fastq/test2_R2.fastq.gz
Cell barcode:
First position: 1
Last position: 16
UMI barcode:
First position: 17
Last position: 26
Expected cells: 100
Tags max errors: 1
Start trim: 0
Packages to read MTX
R
I recommend using Seurat
and their Read10x
function to read the results.
With Seurat V3:
Read10x('OUTFOLDER/umi_count/', gene.column=1)
With Matrix:
library(Matrix)
matrix_dir = "/path_to_your_directory/out_cite_seq_count/umi_count/"
barcode.path <- paste0(matrix_dir, "barcodes.tsv.gz")
features.path <- paste0(matrix_dir, "features.tsv.gz")
matrix.path <- paste0(matrix_dir, "matrix.mtx.gz")
mat <- readMM(file = matrix.path)
feature.names = read.delim(features.path, header = FALSE, stringsAsFactors = FALSE)
barcode.names = read.delim(barcode.path, header = FALSE, stringsAsFactors = FALSE)
colnames(mat) = barcode.names$V1
rownames(mat) = feature.names$V1
Python
I recommend using scanpy
and their read_mtx function to read the results.
Example:
import scanpy
import pandas as pd
import os
path = 'umi_cell_corrected'
data = scanpy.read_mtx(os.path.join(path,'umi_count/matrix.mtx.gz'))
data = data.T
features = pd.read_csv(os.path.join(path, 'umi_count/features.tsv.gz'), header=None)
barcodes = pd.read_csv(os.path.join(path, 'umi_count/barcodes.tsv.gz'), header=None)
data.var_names = features[0]
data.obs_names = barcodes[0]