RNASq - Seq Analysis Tools
This documents a set of tools, written for use in Python and using extensively the tools from the BioPython suite. It is intended not for genomic studies, but rather, for characterizing relatively short complete or RACE sequences that are expected to be based on an "expected" sequence but that nevertheless have significant sequence and/or length heterogeneity. We use the tools to study in vitro transcribed RNAs.
Introduction
This document describes a suite of Python tools for analysis of in vitro RNA-Seq data (not intended for genomic analyses). The basic usage follows an object oriented model. You start by defining a variable to contain information about a specific experimental data set.
myExptSetUp = Seqsetup(
)
Parameters above, in order (noting how you can reference each in programming):
- RNAset.filename - Illumina file name: fastq format (gzipped, or not)
- RNAset.tseq - Expected (encoded) sequence (can be in DNA or RNA format)
- RNAset.adptr5 - 3' end of the 5' adapter (default used in trimming; can be overridden at trimming)
- RNAset.adptr3 - 5' end of the 3' adapter (default used in trimming; can be overridden at trimming)
- RNAset.exptinfo - (einfo) special variable - see below - contains info on the transcription experiment
- RNAset.pnotes - one line description of the experiment - what's it about?
- RNAset.istemplate - True/False - is this a DNA sequence (typical use: TRUE if seq of template DNA)?
- RNAset.SeqsUsed - dictionary variable - see below - contains sequences relative to the experiment
- RNAset.dData - dictionary variable - see below - contains any other info to include
The default location for the sequence input file is in a directory called 'Data' one level up
from the Python source code.
The default for output is in a directory called 'Output" one level up.
These locations can be changed as follows:
config.Options["InputDir"] = "../Data"
config.Options["OutputDir"] = "../Output"
Note that here and below, all sequences should be written 5' to 3'
The following variables might be defined once (or twice, or three times) and then used in the definition of multiple sequencing data sets by passing the parameter as shown above.
einfo is a parameter set containing information on the experimental details. It can be setup using the following function:
einfo = Exptinfo(
)
SeqsUsed is a parameter set containing info about the DNA constructs used. All parameters are optional. It can be setup using the following syntax:
SeqsUsed = {
}
AlignSeq is used as a default by some functions for aligning sequences that might have been "frameshifted." So for example, if transcription started at +2, but you are interested in the 3' end of the transcript, that end will be shifted left by one. Referencing a sequence in the middle of the transcript, one can align all sequences by their common internal alignment sequences - now the 3' ends line up regardless of initiation variability. The value passed here is the default and typically can be overridden.
Routines that use this as a default: endAnalysis, getPhaseShifted (set keypos to 0); getSubseqBySeq, getSubseqByRelativeSeq, getWithMatched, getWithMatchWndw, RNAkeyseqPosAnal (set keyseq to False),
In your own programming, you can access these as: newvar = RNAset.SeqsUsed["Ntmpl"]
dData is a parameter set containing other information. All parameters are optional. It can be setup using the following syntax:
dData = {
}
In your own programming, you can access these as: newvar = RNAset.dData["Run Date"]
Click
After defining your new variable myExptSetUp using Seqsetup, you then use yExptSetUp to load (import) sequence data into a new object DNAset (you can have multiple objects for different sequences; this is the object oriented approach). The new variable (object) will then include both the raw sequence data and all of the information that was provided in the definition of the experimental data set.
DNAset = myExptSetUp.import_dataset();
places all of the Illumina sequences (in DNA format by default) into a new (object) variable called DNAset
DNAsetTrimmed = DNAset.trimAdaptors(None, None);
returns a new (object) variable containing sequences with the 5' and 3' adapters removed (trimmed). The two parameters specify the 5' and 3' adapter sequences used for trimming (specifying None or False for each says to use the sequences in the above definition of 'myExptSetUp'). Note that primer dimers (5' and 3' adapters directly ligated, with no intervening DNA) and sequences that do not contain both 5' and 3' adapter sequences are NOT returned - the console will report on these statistics.
Note that the sequences are still DNA format at this point.
RNAset = DNAsetTrimmed.toRNASet();
converts the trimmed DNA sequences into RNA (replaces T by U, and flags the set as RNA)
Finally, all of the above can be condensed into one line by chaining the objects together:
RNAset = myExptSetUp.import_dataset().trimAdaptors(None, None).toRNASet();
At this point, RNAset includes all of the RNA sequences, trimmed to remove adapter sequences. It also includes all of the information that was specified in the initial definition of 'myExptSetUp. This approach illustrates what you will be doing next.
For example, to filter the data, returning only sequences of length 10 bases or longer:
RNAsetNew = RNAset.getSpecificLengths(10,10000,'remove abortives');
Click
Typical usage involves first setting up an "experiment" by calling Seqsetup. This identifies the .fasta or .fastq file containing the sequences. It also sets various parameters, including the 5' and 3' adapter sequences and the expected RNA(DNA) sequence to be found between those adapters. Data files are expected to be found in the Data folder, one level above the code.
After defining the experiment with Seqsetup, the data are loaded into a NucleicSet object using the
.import_dataset command below (this command is an object property of the Seqsetup variable and so is called as
newNucleicSet = SeqSet.import_dataset().
The generated NucleicSet variable can then be further processed using any of the 'getXxxx" commands below. These operate on the NucleicSet and return a new NucleicSet. Unless the primary data have already been processed, the first call is typically something like RNAset = newNucleicSet.trimAdaptors(None,None). For this function, the first two parameters are the 5' and 3' adaptor sequences to use in trimming. Passing None for each of these defaults to using the adaptors defined in the Seqsetup step.
The trimmed data can now be processed to, for example, extract only post-abortive sequences using a call
like
postAbrtvSet = newNucleicSet.getSpecificLengths(9,2000,'post-abortives')
Note also that many other functions implicitely filter for minimum lengths. For example, .getPhaseShifted and
.getPrimedExt select for sequences containing key sequences at specific or minimal length positions.
Many of the .getXXX functions also report analyses of the processing. In addition, there are Analysis functions
that are solely for analysis. These are typically called as
RNAset.getGel('A','label RNA with alpha-ATP')
Note that most functions have a comment variable. This is simply for your use. Anything entered there will be printed above the results of the function (if this variable is set to "quiet", output will be suppressed.
Finally, output can be sent to a file by bracketing commands with StartCaptureToFile() and WriteCaptureToFile('_extra'). That file will be created in the Output folder, one level above the code.
What is ?
For example: newSet = RNAset.getSpecificLengths(8,10,'return 8mers, 9mers, and 10mers only')
Specifically,
xx.filename is the name of the file from which these data were read
xx.tseq = the expected RNA transcript sequence
xx.SeqsUsed is a python dictionary containing (5' to 3') sequences used in the experiment. Examples include:
- xx.SeqsUsed['Tmplt'] - sequence of the DNA template strand (remember: 5' to 3')
- xx.SeqsUsed['NTmpl'] - sequence of the DNA nontemplate strand
- xx.SeqsUsed['5Prmr'] - the entire sequence of the 5' adaptor (not just the barcode)
- xx.SeqsUsed['3Prmr'] - the entire sequence of the 3' adaptor
- xx.SeqsUsed['AlignSeq'] - an internal sequence used to align sequences, typically a subset of tseq
- xx.SeqsUsed['Toehld'] - the inverse complement of the template toehold if present (in other words, the expected RNA sequence, were the toehold to be transcribed)
- xx.SeqsUsed['T2a'] - in toehold synthesis, the upstream (promoter) template strand (remember: 5' to 3')
- xx.SeqsUsed['T2b'] - in toehold synthesis, the transcribed template strand (remember: 5' to 3')
- xx.SeqsUsed['Target'] - in the 'jump" experiment, the target template strand (remember: 5' to 3')
xx.dData is a python dictionary containing misc info
For the :
class NucleicSet(object):
def __init__(self, dData, seqreclist, setnotes, history, tseq, exptinfo, istemplate, SeqsUsed, filename, xtra):
self.dData = dData # Dictionary with all sorts of data (currently unused, but in the future...)
self.seqreclist = seqreclist # list of SeqRecord objects, one for each sequence
self.setnotes = setnotes # string - notes about this set
self.history = history # the history of this data (aka pnotes)
self.tseq = tseq # string - expected sequence
self.exptinfo = exptinfo # Exptinfo object - experimental details
self.istemplate = istemplate # boolean - is template DNA, not RNA
self.SeqsUsed = SeqsUsed # Dictionary of DNA (5' to 3') used in expt - 'fullseq' is full sequence, with 5' and 3' primers
self.filename = filename # original source file name (if any)
self.xtra = xtra # variable of class Xtra, for holding a variety of things analytical
Quick useful data:
- xx.count() - the number of sequences in this set
- xx.maxlength() - the length of the longest RNA in this set
Filter ("get") Functions
These functions return a subset of the sequences, depending on specific criteria. They do not modify the sequences returned, they just remove some.
Interesting uses:
- RNAset.getSpecificLengths(2,8,'get only short (abortive) RNAs')
- RNAset.getSpecificLengths(9,10000,'remove short (abortive) RNAs')
- RNAset.getSpecificLengths(len(RNAset.tseq,10000),len(RNAset.tseq,10000),'get only expected length (full length) RNAs')
pos = 0 allows the sequence to be found anywhere
pos > 0 returns a match only if the sequence is found at position 'pos'
If keyseq is a sequence preceded by '-' (e.g. '-GGACTTA' or '-5Prmr'), returns the sequences that do NOT contain the key sequence
Interesting uses:
- RNAset.getWithMatched(RNAset.tseq[:2],1,'get only RNAs with correct dinucleotide starting sequence')
- RNAset.getWithMatched(RNAset.SeqsUsed["AlignSeq"],0,'get only RNAs with originally specified align sequence (anywhere)')
- RNAset.getWithMatched(False,0,'get only RNAs with originally specified align sequence (anywhere)')
- RNAset.getWithMatched(False,False,'get only RNAs with originally specified align sequence at the expected position')
- RNAset.getWithMatched(RNAset.tseq[-4:],0,'get only RNAs with the last four encoded bases, at any position')