Python Sequence Analysis Tools
(another tool you may find useful is "Oligo Design")
Author: Craig Martin
Documentation last updated: Mar 09, 2020 (v5.3)
Introduction
This document describes a suite of Python tools for analysis of in vitro RNA-Seq data (not intended for genomic analyses). The basic usage follows an object oriented model. You start by defining a variable to contain information about a specific experimental data set.
myExptSetUp = Seqsetup(
)
Click
Parameters above, in order (noting how you can reference each in programming):
- RNAset.filename - Illumina file name: fastq format (gzipped, or not)
- RNAset.tseq - Expected (encoded) sequence (can be in DNA or RNA format)
- RNAset.adptr5 - 3' end of the 5' adapter (default used in trimming; can be overridden at trimming)
- RNAset.adptr3 - 5' end of the 3' adapter (default used in trimming; can be overridden at trimming)
- RNAset.exptinfo - (einfo) special variable - see below - contains info on the transcription experiment
- RNAset.pnotes - one line description of the experiment - what's it about?
- RNAset.istemplate - True/False - is this a DNA sequence (typical use: TRUE if seq of template DNA)?
- RNAset.SeqsUsed - dictionary variable - see below - contains sequences relative to the experiment
- RNAset.dData - dictionary variable - see below - contains any other info to include
The default location for sequence input files is a directory called 'Data' one level up
from the Python source code.
The default for output is in a directory called 'Output" one level up.
These locations can be changed as follows:
config.Options["InputDir"] = "../Data"
config.Options["OutputDir"] = "../Output"
Note that here and below, all sequences should be written 5' to 3'
The following variables might be defined once (or twice, or three times) and then used in the definition of multiple sequencing data sets by passing the parameter as shown above.
einfo is a parameter set containing information on the experimental details. It can be setup using the following function:
einfo = Exptinfo(
)
SeqsUsed is a parameter set containing info about the DNA constructs used. All parameters are optional. You can create your own. It can be setup using the following syntax:
SeqsUsed = {
}
AlignSeq is used as a default by some functions for aligning sequences that might have been "frameshifted." So for example, if transcription started at +2, but you are interested in the 3' end of the transcript, that end will be shifted left by one. Referencing a sequence in the middle of the transcript, one can align all sequences by their common internal alignment sequences - now the 3' ends line up regardless of initiation variability. The value passed here is the default and typically can be overridden.
Routines that use this as a default (set keyseq to False): endAnalyses, getAlignedToSeq, getSubseqBySeq, getSubseqByRelativeSeq, getWithMatched, getWithMatchWndw, RNAkeyseqPosAnal,
In your own programming, you can access these as: newvar = RNAset.SeqsUsed["Ntmpl"]
dData is a parameter set containing other information. All parameters are optional. It can be setup using the following syntax:
dData = {
}
In your own programming, you can access these as: newvar = RNAset.dData["Run Date"]
Click
After defining your new variable myExptSetUp using Seqsetup, you then use yExptSetUp to load (import) sequence data into a new object DNAset (you can have multiple objects for different sequences; this is the object oriented approach). The new variable (object) will then include both the raw sequence data and all of the information that was provided in the definition of the experimental data set.
DNAset = myExptSetUp.import_dataset();
places all of the Illumina sequences (in DNA format by default) into a new (object) variable called DNAset
DNAsetTrimmed = DNAset.trimAdaptors(None, None);
returns a new (object) variable containing sequences with the 5' and 3' adapters removed (trimmed). The two parameters specify the 5' and 3' adapter sequences used for trimming (specifying None or False for each says to use the sequences in the above definition of 'myExptSetUp'). Note that primer dimers (5' and 3' adapters directly ligated, with no intervening DNA) and sequences that do not contain both 5' and 3' adapter sequences are NOT returned - the console will report on these statistics.
Note that the sequences are still DNA format at this point.
RNAset = DNAsetTrimmed.toRNASet();
converts the trimmed DNA sequences into RNA (replaces T by U, and flags the set as RNA)
Finally, all of the above can be condensed into one line by chaining the objects together:
RNAset = myExptSetUp.import_dataset().trimAdaptors(None, None).toRNASet();
At this point, RNAset includes all of the RNA sequences, trimmed to remove adapter sequences. It also includes all of the information that was specified in the initial definition of 'myExptSetUp. This approach illustrates what you will be doing next.
For example, to filter the data, returning only sequences of length 10 bases or longer:
RNAsetNew = RNAset.getSpecificLengths(10,10000,'remove abortives');
Click
Typical usage involves first setting up an "experiment" by calling Seqsetup. This identifies the .fasta or .fastq file containing the sequences. It also sets various parameters, including the 5' and 3' adapter sequences and the expected RNA(DNA) sequence to be found between those adapters. Data files are expected to be found in the Data folder, one level above the code.
After defining the experiment with Seqsetup, the data are loaded into a NucleicSet object using the
.import_dataset command below (this command is an object property of the Seqsetup variable and so is called as
newNucleicSet = SeqSet.import_dataset().
The generated NucleicSet variable can then be further processed using any of the 'getXxxx" commands below. These operate on the NucleicSet and return a new NucleicSet. Unless the primary data have already been processed, the first call is typically something like RNAset = newNucleicSet.trimAdaptors(None,None). For this function, the first two parameters are the 5' and 3' adaptor sequences to use in trimming. Passing None for each of these defaults to using the adaptors defined in the Seqsetup step.
The trimmed data can now be processed to, for example, extract only post-abortive sequences using a call
like
postAbrtvSet = newNucleicSet.getSpecificLengths(9,2000,'post-abortives')
Note also that many other functions implicitely filter for minimum lengths. For example, .getPhaseShifted and
.getPrimedExt select for sequences containing key sequences at specific or minimal length positions.
Many of the .getXXX functions also report analyses of the processing. In addition, there are Analysis functions
that are solely for analysis. These are typically called as
RNAset.getGel('A','label RNA with alpha-ATP')
Note that most functions have a comment variable. This is simply for your use. Anything entered there will be printed above the results of the function (if this variable is set to "quiet", output will be suppressed.
Finally, output can be sent to a file by bracketing commands with StartCaptureToFile() and WriteCaptureToFile('_extra'). That file will be created in the Output folder, one level above the code.
What is ?
For example: newSet = RNAset.getSpecificLengths(8,10,'return 8mers, 9mers, and 10mers only')
Specifically,
xx.filename is the name of the file from which these data were read
xx.tseq = the expected RNA transcript sequence
xx.SeqsUsed is a python dictionary containing (5' to 3') sequences used in the experiment. Examples include:
- xx.SeqsUsed['Tmplt'] - sequence of the DNA template strand (remember: 5' to 3')
- xx.SeqsUsed['NTmpl'] - sequence of the DNA nontemplate strand
- xx.SeqsUsed['5Prmr'] - the entire sequence of the 5' adaptor (not just the barcode)
- xx.SeqsUsed['3Prmr'] - the entire sequence of the 3' adaptor
- xx.SeqsUsed['AlignSeq'] - an internal sequence used to align sequences, typically a subset of tseq
- xx.SeqsUsed['Toehld'] - the inverse complement of the template toehold if present (in other words, the expected RNA sequence, were the toehold to be transcribed)
- xx.SeqsUsed['T2a'] - in toehold synthesis, the upstream (promoter) template strand (remember: 5' to 3')
- xx.SeqsUsed['T2b'] - in toehold synthesis, the transcribed template strand (remember: 5' to 3')
- xx.SeqsUsed['Target'] - in the 'jump" experiment, the target template strand (remember: 5' to 3')
xx.dData is a python dictionary containing misc info
For the :
class NucleicSet(object):
def __init__(self, dData, seqreclist, setnotes, history, tseq, exptinfo, istemplate, SeqsUsed, filename, xtra):
self.dData = dData # Dictionary with all sorts of data (currently unused, but in the future...)
self.seqreclist = seqreclist # list of SeqRecord objects, one for each sequence
self.setnotes = setnotes # string - notes about this set
self.history = history # the history of this data (aka pnotes)
self.tseq = tseq # string - expected sequence
self.exptinfo = exptinfo # Exptinfo object - experimental details
self.istemplate = istemplate # boolean - is template DNA, not RNA
self.SeqsUsed = SeqsUsed # Dictionary of DNA (5' to 3') used in expt - 'fullseq' is full sequence, with 5' and 3' primers
self.filename = filename # original source file name (if any)
self.xtra = xtra # variable of class Xtra, for holding a variety of things analytical
Quick useful data:
- xx.count() - the number of sequences in this set
- xx.maxlength() - the length of the longest RNA in this set
Filter ("get") Functions
These functions return a subset of the sequences, depending on specific criteria. They do not modify the sequences returned, they just remove some.
Interesting uses:
- RNAset.getSpecificLengths(2,8,'get only short (abortive) RNAs')
- RNAset.getSpecificLengths(9,10000,'remove short (abortive) RNAs')
- RNAset.getSpecificLengths(len(RNAset.tseq,10000),len(RNAset.tseq,10000),'get only expected length (full length) RNAs')
pos = 0 allows the sequence to be found anywhere
pos > 0 returns a match only if the sequence is found at position 'pos'
If keyseq is a sequence preceded by '-' (e.g. '-GGACTTA' or '-5Prmr'), returns the sequences that do NOT contain the key sequence
Interesting uses:
- RNAset.getWithMatched(RNAset.tseq[:2],1,'get only RNAs with correct dinucleotide starting sequence')
- RNAset.getWithMatched(RNAset.SeqsUsed["AlignSeq"],0,'get only RNAs with originally specified align sequence (anywhere)')
- RNAset.getWithMatched(False,0,'get only RNAs with originally specified align sequence (anywhere)')
- RNAset.getWithMatched(False,False,'get only RNAs with originally specified align sequence at the expected position')
- RNAset.getWithMatched(RNAset.tseq[-4:],0,'get only RNAs with the last four encoded bases, at any position')
Returns the entire sequence
Think polymerase might have jumped to some new sequence, perhaps sloppily? Use this to search.
This is particulary useful for looking for strand jumping (to the nontemplate DNA or to something else)
Interesting uses:
- Get only RNAs with at least 7 bases matching any of the first 22 bases in the nontemplate strand:
- RNAset.getWithMatchWndw('AATTAATACGACTCACTATAGG',7,False,False,'')
- RNAset.getWithMatchWndw(RNAset.SeqsUsed["NTmpl"][:22],7,False,False,'')
- Get only RNAs with at least 7 bases matching any part of the reverse complement of the first 22 bases in the nontemplate strand:
- RNAset.getWithMatchWndw('CCTATAGTGAGTCGTATTAATT',7,False,False,'')
- RNAset.getWithMatchWndw(revcompl('AATTAATACGACTCACTATAGG'),7,False,False,'')
- RNAset.getWithMatchWndw(revcompl(RNAset.SeqsUsed["NTmpl"][:22]),7,False,False,'')
Interesting uses:
- Look for primed, or loopback transcription:
- RNAset.getOccurrences('',0,7,False,'')
- RNAset.getWithMatchWndw(RNAset.SeqsUsed["NTmpl"][:22],0,7,False,'')
- Get only RNAs with at least 7 bases matching any of the first 22 bases in the nontemplate strand:
- RNAset.getOccurrences('AATTAATACGACTCACTATAGG',0,7,False,'')
- RNAset.getWithMatchWndw(RNAset.SeqsUsed["NTmpl"][:22],0,7,False,'')
- lmin - minimum position to plot (default = 1)
- lmax - maximum position to plot (default = max length)
- stackedcolor - True/False - display bars in base-specific color segments (default = True)
- fAddr - a string to add to the PDF file name (default = '')
- descr - a string description for the bar chart (default = '')
- width - relative width of bars (0-1) (default = 0.8)
- mrkr - comment to go with output (default = '')
- noHistory - leave history off of plt (default = False; history IS plotted)
- lmin - minimum position to plot (default = 1)
- lmax - maximum position to plot (default = max length)
- fAddr - a string to add to the PDF file name (default = '')
- descr - a string description for the bar chart (default = '')
- width - relative width of bars (0-1) (default = 0.8)
- mrkr - comment to go with output (default = '')
- noHistory - leave history off of plt (default = False; history IS plotted)
- inclCounts - put count at each position above the bar (default = True)
- RNAset.count() - returns the number of RNAs in the set
- RNAset.maxlength() - returns the lenght of the longest RNA in the set
- RNAset.tseq - returns the expected sequence
- RNAset.alignSeq - returns the stored internal alignment sequence
- RNAset.istemplate - returns True if template seqs, False if encoded RNA
- RNAset.filename - returns the original data set file name
- RNAset.setnotes - returns a string with notes about this data set
- RNAset.history - returns a string describing how this data set has been manipulated/filtered
- RNAset.expectedlength() - returns the lenght of the expected sequence
- RNAset.info() - returns information about the set
- RNAset.infoFull() - returns information about the set, incl adapter stats
- RNAset.exptDescr() - returns a string with count, max length, & reaction conditions
- RNAset.adapterStats(adpt5,adpt3,mrkr) - returns a string with info on adapters statistics
- RNAset.exptinfo - returns a Python object with experimental info, as entered originally
- RNAset.exptinfo.rxnconditions() - returns a string with information about the reaction conditions
- RNAset.SeqsUsed - returns a Python dictionary object with sequences, as entered originally
- RNAset.SeqsUsed['xxx'] - returns the dictionary element xxx from SeqsUsed
- RNAset.dData - returns another dictionary with any user defined elements
Filter and Manipulate ("get") Functions
These functions return a subset of the sequences or modified sequences, depending on specific criteria. They DO modify the sequences returned, so use with care.
RNAset.maskSeqs([[5,3],[12,2]],'') = returns a masked set
where in each sequence, bases at positions 5-8 and 12-13 are replaced by 'NNN'
for example, 'GGAGTAGCTACGT' is replaced by 'GGAGNNNCTACNN'
if maskList is False, it masks according to tseq
RNAset.maskSeqs(False,'') applies the masking present in RNAset.tseq
which might look like GGAGNNNACTTACNNAAGGACCA
any of the IUPAC standard base heterogeneity codes are allowed, but all are currently converted to N
RNAset.maskSeqs('GAGTA',3,'') = returns sequences masking 3 bases immediately following GAGTA
masked bases are replaced by 'N'
for example, 'GGAGTAGCTACGT' is replaced by 'GGAGTANNNACGT'
if the key sequence is not found, the unmasked sequence is returned
any of the IUPAC standard base heterogeneity codes are allowed, but all are currently converted to N
To find the most common sequences from 11 to 15, one might call: newset = getSubseqByPos(tmpset,11,15,'just past abortive')
Note that newset.tseq is now adjusted to contain only the new subsegment of the expected sequence. tmpWTSub = RNAset.getSubSeqByPos(12,20,'testing sub-seq by position') tmpWTSub.printMostCommon(2.0,'Most common seqs','testing print most common sequences') >[RMHD_S8_L001_R1_001].importrawdataset().trimAdaptors('CTCCAT','TGGAA').getSubSeqByPos(12,20,'') testing sub-seq by position Extract the subsequence 12 to 20 from all sequences. >> 21492 of 21736 (98.9%) >[RMHD_S8_L001_R1_001].importrawdataset().trimAdaptors('CTCCAT','TGGAA').getSubSeqByPos(12,20,'').printMostCommon(2.0,'Most common seqs','') testing print most common sequences Most common s GTCGACGCN 21492 ....!.... 4627 21.5% GTCGACGCC 3666 17.1% GTCGACGCT 3154 14.7% GTCGACGCA 2868 13.3% GTCGACGCG 1209 5.6% GTCGACGC 775 3.6% GTCGAC 605 2.8% GTC 510 2.4% GTCGACG 467 2.2% GTCGACC
Finds sequences around a key sequence tmpWTSub = RNAset.getSubseqBySeq('ACGTCGACG',6,4,'testing sub-seq by key sequence') tmpWTSub.printMostCommon(2.0,'Most common seqs','testing print most common sequences') >[RMHD_S8_L001_R1_001].importrawdataset().trimAdaptors('CTCCAT','TGGAA').getSubseqBySeq('ACGTCGACG',6,4,'') testing sub-seq by key sequence Return only seqs containing ACGTCGACG, 6 before and 4 after. >> 21736 of 21736 (100.0%) >[RMHD_S8_L001_R1_001].importrawdataset().trimAdaptors('CTCCAT','TGGAA').getSubseqBySeq('ACGTCGACG',6,4,'').printMostCommon(2.0,'Most common seqs','') testing print most common sequences Most common s TACGTACGTC 21736 ....!....1 16541 76.1% TACGTACGTC 5042 23.2% * Primer Dimer *
Extracts a subsegment a fixed distance away from a found sequence
For example, getSubseqByRelativeSeq('TGACCA', 8, 13, '')
|------>|--->|
AGATGACCAGGACGACTAGACAAACTTGAACGCCTATACG
returns ACGACT
tmpWTSub = RNAset.getSubseqByRelativeSeq('ACGTCGACG',-10,-6,'testing sub-seq relative to key sequence') tmpWTSub.printMostCommon(2.0,'Most common seqs','testing print most common sequences') >[RMHD_S8_L001_R1_001].importrawdataset().trimAdaptors('CTCCAT','TGGAA').getSubseqBySeq('ACGTCGACG',6,4,'') testing sub-seq by key sequence Return only seqs containing ACGTCGACG, 6 before and 4 after. >> 21736 of 21736 (100.0%) >[RMHD_S8_L001_R1_001].importrawdataset().trimAdaptors('CTCCAT','TGGAA').getSubseqBySeq('ACGTCGACG',6,4,'').printMostCommon(2.0,'Most common seqs','') testing print most common sequences Most common s TACGTACGTC 21736 ....!....1 16541 76.1% TACGTACGTC 5042 23.2% * Primer Dimer *
More generally, this also corrects for any slippage or skipping that might occur internally before the key sequence. In that way, the 3' ends are more appropriately aligned to visually detect n-1 and n+1 products.
WARNING: please be aware that unlike, most functions, this ALTERS the returned sequence, but only by introducing white space at the 5' end. alignedSet = RNAset.getAlignedToSeq('ACGTCGACG','align to the first occurrence of GGACCCT') alignedSet..printSampleSeqs(10) >[ITWT_S1_L001_R1_001_Aug].importrawdataset().trimAdaptors('TAATCA','TGGAA').getAlignedToSeq('ACGTCGACG','') align to the sequence 'ACGTCGACG' GGNNNNNNNNTACGTCGACGCATTTA Align to sequence ACGTCGACG from all sequences. >> 15841 of 244036 (6.5%) >[ITWT_S1_L001_R1_001_Aug].importrawdataset().trimAdaptors('TAATCA','TGGAA').getAlignedToSeq('ACGTCGACG','').printSampleSeqs(10) GGNNNNNNNNTACGTCGACGCATTTA ....'....|....'....|....'....|....'....|....'....|....'. GGAATATGATTACGTCGACGCATTTACGACGTAATCA GGTACAACATTACGTCGACGCATT GGTCTTTGGATACGTCGACGCATT GTTCGGCAGTTAGATACGTCGACGCATTTTC GGCGATACTGTACGTCGACGCATT GGCACGTTGATACGTCGACGCATT GGAAATAACGTACGTCGACGCATTTAC TTCAGGGATAAAGGATACGTCGACGCATTTAG GGTATGTAAATACGTCGACGCATTTACATACGTCGACGCATT GGGTCATAAATACGTCGACGCGGCGCAnalysis Functions
The above functions will often report back on their successes, but for real analysis of sequence data, use the following. In general, don't use these functions to manipulate sequences, only analyze. Therefore a common usage would look like:
RNAset.RNAkeyseqPosAnal('GAAGGA',"report back on positions of GAAGGA");
RNAset remains unchanged.
Note that this reports on only the first occurence in each sequence tmpset.RNAkeyseqPosAnal('GTCGACG', '') >[ITWT_S1_L001_R1_001_Aug].importrawdataset().trimAdaptors('TAATCA','TGGAA').RNAkeyseqPosAnal(GTCGACG, '') Look for key seq GTCGACG Position found distribution 0 0.0% 106 10 0.1% 168 20 0.1% 144 30 0.0% 76 1 0.0% 92 11 0.1% 139 21 0.0% 97 31 0.0% 67 2 0.1% 129 12 0.3% 615 22 0.0% 72 32 0.0% 85 3 0.0% 71 13 4.5% 10985 23 0.0% 86 33 0.0% 80 4 0.1% 173 14 0.2% 468 24 0.0% 117 34 0.0% 35 5 0.1% 318 15 0.1% 215 25 0.0% 87 35 0.0% 14 6 0.1% 350 16 0.0% 120 26 0.0% 93 36 0.0% 29 7 0.1% 333 17 0.0% 101 27 0.0% 102 37 0.0% 8 8 0.1% 223 18 0.0% 105 28 0.0% 117 9 0.1% 192 19 0.0% 111 29 0.0% 87
Given that there are 16 different dinucleotides possible, for totally random behavior, one would expect 100\16=6.25% occurrence of each dinucleotide step. For well-behaved transcription, one would expect 100% for the encoded step and 0% for everything else.
Expected steps are indicated by a small superscript 'o'.
Note that this analysis is sensitive to frame-shifted or completely 'bad' sequences in the mix. They will make the statistics at each position muddy. So pre-processing your data set can be helpful, BUT be careful that your processing doesn't skew your results.
'Count' is the number of RNAs that terminate at the given length (and so contributed to the statistics in that row). For gamma-GTP labeling, this is also proportional to expected gel intensity. But we typically use alpha-ATP labeling, with longer transcripts incorporating more radioactivity. 'Gel Int' tried to adjust for that and includes the numbers of A's in each transcript. 'Gel Int' is then reported as a percentage of the total.
'%off' is a number widely used in analyzing abortives. It is the percentage of RNAs that have made it to position n, and have then fallen off. Or if you think about the competition between falling off and adding the next nucleotide, it is the percentage that fall off. tmpset.termDiNucAnal('test') Percentage RNA’s of length N TERMINATED w/ indicated dinucleotide step N AA CA GA TA AC CC GC TC AG CG GG TG AT CT GT TT Count Gel Int (%off) 2 GG 0 0 0 0 0 0 0 0 0 0 100° 0 0 0 0 0 93 0.00 ( 0.1%) 3 GN 0 0 6 0 0 0 69 0 0 0 1 0 0 0 24 0 4770 0.13 ( 3.4%) 4 NN 1 6 2 7 4 24 7 18 0 2 0 3 2 15 2 7 15457 1.58 ( 11.3%) 5 NN 3 3 2 1 15 20 8 28 0 1 1 0 3 9 3 3 45969 13.22 ( 37.9%) 6 NN 2 5 2 3 11 15 7 18 0 3 1 3 5 10 3 11 25180 10.65 ( 33.5%) 7 NN 2 5 3 5 8 12 6 17 1 3 1 3 5 10 4 14 9139 4.95 ( 18.2%) 8 NN 3 5 3 4 10 10 6 18 1 3 2 4 5 8 4 14 3833 2.56 ( 9.4%) 9 NN 3 5 3 5 12 10 6 19 1 3 2 5 4 7 3 12 2674 2.06 ( 7.2%) 10 NN 4 5 3 4 11 11 7 17 1 3 2 4 4 6 4 14 2278 2.08 ( 6.6%) 11 NT 1 2 1 1 3 7 2 5 0 1 0 1 10 24 5 36 4273 4.77 ( 13.3%) 12 TA 1 3 0 56° 5 4 1 12 0 1 0 2 2 5 1 6 2418 3.24 ( 8.7%) 13 AC 1 1 0 2 83° 3 1 1 0 3 0 0 1 1 0 1 3060 4.10 ( 12.0%) 14 CG 1 6 0 1 3 11 1 1 0 71° 0 0 1 2 1 1 3018 4.06 ( 13.5%) 15 GT 1 2 3 1 4 5 16 6 1 6 2 1 1 1 48° 2 1099 1.56 ( 5.7%) 16 TC 0 1 0 1 3 11 1 75° 0 1 0 1 0 0 2 2 1862 2.69 ( 10.2%) 17 CG 1 8 3 1 4 29 2 8 1 32° 0 1 1 3 1 5 686 1.15 ( 4.2%) 18 GA 1 2 57° 3 6 7 8 3 1 2 1 0 1 1 3 4 828 1.52 ( 5.3%) 19 AC 1 1 1 1 82° 5 2 1 0 2 0 0 2 0 0 2 1608 2.99 ( 10.8%) 20 CG 0 6 1 1 6 12 5 1 0 63° 0 0 1 2 0 2 1107 2.15 ( 8.3%) 21 GC 1 5 2 2 6 5 70° 1 0 2 1 0 0 0 3 1 932 1.90 ( 7.6%) 22 CA 1 69° 0 2 2 13 1 1 0 3 0 0 2 4 0 1 1016 2.40 ( 9.0%) 23 AT 3 4 0 2 13 5 2 3 2 1 0 1 49° 1 1 12 532 1.27 ( 5.2%) 24 TT 1 1 0 3 2 1 0 10 0 1 0 2 1 1 0 76° 1428 3.47 ( 14.7%) 25 TT 0 1 0 10 1 2 0 12 0 0 0 1 1 1 0 70° 1260 3.11 ( 15.2%) 26 TA 0 5 0 85° 2 1 0 1 1 0 0 1 0 0 0 3 2540 7.23 ( 36.2%) 27 A 11 0 1 3 63 1 1 1 15 0 0 0 3 0 0 0 1766 5.19 ( 39.4%) 28 3 6 19 1 8 24 10 2 2 16 2 1 1 2 2 1 766 2.35 ( 28.2%) 29 6 6 17 7 10 19 6 3 1 12 0 1 3 4 4 3 265 0.83 ( 13.6%) 30 3 7 2 3 10 22 5 11 3 9 1 1 3 4 7 7 149 0.47 ( 8.8%) 31 3 6 1 10 7 26 5 15 1 4 1 0 6 1 9 5 144 0.47 ( 9.4%) 32 4 6 2 7 5 40 2 13 2 4 1 1 2 3 2 6 170 0.61 ( 12.2%) 33 4 7 7 8 12 19 7 11 3 3 2 1 4 4 3 4 118 0.43 ( 9.6%)
This version takes both an RNAset as input and also a similar set (RefSet) derived from sequencing of the DNA template used in that experiment. The idea is that for randomized regions of a template we might want to achieve 25% of each nucleotide at each position, but we know that the oligo synthesis does not always provide that. If the template, instead of having 25/25/25/25 at a specific position has 35/20/25/20, then if polymerase shows no bias, the resulting transcripts should show 35/20/25/20. This function corrects for this by calculating a " Z-Score " comparing the two populations. A positive Z-score reflects steps that occur in the sequence more often than expected from the template. A negative Z-score reflects steps that occur less frequently than expected from the template.
Statistically, the "null hypothesis" is that the transcribed RNA correctly reflects the template, in other words, the probability of abortively dissociating at a particular position is independent of the sequence of the terminal RNA dinucleotide step. Therefore a large positive value indicates a step that leads to more abortive dissociation and a negative value reflects a step that has reduced abortive dissociation. tmpset.termDiNucAnalScore(Refset,'test') Results here
Given that there are 16 different dinucleotides possible, for totally random behavior, one would expect 100\16=6.25% occurrence of each dinucleotide step. For well-behaved transcription, one would expect 100% for the encoded step and 0% for everything else.
Expected steps are indicated by a small superscript 'o'.
Note that this analysis is sensitive to frame-shifted or completely 'bad' sequences in the mix. They will make the statistics at each position muddy. So pre-processing your data set can be helpful, BUT be careful that your processing doesn't skew your results.
'Count' shows the number of RNAs analyzed and decreases progressively as shorter RNAs are left out of the analysis. tmpWTset.internalDiNucAnal('test') Percentage of INTERNAL dinucleotide steps at each position in the RNAs N AA CA GA TA AC CC GC TC AG CG GG TG AT CT GT TT Count (%Tot) 2 GG 0 0 0 0 0 0 0 0 0 0 100° 0 0 0 0 0 141453 ( 99.9%) 3 GN 0 0 28 0 0 0 32 0 0 0 21 0 0 0 19 0 136683 ( 96.6%) 4 NN 7 10 9 6 11 4 4 5 5 6 6 3 7 10 4 3 121226 ( 85.6%) 5 NN 9 5 8 5 11 4 4 6 6 4 6 4 12 6 6 4 75257 ( 53.2%) 6 NN 8 6 9 7 9 4 4 5 6 5 5 6 9 6 6 6 50077 ( 35.4%) 7 NN 9 6 9 8 8 3 4 5 6 5 4 6 9 6 6 6 40938 ( 28.9%) 8 NN 10 5 8 8 8 3 4 5 6 5 4 6 8 6 6 7 37105 ( 26.2%) 9 NN 11 6 8 9 7 3 4 5 6 4 4 6 8 7 6 7 34431 ( 24.3%) 10 NN 9 5 6 7 10 3 4 7 6 4 4 6 10 6 6 8 32153 ( 22.7%) 11 NT 1 1 1 3 2 0 0 1 1 1 1 1 25 20 20 22 27880 ( 19.7%) 12 TA 1 0 1 86° 3 0 0 1 1 1 0 1 1 1 1 1 25462 ( 18.0%) 13 AC 1 0 1 3 85° 0 1 1 1 3 0 0 1 1 1 1 22402 ( 15.8%) 14 CG 1 1 1 2 3 1 1 1 1 84° 0 0 1 1 3 1 19384 ( 13.7%) 15 GT 1 1 1 1 2 0 1 3 0 3 0 0 1 1 84° 0 18285 ( 12.9%) 16 TC 0 1 1 1 2 0 1 84° 0 4 0 0 1 1 2 1 16423 ( 11.6%) 17 CG 0 1 3 1 1 0 1 3 1 85° 0 0 1 1 1 1 15737 ( 11.1%) 18 GA 0 1 85° 1 3 0 1 1 0 3 1 0 1 0 1 1 14909 ( 10.5%) 19 AC 0 1 2 1 85° 0 0 1 1 4 0 0 1 0 1 1 13301 ( 9.4%) 20 CG 0 1 1 1 3 1 3 1 1 85° 0 1 1 1 1 1 12194 ( 8.6%) 21 GC 0 3 1 1 1 0 84° 1 0 3 1 0 1 1 2 1 11262 ( 8.0%) 22 CA 0 83° 1 2 1 1 2 1 0 1 0 0 4 1 1 1 10246 ( 7.2%) 23 AT 1 3 1 1 1 0 1 1 1 1 1 0 84° 1 1 3 9714 ( 6.9%) 24 TT 1 1 1 1 1 0 1 1 1 1 1 1 3 1 1 85° 8286 ( 5.9%) 25 TT 1 1 1 3 1 0 0 3 1 1 1 1 1 0 1 83° 7026 ( 5.0%) 26 TA 1 1 1 76° 1 0 1 2 2 2 1 2 2 1 2 4 4486 ( 3.2%) 27 A 7 2 3 5 25 1 2 3 35 3 3 1 4 1 2 3 2720 ( 1.9%) 28 4 3 24 4 6 3 6 4 5 14 10 1 4 4 4 3 1954 ( 1.4%)
This version takes both an RNAset as input and also a similar set (RefSet) derived from sequencing of the DNA template used in that experiment. The idea is that for randomized regions of a template we might want to achieve 25% of each nucleotide at each position, but we know that the oligo synthesis does not always provide that. If the template, instead of having 25/25/25/25 at a specific position has 35/20/25/20, then if polymerase shows no bias, the resulting transcripts should show 35/20/25/20. This function corrects for this by calculating a " Z-Score " comparing the two populations. A positive Z-score reflects steps that occur in the sequence more often than expected from the template. A negative Z-score reflects steps that occur less frequently than expected from the template.
Statistically, the "null hypothesis" is that the transcribed RNA correctly reflects the template. There are two potential reasons this might not occur: 1) if the polymerase aborts more at some sequences (at some positions), then they will be underrepresented (at those positions) in longer transcripts. Therefore, and in contrast to .termDiNucAnalScore, a large negative value would be expected correlate with more abortive dissociation at that position and sequence. 2) if the polymerase misincorporates at a particular position and then continues to extend past the misincorporation, a positive value might indicate a preference for the base that is misincorporated (and a negative value might reflect the intended base?). tmpset.internalDiNucAnalScore(Refset,'test') Results here
Graphing/Plotting Functions
The default is to depict the last base by color (for B/W, pass 'stackedcolor':False)
pDict Dictionary can contains these optional definitions
Example: plotLengthBarChart({'lmin':0.1, 'fAddr':'_special', 'descr':'This is a test'})
plotLengthBarChart({})
>>< pDict Dictionary can contains these optional definitions Example: plotMisIncorpBarChart({'lmin':0.1, 'fAddr':'_special', 'descr':'This is a test'})
plotMisIncorpBarChart({})
>>< (note that importrawdataset and importRNAdataset are outdated (legacy) versions of this)
einfo2 = Exptinfo('8/8/16', 'Aruni',0.5, 2.0, 5, 37,'','AATTAATACGACTCACTATA')
Expt2 = Seqsetup('MG_S9_L001_R1_001.fastq.gz', 'GGATCCCGACTGGCGAGAGCCAGGTAACGAATGGATCC',
'TCAACT', 'TGGAA', einfo2, 'MG Aptamer (Encoded toehold CCACTCCTCA)', False,
{ 'Tmplt':'GGATCCATTCGTTACCTGGCTCTCGCCAGTCGGGATCCTGAGGAGTGG',
'T2a':'AGTGAGTCGTATTAATTTC',
'NTmpl':'GAAATTAATACGACTCACTATTCCTAGCCGACTGGCGAGAGCCAGGTAACGAATGGATCC',
'5Prmr': 'GTTCAGAGTTCTACAGTCCGACGATCTCAACT',
'3Prmr':'ATGGAATTCTCGGGTGCCAAGG',
'AlignSeq': 'ACTGGCGAGAGCCAGGTAAC',
'Toehld':'CCACTCCTCA'} )
NewSet = Expt2.importdataset()
>>importrawdataset('MG_S9_L001_R1_001.fastq.gz')
================ WT Enz, randomized IT +3 to +10 ==================
ITWT_S1_L001_R1_001_Aug.fastq.gz, Trgt=GGNNNNNNNNTACGTCGACGCATTTA (26mer)
8/8/16 Aruni [Enz] = 0.50 uM, [DNA] = 2.00 uM, for 5.0 min at T=37.0 C
>337631 << Imported
Note that an earlier version of this, trim_adaptors, has been deprecated.
Note also that this does not convert T's to U's (so think of RNA as having T!)
einfo = Exptinfo('8/8/16', 'Aruni',0.5, 2.0, 5, 37,'','AATTAATACGACTCACTATA')
Expt1 = Seqsetup('ITWT_S1_L001_R1_001_Aug.fastq.gz', 'GGNNNNNNNNTACGTCGACGCATTTA',
'TAATCA', 'TGGAA', einfo, 'WT Enz, randomized IT +3 to +10', False,
'AATGATACGGCGACCACCGAGATCTACACGTTCAGAGTTCTACAGTCCGACGATCTAATCAGGNNNNNNNNUACGUCGACGCAUUUAATGGAATTCTCGGGTGCCAAGG' )
rawset = Expt1.importdataset()
trimmedSet = rawset.trimAdaptors(Expt1.adptr5,Expt1.adptr3)
>>importrawdataset('ITWT_S1_L001_R1_001_Aug.fastq.gz')
Many functions above use a parameter called keyseq. At its simplest, it is a DNA or RNA sequence, passed as 'AGCGGAG' that is used for searching sequences for matches. But it can do much more. Specifying False in place of this parameter replaces it with the default value:
the value of SeqsUsed{"AlignSeq"} defined in the initial experimental setup. You can also pass along the value of any
SeqsUsed variable set up for the data set and it will use the sequence that corresponds to that variable. ======== Note that for most of the "get" functions, which return only a subset of the data passed, you can
access the data that was NOT "gotten" by immediately accessing the variable config.dumpedSet. For example,
in the following
NewSet will contain sequences containing the match There is a set of . Do not use them as they are no longer supported. More generally, this also corrects for any slippage or skipping that might occur internally before the
key sequence. In that way, the 3' ends are more appropriately aligned to visually detect n-1 and n+1 products.
The above functions will often report back on their successes, but for real analysis of sequence
data, use the following. In general, don't use these functions to manipulate sequences, only analyze.
Therefore a common usage would look like: RNAset remains unchanged. A strength of this tool is that you can easily run the same analysis on a number of sequence data sets.
An easy way to do this is by defining a list (array) of sequence identifiers. So instead of calling Seqsetup
once for one variable, setup a dictionary collection of experiments.
Uses self.tseq as the standard - any other base is deemed misincorporated
WARNING: a frameshift in a sequence will show almost everything downstream as misincorporated
Utility Functions
The input is a SeqSetup variable, that has within it the name of the file, etc
This can read either .fastq or .fastq.gz files
If an adaptor is passed as None, it uses an adaptor stored with the data set
If an adaptor is passed as '', it does not look for or require that adaptor
If an adaptor is required and is not found in a sequence, it throws out that sequence
If sequence was tagged as 'isTemplate' it returns the reverse complement after trimming
A note about keyseq
Wonky Stuff
The following assumes that RNAset is a NucleicSet variable
Python Dictionaries
NewSet = RNAset.getWithMatched('AGGCT',0,")
BadSet = config.dumpedSet
BadSet will contain sequences that do NOT meet the criteria
Be careful in nested calls. config.dumpedSet will only refer to the LAST function in the nest.
General Utilities
WARNING: these sequences have no statistical significance. Use this only for simple testing, not analysis.
WTset = Expts["ITWT_Aug"].importrawdataset()
WTset.printSampleSeqs(5)
>><
Advanced functions
DEPRECATED (outdated) functions
Analysis Functions
RNAset.RNAkeyseqPosAnal('GAAGGA',"report back on positions of GAAGGA");
Advanced topics
UserSq = {'Tmplt': 'TAAATCTTCACCTCTACTGCTTCCTATAGTGAGTCGTATTAATT',
'NTmpl': 'AATTAATACGACTCACTATAGG',
'5Prmr': 'GTTCAGAGTTCTACAGTCCGACGATCTAATCA',
'3Prmr': 'ATGGAATTCTCGGGTGCCAAGG',
'AlignSeq': 'GGAAGCAG',
'Index1': ''},
einfo5 = Exptinfo('07/02/19', 'Yasaman',2.11, 2, 240, 37,'','AATTAATACGACTCACTATAGG')
Expts = {} # define an initially empty dictionary
Expts['U9'] = Seqsetup('U9_S1_L001_R1_001.fastq.gz', 'GGAAGCAGTAGAGGTGAAGATTTA',
'TAATCA', 'TGGAA', einfo5,
'Pseudo U in stem-loop region, Pseudo U at position +9,Transcriptopn with UTP', False,
UserSq,
{'Keywords': 'PseudoU, UTP',
'Description': 'psU in stem-loop +9, UTP',
'QCode': 'U9',
'PF': 3.14159,
'Run Date': '07/02/2019'}
)
Expts['U7'] = Seqsetup('U7_S1_L001_R1_001.fastq.gz', 'GGAAGCAGTAGAGGTGAAGATTTA',
'TAATCA', 'TGGAA', einfo5,
'Pseudo U in stem-loop region, Pseudo U at position +9,Transcriptopn with pseudoUTP', False,
UserSq,
{'Keywords': 'PseudoU, UTP',
'Description': 'psU in stem-loop +9, pseudoUTP',
'QCode': 'U9pse',
'PF': 3.14159,
'Run Date': '07/02/2019'}
)
for myData in ['U9','U7']:
StartCaptureToFile()
Dset = Expts[myData].import_dataset().trimAdaptors(None, None).toRNASet()
Rset.getWithMatched([11,11],6,'Strip 5\' hetero seqs').endAnalysis(10,5, '')
Rset.printMostCommon(0.1,"5% and higher","")
WriteCaptureToFile(Rset.dData['QCode'])
exit()