This documents a set of tools, written for use in Python and using extensively the tools from the BioPython suite. It is intended not for genomic studies, but rather, for characterizing relatively short complete or RACE sequences that are expected to be based on an "expected" sequence but that nevertheless have significant sequence and/or length heterogeneity. We use the tools to study in vitro transcribed RNAs.
(another tool you may find useful is "Oligo Design")

Author: Craig Martin
Documentation last updated: Mar 09, 2020 (v5.3)

Introduction

This document describes a suite of Python tools for analysis of in vitro RNA-Seq data (not intended for genomic analyses). The basic usage follows an object oriented model. You start by defining a variable to contain information about a specific experimental data set.

myExptSetUp = Seqsetup(
)

Click

Parameters above, in order (noting how you can reference each in programming):

RNAset.filename - Illumina file name: fastq format (gzipped, or not)
RNAset.tseq - Expected (encoded) sequence (can be in DNA or RNA format)
RNAset.adptr5 - 3' end of the 5' adapter (default used in trimming; can be overridden at trimming)
RNAset.adptr3 - 5' end of the 3' adapter (default used in trimming; can be overridden at trimming)
RNAset.exptinfo - (einfo) special variable - see below - contains info on the transcription experiment
RNAset.pnotes - one line description of the experiment - what's it about?
RNAset.istemplate - True/False - is this a DNA sequence (typical use: TRUE if seq of template DNA)?
RNAset.SeqsUsed - dictionary variable - see below - contains sequences relative to the experiment
RNAset.dData - dictionary variable - see below - contains any other info to include

The default location for sequence input files is a directory called 'Data' one level up from the Python source code.
The default for output is in a directory called 'Output" one level up. These locations can be changed as follows:

config.Options["InputDir"] = "../Data"
config.Options["OutputDir"] = "../Output"

Note that here and below, all sequences should be written 5' to 3'

The following variables might be defined once (or twice, or three times) and then used in the definition of multiple sequencing data sets by passing the parameter as shown above.

einfo is a parameter set containing information on the experimental details. It can be setup using the following function:

einfo = Exptinfo(
)

SeqsUsed is a parameter set containing info about the DNA constructs used. All parameters are optional. You can create your own. It can be setup using the following syntax:

SeqsUsed = {
}

AlignSeq is used as a default by some functions for aligning sequences that might have been "frameshifted." So for example, if transcription started at +2, but you are interested in the 3' end of the transcript, that end will be shifted left by one. Referencing a sequence in the middle of the transcript, one can align all sequences by their common internal alignment sequences - now the 3' ends line up regardless of initiation variability. The value passed here is the default and typically can be overridden.

Routines that use this as a default (set keyseq to False): endAnalyses, getAlignedToSeq, getSubseqBySeq, getSubseqByRelativeSeq, getWithMatched, getWithMatchWndw, RNAkeyseqPosAnal,

In your own programming, you can access these as: newvar = RNAset.SeqsUsed["Ntmpl"]

dData is a parameter set containing other information. All parameters are optional. It can be setup using the following syntax:

dData = {
}

In your own programming, you can access these as: newvar = RNAset.dData["Run Date"]

Click

After defining your new variable myExptSetUp using Seqsetup, you then use yExptSetUp to load (import) sequence data into a new object DNAset (you can have multiple objects for different sequences; this is the object oriented approach). The new variable (object) will then include both the raw sequence data and all of the information that was provided in the definition of the experimental data set.

DNAset = myExptSetUp.import_dataset();

places all of the Illumina sequences (in DNA format by default) into a new (object) variable called DNAset

DNAsetTrimmed = DNAset.trimAdaptors(None, None);

returns a new (object) variable containing sequences with the 5' and 3' adapters removed (trimmed). The two parameters specify the 5' and 3' adapter sequences used for trimming (specifying None or False for each says to use the sequences in the above definition of 'myExptSetUp'). Note that primer dimers (5' and 3' adapters directly ligated, with no intervening DNA) and sequences that do not contain both 5' and 3' adapter sequences are NOT returned - the console will report on these statistics.

Note that the sequences are still DNA format at this point.

RNAset = DNAsetTrimmed.toRNASet();

converts the trimmed DNA sequences into RNA (replaces T by U, and flags the set as RNA)

Finally, all of the above can be condensed into one line by chaining the objects together:

RNAset = myExptSetUp.import_dataset().trimAdaptors(None, None).toRNASet();

At this point, RNAset includes all of the RNA sequences, trimmed to remove adapter sequences. It also includes all of the information that was specified in the initial definition of 'myExptSetUp. This approach illustrates what you will be doing next.

For example, to filter the data, returning only sequences of length 10 bases or longer:

RNAsetNew = RNAset.getSpecificLengths(10,10000,'remove abortives');

Click

Typical usage involves first setting up an "experiment" by calling Seqsetup. This identifies the .fasta or .fastq file containing the sequences. It also sets various parameters, including the 5' and 3' adapter sequences and the expected RNA(DNA) sequence to be found between those adapters. Data files are expected to be found in the Data folder, one level above the code.

After defining the experiment with Seqsetup, the data are loaded into a NucleicSet object using the .import_dataset command below (this command is an object property of the Seqsetup variable and so is called as
newNucleicSet = SeqSet.import_dataset().

The generated NucleicSet variable can then be further processed using any of the 'getXxxx" commands below. These operate on the NucleicSet and return a new NucleicSet. Unless the primary data have already been processed, the first call is typically something like RNAset = newNucleicSet.trimAdaptors(None,None). For this function, the first two parameters are the 5' and 3' adaptor sequences to use in trimming. Passing None for each of these defaults to using the adaptors defined in the Seqsetup step.

The trimmed data can now be processed to, for example, extract only post-abortive sequences using a call like
postAbrtvSet = newNucleicSet.getSpecificLengths(9,2000,'post-abortives')
Note also that many other functions implicitely filter for minimum lengths. For example, .getPhaseShifted and .getPrimedExt select for sequences containing key sequences at specific or minimal length positions.

Many of the .getXXX functions also report analyses of the processing. In addition, there are Analysis functions that are solely for analysis. These are typically called as
RNAset.getGel('A','label RNA with alpha-ATP')

Note that most functions have a comment variable. This is simply for your use. Anything entered there will be printed above the results of the function (if this variable is set to "quiet", output will be suppressed.

Finally, output can be sent to a file by bracketing commands with StartCaptureToFile() and WriteCaptureToFile('_extra'). That file will be created in the Output folder, one level above the code.

What is ?

In the descriptors below, RNAset refers to a python object (a variable) that contains a set of sequencing data. This class of object contains a set of DNA/RNA sequences, but also contains a variety of information on those data. The 'get' functions below operate on one RNAset and return a new one

For example: newSet = RNAset.getSpecificLengths(8,10,'return 8mers, 9mers, and 10mers only')

Specifically,
xx.filename is the name of the file from which these data were read
xx.tseq = the expected RNA transcript sequence
xx.SeqsUsed is a python dictionary containing (5' to 3') sequences used in the experiment. Examples include:

xx.SeqsUsed['Tmplt'] - sequence of the DNA template strand (remember: 5' to 3')
xx.SeqsUsed['NTmpl'] - sequence of the DNA nontemplate strand
xx.SeqsUsed['5Prmr'] - the entire sequence of the 5' adaptor (not just the barcode)
xx.SeqsUsed['3Prmr'] - the entire sequence of the 3' adaptor
xx.SeqsUsed['AlignSeq'] - an internal sequence used to align sequences, typically a subset of tseq
xx.SeqsUsed['Toehld'] - the inverse complement of the template toehold if present (in other words, the expected RNA sequence, were the toehold to be transcribed)
xx.SeqsUsed['T2a'] - in toehold synthesis, the upstream (promoter) template strand (remember: 5' to 3')
xx.SeqsUsed['T2b'] - in toehold synthesis, the transcribed template strand (remember: 5' to 3')
xx.SeqsUsed['Target'] - in the 'jump" experiment, the target template strand (remember: 5' to 3')

xx.dData is a python dictionary containing misc info

For the :


    class NucleicSet(object):
    def __init__(self, dData, seqreclist, setnotes, history, tseq, exptinfo, istemplate, SeqsUsed, filename, xtra):
        self.dData = dData             # Dictionary with all sorts of data (currently unused, but in the future...)
        self.seqreclist = seqreclist   # list of SeqRecord objects, one for each sequence
        self.setnotes = setnotes       # string - notes about this set
        self.history = history         # the history of this data  (aka pnotes)
        self.tseq = tseq               # string - expected sequence
        self.exptinfo = exptinfo       # Exptinfo object - experimental details
        self.istemplate = istemplate   # boolean - is template DNA, not RNA
        self.SeqsUsed = SeqsUsed       # Dictionary of DNA (5' to 3') used in expt - 'fullseq' is full sequence, with 5' and 3' primers
        self.filename = filename       # original source file name (if any)
        self.xtra = xtra               # variable of class Xtra, for holding a variety of things analytical

Quick useful data:

xx.count() - the number of sequences in this set
xx.maxlength() - the length of the longest RNA in this set

Filter ("get") Functions

These functions return a subset of the sequences, depending on specific criteria. They do not modify the sequences returned, they just remove some.

get only specific lengths of RNA. For example, to look at transcripts past the abortive phase you might call: getSpecificLengths(tmpWTset, 9, 1000, 'post-abortives')

Interesting uses:

RNAset.getSpecificLengths(2,8,'get only short (abortive) RNAs')
RNAset.getSpecificLengths(9,10000,'remove short (abortive) RNAs')
RNAset.getSpecificLengths(len(RNAset.tseq,10000),len(RNAset.tseq,10000),'get only expected length (full length) RNAs')

tmptmp = RNAset.getSpecificLengths(12,20,'testing specific lengths') tmptmp.printMostCommon(2.0,'Most common seqs','testing print most common sequences') >[RMHD_S8_L001_R1_001].importrawdataset().trimAdaptors('CTCCAT','TGGAA').getSpecificLengths(12,20,'') testing specific lengths Return only seqs of length 12 to 20. >> 9473 of 21736 (43.6%) >[RMHD_S8_L001_R1_001].importrawdataset().trimAdaptors('CTCCAT','TGGAA').getSpecificLengths(12,20,'').printMostCommon(2.0,'Most common seqs','') testing print most common sequences Most common s GGACTACGTACGTCGACGCNNNNNNTGCGTCGAGTATTTA 9473 ....!....1....!....2 1364 14.4% GGACTACGTACGTCGACGCC 1200 12.7% GGACTACGTACGTCGACGC 762 8.0% GGACTACGTACGTCGAC 749 7.9% GGACTACGTACGTCGACGCT 598 6.3% GGACTACGTACGTC 504 5.3% GGACTACGTACGTCGACG 464 4.9% GGACTACGTACGTCGACC 315 3.3% GGACTACGTACGTCGACGCA 274 2.9% GGACTACGTACGTCGACGCG 234 2.5% GGACTACGTACGTCGACGT 232 2.4% GGACTACGTACGTCGA 223 2.4% GGACTACGTACG 192 2.0% GGACTACGTACGTCGATC

Tests each sequence for the occurence of the key sequence; when matched, returns the entire sequence
pos = 0 allows the sequence to be found anywhere
pos > 0 returns a match only if the sequence is found at position 'pos'

If keyseq is a sequence preceded by '-' (e.g. '-GGACTTA' or '-5Prmr'), returns the sequences that do NOT contain the key sequence

Interesting uses:

RNAset.getWithMatched(RNAset.tseq[:2],1,'get only RNAs with correct dinucleotide starting sequence')
RNAset.getWithMatched(RNAset.SeqsUsed["AlignSeq"],0,'get only RNAs with originally specified align sequence (anywhere)')
RNAset.getWithMatched(False,0,'get only RNAs with originally specified align sequence (anywhere)')
RNAset.getWithMatched(False,False,'get only RNAs with originally specified align sequence at the expected position')
RNAset.getWithMatched(RNAset.tseq[-4:],0,'get only RNAs with the last four encoded bases, at any position')

tmpWTset = RNAset.getWithMatched('TCGAC',12,'testing get with Matched specific') tmpWTset.printMostCommon(2.0,'Most common','testing print most common sequences') tmpWTset.printSampleSeqs(10) >[RMLD_S7_L001_R1_001].importrawdataset().trimAdaptors('TACATA','TGGAA').getWithMatched('TCGAC',12,'') testing get with Matched specific Filtering only sequences that contain TCGAC at position 12. >> 173 of 27608 (0.6%) >[RMLD_S7_L001_R1_001].importrawdataset().trimAdaptors('TACATA','TGGAA').getWithMatched('TCGAC',12,'').printMostCommon(2.0,'Most common','') testing print most common sequences Most common GGACTACGTACGTCGACGCNNNNNNTGCGTCGAGTATTTA 173 ....!....1....!.. 4 2.3% GACTACGTACGTCGACG >[RMLD_S7_L001_R1_001].importrawdataset().trimAdaptors('TACATA','TGGAA').getWithMatched('TCGAC',12,'').printSampleSeqs(10) GGACTACGTACGTCGACGCNNNNNNTGCGTCGAGTATTTA ....'....|....'....|....'....|....'....|....' GACTACGTACGTCGACG GGACTAGTACGTCGACGCTCCAGCTGCGTCGAGTATTT GGACACGTACGTCGACGCTGTC GGACTACGACGTCGACGCGACAAGTGCC GACTACGTACGTCGACGCTCGCAGC GGACACGTACGTCGACGCCGCGAGTGC GGCTACGTACGTCGAC GGACTACTACGTCGACGCCGATTCTGCG GGACTACGTCGTCGACGCGGA GACTACGTACGTCGACGCTGCC

Tests each sequence for the occurence of any 'n' length subsequence within the key sequence
Returns the entire sequence
Think polymerase might have jumped to some new sequence, perhaps sloppily? Use this to search.
This is particulary useful for looking for strand jumping (to the nontemplate DNA or to something else)

Interesting uses:

Get only RNAs with at least 7 bases matching any of the first 22 bases in the nontemplate strand:

RNAset.getWithMatchWndw('AATTAATACGACTCACTATAGG',7,False,False,'')
RNAset.getWithMatchWndw(RNAset.SeqsUsed["NTmpl"][:22],7,False,False,'')

Get only RNAs with at least 7 bases matching any part of the reverse complement of the first 22 bases in the nontemplate strand:

RNAset.getWithMatchWndw('CCTATAGTGAGTCGTATTAATT',7,False,False,'')
RNAset.getWithMatchWndw(revcompl('AATTAATACGACTCACTATAGG'),7,False,False,'')
RNAset.getWithMatchWndw(revcompl(RNAset.SeqsUsed["NTmpl"][:22]),7,False,False,'')

tmpWTset = RNAset.getWithMatchWndw('TCGTACCAGTAACCGAC',7,False,False,'testing get with Matched specific') tmpWTset.printMostCommon(2.0,'Most common','testing print most common sequences') tmpWTset.printSampleSeqs(10) >[RMLD_S7_L001_R1_001].importrawdataset().trimAdaptors('TACATA','TGGAA').getWithMatchWndw('TCGTACCAGTAACCGAC',7,False,False,'') testing get with Matched specific Filtering only sequences that contain TCGAC at position 12. >> 173 of 27608 (0.6%) >[RMLD_S7_L001_R1_001].importrawdataset().trimAdaptors('TACATA','TGGAA').getWithMatchWndw('TCGTACCAGTAACCGAC',7,False,False,'').printMostCommon(2.0,'Most common','') testing print most common sequences Most common GGACTACGTACGTCGACGCNNNNNNTGCGTCGAGTATTTA 173 ....!....1....!.. 4 2.3% GACTACGTACGTCGACG >[RMLD_S7_L001_R1_001].importrawdataset().trimAdaptors('TACATA','TGGAA').getWithMatchWndw('TCGTACCAGTAACCGAC',7,False,False,'').printSampleSeqs(10) GGACTACGTACGTCGACGCNNNNNNTGCGTCGAGTATTTA ....'....|....'....|....'....|....'....|....' GACTACGTACGTCGACG GGACTAGTACGTCGACGCTCCAGCTGCGTCGAGTATTT GGACACGTACGTCGACGCTGTC GGACTACGACGTCGACGCGACAAGTGCC GACTACGTACGTCGACGCTCGCAGC GGACACGTACGTCGACGCCGCGAGTGC GGCTACGTACGTCGAC GGACTACTACGTCGACGCCGATTCTGCG GGACTACGTCGTCGACGCGGA GACTACGTACGTCGACGCTGCC

This function looks for RNA products that might arise from either internal or trans priming by another RNA or DNA. One can input a sequence (keyseq) which one might have expected to have served as a template; the anaylsis will look both for RNAs with that sequence and for its reverse complement (the expectation from loopback transcription is that the post-priming RNA will be the inverse complement of some part of the original RNA sequence. For RNA priming on a DNA template, the sequence will be a repeat of the original sequence (not the inverse complement). This function scans and tries to find both kinds of events. The function uses a 'sliding window" approach (like getWithMatchedWndw), looking for sub-sgements of the key sequence: a smaller (nWindow) window might pick up false positives, a larger (nWindow) might miss something. This function now takes on the roles of earlier routines .getPrimedExt and .getRepeats

Interesting uses:

Look for primed, or loopback transcription:

RNAset.getOccurrences('',0,7,False,'')
RNAset.getWithMatchWndw(RNAset.SeqsUsed["NTmpl"][:22],0,7,False,'')

Get only RNAs with at least 7 bases matching any of the first 22 bases in the nontemplate strand:

RNAset.getOccurrences('AATTAATACGACTCACTATAGG',0,7,False,'')
RNAset.getWithMatchWndw(RNAset.SeqsUsed["NTmpl"][:22],0,7,False,'')

Filter and Manipulate ("get") Functions

These functions return a subset of the sequences or modified sequences, depending on specific criteria. They DO modify the sequences returned, so use with care.

Sometimes we want to ignore certain regions of the sequence.

RNAset.maskSeqs([[5,3],[12,2]],'') = returns a masked set
where in each sequence, bases at positions 5-8 and 12-13 are replaced by 'NNN'

for example, 'GGAGTAGCTACGT' is replaced by 'GGAGNNNCTACNN'

if maskList is False, it masks according to tseq

RNAset.maskSeqs(False,'') applies the masking present in RNAset.tseq
which might look like GGAGNNNACTTACNNAAGGACCA

any of the IUPAC standard base heterogeneity codes are allowed, but all are currently converted to N

Sometimes we want to ignore certain regions of the sequence.

RNAset.maskSeqs('GAGTA',3,'') = returns sequences masking 3 bases immediately following GAGTA
masked bases are replaced by 'N'

for example, 'GGAGTAGCTACGT' is replaced by 'GGAGTANNNACGT'

if the key sequence is not found, the unmasked sequence is returned

any of the IUPAC standard base heterogeneity codes are allowed, but all are currently converted to N

Extracts a subsegment, based on position, from each sequence

To find the most common sequences from 11 to 15, one might call: newset = getSubseqByPos(tmpset,11,15,'just past abortive')

Note that newset.tseq is now adjusted to contain only the new subsegment of the expected sequence. tmpWTSub = RNAset.getSubSeqByPos(12,20,'testing sub-seq by position') tmpWTSub.printMostCommon(2.0,'Most common seqs','testing print most common sequences') >[RMHD_S8_L001_R1_001].importrawdataset().trimAdaptors('CTCCAT','TGGAA').getSubSeqByPos(12,20,'') testing sub-seq by position Extract the subsequence 12 to 20 from all sequences. >> 21492 of 21736 (98.9%) >[RMHD_S8_L001_R1_001].importrawdataset().trimAdaptors('CTCCAT','TGGAA').getSubSeqByPos(12,20,'').printMostCommon(2.0,'Most common seqs','') testing print most common sequences Most common s GTCGACGCN 21492 ....!.... 4627 21.5% GTCGACGCC 3666 17.1% GTCGACGCT 3154 14.7% GTCGACGCA 2868 13.3% GTCGACGCG 1209 5.6% GTCGACGC 775 3.6% GTCGAC 605 2.8% GTC 510 2.4% GTCGACG 467 2.2% GTCGACC

Extracts a subsegment, based on sequence, from each sequence

Finds sequences around a key sequence tmpWTSub = RNAset.getSubseqBySeq('ACGTCGACG',6,4,'testing sub-seq by key sequence') tmpWTSub.printMostCommon(2.0,'Most common seqs','testing print most common sequences') >[RMHD_S8_L001_R1_001].importrawdataset().trimAdaptors('CTCCAT','TGGAA').getSubseqBySeq('ACGTCGACG',6,4,'') testing sub-seq by key sequence Return only seqs containing ACGTCGACG, 6 before and 4 after. >> 21736 of 21736 (100.0%) >[RMHD_S8_L001_R1_001].importrawdataset().trimAdaptors('CTCCAT','TGGAA').getSubseqBySeq('ACGTCGACG',6,4,'').printMostCommon(2.0,'Most common seqs','') testing print most common sequences Most common s TACGTACGTC 21736 ....!....1 16541 76.1% TACGTACGTC 5042 23.2% * Primer Dimer *

Extracts a subsegment a fixed distance away from a found sequence
For example, getSubseqByRelativeSeq('TGACCA', 8, 13, '')

|------>|--->|
AGATGACCAGGACGACTAGACAAACTTGAACGCCTATACG

returns ACGACT

tmpWTSub = RNAset.getSubseqByRelativeSeq('ACGTCGACG',-10,-6,'testing sub-seq relative to key sequence') tmpWTSub.printMostCommon(2.0,'Most common seqs','testing print most common sequences') >[RMHD_S8_L001_R1_001].importrawdataset().trimAdaptors('CTCCAT','TGGAA').getSubseqBySeq('ACGTCGACG',6,4,'') testing sub-seq by key sequence Return only seqs containing ACGTCGACG, 6 before and 4 after. >> 21736 of 21736 (100.0%) >[RMHD_S8_L001_R1_001].importrawdataset().trimAdaptors('CTCCAT','TGGAA').getSubseqBySeq('ACGTCGACG',6,4,'').printMostCommon(2.0,'Most common seqs','') testing print most common sequences Most common s TACGTACGTC 21736 ....!....1 16541 76.1% TACGTACGTC 5042 23.2% * Primer Dimer *

Looks for sequences containing (both of) two key sequences, and then from those, extracts sub-sequences flanked by these two sequences (return includes flanking sequences also)

Same as getSubseqFlanked, but looks in tseq for NNN, then picks nBefore bases before and nAfter bases after to use as the flanking search sequences. For example, if tseq='GGAGCGGANNNNNNNCCTAAAGCGT', then a call of RNAset.getSubseqFlankedRandom(5,4,'') will call getSubseqFlanked('GCGGA', 'CCTA', '')

This function returns only sequences containing that key sequence, and it then adds varying amounts of white space to each sequence, in order that all of the (internal) key sequences line up. Why do this? If some transcripts start +1 or -1 relative to the expected start site, the subsequent sequences will be phase shifted, complicating comparisons. When one then looks at sequences aligned with this function, it becomes obvious which sequences started phase shifted.

More generally, this also corrects for any slippage or skipping that might occur internally before the key sequence. In that way, the 3' ends are more appropriately aligned to visually detect n-1 and n+1 products.

WARNING: please be aware that unlike, most functions, this ALTERS the returned sequence, but only by introducing white space at the 5' end.

alignedSet = RNAset.getAlignedToSeq('ACGTCGACG','align to the first occurrence of GGACCCT') alignedSet..printSampleSeqs(10) >[ITWT_S1_L001_R1_001_Aug].importrawdataset().trimAdaptors('TAATCA','TGGAA').getAlignedToSeq('ACGTCGACG','') align to the sequence 'ACGTCGACG' GGNNNNNNNNTACGTCGACGCATTTA Align to sequence ACGTCGACG from all sequences. >> 15841 of 244036 (6.5%) >[ITWT_S1_L001_R1_001_Aug].importrawdataset().trimAdaptors('TAATCA','TGGAA').getAlignedToSeq('ACGTCGACG','').printSampleSeqs(10) GGNNNNNNNNTACGTCGACGCATTTA ....'....|....'....|....'....|....'....|....'....|....'. GGAATATGATTACGTCGACGCATTTACGACGTAATCA GGTACAACATTACGTCGACGCATT GGTCTTTGGATACGTCGACGCATT GTTCGGCAGTTAGATACGTCGACGCATTTTC GGCGATACTGTACGTCGACGCATT GGCACGTTGATACGTCGACGCATT GGAAATAACGTACGTCGACGCATTTAC TTCAGGGATAAAGGATACGTCGACGCATTTAG GGTATGTAAATACGTCGACGCATTTACATACGTCGACGCATT GGGTCATAAATACGTCGACGCGGCGC

Analysis Functions

The above functions will often report back on their successes, but for real analysis of sequence data, use the following. In general, don't use these functions to manipulate sequences, only analyze. Therefore a common usage would look like:

RNAset.RNAkeyseqPosAnal('GAAGGA',"report back on positions of GAAGGA");

RNAset remains unchanged.

This function starts off expecting nothing. It takes a sliding window look at all sequences and then returns the most common occurences of those window-length segments, whereever they are in each sequence. Good for exploratory looks, but probably boring for well-behaved sequences, as it will return expected results. newSet = RNAset.MostCommonWindowAnal(8,0.5,'Looking for commonalities') >[ITWT_S1_L001_R1_001_Aug].importrawdataset().trimAdaptors('TAATCA','TGGAA').MostCommonWindowAnal(8,0.8,'') Most common sub-sequences of length 8 - Looking for commonalities Count Prcnt GGNNNNNNNNTACGTCGACGCATTTA 35650 2.1% CGATCTAA ....!....1....!....2....!. 35527 2.1% GATCTAAT ....!....1....!....2....!. 34495 2.1% ACGATCTA ....!....1....!....2....!. 33855 2.0% GACGATCT ....!....1....!....2....!. 32193 1.9% ATCTAATC ....!....1....!....2....!. 31944 1.9% CGACGATC ....!....1....!....2....!. 31256 1.9% CCGACGAT ....!....1....!....2....!. 30898 1.8% TCTAATCA ....!....1....!....2....!. 30116 1.8% TCCGACGA ....!....1....!....2....!. 29940 1.8% GTCCGACG ....!....1....!....2....!. 29410 1.8% AGTCCGAC ....!....1....!....2....!. 27847 1.7% GTTCAGAG ....!....1....!....2....!. 27733 1.7% CAGTCCGA ....!....1....!....2....!. 27181 1.6% AGAGTTCT ....!....1....!....2....!. 27041 1.6% CAGAGTTC ....!....1....!....2....!. 26733 1.6% GAGTTCTA ....!....1....!....2....!. 26610 1.6% TTCAGAGT ....!....1....!....2....!. 26602 1.6% ACAGTCCG ....!....1....!....2....!. 26570 1.6% AGTTCTAC ....!....1....!....2....!. 26334 1.6% TCAGAGTT ....!....1....!....2....!. 26244 1.6% GTTCTACA ....!....1....!....2....!. 26105 1.6% TACAGTCC ....!....1....!....2....!. 25903 1.5% CTAATCAG ....!....1....!....2....!. 25736 1.5% CTACAGTC ....!....1....!....2....!. 25489 1.5% TCTACAGT ....!....1....!....2....!. 25439 1.5% TTCTACAG ....!....1....!....2....!. 22131 1.3% TAATCAGG ....!....1....!....2....!. 18675 1.1% TACGTCGA ....!....1>...!....2....!. 18093 1.1% ACGTCGAC ....!....1.>..!....2....!. 16240 1.0% CGTCGACG ....!....1..>.!....<....!. 14813 0.9% GTCGACGC ....!....1...>!....2....!. 13595 0.8% TCGACGCA ....!....1....>....2....!. 12437 0.7% CGACGCAT ....!....1....!>...2....!.

Returns the number of sequences of each length. Note that termDiNucAnal also provides this, and much more. RNAset.RNAlengthAnal(0.8,1.5,'Lengths') >[ITWT_S1_L001_R1_001_Aug].importrawdataset().trimAdaptors('TAATCA','TGGAA').RNAlengthAnal(0.50,2.00) >[ITWT_S1_L001_R1_001_Aug].importrawdataset().trimAdaptors('TAATCA','TGGAA').RNAlengthAnal(0.80,1.50) Length distributions 244036 Length Report counts of 0.8% or higher 48910 20.0% 5 28320 11.6% 6 18062 7.4% 4 12043 4.9% 7 8503 3.5% 3 6680 2.7% 11 6221 2.5% 8 5570 2.3% 14 5359 2.2% 13 5004 2.1% 26° 4935 2.0% 9 4899 2.0% 19 4829 2.0% 16 4813 2.0% 12 4366 1.8% 1 4305 1.8% 10 4305 1.8% 27 4292 1.8% 18 4269 1.7% 17 3945 1.6% 20 3913 1.6% 25 3859 1.6% 24 3838 1.6% 2 3736 1.5% 15 3712 1.5% 22 3655 1.5% 21 3148 1.3% 28 2994 1.2% 23 2864 1.2% 30 2600 1.1% 29 2214 0.9% 32 2200 0.9% 31 1996 0.8% 33 Return only seqs of length 5 to 5. > 48910 of 244036 (20.0%) 5mers only GGNNNNNNNNTACGTCGACGCATTTA 48910 ....! 6606 13.5% GGCTC 5117 10.5% GGACC 4253 8.7% GGATC 4127 8.4% GGCAC 2824 5.8% GGACT 2339 4.8% GGTCC 1614 3.3% GGGTC 1561 3.2% GGCGC 1519 3.1% GGAAC 1205 2.5% GGAGC 1040 2.1% GGCCC 790 1.6% GGGAA 738 1.5% GGGAC Return only seqs of length 6 to 6. > 28320 of 244036 (11.6%) 6mers only GGNNNNNNNNTACGTCGACGCATTTA 28320 ....!. 910 3.2% GGAACC 817 2.9% GGGATT 789 2.8% GGGATC 755 2.7% GGCAAC 743 2.6% GGTATC 620 2.2% GGGAAT 594 2.1% GGCATT 593 2.1% GGTACT 584 2.1% GGGACT 561 2.0% GGTATT 454 1.6% GGGACC 435 1.5% GGGCTC 432 1.5% GGGGTC Return only seqs of length 4 to 4. > 18062 of 244036 (7.4%) 4mers only GGNNNNNNNNTACGTCGACGCATTTA 18062 .... 3682 20.4% GGCC 2775 15.4% GGTC 2318 12.8% GGCT 1120 6.2% GGTT 1113 6.2% GGTA 1031 5.7% GGGC 932 5.2% GGCA 694 3.8% GGAC 394 2.2% GGTG 329 1.8% GGCG 271 1.5% GGGA Return only seqs of length 7 to 7. > 12043 of 244036 (4.9%) 7mers only GGNNNNNNNNTACGTCGACGCATTTA 12043 ....!.. 495 4.1% GTTCAGA Return only seqs of length 3 to 3. > 8503 of 244036 (3.5%) 3mers only GGNNNNNNNNTACGTCGACGCATTTA 8503 ... 3307 38.9% GGC 1130 13.3% GGT 928 10.9% GCC 300 3.5% GGA 221 2.6% ACC 210 2.5% CAC 167 2.0% CCC 158 1.9% GTT 141 1.7% CAA 134 1.6% TCA Return only seqs of length 11 to 11. > 6680 of 244036 (2.7%) 11mers only GGNNNNNNNNTACGTCGACGCATTTA 6680 ....!....1. 278 4.2% GTTCAGAGTTC

Returns the number of sequences of each length. Note that termDiNucAnal also provides this, and much more. This is simple and compact RNAset.RNAlengthAnalShort('test') >[ITWT_S1_L001_R1_001_Aug].importrawdataset().trimAdaptors('TAATCA','TGGAA').RNAlengthAnalShort('test') Length analysis Length Distribution 0 0.0% 0 12 2.0% 4813 24 1.6% 3859 36 0.6% 1527 1 1.8% 4366 13 2.2% 5359 25 1.6% 3913 37 0.6% 1374 2 1.6% 3838 14 2.3% 5570 26° 2.1% 5004 38 0.5% 1227 3 3.5% 8503 15 1.5% 3736 27 1.8% 4305 39 0.4% 1086 4 7.4% 18062 16 2.0% 4829 28 1.3% 3148 40 0.4% 956 5 20.0% 48910 17 1.7% 4269 29 1.1% 2600 41 0.4% 920 6 11.6% 28320 18 1.8% 4292 30 1.2% 2864 42 0.3% 827 7 4.9% 12043 19 2.0% 4899 31 0.9% 2200 43 0.3% 796 8 2.5% 6221 20 1.6% 3945 32 0.9% 2214 44 0.3% 743 9 2.0% 4935 21 1.5% 3655 33 0.8% 1996 10 1.8% 4305 22 1.5% 3712 34 0.8% 1859 11 2.7% 6680 23 1.2% 2994 35 0.7% 1645

Returns the number of sequences of each length. Note that termDiNucAnal also provides this, and much more. st = RNAset.getGel('A','testing gel, labeling with alpha-ATP') st = RNAset.getGel(None, 'testing gel, full output') >[ITWT_S1_L001_R1_001_Aug].importrawdataset().trimAdaptors('TAATCA','TGGAA').getGel('A','') testing gel, labeling with alpha-ATP n alpha-A 45 1.19 44 1.21 43 1.25 42 1.28 41 1.40 40 1.40 39 1.56 38 1.71 37 1.86 36 2.01 35 2.11 34 2.33 33 2.42 32 2.59 31 2.47 30 3.06 29 2.70 28 3.14 27 4.07 26 4.58 25 3.37 24 3.18 23 2.40 22 2.80 21 2.60 20 2.68 19 3.10 18 2.68 17 2.48 16 2.54 15 1.88 14 2.49 13 2.32 12 2.03 11 2.41 10 1.39 9 1.36 8 1.48 7 2.26 6 3.95 5 4.58 4 0.87 3 0.37 2 0.32 1 0.14 >[ITWT_S1_L001_R1_001_Aug].importrawdataset().trimAdaptors('TAATCA','TGGAA').getGel('None','') testing gel, full output n count nA nC nG nT %A %C %G %T 45 - 717 - 9166 8159 6423 8517 % 1.2 1.1 0.8 1.2 44 - 743 - 9296 8283 6508 8605 % 1.2 1.1 0.8 1.2 43 - 796 - 9570 8645 6895 9118 % 1.2 1.1 0.8 1.2 42 - 827 - 9846 8870 6886 9132 % 1.3 1.2 0.8 1.2 41 - 920 - 10729 9573 7600 9818 % 1.4 1.2 0.9 1.3 40 - 956 - 10779 9560 7625 10276 % 1.4 1.2 0.9 1.4 39 - 1086 - 11961 10420 8675 11298 % 1.6 1.4 1.0 1.5 38 - 1227 - 13140 11637 9508 12341 % 1.7 1.5 1.1 1.7 37 - 1374 - 14272 12656 10388 13522 % 1.9 1.6 1.2 1.8 36 - 1527 - 15416 13750 11298 14508 % 2.0 1.8 1.3 2.0 35 - 1645 - 16235 14396 11979 14965 % 2.1 1.9 1.4 2.0 34 - 1859 - 17863 15879 12980 16484 % 2.3 2.1 1.5 2.2 33 - 1996 - 18598 16645 13507 17118 % 2.4 2.2 1.6 2.3 32 - 2214 - 19882 17914 14472 18580 % 2.6 2.3 1.7 2.5 31 - 2200 - 18943 17441 13947 17869 % 2.5 2.3 1.6 2.4 30 - 2864 - 23493 21975 17421 23031 % 3.1 2.9 2.0 3.1 29 - 2600 - 20777 19513 15593 19517 % 2.7 2.5 1.8 2.7 28 - 3148 - 24122 22530 18776 22716 % 3.1 2.9 2.2 3.1 27 - 4305 - 31268 29236 25627 30104 % 4.1 3.8 3.0 4.1 26 - 5004 - 35150 31062 29569 34323 % 4.6 4.0 3.5 4.7 25 - 3913 - 25889 24432 22220 25284 % 3.4 3.2 2.6 3.4 24 - 3859 - 24406 23369 21388 23453 % 3.2 3.0 2.5 3.2 23 - 2994 - 18463 17696 15463 17240 % 2.4 2.3 1.8 2.4 22 - 3712 - 21537 20961 19164 20002 % 2.8 2.7 2.2 2.7 21 - 3655 - 19977 19575 18158 19045 % 2.6 2.5 2.1 2.6 20 - 3945 - 20565 19718 19026 19591 % 2.7 2.6 2.2 2.7 19 - 4899 - 23827 23355 22518 23381 % 3.1 3.0 2.6 3.2 18 - 4292 - 20562 18173 18400 20121 % 2.7 2.4 2.2 2.7 17 - 4269 - 19074 16809 17240 19450 % 2.5 2.2 2.0 2.7 16 - 4829 - 19488 18690 19254 19832 % 2.5 2.4 2.3 2.7 15 - 3736 - 14428 12748 14130 14734 % 1.9 1.7 1.7 2.0 14 - 5570 - 19124 17724 22574 18558 % 2.5 2.3 2.6 2.5 13 - 5359 - 17819 16079 19280 16489 % 2.3 2.1 2.3 2.2 12 - 4813 - 15587 11579 16505 14085 % 2.0 1.5 1.9 1.9 11 - 6680 - 18505 14097 21831 19047 % 2.4 1.8 2.6 2.6 10 - 4305 - 10697 9252 13200 9901 % 1.4 1.2 1.5 1.4 9 - 4935 - 10431 9800 14355 9829 % 1.4 1.3 1.7 1.3 8 - 6221 - 11339 10593 18103 9733 % 1.5 1.4 2.1 1.3 7 -12043 - 17360 17616 33840 15485 % 2.3 2.3 4.0 2.1 6 -28320 - 30377 36719 74436 28387 % 4.0 4.8 8.7 3.9 5 -48910 - 35154 68990 109017 31388 % 4.6 9.0 12.8 4.3 4 -18062 - 6659 18977 35168 11444 % 0.9 2.5 4.1 1.6 3 - 8503 - 2861 8148 11499 3001 % 0.4 1.1 1.3 0.4 2 - 3838 - 2478 2943 941 1314 % 0.3 0.4 0.1 0.2 1 - 4366 - 1071 2661 321 313 % 0.1 0.3 0.0 0.0

This is another function that can be interesting for analyzing frame shifts. If GTCGACG is expected to occur at position 18, you will likely a high count for position 13, but perhaps also a non-zero count of at 12 and 14.

Note that this reports on only the first occurence in each sequence tmpset.RNAkeyseqPosAnal('GTCGACG', '') >[ITWT_S1_L001_R1_001_Aug].importrawdataset().trimAdaptors('TAATCA','TGGAA').RNAkeyseqPosAnal(GTCGACG, '') Look for key seq GTCGACG Position found distribution 0 0.0% 106 10 0.1% 168 20 0.1% 144 30 0.0% 76 1 0.0% 92 11 0.1% 139 21 0.0% 97 31 0.0% 67 2 0.1% 129 12 0.3% 615 22 0.0% 72 32 0.0% 85 3 0.0% 71 13 4.5% 10985 23 0.0% 86 33 0.0% 80 4 0.1% 173 14 0.2% 468 24 0.0% 117 34 0.0% 35 5 0.1% 318 15 0.1% 215 25 0.0% 87 35 0.0% 14 6 0.1% 350 16 0.0% 120 26 0.0% 93 36 0.0% 29 7 0.1% 333 17 0.0% 101 27 0.0% 102 37 0.0% 8 8 0.1% 223 18 0.0% 105 28 0.0% 117 9 0.1% 192 19 0.0% 111 29 0.0% 87

Useful for looking at sequence dependence of abortive transcripts, or the ends of any transcripts. Specifically, it looks at the occurence of the last two bases (dinucleotide) of each RNA, broken down by lengths of RNAs.

Given that there are 16 different dinucleotides possible, for totally random behavior, one would expect 100\16=6.25% occurrence of each dinucleotide step. For well-behaved transcription, one would expect 100% for the encoded step and 0% for everything else.

Expected steps are indicated by a small superscript 'o'.

Note that this analysis is sensitive to frame-shifted or completely 'bad' sequences in the mix. They will make the statistics at each position muddy. So pre-processing your data set can be helpful, BUT be careful that your processing doesn't skew your results.

'Count' is the number of RNAs that terminate at the given length (and so contributed to the statistics in that row). For gamma-GTP labeling, this is also proportional to expected gel intensity. But we typically use alpha-ATP labeling, with longer transcripts incorporating more radioactivity. 'Gel Int' tried to adjust for that and includes the numbers of A's in each transcript. 'Gel Int' is then reported as a percentage of the total.

'%off' is a number widely used in analyzing abortives. It is the percentage of RNAs that have made it to position n, and have then fallen off. Or if you think about the competition between falling off and adding the next nucleotide, it is the percentage that fall off. tmpset.termDiNucAnal('test') Percentage RNA’s of length N TERMINATED w/ indicated dinucleotide step N AA CA GA TA AC CC GC TC AG CG GG TG AT CT GT TT Count Gel Int (%off) 2 GG 0 0 0 0 0 0 0 0 0 0 100° 0 0 0 0 0 93 0.00 ( 0.1%) 3 GN 0 0 6 0 0 0 69 0 0 0 1 0 0 0 24 0 4770 0.13 ( 3.4%) 4 NN 1 6 2 7 4 24 7 18 0 2 0 3 2 15 2 7 15457 1.58 ( 11.3%) 5 NN 3 3 2 1 15 20 8 28 0 1 1 0 3 9 3 3 45969 13.22 ( 37.9%) 6 NN 2 5 2 3 11 15 7 18 0 3 1 3 5 10 3 11 25180 10.65 ( 33.5%) 7 NN 2 5 3 5 8 12 6 17 1 3 1 3 5 10 4 14 9139 4.95 ( 18.2%) 8 NN 3 5 3 4 10 10 6 18 1 3 2 4 5 8 4 14 3833 2.56 ( 9.4%) 9 NN 3 5 3 5 12 10 6 19 1 3 2 5 4 7 3 12 2674 2.06 ( 7.2%) 10 NN 4 5 3 4 11 11 7 17 1 3 2 4 4 6 4 14 2278 2.08 ( 6.6%) 11 NT 1 2 1 1 3 7 2 5 0 1 0 1 10 24 5 36 4273 4.77 ( 13.3%) 12 TA 1 3 0 56° 5 4 1 12 0 1 0 2 2 5 1 6 2418 3.24 ( 8.7%) 13 AC 1 1 0 2 83° 3 1 1 0 3 0 0 1 1 0 1 3060 4.10 ( 12.0%) 14 CG 1 6 0 1 3 11 1 1 0 71° 0 0 1 2 1 1 3018 4.06 ( 13.5%) 15 GT 1 2 3 1 4 5 16 6 1 6 2 1 1 1 48° 2 1099 1.56 ( 5.7%) 16 TC 0 1 0 1 3 11 1 75° 0 1 0 1 0 0 2 2 1862 2.69 ( 10.2%) 17 CG 1 8 3 1 4 29 2 8 1 32° 0 1 1 3 1 5 686 1.15 ( 4.2%) 18 GA 1 2 57° 3 6 7 8 3 1 2 1 0 1 1 3 4 828 1.52 ( 5.3%) 19 AC 1 1 1 1 82° 5 2 1 0 2 0 0 2 0 0 2 1608 2.99 ( 10.8%) 20 CG 0 6 1 1 6 12 5 1 0 63° 0 0 1 2 0 2 1107 2.15 ( 8.3%) 21 GC 1 5 2 2 6 5 70° 1 0 2 1 0 0 0 3 1 932 1.90 ( 7.6%) 22 CA 1 69° 0 2 2 13 1 1 0 3 0 0 2 4 0 1 1016 2.40 ( 9.0%) 23 AT 3 4 0 2 13 5 2 3 2 1 0 1 49° 1 1 12 532 1.27 ( 5.2%) 24 TT 1 1 0 3 2 1 0 10 0 1 0 2 1 1 0 76° 1428 3.47 ( 14.7%) 25 TT 0 1 0 10 1 2 0 12 0 0 0 1 1 1 0 70° 1260 3.11 ( 15.2%) 26 TA 0 5 0 85° 2 1 0 1 1 0 0 1 0 0 0 3 2540 7.23 ( 36.2%) 27 A 11 0 1 3 63 1 1 1 15 0 0 0 3 0 0 0 1766 5.19 ( 39.4%) 28 3 6 19 1 8 24 10 2 2 16 2 1 1 2 2 1 766 2.35 ( 28.2%) 29 6 6 17 7 10 19 6 3 1 12 0 1 3 4 4 3 265 0.83 ( 13.6%) 30 3 7 2 3 10 22 5 11 3 9 1 1 3 4 7 7 149 0.47 ( 8.8%) 31 3 6 1 10 7 26 5 15 1 4 1 0 6 1 9 5 144 0.47 ( 9.4%) 32 4 6 2 7 5 40 2 13 2 4 1 1 2 3 2 6 170 0.61 ( 12.2%) 33 4 7 7 8 12 19 7 11 3 3 2 1 4 4 3 4 118 0.43 ( 9.6%)

See termDiNucAnal for a basic introduction to this function.

This version takes both an RNAset as input and also a similar set (RefSet) derived from sequencing of the DNA template used in that experiment. The idea is that for randomized regions of a template we might want to achieve 25% of each nucleotide at each position, but we know that the oligo synthesis does not always provide that. If the template, instead of having 25/25/25/25 at a specific position has 35/20/25/20, then if polymerase shows no bias, the resulting transcripts should show 35/20/25/20. This function corrects for this by calculating a " Z-Score " comparing the two populations. A positive Z-score reflects steps that occur in the sequence more often than expected from the template. A negative Z-score reflects steps that occur less frequently than expected from the template.

Statistically, the "null hypothesis" is that the transcribed RNA correctly reflects the template, in other words, the probability of abortively dissociating at a particular position is independent of the sequence of the terminal RNA dinucleotide step. Therefore a large positive value indicates a step that leads to more abortive dissociation and a negative value reflects a step that has reduced abortive dissociation. tmpset.termDiNucAnalScore(Refset,'test') Results here

Useful for looking at internal fidelity of transcription - how well did polymerase incorporate the expected base at each position. Specifically, it looks at the occurence of two base (dinucleotide) steps.

Expected steps are indicated by a small superscript 'o'.

'Count' shows the number of RNAs analyzed and decreases progressively as shorter RNAs are left out of the analysis. tmpWTset.internalDiNucAnal('test') Percentage of INTERNAL dinucleotide steps at each position in the RNAs N AA CA GA TA AC CC GC TC AG CG GG TG AT CT GT TT Count (%Tot) 2 GG 0 0 0 0 0 0 0 0 0 0 100° 0 0 0 0 0 141453 ( 99.9%) 3 GN 0 0 28 0 0 0 32 0 0 0 21 0 0 0 19 0 136683 ( 96.6%) 4 NN 7 10 9 6 11 4 4 5 5 6 6 3 7 10 4 3 121226 ( 85.6%) 5 NN 9 5 8 5 11 4 4 6 6 4 6 4 12 6 6 4 75257 ( 53.2%) 6 NN 8 6 9 7 9 4 4 5 6 5 5 6 9 6 6 6 50077 ( 35.4%) 7 NN 9 6 9 8 8 3 4 5 6 5 4 6 9 6 6 6 40938 ( 28.9%) 8 NN 10 5 8 8 8 3 4 5 6 5 4 6 8 6 6 7 37105 ( 26.2%) 9 NN 11 6 8 9 7 3 4 5 6 4 4 6 8 7 6 7 34431 ( 24.3%) 10 NN 9 5 6 7 10 3 4 7 6 4 4 6 10 6 6 8 32153 ( 22.7%) 11 NT 1 1 1 3 2 0 0 1 1 1 1 1 25 20 20 22 27880 ( 19.7%) 12 TA 1 0 1 86° 3 0 0 1 1 1 0 1 1 1 1 1 25462 ( 18.0%) 13 AC 1 0 1 3 85° 0 1 1 1 3 0 0 1 1 1 1 22402 ( 15.8%) 14 CG 1 1 1 2 3 1 1 1 1 84° 0 0 1 1 3 1 19384 ( 13.7%) 15 GT 1 1 1 1 2 0 1 3 0 3 0 0 1 1 84° 0 18285 ( 12.9%) 16 TC 0 1 1 1 2 0 1 84° 0 4 0 0 1 1 2 1 16423 ( 11.6%) 17 CG 0 1 3 1 1 0 1 3 1 85° 0 0 1 1 1 1 15737 ( 11.1%) 18 GA 0 1 85° 1 3 0 1 1 0 3 1 0 1 0 1 1 14909 ( 10.5%) 19 AC 0 1 2 1 85° 0 0 1 1 4 0 0 1 0 1 1 13301 ( 9.4%) 20 CG 0 1 1 1 3 1 3 1 1 85° 0 1 1 1 1 1 12194 ( 8.6%) 21 GC 0 3 1 1 1 0 84° 1 0 3 1 0 1 1 2 1 11262 ( 8.0%) 22 CA 0 83° 1 2 1 1 2 1 0 1 0 0 4 1 1 1 10246 ( 7.2%) 23 AT 1 3 1 1 1 0 1 1 1 1 1 0 84° 1 1 3 9714 ( 6.9%) 24 TT 1 1 1 1 1 0 1 1 1 1 1 1 3 1 1 85° 8286 ( 5.9%) 25 TT 1 1 1 3 1 0 0 3 1 1 1 1 1 0 1 83° 7026 ( 5.0%) 26 TA 1 1 1 76° 1 0 1 2 2 2 1 2 2 1 2 4 4486 ( 3.2%) 27 A 7 2 3 5 25 1 2 3 35 3 3 1 4 1 2 3 2720 ( 1.9%) 28 4 3 24 4 6 3 6 4 5 14 10 1 4 4 4 3 1954 ( 1.4%)

See internalDiNucAnal for a basic introduction to this function.

Statistically, the "null hypothesis" is that the transcribed RNA correctly reflects the template. There are two potential reasons this might not occur: 1) if the polymerase aborts more at some sequences (at some positions), then they will be underrepresented (at those positions) in longer transcripts. Therefore, and in contrast to .termDiNucAnalScore, a large negative value would be expected correlate with more abortive dissociation at that position and sequence. 2) if the polymerase misincorporates at a particular position and then continues to extend past the misincorporation, a positive value might indicate a preference for the base that is misincorporated (and a negative value might reflect the intended base?). tmpset.internalDiNucAnalScore(Refset,'test') Results here

This section needs documentation update. tmpWTset.AnalyzeRunoff(4, 2, 8, 'test') Terminal sequence analysis: 1016 CGCATTTA 532 GCATTTA 1428 CATTTA 1260 ATTTA 2540 TTTA 1766 TTA 766 TA 681 (67%) CGCA 258 (48%) GCAT 1036 (73%) CATT 838 (67%) ATTT 2155 (85%) TTTA 1111 (63%) TTAC 174 (23%) TACC 98 (10%) CGCC 60 (11%) GCAC 134 ( 9%) CATC 131 (10%) ATTC 109 ( 4%) TTCA 270 (15%) TTAG 146 (19%) TAGA 35 ( 3%) CGCT 50 ( 9%) CATT 40 ( 3%) ATTT 88 ( 7%) TTTA 51 ( 2%) TTAC 181 (10%) TTAA 115 (15%) TACG 26 ( 3%) CGCG 14 ( 3%) GCAA 31 ( 2%) CATG 30 ( 2%) CATT 27 ( 1%) TTTT 57 ( 3%) TTAT 69 ( 9%) TAGC 11 ( 1%) CGTA 12 ( 2%) GCCA 22 ( 2%) CATA 29 ( 2%) ATTA 23 ( 1%) ATTT 42 ( 2%) TTTA 35 ( 5%) TACA 11 ( 1%) TTAC 10 ( 2%) CATC 13 ( 1%) TTTA 12 ( 1%) ATTG 22 ( 1%) TTTC 8 ( 0%) TAGA 30 ( 4%) TAAC 11 ( 1%) TACC 8 ( 2%) CGCA 12 ( 1%) TTAC 6 ( 0%) TTAC 14 ( 1%) TTAG 5 ( 0%) TACG 21 ( 3%) TTAC 11 ( 1%) GCAT 8 ( 2%) TACC 7 ( 0%) GCCA 5 ( 0%) ATCT 11 ( 0%) TTTG 5 ( 0%) TTGC 18 ( 2%) TAGG

Aligns all sequences to the 'keyseq' key sequence. Then analyzes 5' and 3' end heterogeneities (separately). Useful for analyzing positional mis-initiation and for analyzing runoff things like n-1, n+1, or primed extensions. Passing 'keyseq' as None tells it to use the AlignSeq stored in dData as the alignment sequence.

Analyzes register shift (slippage additions or omissions from skipping an encoded base) Reports back how many are in correct register, how many slipped by 1, by 2, etc AnalyzeRegister(tmpWTset, 11, 8, 0.4, '') >><>.importrawdataset().AnalyzeRegister('GG',1).AnalyzeRegister(tmpWTset, 11, 8, 0.4, '') Register shift analysis. Key: TACGTCGA at +11 14238 GGNNNNNNNNTACGTCGACGCATTTA 12999 91.3% 0 5 GGACTATCTTTACGTCGAC 305 2.1% -1 2 GGCGAAATTTACGTCGAC 277 1.9% 1 2 GGTAAAAAAAATACGTCGAC 99 0.7% -6 18 GGTTTACGTCGAC 80 0.6% -8 80 GGTACGTCGAC 78 0.5% -7 33 GGTTACGTCGAC

Graphing/Plotting Functions

Plots (to a PDF file) a nice bar chart showing amounts of each length of RNA
The default is to depict the last base by color (for B/W, pass 'stackedcolor':False)

pDict Dictionary can contains these optional definitions

lmin - minimum position to plot (default = 1)
lmax - maximum position to plot (default = max length)
stackedcolor - True/False - display bars in base-specific color segments (default = True)
fAddr - a string to add to the PDF file name (default = '')
descr - a string description for the bar chart (default = '')
width - relative width of bars (0-1) (default = 0.8)
mrkr - comment to go with output (default = '')
noHistory - leave history off of plt (default = False; history IS plotted)

Example: plotLengthBarChart({'lmin':0.1, 'fAddr':'_special', 'descr':'This is a test'}) plotLengthBarChart({}) >><>.importrawdataset().plotLengthBarChart(pDict).plotLengthBarChart({'lmax':45}) output gets stored in a PDF, also on screen if your Python environment sports graphics

Plots (to a PDF file) a nice bar chart showing (internal and end) misincorporation at each position
Uses self.tseq as the standard - any other base is deemed misincorporated
WARNING: a frameshift in a sequence will show almost everything downstream as misincorporated

pDict Dictionary can contains these optional definitions

lmin - minimum position to plot (default = 1)
lmax - maximum position to plot (default = max length)
fAddr - a string to add to the PDF file name (default = '')
descr - a string description for the bar chart (default = '')
width - relative width of bars (0-1) (default = 0.8)
mrkr - comment to go with output (default = '')
noHistory - leave history off of plt (default = False; history IS plotted)
inclCounts - put count at each position above the bar (default = True)

Example: plotMisIncorpBarChart({'lmin':0.1, 'fAddr':'_special', 'descr':'This is a test'}) plotMisIncorpBarChart({}) >><>.importrawdataset().plotMisIncorpBarChart(pDict).plotMisIncorpBarChart({'lmax':45}) output gets stored in a PDF, also on screen if your Python environment sports graphics

Utility Functions

This function is essential for setting up a particular experiment. It tells the import_dataset from what file to read the data, but it also tells it other key things like the expected sequence, what adapters to use in trimming the raw data, and general experimental information that will then be stored with the resulting sequence as it is processed. einfo2 = Exptinfo('8/8/16', 'Aruni',0.5, 2.0, 5, 37,'','AATTAATACGACTCACTATA') Expt2 = Seqsetup('MG_S9_L001_R1_001.fastq.gz', 'GGATCCCGACTGGCGAGAGCCAGGTAACGAATGGATCC', 'TCAACT', 'TGGAA', einfo2, 'MG Aptamer (Encoded toehold CCACTCCTCA)', False, { 'Tmplt':'GGATCCATTCGTTACCTGGCTCTCGCCAGTCGGGATCCTGAGGAGTGG', 'T2a':'AGTGAGTCGTATTAATTTC', 'NTmpl':'GAAATTAATACGACTCACTATTCCTAGCCGACTGGCGAGAGCCAGGTAACGAATGGATCC', '5Prmr': 'GTTCAGAGTTCTACAGTCCGACGATCTCAACT', '3Prmr':'ATGGAATTCTCGGGTGCCAAGG', 'AlignSeq': 'ACTGGCGAGAGCCAGGTAAC', 'Toehld':'CCACTCCTCA'} ) NewSet = Expt2.importdataset() [ no results ]

Import data from an Illumina sequence file, or a file written by 'writedataset' below
The input is a SeqSetup variable, that has within it the name of the file, etc
This can read either .fastq or .fastq.gz files

(note that importrawdataset and importRNAdataset are outdated (legacy) versions of this) einfo2 = Exptinfo('8/8/16', 'Aruni',0.5, 2.0, 5, 37,'','AATTAATACGACTCACTATA') Expt2 = Seqsetup('MG_S9_L001_R1_001.fastq.gz', 'GGATCCCGACTGGCGAGAGCCAGGTAACGAATGGATCC', 'TCAACT', 'TGGAA', einfo2, 'MG Aptamer (Encoded toehold CCACTCCTCA)', False, { 'Tmplt':'GGATCCATTCGTTACCTGGCTCTCGCCAGTCGGGATCCTGAGGAGTGG', 'T2a':'AGTGAGTCGTATTAATTTC', 'NTmpl':'GAAATTAATACGACTCACTATTCCTAGCCGACTGGCGAGAGCCAGGTAACGAATGGATCC', '5Prmr': 'GTTCAGAGTTCTACAGTCCGACGATCTCAACT', '3Prmr':'ATGGAATTCTCGGGTGCCAAGG', 'AlignSeq': 'ACTGGCGAGAGCCAGGTAAC', 'Toehld':'CCACTCCTCA'} ) NewSet = Expt2.importdataset() >>importrawdataset('MG_S9_L001_R1_001.fastq.gz') ================ WT Enz, randomized IT +3 to +10 ================== ITWT_S1_L001_R1_001_Aug.fastq.gz, Trgt=GGNNNNNNNNTACGTCGACGCATTTA (26mer) 8/8/16 Aruni [Enz] = 0.50 uM, [DNA] = 2.00 uM, for 5.0 min at T=37.0 C >337631 << Imported

Returns a NucleicSet object after trimming adaptors off of each sequence
If an adaptor is passed as None, it uses an adaptor stored with the data set
If an adaptor is passed as '', it does not look for or require that adaptor
If an adaptor is required and is not found in a sequence, it throws out that sequence
If sequence was tagged as 'isTemplate' it returns the reverse complement after trimming

Note that an earlier version of this, trim_adaptors, has been deprecated.

Note also that this does not convert T's to U's (so think of RNA as having T!) einfo = Exptinfo('8/8/16', 'Aruni',0.5, 2.0, 5, 37,'','AATTAATACGACTCACTATA') Expt1 = Seqsetup('ITWT_S1_L001_R1_001_Aug.fastq.gz', 'GGNNNNNNNNTACGTCGACGCATTTA', 'TAATCA', 'TGGAA', einfo, 'WT Enz, randomized IT +3 to +10', False, 'AATGATACGGCGACCACCGAGATCTACACGTTCAGAGTTCTACAGTCCGACGATCTAATCAGGNNNNNNNNUACGUCGACGCAUUUAATGGAATTCTCGGGTGCCAAGG' ) rawset = Expt1.importdataset() trimmedSet = rawset.trimAdaptors(Expt1.adptr5,Expt1.adptr3) >>importrawdataset('ITWT_S1_L001_R1_001_Aug.fastq.gz')

Returns a NucleicSet object, converting all T's to U's newRNASet = rawset.trimAdaptors(None,None).toRNASet() needs example

This section needs documentation update. RNAset.writedataset('_Craig1','../Output') RNAset.writedataset('_trial1',None) Results here

Use this to define a variable that contains information on the transcription reaction. einfo = Exptinfo('8/8/16', 'Aruni',0.5, 2.0, 5, 37,'','AATTAATACGACTCACTATA') [ no results ]

This section needs documentation update. newSet = RNAset.getReverseComplement()

This section needs documentation update. newSet = RNAset.getMostCommon(25,'comment') Results here

This section needs documentation update. RNAset.printMostCommon(0.8,'heading','comment') Results here

A note about keyseq

Many functions above use a parameter called keyseq.

At its simplest, it is a DNA or RNA sequence, passed as 'AGCGGAG' that is used for searching sequences for matches.

But it can do much more. Specifying False in place of this parameter replaces it with the default value: the value of SeqsUsed{"AlignSeq"} defined in the initial experimental setup. You can also pass along the value of any SeqsUsed variable set up for the data set and it will use the sequence that corresponds to that variable.

Wonky Stuff

The following assumes that RNAset is a NucleicSet variable

RNAset.count() - returns the number of RNAs in the set
RNAset.maxlength() - returns the lenght of the longest RNA in the set
RNAset.tseq - returns the expected sequence
RNAset.alignSeq - returns the stored internal alignment sequence
RNAset.istemplate - returns True if template seqs, False if encoded RNA
RNAset.filename - returns the original data set file name
RNAset.setnotes - returns a string with notes about this data set
RNAset.history - returns a string describing how this data set has been manipulated/filtered
RNAset.expectedlength() - returns the lenght of the expected sequence
RNAset.info() - returns information about the set
RNAset.infoFull() - returns information about the set, incl adapter stats
RNAset.exptDescr() - returns a string with count, max length, & reaction conditions
RNAset.adapterStats(adpt5,adpt3,mrkr) - returns a string with info on adapters statistics

Python Dictionaries

RNAset.exptinfo - returns a Python object with experimental info, as entered originally
RNAset.exptinfo.rxnconditions() - returns a string with information about the reaction conditions

RNAset.SeqsUsed - returns a Python dictionary object with sequences, as entered originally
RNAset.SeqsUsed['xxx'] - returns the dictionary element xxx from SeqsUsed

RNAset.dData - returns another dictionary with any user defined elements

========

Note that for most of the "get" functions, which return only a subset of the data passed, you can access the data that was NOT "gotten" by immediately accessing the variable config.dumpedSet. For example, in the following

   NewSet = RNAset.getWithMatched('AGGCT',0,")

   BadSet = config.dumpedSet

NewSet will contain sequences containing the match
BadSet will contain sequences that do NOT meet the criteria
Be careful in nested calls. config.dumpedSet will only refer to the LAST function in the nest.

General Utilities

Always use this for anything that you might want captured to a file. In fact, just always use printc('test') instead of print('test') printc('This is a test') This is a test

Typically, bracket your analysis by StartCaptureToFile() and WriteCaptureToFile(fi) StartCaptureToFile()

Typically, bracket your analysis by StartCaptureToFile() and WriteCaptureToFile(fi) WriteCaptureToFile('output.txt') [ no output ]

This section needs documentation update. ResumeCaptureToFile()

Print some sample sequences from the data set.
WARNING: these sequences have no statistical significance. Use this only for simple testing, not analysis. WTset = Expts["ITWT_Aug"].importrawdataset() WTset.printSampleSeqs(5) >><>.importrawdataset() ============ WT Enz, randomized IT +3 to +10 =============== WT Enz, randomized IT +3 to +10 8/8/16 Aruni [Enz] = 0.50 uM, [DNA] = 2.00 uM, for 5.0 min at T=37.0 C >337631 << Imported TAATCATACAGTCCGACGATCTAATGTTCTACAGTCCGACGATCTAATCAGGCGTC TAATCAGGGCTTCCTCTGGAATTCTCGGGTGCCAAGGAACTCCAGTCACCGATGTA TAATGGACCTGGAATTCTCGGGTGCCAAGGAACTCCAGTCACCGATGTATCTCGTA TAATCAAGAGTTCTACAGTCAGACGATCTAATCATTCTACAGTCCGACGATCTAAT TAATCAGGAGCCTGGAATTCTCGGGTGCCAAGGAACTCCAGTCACCGATGTATCTC

Advanced functions

This function is called by .termDiNucAnal, .internalDiNucAnal, .termDiNucAnalScore, and .internalDiNucAnalScore. It collects abundancies of n-nucleotide steps at each position (either at every position along the transcript (internal) or only at the ends of each transcript (terminal). If no reference set is provided (None), it reports back the percent of each step found at each position. If a reference NucleicSet is provided (expected transcripts from direct sequencing of the DNA template), the percent of each step at each position for the primary experiment is compared to the percent of each step in the template derived set. A "Z-Score" is calculated, which effectively corrects for variations in the template (eg, if the template has a higher than random fraction of CG at position 4, then one would expect a higher than random fraction in the experiment - the score is then scaled appropriately. ResumeCaptureToFile()

There is a set of . Do not use them as they are no longer supported.

DEPRECATED (outdated) functions

Aligns all sequences to the 'tseq' subsequence starting at 'keyposition' and 'keylength' long. Then analyzes 5' and 3' end heterogeneities (separately). Useful for analyzing positional mis-initiation and for analyzing runoff things like n-1, n+1, or primed extensions. Passing 'keyposition' as None tells it to use the AlignSeq stored in dData as the alignment sequence.

This function looks for RNA products that might arise from internal priming by another RNA (or DNA). The expectation from such an event is that the post-priming RNA will be the inverse complement of some part of the original RNA sequence. This scans and tries to find those events. The function uses a 'sliding window" approach: a smaller window might pick up false positives, a larger window might miss something. alignedSet = RNAset.getPrimedExt(7,5,'testing get primed extensions','InvCompl_Seqs') >[ITWT_S1_L001_R1_001_Aug].importrawdataset().trimAdaptors('TAATCA','TGGAA').getPrimedExt(7,5,'','InvCompl_Seqs') testing get primed extensions Return only seqs of length 7 to 10000. >> 132037 of 244036 (54.1%) Analyzing 132037 sequences 6- 0 ( 6) GGNNNNNNNNTACGTCGACGCATTTA 18-24 RvTmplt GTTCAGAGTTCTACAAGGCTGAACATTACGTTCAG <<<<<<............++++++........... 7- 1 ( 6) GGNNNNNNNNTACGTCGACGCATTTA 21-27 Tmplate CGTACGTCGACGCATTTACCTGTACGTCGACGCATT .>>>>>>..............++++++......... 23-17 ( 6) GGNNNNNNNNTACGTCGACGCATTTA 32-38 RvTmplt GTTCAGAGTTCTACGTAATCACTCACTAATGTAGTGATA .................<<<<<<.........++++++. 18-10 ( 8) GGNNNNNNNNTACGTCGACGCATTTA 14-22 RvTmplt GGAAACTAAATACGTCGACGTAAA ..........<<<<----++++.. .... 26-20 ( 6) GGNNNNNNNNTACGTCGACGCATTTA 39-45 RvTmplt GTTCTACAGTCCTACGATCTAATCAGTTCAGACAGGGCGCTGATT ....................<<<<<<.............++++++ 13- 7 ( 6) GGNNNNNNNNTACGTCGACGCATTTA 25-31 Tmplate CGACGATTAATCAAGTCCGACGATCTAATCAGTTCAGAGT .......>>>>>>............++++++......... 20 12/3945 0.3% 21 6/3655 0.2% 22 20/3712 0.5% 23 13/2994 0.4% 24 43/3859 1.1% 25 34/3913 0.9% 26 29/5004 0.6% 27 20/4305 0.5% 28 22/3148 0.7% 29 22/2600 0.8% 30 20/2864 0.7% 31 30/2200 1.4% 32 59/2214 2.7% 33 21/1996 1.1% 34 25/1859 1.3% 35 18/1645 1.1% 36 40/1527 2.6% 37 24/1374 1.7% 38 21/1227 1.7% 39 36/1086 3.3% 40 20/ 956 2.1% 41 25/ 920 2.7% 42 22/ 827 2.7% 43 19/ 796 2.4% 44 19/ 743 2.6% 45 17/ 717 2.4% get only extended RNAs (5 base window) at or beyond 7. >> 637 of 244036 (0.3%)

This function selects a key sequence from within the expected sequence (of length keylen, starting at position keypos). It returns only sequences containing that key sequence, and it then adds varying amounts of white space to each sequence, in order that all of the (internal) key sequences line up. Why do this? If some transcripts start +1 or -1 relative to the expected start site, the subsequent sequences will be phase shifted, complicating comparisons. When one then looks at sequences aligned with this function, it becomes obvious which sequences started phase shifted.

WARNING: please be aware that unlike, most functions, this ALTERS the returned sequence, but only by introducing white space at the 5' end. alignedSet = RNAset.getPhaseShifted(12,9,'align to position 12 in expected sequence') alignedSet..printSampleSeqs(10) >[ITWT_S1_L001_R1_001_Aug].importrawdataset().trimAdaptors('TAATCA','TGGAA').getPhaseShifted(12,9,'') align to position 12 in expected sequence GGNNNNNNNNTACGTCGACGCATTTA Phase shift seq with key at 12, 9mer ACGTCGACG from all sequences. >> 15841 of 244036 (6.5%) >[ITWT_S1_L001_R1_001_Aug].importrawdataset().trimAdaptors('TAATCA','TGGAA').getPhaseShifted(12,9,'').printSampleSeqs(10) GGNNNNNNNNTACGTCGACGCATTTA ....'....|....'....|....'....|....'....|....'....|....'. GGAATATGATTACGTCGACGCATTTACGACGTAATCA GGTACAACATTACGTCGACGCATT GGTCTTTGGATACGTCGACGCATT GTTCGGCAGTTAGATACGTCGACGCATTTTC GGCGATACTGTACGTCGACGCATT GGCACGTTGATACGTCGACGCATT GGAAATAACGTACGTCGACGCATTTAC TTCAGGGATAAAGGATACGTCGACGCATTTAG GGTATGTAAATACGTCGACGCATTTACATACGTCGACGCATT GGGTCATAAATACGTCGACGCGGCGC

Analysis Functions

RNAset.RNAkeyseqPosAnal('GAAGGA',"report back on positions of GAAGGA");

RNAset remains unchanged.

Advanced topics

A strength of this tool is that you can easily run the same analysis on a number of sequence data sets. An easy way to do this is by defining a list (array) of sequence identifiers. So instead of calling Seqsetup once for one variable, setup a dictionary collection of experiments.

UserSq = {'Tmplt': 'TAAATCTTCACCTCTACTGCTTCCTATAGTGAGTCGTATTAATT',
   'NTmpl': 'AATTAATACGACTCACTATAGG',
   '5Prmr': 'GTTCAGAGTTCTACAGTCCGACGATCTAATCA',
   '3Prmr': 'ATGGAATTCTCGGGTGCCAAGG',
   'AlignSeq': 'GGAAGCAG',
   'Index1': ''},

einfo5 = Exptinfo('07/02/19', 'Yasaman',2.11, 2, 240, 37,'','AATTAATACGACTCACTATAGG')

Expts = {} # define an initially empty dictionary

Expts['U9'] = Seqsetup('U9_S1_L001_R1_001.fastq.gz', 'GGAAGCAGTAGAGGTGAAGATTTA',
   'TAATCA', 'TGGAA', einfo5,
   'Pseudo U in stem-loop region, Pseudo U at position +9,Transcriptopn with UTP', False,
   UserSq,
   {'Keywords': 'PseudoU, UTP',
   'Description': 'psU in stem-loop +9, UTP',
   'QCode': 'U9',
   'PF': 3.14159,
   'Run Date': '07/02/2019'}
   )

Expts['U7'] = Seqsetup('U7_S1_L001_R1_001.fastq.gz', 'GGAAGCAGTAGAGGTGAAGATTTA',
   'TAATCA', 'TGGAA', einfo5,
   'Pseudo U in stem-loop region, Pseudo U at position +9,Transcriptopn with pseudoUTP', False,
   UserSq,
   {'Keywords': 'PseudoU, UTP',
   'Description': 'psU in stem-loop +9, pseudoUTP',
   'QCode': 'U9pse',
   'PF': 3.14159,
   'Run Date': '07/02/2019'}
   )

for myData in ['U9','U7']:
      StartCaptureToFile()

      Dset = Expts[myData].import_dataset().trimAdaptors(None, None).toRNASet()

      Rset.getWithMatched([11,11],6,'Strip 5\' hetero seqs').endAnalysis(10,5, '')
      Rset.printMostCommon(0.1,"5% and higher","")

      WriteCaptureToFile(Rset.dData['QCode'])

exit()

Python Sequence Analysis Tools

Introduction

Filter ("get") Functions

Filter and Manipulate ("get") Functions

Analysis Functions

Graphing/Plotting Functions

Utility Functions

A note about keyseq

Wonky Stuff

Python Dictionaries

General Utilities

Advanced functions

DEPRECATED (outdated) functions

Analysis Functions

Advanced topics