PatScan

This tutorial will explain the use of PatScan patterns using biologically relevant examples. If you want to follow along, download the input files for example 1 and example 2.

Identifying Iron-Response Elements in mRNA
Indentifying Cytochrome P450 Cysteine Heme-Iron Ligand Signatures
1. Any-Of and Not-Any-Of Patterns

Example 1: Identifying Iron-Response Elements in mRNA.

Besides its catalytic function in the TCA cycle, the Aconitase enzyme has an important regulatory function in both Eukaryotes and Prokaryotes. When active as a regulator, Aconitase binds specific structural motifs in mRNA, so-called Iron-Response Elements (IREs). In Eukaryotes, it is highly sensitive to iron deprivation and oxidative stress. In bacteria, the role of Aconitase seems to be similar, though the recognition sequence is less preserved. Aconitase recognises the RNA stem loop structure of the IRE consisting of a C bulge, a stem of 6 amino acids, and a loop of 5 or 6 nucleotides. In Eukaryotes, the loop sequence always is CAGUG, in Bacteria, more variations are possible.
In this example, we will build a set of patterns to identify putative IREs in the 5’ UTR sequences of the bacterium Streptomyces coelicolor. First starting with a relatively basic example, we will slowly extend it to cover more advanced patterns and use cases.

1. String and Complement Patterns

The basic IRE pattern is CN₁N₂N₃N₄N₅N₆CAGUGN'₆N'₅N'₄N'₃N'₂N'₁, where N_x is any nucleotide, and N'_x is its complement. An initial attempt at finding such a pattern using PatScan would use a number of string patterns and a complement pattern. Just using a string pattern with CNNNNNNCAGUGNNNNNN would fail to ensure the second set of Ns would be the reverse complement of the first, so instead we need to break the pattern up a bit.

On the start page , hit the "Browse" button to upload the example1_input.fa file.

Fig 1: Start page.

You will end up on an empty pattern input screen.

Fig 2: Empty pattern input screen.

In order to add patterns, we use the pattern selector, as shown in Fig. 3. Patterns can be added to the work area by dragging them from the selector, or simply by clicking on them. If you need to rearrange patterns, you can simply move them around using drag and drop in the work area.

Fig 3: Pattern selector.

Select three string patterns and one complement patterns. For the first string pattern, select C. For the second string pattern, select NNNNNN. For the third string pattern, select CAGUG. Click the "unnamed" button next to the second string pattern. The button text will change to "p1", the identifier of this pattern. In the complement pattern, you can now select the “p1” identifier for the “complement of” dropdown. Once you are done, your work area should look like in Fig. 4.

Fig 4: Using string and complement patterns.

2. Variations and Range patterns

If we run the pattern we created in Example 1.1, we unfortunately don't get any hits in the input data. The reason for this is that the IRE motif in Bacteria is not as conserved. Notably, the stem region does not need to have six perfect matches. To allow for a mismatch in the stem, hit the "Variations" button on the complement pattern. A number of new fields appear allowing us to specify how many mismatches, insertions or deletions this pattern can tolerate. Put a "1" in the "mismatches" field and ignore the other fields to just allow a single mismatch.

Fig 5: Allowing one mutation.

There is one more detail about the IRE pattern that we don't cover yet. The loop region acually can contain an optional sixth base after the CAGUG pattern. Let's add support for this by adding a range pattern. Drag and drop the range pattern field to sit between the string pattern for the loop region and the repeat pattern for the reverse part of the stem loop. Set the range pattern to cover from "0" to "1" bases.

Fig 6: The loop range pattern in place.

Another variation that is common on bacterial IREs is that the stem of the stem loop structure is only five bases long, not six. We can use another range pattern to support this. Hit the "Remove" button on the NNNNNN string pattern, and drag another range pattern into the same location instead. Remember to hit the "unnamed" button again so it says "p1" as before. You also need to select that from the "Complement of" dropdown of the complement pattern again.

Fig 7: Replacing the N string pattern with a range pattern.

3. Alternative Patterns

Not only the stem region but also the loop region can have variations in Bacteria. Notably, the loop sequence GAGAG has also been reported to be functional. To support searching for CAGUG or GAGAG at the same time, we can use the alternative pattern. It is basically a container that can hold any two other patterns, one of which has to match. To add it, drag it into position next to the CAGUG string pattern, and then drag the CAGUG pattern into the alternative pattern. Drag a new string pattern into the remaining free slot of the alternative pattern, and set that sequence to GAGAG. It does not matter what order the two string patterns have inside the alternative pattern.

Fig 8: Using the alternative pattern.

4. Alternative Complementation Rules

By default, PatScan uses the regular DNA complementation rules to determine which bases can be in a reverse complement. IREs are RNA sequences, so additional GU/UG pairs are possible. To support this, you can create custom alternative complementation rules. Note that due to constraints in PatScan, the alternative complementation rules need to be specified before being usable in a complement rule. Personally, I prefer having all alternative complementation rules at the start of my pattern set.
To add a complement pair, hit the "Add Complement Pair" button. You need to add a pair for every "direction" you want to allow complementation in. That is, adding e.g. a GU pair does not set up pairing rules for UG.

Fig 9: Using alternative complement rules.

5. Weight Patterns

If we wanted to explore further alternatives to the loop pattern sequence, we could put alternative patterns into alternative patterns to create a deeper logical branching. That approach does get a bit unwieldy, though. A better option is to instead use a weight pattern. A weight pattern is a kind of position-specific probability matrix that includes a cut-off value.
Weights can be added by clicking the "Add Weight" button. Every weight expects the probability values in percent for encountering a specific base at a given position. If the probability is zero, no value needs to be entered. The final weight is the summed-up probabilities of the whole sequence.
If we wanted to build a weight pattern to capture the loop sequences CAGUG and GAGAG but also want to allow for a G base in the fourth position, we would use the following approach:
First, remove the old alternative pattern by clicking its "Remove" button. Then, drag and drop a weight pattern in that place instead. Hit the "Add weight" button five times to add weights for all positions of our desired loop sequence. We want the first base to be C or G, so in the first column, we fill in "50" in the second and third rows. Zeroes can be left out, so no need to fill out the other rows. The second base should always be an A, so we put "100" into the first row of the second column and leave the rest blank. The third base should always be a G, so "100" goes into the third row of the third column. The fourth base could be A, G, or T (actually U, but for weight matrices those bases are equivalent). The weights are a bit arbitrary for this one, as long as C gets a zero. As there is no published evidence for G actualy being valid, I personally like to score this a bit less than the experimentally validated options. So, for column four, let's put "40" into the first and last rows, and "20" into the third row. For the last base, we again always want a G, so the fifth column gets a "100" in the third row.
To calculate the weight cut-off, we use the minimal score a valid pattern can obtain. There are three bases at "100", one base at "50" and one where the lowest valid score is "20", so "370" is the minimum score we expect a good pattern to hit.

Fig 10: Using a weight pattern.

Example 2: Identifying Cytochrome P450 Cysteine Heme-Iron Ligand Signatures (PS00086)

Cytochrome P450 proteins utilise heme cofactors to perform oxidation reactions and are found all over the tree of life. The heme cofactor binds to a cysteine in the active site of the enzyme. In this example, we will build a set of patterns to identify the active site cysteine of cytochrome P450 enzymes found in the proteome of the bacterium Streptomyces coelicolor. We assume that you have already gone over example 1 and are familiar with the basic pattern creation.

According to the PROSITE database entry PS00086, the pattern for identifying the active site cysteine is [FW]-[SGNH]-x-[GD]-{F}-[RKHPT]-{P}-C-[LIVMFAP]-[GAD]. If you are not familiar with the PROSITE pattern notation, amino acid letters wrapped in square brackets ([ ]) are alternatives, amino acid letters wrapped by curly braces ({ }) should not be present at a given position, and x means any arbirary amino acid.

1. Any-Of and Not-Any-Of Patterns

PatScan has two protein-specific patterns that allow us to search for patterns like this. The any-of pattern allows to specify a list of alternatives, matching the square brackets ([ ]) in the PROSITE format. The not-any-of pattern allows to give a list of amino acids that should not be present, matching the curly braces ({ }) in the PROSITE format. Note that both the any-of and the not-any-of pattern don't take a location but instead need to be inserted into your overall pattern at the location where you want to allow/disallow multiple amino acids. So if you wanted to match the sequences MAGICHAT and MAGICCAT, you want to allow for both H and C at position 6 in the sequence. In order to build this pattern in PatScan, you need to break it up into three parts. You need a string pattern for MAGIC, an any-of pattern for CH and another string pattern for AT, as shown in figure 11. The PROSITE pattern for this would be MAGIC-[CH]-AT, so you can usually use the dashes (-) in the PROSITE pattern to figure out where you need differnt pattern elements.

Fig 11: Example use of the any-of pattern.

With the help of the any-of and not-any-of patterns, we can go now and build a PatScan pattern for [FW]-[SGNH]-x-[GD]-{F}-[RKHPT]-{P}-C-[LIVMFAP]-[GAD]. We will need ten pattern elements. The fastest way to add them is to click on the respective elements in the selector. You will need two any-of patterns, a range pattern, an any-of pattern, a not-any-of pattern, an any-of pattern, a not-any-of pattern, a string pattern, and two any-of patterns. The first two any-of patterns take FW and SGNH, respectively. There is a single x in the PROSITE pattern, which translates to a range pattern from 1 to 1. The next any-of pattern takes the GD. The not-any-of pattern takes an F. The next any-of pattern takes RKHPT, the not-any-of pattern takes a P. The single C goes into the string pattern. The last two any-of patterns take LIVMFAP and GAD, respectively. See figure 12 for the example. Clicking "Submit" sends off the request, yielding eleven hits on the Streptomyces coelicolor proteome.

Fig 12: The cytochrome P450 active site pattern.