Blast Matrix

You don’t need to know any details above matrices to use abYsis productively. But for those who wish to know more background about matrics and how they work, keep reading.

A key element in generating and scoring a pairwise sequence alignment is the 20x20 substitution matrix. It assigns a score to each possible pair of residues. BLOSUM-62 is the most popular and common default for the Blast program and only rarely does it benefit from being changed.

Detail

Different substitution matrices are tailored to detecting similarities among sequences that are diverged by differing degrees [1-3]. In practice however a single matrix may nevertheless be efficient over a broad range [1-3].

Experimentation has shown that the BLOSUM-62 matrix [4] is among the best for detecting relatively weak protein similarities though for particularly long and weak alignments which you probably will not encounter when aligning antibodies, the BLOSUM-45 matrix may prove superior.

A detailed statistical theory for gapped alignments has not been developed, and the best gap costs to use with a given substitution matrix are determined empirically.

Short alignments need a higher percentage of matching residues to rise above background noise and are more easily detected using a matrix with a higher "relative entropy" [1] than BLOSUM-62. As short query sequences can only produce short alignments, they might benefit from something closer to an identity matrix.

The BLOSUM series does not include any matrices with relative entropies suitable for the shortest queries, but older PAM matrices [5,6] may be used instead if required.

For proteins, a provisional table of recommended substitution matrices and gap costs for various query lengths is:

Query lengthSubstitution matrixGap costs
<35PAM-30 (9,1)
35-50 PAM-70 (10,1)
50-85 BLOSUM-80(10,1)
>85BLOSUM-62(11,1)

Gap Costs

The raw score of an alignment is the sum of the scores for aligning pairs of residues and the scores for gaps. Gapped BLAST (and PSI-BLAST which is not used in abYsis) use "affine gap costs" which charge the score -a for the existence of a gap, and the score -b for each residue in the gap. Thus a gap of k residues receives a total score of -(a+bk); specifically, a gap of length 1 receives the score -(a+b).

Lambda Ratio

To convert a raw score S into a normalized score S' expressed in bits, one uses the formula S' = (lambda*S - ln K)/(ln 2), where lambda and K are parameters dependent upon the scoring system (substitution matrix and gap costs) employed [7-9].

For determining S', the more important of these parameters is lambda. The "lambda ratio" quoted here is the ratio of the lambda for the given scoring system to that for one using the same substitution scores, but with infinite gap costs [8].

This ratio indicates what proportion of information in an ungapped alignment must be sacrificed in the hope of improving its score through extension using gaps. We have found empirically that the most effective gap costs tend to be those with lambda ratios in the range 0.8 to 0.9.

References

[1] Altschul, S.F. (1991) "Amino acid substitution matrices from an information theoretic perspective." J. Mol. Biol. 219:555-565.

[2] States, D.J., Gish, W. & Altschul, S.F. (1991) "Improved sensitivity of nucleic acid database searches using application-specific scoring matrices." Methods 3:66-70.

[3] Altschul, S.F. (1993) "A protein alignment scoring system sensitive at all evolutionary distances." J. Mol. Evol. 36:290-300.

[4] Henikoff, S. & Henikoff, J.G. (1992) "Amino acid substitution matrices from protein blocks." Proc. Natl. Acad. Sci. USA 89:10915-10919.

[5] Dayhoff, M.O., Schwartz, R.M. & Orcutt, B.C. (1978) "A model of evolutionary change in proteins." In "Atlas of Protein Sequence and Structure, vol. 5, suppl. 3," M.O. Dayhoff (ed.), pp. 345-352, Natl. Biomed. Res. Found., Washington, DC.

[6] Schwartz, R.M. & Dayhoff, M.O. (1978) "Matrices for detecting distant relationships." In "Atlas of Protein Sequence and Structure, vol. 5, suppl. 3," M.O. Dayhoff (ed.), pp. 353-358, Natl. Biomed. Res. Found., Washington, DC.

[7] Karlin, S. & Altschul, S.F. (1990) "Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes." Proc. Natl. Acad. Sci. USA 87:2264-2268.

[8] Altschul, S.F. & Gish, W. (1996) "Local alignment statistics." Meth. Enzymol. 266:460-480.**

[9] Altschul, S.F., Madden, T.L., Sch&auml;ffer, A.A., Zhang, J., Zhang, Z., Miller, W. & Lipman, D.J. (1997) "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs." Nucleic Acids Res. 25:3389-3402.