Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
1/17/2017
BINF 3360, Introduction to Computational Biology
Lecture 2, Introduction to Python
Young-Rae Cho
Associate Professor
Department of Computer Science
Baylor University
Python Programming Language
Script Language
General-purpose script language
Broad applications
(web, bioinformatics, network programming, graphics, software engineering)
Features
Object-oriented
Extension with modules
Database integration
Embeddable
Web frameworks / Web modules
1
1/17/2017
Getting Started
Download & Installation
http://www.python.org/download/ (the most recent version: Python 3.3)
Edit & Run
Create a file named test.py
Edit the code
# This is a test.
dna = ‘ATCGATGA’
print dna, ‘\n’
Run the code
> python test.py
Primitives
Primitive Data Types
Numbers or Strings
num = 1234
st = ‘1234’
num_1 = num + int(st)
st_1 = str(num) + st
Substring
dna1 = ‘ACGTGAACT’
dna2 = dna1[0:4]
length = len(dna2)
Reversing
dna1 = ‘ACGTGAACT’
dna2 = dna1[::-1]
2
1/17/2017
Lists
List Variables
A list of comma-separated values
lst1 = [‘A’, ‘C’, ‘G’]
lst2 = [‘T’]
lst1 = lst1 + lst2
Variable-length list
Insert, Delete, Append, Reverse, and Sort
lst = [‘A’, ‘T’, ‘G’]
lst = [‘A’, ‘T’, ‘G’]
lst.insert(1, ‘C’)
lst [1:2] = ‘C’
del lst[2]
lst [1:1] = ‘T’
lst.append(‘T’)
lst [2:3] = ‘’
lst.extend([‘A’, ‘C’])
lst [len(lst) : len(lst)] = ‘T’
lst.reverse()
lst [len(lst) : len(lst)] = [‘A’, ‘C’]
lst.sort()
lst [::-1]
Sets
Set Variables
DNAbases = {‘A’, ‘C’, ‘G’, ‘T’}
RNAbases = {‘A’, ‘C’, ‘G’, ‘U’}
DNAbases | RNAbases
DNAbases & RNAbases
DNAbases - RNAbases
Add and Remove
bases = {‘A’, ‘D’, ‘G’}
bases.add(‘T’)
bases.remove(‘D’)
3
1/17/2017
Dictionaries
Initialization
d= {
d = dict()
‘key1’: ‘value1’ ,
d[‘key1’] = ‘value1’
‘key2’: ‘value2’ ,
k2, v2 = ‘key2’, ‘value2’
‘key3’: ‘value3’
d[k2] = v2
}
Mapping
d[‘key1’]
d.get(‘key1’)
d.keys()
d.values()
Delete
del d[‘key1’]
Input / Output
Standard Input
import sys
data = sys.stdin.readline().replace(‘\n’, ‘ ’)
Reading Files
name = ‘myfilename.txt’
name = sys.stdin.readline()
name = sys.argv[1]
with open(name) as file:
with open(name) as file:
with open(name) as file:
data = file.read()
Writing Files
data = file.read()
data = file.read()
name = ‘output.txt’
with open(name, ‘w’) as file:
file.write(‘ATCGATG’)
4
1/17/2017
Functions
Types
Built-in system functions
User-defined functions
Defining Function
def function_name (parameter_list):
statement
statement
return value
Function Call
Iteration
Iterative Process
def find_max(lst):
max_so_far = lst[0]
for item in lst[1:]:
if item > max_so_far:
max_so_far = item
return max_so_far
lst1 = [3,5,10,4,6]
maximum = find_max(lst1)
5
1/17/2017
Recursion
Recursive Call
def print_tree(tree, level):
print ‘ ’ * 4 * level, tree[0]
for subtree in tree[1:]:
print_tree(subtree, level+1)
t1 = [‘A’, [‘T’, [‘A’], [‘T’]], [‘G’, [‘G’], [‘C’]]]
print_tree(t1, 0)
Modules
Module
A collection of functions
Module python (.py) files in a library directory
Module Call
import random
seq = 'ATCGATAGCTA'
random_base = seq[random.randint(0,len(seq)-1)]
from random import *
seq = 'ATCGATAGCTA'
random_base = seq[randint(0,len(seq)-1)]
6
1/17/2017
Regular Expressions (1)
Special Languages
Metacharacters (characters having special meanings):
. (any character),
\n,
\t,
\s (whitespace),
\w (any alphabetic or numeric character),
\W,
\d (decimal digit),
\D
Quantifiers
e.g., ‘ct .*g’,
‘ct .+g’,
Grouping and back-reference
Alternatives
Character set
‘ct .?g’,
‘ct{2}g’, ‘ct{2,5}g’
e.g., ‘(.)(.)aa\1\2’
e.g., ‘(ct|ca)’
e.g., ‘[acgt]’,
‘[a-zA-Z]’
Anchors: ^ (the start of the string), $ (the end of the string)
e.g., ‘^tata’ , ‘aa$’
Regular Expressions (2)
Usage
search: searches the first match
of the pattern in a string, and
returns the position as a
import re
pos = re.search(‘TATA .* AA’, seq)
print pos.start()
MatchObject instance
findall: searches all matches of
the pattern in a string, and
returns a list of the matches
import re
matches = re.findall(‘TATA .* AA’, seq)
print matches
finditer: searches all matches of
the patterns in a string, and
returns an Iterator object as a MatchObject instance
7
1/17/2017
Biological Applications
Parsing Sequences
Sequence Validation
Motif Search
Sequence Transformation
DNA Replication
Transcription from DNA to RNA
Translating RNA into Protein
DNA Sequence Mutation
Parsing Sequences (1)
Single Sequence in FASTA Format
>gi|5524211|gb|AAD44166.1| cytochrome b
LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIP
YIGTNLVEWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDK
IPFHPYYTIKDFLGLLILILLLLLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRS
VPNKLGGVLALFLSIVILGLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYP
YTIIGQMASILYFSIILAFLPIAGXIENY
Parsing
Make a function to return the sequence from the FASTA format
def read_FASTA_seq(filename):
with open(filename) as f:
return f.read().partition(‘\n’)[2].replace(‘\n’, ‘’)
8
1/17/2017
Parsing Sequences (2)
Multiple Sequences in FASTA Format
>SEQUENCE_1
MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEG
LVSVKVSDDFTIAAMRPSYLSYEDLDMTFVENEYKALVAELEKENEERRRLKDPNKPEHKIP
QFASRKQLSDAILKEAEEKIKEELKAQGKPEKIWDNIIPGKMNSFIADNSQLDSKLTLMGQFY
VMDDKKTVEQVIAEKEKEFGGKIKIVEFICFEVGEGLEKKTEDFAAEVAAQL
>SEQUENCE_2
SATVSEINSETDFVAKNDQFIALTKDTTAHIQSNSLQSVEELHSSTINGVKFEEYLKSQIATIGE
NLVVRRFATLKAGANGVVNGYIHTNGRVGVVIAAACDSAEVASKSRDLLRQICMH
Parsing ?
Sequence Validation (1)
DNA Sequence Validation
Make a function to check the sequence consists of ‘A’, ‘T’, ‘C’, and ‘G’ only
def validate_dna (base_sequence):
seq = base_sequence.upper()
for base in seq:
if base not in ‘ACGT’:
return False
return True
def validate_dna (base_sequence):
seq = base_sequence.upper()
return len(seq) == (seq.count(‘T’) + seq.count(‘C’) +
seq.count(‘A’) + seq.count(‘G’) )
9
1/17/2017
Sequence Validation (2)
Counting Base Frequency
Make a function to calculate the percent of ‘C’ and ‘G’ in a DNA sequence
def percent_of_GC (base_sequence):
seq = base_sequence.upper()
count = 0
for base in seq:
if base in ‘CG’:
count += 1
return float(count) / len(seq)
def percent_of_GC (base_sequence):
seq = base_sequence.upper()
return float(seq.count(‘G’) + seq.count(‘C’)) / len(seq)
Motif Search
Searching Substring
Make a function to take a sequence and a motif and return the position(s)
of matching in the sequence
def motif_search (seq, motif):
return seq.find(motif)
def all_motif_search (seq, motif):
pos = []
idx = seq.find(motif)
pos.append(idx)
seq = seq.partition(motif)[2]
while seq.find(motif) > 0:
idx += seq.find(motif) + len(motif)
pos.append(idx)
seq = seq.partition(motif)[2]
return pos
10
1/17/2017
Transcription
Simulating Transcription
Make a function to transcribe a DNA into an RNA
def transcription (dna):
return dna.replace(‘T’, ‘U’)
Translation (1)
Making Genetic Code
Make a function to translate a codon to an amino acid
def codon2aa(codon):
genetic_code = { ‘UUU’: ‘F’, ‘UUC’: ‘F’,
‘UUA’: ‘L’, …… }
if codon in genetic_code.keys():
return genetic_code[codon]
else:
return ‘Error’
11
1/17/2017
Translation (2)
Simulating Translation
Make a function to translate an RNA into a protein sequence
def translation(rna):
protein = ‘’
for n in range(0, len(rna), 3):
protein += codon2aa(rna[n:n+3])
return protein
Translation (3)
Simulating Translation – cont’
Make a generator function which returns values from a series it computes
def aa_generator(rna):
return (codon2aa(rna[n:n+3]) for n in range(0, len(rna), 3) )
def translation(rna):
gen = aa_generator(rna)
protein = ‘’
aa = next(gen)
while aa:
protein += aa
aa = next(gen)
return protein
12
1/17/2017
Mutation
Simulating Mutation
Make a function to simulate single point mutations in a DNA sequence
import random
def mutation(dna):
position = random.randint(0,len(dna)-1)
bases = ‘ACGT’
new_base = bases[random.randint(0,3)]
dna[position:position+1] = new_base
return dna
bases.replace(dna[position], ‘’)
new_base = bases[random.randint(0,2)]
Questions?
Lecture Slides are found on the Course Website,
web.ecs.baylor.edu/faculty/cho/3360
13