Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Programming for Bioinformatics
Michael Schroeder
BioTechnological Center
TU Dresden
ms@mpi-cbg.de
Biotec
The module…
will teach students basic programming skills relevant to
bioinformatics, which will enable them to actively develop
bioinformatics tools.
will take a problem-driven approach.
will present bioinformatics problems and show how to solve
them using existing online tools and how to implement such
tools.
will revisit some of the problems and databases discussed in
applied bioinformatics.
will be very practical and hands-on approach to basic
computer science tools such as using command line operating
systems, programming in Python, and using relational
databases.
Objectives
Students will have an understanding of different operating
systems
Students will be able to automate simple repetitive
information retrieval tasks
Students will be able to write simple programs in Python
Students will be able to work with relational databases
Students will appreciate the principles, limits, and
possibilities of programming
Students will be able to formulate biological questions as
information processing problems
Students will understand when and how programming can
help to automate bioinformatics problems
Module Structure
Introduction
Databases
Introduction to SQL
A Little Exercise
A Little Science
Introduction to Python
Data types and loops
Sequences and lists
Patterns and functions
Dictionaries
Advanced topics
More Python
Dynamic programming
Clustering
Revision Class
Books
You will need two books for the module: a
reference book on MySQL and a book on Python
Books: Python
We will follow a number of online
resources.
(see course web page)
Further, we look in Python in a Nutshell,
Alex Martelli, O’Reilly
Wesley Chun's Core Python
Programming
Python Cookbook (O’Reilly)
The publisher O’Reilly has many
general programming books on linux,
python, etc.
They allow you to read all books for 2
weeks online for free. This is very nice
to decide what to buy and what not.
You can also buy electronic copies of
the book.
Books: MySQL
There are many, many
books on MySQL
The following two are just
sugestions, as there are
many other books covering
the same material
MySQL Cookbook by Paul
DuBois, O'Reilly or
MySQL by Paul DuBois,
Michael Widenius, O'Reilly
Structure of Labs
Databases
Lab 1,2: Simple SQL
Lab 3,4: SQL to answer interesting scientific questions
Python
Lab 5: Data types and loops, accessing a DB from Python
Lab 6: Sequences and lists
Lab 7: Patterns and Functions
Lab 8: Dictionaries
Lab 9: BioPython
Lab 10: Python & PyMOL
More Python:
Lab 11: Dynamic programming revisited
Lab 12: Clustering revisited
Lab 13: Revision
Assessment
Lab
Exercises:
Each week during the lab you get exercises which you have
to do during the lab and finish on your own during the week
These exercises need to be handed in on paper at the next
lecture
Results are discussed during the labs and as part of the
assessment you will have to present a solution once
Doing the exercises is compulsory, but there are no marks
Project
You will demonstrate your programming skills by implementing
and presenting a software project
Exam
Pen and paper exam on material covered in lecture
Programming Project
Goal: Demonstrate ability to use SQL and
programming
Goal 2: Produce science movie for Long Night of
Science
You will work in a team and get a biological problem.
Part 1: Programming: You have to implement some
workflows, which integrate data from various sites and
use various tools programmatically. This includes an
animation of your target protein in PyMol.
Part 2: Make a movie. Tell the story about your protein
based on the data collected and analysis carried out.
Create a story board and turn all material and Pymol
animations into a movie.
Motivation: Databases
In the last term,
we accessed most information online via the web
we interacted directly and manually with databases and tools
we had to manually submit queries, interpret results. select interesting
results, cut&paste them, and submit queries again,…
Pro:
Reasonably easy to get hold of information
Con:
Not possible to ask many queries
Queries limited by interface provided by web page
Difficult/impossible to integrate information from different sites
In this term, we will look at the databases underlying the online
front ends
How is the data internally stored?
How can we - and more important computer programs - directly interact
with the underlying data, so that we can ask more powerful queries, large
queries, and integrate different systems
What actually happens
You are limited by what web server
allows you to ask:
Example CATH:
•PDB ID,
•CATH code, or
•General text
But you cannot ask:
•In how many different PDB
structures is there a P-loop domain?
•Is there a PDB entry with a P-loop
and a DNA-binding domain
•How many different superfamilies
does the largest structure in PDB
have?
•With direct access to the underlying
database you could answer all these
questions (and many more)
Motivation: SCOP as Relational Database
We worked with SCOP, the Structural Classification of Proteins
Family: >30% sequence identity
Superfamily: Similar structure and function (possibly lower 30% sequence
identity)
Family
Same
Superfamily,
But not family
30%
Picture from www.jenner.ac.uk/YBF/DanielleTalbot.ppt
Motivation: Databases
We wish to answer the following questions:
How many families and superfamilies are there?
Do all superfamilies roughly have the same number of families?
How many families does the immunoglobulin superfamily have?
Which superfamily has the most families and how many?
How many percent of superfamilies have only one family?
Which PDB structure has the largest number of distinct
superfamilies?
How many percent of PDB structures have only one type of
superfamily, how many percent have at least two?
Which is the most popular superfamily?
Are all superfamilies equally likely to co-occur or do they have
preferences?
Which superfamily has the most co-occurrence partners?
Is the number of co-occurrence partners and the frequency of the
superfamily correlated?
What is a Database
SCOP contains relevant information, but we cannot answer the
above questions through the web-interface of SCOP
The problem is that we do not have access to the underlying
database
What is a database anyway?
A database provides…
Logical organization of data
data models, schema design, dictionaries
Physical organization of data
Fast retrieval, indexing, compact storage of data
Relational Database
Central Idea: Data as relations in a table
E.g. Employee
+-------+------+---------+---------+
| id
| name | salary | role
|
+-------+------+---------+---------+
| 46457 | pete | 50.000 | director|
| 46458 | jane | 60.000 | nurse
|
| 46459 | asif | 70.000 | driver |
+-------+------+---------+---------+
Relational Database
Central Idea: Data as relations in a table
E.g. SCOP, Structural Classification of Proteins
+-------+------+---------+---------+--------------------------------------+
| id
| type | sccs
| sid
| description
|
+-------+------+---------+---------+--------------------------------------+
| 46457 | cf
| a.1
| | Globin-like
|
| 46458 | sf
| a.1.1
| | Globin-like
|
| 46459 | fa
| a.1.1.1 | | Truncated hemoglobin
|
| 46460 | dm
| a.1.1.1 | | Truncated hemoglobin
|
| 46461 | sp
| a.1.1.1 | | Ciliate (Paramecium caudatum)
|
| 14982 | px
| a.1.1.1 | d1dlwa_ | 1dlw A:
|
| 46462 | sp
| a.1.1.1 | | Green alga (Chlamydomonas eugametos) |
| 14983 | px
| a.1.1.1 | d1dlya_ | 1dly A:
|
| 63437 | sp
| a.1.1.1 | | Mycobacterium tuberculosis
|
| 62301 | px
| a.1.1.1 | d1idra_ | 1idr A:
|
+-------+------+---------+---------+--------------------------------------+
SCOP Tables
mysql> select * from cla limit 1;
+---------+--------+---------+-------+-------+-------+-------+-------+-------+-------+
| sid
| pdb_id | sccs
| cl
| cf
| sf
| fa
| dm
| sp
| px
|
+---------+--------+---------+-------+-------+-------+-------+-------+-------+-------+
| d1dlwa_ | 1dlw
| a.1.1.1 | 46456 | 46457 | 46458 | 46459 | 46460 | 46461 | 14982 |
+---------+--------+---------+-------+-------+-------+-------+-------+-------+-------+
mysql> select * from des limit 1;
+-------+------+------+------+--------------------+
| id
| type | sccs | sid | description
|
+-------+------+------+------+--------------------+
| 46456 | cl
| a
| | All alpha proteins |
+-------+------+------+------+--------------------+
mysql> select * from astral limit 1;
+---------+---------+-----------------------------------------------------------+
| sid
| sccs
| seq
|
+---------+---------+-----------------------------------------------------------+
| d1dlwa_ | a.1.1.1 | slfeqlggqaavqavtaqfyaniqadatvatffngidmpnqtnktaaflcaalgg...|
+---------+---------+-----------------------------------------------------------+
mysql> select * from subchain limit 1;
+----+-------+----------+-------+------+
| id | px
| chain_id | begin | end |
+----+-------+----------+-------+------+
| 1 | 14982 | A
|
|
|
+----+-------+----------+-------+------+
SCOP Tables
mysql> select * from cla limit 1;
+---------+--------+---------+-------+-------+-------+-------+-------+-------+-------+
| sid
| pdb_id | sccs
| cl
| cf
| sf
| fa
| dm
| sp
| px
|
+---------+--------+---------+-------+-------+-------+-------+-------+-------+-------+
| d1dlwa_ | 1dlw
| a.1.1.1 | 46456 | 46457 | 46458 | 46459 | 46460 | 46461 | 14982 |
+---------+--------+---------+-------+-------+-------+-------+-------+-------+-------+
mysql> select * from des limit 1;
+-------+------+------+------+--------------------+
| id
| type | sccs | sid | description
|
+-------+------+------+------+--------------------+
| 46456 | cl
| a
| | All alpha proteins |
+-------+------+------+------+--------------------+
mysql> select * from astral limit 1;
+---------+---------+-----------------------------------------------------------+
| sid
| sccs
| seq
|
+---------+---------+-----------------------------------------------------------+
| d1dlwa_ | a.1.1.1 | slfeqlggqaavqavtaqfyaniqadatvatffngidmpnqtnktaaflcaalgg...|
+---------+---------+-----------------------------------------------------------+
mysql> select * from subchain limit 1;
+----+-------+----------+-------+------+
| id | px
| chain_id | begin | end |
+----+-------+----------+-------+------+
| 1 | 14982 | A
|
|
|
+----+-------+----------+-------+------+
Querying Relational Databases
SQL = Structured Query
Language
Select Which attributes?
from Which tables? where
Which conditions?
Select … from … where …
Distinct
Like
Union/intersect
Join
Count/average/sum/min/m
ax
Group by
Having
Show tables
Show databases
Use
Create database
Create table … as
Drop table
Load data
Insert into
Databases
Given SCOP as relational database, we can answer all
the questions raised above using the SQL constructs
of the previous slide!
Programming
We will use Python (Guido van Rossum, named after Monty
Python) as a convenient extension to the operating system
Easy to write quick programs
More than just a scripting language
Interpreted, interactive, indented
Supports string processing well
Widely used in bioinformatics
Object oriented, general purpose
Many nice libraries for database access, Graphics, Web, GUI, R…
Scientific orientation: Numerical Python (math), Scientific Python,
Biopython
Beware: Python is inefficient, but computationally expensive parts
can be included as C-libraries
Motivation: Families and Identity
We said that SCOP families share >30% identity
What does that mean?
Any two structures in a family >30%?
At least one other member in family with >30%?
What is the average sequence similarity within a
family? Within a superfamily?
Given a sequence and that we know already which
superfamily it belongs to. Can we find the
superfamily’s family best suited for the sequence
Two approaches: Blast vs. DIY
We can answer the above easily:
We use SCOP database and run database queries
from a Python script
For a given superfamily select all corresponding
sequences from the astral table
For all pairs of selected sequences
Call Blast and record the sequence identity
Or run your own dynamic programming algorithm and
record the sequence identity
For second problem: Compare sequence to all family
sequences and assign it to the family which shares
the highest (must be >30%) similarity with the
sequence
Motivation: Sequence vs. Structure
Can we verify the plot below?
Can we create a similar plot for specific
superfamilies? E.g. DNA-binding domains?
Family
Same
Superfamily,
But not family
30%
Picture from www.jenner.ac.uk/YBF/DanielleTalbot.ppt
Motivation: Sequence vs. Structure
Again: select the relevant sequences from the
astral table and besides computing the
sequence identity, we compute structural
similarity to the relevant structure using an
algorithm like Dali or CE
Then plot the two similarities against each other in
a scatter plot
Motivation:
Amino Acid Composition of Families
Can we characterise the amino acid composition of
different families/superfamilies?
Again: select the relevant sequences from astral
and count the frequencies of amino acids
Is the amino acid composition at the interface of a
domain different from the rest of the domain?
Motivation:
Let’s rebuild SCOP families
Given a SCOP superfamily and its sequences, how
can we divide it into families?
First, we need dynamic programming to determine
the sequence similarity
Then we do the following:
For all pairs of sequences, call the sequence
similarity algorithm and record the similarity into a
distance matrix
Next, run hierarchical clustering to cluster the
sequences.
What’s needed…
…programming in Python
Python Programming Constructs
Variables, strings,
For/while Loops
If statements
File I/O
Regular expressions
Data structures: Lists, Hashes
Code Structure: Objects, classes, modules
Hello World in Python
Given a file helloworld.py
Open a shell and type at the command prompt
helloworld.py
The shell then executes your programme
In the first line, it realises that the python interpreter
needs to be loaded and that what follows is a python
program
The line below prints a message
File: helloworld.py
print "Hello World"
Read a text file in python
The command open opens a text file and creates
“r” as second argument after the filename indicates that file is
read (this is default, ie. can be left out)
“w” as second argument indicates that file is written to
“a” as second argument indicates that file is appended to
The for-loop reads all lines of the file one by one (requires python
>2.2)
The body of the loop prints them on the screen (note that print
adds a new line automatically, avoid that with adding a ”,” )
File: fileIO.py
data = open("seq.txt“, “r”)
File: seq.txt for line in data:
print "Line:”, line,
acgt
gggt
Output
Line: acgt
Line: gggt
Variables in Python
The = symbol is used to assign values to variables
The + symbol is also used to concatenate strings
File: fileIO.pl
lineNo = 1
File: seq.txt for line in open(“seq.txt”): Output
1: acgt
acgt
print lineNo+”: ”+line,
2: gggt
gggt
lineNo = lineNo+1
File: seqcomp.py
If-thenelse and
strings in
Python
data = open("seq.txt")
line1 = data.readline().rstrip()
line2 = data.readline().rstrip()
len1=len(line1)
len2=len(line2)
if len1 < len2:
minLen = len1
else:
minLen = len2
line3 = ""
for i in range(minLen):
if line1[i] == line2[i]:
line3=line3+"*"
else:
line3=line3+" "
print
print
print
print
"Sequence comparison"
line1
line2
line3
File: seq.txt
acgt
gggt
Output
Sequence comparison
acgt
gggt
**
Programming
Example