Let my dataset change your mindset. (Hans Rosling)
|This site accompanies the 2nd edition of my book entitled Computational Biology - A Practical Introduction to BioData Processing and Analysis with Linux, MySQL, and R published by Springer in 2013.
Buy it in the iTunes Store (iBook)
Buy it at Amazon
Buy it at Springer (wooden or eBook)
Read why I wrote this book (german): Author Interview
This largely expanded 2nd edition provides a practical introduction to
- data processing with Linux tools and the programming languages AWK and Perl
- data management with the relational database system MySQL, and
- data analysis and visualization with the statistical computing environment R
for students and practitioners in the life sciences. Although written for beginners, experienced researchers in areas involving bioinformatics and computational biology may benefit from numerous tips and tricks that help to process, filter and format large datasets. Learning by doing is the basic concept of this book. Worked examples illustrate how to employ data processing and analysis techniques, e.g. for
- finding proteins potentially causing pathogenicity in bacteria,
- supporting the significance of BLAST with homology modeling, or
- detecting, based on their structure, candidate proteins that may be redox-regulated.
All used software tools and datasets are freely available. One section is devoted to explain setup and maintenance of Linux as an operating system independent virtual machine. The author’s experiences and knowledge gained from working and teaching in both academia and industry constitute the foundation for this practical approach.
From the Forewords
I am convinced that this book should be the required reading for every molecular biologist. It will of course be particularly helpful for those dealing with genomics data, but even if genomics is currently not on your experimental agenda, handling large datasets and doing proper statistics is a basic qualification that cannot be underestimated in our discipline today.
Prof. Dr. Diethard Tautz, Director at the Max-Planck Institute for Evolutionary Biology, Germany
There are many things I liked about this book. First, the material on Unix/Linux is presented in a no-nonsense manner that would be familiar and appealing to any Unix/ Linux programmer. It‘s clear the author has internalized the powerful Unix/Linux building-block approach to problem solving. Second, the book is written in a lively and engaging style. It is not a turgid user manual. Finally, throughout the book the author admonishes the reader to write programs continuously as he or she reads the material. This cannot be overemphasized – it is well known that the only way to learn how to program effectively is by writing and running programs.
<Prof. Alfred V. Aho, PhD, Lawrence Gussman Professor, Department of Computer Science Columbia University in the City of New York, USA
- Page 184
- the 1st line in Program 27 has to be deleted
- Page 243
- Program 40 should be named sort-array-elements.awk
- Page 336
- Do the following replacements in Terminal 238 [thanks to Sean A. Rogers]
- replace ecolik12 by allk12 in lines 3 and 5
- replace ecolio157 by allo157 in lines 4 and 7
- Do the following replacements in Terminal 238 [thanks to Sean A. Rogers]
- Page 339
- The “magic command” should read
R --slave --vanilla < script.r > /dev/null[thanks to Sean A. Rogers]
- The “magic command” should read
Part I - Whetting Your Appetite
01 - Introduction
This chapter gives a short historical introduction to bioinformatics and computational biology and compares both disciplines.
02 - Content of This Book
This chapter sketches the main topics of this book, namely shell programming, Sed, AWK, Perl, R, and MySQL. It also describes the conventions and prerequisites for working with this book.
Part II - Computer and Operating Systems
03 - Unix/Linux
This chapter’s intention is to give you an idea about what operating systems in general, and Linux in particular, are. It is not absolutely necessary to know all this. However, since you are going to work with these operating systems you are taking part in their history! I feel that you should have heard of (and learned to appreciate) the outline of this history. Furthermore, it gives you some background on the system we are going to restrain.
Part III - Working with Linux
04 - The First Touch
In this chapter, you will learn the basics in order to work on a Unix-based computer: login, execute commands, logout. It also explains how to set up and maintain an Ubuntu Linux-based virtual machine on top of any operating system (e.g. Windows or MacOS X). This makes the examples shown in this book independent of your preferred operating system.
05 - Working with Files
This chapter introduces to working with files and directories in the Linux world.
06 - Remote Connections
How to connect to a remote Unix/Linux server? How to exchange files with this remote server? How to download Web content via the command line? How to create backups? This chapter provides you with the proper knowledge.
07 - Playing with Text and Data Files
How to look at gigabyte large text files? How to extract relevant or unique lines? How to compare text files? And, how to edit them? This chapter introduces basic command line tools and introduces you to the powerful VIM text editor.
08 - Using the Shell
The power of Linux lies in its shell. Pipes, redirections, aliases, batch jobs, scheduled commands? Playing with processes? This chapter introduces you to the power of the Bash shell.
09 - Installing BLAST and ClustalW
In this chapter, you will learn how to install small programs. As examples, we are using BLAST. (Basic Local Alignment Search Tool) and ClustalW. BLAST is a powerful tool to find sequences in a database. ClustalW is a general purpose multiple sequence alignment program for nucleic acid or protein sequences.
10 - Shell Programming
With this chapter we enter a new world. Before, you processed files by executing a number of commands. Now, we are going to put the commands together, blend them with some logic, and write our first shell scripts this way. IF you work through this chapter, THEN you know basic programming techniques. Hello World!
- add2path.sh / archive-pwd-i.sh / case-cp.sh / chg-pwd.sh / chmod_files.sh / con-sh-exe.sh / date-set.sh / date-vx.sh / dna-test.sh / for-ls.sh / for-num.sh / for-par.sh / for1.sh / grep_file_seq.sh / if-ls.sh / para.sh / select.sh / space-convert.sh / test-par.sh / text.sh / time-signal.sh / trap.sh / triplet-stop.sh / triplet-until.sh / triplet.sh
11 - Regular Expressions
Regular expressions are EXTREMELY useful and I cannot overemphasize the word extremely. Regular expressions provide a brief, flexible and comprehensive way to specify or recognize text strings, e.g. character patterns. They are available in virtually all Linux tools and programming languages. A must read chapter.
12 - Sed
Sed (stream editor) is a non-interactive, line-oriented, stream editor. Need to exchange decimal delimiters or replace text strings in a gigabyte seized file? Sed is you solution.
Part IV Programming
13 - AWK
AWK is a great language for both learning programming and processing text-based data files. For 99% you will work with text-based files, be it data tables, omic-data, or whatever. Apart from being simple to learn and having a clear syntax, AWK provides you with the possibility to construct your own commands. Thus, the language may grow with you as you grow with the language. Hello World, again. This chapter closes with an example of how to measure sequence distances by dynamic programming. With the basic knowledge of AWK presented in this chapter you are well prepared to master large-scale data processing and to learn any other programming language, if you feel to do so.
14 - Perl
Perl was initially developed to integrate features of Sed and AWK within the framework provided by the shell. As you learned before, AWK is a programming language with powerful string manipulation commands and regular expressions that facilitate file formatting and analysis. The stream editor Sed nicely complements AWK. This chapter gives a short introduction to Perl.
15 - Other Programming Languages
Every book needs a short chapter—this is mine. It presents an example of how to proceed with programming.
Part V - Advanced Data Analysis
16 - Relational Databases with MySQL
This chapter will introduce you to the heart of data management, i.e., databases. In particular, it will show you how to bring your own data into a relational MySQL database. “Relational” means essentially that your data are organized in tables that are related to each other. The database engine we are going to use is the freely available, open source MySQL. It is available for all operating systems.
17 - The Statistics Suite R
Data analysis commonly involves visualization of data. This includes both plotting the data themselves and plotting properties of the dataset like frequencies of certain numbers. R is a well-established platform for scientists in general and computational biologists in particular. Besides being a programming environment for statistical computing, R is also a data visualization tool. Several topic-specific packages are available to analyze and visualize experimental data, e.g., for evolutionary biology, the evaluation of biochemical assays, nucleotide and amino acid sequence analysis, microarray data interpretation, and more. This chapter provides a basic introduction, exemplifies statistical data analysis, demonstrates installation of additional packages, and shows how to retrieve data from a MySQL database.
Part VI - Worked Examples
18 - Genomic Analysis of the Pathogenicity Factors from E. coli Strain O157:H7 and EHEC Strain O104:H4===
This guided exercise will show you how to use BLAST to compare two genomes. The objective is to find open reading frames (ORFs) that are unique to one genome. This exercise uses a local installation of BLAST+, AWK, MySQL, and R.
19 - Limits of BLAST & Homology Modeling
Very often it is difficult to find homologs sequences just by using BLAST. This is because bad expectation values (E-Values) do not necessarily mean bad results. This project will teach you that it can help to look at the secondary and tertiary structure (homology modeling) as well and apply a priory knowledge. This exercise uses BLAST, ClustalW, Jpred, SWISS-MODEL, Jmol, and AWK.
20 - Virtual Sequencing of pUC18c
The objective of this exercise is to sequence the pUC18c cloning vector. It is divided into the following major parts: a) downloading the cloning vector, b) run a virtual sequencing program, and c) assembling the virtual sequences obtained from sequencing with the TIGR Assembler. The quality of a particular assembly is visualized with a dot plot. The ultimate goal is to understand the effect of sequence coverage and sequence fragment length on the assembly process. This exercise uses a virtual sequencer, the TIGR Assembler, Dotter, R, and shell programming.
21 - Querying for Potential Redox-Regulated Enzymes
The scientific background of this example is biochemistry. The activity of some enzymes is regulated by formation or cleavage of disulfide bonds near the protein surface. This can be catalyzed by a group of proteins known as thioredoxin. Comparison of primary structures reveals that there is no consensus motif present in most of the thioredoxin-regulated target enzymes. In order to identify potential proteins that could be targets for thioredoxin, one can investigate protein structure data. We just need to search for cysteine sulfur atoms that are no further than 3 Å apart and close to the surface. But how? This exercise uses Jmol, Surface Racer, AWK, and shell programming.