Let my dataset change your mindset. (Hans Rosling)


This site accompanies the 2nd edition of my book entitled Computational Biology - A Practical Introduction to BioData Processing and Analysis with Linux, MySQL, and R published by Springer in 2013.

Buy it in the iTunes Store (iBook)
Buy it at Amazon
Buy it at Springer (wooden or eBook)

Read why I wrote this book (german): Author Interview

Das Buch

This largely expanded 2nd edition provides a practical introduction to

for students and practitioners in the life sciences. Although written for beginners, experienced researchers in areas involving bioinformatics and computational biology may benefit from numerous tips and tricks that help to process, filter and format large datasets. Learning by doing is the basic concept of this book. Worked examples illustrate how to employ data processing and analysis techniques, e.g. for

All used software tools and datasets are freely available. One section is devoted to explain setup and maintenance of Linux as an operating system independent virtual machine. The author’s experiences and knowledge gained from working and teaching in both academia and industry constitute the foundation for this practical approach.

From the Forewords

Corrections

Chapters

Part I - Whetting Your Appetite

01 - Introduction

This chapter gives a short historical introduction to bioinformatics and computational biology and compares both disciplines.

02 - Content of This Book

This chapter sketches the main topics of this book, namely shell programming, Sed, AWK, Perl, R, and MySQL. It also describes the conventions and prerequisites for working with this book.

Part II - Computer and Operating Systems

03 - Unix/Linux

This chapter’s intention is to give you an idea about what operating systems in general, and Linux in particular, are. It is not absolutely necessary to know all this. However, since you are going to work with these operating systems you are taking part in their history! I feel that you should have heard of (and learned to appreciate) the outline of this history. Furthermore, it gives you some background on the system we are going to restrain.

Part III - Working with Linux

04 - The First Touch

In this chapter, you will learn the basics in order to work on a Unix-based computer: login, execute commands, logout. It also explains how to set up and maintain an Ubuntu Linux-based virtual machine on top of any operating system (e.g. Windows or MacOS X). This makes the examples shown in this book independent of your preferred operating system.

05 - Working with Files

This chapter introduces to working with files and directories in the Linux world.

06 - Remote Connections

How to connect to a remote Unix/Linux server? How to exchange files with this remote server? How to download Web content via the command line? How to create backups? This chapter provides you with the proper knowledge.

07 - Playing with Text and Data Files

How to look at gigabyte large text files? How to extract relevant or unique lines? How to compare text files? And, how to edit them? This chapter introduces basic command line tools and introduces you to the powerful VIM text editor.

08 - Using the Shell

The power of Linux lies in its shell. Pipes, redirections, aliases, batch jobs, scheduled commands? Playing with processes? This chapter introduces you to the power of the Bash shell.

09 - Installing BLAST and ClustalW

In this chapter, you will learn how to install small programs. As examples, we are using BLAST. (Basic Local Alignment Search Tool) and ClustalW. BLAST is a powerful tool to find sequences in a database. ClustalW is a general purpose multiple sequence alignment program for nucleic acid or protein sequences.

10 - Shell Programming

With this chapter we enter a new world. Before, you processed files by executing a number of commands. Now, we are going to put the commands together, blend them with some logic, and write our first shell scripts this way. IF you work through this chapter, THEN you know basic programming techniques. Hello World!

11 - Regular Expressions

Regular expressions are EXTREMELY useful and I cannot overemphasize the word extremely. Regular expressions provide a brief, flexible and comprehensive way to specify or recognize text strings, e.g. character patterns. They are available in virtually all Linux tools and programming languages. A must read chapter.

12 - Sed

Sed (stream editor) is a non-interactive, line-oriented, stream editor. Need to exchange decimal delimiters or replace text strings in a gigabyte seized file? Sed is you solution.

Part IV Programming

13 - AWK

AWK is a great language for both learning programming and processing text-based data files. For 99% you will work with text-based files, be it data tables, omic-data, or whatever. Apart from being simple to learn and having a clear syntax, AWK provides you with the possibility to construct your own commands. Thus, the language may grow with you as you grow with the language. Hello World, again. This chapter closes with an example of how to measure sequence distances by dynamic programming. With the basic knowledge of AWK presented in this chapter you are well prepared to master large-scale data processing and to learn any other programming language, if you feel to do so.

14 - Perl

Perl was initially developed to integrate features of Sed and AWK within the framework provided by the shell. As you learned before, AWK is a programming language with powerful string manipulation commands and regular expressions that facilitate file formatting and analysis. The stream editor Sed nicely complements AWK. This chapter gives a short introduction to Perl.

15 - Other Programming Languages

Every book needs a short chapter—this is mine. It presents an example of how to proceed with programming.

Part V - Advanced Data Analysis

16 - Relational Databases with MySQL

This chapter will introduce you to the heart of data management, i.e., databases. In particular, it will show you how to bring your own data into a relational MySQL database. “Relational” means essentially that your data are organized in tables that are related to each other. The database engine we are going to use is the freely available, open source MySQL. It is available for all operating systems.

17 - The Statistics Suite R

Data analysis commonly involves visualization of data. This includes both plotting the data themselves and plotting properties of the dataset like frequencies of certain numbers. R is a well-established platform for scientists in general and computational biologists in particular. Besides being a programming environment for statistical computing, R is also a data visualization tool. Several topic-specific packages are available to analyze and visualize experimental data, e.g., for evolutionary biology, the evaluation of biochemical assays, nucleotide and amino acid sequence analysis, microarray data interpretation, and more. This chapter provides a basic introduction, exemplifies statistical data analysis, demonstrates installation of additional packages, and shows how to retrieve data from a MySQL database.

Part VI - Worked Examples

18 - Genomic Analysis of the Pathogenicity Factors from E. coli Strain O157:H7 and EHEC Strain O104:H4===

This guided exercise will show you how to use BLAST to compare two genomes. The objective is to find open reading frames (ORFs) that are unique to one genome. This exercise uses a local installation of BLAST+, AWK, MySQL, and R.

19 - Limits of BLAST & Homology Modeling

Very often it is difficult to find homologs sequences just by using BLAST. This is because bad expectation values (E-Values) do not necessarily mean bad results. This project will teach you that it can help to look at the secondary and tertiary structure (homology modeling) as well and apply a priory knowledge. This exercise uses BLAST, ClustalW, Jpred, SWISS-MODEL, Jmol, and AWK.

20 - Virtual Sequencing of pUC18c

The objective of this exercise is to sequence the pUC18c cloning vector. It is divided into the following major parts: a) downloading the cloning vector, b) run a virtual sequencing program, and c) assembling the virtual sequences obtained from sequencing with the TIGR Assembler. The quality of a particular assembly is visualized with a dot plot. The ultimate goal is to understand the effect of sequence coverage and sequence fragment length on the assembly process. This exercise uses a virtual sequencer, the TIGR Assembler, Dotter, R, and shell programming.

21 - Querying for Potential Redox-Regulated Enzymes

The scientific background of this example is biochemistry. The activity of some enzymes is regulated by formation or cleavage of disulfide bonds near the protein surface. This can be catalyzed by a group of proteins known as thioredoxin. Comparison of primary structures reveals that there is no consensus motif present in most of the thioredoxin-regulated target enzymes. In order to identify potential proteins that could be targets for thioredoxin, one can investigate protein structure data. We just need to search for cysteine sulfur atoms that are no further than 3 Å apart and close to the surface. But how? This exercise uses Jmol, Surface Racer, AWK, and shell programming.