[Home] [Docs/Download] [SemiBenchmarks] [Schema] [CFTR R domain/example]

ModularBioSQL

NEW!!!

  • Step-by-step description of the mBioSQL design (here)
  • Detailed (graphical) database schema with notes (mainly for developers; here)
  • Updates on the CFTR R domain project (here)
  • Future plans
Still in my life's Spring
was I, and I wandered out,
and the happy dances of youth
I left in my Father's house.

All my earthly goods, all my possessions
I threw away in happy faith,
and with a light pilgrim's staff
I set out with childish hope.

[...]

Mountains lay in my path,
Rivers checked my feet,
Over sheer abysses I climbed,
Bridges during wild floods.

[...]

Hence to a great sea
where I explored the play of the waves;
Before me lay a vast emptyness,
I have never been nearer my goal.

Ah, no landing stage will then lead,
Ah,the Heaven above me
will never resemble the Earth,
And that there is never as here!

(Schiller: The Pilgrim)

What is ModBioSQL?

Short version:See the picture above.
Long version:Relational Database (RDB) schemas and collection of python scripts to handle RDB connections between biological data and analysis tools.
Longer version: During my work I had to do a lot of analysis on sequence, pattern databases available on the web, and also on sequences and data generated in our lab. I realized early, that I am very bad in handling files (find the right one, which may be several months old), it is wasting time to run analysis programs one thousand times on the same sequence with different parameters. Moreover, sometimes I have problems that can be solved by simple scripting, but with a handy, tractable database background. Therefore I started to use Relational Database Management System (RDBMS), as an ultimate system for handling data. Since my setup was working pretty well for solution of different problems compared to other databases (e.g. flat file), I wrapped it into a package to show the possibilities and futures of biological relational database management systems. I do not want to say that ModBioSQL is a perfect solution, my RDB schemas are the best representation of the biological data (I only think this :-)), but the different futures and concepts, that ModBioSQL does have compared to other solutions, should be taken into consideration.

Futures and concepts


Acknowledgement

Warnings

Future plans:


Contents

General considerations ModBioSQL

Why local databases?

Notes:

Why relational databases?

RDBMS were developed to handle large amount of data providing e.g. consistency, redundancy checkings (important in annotations), simpler maintenance, easier data queries etc. compared to flat file databases (see this paper). These futures make RDBMS as important basic systems to handle biological data, however they have some drawbacks: Notes:

Why ModBioSQL?

I think, high level fine tuning of biological RDBMS is possible, if each individual databases has its own schema. The BioSQL is an excellent and stable schema allowing data loads from different sequence databases into one schema, but it has also some disadvantages: there are existing indexes after table creation that results in extremely long loads from large flat files; if you load several databases into one schema, the running of your query may be longer compared to the situation having one biological database in one schema.

Notes:

Why PostgreSQL? (MySQL is also supported)

I prefer PgSQL, as it has real RDBMS futures - that were neglected earlier by MySQL developers - helping the life of programmers: e.g. its SQL is closer to the standard; comments can be defined on tables; tables has owners disabling deletion of bioroot tables by biouser (see the docs); etc. The stable version of MySQL does not support some very basic standard SQL statements, like subqueries, 'DROP USER user', etc. At this moment I use PgSQL, as the development with it is easier and faster.

Notes:

Why Python?

It is a scripting language allowing much faster development than using an 'application'/'compiled' language like C/C++. However, scripting languages are slower than compiled languages, I think that in most case of a biological RDBMS the bottle neck is something else (e.g. the run time of the query, communication between the script and the RDBMS) (Moreover, I do not have time for development, but I have time for waiting results doing my wet experiments.). I found python iterations and screen output to be slow. My opinion: Python has a cleaner object oriented syntax, much easier to learn by biologist compared to Perl.

Note:

Maintenance

mbs_init.py is used to initialize UniProt, BioLocal, and BioRes databases (arrow a in the top Fig).

UniProt

I split the initialization into 3 steps: In the pre-phase tables are created without indexes and constrains; keywlist.txt, dbxref.txt, cclist.txt, ftlist.txt are loaded into '*_ref' tables. The latter 'fixed value arrays' makes the tables, and also the indexes smaller, the performance better. In the loading-phase adjustable number of records can be inserted between "TRANSACTION" and "COMMIT" statements. In the post-phase indexes and constrains are created.

Notes:

BioLocal

I created 3 tables for storing my nucleic acid sequences, my protein sequences, and patterns. The structures of these tables are not matured yet!!! (But they are working fine in my everyday life.)
For smaller databases you should try the BioSQL with better schema instead of BioLocal.

BioRes

At the beginning this database contains only the 'db_info' table.


Analysis I

If you use EMBOSS, you can configure it to retrieve sequences from the RDBMS by connectrorom via USA (Uniform Sequence Addresses; see docs). With the mbs_query.py you can access sequences by IDs, or also by complex SQL queries (arrow b in the top Fig).

Notes:

Analysis II

It is common to store analysis results in RDBMS (arrow d and e in the top Fig) for further analysis in informatics and other (non biological) sciences. I found only few papers in the PubMed using this possibility in complex result sets. I think, it is worth to store some types of analysis results (not only the very very complex ones) in RDBMS in order to analyze them deeper, more detailed (arrow f and g in the top Fig). I implemented two different, very simple types: Notes: