[Home] [Docs/Download] [SemiBenchmarks] [Schemas] [CFTR R domain/example]

ModBioSQL documentation


REQUIREMENTS

Obligatory:
  1. Python (I used 2.3.3)
  2. MySQL or PgSQL (I used 4.0.20 and 7.4.2, respectively)
    See tuning of kernel and RDB parameters.
  3. DB driver for Python; currently PyPgSQL, psycopg, MySQLdb are supported.
Optional:
  1. BioPython ( + my Bio.SwissProt.SProt crack)
    You need this, if you want to use UniProt, since MBS uses its parser for parsing uniprot*.dat files. The original Bio.SwissProt.SProtparser can not handle the newest version of SwissProt (the original one may be OK for the UniProt released in May).
  2. EMBOSS
    If you want to try MBS with EMBOSS, you have to have the newest 2.9.0 version, or crack the earlier one (see tuning).
It is good to have a strong hardware. A total UniProt installation with indexes needs approximately 3.5G space on the hard drive. I do not think that I used linux specific functions in my python scripts: MBS may also work under Windows and other OS.


GENERAL/RANDOM NOTES


TIPS FOR TUNING RDBMS AND EMBOSS

See my system parameters for comparison on the page of SemiBanchmarks.

Kernel

On some system (including linux) the default shared memory setting is low. See this page, which can be used not only for PgSQL, for details. Setting this parameters higher (from 32M to 256M or 512M; I set mine to 512M) increased the RDB speed 3-4 times. If you set it too high compared to your total memory, your computer could slow down cause running out of memory (or something like that).

PgSQL

"The drawback of using locales other than C or POSIX in PostgreSQL is its performance impact. It slows character handling and prevents ordinary indexes from being used by LIKE."
Be sure to run initdb with --locale="C" if you want good performance with 'LIKE'!

In the postgresql.conf file you should set the following parameters:
# RESOURCE USAGE - Memory
shared_buffers= n1# 8KB each; pg default: 1000; set it lower than the shared memory; I used n1=30000
sort_mem= n2# size in KB; pg default: 1024; I used n2=4096
vacuum_mem= n3# size in KB; I used n3=65536
# QUERY TUNING
enable_nestloop= false
enable_seqscan= false
# These do not mean that sequential scan or nested loops are not allowed, but avoided as possible.

MySQL

My my.cnf file contained the followings:
[mysqld]

# Running mysqld first time with these setup creates a 1G innodb
# file filled up with empty data - it takes time

innodb_data_home_dir=/home/mysql/data/idbdata
innodb_data_file_path=ibdata1:1G:autoextend
innodb_log_group_home_dir=/home/mysql/data/idblog

# Set buffer pool size to 50-80% of your computer's memory
set-variable = innodb_buffer_pool_size=256M
set-variable = innodb_additional_mem_pool_size=50M
set-variable = thread_stack=4M

# Set the log file size to about 25% of the buffer pool size
set-variable = innodb_log_file_size=60M
set-variable = innodb_log_buffer_size=20M

innodb_flush_log_at_trx_commit=1
skip-external-locking
set-variable = max_connections=50
set-variable = read_buffer_size=1M

#You may increase the 'bulk_insert_buffer_size' to speed up uploading

EMBOSS

!!! You have to have the newest EMBOSS 2.9.0 to use external applications (like connectorom) for accessing RDBMS without problems.
Or: according to Peter Rice (EMBOSS developer) comment out the following line in the emboss_source_dir/ajax/ajfile.c, in the function ajFileNewInPipe (line 189; EMBOSS 2.8.0): while(wait(&status) != pid); recompile, reinstall...

My emboss.default configuration file contained the following lines:

# UP: UniProt; MN: MyNucs; MP: MyProts
# If you do not add $MODBIOSQL/bin to your $PATH, use the full path of connectorom

DB UP [
method: "app"
format: "fasta"
type: "P"
app: "connectorom -s UniProt \%s"
comment: "UniProtSQL"
]

DB MN [
method: "app"
format: "fasta"
type: "N"
app: "connectorom -s BioLocal -t mynucs \%s"
comment: "BioLocal.MyNucs"
]

DB MP [
method: "app"
format: "fasta"
type: "P"
app: "connectorom -s BioLocal -t myprots \%s"
comment: "BioLocal.MyProts"
]


DOWNLOAD

modbiosql-teta-0.52.tgz (size: 220K; date: 2005.06.15)
Tested with UniProt 5.2 and EMBOSS 2.10.


INSTALL

See also the step-by-step description of the mBioSQL design (here)

Short version:

tar -xvzf
cd modbiosql-teta-0.12
path_to_python/python install.py

Longer instructions:

You need a RDBMS superuser name and password to create the databases. The install script can create two logical users for you: bioroot and biouser. I recommend to use mbs_init.py with bioroot account, and query, analyze your data with the biouser account.
Tip: You may set the authentication method for biouser to 'trust' in the pg_hba.conf file, or give an empty password in case of MySQL.

Run the install script with the appropriate python:
path_to_your_python/python install.py

Enter the destination directory [/your-path/modbiosql-teta-0.12]:
If you do not specify a new directory, this will be your MBS_dir.

Enter the name of BIOROOT [bioroot]:
Enter bioroot password:
Confirm:
Enter the name of BIOUSER [biouser]:
Enter biouser password:
Confirm:
You can define any other name. I would not give any password for biouser. With pgsql I use 'trust' authentication for biouser in order to avoid entering password. If you miss the pwd confirmation, the script ask the pwd again.

Which database driver do you want to use [1]?
(1) psycopg (2) pgsql (3) mysqldb
Choose a number:
You can define which driver to use.

Do you want install BioLocal [yes]? (yes/no)
Generally in MBS, if you want to say 'yes', you have to type 'yes' (case insensitive).

Enter the name of the UniProt database [uniprot]:
Here you can give the real RDB name of your UniProt database.

Do you want install BioLocal [yes]? (yes/no)
Enter the name of the BioLocal database [biolocal]:

Do you want install BioRes [yes]? (yes/no)
Enter the name of the BioRes database [biores]:

Do you want to CREATE DATABASE, USERS in your RDB? (yes/no)
The script could create the databases, users, and privileges for you. If you do not want, than create the databases manually; grant select, insert, index, update, create, drop, with grant option to the bioroot user on the databases. After mbs_init* grant create and select on UniProt tables; select, insert, create, index, drop to the 'biouser' user on other dbs.
MyNote: Using MySQL you should grant insert, create, drop to the 'biouser', if you want to allow this user to store result sets in the UniProt. The tables do not have owners, that means biouser can delete the result table of other users.

After all of these the script creates a config file, copy python scripts, lib files into the destination directory, and also creates other dirs (log, examples...).

!!! You have to define the $MODBIOSQL variable pointing to your MBS install directory
!!!You may append $MODBIOSQL/bin to your $PATH
Since this install is not a real python install, if you want to browse my libraries with pydoc, you have to define the $PYTHONPATH variable pointing to $MODBIOSQL/lib

For those happy people, who have not used environmental variables (for bash-like systems):

DIRECTORIES

$MODBIOSQL/bin

Location of the scripts. You may add this dir to your $PATH.
These scripts mainly do option, argument parsing, error checking, and call the functions in the modules found in the lib dir.

Valid, case insensitive Symbolic DB names: General options for the scripts:
-h, --help show the help message and exit
-cCONFIG_FILE, --conf=CONFIG_FILE specifies an alternative config_file
-uUSER, --user=USER defines the name of the db_user
-p, --pwd prompts the password for the db_user
-PPWDP, --Pwd=PWDP specifies the password for the db_user

mbs_init.py
mbs_uninstall.py
mbs_clean.py
mbs_info.py
mbs_query.py
bl_load.py
bl_delete.py


connectorom
br_load.py
br_drop.py
br_anal2.py



mbs_init.py

mbs_uninstall.py

mbs_clean.py

mbs_info.py

mbs_query.py

bl_load.py

bl_delete.py

br_load.py

br_drop.py

br_anal2.py

connectorom

$MODBIOSQL/biolocal

Files (like pdf formatted map files) for BioLocal.mynucs can be stored here. Accessing them by scripts is not implemented yet.


$MODBIOSQL/etc

Location of the main config file. You can define alternative config file by the '-c' option. Scripts try to read a config file in the following sequence: $HOME/.modbiosql.etc; $MODBIOSQL/etc/modbiosql.etc; and modbiosql.etc in the working directory. Only one config file is processed.
For details see the config file.


$MODBIOSQL/lib

Location of the modules containing the core of the code. You can browse it by pydoc. Since this python modules are not installed for your python: you may set $PYTHONPATH to this directory. (My scripts in the $MODBIOSQL/bin find them by sys.path.append( os.path.expandvars( '$MODBIOSQL/lib')))


$MODBIOSQL/log

Log file location if is not other 'log' defined in the conf file.
Only programs/functions affecting the RDBMS write logs.


$MODBIOSQL/uniprot

UniProt files if is not other 'dir' defined in the conf file.
Constant files: You have to download the following files to this directory(ftp://ftp.ebi.ac.uk/pub/databases/uniprot/):

source_dir/docs

There are doc files here.


source_dir/examples

You find here some example files (e.g. if you do not have EMBOSS, you can use this file to taste)

BUG REPORT

Send me the following things:

LICENSE

ModBioSQL is an open source, free software.

I am not a programmer, I am a wet biologist, who likes programming doing it without deep, real knowledge and savvy. I cannot be responsible for any problem occurring in informatical or biological parts of this package. The ModBioSQL is in an early (and may be in the final) developmental stage, and intends to show you the possibilities and futures of biological relational database systems.



Author:Tamas Hegedus
email: hegedus.tamas@mayo.edu
web: http://www.biomembrane.hu/~hegedus/modbiosql
date: 2004.07.15.