As of MySQL Version 3.23.6, you can choose between three basic
table formats (ISAM
, HEAP
and MyISAM
. Newer
MySQL may support additional table type (BDB
,
GEMINI
or InnoDB
), depending on how you compile it.
When you create a new table, you can tell MySQL which table
type it should use for the table. MySQL will always create a
.frm
file to hold the table and column definitions. Depending on
the table type, the index and data will be stored in other files.
Note that to use InnoDB
tables you have to use at least
the innodb_data_file_path
startup option. See section 8.7.2 InnoDB startup options.
The default table type in MySQL is MyISAM
. If you are
trying to use a table type that is not compiled-in or activated,
MySQL will instead create a table of type MyISAM
. This
is a very useful feature when you want to copy tables between different
SQL servers that supports different table types (like copying tables to
a slave that is optimized for speed by not having transactional tables).
This automatic table changing can however also be very confusing for new
MySQL users. We plan to fix this by introducing warnings in
MySQL 4.0 and giving a warning when a table type is automatically
changed.
You can convert tables between different types with the ALTER
TABLE
statement. See section 7.8 ALTER TABLE
Syntax.
Note that MySQL supports two different kinds of
tables. Transaction-safe tables (BDB
, InnoDB
or
GEMINI
) and not transaction-safe tables (HEAP
, ISAM
,
MERGE
, and MyISAM
).
Advantages of transaction-safe tables (TST):
COMMIT
command.
ROLLBACK
to ignore your changes (if you are not
running in auto commit mode).
Advantages of not transaction-safe tables (NTST):
You can combine TST and NTST tables in the same statements to get the best of both worlds.
MyISAM
is the default table type in MySQL Version 3.23. It's
based on the ISAM
code and has a lot of useful extensions.
The index is stored in a file with the .MYI
(MYIndex) extension,
and the data is stored in a file with the .MYD
(MYData) extension.
You can check/repair MyISAM
tables with the myisamchk
utility. See section 16.5 Using myisamchk
for Crash Recovery. You can compress MyISAM
tables with
myisampack
to take up much less space. See section 15.12 The MySQL Compressed Read-only Table Generator.
The following is new in MyISAM
:
MyISAM
file that indicates whether or not
the table was closed correctly. If mysqld
is started with
--myisam-recover
, MyISAM
tables will automatically be
checked and/or repaired on open if the table wasn't closed properly.
INSERT
new rows in a table without deleted rows,
while other threads are reading from the table.
AUTO_INCREMENT
column. MyISAM
will automatically update this on INSERT/UPDATE
. The
AUTO_INCREMENT
value can be reset with myisamchk
. This
will make AUTO_INCREMENT
columns faster (at least 10 %) and old
numbers will not be reused as with the old ISAM
. Note that when an
AUTO_INCREMENT
is defined on the end of a multi-part-key the old
behavior is still present.
AUTO_INCREMENT
column) the key tree will be split so that the high node only contains one
key. This will improve the space utilization in the key tree.
BLOB
and TEXT
columns can be indexed.
NULL
values are allowed in indexed columns. This takes 0-1
bytes/key.
myisamchk
.
myisamchk
will mark tables as checked if one runs it with
--update-state
. myisamchk --fast
will only check those
tables that don't have this mark.
myisamchk -a
stores statistics for key parts (and not only for
whole keys as in ISAM
).
myisampack
can pack BLOB
and VARCHAR
columns.
DATA/INDEX DIRECTORY="path"
option to
CREATE TABLE
). See section 7.7 CREATE TABLE
Syntax.
MyISAM
also supports the following things, which MySQL
will be able to use in the near future:
VARCHAR
type; A VARCHAR
column starts
with a length stored in 2 bytes.
VARCHAR
may have fixed or dynamic record length.
VARCHAR
and CHAR
may be up to 64K.
All key segments have their own language definition. This will enable
MySQL to have different language definitions per column.
UNIQUE
. This will allow
you to have UNIQUE
on any combination of columns in a table. (You
can't search on a UNIQUE
computed index, however.)
Note that index files are usually much smaller with MyISAM
than with
ISAM
. This means that MyISAM
will normally use less
system resources than ISAM
, but will need more CPU when inserting
data into a compressed index.
The following options to mysqld
can be used to change the behavior of
MyISAM
tables. See section 7.28.4 SHOW VARIABLES
.
Option | Meaning |
--myisam-recover=# | Automatic recover of crashed tables. |
-O myisam_sort_buffer_size=# | Buffer used when recovering tables. |
--delay-key-write-for-all-tables | Don't flush key buffers between writes for any MyISAM table |
-O myisam_max_extra_sort_file_size=# | Used to help MySQL to decide when to use the slow but safe key cache index create method. NOTE that this parameter is given in megabytes! |
-O myisam_max_sort_file_size=# | Don't use the fast sort index method to created index if the temporary file would get bigger than this. NOTE that this paramter is given in megabytes! |
The automatic recovery is activated if you start mysqld
with
--myisam-recover=#
. See section 4.16.4 mysqld Command-line Options.
On open, the table is checked if it's marked as crashed or if the open
count variable for the table is not 0 and you are running with
--skip-locking
. If either of the above is true the following happens.
If the recover wouldn't be able to recover all rows from a previous
completed statement and you didn't specify FORCE
as an option to
myisam-recover
, then the automatic repair will abort with an error
message in the error file:
Error: Couldn't repair table: test.g00pages
If you in this case had used the FORCE
option you would instead have got
a warning in the error file:
Warning: Found 344 of 354 rows when repairing ./test/g00pages
Note that if you run automatic recover with the BACKUP
option,
you should have a cron script that automatically moves file with names
like `tablename-datetime.BAK' from the database directories to a
backup media.
See section 4.16.4 mysqld Command-line Options.
MySQL can support different index types, but the normal type is
ISAM or MyISAM. These use a B-tree index, and you can roughly calculate
the size for the index file as (key_length+4)/0.67
, summed over
all keys. (This is for the worst case when all keys are inserted in
sorted order and we don't have any compressed keys.)
String indexes are space compressed. If the first index part is a
string, it will also be prefix compressed. Space compression makes the
index file smaller than the above figures if the string column has a lot
of trailing space or is a VARCHAR
column that is not always used
to the full length. Prefix compression is used on keys that start
with a string. Prefix compression helps if there are many strings
with an identical prefix.
In MyISAM
tables, you can also prefix compress numbers by specifying
PACK_KEYS=1
when you create the table. This helps when you have
many integer keys that have an identical prefix when the numbers are stored
high-byte first.
MyISAM supports 3 different table types. Two of them are chosen
automatically depending on the type of columns you are using. The third,
compressed tables, can only be created with the myisampack
tool.
This is the default format. It's used when the table contains no
VARCHAR
, BLOB
, or TEXT
columns.
This format is the simplest and most secure format. It is also the fastest of the on-disk formats. The speed comes from the easy way data can be found on disk. When looking up something with an index and static format it is very simple. Just multiply the row number by the row length.
Also, when scanning a table it is very easy to read a constant number of records with each disk read.
The security is evidenced if your computer crashes when writing to a
fixed-size MyISAM file, in which case myisamchk
can easily figure out where each
row starts and ends. So it can usually reclaim all records except the
partially written one. Note that in MySQL all indexes can always be
reconstructed:
CHAR
, NUMERIC
, and DECIMAL
columns are space-padded
to the column width.
myisamchk
) unless a huge number of
records are deleted and you want to return free disk space to the operating
system.
This format is used if the table contains any VARCHAR
, BLOB
,
or TEXT
columns or if the table was created with
ROW_FORMAT=dynamic
.
This format is a little more complex because each row has to have a header that says how long it is. One record can also end up at more than one location when it is made longer at an update.
You can use OPTIMIZE table
or myisamchk
to defragment a
table. If you have static data that you access/change a lot in the same
table as some VARCHAR
or BLOB
columns, it might be a good
idea to move the dynamic columns to other tables just to avoid
fragmentation:
''
) for string columns, or zero for numeric columns. (This isn't
the same as columns containing NULL
values.) If a string column
has a length of zero after removal of trailing spaces, or a numeric
column has a value of zero, it is marked in the bit map and not saved to
disk. Non-empty strings are saved as a length byte plus the string
contents.
myisamchk
-r
from time to time to get better performance. Use myisamchk -ei
tbl_name
for some statistics.
3 + (number of columns + 7) / 8 + (number of char columns) + packed size of numeric columns + length of strings + (number of NULL columns + 7) / 8There is a penalty of 6 bytes for each link. A dynamic record is linked whenever an update causes an enlargement of the record. Each new link will be at least 20 bytes, so the next enlargement will probably go in the same link. If not, there will be another link. You may check how many links there are with
myisamchk -ed
. All links may be removed with myisamchk -r
.
This is a read-only type that is generated with the optional
myisampack
tool (pack_isam
for ISAM
tables):
myisampack
.
0
are stored using 1 bit.
BIGINT
column (8 bytes) may
be stored as a TINYINT
column (1 byte) if all values are in the range
0
to 255
.
ENUM
.
BLOB
or TEXT
columns.
myisamchk
.
The file format that MySQL uses to store data has been extensively tested, but there are always circumstances that may cause database tables to become corrupted.
Even if the MyISAM table format is very reliable (all changes to a table is written before the SQL statements returns) , you can still get corrupted tables if some of the following things happens:
mysqld
process being killed in the middle of a write.
Typial typical symptoms for a corrupt table is:
Incorrect key file for table: '...'. Try to repair it
while selecting data from the table.
You can check if a table is ok with the command CHECK
TABLE
. See section 7.12 CHECK TABLE
Syntax.
You can repair a corrupted table with REPAIR TABLE
. See section 7.16 REPAIR TABLE
Syntax.
You can also repair a table, when mysqld
is not running with
the myisamchk
command. myisamchk syntax
.
If your tables get corrupted a lot you should try to find the reason for this! See section 21.2 What to Do if MySQL Keeps Crashing.
In this case the most important thing to know is if the table got
corrupted if the mysqld
died (one can easily verify this by
checking if there is a recent row restarted mysqld
in the mysqld
error file). If this isn't the case, then you should try to make a test
case of this. See section I.1.6 Making a test case when you experience table corruption.
Each MyISAM
.MYI
file has in the header a counter that can
be used to check if a table has been closed properly.
If you get the following warning from CHECK TABLE
or myisamchk
:
# clients is using or hasn't closed the table properly
this means that this counter has come out of sync. This doesn't mean that the table is corrupted, but means that you should at least do a check on the table to verify that it's ok.
The counter works as follows:
FLUSH
or
because there isn't room in the table cache) the counter is
decremented if the table has been updated at any point.
In other words, the only ways this can go out of sync are:
MyISAM
tables are copied without a LOCK
and
FLUSH TABLES
.
myisamchk --repair
or myisamchk
--update-state
on a table that was in use by mysqld
.
mysqld
servers are using the table and one has done a
REPAIR
or CHECK
of the table while it was in use by
another server. In this setup the CHECK
is safe to do (even if
you will get the warning from other servers), but REPAIR
should
be avoided as it currently replaces the data file with a new one, which
is not signaled to the other servers.
MERGE
tables are new in MySQL Version 3.23.25. The code
is still in gamma, but should be resonable stable.
A MERGE
table is a collection of identical MyISAM
tables
that can be used as one. You can only SELECT
, DELETE
, and
UPDATE
from the collection of tables. If you DROP
the
MERGE
table, you are only dropping the MERGE
specification.
Note that DELETE FROM merge_table
used without a WHERE
will only clear the mapping for the table, not delete everything in the
mapped tables. (We plan to fix this in 4.0).
With identical tables we mean that all tables are created with identical
column and key information. You can't put a MERGE over tables where the
columns are packed differently or doesn't have exactly the same columns.
Some of the tables can however be compressed with myisampack
.
See section 15.12 The MySQL Compressed Read-only Table Generator.
When you create a MERGE
table, you will get a .frm
table
definition file and a .MRG
table list file. The .MRG
just
contains a list of the index files (.MYI
files) that should
be used as one.
For the moment you need to have SELECT
, UPDATE
, and
DELETE
privileges on the tables you map to a MERGE
table.
MERGE
tables can help you solve the following problems:
myisampack
, and then create a MERGE
to use these as one.
MERGE
table on this could be much faster than using
the big table. (You can, of course, also use a RAID to get the same
kind of benefits.)
MERGE
tables active, with possible overlapping files.
MERGE
file than trying to repair a real big file.
MERGE
table uses the
index of the individual tables. It doesn't need to maintain an index of
its one. This makes MERGE
table collections VERY fast to make or
remap. Note that you must specify the key definitions when you create
a MERGE
table!.
MERGE
table on them on demand.
This is much faster and will save a lot of disk space.
The disadvantages with MERGE
tables are:
INSERT
on MERGE
tables, as MySQL
can't know in which of the tables we should insert the row.
MyISAM
tables for a MERGE
table.
MERGE
tables uses more file descriptors. If you are using a
MERGE that maps over 10 tables and 10 users are using this, you
are using 10*10 + 10 file descriptors. (10 data files for 10 users
and 10 shared index files.)
MERGE
handler will need to issue a read on all underlying tables to check
which one most closely matches the given key. If you then do a 'read-next'
then the merge table handler will need to search the read buffers
to find the next key. Only when one key buffer is used up, the handler
will need to read the next key block. This makes MERGE
keys much slower
on eq_ref
searches, but not much slower on ref
searches.
See section 7.29 EXPLAIN
Syntax (Get Information About a SELECT
).
DROP TABLE
, ALTER TABLE
or DELETE FROM
table_name
without a WHERE
clause on any of the table that is
mapped by a MERGE
table that is 'open'. If you do this, the
MERGE
table may still refer to the original table and you will
get unexpected results.
The following example shows you how to use MERGE
tables:
CREATE TABLE t1 (a INT AUTO_INCREMENT PRIMARY KEY, message CHAR(20)); CREATE TABLE t2 (a INT AUTO_INCREMENT PRIMARY KEY, message CHAR(20)); INSERT INTO t1 (message) VALUES ("Testing"),("table"),("t1"); INSERT INTO t2 (message) VALUES ("Testing"),("table"),("t2"); CREATE TABLE total (a INT NOT NULL, message CHAR(20), KEY(a)) TYPE=MERGE UNION=(t1,t2);
Note that we didn't create a UNIQUE
or PRIMARY KEY
in the
total
table as the key isn't going to be unique in the total
table.
Note that you can also manipulate the .MRG
file directly from
the outside of the MySQL server:
shell> cd /mysql-data-directory/current-database shell> ls -1 t1.MYI t2.MYI > total.MRG shell> mysqladmin flush-tables
Now you can do things like:
mysql> select * from total; +---+---------+ | a | message | +---+---------+ | 1 | Testing | | 2 | table | | 3 | t1 | | 1 | Testing | | 2 | table | | 3 | t2 | +---+---------+
To remap a MERGE
table you can do one of the following:
DROP
the table and re-create it
ALTER TABLE table_name UNION(...)
.MRG
file and issue a FLUSH TABLE
on the
MERGE
table and all underlying tables to force the handler to
read the new definition file.
You can also use the deprecated ISAM table type. This will disappear
rather soon because MyISAM
is a better implementation of the same
thing. ISAM uses a B-tree
index. The index is stored in a file
with the .ISM
extension, and the data is stored in a file with the
.ISD
extension. You can check/repair ISAM tables with the
isamchk
utility. See section 16.5 Using myisamchk
for Crash Recovery.
ISAM
has the following features/properties:
Most of the things true for MyISAM
tables are also true for ISAM
tables. See section 8.1 MyISAM Tables. The major differences compared to MyISAM
tables are:
pack_isam
rather than with myisampack
.
If you want to convert an ISAM
table to a MyISAM
table so
that you can use utilities such as mysqlcheck
, use an ALTER
TABLE
statement:
mysql> ALTER TABLE tbl_name TYPE = MYISAM;
HEAP
tables use a hashed index and are stored in memory. This
makes them very fast, but if MySQL crashes you will lose all
data stored in them. HEAP
is very useful for temporary tables!
The MySQL internal HEAP tables use 100% dynamic hashing
without overflow areas. There is no extra space needed for free lists.
HEAP
tables also don't have problems with delete + inserts, which
normally is common with hashed tables:
mysql> CREATE TABLE test TYPE=HEAP SELECT ip,SUM(downloads) as down FROM log_table GROUP BY ip; mysql> SELECT COUNT(ip),AVG(down) FROM test; mysql> DROP TABLE test;
Here are some things you should consider when you use HEAP
tables:
MAX_ROWS
in the CREATE
statement
to ensure that you accidentally do not use all memory.
=
and <=>
(but are VERY fast).
HEAP
tables can only use whole keys to search for a row; compare this
to MyISAM
tables where any prefix of the key can be used to find rows.
HEAP
tables use a fixed record length format.
HEAP
doesn't support BLOB
/TEXT
columns.
HEAP
doesn't support AUTO_INCREMENT
columns.
HEAP
doesn't support an index on a NULL
column.
HEAP
table (this isn't common for
hashed tables).
HEAP
tables are shared between all clients (just like any other
table).
ORDER BY
).
HEAP
tables are allocated in small blocks. The tables
are 100% dynamic (on inserting). No overflow areas and no extra key
space are needed. Deleted rows are put in a linked list and are
reused when you insert new data into the table.
DELETE FROM heap_table
,
TRUNCATE heap_table
or DROP TABLE heap_table
.
MyISAM
table to a HEAP
table.
HEAP
tables bigger than max_heap_table_size
.
The memory needed for one row in a HEAP
table is:
SUM_OVER_ALL_KEYS(max_length_of_key + sizeof(char*) * 2) + ALIGN(length_of_row+1, sizeof(char*))
sizeof(char*)
is 4 on 32-bit machines and 8 on 64-bit machines.
Support for BDB tables is included in the MySQL source distribution starting from Version 3.23.34 and is activated in the MySQL-Max binary.
BerkeleyDB, available at http://www.sleepycat.com/ has provided
MySQL with a transactional table handler. By using BerkeleyDB
tables, your tables may have a greater chance of surviving crashes, and also
provides COMMIT
and ROLLBACK
on transactions. The
MySQL source distribution comes with a BDB distribution that has a
couple of small patches to make it work more smoothly with MySQL.
You can't use a non-patched BDB
version with MySQL.
We at MySQL AB are working in close cooperation with Sleepycat to keep the quality of the MySQL/BDB interface high.
When it comes to supporting BDB tables, we are committed to help our users to locate the problem and help creating a reproducable test case for any problems involving BDB tables. Any such test case will be forwarded to Sleepycat who in turn will help us find and fix the problem. As this is a two stage operation, any problems with BDB tables may take a little longer for us to fix than for other table handlers. However, as the BerkeleyDB code itself has been used by many other applications than MySQL, we don't envision any big problems with this. See section 3.5.6 Support for other table handlers.
If you have downloaded a binary version of MySQL that includes support for BerkeleyDB, simply follow the instructions for installing a binary version of MySQL. See section 4.6 Installing a MySQL Binary Distribution. See section 15.2 mysqld-max, An extended mysqld server.
To compile MySQL with Berkeley DB support, download MySQL
Version 3.23.34 or newer and configure MySQL
with the
--with-berkeley-db
option. See section 4.7 Installing a MySQL Source Distribution.
cd /path/to/source/of/mysql-3.23.34 ./configure --with-berkeley-db
Please refer to the manual provided with the BDB
distribution for
more updated information.
Even though Berkeley DB is in itself very tested and reliable, the MySQL interface is still considered beta quality. We are actively improving and optimizing it to get it stable very soon.
If you are running with AUTOCOMMIT=0
then your changes in BDB
tables will not be updated until you execute COMMIT
. Instead of commit
you can execute ROLLBACK
to forget your changes. See section 7.31 BEGIN/COMMIT/ROLLBACK
Syntax.
If you are running with AUTOCOMMIT=1
(the default), your changes
will be committed immediately. You can start an extended transaction with
the BEGIN WORK
SQL command, after which your changes will not be
committed until you execute COMMIT
(or decide to ROLLBACK
the changes).
The following options to mysqld
can be used to change the behavior of
BDB tables:
Option | Meaning |
--bdb-home=directory | Base directory for BDB tables. This should be the same directory you use for --datadir. |
--bdb-lock-detect=# | Berkeley lock detect. One of (DEFAULT, OLDEST, RANDOM, or YOUNGEST). |
--bdb-logdir=directory | Berkeley DB log file directory. |
--bdb-no-sync | Don't synchronously flush logs. |
--bdb-no-recover | Don't start Berkeley DB in recover mode. |
--bdb-shared-data | Start Berkeley DB in multi-process mode (Don't use DB_PRIVATE when initializing Berkeley DB)
|
--bdb-tmpdir=directory | Berkeley DB tempfile name. |
--skip-bdb | Don't use berkeley db. |
-O bdb_max_lock=1000 | Set the maximum number of locks possible. See section 7.28.4 SHOW VARIABLES .
|
If you use --skip-bdb
, MySQL will not initialize the
Berkeley DB library and this will save a lot of memory. Of course,
you cannot use BDB
tables if you are using this option.
Normally you should start mysqld
without --bdb-no-recover
if you
intend to use BDB tables. This may, however, give you problems when you
try to start mysqld
if the BDB log files are corrupted. See section 4.16.2 Problems Starting the MySQL Server.
With bdb_max_lock
you can specify the maximum number of locks
(10000 by default) you can have active on a BDB table. You should
increase this if you get errors of type bdb: Lock table is out of
available locks
or Got error 12 from ...
when you have do long
transactions or when mysqld
has to examine a lot of rows to
calculate the query.
You may also want to change binlog_cache_size
and
max_binlog_cache_size
if you are using big multi-line transactions.
See section 7.31 BEGIN/COMMIT/ROLLBACK
Syntax.
BDB
tables:--bdb_log_dir
options.
FLUSH LOGS
at any time
to checkpoint the Berkeley DB tables.
For disaster recovery, one should use table backups plus
MySQL's binary log. See section 22.2 Database Backups.
Warning: If you delete old log files that are in use, BDB will
not be able to do recovery at all and you may loose data if something
goes wrong.
PRIMARY KEY
in each BDB table to be
able to refer to previously read rows. If you don't create one,
MySQL will create an maintain a hidden PRIMARY KEY
for
you. The hidden key has a length of 5 bytes and is incremented for each
insert attempt.
BDB
table are part of the same index or
part of the primary key, then MySQL can execute the query
without having to access the actual row. In a MyISAM
table the
above holds only if the columns are part of the same index.
PRIMARY KEY
will be faster than any other key, as the
PRIMARY KEY
is stored together with the row data. As the other keys are
stored as the key data + the PRIMARY KEY
, it's important to keep the
PRIMARY KEY
as short as possible to save disk and get better speed.
LOCK TABLES
works on BDB
tables as with other tables. If
you don't use LOCK TABLE
, MYSQL will issue an internal
multiple-write lock on the table to ensure that the table will be
properly locked if another thread issues a table lock.
BDB
tables is done on page level.
SELECT COUNT(*) FROM table_name
is slow as BDB
tables doesn't
maintain a count of the number of rows in the table.
MyISAM
tables as one has data in BDB
tables stored in B-trees and not in a separate data file.
BDB
table may make an automatic rollback and any
read may fail with a deadlock error.
BDB
tables compared to MyISAM tables which don't use
PACK_KEYS=0
.
DELETE
or ROLLBACK
:s this number should be
accurate enough for the MySQL optimizer, but as MySQL
only store the number on close, it may be wrong if MySQL dies
unexpectedly. It should not be fatal even if this number is not 100 %
correct. One can update the number of rows by executing ANALYZE
TABLE
or OPTIMIZE TABLE
. See section 7.15 ANALYZE TABLE
Syntax . See section 7.11 OPTIMIZE TABLE
Syntax.
BDB
table, you will get an error
(probably error 28) and the transaction should roll back. This is in
contrast with MyISAM
and ISAM
tables where mysqld
will
wait for enough free disk before continuing.
--no-auto-rehash
with the mysql
client. We plan to partly fix this in 4.0.
SHOW TABLE STATUS
doesn't yet provide that much information for BDB
tables.
If you after having built MySQL with support for BDB tables get
the following error in the log file when you start mysqld
:
bdb: architecture lacks fast mutexes: applications cannot be threaded Can't init dtabases
This means that BDB
tables are not supported for your architecture.
In this case you have to rebuild MySQL without BDB table support.
NOTE: The following list is not complete; We will update this as we get more information about this.
Currently we know that BDB tables works with the following operating system.
It doesn't work with the following operating systems:
hostname.err log
when
starting mysqld
:
bdb: Ignoring log file: .../log.XXXXXXXXXX: unsupported log version #it means that the new
BDB
version doesn't support the old log
file format. In this case you have to delete all BDB
log BDB
from your database directory (the files that has the format
log.XXXXXXXXXX
) and restart mysqld
. We would also
recommend you to do a mysqldump --opt
of your old BDB
tables, delete the old table and restore the dump.
auto_commit
mode and delete a table you
are using by another thread you may get the following error messages in
the MySQL error file:
001119 23:43:56 bdb: Missing log fileid entry 001119 23:43:56 bdb: txn_abort: Log undo failed for LSN: 1 3644744: InvalidThis is not fatal but we don't recommend that you delete tables if you are not in
auto_commit
mode, until this problem is fixed (the fix is
not trivial).
GEMINI
is a transaction-safe table handler for MySQL. It
provides row-level locking, robust transaction support and reliable
crash recovery. It is targeted for databases that need to handle heavy
multi-user updates typical of transaction processing applications while
still providing excellent performance for read-intensive operations. The
GEMINI
table type is developed and supported by NuSphere
Corporation (see http://www.nusphere.com).
GEMINI
provides full ACID transaction properties (Atomic,
Consistent, Independent, and Durable) with a programming model that
includes support for statement atomicity and all four standard isolation
levels (Read Uncommitted, Read Committed, Repeatable Read, and
Serializable) defined in the SQL standard.
The GEMINI
tables support row-level and table-level locking to
increase concurrency in applications and allow reading of tables without
locking for maximum concurrency in a heavy update environment. The
transaction, locking, and recovery mechanisms are tightly integrated to
eliminate unnecessary administration overhead.
In general, if GEMINI
tables are selected for an application, it
is recommended that all tables updated in the application be
GEMINI
tables to provide well-defined system behavior. If
non-GEMINI
tables are mixed into the application then, ACID
transaction properties cannot be maintained. While there are clearly
cases where mixing table types is appropriate, it should always be done
with careful consideration of the impact on transaction consistency and
recoverability needs of the application and underlying database.
The GEMINI
table type is derived from a successful commercial
database and uses the storage kernel technology tightly integrated with
MySQL server. The basic GEMINI
technology is in use by
millions of users worldwide in production environments today. This
maturity allows GEMINI
tables to provide a solution for those
users who require transaction-based behavior as part of their
applications.
The GEMINI
table handler supports a configurable data cache that
allows a significant portion of any database to be maintained in memory
while still allowing durable updates.
The following summarizes the major features provided by GEMINI
tables.
SELECT
COUNT(*) FROM
table-name always returns an answer immediately.
GEMINI
uses block based I/O for better performance. There is no
performance penalty for using VARCHAR
fields. The maximum record size is
currently 32K.
GEMINI
table can be 4 quintillion
(full use of 64 bits).
This section highlights some of the important concepts behind
GEMINI
and the GEMINI
programming model, including:
These features are described below.
ACID in the context of transactions is an acronym which stands for Atomicity, Consistency, Isolation, Durability.
ATTRIBUTE | DESCRIPTION |
Atomicity @tab A transaction allows for the grouping of one or more changes to tables and rows in the database to form an atomic or indivisible operation. That is, either all of the changes occur or none of them do. If for any reason the transaction cannot be completed, everything this transaction changed can be restored to the state it was in prior to the start of the transaction via a rollback operation. | |
Consistency @tab Transactions always operate on a consistent view of the data and when they end always leave the data in a consistent state. Data may be said to be consistent as long as it conforms to a set of invariants, such as no two rows in the customer table have the same customer ID and all orders have an associated customer row. While a transaction executes, these invariants may be violated, but no other transaction will be allowed to see these inconsistencies, and all such inconsistencies will have been eliminated by the time the transaction ends. | |
Isolation @tab To a given transaction, it should appear as though it is running all by itself on the database. The effects of concurrently running transactions are invisible to this transaction, and the effects of this transaction are invisible to others until the transaction is committed. | |
Durability @tab Once a transaction is committed, its effects are guaranteed to persist even in the event of subsequent system failures. Until the transaction commits, not only are any changes made by that transaction not durable, but are guaranteed not to persist in the face of a system failures, as crash recovery will rollback their effects. |
As stated above, a transaction is a group of work being done to data. Unless otherwise directed, MySQL considers each statement a transaction in itself. Multiple updates can be accomplished by placing them in a single statement, however they are limited to a single table.
Applications tend to require more robust use of transaction
concepts. Take, for example, a system that processes an order: A row may
be inserted in an order table, additional rows may be added to an
order-line table, updates may be made to inventory tables, etc. It is
important that if the order completes, all the changes are made to all
the tables involved; likewise if the order fails, none of the changes to
the tables must occur. To facilitate this requirement, MySQL
has syntax to start a transaction called BEGIN WORK
. All
statements that occur after the BEGIN WORK
statement are grouped
into a single transaction. The end of this transaction occurs when a
COMMIT
or ROLLBACK
statement is encountered. After the
COMMIT
or ROLLBACK
the system returns back to the behavior
before the BEGIN WORK
statement was encountered where every
statement is a transaction.
To permanently turn off the behavior where every statement is a
transaction, MySQL added a variable called
AUTOCOMMIT
. The AUTOCOMMIT
variable can have two values,
1
and 0
. The mode where every statement is a transaction
is when AUTOCOMMIT
is set to 1
(AUTOCOMMIT=1
). When
AUTOCOMMIT
is set to 0
(AUTOCOMMIT=0
), then every
statement is part of the same transaction until the transaction end by
either COMMIT
or ROLLBACK
. Once a transaction completes, a
new transaction is immediately started and the process repeats.
Here is an example of the SQL statements that you may find in a typical order:
BEGIN WORK; INSERT INTO order VALUES ...; INSERT INTO order-lines VALUES ...; INSERT INTO order-lines VALUES ...; INSERT INTO order-lines VALUES ...; UPDATE inventory WHERE ...; COMMIT;
This example shows how to use the BEGIN WORK
statement to start a
transaction. If the variable AUTOCOMMIT
is set to 0
, then
a transaction would have been started already. In this case, the
BEGIN WORK
commits the current transaction and starts a new one.
As mentioned above, when running with AUTOCOMMIT
set to 1
,
each statement executes as a single transaction. When a statement has an
error, then all changes make by the statement must be
undone. Transactions support this behavior. Non-transaction safe table
handlers would have a partial statement update where some of the changes
from the statement would be contained in the database and other changes
from the statement would not. Work would need to be done to manually
recover from the error.
Transactions are the basis for database recovery. Recovery is what supports the Durability attribute of the ACID transaction.
GEMINI
uses a separate file called the Recovery Log located in
the $DATADIR
directory named gemini.rl
. This file
maintains the integrity of all the GEMINI
tables. GEMINI
can not recover any data from non-GEMINI
tables. In addition, the
gemini.rl
file is used to rollback transactions in support of the
ROLLBACK
statement.
In the event of a system failure, the next time the MySQL
server is started, GEMINI
will automatically go through its
crash recovery process. The result of crash recovery is that all the
GEMINI
tables will contain the latest changes made to them, and
all transactions that were open at the time of the crash will have been
rolled back.
The GEMINI
Recovery Log reuses space when it can. Space can be
reused when information in the Recovery Log is no longer needed for
crash recovery or rollback.
There are four isolation levels supported by GEMINI
:
These isolation levels apply only to shared locks obtained by select statements, excluding select for update. Statements that get exclusive locks always retain those locks until the transaction commits or rolls back.
By default, GEMINI
operates at the READ COMMITTED
level. You can override the default using the following command:
SET [GLOBAL | SESSION] TRANSACTION ISOLATION LEVEL [READ UNCOMMITTED | READ COMMITTED | REPEATABLE READ | SERIALIZABLE ]
If the SESSION
qualifier used, the specified isolation level
persists for the entire session. If the GLOBAL
qualifier is used,
the specified isolation level is applied to all new connections from
this point forward. Note that the specified isolation level will not
change the behavior for existing connections including the connection
that exectues the SET GLOBAL TRANSACTION ISOLATION LEVEL
statement.
ISOLATION LEVEL | DESCRIPTION |
READ UNCOMMITTED
@tab Does not obtain any locks when reading rows. This means that if a
row is locked by another process in a transaction that has a more strict
isolation level, the READ UNCOMMITTED query will not wait until
the locks are released before reading the row. You will get an error if
attempt any updates while running at this isolation level.
| |
READ COMMITTED
@tab Locks the requested rows long enough to copy the row from the
database block to the client row buffer. If a READ COMMITTED
query finds that a row is locked exclusively by another process, it will
wait until either the row has been released, or the lock timeout value
has expired.
| |
REPEATABLE READ
@tab Locks all the rows needed to satisfy the query. These locks are
held until the transaction ends (commits or rolls back). If a
REPEATABLE READ query finds that a row is locked exclusively by
another process, it will wait until either the row has been released, or
the lock timeout value has expired.
| |
SERIALIZABLE
@tab Locks the table that contains the rows needed to satisfy the
query. This lock is held until the transaction ends (commits or rolls
back). If a SERIALIZABLE query finds that a row is exclusively
locked by another process, it will wait until either the row has been
released, or the lock timeout value has expired.
|
The statements that get exclusive locks are INSERT
,
UPDATE
, DELETE
and SELECT ... FOR UPDATE
. Select
statements without the FOR UPDATE
qualifier get shared locks
which allow other not ''for update'' select statements to read the same
rows but block anyone trying to update the row from accessing it. Rows
or tables with exclusive locks block all access to the row from other
transactions until the transaction ends.
In general terms, the higher the Isolation level the more likelihood of
having concurrent locks and therefore lock conflicts. In such cases,
adjust the -O gemini_lock_table_size
accordingly.
GEMINI
uses row locks, which allows high concurrency for requests
on the same table.
In order to avoid lock table overflow, SQL statements that require applying locks to a large number of rows should either be run at the serializable isolation level or should be covered by a lock table statement.
Memory must be pre-allocated for the lock table. The mysqld server
startup option -0 gemini_lock_table_size
can be used to adjust
the number of concurrent locks.
The following limitations are in effect for the current version of
GEMINI
:
DROP DATABASE
does not work with GEMINI
tables; instead,
drop all the tables in the database first, then drop the database.
GEMINI
tables is 1012.
GEMINI
files a server can manage is 1012. Each
table consumes one file; an additional file is consumed if the table has
any indexes defined on it.
FULLTEXT
indexes are not supported with GEMINI
tables.
AUTO_INCREMENT
fields
that provide alternating values at the component level. If you try to
create such a field, GEMINI
will refuse.
TEMPORARY TABLES
are not supported by GEMINI
. The
statement CREATE TEMPORARY TABLE ... TYPE=GEMINI
will generate
the response: ERROR 1005: Can't create table '/tmp/#sqlxxxxx'
(errno: 0)
.
FLUSH TABLES
has not been implemented with GEMINI
tables.
This section explains the various startup options you can use with
GEMINI
tables, how to backup GEMINI
tables, some
performance considerations and sample configurations, and a brief
discussion of when to use GEMINI
tables.
Specifically, the topics covered in this section are:
GEMINI
Tables
GEMINI
Tables
GEMINI
Tables
GEMINI
Tables
The table below lists options to mysqld that can be used to change the
behavior of GEMINI
tables.
OPTION | DESCRIPTION |
--default-table-type=gemini
@tab Sets the default table handler to be GEMINI . All create
table statements will create GEMINI tables unless otherwise
specified with TYPE=table-type . As noted above, there is
currently a limitation with TEMPORARY tables using GEMINI .
| |
--gemini-flush-log-at-commit
@tab Forces the recovery log buffers to be flushed after every
commit. This can have a serious performance penalty, so use with
caution.
| |
--gemini-recovery=FULL | NONE | FORCE
@tab Sets the recovery mode. Default is FULL . NONE is
useful for performing repeatable batch operations because the updates
are not recorded in the recovery log. FORCE skips crash recovery
upon startup; this corrupts the database, and should be used in
emergencies only.
| |
--gemini-unbuffered-io
@tab All database writes bypass the OS cache. This can provide a
performance boost on heavily updated systems where most of the dataset
being worked on is cached in memory with the gemini_buffer_cache
parameter.
| |
--O gemini_buffer_cache=size
@tab Amount of memory to allocate for database buffers, including Index
and Record information. It is recommended that this number be 10% of the
total size of all GEMINI tables. Do not exceed amount of memory
on the system!
| |
--O gemini_connection_limit=#
@tab Maximum number of connections to GEMINI ; default is
100 . Each connection consumes about 1K of memory.
| |
--O gemini_io_threads=#
@tab Number of background I/O threads; default is 2 . Increase the
number when using --gemini-unbuffered-io
| |
--O gemini_lock_table_size=#
@tab Sets the maximum number of concurrent locks; default is 4096. Using
SET [ GLOBAL | SESSION ] TRANSACTION ISOLATION = ... will
determine how long a program will hold row locks.
| |
--O gemini_lock_wait_timeout=seconds
@tab Number of seconds to wait for record locks when performing queries;
default is 10 seconds. Using SET [ GLOBAL | SESSION ] TRANSACTION
ISOLATION = ... will determine how long a program will hold row locks.
| |
--skip-gemini
@tab Do not use GEMINI . If you use --skip-gemini , MySQL
will not initialize the GEMINI table handler, saving memory; you
cannot use GEMINI tables if you use --skip-gemini .
| |
--transaction-isolation=READ-UNCOMMITTED | READ-COMMITTED | REPEATABLE-READ | SERIALIZABLE
@tab Sets the GLOBAL transaction isolation level for all users that
connect to the server; can be overridden with the SET ISOLATION LEVEL
statement.
|
GEMINI
tables can be created by either using the CREATE
TABLE
syntax or the ALTER TABLE
syntax.
GEMINI
table is:
CREATE TABLE table-name (....) TYPE=GEMINI;
GEMINI
is:
ALTER TABLE table-name TYPE=GEMINI;
See section 9 MySQL Tutorial, for more information on how to create and use
MySQL
tables.
GEMINI
supports both BACKUP TABLE
and RESTORE TABLE
syntax. To learn more about how to use BACKUP
and RESTORE
,
see section 7.13 BACKUP TABLE
Syntax and section 7.14 RESTORE TABLE
Syntax.
To backup GEMINI
tables outside of the MySQL
environment,
you must first shut down the MySQL
server. Once the server is
shut down, you can copy the files associated with GEMINI
to a
different location. The files that make up the GEMINI
table
handler are:
.gmd
extention below the
$DATADIR
directory. Such files include table.gmd
,
table.gmi
, and table.frm
gemini.db
in the $DATADIR
directory
gemini.rl
in the $DATADIR
directory
gemini.lg
in the $DATADIR
directory
All the GEMINI
files must be copied together. You can not copy
just the .gmi
and .gmd
files to a different
$DATADIR
and have them become part of a new database. You can
copy an entire $DATADIR
directory to another location and start a
MySQL server using the new $DATADIR
.
To restore GEMINI
tables outside of the MySQL
environment,
you must first shut down the MySQL
server. Once the server is
shut down, you can remove all GEMINI
files in the target
$DATADIR
and then copy the files previously backed up into the
$DATADIR
directory.
As mentioned above, the files that make up the GEMINI
table
handler are:
.gmd
extention below the
$DATADIR
directory. Such files include table.gmd
,
table.gmi
, and table.frm
gemini.db
in the $DATADIR
directory
gemini.rl
in the $DATADIR
directory
gemini.lg
in the $DATADIR
directory
When restoring a table, all the GEMINI
files must be copied
together. You can not restore just the .gmi
and .gmd
files.
As mentioned previously, GEMINI
tables support row-level and
table-level locking to increase concurrency in applications and to allow
reading of tables without locking for maximum concurrency in heavy
update environments. This feature has several implications when working
with auto_increment
tables.
In MySQL
, when a column is defined as an auto_increment
column, and a row is inserted into the table with a NULL
for the
column, the auto_increment
column is updated to be 1 higher than
the highest value in the column.
With MyISAM
tables, the auto_increment
function is
implemented by looking in the index and finding the highest value and
adding 1 to it. This is possible because the entire ISAM
table is
locked during the update period and the increment value is therefore
guaranteed to not be changing.
With GEMINI
tables, the auto_increment
function is
implemented by maintaining a counter in a separate location from the
table data. Instead of looking at the highest value in the table index,
GEMINI
tables look at this separately maintained counter. This
means that in a transactional model, unlike the bottleneck inherent in
the MyISAM
approach, GEMINI
users do not have to wait
until the transaction that added the last value either commits or
rollbacks before looking at the value.
Two side-effects of the GEMINI
implementation are:
auto_increment
is
specified, and this specified value is the highest value, MyISAM
uses it as its auto_increment
value, and every subsequent insert
is based on this. By contrast, GEMINI
does not use this value,
but instead uses the value maintained in the separate GEMINI
counter location.
SET
insert_id=#
and insert a new row in the table. However, as a general
rule, values should not be inserted into an auto_increment
column; the database manager should be maintaining this field, not the
application. SET insert_id
is a recovery mechanism that should be
used in case of error only.
Note that if you delete the row containing the maximum value for an
auto_increment
column, the value will be reused with a
GEMINI
table but not with a MyISAM
table.
See section 7.7 CREATE TABLE
Syntax for more information about creating
auto_increment
columns.
In addition to designing the best possible application, configuration of the data and the server startup parameters need to be considered. How the hardware is being used can have a dramatic affect on how fast the system will respond to queries. Disk Drives and Memory must both be considered.
Disk Drives
For best performance, you want to spread the data out over as many disks
as possible. Using RAID 10 stripes work very well. If there are a lot of
updates then the recovery log (gemini.rl
) should be on a
relatively quiet disk drive.
To spread the data out without using RAID 10, you can do the following:
GEMINI
by using the
ALTER TABLE <name> TYPE=GEMINI
statements, move (mv
) the
.gmd
and .gmi
files to a different disk drive and link
(ln -s
) them back to the original directory where the .frm
file resides.
gemini.rl
file to its quiet disk location and link
the file back to the $DATADIR
directory.
Memory
The more data that can be placed in memory the faster the access to the
data. Figure out how large the GEMINI
data is by adding up the
.gmd
and .gmi
file sizes. If you can, put at least 10% of
the data into memory. You allocate memory for the rows and indexes by
using the gemini_buffer_cache
startup parameter. For example:
mysqld -O gemini_buffer_cache=800M
would allocate 800 MB of memory for the GEMINI
buffer cache.
Based on the performance considerations above, we can look at some
examples for how to get the best performance out of the system when
using GEMINI
tables.
HARDWARE | CONFIGURATION |
One CPU, 128MB memory, one disk drive
@tab Allocate 80MB of memory for reading and updating GEMINI
tables by starting the mysqld server with the following option:
-O gemini_buffer_cache=80M | |
Two CPUs, 512MB memory, four disk drives
@tab Use RAID 10 to stripe the data across all available disks, or use
the method described in the performance considerations section,
above. Allocate 450MB of memory for reading/updating GEMINI
tables:
-O gemini_buffer_cache=450M |
Because the GEMINI
table handler provides crash recovery and
transaction support, there is extra overhead that is not found in other
non-transaction safe table handlers. Here are some general guidelines
for when to employ GEMINI
and when to use other non-transaction
safe tables (NTST
).
ACCESS TRENDS | TABLE TYPE | REASON |
Read-only
@tab NTST
@tab Less overhead and faster
| ||
Critical data
@tab GEMINI
@tab Crash recovery protection
| ||
High concurrency
@tab GEMINI
@tab Row-level locking
| ||
Heavy update
@tab GEMINI
@tab Row-level locking
|
The table below shows how a typical application schema could be defined.
TABLE | CONTENTS | TABLE TYPE | REASON |
account
@tab Customer account data
@tab GEMINI
@tab Critical data, heavy update
| |||
order
@tab Orders for a customer
@tab GEMINI
@tab Critical data, heavy update
| |||
orderline
@tab Orderline detail for an order
@tab GEMINI
@tab Critical data, heavy update
| |||
invdesc
@tab Inventory description
@tab NTST
@tab Read-only, frequent access
| |||
salesrep
@tab Sales rep information
@tab NTST
@tab Infrequent update
| |||
inventory
@tab Inventory information
@tab GEMINI
@tab High concurrency, critical data
| |||
config
@tab System configuration
@tab NTST
@tab Read-only
|
InnoDB tables are included in the MySQL source distribution starting from 3.23.34a and are activated in the MySQL -max binary.
If you have downloaded a binary version of MySQL that includes support for InnoDB (mysqld-max), simply follow the instructions for installing a binary version of MySQL. See section 4.6 Installing a MySQL Binary Distribution. See section 15.2 mysqld-max, An extended mysqld server.
To compile MySQL with InnoDB support, download MySQL-3.23.37 or newer
and configure MySQL
with the --with-innodb
option.
See section 4.7 Installing a MySQL Source Distribution.
cd /path/to/source/of/mysql-3.23.37 ./configure --with-innodb
InnoDB provides MySQL with a transaction-safe table handler with
commit, rollback, and crash recovery capabilities. InnoDB does
locking on row level, and also provides an Oracle-style consistent
non-locking read in SELECTS
, which increases transaction
concurrency. There is not need for lock escalation in InnoDB,
because row level locks in InnoDB fit in very small space.
InnoDB has been designed for maximum performance when processing large data volumes. Its CPU efficiency is probably not matched by any other disk-based relational database engine.
You can find the latest information about InnoDB at http://www.innodb.com. The most up-to-date version of the InnoDB manual is always placed there, and you can also order commercial support for InnoDB.
Technically, InnoDB is a database backend placed under MySQL. InnoDB
has its own buffer pool for caching data and indexes in main
memory. InnoDB stores its tables and indexes in a tablespace, which
may consist of several files. This is different from, for example,
MyISAM
tables where each table is stored as a separate file.
InnoDB is distributed under the GNU GPL License Version 2 (of June 1991). In the source distribution of MySQL, InnoDB appears as a subdirectory.
Beginning from MySQL-3.23.37 the prefix of the options is changed
from innobase_...
to innodb_...
.
To use InnoDB tables you MUST specify configuration parameters
in the MySQL configuration file in the [mysqld]
section of
the configuration file `my.cnf'. See section 4.16.5 Option Files.
The only required parameter to use InnoDB is innodb_data_file_path
,
but you should set others if you want to get a better performance.
Suppose you have a Windows NT machine with 128 MB RAM and a single 10 GB hard disk. Below is an example of possible configuration parameters in `my.cnf' for InnoDB:
innodb_data_file_path = ibdata1:2000M;ibdata2:2000M innodb_data_home_dir = c:\ibdata set-variable = innodb_mirrored_log_groups=1 innodb_log_group_home_dir = c:\iblogs set-variable = innodb_log_files_in_group=3 set-variable = innodb_log_file_size=30M set-variable = innodb_log_buffer_size=8M innodb_flush_log_at_trx_commit=1 innodb_log_arch_dir = c:\iblogs innodb_log_archive=0 set-variable = innodb_buffer_pool_size=80M set-variable = innodb_additional_mem_pool_size=10M set-variable = innodb_file_io_threads=4 set-variable = innodb_lock_wait_timeout=50
Suppose you have a Linux machine with 512 MB RAM and three 20 GB hard disks (at directory paths `/', `/dr2' and `/dr3'). Below is an example of possible configuration parameters in `my.cnf' for InnoDB:
innodb_data_file_path = ibdata/ibdata1:2000M;dr2/ibdata/ibdata2:2000M innodb_data_home_dir = / set-variable = innodb_mirrored_log_groups=1 innodb_log_group_home_dir = /dr3 set-variable = innodb_log_files_in_group=3 set-variable = innodb_log_file_size=50M set-variable = innodb_log_buffer_size=8M innodb_flush_log_at_trx_commit=1 innodb_log_arch_dir = /dr3/iblogs innodb_log_archive=0 set-variable = innodb_buffer_pool_size=400M set-variable = innodb_additional_mem_pool_size=20M set-variable = innodb_file_io_threads=4 set-variable = innodb_lock_wait_timeout=50
Note that we have placed the two data files on different disks.
The reason for the name innodb_data_file_path
is that
you can also specify paths to your data files, and
innodb_data_home_dir
is just textually catenated
before your data file paths, adding a possible slash or
backslash in between. InnoDB will fill the tablespace
formed by the data files from bottom up. In some cases it will
improve the performance of the database if all data is not placed
on the same physical disk. Putting log files on a different disk from
data is very often beneficial for performance.
The meanings of the configuration parameters are the following:
innodb_data_home_dir | The common part of the directory path for all innobase data files. |
innodb_data_file_path | Paths to individual data files and their sizes. The full directory path to each data file is acquired by concatenating innodb_data_home_dir to the paths specified here. The file sizes are specified in megabytes, hence the 'M' after the size specification above. Do not set a file size bigger than 4000M, and on most operating systems not bigger than 2000M. InnoDB also understands the abbreviation 'G', 1G meaning 1024M. The sum of the sizes of the files must be at least 10 MB. |
innodb_mirrored_log_groups | Number of identical copies of log groups we keep for the database. Currently this should be set to 1. |
innodb_log_group_home_dir | Directory path to InnoDB log files. |
innodb_log_files_in_group | Number of log files in the log group. InnoDB writes to the files in a circular fashion. Value 3 is recommended here. |
innodb_log_file_size | Size of each log file in a log group in megabytes. Sensible values range from 1M to the size of the buffer pool specified below. The bigger the value, the less checkpoint flush activity is needed in the buffer pool, saving disk i/o. But bigger log files also mean that recovery will be slower in case of a crash. File size restriction as for a data file. |
innodb_log_buffer_size | The size of the buffer which InnoDB uses to write log to the log files on disk. Sensible values range from 1M to half the combined size of log files. A big log buffer allows large transactions to run without a need to write the log to disk until the transaction commit. Thus, if you have big transactions, making the log buffer big will save disk i/o. |
innodb_flush_log_at_trx_commit | Normally this is set to 1, meaning that at a transaction commit the log is flushed to disk, and the modifications made by the transaction become permanent, and survive a database crash. If you are willing to compromise this safety, and you are running small transactions, you may set this to 0 to reduce disk i/o to the logs. |
innodb_log_arch_dir |
The directory where fully written log files would be archived if we used
log archiving. The value of this parameter should currently be set the
same as innodb_log_group_home_dir .
|
innodb_log_archive | This value should currently be set to 0. As recovery from a backup is done by MySQL using its own log files, there is currently no need to archive InnoDB log files. |
innodb_buffer_pool_size | The size of the memory buffer InnoDB uses to cache data and indexes of its tables. The bigger you set this the less disk i/o is needed to access data in tables. On a dedicated database server you may set this parameter up to 90 % of the machine physical memory size. Do not set it too large, though, because competition of the physical memory may cause paging in the operating system. |
innodb_additional_mem_pool_size | Size of a memory pool InnoDB uses to store data dictionary information and other internal data structures. A sensible value for this might be 2M, but the more tables you have in your application the more you will need to allocate here. If InnoDB runs out of memory in this pool, it will start to allocate memory from the operating system, and write warning messages to the MySQL error log. |
innodb_file_io_threads | Number of file i/o threads in InnoDB. Normally, this should be 4, but on Windows NT disk i/o may benefit from a larger number. |
innodb_lock_wait_timeout |
Timeout in seconds an InnoDB transaction may wait for a lock before
being rolled back. InnoDB automatically detects transaction deadlocks
in its own lock table and rolls back the transaction. If you use
LOCK TABLES command, or other transaction-safe table handlers
than InnoDB in the same transaction, then a deadlock may arise which
InnoDB cannot notice. In cases like this the timeout is useful to
resolve the situation.
|
innodb_unix_file_flush_method |
(Available from 3.23.39 up.)
The default value for this is fdatasync .
Another option is O_DSYNC .
Options littlesync and nosync have the
risk that in an operating system crash or a power outage you may easily
end up with a half-written database page, and you have to do a recovery
from a backup. See the section "InnoDB performance tuning", item 6, below
for tips on how to set this parameter. If you are happy with your database
performance it is wisest not to specify this parameter at all, in which
case it will get the default value.
|
Suppose you have installed MySQL and have edited `my.cnf' so that it contains the necessary InnoDB configuration parameters. Before starting MySQL you should check that the directories you have specified for InnoDB data files and log files exist and that you have access rights to those directories. InnoDB cannot create directories, only files. Check also you have enough disk space for the data and log files.
When you now start MySQL, InnoDB will start creating your data files and log files. InnoDB will print something like the following:
~/mysqlm/sql > mysqld InnoDB: The first specified data file /home/heikki/data/ibdata1 did not exist: InnoDB: a new database to be created! InnoDB: Setting file /home/heikki/data/ibdata1 size to 134217728 InnoDB: Database physically writes the file full: wait... InnoDB: Data file /home/heikki/data/ibdata2 did not exist: new to be created InnoDB: Setting file /home/heikki/data/ibdata2 size to 262144000 InnoDB: Database physically writes the file full: wait... InnoDB: Log file /home/heikki/data/logs/ib_logfile0 did not exist: new to be c reated InnoDB: Setting log file /home/heikki/data/logs/ib_logfile0 size to 5242880 InnoDB: Log file /home/heikki/data/logs/ib_logfile1 did not exist: new to be c reated InnoDB: Setting log file /home/heikki/data/logs/ib_logfile1 size to 5242880 InnoDB: Log file /home/heikki/data/logs/ib_logfile2 did not exist: new to be c reated InnoDB: Setting log file /home/heikki/data/logs/ib_logfile2 size to 5242880 InnoDB: Started mysqld: ready for connections
A new InnoDB database has now been created. You can connect to the MySQL
server with the usual MySQL client programs like mysql
.
When you shut down the MySQL server with `mysqladmin shutdown',
InnoDB output will be like the following:
010321 18:33:34 mysqld: Normal shutdown 010321 18:33:34 mysqld: Shutdown Complete InnoDB: Starting shutdown... InnoDB: Shutdown completed
You can now look at the data files and logs directories and you will see the files created. The log directory will also contain a small file named `ib_arch_log_0000000000'. That file resulted from the database creation, after which InnoDB switched off log archiving. When MySQL is again started, the output will be like the following:
~/mysqlm/sql > mysqld InnoDB: Started mysqld: ready for connections
If something goes wrong in an InnoDB database creation, you should delete all files created by InnoDB. This means all data files, all log files, the small archived log file, and in the case you already did create some InnoDB tables, delete also the corresponding `.frm' files for these tables from the MySQL database directories. Then you can try the InnoDB database creation again.
Suppose you have started the MySQL client with the command
mysql test
.
To create a table in the InnoDB format you must specify
TYPE = InnoDB
in the table creation SQL command:
CREATE TABLE CUSTOMER (A INT, B CHAR (20), INDEX (A)) TYPE = InnoDB;
This SQL command will create a table and an index on column A
into the InnoDB tablespace consisting of the data files you specified
in `my.cnf'. In addition MySQL will create a file
`CUSTOMER.frm' to the MySQL database directory `test'.
Internally, InnoDB will add to its own data dictionary an entry
for table 'test/CUSTOMER'
. Thus you can create a table
of the same name CUSTOMER
in another database of MySQL, and
the table names will not collide inside InnoDB.
You can query the amount of free space in the InnoDB tablespace
by issuing the table status command of MySQL for any table you have
created with TYPE = InnoDB
. Then the amount of free
space in the tablespace appears in the table comment section in the
output of SHOW
. An example:
SHOW TABLE STATUS FROM test LIKE 'CUSTOMER'
Note that the statistics SHOW
gives about InnoDB tables
are only approximate: they are used in SQL optimization. Table and
index reserved sizes in bytes are accurate, though.
NOTE: DROP DATABASE
does not currently work for InnoDB tables!
You must drop the tables individually. Also take care not to delete or
add `.frm' files to your InnoDB database manually: use
CREATE TABLE
and DROP TABLE
commands.
InnoDB has its own internal data dictionary, and you will get problems
if the MySQL `.frm' files are out of 'sync' with the InnoDB
internal data dictionary.
InnoDB does not have a special optimization for separate index creation.
Therefore it does not pay to export and import the table and create indexes
afterwards.
The fastest way to alter a table to InnoDB is to do the inserts
directly to an InnoDB table, that is, use ALTER TABLE ... TYPE=INNODB
,
or create an empty InnoDB table with identical definitions and insert
the rows with INSERT INTO ... SELECT * FROM ...
.
To get better control over the insertion process, it may be good to insert big tables in pieces:
INSERT INTO newtable SELECT * FROM oldtable WHERE yourkey > something AND yourkey <= somethingelse;
After all data has been inserted you can rename the tables.
During the conversion of big tables you should set the InnoDB buffer pool size big to reduce disk i/o. Not bigger than 80 % of the physical memory, though. You should set InnoDB log files big, and also the log buffer large.
Make sure you do not run out of tablespace: InnoDB tables take a lot
more space than MyISAM tables. If an ALTER TABLE
runs out
of space, it will start a rollback, and that can take hours if it is
disk-bound.
In inserts InnoDB uses the insert buffer to merge secondary index records
to indexes in batches. That saves a lot of disk i/o. In rollback no such
mechanism is used, and the rollback can take 30 times longer than the
insertion.
In the case of a runaway rollback, if you do not have valuable data in your database, it is better that you kill the database process and delete all InnoDB data and log files and all InnoDB table `.frm' files, and start your job again, rather than wait for millions of disk i/os to complete.
You cannot increase the size of an InnoDB data file. To add more into
your tablespace you have to add a new data file. To do this you have to
shut down your MySQL database, edit the `my.cnf' file, adding a
new file to innodb_data_file_path
, and then start MySQL
again.
Currently you cannot remove a data file from InnoDB. To decrease the
size of your database you have to use mysqldump
to dump
all your tables, create a new database, and import your tables to the
new database.
If you want to change the number or the size of your InnoDB log files, you have to shut down MySQL and make sure that it shuts down without errors. Then copy the old log files into a safe place just in case something went wrong in the shutdown and you will need them to recover the database. Delete then the old log files from the log file directory, edit `my.cnf', and start MySQL again. InnoDB will tell you at the startup that it is creating new log files.
The key to safe database management is taking regular backups. To take a 'binary' backup of your database you have to do the following:
There is currently no on-line or incremental backup tool available for InnoDB, though they are in the TODO list.
In addition to taking the binary backups described above, you should also regularly take dumps of your tables with `mysqldump'. The reason to this is that a binary file may be corrupted without you noticing it. Dumped tables are stored into text files which are human-readable and much simpler than database binary files. Seeing table corruption from dumped files is easier, and since their format is simpler, the chance for serious data corruption in them is smaller.
A good idea is to take the dumps at the same time you take a binary backup of your database. You have to shut out all clients from your database to get a consistent snapshot of all your tables into your dumps. Then you can take the binary backup, and you will then have a consistent snapshot of your database in two formats.
To be able to recover your InnoDB database to the present from the binary backup described above, you have to run your MySQL database with the general logging and log archiving of MySQL switched on. Here by the general logging we mean the logging mechanism of the MySQL server which is independent of InnoDB logs.
To recover from a crash of your MySQL server process, the only thing you have to do is to restart it. InnoDB will automatically check the logs and perform a roll-forward of the database to the present. InnoDB will automatically roll back uncommitted transactions which were present at the time of the crash. During recovery, InnoDB will print out something like the following:
~/mysqlm/sql > mysqld InnoDB: Database was not shut down normally. InnoDB: Starting recovery from log files... InnoDB: Starting log scan based on checkpoint at InnoDB: log sequence number 0 13674004 InnoDB: Doing recovery: scanned up to log sequence number 0 13739520 InnoDB: Doing recovery: scanned up to log sequence number 0 13805056 InnoDB: Doing recovery: scanned up to log sequence number 0 13870592 InnoDB: Doing recovery: scanned up to log sequence number 0 13936128 ... InnoDB: Doing recovery: scanned up to log sequence number 0 20555264 InnoDB: Doing recovery: scanned up to log sequence number 0 20620800 InnoDB: Doing recovery: scanned up to log sequence number 0 20664692 InnoDB: 1 uncommitted transaction(s) which must be rolled back InnoDB: Starting rollback of uncommitted transactions InnoDB: Rolling back trx no 16745 InnoDB: Rolling back of trx no 16745 completed InnoDB: Rollback of uncommitted transactions completed InnoDB: Starting an apply batch of log records to the database... InnoDB: Apply batch completed InnoDB: Started mysqld: ready for connections
If your database gets corrupted or your disk fails, you have to do the recovery from a backup. In the case of corruption, you should first find a backup which is not corrupted. From a backup do the recovery from the general log files of MySQL according to instructions in the MySQL manual.
InnoDB implements a checkpoint mechanism called a fuzzy checkpoint. InnoDB will flush modified database pages from the buffer pool in small batches, there is no need to flush the buffer pool in one single batch, which would in practice stop processing of user SQL statements for a while.
In crash recovery InnoDB looks for a checkpoint label written to the log files. It knows that all modifications to the database before the label are already present on the disk image of the database. Then InnoDB scans the log files forward from the place of the checkpoint applying the logged modifications to the database.
InnoDB writes to the log files in a circular fashion. All committed modifications which make the database pages in the buffer pool different from the images on disk must be available in the log files in case InnoDB has to do a recovery. This means that when InnoDB starts to reuse a log file in the circular fashion, it has to make sure that the database page images on disk already contain the modifications logged in the log file InnoDB is going to reuse. In other words, InnoDB has to make a checkpoint and often this involves flushing of modified database pages to disk.
The above explains why making your log files very big may save disk i/o in checkpointing. It can make sense to set the total size of the log files as big as the buffer pool or even bigger. The drawback in big log files is that crash recovery can last longer because there will be more log to apply to the database.
InnoDB data and log files are binary-compatible on all platforms
if the floating point number format on the machines is the same.
You can move an InnoDB database simply by copying all the relevant
files, which we already listed in the previous section on backing up
a database. If the floating point formats on the machines are
different but you have not used FLOAT
or DOUBLE
data types in your tables then the procedure is the same: just copy
the relevant files. If the formats are different and your tables
contain floating point data, you have to use `mysqldump'
and `mysqlimport' to move those tables.
A performance tip is to switch off the auto commit when you import data into your database, assuming your tablespace has enough space for the big rollback segment the big import transaction will generate. Do the commit only after importing a whole table or a segment of a table.
In the InnoDB transaction model the goal has been to combine the best sides of a multiversioning database to traditional two-phase locking. InnoDB does locking on row level and runs queries by default as non-locking consistent reads, in the style of Oracle. The lock table in InnoDB is stored so space-efficiently that lock escalation is not needed: typically several users are allowed to lock every row in the database, or any random subset of the rows, without InnoDB running out of memory.
In InnoDB all user activity happens inside transactions. If the
auto commit mode is used in MySQL, then each SQL statement
will form a single transaction. If the auto commit mode is
switched off, then we can think that a user always has a transaction
open. If he issues
the SQL COMMIT
or ROLLBACK
statement, that
ends the current transaction, and a new starts. Both statements
will release all InnoDB locks that were set during the
current transaction. A COMMIT
means that the
changes made in the current transaction are made permanent
and become visible to other users. A ROLLBACK
on the other hand cancels all modifications made by the current
transaction.
A consistent read means that InnoDB uses its multiversioning to present to a query a snapshot of the database at a point in time. The query will see the changes made by exactly those transactions that committed before that point of time, and no changes made by later or uncommitted transactions. The exception to this rule is that the query will see the changes made by the transaction itself which issues the query.
When a transaction issues its first consistent read, InnoDB assigns the snapshot, or the point of time, which all consistent reads in the same transaction will use. In the snapshot are all transactions that committed before assigning the snapshot. Thus the consistent reads within the same transaction will also be consistent with respect to each other. You can get a fresher snapshot for your queries by committing the current transaction and after that issuing new queries.
Consistent read is the default mode in which InnoDB processes
SELECT
statements. A consistent read does not set any locks
on the tables it accesses, and therefore other users are free to
modify those tables at the same time a consistent read is being performed
on the table.
A consistent read is not convenient in some circumstances.
Suppose you want to add a new row into your table CHILD
,
and make sure that the child already has a parent in table
PARENT
.
Suppose you use a consistent read to read the table PARENT
and indeed see the parent of the child in the table. Can you now safely
add the child row to table CHILD
? No, because it may
happen that meanwhile some other user has deleted the parent row
from the table PARENT
, and you are not aware of that.
The solution is to perform the SELECT
in a locking
mode, LOCK IN SHARE MODE
.
SELECT * FROM PARENT WHERE NAME = 'Jones' LOCK IN SHARE MODE;
Performing a read in share mode means that we read the latest
available data, and set a shared mode lock on the rows we read.
If the latest data belongs to a yet uncommitted transaction of another
user, we will wait until that transaction commits.
A shared mode lock prevents others from updating or deleting
the row we have read. After we see that the above query returns
the parent 'Jones'
, we can safely add his child
to table CHILD
, and commit our transaction.
This example shows how to implement referential
integrity in your application code.
Let us look at another example: we have an integer counter field in
a table CHILD_CODES
which we use to assign
a unique identifier to each child we add to table CHILD
.
Obviously, using a consistent read or a shared mode read
to read the present value of the counter is not a good idea, since
then two users of the database may see the same value for the
counter, and we will get a duplicate key error when we add
the two children with the same identifier to the table.
In this case there are two good ways to implement the
reading and incrementing of the counter: (1) update the counter
first by incrementing it by 1 and only after that read it,
or (2) read the counter first with
a lock mode FOR UPDATE
, and increment after that:
SELECT COUNTER_FIELD FROM CHILD_CODES FOR UPDATE; UPDATE CHILD_CODES SET COUNTER_FIELD = COUNTER_FIELD + 1;
A SELECT ... FOR UPDATE
will read the latest
available data setting exclusive locks on each row it reads.
Thus it sets the same locks a searched SQL UPDATE
would set
on the rows.
In row level locking InnoDB uses an algorithm called next-key locking. InnoDB does the row level locking so that when it searches or scans an index of a table, it sets shared or exclusive locks on the index records in encounters. Thus the row level locks are more precisely called index record locks.
The locks InnoDB sets on index records also affect the 'gap'
before that index record. If a user has a shared or exclusive
lock on record R in an index, then another user cannot insert
a new index record immediately before R in the index order.
This locking of gaps is done to prevent the so-called phantom
problem. Suppose I want to read and lock all children with identifier
bigger than 100 from table CHILD
,
and update some field in the selected rows.
SELECT * FROM CHILD WHERE ID > 100 FOR UPDATE;
Suppose there is an index on table CHILD
on column
ID
. Our query will scan that index starting from
the first record where ID
is bigger than 100.
Now, if the locks set on the index records would not lock out
inserts made in the gaps, a new child might meanwhile be
inserted to the table. If now I in my transaction execute
SELECT * FROM CHILD WHERE ID > 100 FOR UPDATE;
again, I will see a new child in the result set the query returns. This is against the isolation principle of transactions: a transaction should be able to run so that the data it has read does not change during the transaction. If we regard a set of rows as a data item, then the new 'phantom' child would break this isolation principle.
When InnoDB scans an index it can also lock the gap
after the last record in the index. Just that happens in the previous
example: the locks set by InnoDB will prevent any insert to
the table where ID
would be bigger than 100.
You can use the next-key locking to implement a uniqueness check in your application: if you read your data in share mode and do not see a duplicate for a row you are going to insert, then you can safely insert your row and know that the next-key lock set on the successor of your row during the read will prevent anyone meanwhile inserting a duplicate for your row. Thus the next-key locking allows you to 'lock' the non-existence of something in your table.
SELECT ... FROM ...
: this is a consistent read, reading a
snapshot of the database and setting no locks.
SELECT ... FROM ... LOCK IN SHARE MODE
: sets shared next-key locks
on all index records the read encounters.
SELECT ... FROM ... FOR UPDATE
: sets exclusive next-key locks
on all index records the read encounters.
INSERT INTO ... VALUES (...)
: sets an exclusive lock
on the inserted row; note that this lock is not a next-key lock
and does not prevent other users from inserting to the gap before the
inserted row. If a duplicate key error occurs, sets a shared lock
on the duplicate index record.
INSERT INTO T SELECT ... FROM S WHERE ...
sets an exclusive
(non-next-key) lock on each row inserted into T
. Does
the search on S
as a consistent read, but sets shared next-key
locks on S
if the MySQL logging is on. InnoDB has to set
locks in the latter case because in roll-forward recovery from a
backup every SQL statement has to be executed in exactly the same
way as it was done originally.
CREATE TABLE ... SELECT ...
performs the SELECT
as a consistent read or with shared locks, like in the previous
item.
REPLACE
is done like an insert if there is no collision
on a unique key. Otherwise, an exclusive next-key lock is placed
on the row which has to be updated.
UPDATE ... SET ... WHERE ...
: sets an exclusive next-key
lock on every record the search encounters.
DELETE FROM ... WHERE ...
: sets an exclusive next-key
lock on every record the search encounters.
LOCK TABLES ...
: sets table locks. In the implementation
the MySQL layer of code sets these locks. The automatic deadlock detection
of InnoDB cannot detect deadlocks where such table locks are involved:
see the next section below. See also section 13 'InnoDB restrictions'
about the following: since MySQL does know about row level locks,
it is possible that you
get a table lock on a table where another user currently has row level
locks. But that does not put transaction integerity into danger.
InnoDB automatically detects a deadlock of transactions and rolls
back the transaction whose lock request was the last one to build
a deadlock, that is, a cycle in the waits-for graph of transactions.
InnoDB cannot detect deadlocks where a lock set by a MySQL
LOCK TABLES
statement is involved, or if a lock set
in another table handler than InnoDB is involved. You have to resolve
these situations using innodb_lock_wait_timeout
set in
`my.cnf'.
When InnoDB performs a complete rollback of a transaction, all the locks of the transaction are released. However, if just a single SQL statement is rolled back as a result of an error, some of the locks set by the SQL statement may be preserved. This is because InnoDB stores row locks in a format where it cannot afterwards know which was set by which SQL statement.
1. If the Unix `top' or the Windows `Task Manager' shows that the CPU usage percentage with your workload is less than 70 %, your workload is probably disk-bound. Maybe you are making too many transaction commits, or the buffer pool is too small. Making the buffer pool bigger can help, but do not set it bigger than 80 % of physical memory.
2. Wrap several modifications into one transaction. InnoDB must flush the log to disk at each transaction commit, if that transaction made modifications to the database. Since the rotation speed of a disk is typically at most 167 revolutions/second, that constrains the number of commits to the same 167/second if the disk does not fool the operating system.
3.
If you can afford the loss of some latest committed transactions, you can
set the `my.cnf' parameter innodb_flush_log_at_trx_commit
to zero. InnoDB tries to flush the log anyway once in a second,
though the flush is not guaranteed.
4. Make your log files big, even as big as the buffer pool. When InnoDB has written the log files full, it has to write the modified contents of the buffer pool to disk in a checkpoint. Small log files will cause many unnecessary disk writes. The drawback in big log files is that recovery time will be longer.
5. Also the log buffer should be quite big, say 8 MB.
6. (Relevant from 3.23.39 up.)
In some versions of Linux and Unix, flushing files to disk with the Unix
fdatasync
and other similar methods is surprisingly slow.
The default method InnoDB uses is the fdatasync
function.
If you are not satisfied with the database write performance, you may
try setting innodb_unix_file_flush_method
in `my.cnf'
to O_DSYNC
, though O_DSYNC seems to be slower on most systems.
You can also try setting it to littlesync
, which means that
InnoDB does not call the file flush for every write it does to a
file, but only
in log flush at transaction commits and data file flush at a checkpoint.
The drawback in littlesync
is that if the operating system
crashes, you can easily end up with a half-written database page,
and you have to
do a recovery from a backup. With nosync
you have even less safety:
InnoDB will only flush the database files to disk at database shutdown
7. In importing data to InnoDB, make sure that MySQL does not have
autocommit=1
on. Then every insert requires a log flush to disk.
Put before your plain SQL import file line
set autocommit=0;
and after it
commit;
If you use the `mysqldump' option --opt
, you will get dump
files which are fast to import also to an InnoDB table, even without wrapping
them to the above set autocommit=0; ... commit;
wrappers.
8. Beware of big rollbacks of mass inserts: InnoDB uses the insert buffer to save disk i/o in inserts, but in a corresponding rollback no such mechanism is used. A disk-bound rollback can take 30 times the time of the corresponding insert. Killing the database process will not help because the rollback will start again at the database startup. The only way to get rid of a runaway rollback is to increase the buffer pool so that the rollback becomes CPU-bound and runs fast, or delete the whole InnoDB database.
9.
Beware also of other big disk-bound operations.
Use DROP TABLE
or TRUNCATE
(from MySQL-4.0 up) to empty a table, not
DELETE FROM yourtable
.
10.
Use the multi-line INSERT
to reduce
communication overhead between the client and the server if you need
to insert many rows:
INSERT INTO yourtable VALUES (1, 2), (5, 5);
This tip is of course valid for inserts into any table type, not just InnoDB.
Since InnoDB is a multiversioned database, it must keep information of old versions of rows in the tablespace. This information is stored in a data structure we call a rollback segment after an analogous data structure in Oracle.
InnoDB internally adds two fields to each row stored in the database. A 6-byte field tells the transaction identifier for the last transaction which inserted or updated the row. Also a deletion is internally treated as an update where a special bit in the row is set to mark it as deleted. Each row also contains a 7-byte field called the roll pointer. The roll pointer points to an undo log record written to the rollback segment. If the row was updated, then the undo log record contains the information necessary to rebuild the content of the row before it was updated.
InnoDB uses the information in the rollback segment to perform the undo operations needed in a transaction rollback. It also uses the information to build earlier versions of a row for a consistent read.
Undo logs in the rollback segment are divided into insert and update undo logs. Insert undo logs are only needed in transaction rollback and can be discarded as soon as the transaction commits. Update undo logs are used also in consistent reads, and they can be discarded only after there is no transaction present for which InnoDB has assigned a snapshot that in a consistent read could need the information in the update undo log to build an earlier version of a database row.
You must remember to commit your transactions regularly. Otherwise InnoDB cannot discard data from the update undo logs, and the rollback segment may grow too big, filling up your tablespace.
The physical size of an undo log record in the rollback segment is typically smaller than the corresponding inserted or updated row. You can use this information to calculate the space need for your rollback segment.
In our multiversioning scheme a row is not physically removed from the database immediately when you delete it with an SQL statement. Only when InnoDB can discard the update undo log record written for the deletion, it can also physically remove the corresponding row and its index records from the database. This removal operation is called a purge, and it is quite fast, usually taking the same order of time as the SQL statement which did the deletion.
Every InnoDB table has a special index called the clustered index
where the data of the rows is stored. If you define a
PRIMARY KEY
on your table, then the index of the primary key
will be the clustered index.
If you do not define a primary key for your table, InnoDB will internally generate a clustered index where the rows are ordered by the row id InnoDB assigns to the rows in such a table. The row id is a 6-byte field which monotonically increases as new rows are inserted. Thus the rows ordered by the row id will be physically in the insertion order.
Accessing a row through the clustered index is fast, because the row data will be on the same page where the index search leads us. In many databases the data is traditionally stored on a different page from the index record. If a table is large, the clustered index architecture often saves a disk i/o when compared to the traditional solution.
The records in non-clustered indexes (we also call them secondary indexes), in InnoDB contain the primary key value for the row. InnoDB uses this primary key value to search for the row from the clustered index. Note that if the primary key is long, the secondary indexes will use more space.
All indexes in InnoDB are B-trees where the index records are stored in the leaf pages of the tree. The default size of an index page is 16 kB. When new records are inserted, InnoDB tries to leave 1 / 16 of the page free for future insertions and updates of the index records.
If index records are inserted in a sequential (ascending or descending) order, the resulting index pages will be about 15/16 full. If records are inserted in a random order, then the pages will be 1/2 - 15/16 full. If the fillfactor of an index page drops below 1/2, InnoDB will try to contract the index tree to free the page.
It is a common situation in a database application that the primary key is a unique identifier and new rows are inserted in the ascending order of the primary key. Thus the insertions to the clustered index do not require random reads from a disk.
On the other hand, secondary indexes are usually non-unique and insertions happen in a relatively random order into secondary indexes. This would cause a lot of random disk i/o's without a special mechanism used in InnoDB.
If an index record should be inserted to a non-unique secondary index, InnoDB checks if the secondary index page is already in the buffer pool. If that is the case, InnoDB will do the insertion directly to the index page. But, if the index page is not found from the buffer pool, InnoDB inserts the record to a special insert buffer structure. The insert buffer is kept so small that it entirely fits in the buffer pool, and insertions can be made to it very fast.
The insert buffer is periodically merged to the secondary index trees in the database. Often we can merge several insertions on the same page in of the index tree, and hence save disk i/o's. It has been measured that the insert buffer can speed up insertions to a table up to 15 times.
If a database fits almost entirely in main memory, then the fastest way to perform queries on it is to use hash indexes. InnoDB has an automatic mechanism which monitors index searches made to the indexes defined for a table, and if InnoDB notices that queries could benefit from building of a hash index, such an index is automatically built.
But note that the hash index is always built based on an existing B-tree index on the table. InnoDB can build a hash index on a prefix of any length of the key defined for the B-tree, depending on what search pattern InnoDB observes on the B-tree index. A hash index can be partial: it is not required that the whole B-tree index is cached in the buffer pool. InnoDB will build hash indexes on demand to those pages of the index which are often accessed.
In a sense, through the adaptive hash index mechanism InnoDB adapts itself to ample main memory, coming closer to the architecture of main memory databases.
After a database startup, when a user first does an insert to a
table T
where an auto-increment column has been defined, and the user does not provide
an explicit value for the column, then InnoDB executes SELECT
MAX(auto-inc-column) FROM T
, and assigns that value incremented
by one to the the column and the auto-increment counter of the table.
We say that
the auto-increment counter for table T
has been initialized.
InnoDB follows the same procedure in initializing the auto-increment counter for a freshly created table.
Note that if the user specifies in an insert the value 0 to the auto-increment column, then InnoDB treats the row like the value would not have been specified.
After the auto-increment counter has been initialized, if a user inserts a row where he explicitly specifies the column value, and the value is bigger than the current counter value, then the counter is set to the specified column value. If the user does not explicitly specify a value, then InnoDB increments the counter by one and assigns its new value to the column.
The auto-increment mechanism, when assigning values from the counter, bypasses locking and transaction handling. Therefore you may also get gaps in the number sequence if you roll back transactions which have got numbers from the counter.
The behavior of auto-increment is not defined if a user gives a negative value to the column or if the value becomes bigger than the maximum integer that can be stored in the specified integer type.
In disk i/o InnoDB uses asynchronous i/o. On Windows NT it uses the native asynchronous i/o provided by the operating system. On Unix, InnoDB uses simulated asynchronous i/o built into InnoDB: InnoDB creates a number of i/o threads to take care of i/o operations, such as read-ahead. In a future version we will add support for simulated aio on Windows NT and native aio on those versions of Unix which have one.
On Windows NT InnoDB uses non-buffered i/o. That means that the disk pages InnoDB reads or writes are not buffered in the operating system file cache. This saves some memory bandwidth.
You can also use a raw disk in InnoDB, though this has not been tested yet: just define the raw disk in place of a data file in `my.cnf'. You must give the exact size in bytes of the raw disk in `my.cnf', because at startup InnoDB checks that the size of the file is the same as specified in the configuration file. Using a raw disk you can on some versions of Unix perform non-buffered i/o.
There are two read-ahead heuristics in InnoDB: sequential read-ahead and random read-ahead. In sequential read-ahead InnoDB notices that the access pattern to a segment in the tablespace is sequential. Then InnoDB will post in advance a batch of reads of database pages to the i/o system. In random read-ahead InnoDB notices that some area in a tablespace seems to be in the process of being fully read into the buffer pool. Then InnoDB posts the remaining reads to the i/o system.
The data files you define in the configuration file form the tablespace of InnoDB. The files are simply catenated to form the tablespace, there is no striping in use. Currently you cannot directly instruct where the space is allocated for your tables, except by using the following fact: from a newly created tablespace InnoDB will allocate space starting from the low end.
The tablespace consists of database pages whose default size is 16 kB. The pages are grouped into extents of 64 consecutive pages. The 'files' inside a tablespace are called segments in InnoDB. The name of the rollback segment is somewhat misleading because it actually contains many segments in the tablespace.
For each index in InnoDB we allocate two segments: one is for non-leaf nodes of the B-tree, the other is for the leaf nodes. The idea here is to achieve better sequentiality for the leaf nodes, which contain the data.
When a segment grows inside the tablespace, InnoDB allocates the first 32 pages to it individually. After that InnoDB starts to allocate whole extents to the segment. InnoDB can add to a large segment up to 4 extents at a time to ensure good sequentiality of data.
Some pages in the tablespace contain bitmaps of other pages, and therefore a few extents in an InnoDB tablespace cannot be allocated to segments as a whole, but only as individual pages.
When you issue a query SHOW TABLE STATUS FROM ... LIKE ...
to ask for available free space in the tablespace, InnoDB will
report you the space which is certainly usable in totally free extents
of the tablespace. InnoDB always reserves some extents for
clean-up and other internal purposes; these reserved extents are not
included in the free space.
When you delete data from a table, InnoDB will contract the corresponding B-tree indexes. It depends on the pattern of deletes if that frees individual pages or extents to the tablespace, so that the freed space is available for other users. Dropping a table or deleting all rows from it is guaranteed to release the space to other users, but remember that deleted rows can be physically removed only in a purge operation after they are no longer needed in transaction rollback or consistent read.
If there are random insertions or deletions in the indexes of a table, the indexes may become fragmented. By fragmentation we mean that the physical ordering of the index pages on the disk is not close to the alphabetical ordering of the records on the pages, or that there are many unused pages in the 64-page blocks which were allocated to the index.
It can speed up index scans if you
periodically use mysqldump
to dump the table to
a text file, drop the table, and reload it from the dump.
Another way to do the defragmenting is to ALTER
the table type to
MyISAM
and back to InnoDB
again.
Note that a MyISAM
table must fit in a single file
on your operating system.
If the insertions to and index are always ascending and records are deleted only from the end, then the the file space management algorithm of InnoDB guarantees that fragmentation in the index will not occur.
The error handling in InnoDB is not always the same as specified in the ANSI SQL standards. According to the ANSI standard, any error during an SQL statement should cause the rollback of that statement. InnoDB sometimes rolls back only part of the statement. The following list specifies the error handling of InnoDB.
'Table is full'
error
and InnoDB rolls back the SQL statement.
'Table handler error 1000000'
and InnoDB rolls back
the SQL statement.
INSERT INTO ... SELECT ...
.
This will probably change so that the SQL statement will be rolled
back if you have not specified the IGNORE
option in your
statement.
SHOW TABLE STATUS
does not give accurate statistics
on InnoDB tables, except for the physical size reserved by the table.
The row count is only a rough estimate used in SQL optimization.
CREATE TABLE T (A CHAR(20), B INT, UNIQUE (A(5))) TYPE = InnoDB;If you create a non unique index on a prefix of a column, InnoDB will create an index over the whole column.
INSERT DELAYED
is not supported for InnoDB tables.
LOCK TABLES
operation does not know of InnoDB
row level locks set in already completed SQL statements: this means that
you can get a table lock on a table even if there still exist transactions
of other users which have row level locks on the same table. Thus
your operations on the table may have to wait if they collide with
these locks of other users. Also a deadlock is possible. However,
this does not endanger transaction integrity, because the row level
locks set by InnoDB will always take care of the integrity.
Also, a table lock prevents other transactions from acquiring more
row level locks (in a conflicting lock mode) on the table.
BLOB
or TEXT
column.
DELETE FROM TABLE
does not regenerate the table but instead
deletes all rows, one by one, which is not that fast. In future versions
of MySQL you can use TRUNCATE
which is fast.
BLOB
and TEXT
type
columns. The restriction on the size of BLOB
and
TEXT
columns will be removed by June 2001 in a future version of
InnoDB.
Contact information of Innobase Oy, producer of the InnoDB engine. Website: http://www.innodb.com. Email: Heikki.Tuuri@innodb.com
phone: 358-9-6969 3250 (office) 358-40-5617367 (mobile) InnoDB Oy Inc. World Trade Center Helsinki Aleksanterinkatu 17 P.O.Box 800 00101 Helsinki Finland
Go to the first, previous, next, last section, table of contents.