Archive for the 'Oracle EE' Category

Recovery of deleted files on linux

Thursday, May 17th, 2007

A recent thread on the oracle-l maillist (http://www.freelists.org/list/oracle-l) called “REDO LOG Concepts” caught my attention. Someone asked about the transaction state of the (oracle) database when the online redologfiles have been deleted prior to a commit.

Most people know that on linux and unixes a file keeps on living (=available) in the context of a process as long as the file is kept open by the process, despite deletion by another process. This file is gone for all processes that try to access it after the deletion, but the deleted files will only get really disposed once all processes that have the file opened close the the connection (file descriptor) to it.

I guess everybody who is doing/has done linux/unix administration has encountered the situation at least once where a filesystem gets filled with some kind of logfile or tracefile which grows rapidly and keeps on growing at a very fast pace, and deleting it does not release any space (because the process which is filling the file is not stopped, so the file is held open, so the file still exists, despite deletion)

Some suggestions where made for recovery and protecting against it (mirroring online redologfiles). I guess most people know recovery fundamentals, so I will not dive further into that matter.

A fellow oracle-l member (stefan knecht) however suggested:

I’ve run into this situation before as well. It happens, people accidentally delete files.

What would be interesting to research, is if you still can somehow recover those files. As they’re still there, and locked by the process accessing them (in your case LGWR), there might be a way to “steal” the filehandle from that process and just write the file to another location before it’s being closed. Though you would have to intercept the syscall closing the file.

That made me think: linux has the proc filesystem, which gives system information, but also process information. This process information is quite extensive. One of the sections (directory actually) is a list of all the file descriptors for a process. These file descriptors are shown as symlinks:


[oracle@centos50-oracle10203 bin]$ ./sqlplus '/as sysdba'

SQL*Plus: Release 10.2.0.3.0 - Production on Thu May 17 16:24:01 2007

Copyright (c) 1982, 2006, Oracle. All Rights Reserved.

Connected to an idle instance.

SQL> startup
ORACLE instance started.

Total System Global Area 167772160 bytes
Fixed Size 1260696 bytes
Variable Size 130024296 bytes
Database Buffers 33554432 bytes
Redo Buffers 2932736 bytes
Database mounted.
Database opened.
SQL> exit
Disconnected from Oracle Database 10g Enterprise Edition Release 10.2.0.3.0 - Production
With the Partitioning, OLAP and Data Mining options
[oracle@centos50-oracle10203 bin]$ ps -ef | grep pmon
oracle 8634 1 0 16:24 ? 00:00:00 ora_pmon_frits
oracle 8680 2642 0 16:24 pts/1 00:00:00 grep pmon
[oracle@centos50-oracle10203 bin]$ cd /proc/8634/fd
[oracle@centos50-oracle10203 fd]$ ls -ls
total 0
0 lr-x------ 1 oracle oracle 64 May 17 16:24 0 -> /dev/null
0 lr-x------ 1 oracle oracle 64 May 17 16:24 1 -> /dev/null
0 lrwx------ 1 oracle oracle 64 May 17 16:24 10 -> /oracle/db/10.2.0.3/dbs/hc_frits.dat
0 lrwx------ 1 oracle oracle 64 May 17 16:24 11 -> /oracle/db/10.2.0.3/admin/frits/adump/ora_8632.aud
0 lr-x------ 1 oracle oracle 64 May 17 16:24 12 -> /dev/zero
0 lr-x------ 1 oracle oracle 64 May 17 16:24 13 -> /dev/zero
0 lr-x------ 1 oracle oracle 64 May 17 16:24 14 -> /oracle/db/10.2.0.3/rdbms/mesg/oraus.msb
0 lrwx------ 1 oracle oracle 64 May 17 16:24 15 -> /oracle/db/10.2.0.3/dbs/hc_frits.dat
0 lrwx------ 1 oracle oracle 64 May 17 16:24 16 -> socket:[33865]
0 lrwx------ 1 oracle oracle 64 May 17 16:24 17 -> /oracle/db/10.2.0.3/dbs/lkFRITS
0 lr-x------ 1 oracle oracle 64 May 17 16:24 18 -> /oracle/db/10.2.0.3/rdbms/mesg/oraus.msb
0 lr-x------ 1 oracle oracle 64 May 17 16:24 2 -> /dev/null
0 lr-x------ 1 oracle oracle 64 May 17 16:24 3 -> /dev/null
0 lr-x------ 1 oracle oracle 64 May 17 16:24 4 -> /dev/null
0 l-wx------ 1 oracle oracle 64 May 17 16:24 5 -> /oracle/db/10.2.0.3/admin/frits/udump/frits_ora_8632.trc
0 l-wx------ 1 oracle oracle 64 May 17 16:24 6 -> /oracle/db/10.2.0.3/admin/frits/bdump/alert_frits.log
0 lrwx------ 1 oracle oracle 64 May 17 16:24 7 -> /oracle/db/10.2.0.3/dbs/lkinstfrits (deleted)
0 l-wx------ 1 oracle oracle 64 May 17 16:24 8 -> /oracle/db/10.2.0.3/admin/frits/bdump/alert_frits.log
0 l-wx------ 1 oracle oracle 64 May 17 16:24 9 -> pipe:[11466]

As you can see, these are the open files of the pmon process of an oracle 10.2.0.3 database just after startup.
(if you want to use ltrace, strace or gdb on a process, this is the place to find the mapping of file descriptor to a file, socket or pipe!)

Can I use a file descriptor to get data out of the file? Probably, because it’s a symlink which is perfectly normal to use. Well then, back to the suggestion made:

“As they’re still there, and locked by the process accessing them (in your case LGWR), there might be a way to “steal” the filehandle from that process and just write the file to another location before it’s being closed.”

A quick test on a test system (centos 4.4, oracle 10.2.0.3):


SQL> select * from v$logfile;

GROUP# STATUS TYPE MEMBER IS_
---------- ------- ------------------------------------------------------------------------ ---
3 ONLINE /oracle/frits/data/FRITS/onlinelog/o1_mf_3_31l7kp85_.log NO
3 ONLINE /oracle/frits/flashreco/FRITS/onlinelog/o1_mf_3_31l7kqsl_.log YES
2 ONLINE /oracle/frits/data/FRITS/onlinelog/o1_mf_2_31l7ko7c_.log NO
2 ONLINE /oracle/frits/flashreco/FRITS/onlinelog/o1_mf_2_31l7koq3_.log YES
1 ONLINE /oracle/frits/data/FRITS/onlinelog/o1_mf_1_31l7knbs_.log NO
1 ONLINE /oracle/frits/flashreco/FRITS/onlinelog/o1_mf_1_31l7knrh_.log YES

6 rows selected.

SQL>

Let’s see which log group is the current one:


SQL> select * from v$log;

GROUP# THREAD# SEQUENCE# BYTES MEMBERS ARC STATUS FIRST_CHANGE# IRST_TIM
---------- ---------- ---------- ---------- ---------- --- ----------------------------- ---------
1 1 38 52428800 2 NO INACTIVE 1198861 10-APR-07
2 1 39 52428800 2 NO INACTIVE 1238951 10-APR-07
3 1 40 52428800 2 NO CURRENT 1274534 17-MAY-07

SQL>

Log group 3 is the current one. Well, lets delete one of the files of group 1 (I have two members in every log group, lets pretend we have only one member):


[oracle@filetest bin]$ rm /oracle/frits/data/FRITS/onlinelog/o1_mf_1_31l7knbs_.log

Now look at the file descriptors of the logwriter:


...
1 lrwx------ 1 root root 64 May 17 17:07 21 -> /oracle/frits/data/FRITS/onlinelog/o1_mf_1_31l7knbs_.log (deleted)
...

And lets make a copy of the file using the file descriptor:


[root@filetest ~]# cat /proc/4703/fd/21 > o1_mf_1_31l7knbs_.log
[root@filetest ~]# ls -ls o1_mf_1_31l7knbs_.log
51260 -rw-r--r-- 1 root root 52429312 May 17 17:11 o1_mf_1_31l7knbs_.log

It looks like we’ve made a copy of a deleted file, using the file descriptor in the proc filesystem. Now let’s shutdown the database, and start it up. This will go wrong, because a member of the online redolog group 1 can not be identified:


SQL> shut immediate
Database closed.
Database dismounted.
ORACLE instance shut down.
SQL> startup
ORACLE instance started.

Total System Global Area 167772160 bytes
Fixed Size 1260696 bytes
Variable Size 109052776 bytes
Database Buffers 54525952 bytes
Redo Buffers 2932736 bytes
Database mounted.
Database opened.

SQL>

Excuse me? Shouldn’t we get an error message saying an online logfile could not be identified? Let’s investigate further:


SQL> select * from v$logfile;

GROUP# STATUS TYPE MEMBER IS_
---------- ------- ------- ----------------------------------------------------------------- ---
3 ONLINE /oracle/frits/data/FRITS/onlinelog/o1_mf_3_31l7kp85_.log NO
3 ONLINE /oracle/frits/flashreco/FRITS/onlinelog/o1_mf_3_31l7kqsl_.log YES
2 ONLINE /oracle/frits/data/FRITS/onlinelog/o1_mf_2_31l7ko7c_.log NO
2 ONLINE /oracle/frits/flashreco/FRITS/onlinelog/o1_mf_2_31l7koq3_.log YES
1 INVALID ONLINE /oracle/frits/data/FRITS/onlinelog/o1_mf_1_31l7knbs_.log NO
1 ONLINE /oracle/frits/flashreco/FRITS/onlinelog/o1_mf_1_31l7knrh_.log YES

6 rows selected.

SQL>

Well, apparently not….I recall situations where a lost online redologfile would cause the startup to fail, and needing to copy an existing one in the group to the missing one. I guess this is altered in some version of the database. Anybody know which version?

Well, it makes sense. The status of the file is set to ‘INVALID’ and the surviving file(s) are used. A message is printed in the alert.log:


Thu May 17 17:43:13 2007
Errors in file /oracle/db/10.2.0.3/admin/frits/bdump/frits_lgwr_13960.trc:
ORA-00313: open failed for members of log group 1 of thread 1
ORA-00312: online log 1 thread 1: '/oracle/frits/data/FRITS/onlinelog/o1_mf_1_31l7knbs_.log'
ORA-27037: unable to obtain file status
Linux Error: 2: No such file or directory
Additional information: 3
Thu May 17 17:43:13 2007
Errors in file /oracle/db/10.2.0.3/admin/frits/bdump/frits_lgwr_13960.trc:
ORA-00321: log 1 of thread 1, cannot update log file header
ORA-00312: online log 1 thread 1: '/oracle/frits/data/FRITS/onlinelog/o1_mf_1_31l7knbs_.log'

Now, let’s replace the missing one with our copy from the file descriptor:


[root@filetest ~]# cat o1_mf_1_31l7knbs_.log > /oracle/frits/data/FRITS/onlinelog/o1_mf_1_31l7knbs_.log
[root@filetest ~]# chown oracle:oinstall /oracle/frits/data/FRITS/onlinelog/o1_mf_1_31l7knbs_.log

Now switch from logfile until we reach the log group in which the invalid file resides (log group 1)…


SQL> select * from v$logfile;

GROUP# STATUS TYPE MEMBER IS_
---------- ------- ------------------------------------------------------------------------ ---
3 ONLINE /oracle/frits/data/FRITS/onlinelog/o1_mf_3_31l7kp85_.log NO
3 ONLINE /oracle/frits/flashreco/FRITS/onlinelog/o1_mf_3_31l7kqsl_.log YES
2 ONLINE /oracle/frits/data/FRITS/onlinelog/o1_mf_2_31l7ko7c_.log NO
2 ONLINE /oracle/frits/flashreco/FRITS/onlinelog/o1_mf_2_31l7koq3_.log YES
1 ONLINE /oracle/frits/data/FRITS/onlinelog/o1_mf_1_31l7knbs_.log NO
1 ONLINE /oracle/frits/flashreco/FRITS/onlinelog/o1_mf_1_31l7knrh_.log YES

6 rows selected.

SQL>

Look! The file is valid again!

Once a logfile is invalid for whatever reason, oracle continues to work as long as there are valid members in the group. During a logswitch, the files in the log group where oracle switches to are identified, even if they are invalid, and reset to a valid state if they are usable again.

Also, it is possible to make a copy of an already deleted file (as root on centos 4.4 (centos 4.4 is a free version of RHEL4) and as the owner of the file or root on centos 5 (RHEL5)) using the filedescriptor in proc.

How tuning can work out the wrong way

Saturday, April 28th, 2007

(…and always check all direct involved components yourself)

Recently I’ve been asked to look into a performance issue where an import already was taking a week.

Oracle version: 9.2.0.7 64-bit EE
Operating system: SunOS 5.8 Generic_117350-26 sun4u sparc SUNW, Sun-Fire-880
The hardware has 8 CPU’s, 32G RAM and a sun T2000 storage cabinet attached.

There already was a database present on the machine, which was a testdatabase which wasn’t doing anything. Also, it was sized rather small for this kind of machine (25M buffercache; 25M pga (automatic)).

The import took place on a newly created database. Because this database instance configuration was also tremendously small (25M buffercache; 25M pga (automatic)), I decided to kill the current import and set sensible values for some key memory parts to take advantage of the hardware (10G buffercache; 5G pga). Also I asked the unix administrator to mount the oracle data filesystem (one for both databases) with the “forcedirectio” and “noatime” options.

The “forcedirectio” option enables direct IO on the filesystem, to avoid double buffering (rough and short: meaning reads are not buffered by the operating system buffercache), “noatime” disables updating the Inodes on the filesystem on access.

So, with a decent configured database and filesystem configured for optimal IO, we started the import again.

Next, it appeared the testdatabase wasn’t really doing nothing. In fact, it was continuously used and because of the tuning actions, it was performing worse. A look in the performance repository revealed a big fall back in physical IO:
Physical IO
(rightclick, “view image”)

Also, the import still performed the same as before. Taking an enormous time…

What has happened, and what went wrong?

1.
Always check all the compontents involved in a tuning action. The first thing which did go wrong, was a database which was reported to nothing, did approximately 1500 physical IO’s per second (from the database point of view).

2.
Because the testdatabase (which was presumed not to do anything) was sized very small, the oracle buffercache couldn’t cache all the data of its transactions, but it’s IO was buffered on the operating system level (the operating system had plenty of space in it’s memory for that). That means that the IO seen as physical by the database, was in fact all buffered on the operating system level.

3.
By forcing IO to go direct, it removed the ability to cache on the operating system level, which means the IO was really done by the storage cabinet now. Apparently (when looking at the graph), the maximum speed of the IO done by this cabinet was approximately 550 IO’s per second.

4.
I didn’t check what the import was waiting for. A look at the waits of the import session revealed mainly ‘direct path write’ waits, and the ’seconds_in_wait’ column showed waittimes going as high as 500 (seconds!). Also, the alertlog showed: ‘WARNING: aiowait timed out 1 times’.

Both the ‘direct path write’ wait and the warning in the alertlog point us to IO contention. Some investigation more revealed that this behavior is known on sun (and is in fact CPU contention; we do not get enough priority to be able to write) and solved by both patch 112254-01 and setting the operating system kernel parameter ts_sleep_promote=1 (default 0). In short, this patch alters the behavior of the operating system which could prevent certain processes from getting runnable (and thus waiting for a very long time, making the import last longer than a week).

Okay, how to solve this issues?

1. To reduce the number of IO’s of the testdatabase, scale up the buffercache. Plenty of space is available.
2. Because the number of IO’s is reduced, and the priorisation of runnable processes is altered (ts_sleep_promote) both are able to run perfectly alongside each other.

So, lesson learned: always check all the components involved.

OCFS version 1 odd behavior

Saturday, November 25th, 2006

Last week I’ve been involved in troubleshooting and getting a 2-node RAC database to work for a customer. (linux, RHEL3, oracle 10.2.0.2)

Among other things, I’ve found corruptions on an OCFS filesystem. In order to try to fix that, I wanted to run fsck. But before that I wanted a backup (always preserve the way back!). The client’s backup suite is Tivoli Storage Manager.

It appeared Tivoli Storage Manager (TSM) only was able to do the backup at the speed of about 5 megabyte per second. This immediatly reminded me of the O_DIRECT behavior of OCFS, but it had become night, the client wanted to work the next day and not too much time was left; meaning: though luck, that is what we have.

In order to see if the filesystem was corrupt from a database’s perspective, I decided to run db verify (dbv). During the db verify (on the other node of the RAC database cluster), I decided to tar a databasefile and see what my speed was. That appeared to be around 50 megabyte per second. That’s odd…that’s ten times faster than the TSM backup did. Little later, I saw the performance of my tar session decrease to the same 5 megabyte per second as TSM did. Little later I saw the db verify be ready….

That made me think: could it be db verify influences the IO done by another session on another file? Seems unlikely, but because db verify is non-intrusive, I decided to try to run it again. The outcome struck me: I got the “old” 50 megabyte per second performance back. ???

Little later I decided to find out which database files already were backuped by TSM and do a db verify on that node too. IO performance increase of ten times makes us wait much less! The same happened there: TSM backup performance increased to around 50 megabyte per second!!

The next day I thought about how this could be. Because of the lack of time, I didn’t fetch the code (OCFS is open source) and read through it. These are my findings:

- OCFS is optimised to deliver high speed just for the oracle database, using O_DIRECT calls
- OCFS is never meant to be a “normal” filesystem
- The documentation clearly states that only O_DIRECT enabled tools should be used to do file maintenance. There are even O_DIRECT versions of tar, cp, md5sum, dd, etc. on the website
- The ability to do IO using normal/non-O_DIRECT tools seems to more of a “workaround” than normal behavior
- If some O_DIRECT tool opens a file on the OCFS filesystem, it locks THAT file, but seems to provide other non-O_DIRECT processes the ability to better optimised IO

Database horror: reanimating a fool backup

Wednesday, August 2nd, 2006

Sometimes life isn’t pretty. Last week I’ve been asked to reanimate/save a database of which only a backup exists which was done without the database being in backup mode. Yes, the infamous ‘fool backup’ (the opposite of the ‘full backup’, an online backup done with the database in backup mode). For the record: Windows 2003, Database oracle EE 9.2.0.7.

Can it be done? Yes.
Do you want to do it? In almost all cases: NO.
Do you loose data? Yes, depending on activity at that time.

For the record & other interested readers: oracle offers an undocumented parameter which especially is designed to bring up databases of which not all datafiles are consistent, or of which the oracle engine knows are ‘fuzzy’ (the fuzzy bit is a number in the fileheader of a datafile which oracle uses to put the state of a datafile in. Because the fuzzy bit gets set in ‘online backup mode’, oracle knows it is put in backup mode, and can be recovered). This parameter is “_allow_resetlog_corruption”. Oracle’s (and my) advice is: only use this parameter under direct guidance of oracle support.

Originally, this was about a few datafiles which went missing. This led to the first action: only restoring the (wrongly backuped) datafiles, and try to force open the database. (the database was three days older than the restored datafiles)
This action led to a ORA-1555 on opening of the database (which also blew up my instance). Investigation turned out this is due to the locally managed datafiles. (the tracefile showed a query of the database itself on uet$ (used extents data dictionary view), which originally was a (clustered) table in the SYSTEM tablespace, but with locally managed datafiles this data is actually stored inside the datafile in the bitmap area). So, an unfindable block of a space bitmap image of the datafile’s space managed broke the instance. Bummer.

Next action: restore the total fool backup.
This action led to a running instance! Interesting!
Following up (of course), was the action to export the entire database. (to save the data. this database is -absolutely- inconsistent and NOT USABLE!)
Also we are aware we can have inconsistencies and corrupt blocks. For this, we also set an undocumented event at startup, so oracle doesn’t break, but skips corrupt blocks (of tables, an export doesn’t read indexes, but gathers the information to create the an index out of the data dictionary). This event is 10231, and can be set in the init.ora:

event = '10231 trace name context forever, level 10'

We succeeded to export most of the tables, only 4 gave an ORA-1555.
The ORA-1555 means the before image of a transaction in a block is not present at the designated UBA (undo block address).
Every index and data block has a variable list of transactions in the variable transaction header of each block. This list is called ‘ITL’, which means Interested Transaction List. Once a transaction starts, this list arranges consistent reads for other transactions. If a read actions needs to obey the ITL entry (most likely because a transaction is pending on a row), it searches at the UBA for the data consistent with its read. If that data cannot be found at the UBA, the ORA-1555 is triggered.

So, we got ourselves a nice ‘logical corruption’. Oracle perfectly understands the data in the block (so it’s not physically corrupt), but can also not fulfill my wish because it cannot get the old data needed to give me my data consistently.

We choose to ‘get what we could’, by marking the block corrupt (so the 10231 event skips the block). Rough? Yes, but it allowed us to get an export of (most of) the data (and an export of the table). We did this by using the BBED.

The BBED is the ‘block browser and editor’. It is a commandline tool, which is present (as executable) in windows, and can be linked to an executable on linux/unixes. It is password protected.
The BBED can be ’set’ at a block, and by using the command ‘corrupt’ it can corrupt a block (meaning it is marked as ‘media corrupt’, which means the 10231 event will skip that block).

We additionally set the following event to get the DBA (data block address):

event='1555 trace name errorstack level 10'

so we know which block should be marked.

Well, this led to a situation where we were able to get most of the data out. Additionaly I’ve did some research on the rows inside the blocks (you’ll need the format of the table which is stored in the block), but the BBED is made for browsing and editting a block, not for unloading data. (tools like oracle’s DUL, ORA600’s DUDE/jDUL are made for this purpose).

I know this could be done several other ways. But only in this situation we have both as much data as possible, and as little risk as possible (with a potentially corrupted database:-) for inconsistent rows (because those blocks we skipped).