Friday, May 29, 2015

RDBMS -- java.sql.SQLException: Io exception: The Network Adapter could not establish the connection

I think it is worth write about the problem regarding a Sql Server Connection..
The issue were encountered in a custom code which was suppose to run on an EBS 12.2 Application Tier..It was thought of an sql server connection issue when it was first escalated to me..

I have requested the code block that the developer was complaning about.
The code was something like the following;
Note that: Because we are in Oracle World , also because we are working on Oracle Application in this story, we use java for these kind of stuff.

This method was there in a Custom class stored in Application Server..

public Connection getMSQLServerConnection()
throws Exception
{
Class.forName("net.sourceforge.jtds.jdbc.Driver");
return DriverManager.getConnection("jdbc:jtds:sqlserver://SOME_IP:1433/SOME_DB_NAME;", "erman", "password123");
}

So method was correct and I could reach the sqlserver's port using telnet.
telnet sqlservers_ip sqlservers_port..
There werent any classpath issues ..

The error produced was "java.sql.SQLException: Io exception: The Network Adapter could not establish the connection" and it was weird.. According to this error, there must have been a connection problem somewhere..

With this in mind, I requested the full error stack and realized the problem when I saw that;

java.sql.SQLException: Io exception: The Network Adapter could not establish the connection at oracle.jdbc.driver.DatabaseError.throwSqlException(DatabaseError.java:124) at oracle.jdbc.driver.DatabaseError.throwSqlException(DatabaseError.java:161) at oracle.jdbc.driver.DatabaseError.throwSqlException(DatabaseError.java:273) at oracle.jdbc.driver.T4CConnection.logon(T4CConnection.java:327) at oracle.jdbc.driver.PhysicalConnection.<init>(PhysicalConnection.java:360) at oracle.jdbc.driver.T4CConnection.<init>(T4CConnection.java:150) at oracle.jdbc.driver.T4CDriverExtension.getConnection(T4CDriverExtension.java:31) at oracle.jdbc.driver.OracleDriver.connect(OracleDriver.java:595) at java.sql.DriverManager.getConnection(DriverManager.java:571) at java.sql.DriverManager.getConnection(DriverManager.java:215) at erman.oracle.apps.datatransfer.ErmanMSSQLViewDataTransfer.getOracleConnection(ERMANSSQLViewDataTransfer.java:48) at erman.oracle.apps.datatransfer.ErmanMSSQLViewDataTransfer.transferData(MSSQLViewDataTransfer.java:75) at erman.oracle.apps.datatransfer.ErmanMSSQLViewDataTransfer.main(ErmanMSSQLViewDataTransfer.java:298)
The issue was there and it was not like it seemed..
The driver used was OracleDriver, so it was not an sql driver at all.
If it would be an sql driver, it must have been there as something like: ..jtds.jdbc.Driver....

So, the problem was actually an Oracle Connection problem..
Later turned out that; there was another method in the same class and this method was used to connect to an Oracle database.. 
The class was a DataTransfer class , so it was designed to create its connections to both Oracle and SqlServer databases.. It was trying to create the connections and then transfer the data between the databases using these connections..

As for the solution;

When I analyzed the Oracle connection method, I saw that the IP address of the Oracle Database Server was wrong.. That was the cause for the issue :  java.sql.SQLException: Io exception: The Network Adapter could not establish the connection.

Changed the IP address in the oracle connection string and fixed the problem..

I m sharing this with you because, It was an enjoyable analysis, and the cause was interesting..
One more time , it showed the importance of the control.. 

At the end of the day, the key concept of this blog post becomes -> "while investigating such abnormal issues, do not trust the information based on someone else's analysis passed to you! Do your own analysis , check the problem as a whole, prove your discovery and then offer your solution.." 

Thursday, May 28, 2015

Linux -- Attached Usb disk not recognized , Linux on VM

Recently encountered an issue with a usb disk.. It was a ESX environment , and one of our databases was running on a Linux Guest Virtual Machine. (Oracle Linux 5.9 64 bit)

The ESX admin mapped the disk using his Virtual Machine Administration console tool, and notified us.

We saw the activity about the scsi disk in the kernel ring buffer (using dmesg)..

"dmesg" output:

"sd 7:0:0:0: [sdc] Attached SCSI removable disk"

dmesg was saying that, it saw the disk and the disk was associated with the device name sdc..

So far so good..

After seeing this in dmesg, we used fdisk to see the disk normally..

The strange thing was that, fdisk did not display the disk and thus we begun to think that this was a problem..

"fdisk -l"

does not display any disk with the name /dev/sdc 

This was not normal, because we already saw that the disk was attached in dmesg, and it was associated with sdc..

Also, using ls , we saw that /dev/sdc was not there at all..


At this point; as we are in such a modern System like Oracle Linux , we check the symbolic links under the /dev/disk/by-id directory;
/dev/disk/by-id shows;

[root@oracledb by-id]# pwd

/dev/disk/by-id
[root@ermandb by-id]# ls

lrwxrwxrwx 1 root root 9 May 27 11:35 usb-Multiple_Card_Reader_058F63666433 -> ../../sdc, the serial name was there, it was obviously seen as a "Card Reader" , not a USB disk..

To crosscheck, we checked the /proc/scsi/scsi file; and it was should the Vendor as " Card Reader"
/proc/scsi/scsi shows:

Host: scsi7 Channel: 00 Id: 00 Lun: 00
  Vendor: Multiple Model: Card  Reader     Rev: 1.00
  Type:   Direct-Access                    ANSI  SCSI revision: 00


So the problem was obvious .. The usb disk was recognized, but it was recognized as a Card Reader and that 's why we could not see it on fdisk output and thus could not operate on it..

For the solution, forwarded this analysis to the ESX admin and he corrected this virtual USB disk problem.. That is , the usb port was attached to the virtual machine, but "usb device" was not  :)

So once this is corrected, Linux saw the the usb disk properly, and we could proceed..

[root@ermandb ~]#cd /dev/disk/by-id/
[root@ermandb by-id]# ls -al
lrwxrwxrwx 1 root root 9 May 28 16:26 usb-Seagate_Backup+_BK_NA518PVD -> ../../sdc

Reference: VMware
Procedure


1 In the vSphere Client inventory, right-click the virtual machine and select Edit Settings.

2 Click the Hardware tab and click Add.

3 Select USB Controller and click Next.

4 Review the device information and click Next.
:
5 Click Finish. New USB Controller (adding) appears in the hardware list as Present.

6 Click OK to save your changes and close the dialog box.

When you reopen the Properties Editor, the controller appears on the Hardware tab.
What to do next:

Add one or more USB devices to the virtual machine -> This was missing..

EBS 12.2 -- Notification Mailer, workflow validation error occurred // after cloning an IMAP-SSL enabled environment

If you have an IMAP SSL Enabled EBS 12.2, then you may hit this problem ..

The problem may arise after a successful clone operation, exactly when you try to start the Workflow Notification mailer in your newly cloned environment.
No matter what you do from OAM, you will end up with this.
I mean you may disable IMAP and SSL checkbox, but OAM will still report this validation error..
The error in the OAM pages will be "workflow validation error occurred please check indicated fields below", and the indicated field will be the SMTP / outbund server's textbox, which is actually unrelated..

The cause of this error is the Workflow SSL configuration which comes from the source environment..

So if you have a IMAP-SSL Enabled EBS, you better watch out this, and add the action plan given below in to your Cloning documents.

A similar approach is documented for 11.5.10.2 and I can say that it applies to EBS 12.2 and 12.1, too..

Unable To Perform The Workflow Mailer Configuration As It Keeps Giving "Validation Error Occured" (Doc ID 603954.1)

Anyways.. The cause of the issue is Workflow mailer's IMAP-SSL related parameters and they must be modified from the backend in order to fix this.

The parameter to be modified are;

SSL Trust store -> none

Inbound SSL Enabled -> N 

The utility to be used for this modification is "afsvcpup.sql"

So, for the solution;

Stop the Workflow Services (Agent Listeners and Mailer)
From the Application Tier operating system -- As application owner run ->

sqlplus apps/apps @$FND_TOP/sql/afsvcpup.sql

Choose 10006 for the Workflow mailer
Choose "10243 SSL Trust Store" Set to "NONE"
Choose "10240 "Inbound SSL Enabled" Set to "N"
If your Outbound is also SSL enabled;
Choose "10244 Outbound SSL Enabled" Set to N

Start the workflow services..

Note that: Setting SSL Trust from OAM is nıot possible.. It can be set only by using the backend script $FND_TOP/sql/afsvcpup.sql.
It must set before setting up the Mailer, otherwise IMAP server setup may not pass the validation checks made in OAM. Ref: How to Set Mailer Parameter MAILER_SSL_TRUSTSTORE ? (Doc ID 444946.1)

Wednesday, May 27, 2015

AIX -- extending Logical Volumes online

In this post, we will extend an Logical Volume size in AIX system.
The AIX system used in this example, is AIX 5.3.
The operation is online, and the effects of it can be seen instantaneously.
We use commands like lsvg, lslv, chfs, chlv and df in this example..
Here is a general information about them:

lsvg: Displays information about volume groups
lslv: Displays information about a logical volume
chfs: Changes attributes of a file system
chlv :Changes only the characteristics of a logical volume
df : Reports information about space on file systems

Also, I will give general information about  the terms used in this example.

LV: Logical Volume
LP: Logical Partition 
PP: Physical Partition
PPSIZE : Physical Partition Size
JFS: journaled file system
JFS2: Enhanced journaled file system. JFS2 is designed to accommodate a 64-bit kernel and larger files

So , lets start..
First, we check the volume groups in this system and then get the information about the relevant volume group ..

ermansrv1: / > lsvg
rootvg
oravg --> this volume group is in our focus..

ermanrv1: / > lsvg oravg

VOLUME GROUP:       oravg                    VG IDENTIFIER:  00c6265000004c0000000116a075433a
VG STATE:           active                   PP SIZE:        256 megabyte(s)
VG PERMISSION:      read/write               TOTAL PPs:      1596 (408576 megabytes)
MAX LVs:            256                      FREE PPs:       143 (36608 megabytes)
LVs:                5                        USED PPs:       1453 (371968 megabytes)
OPEN LVs:           5                        QUORUM:         3
TOTAL PVs:          4                        VG DESCRIPTORS: 4
STALE PVs:          0                        STALE PPs:      0
ACTIVE PVs:         4                        AUTO ON:        yes
MAX PPs per VG:     32512
MAX PPs per PV:     1016                     MAX PVs:        32
LTG size (Dynamic): 1024 kilobyte(s)         AUTO SYNC:      no
HOT SPARE:          no                       BB POLICY:      relocatable

So, "There are 143 PPs, almost 35 GB free area in the volume group."
Okay.. We have free PPs to grow, but do quota to grow?
I mean, If we are not in the limit of max allocations; we can just extent the filesystem.

ermanrv1: / > chfs -a size=+1G /orahome
Filesystem size changed to 44040192
ermandbsrv1: / > df

Filesystem    512-blocks      Free %Used    Iused %Iused Mounted on
/dev/fslv02     44040192   2096832   96%    39495    15% /orahome  (Enlarged)

But if we have reached the max allocation for a logical volume;
We will get error while using chfs;

ermanrv1: / > chfs -a size=+20G /data
0516-787 extendlv: Maximum allocation for logical volume fslv00  is 1000.

If that's the case, we need to increase the max allocation for that volume group reasonably.
To find the correct limit to set the lv; we use lsvg and lslv commands.. ;
(the LV is made up of LPs.  The LP corresponds to 1 or more (in the case of mirroring) PPs.)

lsvg shows us, the PPSIZE;

 lsvg oravg
VOLUME GROUP:       oravg                    VG IDENTIFIER:  00c6265000004c0000000116a075433a
VG STATE:           active                   PP SIZE:        256 megabyte(s)
VG PERMISSION:      read/write               TOTAL PPs:      1596 (408576 megabytes)
MAX LVs:            256                      FREE PPs:       39 (9984 megabytes)
LVs:                5                        USED PPs:       1557 (398592 megabytes)
OPEN LVs:           5                        QUORUM:         3
TOTAL PVs:          4                        VG DESCRIPTORS: 4
STALE PVs:          0                        STALE PPs:      0
ACTIVE PVs:         4                        AUTO ON:        yes
MAX PPs per VG:     32512
MAX PPs per PV:     1016                     MAX PVs:        32
LTG size (Dynamic): 1024 kilobyte(s)         AUTO SYNC:      no
HOT SPARE:          no                       BB POLICY:      relocatable

lslv shows us how much LP corresponds to how much PP and what is our limit is..

Note that: while listing the logical volume information, we need to pass the logical volume name as input.. The logical volume name can be derived using the lsvg command;

ermansrv1: / > lsvg -l oravg    (what logical volumes do we have in our  volume group named oravg)
oravg:
LV NAME             TYPE       LPs   PPs   PVs  LV STATE      MOUNT POINT
loglv00             jfs2log    1     1     1    open/syncd    N/A
fslv00              jfs2       1008  1008  4    open/syncd    /data  --> here we have logical volume named fslv00 for the /data mount point, so we will use fslv00 as an input while using lslv and chlv commands.
fslv01              jfs2       336   336   4    open/syncd    /index
fslv02              jfs2       104   104   3    open/syncd    /orahome
fslv03              jfs2       108   108   1    open/syncd    /appl_top

Then we use the lslv command to see the current allocation and quota for our logical volume..

ermansrv1: / > lslv fslv00
LOGICAL VOLUME:     fslv00                 VOLUME GROUP:   oravg
LV IDENTIFIER:      00c6265000004c0000000116a075433a.2 PERMISSION:     read/write
VG STATE:           active/complete        LV STATE:       opened/syncd
TYPE:               jfs2                   WRITE VERIFY:   off
MAX LPs:            1000                   PP SIZE:        256 megabyte(s)
COPIES:             1                      SCHED POLICY:   parallel
LPs:                960                   PPs:            960
STALE PPs:          0                      BB POLICY:      relocatable
INTER-POLICY:       minimum                RELOCATABLE:    yes
INTRA-POLICY:       middle                 UPPER BOUND:    32
MOUNT POINT:        /data                  LABEL:          /data
MIRROR WRITE CONSISTENCY: on/ACTIVE
EACH LP COPY ON A SEPARATE PV ?: yes
Serialize IO ?:     NO

From this ouput , we see the PP size , MAX LPs and current LPs..
So we are now ready to execute chlv to increase our MAX LP limit accordingly..

ermansrv1: / > chlv -x 1100 fslv00   , we change the lp limit (logical partition limit) it makes 256 * 1100 /1024 GB = 275GB
and then execute lslv again..

ermansrv1: / > lslv fslv00
LOGICAL VOLUME:     fslv00                 VOLUME GROUP:   oravg
LV IDENTIFIER:      00c6265000004c0000000116a075433a.2 PERMISSION:     read/write
VG STATE:           active/complete        LV STATE:       opened/syncd
TYPE:               jfs2                   WRITE VERIFY:   off
MAX LPs:            1100                   PP SIZE:        256 megabyte(s)
COPIES:             1                      SCHED POLICY:   parallel
LPs:                1008                   PPs:            1008
STALE PPs:          0                      BB POLICY:      relocatable
INTER-POLICY:       minimum                RELOCATABLE:    yes
INTRA-POLICY:       middle                 UPPER BOUND:    32
MOUNT POINT:        /data                  LABEL:          /data
MIRROR WRITE CONSISTENCY: on/ACTIVE
EACH LP COPY ON A SEPARATE PV ?: yes
Serialize IO ?:     NO

So,the MAX LP value is increased as you see.. Good..
At this point we can execute chfs again.

ermansrv1: / > chfs -a size=+20G /data
Filesystem size changed to 528482304

ermansrv1: / > df -g
Filesystem    GB blocks      Free %Used    Iused %Iused Mounted on
/dev/fslv00      252.00     20.61   92%     2213     1% /data (Enlarged)

That 's all. Note that: this is an online operation.. So Your Oracle Database may be running, your Oracle Applications or any other application may be running during this operation..

Note that: The filesystems used in this example are jfs2.. In jfs2, we can even decrease the filesystem size, but in jfs the filesystem size can not be reduced..

Tuesday, May 26, 2015

EBS -- Workflow Notification Mailer -- IMAP emails to be deleted

Writing this blog post regarding the IMAP emails sitting on the IMAP Email Server mail boxes..
As you may know, we require an IMAP account , an INBOX folder , a DISCARD folder and a PROCESS folder in order to make our Workflow Notification Mailer 's IMAP thread/inbound run properly..

When the users respond the notification mails via their Mail transfer agents, these IMAP folders get populated..
Notification mailer regularly checks the INBOX of its associated Imap account to see if there are any incoming emails.. If it finds an incoming email, it takes it to the PROCESS folder and do the processing.. If it sees an email which is not related with its workflow services, it takes that mail in to the DISCARD folder..
So basically, it does not delete anything..
That's why if there is a quota defined by the Mail Server admin, then this may be a problem..
In this context, we can say that these IMAP folders should be regularly cleared by the Mail Server Administrators..
This can be a manual operation as well as it can be automated using some rules..
That is, Mail Server Admins can define mail box rules to manage these email automatically..
They can use their rules to archive/delete the emails sitting on Notification Mailer's IMAP account automatically..
Ofcourse, necessary emails should not be deleted during this process.
The retention policy should be identified by the Business Users and Project members..
This solution is necessary because Oracle Workflow Services does not have a standart method for dealing with this.
There is a comprehensive e-mail response management system called Email Center .. It can be used in EBS, but even the Email Center does not do this job..

Ref: Oracle

Email Center does not delete emails from the Oracle Processed folder on the Email Server. Purging the Oracle Processed folder needs to be performed by the Email Server Administrator and can be achieved via server side rules.

Friday, May 22, 2015

Linux / Oracle RDBMS -- High Load after VM migration

After migrating a 10gR2 EBS database from a 32 bit Linux machine to a new 64 bit Linux Machine, performance started to slow down.
Load average was always somewhere between 20 and 30, and end users were facing slowness in the EBS OAF and Forms screens..
Altough the users were facing this slowness, they were also saying that their Discoverer reports are completed faster than they were in the old 32 Linux..
The major difference between this old and new machines are as follows;

Old Machine Linux version: Redhat 5.5
Old MAchine Arch: 32 bit
Old MAchine SGA: 8GB (implemented VLM mode + in-memory filesystem (shmfs/tmpfs, ramfs)
Old MAchine PGA: 2G
Old MAchine Total Memory: 31GB
Old Machine CPU core count: 16 Pyhiscal Cores (No Hyper threading)

New Machine Linux version: Redhat 5.9
New MAchine Arch: 64 bit
New MAchine SGA: 18G
New Machine PGA : 10G
New MAchine Total Memory: 48GB (hugepages enabled for SGA)
New Machine CPU core count: 16 Cores (16 virtual sockets , one core each.. no hyper threading)

So , the new machine was 64 bit, had 48GB memory(hugepages were also enabled) and had the same core counts as the old server, but it was on VM (Vmware ESX)
The important difference was that the new machine's CPU cores were virtual..

When we examined the AWR reports, we have seen CPU waits all the time..
Also, In top command , we have seen %96 user space CPU usage all the time..
Load Average was high all the time..
Discoverer reports were running all the time.. They were  always in "R" state and they were always consuming %99 cpu..
So they were obviously the main cause of the slowness, but we had no chance to tune the Discoverer Reports, because these reports were also used in the old system by the End users..

AWR gathered for 30 minutes interval representing the problem:

EventWaitsTime(s)Avg Wait(ms)% Total Call TimeWait Class
CPU time19,76974.4
db file sequential read209,68085043.2User I/O
SQL*Net more data to client644,83054612.1Network
latch: cache buffers chains4,576389851.5Concurrency
TCP Socket (KGAS)6,313317501.2Network


Okay we had 19769 seconds CPU time..
On the other hand we had 16 vm cores in this servers. So for a 30 minutes interval , we had 16 * 30  * 60 seconds CPU time available ..
This made 28800, so we still had 9031 seconds CPU time available..
So, we had a high CPU load but we were not in a hang situation. This high DB CPU usage meant that the database was spending a lot of time in CPU (processing) compared to I/O and other events.

This high CPU usage was caused by the Discoverer Reports, it was for sure..

CPU Time (s)Elapsed Time (s)ExecutionsCPU per Exec (s)% Total DB TimeSQL IdSQL ModuleSQL Text
2,1902,68563,2310.0310.106j1rnhpxu1bryDisco10, SELECT COUNT(T.NO) FROM XXIC_T...
1,9942,4932996.849.3879h2r4xdn8g44Disco10, SELECT o481137.CUSTOMER_NAME ...
1,8272,2022913.398.2993bkbks2j66q5Disco10, SELECT DECODE((decode(o310037...

The design of these reports were not optimized.. For example a subquery was running 63231 times in 30 minues and  It was a cpu bound query.. On the other hand, it was not the time to tune these sqls or modify these reports, because they were working exactly as the same way as they were working in the old system.. 
I mean, even the execution plans and the sqls in these reports were the same, they did not change.. 
In addition, this new system was 64 bit and it had bigger SGA and bigger Physical memory than the old one.. So we were not expecting such a load in the new database server..
That 'is , the cause of high Cpu times must be something else..

There was db file sequential read listed at second place in the Top 5 Waits Event.. This wait was indicating single block index reads and its average was 4 ms.. So it was acceptable for this environment.. In adddition, we were using the same disk devices even the same filesystems as the old server, so IO was not the problem..

It was the CPU... It would be better if I could share the CPU model information in here but the idea in this post is to prove the slowness using AWR ..

In the Latch activity , we saw the cache buffer chains with the highest Get Request..

                               Get RequestsPct Get MissAvg Slps /MissWait Time (s)NoWait RequestsPct NoWait Miss
cache buffers chains810,825,5890.020.033891,140,6170.03

The high cache buffer chains was directing us to find the hot buffers, find the sessions waiting for  latch contention, intersect the objects related with hot buffers and the objects touched with queries waiting for latch contention, conclude the problematic sqls and made an sql tuning. 
On the other hand, I was not in that mood, as this database was running as the same way as the old database server..

By knowing that, the latches acquired when a process needs them and latches are released when the process finishes its work in the relevant memory structure; we decided "the time spent between acquiring and releasing the latches" is also very important..

That is , if our processes could make their jobs quickly , then more latches would be available.. 
I mean, the time between acquiring and releasing the latches will be shorter, and that will decrease the cpu time waits instance wide, as the processes will not be on the CPU while waiting for the latches ..

At this point, we changed the CPUs.. I mean we changed the Hardware of the Virtual Machine which were running the database.. 
Note that , the disks stayed the same and the memory stayed the same.. 

After the CPU change: the cache buffer chain count was again so high;

                               Get RequestsPct Get MissAvg Slps /MissWait Time (s)NoWait RequestsPct NoWait Miss
cache buffers chains3,439,451,4022.090.000823,8850.04
Using the new CPUS, there were almost no waits in latch: cache buffers chain;
EventWaits%Time -outsTotal Wait Time (s)Avg wait (ms)Waits /txn
atch: cache buffers chains17,4240.00000.57
These values were as follows in the OLD system..
latch: cache buffers chains4,5760.00389850.34

So , eventhough there were more number of latches allocated with the new CPUs, there were almost no  wait times
Look, we had more db file sequential read, and more cache buffer chains , but our CPU time is lower :)

Note that : This report was taken for an 1 hour interval. Eventhough, it was for a longer period, CPU time was lower than the value with the old CPUs..

EventWaitsTime(s)Avg Wait(ms)% Total Call TimeWait Class
CPU time10,73096.0
db file sequential read213,55616511.5User I/O
TCP Socket (KGAS)11,950162141.4Network
SQL*Net more data to client1,359,692930.8Network
log file parallel write28,298472.4System I/O

We had almost have the same load on the Oracle Database. Note that: these values are per second and per transaction.

Load Profile (with new CPUs) --1 hour AWR
Per SecondPer Transaction
Redo size:58,524.706,768.44
Logical reads:509,961.7958,977.58
Block changes:338.9039.19
Physical reads:171.8619.88
Physical writes:24.922.88
User calls:2,476.00286.35
Parses:170.1219.67
Hard parses:4.220.49
Sorts:296.3534.27
Logons:0.960.11
Executes:7,877.99911.10
Transactions:8.65
Load Profile (with OLD CPUs) --30 minutes AWR
Per SecondPer Transaction
Redo size:38,658.426,072.38
Logical reads:212,494.6233,378.16
Block changes:195.1930.66
Physical reads:294.8046.31
Physical writes:22.433.52
User calls:1,471.26231.10
Parses:94.9414.91
Hard parses:4.280.67
Sorts:157.2424.70
Logons:0.840.13
Executes:7,478.481,174.70
Transactions:6.37

We had  more redo size, more logical reads, more block changes, more physical read, more phyiscal writes, almost the same hard parses, more sorts, more executions and more transactions in New CPU System, but there was low CPU time waits :)
Almost the same sqls were running , same Discoverer reports were at the top, but this time Cpu time was not so high  in instance wide..

So this showed that the approach was correct, the CPUs were the bottleneck and the new CPUs fixed the problem..  The new cpus were again Virtual, but the Cpu family of the new hardware was different.. There were major differences in CPU specs... like hyperthreading...

Okay. So far so good. But we had one more question; how was this system dealing with this load earlier?  I mean this database was running on 32 bit server with 8gb SGA and we were not seeing this CPU time waits in AWRs... So the question is why?

Lets start with the CPU utilization...
High Cpu utilization may not a problem actually.. If you are not waiting for other resources, you can end up waiting for  CPU or running on the CPU..
 Especially if your queries are cpu bound, then you will have your CPU time increasing..

CPU time in AWR refers to the amount of CPU time that session used .. Oracle gets this info from OS using getrusage() or similar OS calls.. So this CPU time value does not include time spent waiting on CPU run queues..

High CPU utilization may not necessarily indicate that there is a problem; it could just mean that the system is being well utilized.
Time shown as "DB CPU" or "CPU time" in AWR and similar reports refers to the amount of CPU time that the session/s used and is usually based on information from the OS itself (eg: using getrusage() or similar OS calls on Unix). This measure typically does not include time spent waiting on CPU run queues.
If we could have ASH, it would be better because Time shown as "CPU + Wait for CPU" in ASH reports could give us an idea of the time that session/s were running, or wanting to run on CPU..

Anyways, the bottom line is that , if you are not waiting on other things, then you will be on CPU and your CPU time will increase.. You will use the CPU in parallel and you will increase the load of the server and then you will start to feel performance degregation in system wide..

Lets have a look at the AWR report gathered from the old 32 bit server, running the same database, same queries, same reports and having almost the same number of sessions connected to it..

Do not bother LNS wait.. This system had a standby when this AWR snapshot was taken.

Load Profile (with OLD CPU s --64 bit)
Per SecondPer Transaction
Redo size:38,658.426,072.38
Logical reads:212,494.6233,378.16
Block changes:195.1930.66
Physical reads:294.8046.31
Physical writes:22.433.52
User calls:1,471.26231.10
Parses:94.9414.91
Hard parses:4.280.67
Sorts:157.2424.70
Logons:0.840.13
Executes:7,478.481,174.70
Transactions:6.37
EventWaitsTime(s)Avg Wait(ms)% Total Call TimeWait Class
CPU time19,76974.4
db file sequential read209,68085043.2User I/O
SQL*Net more data to client644,83054612.1Network
latch: cache buffers chains4,576389851.5Concurrency
TCP Socket (KGAS)6,313317501.2Network

Load Profile (with OLD Phyiscal Server --32 bit )
Per SecondPer Transaction
Redo size:38,685.465,267.77
Logical reads:348,930.5147,513.61
Block changes:218.7829.79
Physical reads:1,026.41139.77
Physical writes:282.6938.49
User calls:2,255.70307.16
Parses:138.6518.88
Hard parses:3.750.51
Sorts:205.2227.95
Logons:0.860.12
Executes:6,003.52817.50
Transactions:7.34

EventWaitsTime(s)Avg Wait(ms)% Total Call TimeWait Class
CPU time15,46693.5
db file sequential read755,43838912.4User I/O
db file scattered read302,77629011.8User I/O
LNS wait on SENDREQ38998253.6Network
log file sync17,380935.6Commit

So the old 32 bit system had more db file sequential read, more db file scattered read and a significant amount of CPU time..
Lets take a look at te Operating System Statistics ;

Operating System Statistics (OLD Physical Server --32 bit)

Note that : this AWR was for an 60 minutes interval.

    StatisticTotal
    BUSY_TIME1,868,106
    IDLE_TIME3,894,329
    IOWAIT_TIME65,502
    NICE_TIME48
    SYS_TIME222,197
    USER_TIME1,625,270
    LOAD5
    RSRC_MGR_CPU_WAIT_TIME0
    PHYSICAL_MEMORY_BYTES34,076,270,592
    NUM_CPUS16
    NUM_CPU_SOCKETS4
    • Total Buffer Gets: 1,262,389,050

    Latch NameGet RequestsPct Get MissAvg Slps /MissWait Time (s)NoWait RequestsPct NoWait Miss
    Cache buffers chains2,406,314,9300.280.0005,506,1880.00
    library cache13,180,2590.040.010112,1430.09
    library cache load lock10,3480.0000
    library cache lock4,375,4480.000.0000
    library cache lock allocation181,1450.0000
    library cache pin5,453,5610.010.0000
    library cache pin allocation69,9310.0000

    Operating System Statistics (OLD CPU s --64 bit)

    Note that : this AWR was for an 30 minutes interval.

      StatisticTotal
      BUSY_TIME1,674,495
      IDLE_TIME571,810
      IOWAIT_TIME39,970
      NICE_TIME0
      SYS_TIME296,542
      USER_TIME1,247,579
      LOAD18
      RSRC_MGR_CPU_WAIT_TIME0
      PHYSICAL_MEMORY_BYTES50,750,922,752
      NUM_CPUS16
      NUM_CPU_SOCKETS4

      • Total Buffer Gets: 444,663,900
      Latch NameGet RequestsPct Get MissAvg Slps /MissWait Time (s)NoWait RequestsPct NoWait Miss
      cache buffers chains810,825,5890.020.033891,140,6170.03
      library cache7,053,5640.120.24994090.24
      library cache load lock5,0110.00070.00
      library cache lock1,622,6040.010.3100
      library cache lock allocation62,0530.001.0000
      library cache pin2,719,8410.020.30430.00
      library cache pin allocation31,8160.0000

      As we see , The new 64 bit system which has the old CPUs had a load of 18.. On the other hand the old 32 bit system, which has physical  CPUs, had a load of 5 .. 
      The new 64 bit System which has old CPUs, has a BUSY time of 1,674,495 measured in 30 mins , but the old 32 bit system which has physical CPUS has a BUSY time of 1,868,106 measured in 1 hour..
      Also when we look at the cache buffer chains, we see that there are almost no waits for the cache buffer chains in the old 32 bit system.. Note that Total Buffer gets is much higher (3 times more) in 
      the Old 32 Bit server, and even if it so , it has almost no wait for cache buffers chains..
      The situation was almost the same for library cache latches.. 
      At the bottom line ,we can say that altough the old 32 bit server had more operations, its CPU busy time is lower(1/2 rate) than the new 64 bit server which had old cpus..

      So this was a proof for the CPU problem in 64 bit server.. Ofcourse , there should be a more detailed explanation for this, but as we knew the sqls were the same, the load was the same and as the storage is the same, we requested a CPU change in the newly migrated 64 bit server, and after the CPU change, the performance problem went away..

      I would appreciate your comments on the paper including the acceptance or rejection on the basis of the things described above.