Thursday, May 24, 2018

EBS 12.2 -- Things that can be done for debugging WLS Managed Server performance and stability

Weblogic (a FMW component) is an important component in EBS 12.2.

FMW plays an important role in EBS 12.2 , as EBS 12.2 delivers Http services ,OAF and forms services through FMW.

That's why, time-to-time, a real diagnostics is required , especially for analyzing weird performance and hang issues on EBS OAF pages.


In this post, I will go through the things that can be done for debugging the Weblogic side, especially the managed server performance and stability.

Of course, when dealing with weblogic inside EBS, we directly check the managed server logs, admin server logs, Heap size configurations, managed server counts (whether it is aligned with the concurrent user count or not), connection pool limits and so on. On the other hand; these debugging activities that I will give you in this blog post, are a little more advanced. It is also needless to say that, these debugging activities require advanced skills on Weblogic and EBS administration.

Note that,I  won't give the full instructions for these diagnostics activities. In other words; I will explain them very briefly.
Also note that, these activities are not fully documented, that's why they are not fully supported --the risk is yours.

Garbage Collector Debug: for getting a more elaborated GC info and checking the time passed for each GC event.
We can get this debug info using -XX:+PrintGCDetails and -XX:+PrintGCTimeStamps, jvm arguments.

Running technology stack inventory report: to collect the list of patches applied to all middle tier homes (besides Weblogic).. The output of this script may be used to identify unapplied performance patches.

$ADPERLPRG $FND_TOP/patch/115/bin/TXKScript.pl -script=$FND_TOP/patch/115/bin/txkInventory.pl -txktop=$APPLTMP -contextfile=$CONTEXT_FILE -appspass=<appspassword> -outfile=$APPLTMP/Report_App_Inventory.html

Diagnostic Connection Leaks: for getting leak connection-related diag info, we use "How To Detect a Connection Leak Using Diagnostic JDBC Dumps (Doc ID 1502054.1)"

Create heap dump & thread dumps: Especially for getting info about an outofmemory problems.
These diagnostics are done by using the necessary command line arguments in server start arguments section of the related managed server (using WLS console)

The related arguments are specified using the server start arguments section ->

Connect WLS console
Navigate to servers under EBS_domain_<SID> environment
Click on the managed server (ex:oacore_server1)
Click  on  Lock & Edit in Change Center
Click on Server start
Edit arguments (such as  -XX:HeapDumpOnCtrlBreak)

So once the necessary argument is given to a managed server, we restart the managed server and use  OS kill commands to generate these dumps.. (ex: kill -3 os_pid -- kill -3 - SIGQUIT - like ctrl-C but with a core dump)

--review -> How to create a Java stack trace on e-Business Suite ? (Doc ID 833913.1)

Once the error is reproduced we review the FMW logs -> 12.2 Ebusiness Suite - Collecting Fusion Middleware Log Files Note 1362900.1.

Consider increasing Stuck Thread timeouts : in case we have stuck threads.. We can increase the Stuch Thread Max Time using Weblogic console.

Connection Debugging: For JDBC connection debugging, we use Oracle E-Business Suite 12.2 Data Source Connection Pool Diagnostics (Doc ID 1940996.1).

DB level trace: We enable trace at db level -> "alter system set events '10046 trace name context forever, level 12';"
We reproduce the issue and turn it off "alter system set events '10046 trace name context off';"

We check the traces (find the relevant trace using  "grep MODULE *.trc  and/or "grep ACTION   *.trc"

Tracing Managed Server sessions :  For diagnosing  managed server related db activity, and for diagnosing inactive (not closed) managed server sessions.

Reference: On E-Business Suite 12.2 V$SESSION.PROCESS incorrectly reports EBS Client Process ID as '1234' (Doc ID 1958352.1)

Connect to Weblogic Console and then do the following;
Services > Data Sources > EBSDataSource > Configuration > Connection Pool
Set "System Property" as below

v$session.program=weblogic.Name [Take note of the initial value one is changing as one will need to reset it once the fix is delivered and applied.]

Lastly we restart oacore managed servers and monitor the database using a query like;

SQL> select program, process, machine, sql_id, status, last_call_et from v$session where program like 'oacore_server%';

Tuesday, May 22, 2018

EXADATA -- Unique Articles Worth Reading ( imaging, upgrade, installation, configuration and so on)

Nowadays, my context is completely switched. That is, I have started to work more on Exadata and ECM/OCM migrations.. As a result of that, I produce more content on these areas.

Till the last month, I was more focused on Exadata.. But nowadays, I m not only focused on Exadata, but also Exadata Clould machines and cloud migration projects.

Of course, I documented the critical things that we have done on Exadata machines one by one and produced the following articles for sharing with you.

Monday, May 21, 2018

Exadata -- Cisco Switch Firmware upgrade

In this post, I will explain upgrading the firmware of the Cisco Switch, which is delivered --built-in-- with the Exadata machines.
For explaning the process, I will go through a real life case, which was done in an Exadata X3-2 environment.

The Cisco switch version that I use for demonstrating this upgrade is Catalyst 4948e, which is the ethernet switch delivered with Exadata X3-2 machines. (In Exadata X7, we see Cisco Nexus switches..)

In Exadata environments, these cisco switches are used for systems management net interfaces access only. (ethernet based management network, ssh connection, ILOM and so on.)

So, during such an upgrade, no production traffic is affected, just consoles and node management...


The requirement for upgrading the firmware of these switches may arise after a security scan, which is usually performed regularly by the security teams in customer environments (enterprise customers..) 

Following is a list of vulnerabilities that were discovered in a customer environment.. These vulnerabilities were discovered on the cisco switch which was delivered with the Exadata X3-2. (cisco firmware version was : cat4500e-IPBASEK9-M Version 15.1(1)SG)

• Cisco IOS Cluster Management Protocol Telnet Option Handling 
• Cisco IOS IKEv2 Fragmentation DoS 
• Cisco IOS IKEv1 Fragmentation DoS 
• Cisco IOS Software DHCP Version 6 Server Denial of Service Vulnerability 
• Cisco IOS Software DHCP Denial of Service Vulnerability 
• Cisco IOS EnergyWise DoS 
• Cisco IOS Software Internet Key Exchange Version 2 (IKEv2) Denial of Service 
• Cisco IOS Software Smart Install Denial of Service Vulnerability 
• Cisco IOS Software RSVP DoS 
• Cisco IOS Multicast Routing Multiple DoS 
• Cisco IOS Multiple OpenSSL Vulnerabilities 
• Cisco IOS Software TFTP DoS 
• Cisco IOS Software DHCP Denial of Service Vulnerability 

These vulnerabilites are fixed in cisco firmware version "cat4500e-ipbasek9-mz.152-2.E8"  and here is the list of things that we did for upgrading this 15.2.2E8 target release;

  • First, we connect to the cisco switch using telnet from db node 1 and check the current firware version;

[oracle@exanode1~]$ telnet <cisco_switch_ip_address>
exaswc0>show version

Cisco IOS Software, Catalyst 4500 L3 Switch Software (cat4500e-IPBASEK9-M), Version 15.1(1)SG, RELEASE SOFTWARE (fc3)
Technical Support: http://www.cisco.com/techsupport
Copyright (c) 1986-2012 by Cisco Systems, Inc.
Compiled Sun 15-Apr-12 02:55 by prod_rel_team

ROM: 12.2(44r)SG11
fbadmswc0 uptime is 4 years, 37 weeks, 2 days, 23 hours, 55 minutes
System returned to ROM by power-on
System restarted at 15:15:39 GDT Tue Jul 2 2013
System image file is "bootflash:cat4500e-ipbasek9-mz.151-1.SG.bin"
Hobgoblin Revision 21, Fortooine Revision 1.40

  • Then, we download the new switch software from cisco -

https://software.cisco.com/download/release.html?mdfid=283027810&softwareid=280805680&release=15.2.2E8&flowid=3592
(Choose "IP Base Image" line from 15.2.2E8(MD) version.
File name : cat4500e-ipbasek9-tar.152-2.E8.tar)

  • After downloading the new switch software , we create a tftp server and , and put the new cisco software bin (which comes out from the tar file) to a tftp directory like /tftpboot/switch_image.

[root@acs-vmmachine~]# mkdir /tftpboot/switch_image

[root@acs-vmmachine ~]# chmod 777 /tftpboot/switch_image/

[root@acs-vmmachine ~]# ls -l /tftpboot/switch_image/
total 0

-rwxrwxrwx 1 root root 0 Mar 19 09:16 new_image.bin

  • Then again, in cisco switch; we list the files in the bootflash directory and check its size;

exaswc0>enable
Password: 

exaswc0#dir bootflash:
Directory of bootflash:/
    6  -rw-    25213107  Mar 19 2013 14:46:08 +04:00  cat4500e-ipbase-mz.150-2.SG2.bin
    7  -rw-    32288280   Jun 5 2013 20:04:54 +04:00  cat4500e-ipbasek9-mz.151-1.SG.bin
  
exaswc0>show file systems 
File Systems: 

Size(b) Free(b) Type Flags Prefixes 
* 60817408 45204152 flash rw bootflash:   --------> There are about 45 MB free space in bootflash. (Min 20 MB required.)
  • We configure our cisco to boot from a specific firmware file. 

exaswc0#configure terminal
Enter configuration commands, one per line. End with CNTL/Z.
exaswc0(config)#no boot system
exaswc0(config)#boot system bootflash:cat4500e-ipbasek9-mz.151-1.SG.bin (current)

  • Then, we save the running config and name it with the suffix "before-upgrade"

exaswc0#copy running-config startup-config all 
exaswc0#copy running-config bootflash:cat4500e-ipbasek9-mz.151-1.SG-before-upgrade
  • Next, we copy this file to our tftp server. (for backup) -- we answer the prompts for the tftp-server name and the destination filename..

exaswc0#copy bootflash:cisco4948-ip-confg-before-upgrade tftp:
  • After copying our running config to our tftp-server (installed earlier into our client machine), we copy the new image from tftp-server to our cisco switch by executing the following command on cisco.

copy tftp: bootflash:
Address or name of remote host []? acs-vmmachine
Source filename []? switch_image/new_image.bin
Destination filename [new_image.bin]?
cat4500e-ipbasek9-mz.152-2.E8.bin

...
....
exaswc0# 
exaswc0# dir bootflash: 
Directory of bootflash:/
    6  -rw-    25213107  Mar 19 2013 14:46:08 +04:00  cat4500e-ipbase-mz.150-2.SG2.bin
    7  -rw-    32288280   Jun 5 2013 20:04:54 +04:00  cat4500e-ipbasek9-mz.151-1.SG.bin
25  -rw-    38791882  Mar 20 2018 15:24:24 +04:00  cat4500e-ipbasek9-mz.152-2.E8.bin -- this is the firmware that we are upgrading to.

  • We verify the new image file;

exaswc0-ip#verify bootflash:cat4500e-ipbasek9-mz.152-2.E8.bin
File system hash verification successful.

  • After our new image file is verified, we configure our cisco switch boot system to our new image bin and save the configuration into nvram.

exaswc0#configure terminal
Enter configuration commands, one per line. End with CNTL/Z.
exaswc0(config)#config-register 0x2102
exaswc0(config)#no boot system
exaswc0(config)#boot system bootflash:cat4500e-ipbasek9-mz.152-2.E8.bin
exaswc0(config)#
exaswc0(config)# (type <control-z> here to end)
exaswc0#show run | include boot
boot-start-marker
boot system bootflash:cat4500e-ipbasek9-mz.152-2.E8.bin
boot-end-marker

exaswc0# copy running-config startup-config all
exaswc0#write memory 


Note that: 0x2102 instructs the boot process to ignore any breaks, sets baudrate to 9600 and boots into ROM if the main boot process fails for some reason.
  • Lastly, we boot our cisco switch with the new firmware and save running config.

exaswc0# reload 
exaswc0-#copy running-config startup-config all 
exaswc0#copy running-config bootflash:cat4500e-ipbasek9-mz.152-2.E8-after-upgrade
exaswc0#write memory 

  • At this point, we can continue enabling SSH access and disabling telnet access. (although, this action is optional, it is highly recommended. Check the below references for the instructions.

References:

Upgrading firmware / Configuring SSH on Cisco Catalyst 4948 Ethernet Switch (Doc ID 1415044.1)
How To Update Exadata Management Network Switch Firmware (Doc ID 1593004.1)

Thursday, May 17, 2018

RDBMS -- Interesting error on Duplicate From Active Database -> ORA-19845, ORA-17628, ORA-19571, ORA-19660

Recently encountered an interesting problem in a Rman Duplicate session.
We were trying to duplicate a database from active database using rman and although, we did everything fine, we ended up with the following error stack.

RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of Duplicate Db command at 05/03/2018 00:15:53
RMAN-05501: aborting duplication of target database
RMAN-03015: error occurred in stored script Memory Script

ORA-19845: error in backupArchivedLog while communicating with remote database server
ORA-17628: Oracle error 19571 returned by remote Oracle server
ORA-19571: RECID STAMP not found in control file
ORA-19660: some files in the backup set could not be verified
ORA-19662: archived log thread 1 sequence 7643 could not be verified
ORA-19845: error in backupArchivedLog while communicating with remote database server
ORA-17628: Oracle error 19571 returned by remote Oracle server
ORA-19571: RECID STAMP not found in control file


As the error ORA-17628 suggests, RMAN couldn't comminucate with the remote server.
The remote server that is mentioned here was actually the auxiliary instance , which was the new database instance that we were creating from the active database.

This problem was closely related with the service_names parameter of this auxiliary.

As you may already know, when we duplicate from active database, rman restores the spfile from the source instance and update it according to the parameter settings that we used in our duplicate command..

In case of the service_names and other similar type of parameters, rman restores the spfile and updates it according to the value that we set for the "SPFILE PARAMETER VALUE_CONVERT", that we use in our rman duplicate command.

However; what we discovered in this case was, rman couldn't do that update properly.. (at least for the service_names parameter and at least for our case..)

So, although we set the correct value for the SPFILE PARAMETER_VALUE_CONVERT parameter, rman couldn't update the service_names parameter of the auxiliary instance properly.

As a result, we encountered "ORA-17628: Oracle error 19571 returned by remote Oracle server" error during our duplicate session.

I must admit that, this was weird and probably this was probably a bug.
Fortuneatly, we found the workaround.

As for the workaround, we did the following;
  • we created an init.ora for the auxiliary and made the changes in init.ora (changes for the desired values)
db_unique_name='ERM'
set db_name='ERM'
set instance_name='ERM1'
set instance_number='1'
set db_create_file_dest='+DATA'
set db_recovery_file_dest_size='40G'
set db_recovery_file_dest='+RECO'
set control_files='+DATA','+RECO'
set db_create_online_log_dest_1='+DATA'
set db_create_online_log_dest_2='+RECO'
set diagnostic_dest='/u01/app/oracle'
set audit_file_dest='/u01/app/oracle/product/12.1.0.2/dbhome_3/rdbms/audit'
set log_archive_dest_1='location=USE_DB_RECOVERY_FILE_DEST'
set log_archive_dest=''
set local_listener=''
set cluster_database='FALSE';

  • Then, we connected to the auxililary and created spfile from the pfile
SQL> CREATE SPFILE FROM PFILE='location of destination pfile'; ----this is the pfile created earlier.
SQL> STARTUP NOMOUNT;

  • Lastly, we run our duplicate command without SPFILE clause. (without SPFILE PARAMETER_VALUE_CONVERT)
In brief, we set our desired parameters for the auxiliary database in a pfile (init.ora), then created spfile from that init.ora(pfile) and started up the auxiliary database in nomount mode using that spfile..
After that , we run our rman duplicate command  without specifying SPFILE.. parameter.

By doing this; we started up the auxilariy instance with the desired parameters and bypassed the automatic spfile update that is done from source instance to auxiliary by rman ( this automatic update is done when we use SPFILE PARAMETER_VALUE_CONVERT parameter and when the auxiliary instance is started up using pfile)..

This workaround saved the day, so I wanted to share it with you.

Note that: starting up the auxiliary database directly with the spfile (filled with the desired parameters) is actually a good thing to do. So we are considering using this approach in our next duplicate sessions as well.

Wednesday, May 16, 2018

EBS -- Upgrading EBS 12.1.3 to 12.2 -- the general steps

In this blog post, I want to give you a list which includes the general steps that can be included in the project plan of an EBS 12.1.3 to 12.2 upgrade project.

By taking the following phases and the related steps into account, you may calculate your effort and do your project plan accordingly.

I wanted to use EBS 12.1.3 as the source version because it is a very common version in EBS customer environments. While giving the steps, I also wanted to highlight the teams responsible for completing those steps. (apps DBA team, Functional team, core business users etc..).

Of course, some of these steps like the "upgrade database step" is optional .( if your db release is up-to-date enough.)

Phase 1      
  • Upgrade the database on the existing EBS 12.1: apps DBA team      
  • Execute a functional test: EBS functional team
Phase 2      
  • Install all application pre-upgrade patches: apps DBA team      
  • Verify the instance: EBS functional team
Phase 3      
  • Execute all functional pre-upgrade tasks including customizations: functional team      
  • Perform a full system backup: System and apps DBA team   
 Phase 4      
  • Apply localization and 12.2 pre-upgrade patches: apps DBA team      
  • Upgrade to 12.2.0: apps DBA team      
  • Enable online patching: apps DBA team      
  • Apply tech stack patches: apps DBA team      
  • Upgrade to 12.2.6/12.2.7: apps DBA team      
  • Perform all post-upgrade tasks: apps DBA and functional teams      
  • Application function test cases: core business users       

Friday, April 27, 2018

Weblogic -- Performance problem - Forms & Reports environment -- Unable to load performance pack / libmuxer.so

Recently, analyzed a problem on a Weblogic instance.

There was a Forms & Reports - based custom program, running o Weblogic 10.3.6 and the customer was complaining about the performance.

Every single form screen was working slowly..
The problem was obvious, it was on the application tier, as there were no real database activitiy.

Customer said that, this program was previously running with 1500 users on a Oracle Application Server 10g environment, and there were no performance problems encountered there.

Weblogic instance was running on a Solaris OS.

I first checked the configuration from the weblogic console, and concluded that it was all fine.
Just in case, I increased the heap sizes of the managed servers and restarted them..However;  this action didn't solve the problem.

So I jumped into the log files.

While analyzing the WLS_FORMS.log, I saw a strange error..
Weblogic managed server was complaning about the performance pack.. It was saying "Unable to load performance pack".
When this happens, Weblogic starts to use the Java I/O, rather than the native one.

These kinds of problems, are usually caused by the LD_LIBRARY_PATH.

In order to solve this; I first found the necessary shared library that is supposed to be used for enabling the native/IO - performance pack.

The name of the library libmuxer.so, and it was located in the directory named "/app01/weblogic/wlserver_10.3/server/native/solaris/sparc64/".

I modified the <DOMAIN_HOME>/bin/setDomainEnv.sh file and made the LD_LIBRARY_PATH (for the Forms server) include the location of the libmuxer.so file.

After this modification, I restarted the Weblogic Managed server and checked the WLS_FORMS.log  file once again to see whether the error was dissapeared or not.

The error wasn't there. So , I checked the application and saw those waits were gone.. The forms screens were working perfectly fine.. :)

This was like a surgical operation , so I liked the work that I have done, and wanted to share this with you :)

Hope , you will find it useful.

Friday, April 13, 2018

RDBMS / RAC / EXADATA-- switching Scan names -- removing the need for modifying the connection strings after a migration

Recently started a big " ECM ( Exadata Cloud Machine Migration) - OCM (Oracle Cloud Machine)" migration project..

In this exciting project, I will play the lead consultant role for both database and application migrations..


Databases in the scope are Oracle 11gR2 databases, and they all will be migrated to ECMs.

Applications, on the other hands, are running on WebSphere and they all will be migrated to the WebLogic instances running on OCMs.

My team is responsible the project as an whole. From analysis, to Planning & From planning to Execution..

As you may guess, the downtime is very important..

In other words, the methods that we should use for the migration of these databases, should require a minimum downtime..

The platforms of the databases, which are in the scope of this project, varies..

That is, some of these Oracle databases are running on AIX-IBM Power platforms and some of them on LINUX-INTEL Platforms.

As a result of these varying platforms & the need for having the minimum downtime; we are working on several migration plans.

For Linux-Intel Platform, we are mainly focused on the Dataguard based migration strategies, and this blog post will be based on a little method that we used while doing one of our migration POCs for proving Oracle's Dataguard technology.

Most of you may already know that, in order to migrate a database using Dataguard switchover or failover methods, we first create a standby database in the target site.

Then we use Oracle's managed recovery (Dataguard) to make it be sync with the primary.

Once the standby database becomes sync with the primary, we schedule a small downtime and do the switchover operation in this planned maintanence window.

After a successful switchover operation, our new primary starts running in the target site. (read-write)

This new primary will be running on a different platform, with different "management ip addresses, hostnames, virtual hostnames, virtual ip addresses, scan ip addresses and scan names".

As a result, we normally tell our clients & application owners to change their connections strings (java jdbc urls, tnsnames.ora and etc..) accordingly.

However, what we have done in one of our POCs, was a little different..

I mean, we did a switchover and make the primary database be activated on the target site.. But after that, we didn't tell our clients and application owners to change their connection strings..

There was no need for that..

Why? Because, we switched the scan names between the target and the source platform.
Here is the time flow of this operation:

t0: exax6 's scan name: exaxxx & exax7's scan name: exax7
t1: exax6 's scan name : exax6old & exax7's scan name: exax7
t2: exax6 's scan name : exax6old & exax7's scan name: exaxxx

So the new primary (its scan listeners actually) started listening on the same scan name, as the old primary. ( Although the scan IP addresses of the new primary is different than the old primary)

By doing this, we gained time & we didn't spend any extra efforts for changing the application connection strings..

Well... Let's look at the technical side of this operation;

Our source platform was an Exadata X6-2, and our target platform was an Exadata X7-2.

So, after the Dataguard-based switchover operation, we did the following ;

***We first changed the scan name of the source platform (Exa X6-2) and make it exax6old.. (it was exaxxx before this change)..

In order to do this, we took the following actions;

--Stopped the scan listeners and scan itself using Grid user (in our case it was oracle)

$GRID_HOME/bin/srvctl stop scan_listener
$GRID_HOME/bin/srvctl stop scan

--We said the DNS team to update scan-related DNS definitions of Exa X6-2 and made them exax6old.

--Then, we checked our server (Exadata X6's db nodes) and confirmed that these new hostnames could be resolved .

-- modified the scan name as root user

$GRID_HOME/bin/srvctl modify scan -n exax6old

--made the modification for the scan resources & confirmed the change, -- as the grid user.

$GRID_HOME/bin/srvctl modify scan_listener -u
$GRID_HOME/bin/srvctl start scan_listener
$GRID_HOME/bin/srvctl config scan
$GRID_HOME/bin/srvctl config scan_listener

*** After changing the scan name of our source platform, the old scan name of the source platform became available to be used for our target platform.

--So, this time in target (Exa X7-2), we stopped scan listener and scan itself using Grid user (in our case it was oracle)

$GRID_HOME/bin/srvctl stop scan_listener
$GRID_HOME/bin/srvctl stop scan

--We said the DNS team to add the former scan-related DNS definitions of the source platform and made them to be resolved using the target platform's scan IP addresses.

--Then, we checked our server (Exadata X7's db nodes) and confirmed that these new hostnames (scan names) could be resolved properly. We needed the clear the OS DNS cache (even reboot may be required) to make the target platform to resolve its scan ip address using newly mapped scan names.

-- modified the scan name as root user

$GRID_HOME/bin/srvctl modify scan -n exaxxx

--Lastly, made the modification for the scan resources & confirmed the change, -- as the grid user.

$GRID_HOME/bin/srvctl modify scan_listener -u
$GRID_HOME/bin/srvctl start scan_listener
$GRID_HOME/bin/srvctl config scan
$GRID_HOME/bin/srvctl config scan_listener

Reference for this operation: How to Modify SCAN Setting or SCAN Listener Port after Installation (Doc ID 972500.1)

Note that, this approach is applicable in environments where all the clients and applications are using scan names (scan listeners) to connect to the databases.

Thursday, April 12, 2018

RDBMS -- ORA-38753 -- The effect of flashback_mode_clause (tablespace) to restore points & snapshot standby operations

Last year , I wrote a blog post about using guaranteed restore points.. In that blog post, I did a demo to show you the concept and tried to explain the Guaranteed Restore Points with or without Flashback Logging (database level), along with the prerequisites and restrictions.

Here is the link of that blog post -> http://ermanarslan.blogspot.com.tr/2017/08/rdbms-flashback-feature-demo-guranteed.html)

Today, I'm here to write about a very specific but also very important thing that you may face while restoring to a guaranteed restore point.

Although, this blog post seems to be related with the guaranteed restore points only, it is actually not .. That is, as the standby snapshot technology relies on the guaranteed restore point, this blog post is also related with the snapshot standby technology. So, you may face this issue while converting your snapshot standby to physical standby, as well.

I hope, you read this blog post, before facing with that thing, because it is a little shocking :)

In spite of the name "Guaranteed Restore Point", you need to be aware of the following fact in order to be able to restore to a guaranteed restore point! ->

You must not have any tablespaces which have Flashback_on set to NO.

If you have flashback_on set to OFF  for a tablespace-> then you may end of the following error stack while converting a standby snapshot to physical standby, or while doing a flashback to a restore point ->

ORA-38753: Cannot flashback data file XX; no flashback log data.
ORA-01110: data file YYY: 'XXX' 


Although, FLASHBACK_ON is by default set to YES, it can be changed to NO.. So if you do this, you won't be able to restore to a guaranteed restore point.

When FLASHBACK_ON is set to YES for a tablespace, Oracle Database will save Flashback log data for that tablespace and thus, the tablespace can participate in a FLASHBACK DATABASE operation.

However, when FLASHBACK_ON is set to OFF for a tablespace, then Oracle Database will not save any Flashback log data for that tablespace. That's why, if FLASHBACK_ON is set to OFF for a tablespace, you must take the datafiles of this tablespace offline (or put the tablespace offline) or drop them prior to any subsequent FLASHBACK DATABASE operation.

Relevant commands for disabling/enabling flashback for a tablespace; 

alter tablespace XXX flashback off; 
alter tablespace XXX flashback on;

So if you are planning to use guaranteed restore points, or snapshot standby technology, it is better to check v$tablespace to ensure that all the critical tablespaces are flashback enabled.

FLASHBACK_ON column in v$tablespace -> Indicates whether the tablespace participates in FLASHBACK DATABASE operations (YES) or not (NO)

Interesting right? Although, the database must not be in flashback mode, all the critical tablespaces must be in flashback mode, in order to be able to restore to a guaranteed restore point (or to be able to convert a snapshot standby to a physical standby)

So if, for any reason, you need to flashback your database to a restore point in a database environment where there are tablespaces for which the flashback modes are set to OFF, you need to follow the action plan documented in the following MOS note:

Flashback Database fails with ORA-38753 ORA-01110 with Tablespaces having Flashback off & RESETLOGS (Doc ID 1588027.1)

Saturday, March 31, 2018

Exadata X7-2 -- Installation & applyElasticConfig.sh

I have recently done a POC with an Exadata X7-2 1/8 machine, and here is the list of information that I gathered about the deployment of this machine. Note that, I find this information very important, as it is gathered from a real field experience. There is also one very important note about using the applyElasticConfig.sh script for the initial installation of the machine and that's why, you are seeing the word applElasticConfig.sh in the title of this post..

1) Currently, there are some problems with PXE boot based installation(imaging) of Exadata X7-2 machine.. This is what I heard from the guys @ Oracle .. That's why if you are planning to image an Exadata X7-2 environment, go with USB boot.. (again I didn't try the PXE boot, but the guys in Oracle told me so.. So I m here to warn you about it.)

2) X7-2 nodes come with 25Gbit SFP support. The SPF devices are SFP28.. So if you are going to purchase SFP Transceiver modules, choose your SFP Transceiver modules accordingly. Also note that, you can't see the green light on SFP cards, until you active the OS interfaces mapped to them.. So if you put a fiber cable and if you don't see the lights are activated on the SFP cards, don't panic :)

3) Admin network is based on a Cisco Nexus switch (rather than Cisco Catalyst, that we have seen in earlier Exadata generations) .. That's why the configuration of the admin switch is a little different than the earlier..

4) If the Image version of Exadata X7-2 is up-to-date, then there is no need reimage it again. (at least while doing a POC)

Currently the most up-to-date image version is 18.1.4.0.0.

This information actually, is not from Oracle. But I have seen it.. It worked well.. It is maybe a little bit risky, but it can be used for POCs, as it saves us time for the initial deployment of the machine.

So if the image version that comes with the newly purchased Exadata X7-2 is up-to-date, we may do the following for the installation;

  • We cable and power on the machine.
  • We configure Cisco.
  • We configure Infiniband switches & PDUs.
  • We run OEDA and put its output to db node1.
  • We run applyElasticConfig.sh by giving OEDA xml output as the input argument. This applyElasticConfig.sh script is actually for Elastic configurations, but I have seen it is working in standard installations too.. It is used for reconfiguring the Exadata that is delivered with the standard IP addresses and hostnames. As you may already know, Exadata is delivered with default IP address and hostnames .. When it is delivered to the customer environment, it has OS installed on it.. So this applyElasticConfig.sh can be used to reconfigure the net interfaces, ip addresses, hostname and everyting based on what is written in OEDA xml output.. (imaging do the same, but this script can reconfigure the machine without imaging) So, once applyElasticConfig.sh is executed successfully, we run onecommand and finish our work.
Pretty interesting , right? It is actually documented for Elastic Configurations, but as I mentioned earlier, I saw it is working even for the standard installations, as well.. Again, I don't recommend this way of installation but this type of an installation may still be used for shortening the deployment time, especially during POCs.

Reference: Elastic Configuration on Exadata (Doc ID 1953915.1)

Monday, March 12, 2018

Exadata Patching-- Upgrading Exadata Software versions / Image upgrade

Recently completed an upgrade work in a critical Exadata environment.
The platform was an Exadata X6-2 quarter rack and our job was to upgrade the image versions of Inifiniband switches, Cell nodes and Database nodes. (This is actually called Patching Exadata)

We did this work in 2 iterations. Firstly in DR and secondly, in PROD.
The upgrade was done with the rolling method.

We needed to upgrade the Image version of Exadata to 12.2.1.1.4. (It was 12.1.2.3.2, before the upgrade)


Well.. Our action plan was to upgrade the nodes in the following order:

InfiniBand Switches
Exadata Storage Servers(Cell nodes)
Database nodes (Compute nodes)

We started to work by gathering the gathering info about the environment.

Gathering INFO about the environment:
------------------------------------------

Current image info: we gathered this info by running imageinfo -v on each node including cells.  we expected to see same image versions on all nodes.

Example command:

root>dcli -g /opt/oracle.SupportTools/onecommand/dbs_group -l root "imageinfo | grep 'Image version'"   --> for db nodes
root>dcli -g /opt/oracle.SupportTools/onecommand/cell_group -l root "imageinfo | grep 'Image version'"  --> for cell nodes

In addition, we could check the image history using imagehistory command as well..

DB Home and GRID Home patch levels: We gathered opatch lsinventory outputs. (just in case)

SSH equivalency : We checked the ssh equivalency, from db node 1 to all the cells, from db node1 to all infiniband switches, from db node2 to db node1 . (we used dcli to check this)

Example check:

with root user>
dcli -g cell_group -l root 'hostname -i'

ASM Diskgruop repair times: We checked whether the repair times are lower than 24h, we noted them to be increased to 24h.(just before upgrade of the cell nodes)

We used v$asm_disk_group & v$asm_attribute 

Query for checking:
SELECT dg.name,a.value FROM v$asm_diskgroup dg, v$asm_attribute a WHERE dg.group_number=a.group_number AND a.name='disk_repair_time';

Setting the attributes:
before the upgrade:after
ALTER DISKGROUP diskgroup_name SET ATTRIBUTE 'disk_repair_time'='24h';
before the upgrade:
ALTER DISKGROUP diskgroup_name SET ATTRIBUTE 'disk_repair_time'='3.6h';

ILOM connectivity: We checked ILOM connectivity using ssh from db nodes to ILOM.. We checked using start /SP/console.. (again not web based, over SSH)

profile files (.bash_profile etc..) : We checked .bash_profile and .profile files, we removed the custom lines removed from those file.. (before the upgrade)

After gathering the necessary info, we executed the Exachk and concantrated on its findins:

Running EXACHK:
------------------------------------------

We first checked our exachk version using "exachk -v" and we checked if it is in the most up-to-date version.. In our case, it wasn't. So we downloaded the latest exachk using the link given in the document named : "Oracle Exadata Database Machine exachk or HealthCheck (Doc ID 1070954.1)"

In order to run the exachk, we unziped the donwloaded exachk.zip file. We put it under /opt/oracle.SupportTools/exachk directory.

After downloading and unzipping; we run exachk using "exachk -a" command as the root user. ("-a" means Perform best practice check and recommended patch check. This is the default option. If no options are specified exachk runs with -a)


Then we checked the output of exachk and take the corrective actions if necessary.
After the exachk, we continued with downloading the image files.

Downloading the new Image files:
------------------------------------------
All the image versions and links to the patches were documented in "Exadata Database Machine and Exadata Storage Server Supported Versions (Doc ID 888828.1)"

So we opened the document 888828.1 and checked the table for "Exadata 12.2" . (as our Target Image Version was 12.2.1.1.4)
We downloaded the patches documented there..

In our case, following patches were downloaded;

Patch 27032747 - Storage server and InfiniBand switch software (12.2.1.1.4.171128)   : This is for Cells and Infiniband switches.

Patch 27103625 - x86-64 Database server bare metal / domU ULN exadata_dbserver_12.2.1.1.4_x86_64_base OL6 channel ISO image (12.2.1.1.4.171128)  : This is for DB nodes.

Cell&Infiniband patch was downloaded to DB node1 and unzipped there.  (SSH equiv is required between DB node1 and all the Cells + all the infiniband switches) (can be unzipped in any location)

Database Server patch was downloaded to DB node1 and DB node2 (if the Exa is 1/4 or 1/8 ) and unzipped there. (can be unzipped in any location)


Note: downloaded and unzipped patch files should belong to root user..

After downloading and unzipping the Image patches, we created the group files..

Creating the group files specifically for the image upgrade:
------------------------------------------

In order to execute patchmgr , which was the tool that makes the image upgrade, we created files dbs_group, cell_group and ibswitches.lst.
We placed these files on db node1 and db node2.

cell_group files : contains the hostnames of all the cells.
ibswitches.lst files : contains the hostnames of all the infiniband switches.
dbs_group file on DB node 1: contains the hostname of only DB node2

dbs_group file on DB node 2: contains the hostname of only DB node1

At this point, we were on a important as our upgrade  was almost beginning.. However, we still had an important thing to the and that was the Precheck..

Running Patchmgr Precheck (first for Cells, then for Dbs, lastly for Infiniband Switches -- actually, there was no need to follow an exact sequence for this): 
------------------------------------------

In this phase, we run patchmgr utility with precheck argument to check the environment before the patchmgr based image upgrade.
We used patchmgr utility that comes with the downloaded patches.
We run these checks using root account.

Cell Storage Precheck: (we run it from dbnode1 , it then connects to all cells and do the check..)

Approx Duration : 5 mins total

# df -h (check disk size, 5gb free for / is okay.)
# unzip p27032747_122110_Linux-x86-64.zip
# cd patch_12.2.1.1.4.171128/
# ./patchmgr -cells cell_group -reset_force
# ./patchmgr -cells cell_group -cleanup

# ./patchmgr -cells cell_group -patch_check_prereq -rolling

Database Nodes Precheck: (we run it from dbnode1 and dbnode2, so each db node is checked seperately.. This was because our dbs_group files contains only one db node name..)

Approx Duration : 10 mins per db.

        # df -h (check disk size, 5gb free for / is okay.)
# unzip p27032747_122110_Linux-x86-64.zip
# cd patch_12.2.1.1.4.171128/
# ./patchmgr -dbnodes dbs_group -precheck -nomodify_at_prereq -log_dir auto -target_version 12.2.1.1.4.171128 -iso_repo <patch>.zip

Infiniband Switches: # ./patchmgr -ibswitches ibswitches.lst -upgrade -ibswitch_precheck

Note that, while doing the database precheck, we used nomodify_at_prereq argument to make patchmgr not to delete custom rpms automatically during its run.

So, when we used nomodify_at_prereq, patchmgr created a script to delete the custom rpms .. This script was named /var/log/cellos/nomodify*.. We could later(just before the upgrade) run this script to delete the custom rpms. (we actually didn't used this script, but deleted the rpms manually one by one :)

Well.. We reviewed the patchmgr precheck logs. (note that we ignored custom rpm related errors, as we planned to remove them just before the upgrade)

Cell precheck output files were all clean.. We only saw the a LVM related error in the database node precheck outputs.

In precheck.log file of db node 1, we had - >

ERROR: Inactive lvm (/dev/mapper/VGExaDb-LVDbSys2) (30G) not equal to active lvm /dev/mapper/VGExaDb-LVDbSys1 (36G). Backups will fail. Re-create it with proper size.

As for the solution: we implemented the actions documented in the following note. (we simply resized the lvm)

Exadata YUM Pre-Checks Fails with ERROR: Inactive lvm not equal to active lvm. Backups will fail. (Doc ID 1988429.1)

So, after the precheck, we were almost there :) just... we had to do one more thing;

Discover environmental additional configurations and take notes for disabling them before the DB image upgrade:
------------------------------------------

We checked the existence of customer's NFS shares and disabled them before db image upgrade.
We also checked the existence of customer's crontab settings and disabled them before db image upgrade.

These were the final things to do , before the upgrade commands..
So, at this point, we actually started executing the upgrade commands;

Running "Patchmgr for the upgrade" (first for infiniband switches, then for Cells,  lastly for Dbs)
------------------------------------------

Upgrading  Infiniband switches : (we run it from dbnode1, it then connects to all infiniband switches and do the upgrade, the job is done in rolling fashion)

Note: Infiniband image versions are normally different than the cell & db image versions. This is because Infiniband switch is a switch and its versioning is different than cells and db nodes.
.. 
Note: We could get a list of the inifiband switches using the ibswitches command (we run it from db nodes using root)  

We connected to Db node 1 ILOM (using ssh)
We run command start /SP/console
Then, with root user -> we changed our current working directory to the directory where we unzipped the Cell Image patch.

Lastly we run (with root) -> # ./patchmgr -ibswitches ibswitches.lst -upgrade (approx : 50 mins total)

Upgrading Cells/Storage Servers :  (we run it from dbnode1, it then connects to all the cell nodes and do the upgrade.. The job was done in rolling fashion)

We connected to Db node 1 ILOM (using ssh)
We run command start /SP/console
Then, with root user -> we changed our current working directory to the directory where we unzipped the Cell Image patch.
Lastly we run (using root account) ->

# ./patchmgr -cells cell_group -patch -rolling   (approx : 90 mins total)

This command was run from the DB node 1 and it upgraded all the cells in one go.. Rebooted them one by one , etc.. There was no downtime in the database layer.. All the databases were running during this operation.

After this command completed successfully, we cleaned up the temporary file with the command :
# ./patchmgr -cells cell_group -cleanup

We checked the new image version using imageinfo & imagehistory commands on cells and continued with upgrading the database nodes.

Upgrading  Database Nodes : (must be executed from node 1 for upgrading node 2 and from node 2 for upgrading node 1, so it is done with 2 iterations -- we actually choosed this method..).

During these upgrades, database nodes are rebooted automatically. In our case, onnce the upgrade was done, databases and all other services were automatically started.

We first deleted the rpms (note that, we needed to reinstall them after the upgrade)

We disabled the custom crontab settings.
We unmounted the custom nfs shares. (we also disabled nfs-mount-related lines in the relevant configuration files , for ex: /etc/fstab, /etc/auto.direct)

--upgrading image of db node 2

We connected to Db node 1 ILOM (using ssh)
We run command start /SP/console
Then, with root user -> we changed our current working directory to the directory where we unzipped the Database Image patch.

Important note: Before running the below command, we modified the dbs_group.. At this phase, dbs_group should only include db node 2's hostname. (as we upgraded nodes one by one and we were upgrading the db node 2 first -- rolling)

Next, we run (with root) ->

# ./patchmgr -dbnodes dbs_group -upgrade -log_dir auto -target_version 12.2.1.1.4.171128 -iso_repo <patch>.zip   (approx: 1 hour)

Once this command completed successfully, we could say that, Image upgrade of db node 2 was finished.

--upgrading image of db node 1

We connected to Db node 2 ILOM (using ssh)
We run command start /SP/console
Then, with root user -> we changed our current working directory to the directory where we unzipped the Database Image patch.

Important note: Before running the below command, we modified the dbs_group.. At this phase, dbs_group should only include db node 1's hostname. (as we were upgrading nodes one by one and as we already upgraded db node 2 and this time, we were upgrading db node 1. -- rolling)

Next we run (with root) ->

# ./patchmgr -dbnodes dbs_group -upgrade -log_dir auto -target_version 12.2.1.1.4.171128 -iso_repo <patch>.zip (approx: 1 hour)

Once this command completed successfully, we could say that, Image upgrade of db node 1 was finished.

At this point, our upgrade was finished!!

We re-enabled the crontabs, remounted the NFS shares, reinstalled the custom rpms and started testing our databases.

Some good references:
Oracle Exadata Database Machine Maintenance Guide, Oracle.
Exadata Patching Deep Dive, Enkitec.