Various Information Gained from Experiences


25 comments :

  1. EBS 11.5.10.2 Server Migration --- dependencies..
    -----------------------------------------------
    Yesterday night I migrated an EBS 11.5.10.2 environment (app+db) to a new server.
    Both old and the new servers were using the same Operating Systems (Sun os 10). The directory structure was the same. Even the operating system usernames and the target paths were the same.
    After copying datafiles and application files , configuring the environment and opening the EBS , there was no problem at all. Everyting was normal, System was healthy.

    On the other hand, I was expecting a problem, related to dependency between software and OS and maybe the hardware. There must be a dependency( there shouldnt actually but always there is:) ) . Finally that little problem showed itself. There was a jar file, used to upload sqlloader datafiles to the server in order to prepare data for the sqlloader . This sqlloader process was triggered from Concurrent Program for loading data to EBS, and the Concurrent Set that contains this Concurrent Program was completing with errors.

    The problem was operating system account passwords.. Passwords was not the same in the new server. This java program was using operating system account to login to Solaris and upload the text files.
    I changed the passwords on New Server to be the same as they are in Old Server and problem disapeared.

    ReplyDelete
  2. Linux Redirection on Backup scripts
    -------------------------------------

    Yesterday night I realized that some bash scripts used in Production systems to backup the databases, have logging lines such as

    cp * location >> logfile;

    When checking Logfiles of these backups, as expected; no errors can be seen.
    But it does not mean that any errors are produced during these backups..
    Because using ">>", script just redirects stdout appending to file.
    For logging errors+output generated during the runtime of the script, and appending " cp * location >>logfile 2>&1 " must be used.

    starting with bash 4 "cp - location" &>>logfile can also be used for this operation. But this method will not be backward compatible.

    ReplyDelete
  3. ORA-00600 [kewrose_1] ORA-00600 [ktsplbfmb-dblfree] on 11.2.0.3
    ------------------------

    Today, ORa-0600 kewrose_1 and ktsplbfmb-dblfree erros are produced on a customer using 11.2.0.3 on Exadata X2-2.
    After investigating the incident, I found the following sql was the cause the error produced.

    {INSERT INTO wrh$_sql_plan sp (snap_id, dbid, sql_id, plan_hash_value, id, operation, options, ....

    I made a search on Oracle Support and found a few documents that addresses this bug. One of the document actually addresses our situation as it s written for 11.2.0.1 and does not include any fix or workaround.

    As a workaround I suggest my collegue(who ise responsible for this site) rebuild the AWR repository, because the wrh$.. table is an awr table.

    We will see if rebuilding the AWR will fix the error.

    ReplyDelete
  4. Yes, Rebuilding the AWR repository fixed the issue..

    ReplyDelete
  5. Enterprise Manager Tablespace Usage Threshold
    ---------------------------------------------------
    If Warning Threshold value is bigger than the Critical Threshold, Enterprise Manager updates the default tablespace thresholds as "not defined". This is because the threshould are logically wrong. Warning Threshold value cannot be bigger than Critical.

    This behaviour of Enterprise Manager occupied our time, as we thought that the mail infrastructure of Enterprise Manager is not working properly.
    We then noticed that the "not defined" threshold in the tablespaces, and fixed the issue by setting lower value to Warning Threshold...

    ReplyDelete
  6. Oracle EBS "I" and "l" character conversion
    --------------------------------------------

    OM Debug File Retrieval diagnostic was failing with the error cat: "0652-050 Cannot open /usr/tmp/I0001064.dbg.", altough the file was in /usr/tmp..
    Actually there was a big difference in filename that is requested and actually exist. Actual file name was starting with "l", but requested file was starting with "I"..

    The debug note was already mentioning about this..(How to generate a debug file in OM [ID 121054.1])

    To retrieve the debug file, navigate to Order Management > Run Requests. Then choose:
    Diagnostics: OM Debug File Retrieval.
    Parameters: Give the name of the debug file noted in Step 3 above. Only give the file name, not the directory path. Remember the first letter in the file name is lower case 'L'

    So;
    The filename was entered by the application user wrongly (I and l characters..) That is, why the error was produced..

    ReplyDelete
  7. Something like Concurrent Manager Internals
    -----------------------------------------

    Today , I faced a strange behaviour in Concurrent Manager shutdown process.
    In one of my customers, Concurrent Managers wer not shutting down, using adcmctl.sh.
    I tried with adcmctl.sh even with abort but all the concurrent managers and their OS process were just remain running. (even in OS).
    After spending some time, I realized that Concurrent Managers shutdown was implented thorugh a concurrent request (name Shutdown).
    This Shutdown concurrent request was sitting in the queue of the Internal Concurrent Manager ( in Pending Standby -> Phase and Status)

    After some investigation, I saw that concurrent managers actually were shutdown, but after 10-15 minutes..Right, Shutdown concurrent request was processed, but the question and the problem was obvious.. What did take this much time?

    At this point, I decided to look into the database.. I investigated sessions and saw that when shutdown request was in standby, the ICM was waiting enq-tx , which is a lock wait.
    The blocker was FNDSCH. I killed FNDSCH just to be sure. (It was a LAB env.)
    After FNDSCH was killed, shutdown request was processed in 10 seconds..

    So FNDSCH was doing nothing, but its transaction was open, and locking the ICM (fnd_concurrent_request/or queues)..

    I ll keep it short.. The problem was in FNDSCH concurrent manager. Its sleep time was high (5 mins)
    So this make ICM be unable to process shutdown request..

    Solution was decreasing the sleep time of FNDSCH.. I set it to 30 secs..
    After this setting, problem solved.. All the subsequent concurrent manager shutdown processes finished in 1 or 2 minutes...

    ReplyDelete
  8. While patching from EBS 12.0.6 to 12.1.3 rup5, I could not find an nls translated version of a patch. This was my second iteration. I decided to check from server that I used for the first iteration. The weird thing was the nls translated patch was already applied on the first iteration. But the patch could not be found in anywhere ( servers filesystem, Oracle support and etc..).

    I analyzed the ad_bugs table and saw that creation_date of this nls translation patch was almost the same (just after) with creation_date of another patch.

    So, this patch was bundled in another patch. That s why I couldnt find it in Oracle Support. That was the story..

    Patch number was 13626800 .. There is Turkish version of it.
    On the other hand; patch 14586882 includes the Turkish version of 13626800..

    ReplyDelete
  9. I was installing Discoverer Desktop and Admin 11g on a Windows 7 64 bit Laptop.
    Unzipped to installation files and found a subdirectory named Win64 in Discoverer Setup folder.. Normally, I used the setup.exe in Win64 folder. Installer invoked but installation program closed automatically without any trace.

    As a workaround, I installed the Discoverer Desktop&Admin from the subdirectory named Win32 in Discoverer Setup Directory, and the installer completed successfully.

    ReplyDelete
  10. Because of a power cut on a linux system, a disk became physically corrupted.
    This disk was mounted to /home directory. So after the power up, home directory 's of all the users became lost.
    This server was hosting Apache services..
    But after the power cut, Apache couldnt start and gave up with the following error
    ----------------------------------------------------------------------------------
    05/29/13-11:47:23 Starting Apache Web Server Listener (dedicated PLSQL)
    Syntax error on line 1362 of httpd_pls.conf
    Invalid command 'include', perhaps mis-spelled or defined by a module not included in the server configuration
    apachectl start: httpd could not be started
    -----------------------------------------------------------------------------------
    So the invalid command was 'include', is it possible ?Of course not.. incude is a key statement , part of the main syntax..
    After analyzing the situation, the problem was solved by unsetting LANG environment variable for the Os user starting Apache services..
    (unset LANG in user's .bash_profile)
    The main cause of the problem was the power cut.
    Because it directly made the system lose the user homes, and made the user indirectly lose the unset LANG setting.
    In addition, the server's default setting was Turkish(TR_tr.UTF8) in /etc/sysconfig/i18n, which is the default setting file for the system language.
    So Os user starting Apache was using LANG=TR_tr..
    This made Apache to not recognize the "i" character.. (i converted to I) and invalid command error was produced.

    ReplyDelete
  11. One customer was encountering Rep-3000 errors, in some E-Business Suite concurrent programs --Oracle Reports..
    Rep-3000 error was a common error for EBS systems in Linux/Unix, and the solution was disabling the access control for X sessions using xhost + command.. The command should be run from the Server itself, from a local connection.. But before running the xhost + command, DISPLAY environment variable should be set to the DISPLAY used in EBS concurrent managers..
    So the display used in EBS concurrent managers of the customer's EBS system was hostname:0.0..
    The problem was, after setting display variable to hostname:0.0 , xhost + command encountered errors.

    The comand was;

    export DISPLAY=exatest:0.0
    xhost +

    The error was;

    xhost: unable to open display "exatest:0.0"


    So, after some analysis, I found that a X server was actually listening on port :0, but its authorization was be sourced before the xhost + command .
    I found the authorization of the relevant X server process, using proc filesystem.. and sourced it before running the xhost + command.. And the issue is fixed..

    The solution was;

    root@exatest ~]# export XAUTHORITY=/var/gdm/:0.Xauth
    [root@exatest ~]# export DISPLAY=exatest:0.0
    [root@exatest ~]# xhost +
    access control disabled, clients can connect from any host

    ReplyDelete
  12. One of my cusmomers was questioning the premature archivelogs generated by its production oracle database system..

    After analysis, the issue was rejected as it s a designed behaviour for this issue.. It was explained in the following blog post..

    http://ermanarslan.blogspot.com/2013/06/database-premature-archivelogs-log.html

    ReplyDelete
  13. The following performance issue was encountered in one of the customer site.. It was encountered in an EBS 11.5.0.2 production system..

    In certain time intervals(especially at evenings..), the production system was suffering from slow query performance..
    After some analysis, I found that parallel query servers were running and generating a high load..These parallel servers were fetching data for a full table scan operation..

    The source of this parallel query operation was the following plsql;

    BEGIN WF_EVENT.LISTEN ( p_agent_name => :1, p_wait => :2, p_correlation => :3, p_deq_condition => null, p_message_count => :4, p_max_error_count => :5 ); END;

    This plsql was using the below query to scan the workflow queue (wf_notification_in)

    select tab.rowid, tab.msgid, tab.corrid, tab.priority, tab.delay, tab.expiration, tab.retry_count, tab.exception_qschema, tab.exception_queue, tab.chain_no, tab.local_order_no, tab.enq_time, tab.time_manager_info, tab.state, tab.enq_tid, tab.step_no, tab.sender_name, tab.sender_address, tab.sender_protocol, tab.dequeue_msgid, tab.user_prop, tab.user_data from "APPLSYS"."WF_NOTIFICATION_IN" tab where msgid = :1

    Plan -> full table scan ve high cost..
    SELECT STATEMENT CHOOSECost: 146.456 Bytes: 909 Cardinality: 1
    1 TABLE ACCESS FULL TABLE APPLSYS.WF_NOTIFICATION_IN Cost: 146.456 Bytes: 909 Cardinality: 1


    So the WF_NOTIFICATION_IN table was over sized..
    As a solution, I recommended to recreate the queue by using the following note: Workflow Queues Creation Scripts [ID 398412.1]

    But as the subject is worklow, it s a dangerous thing to do.. So we didnt rebuild the queue yet.. I will update this post with the final actions..

    ReplyDelete
  14. Analyzing unexpected eof in bash script
    -----------------------------------------
    A backup script was producing unexpected eof error , when I try to run it..
    Here is the approach I followed to analyze and correct the error..

    with sh -x , I checked the bash script to find the exact line that was producing error.

    After that I used cat -vet to see the nonprinting characters..

    From man of cat ->
    -e equivalent to -vE
    -E, --show-ends display $ at end of each line
    -v, --show-nonprinting use ^ and M- notation, except for LFD and TAB
    -t equivalent to -vT
    -T, --show-tabs
    display TAB characters as ^I

    In my case, the source of the problem was a missing ";" while calling a php script..


    ReplyDelete
  15. HPUX printer Turkish character problem
    ------------------------------------------
    One of my customer was not able to print turkish characters like "ç,ş,i" etc.. through a new configured printer..
    "This problematic printer was configured just like the other printers in the system" said my customer..
    After taking a look to the printer configurations using SAM, it seemed the printer was configured just like the other remote printers indeed..
    Then I decided to take a deep look through HPUX printer configuration files..

    key files were under /etcl/lp/interface directory..

    Every printer has a configuration file in that directory.

    So after searching through the configuration files of the printers that are able to print turkish character, following lines attracted my attention..
    ---
    #Following line added to this model script by HP Support
    #to Convert TR.iso88599 to ISO
    /usr/bin/turkceyap <"$1">/tmp/myprn.$$
    mv /tmp/myprn.$$ $1
    ---
    :) /usr/bin/turkceyap .. It was a binary .. turkceyap means make it turkish actually...
    So this was a post processing action, and it was missing in the configuration file of the problematic printer.. I added the relevant lines to the configuration file of the problematic printer and restarted the spooler..

    Problem disappeared, as expected..

    ReplyDelete
  16. EBS Datafile block corrupted, actually not corrupted..
    -----------------------------------------------------
    One of my clients declared a corruption issue last week.
    The EBS version is 11i , and a concurrent programs was encountering the following error; ORA- 01578: ORACLE data block corrupted (file # 28, block # 33523) ORA-01110: data file 28:

    I checked the datafile with dbv and dbv found the corruption.
    After that I found the object that the corrupted block belongs to.. It was an index block..
    So for solution, we recreated the index..

    Next day, the client reported the issue again. The corruption reported was exactly the same.. Same block, same datafile..
    I checked the datafile with dbv and with dbms_repair.check_object, the tools didnt report any corruption for this time..
    It was strange but whatever.. to fix; I dropped the index, no luck..
    I created a table in the problematic tablespace, and extend ed it till the problematic block ( to reformat the block properly) .. Still no luck..
    So in the end, it is understood that, the error was not reported at all..
    The error numbers are written to the output of the concurrent program, because the query of the concurrent program was selecting the error code columns of all the records.. As a result, the corruption errors that we already fixed, are displayed in the output..
    Looking to the output, you can think that these errors are encountered during query... That was the story..
    So as a solution( to not display these error lines..),I put concurrent request in to debug mode , and found the sqls used in the concurrent program.. Found the table, and deleted the corruption error strings in the error_message column..

    After that, we run the concurrent program again, and the output produced as expected..

    So it was an interesting case, because it made me think that, there should be a bug and the corruption will never be dissapeared..


    ReplyDelete
  17. EBS-OID authentication error
    -------------------------------------
    In a customer site, where we positioned OID 11g and SSO to authenticate EBS users, we faced a strange problem..
    The problem was strange because it appeared occasionally..
    More specificaly, when the problem appeared, user could not login to EBS in their first try. Most of the time, users could login in their third or fourth try..

    So after analyzing the situation with ldap admins, it s found that the customer has 8 ldap servers. These ldap servers were working load balanced , and the connections were distributed by the round robin.. Unluckily, the user accounts were expired in some of these ldap servers.. So when the login requests came across to these problematic ldap servers, the users could not login.

    Solution: ldap guys corrected the accounts in question and problem dissapeared.

    ReplyDelete
  18. AIX slibclean..
    ------------------------
    While applying PSU to the Oracle Database 11.2.0.3 on AIX platform, opatch encounter errors like "cannot copy file to the destination.." We checked the permission, file paths and environment, and we did not see any problems..
    After analyzing the situation, this error is produced while copying libraries, and related files. So for the solution, we run a for loop, which execute the slibclean command of AIX in every 2 seconds during the patch application.. By this method, we successfully applied the PSU patch..

    ReplyDelete
  19. Oracle Application Server/Single Sign on installation on AIX -- port conflict..

    Using Oracle Installer on Vnc , my customer was trying to install SSO 10g on AIX server.
    In the Opmn start phase of the installation, port conflict error (unable to bind, port already in use) encountered..
    After analysis, found that, Vncvserver on port 3 was blocking the opmn.. Because, vncserver was using the port 6003, and that was the opmn port actually. Killed the vncserver, completed the installation on Vnc2 and the problem dissapeared..
    So it seems, vnc on Aix used the ports starting from 6000.. In linux the vnc ports start from 5900, and that's why I think, the same error will not be produced on linux.

    ReplyDelete
  20. To be able to open a wallet using Owm does not mean that the wallet can be used for the ssl connection.. The content in the wallet is important actually..
    And remember, the cwallet.sso is used for file based credential store. This file is created by the Owm, to supply the auto login mechanism..

    ReplyDelete
  21. If you define a printer usng SAMBA in Linux, you need to supply a username and password defined in Windows environment(ActiveDirectory) to access the printer via network.

    while supplying username, you need to supply the username with domain prefix.
    For example: ERMAN.ORG/ermanuser

    Also you can check the access to the printer using smbclient
    For ex:

    smbclient -L \\dns_name_of_the_server -I ip_of_the_server -U ERMAN.ORG/ermanuser

    ReplyDelete
  22. During an Exadata POC, we had to create an EBS clone environment. We used a full compressed rman backup for this. Rman duplicate commands were executed, but there was a problem.. The duplicate (restore) was running so slow on the Linux machine.. The Linux machine had 2 Intel cpu 's each having 4 cores. The server's physical memory was 48 GB, and there was nothing running but rman restore job on the server.
    When I had made analysis, I suspect from the network latency, because the rman backup files were on a nfs filesystem.. But after some digging, I found the server was swapping, altough it had 48 gb memory... kswapd processes were active, and rman could run very slow.. This was increasing the latency in backup restore operation.

    Then I placed my Linux admin hat, made the memory analysis on the Linux server and found that the HugeTLB 's are configured on the server... The Hugepage count was above 30000. So there was almost no 4k pages available on the server.. Also note that : These hugepages are never swapped out..
    This made rman to face difficulties on writing in to the normal memory pages(4k), rman needed swap out operation to write in to the memory.

    As a solution, I reduced the hugepage count, dynamically activated this change and everything went back to normal. Kswapd became idle as expected and rman restore job accelareted significantly.


    ReplyDelete
    Replies
    1. the solution came from the least expected part :) pretty interesting.

      Delete
    2. yes Mehmet, it was interesting..
      Thanks to this incidient, that is not the least expected part anymore :)

      Delete
  23. In one of my Customer's EBS 11i clone environment (which was HPUX by the way), apache could not start..
    When tried to be started, it produced error below;

    adapcctl.sh version 115.47
    [Thu Jan 2 16:33:58 2014] [error] OPM: PRIV: Error resolving host name: localhost
    /u01/klon/klonora/iAS/Apache/Apache/bin/apachectl start: httpd could not be started
    [Thu Jan 2 16:34:01 2014] [error] OPM: PRIV: Error resolving host name: localhost
    /u01/klon/klonora/iAS/Apache/Apache/bin/apachectl start: httpd could not be started
    adapcctl.sh: exiting with status 3


    So, by looking to the error , you can easily say that Apache cant make the name resolution for localhost. But I checked /etc/hosts file and everyting was proper..

    I mean,
    In /etc/localhost, there was a line for localhost already.

    127.0.0.1 localhost.localdomain localhost.

    Then, I made some investigation , :) actually I worked for 2 minutes and found that the server's name was set to "-a" :)

    So that was the problem, although Apache said that it could not find localhost's ip, it actually tried to gather the server's hostname from its ip addres..

    Anyways, the fatal error was an human error, as someone accidentally set the hostname of the server to "-a"..

    After setting the proper hostname, apache could start without problems.

    ReplyDelete