That’s Me

Gavin Soorma
Oracle Certified Professional
7.3, 8i, 9i,10g, 11g, 12c
11i Apps DBA OCP
10g RAC OCE
Certified GoldenGate Implementation Specialist
Oracle 11g Exadata Implementation Specialist
10g OCM
11g OCM

Feedback

6916794 hits

Thanks A MILLION for your support!

Please send me your valuable feedback and suggestions

Oracle GoldenGate 11gR2 high availability with Oracle 11gR2 Clusterware

This note discusses how to achieve high availability in an Oracle 11g GoldenGate environment using Oracle 11gR2 Grid Infrastructure Clusterware services.

The GoldenGate manager process has some configuration parameters which we can set to prevent outages like AUTOSTART and AUTORESTART which will ensure that GoldenGate Extract and Replicat processes will get restarted in the event of a failure – but we need to note that for this to happen, the Manager process needs to be up and running.

But what happens if the manager process itself fails as a result of a server or OS crash or network failure or database crash?

So let us look at how to provide high availability for the manager process in a clustered environment.

In this example we are using GoldenGate in a two-node 11gR2 RAC cluster with ASM storage configured.

In a two-node RAC cluster configuration, Oracle GoldenGate runs on only one server at any given time. If that server goes down, GoldenGate is restarted automatically by CRS on the other available node.

While we can install GoldenGate on both nodes, we need to ensure that the checkpoint files which are located in the dirchk directory and the trail files are residing in a location that is accessible from any node as well as we need to ensure that the parameter files are identical in both nodes of the cluster.

So in our case we are using ACFS which is the Automatic Storage Management (ASM) Cluster File System which acts as ahared storage location and we need to install GoldenGate just once in this shared storage location and the same can then be accessed fron either node in the cluster as the ACFS is mounted on both nodes in the cluster.

In addition to configuring the ACFS, we also need to obtain an unused IP address on the public subnet whch is registered in DNS – this is the VIP or the Virtual IP address. In the event of a node failure, Oracle Clusterware will migrate the VIP to a surviving node in the cluster.

 

Have a read of the Oracle white paper on the subject – I have followed that in setting up the GoldenGate clusterware.

Oracle White Paper—Oracle GoldenGate high availability with Oracle Clusterware
 

Let us now take a look at the steps involved.

 

Install the GoldenGate software on the shared location

The first thing we are doing here is to setup and configure ACFS and install the 11gR2 GoldenGate software in this shared location.

In our case, we have a mountpoint /goldengate which is the ASM Cluster File System and this file system is the shared location for the GoldenGate software mounted on both nodes in the cluster.

In short create the ASM disk group, then the volume and finally the ASM Cluster File System.

Install the GoldenGate 11gR2 software then in this location. As this is a  shared location, we need to install GoldenGate just once .

 

 

 

 

 

Create the application VIP

 

From the GRID_HOME/bin directory run the appvipcfg command to create the application VIP. Oracle Clusterware assigns this VIP to a physical server in the cluster and will migrate the VIP to a surviving node in the cluster in the event of a server failure.

As root:

[root@mycorp-racnode1 bin]# ./appvipcfg create -network=1 -ip=10.50.20.52 -vipname=mycorp-oragg-vip -user=root
Production Copyright 2007, 2008, Oracle.All rights reserved
2012-11-28 03:21:48: Skipping type creation
2012-11-28 03:21:48: Create the Resource
2012-11-28 03:21:48: Executing /u01/app/11.2.0.3/grid/bin/crsctl add resource mycorp-oragg-vip -type app.appvip_net1.type -attr "USR_ORA_VIP=10.6.20.52,START_DEPENDENCIES=hard(ora.net1.network) pullup(ora.net1.network),STOP_DEPENDENCIES=hard(ora.net1.network),ACL='owner:root:rwx,pgrp:root:r-x,other::r--,user:root:r-x',HOSTING_MEMBERS=mycorp-racnode1,APPSVIP_FAILBACK="
2012-11-28 03:21:48: Executing cmd: /u01/app/11.2.0.3/grid/bin/crsctl add resource mycorp-oragg-vip -type app.appvip_net1.type -attr "USR_ORA_VIP=10.6.20.52,START_DEP
We also have to allow permissions to the Grid Infrastructure owner (grid) and Oracle database and GoldenGate software owner (oracle) to run the script which starts the VIP.

As root:

[root@mycorp-racnode1 bin]# ./crsctl setperm resource mycorp-oragg-vip -u user:oracle:r-x
[root@mycorp-racnode1 bin]# ./crsctl setperm resource mycorp-oragg-vip -u user:grid:r-x

As oracle, start the application VIP via the crsctl start resource command from the GRID_HOME/bin

[oracle@mycorp-racnode1 bin]# ./crsctl start resource mycorp-oragg-vip
CRS-2672: Attempting to start 'mycorp-oragg-vip' on 'mycorp-racnode2'
CRS-2676: Start of 'mycorp-oragg-vip' on 'mycorp-racnode2' succeeded

We can now verify whether VIP is running and on which node it is running via the crsctl status resource command

[oracle@mycorp-racnode1 bin]# ./crsctl status resource mycorp-oragg-vip
NAME=mycorp-oragg-vip
TYPE=app.appvip_net1.type
TARGET=ONLINE
STATE=ONLINE on mycorp-racnode2

From another node in the cluster we should now be able to ping the VIP’s IP address.

 

Create the Agent script

The agent script is used by Oracle Clusterware to check if the manager process is running and also to stop and start the manager as the case may be.

There is a sample script which is available in the appendix of the Oracle white paper (link mentioned ablove) which can be modified accordingly.

Here is the script I have used in my case. You will need to change the location of the GGS_HOME, CRS_HOME, ORACLE_HOME, LD_LIBRARY_PATH,  and if you are using ASM, the ASMPASSWORD value which is embedded in the agent script.

 

This script is called 11gr2_gg_action.sh and is copied to the /goldengate directory location so that it is accessible from either node in the cluster.

 

11gr2_gg_action.sh

 

Register a resource in Oracle Clusterware

 

We run the crsctl add resource command to register Oracle GoldenGate as a resource in Oracle Clusterware. The resource name in this case is ggate

 

[oracle@mycorp-racnode1 bin]# ./crsctl add resource ggate -type cluster_resource \ 
-attr "ACTION_SCRIPT=/goldengate/11gr2_gg_action.sh,CHECK_INTERVAL=30, \ 
START_DEPENDENCIES='hard(mycorp-oragg-vip,ora.asm) pullup(mycorp-oragg-vip)', STOP_DEPENDENCIES='hard(mycorp-oragg-vip)'"

The START_DEPENDENCIES and STOP_DEPENDENCIES indicates that the VIP and the ggate resource should always start and stop together.
Since in our case the Oracle GoldenGate software owner is not the same as the Grid Infrastructure owner, we need to run the crsctl setperm command to set the
ownership of the application to the GoldenGate software owner

As root:
[root@mycorp-racnode1 bin]# ./crsctl setperm resource ggate -o oracle

Start the Application   We can now use Oracle Clusterware to start GoldenGate. We connect as oracle and run the crsctl start resourcecommand from the GRID_HOME/bin.

[oracle@mycorp-racnode1 bin]$ ./crsctl start resource ggate
CRS-2672: Attempting to start 'ggate' on 'mycorp-racnode1'
CRS-2676: Start of 'ggate' on 'mycorp-racnode1' succeeded

We can use the crsctl status resource command to verify the state of the resource and which node in the cluster it is currently running on.

[oracle@mycorp-racnode1 bin]$ ./crsctl status resource ggate
NAME=ggate
TYPE=cluster_resource
TARGET=ONLINE
STATE=ONLINE on mycorp-racnode1

If we connect to GoldenGate, we can see that the manager process has been started up automatically.
GGSCI (mycorp-racnode1) 1> info all

Program     Status      Group       Lag at Chkpt  Time Since Chkpt

MANAGER RUNNING 
 I found that if the manager process was already running and then we go and issue the crsctl start resource command, we got the following error message:
 
[oracle@mycorp-racnode1 bin]$ ./crsctl start resource ggate
CRS-2672: Attempting to start 'ggate' on 'mycorp-racnode1'
CRS-2674: Start of 'ggate' on 'mycorp-racnode1' failed
CRS-2679: Attempting to clean 'ggate' on 'mycorp-racnode1'
CRS-2681: Clean of 'ggate' on 'mycorp-racnode1' succeeded
CRS-2527: Unable to start 'ggate' because it has a 'hard' dependency on 'mycorp-oragg-vip'
CRS-2525: All instances of the resource 'mycorp-oragg-vip' are already running; relocate is not allowed because the force option was not specified
CRS-4000: Command Start failed, or completed with errors.

What happens if we manually stop the manager process?
Oracle Clusterware starts it up!

GGSCI (mycorp-racnode1) 2> stop manager
Manager process is required by other GGS processes.
Are you sure you want to stop it (y/n)? y

Sending STOP request to MANAGER ...
Request processed.
Manager stopped.

GGSCI (mycorp-racnode1) 3> quit
[oracle@mycorp-racnode1 goldengate]$ ./ggsci

Oracle GoldenGate Command Interpreter for Oracle
Version 11.2.1.0.3 14400833 OGGCORE_11.2.1.0.3_PLATFORMS_120823.1258_FBO
Linux, x64, 64bit (optimized), Oracle 11g on Aug 23 2012 20:20:21

Copyright (C) 1995, 2012, Oracle and/or its affiliates. All rights reserved.

GGSCI (mycorp-racnode1) 1> info all

Program     Status      Group       Lag at Chkpt  Time Since Chkpt

MANAGER     RUNNING

While GoldenGate is running we can test the failover process by relocating it to another node in the cluster via the crsctl relocate resource command

Note that we have to use the -f (force) option because it is already running when we are trying to manually relocate it to another node.

[oracle@mycorp-racnode1 bin]$ ./crsctl relocate resource ggate -f
CRS-2673: Attempting to stop 'ggate' on 'mycorp-racnode1'
CRS-2677: Stop of 'ggate' on 'mycorp-racnode1' succeeded
CRS-2673: Attempting to stop 'mycorp-oragg-vip' on 'mycorp-racnode1'
CRS-2677: Stop of 'mycorp-oragg-vip' on 'mycorp-racnode1' succeeded
CRS-2672: Attempting to start 'mycorp-oragg-vip' on 'mycorp-racnode2'
CRS-2676: Start of 'mycorp-oragg-vip' on 'mycorp-racnode2' succeeded
CRS-2672: Attempting to start 'ggate' on 'mycorp-racnode2'
CRS-2676: Start of 'ggate' on 'mycorp-racnode2' succeeded

We noe can see that the Manager process has been started up on the second node and via the crsctl status resource command we can verify that the ggate resource is now running on the second node in the cluster (mycorp-racnode2).

GGSCI (mycorp-racnode2) 3> info all

Program     Status      Group       Lag at Chkpt  Time Since Chkpt

MANAGER     RUNNING

[oracle@mycorp-racnode1 bin]$ ./crsctl status resource ggate
NAME=ggate
TYPE=cluster_resource
TARGET=ONLINE
STATE=ONLINE on mycorp-racnode2

What happens if we kill the Manager process?

[grid@mycorp-racnode1 goldengate]$ ps -ef |grep mgr
grid     11796     1  0 05:41 ?        00:00:00 ./mgr PARAMFILE /goldengate/dirprm/mgr.prm REPORTFILE /goldengate/dirrpt/MGR.rpt PROCESSID MGR PORT 7809
grid     12763 12669  0 05:50 pts/0    00:00:00 grep mgr

[grid@mycorp-racnode1 goldengate]$ kill -9 11796

As soon as I killed the Manager process from the OS, I quickly checked the status of the Manager as well as the Extract process which was running at the time I killed the process from the OS.

Note the status of both the processes and the how the status quickly changes.

GGSCI (mycorp-racnode1) 1> info all

Program     Status      Group       Lag at Chkpt  Time Since Chkpt

MANAGER     STOPPED
EXTRACT     RUNNING     EXT1        00:00:00      00:00:02

GGSCI (mycorp-racnode1) 2> !
info all

Program     Status      Group       Lag at Chkpt  Time Since Chkpt

MANAGER     RUNNING
EXTRACT     ABENDED     EXT1        00:00:00      00:00:06

GGSCI (mycorp-racnode1) 3> !
info all

Program     Status      Group       Lag at Chkpt  Time Since Chkpt

MANAGER     RUNNING
EXTRACT     RUNNING     EXT1        00:00:00      00:00:10

Leave a Reply

  

  

  

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>