Sudhakar Davuluri's Blog: July 2014

Sunday, July 27, 2014

What is Exadata ? Architecture, Components and Four Main Features of Exadata Storage Server

Now a days Exadata is a buzz word in Database world. So, I decide to simplify What is Exadata It's components and what are four main features of Exadata ?

To understand Exadata, Let's start with Traditional Database System having High Availability (RAC).

Current RAC Environment Setup:

To setup a RAC database Below are the Hardware and Software requirements.

1. Database Nodes or Database Server.
2. Storage Server.
3. Network equipments.

DBA has got Hardware. Next step is to setup Software. Database Administrator has to do following:

1. Eanble SSH between Database Nodes and Storage Server.
2. Real Application Cluster Installation.
3. ASM Installation.
4. Presenting Disks to ASM and creating Disk groups etc.
5. Installing Database Software.

Finally, System is ready for RAC database creation, but during all this process DBA has to work with Different teams like Hardware, Networking and storage and there are different-2 vendors involved for each of these components. Remote DBA also responsible to make sure that each component is compatible with each other and finally compatible with Oracle RAC, ASM and Database Software. So it has became a tedious task for DBA and takes a lot of time to set up a RAC environment.

To solve all these challenges Oracle has come up with a new product call Exadata, Which is a bundle of Hardware from Oracle SUN and Installed S/W Like RAC, ASM, Database Software. Exadata Machine is ready to use within few hours of Delivery. but Exadata is not about providing bundled Hardware and Software to customer but as an Hardware with Intelligence.

To Know Hardware with Intelligence, Let's take another scenario, Remote DBA has set up a RAC environment which is working fine. In RAC env Database is stored at Shared Storage Server and to get output of any query data has to transfer from Shared Storage server to Database Node and then DB node will process data and provide results.

Limitations of Current RAC and Storage Servers

As we know, the database for organizations are increasing day by day and we have limited network bandwidth between DB node and Storage Server (Limitations comes at Hardware level). So after a certain limit we can't optimize our speed of processing data because of Hardware limitation of database nodes, storage Server and Network equipments.

Suppose a simple query Select * from order where Order_type='new' which returns only 1 Million rows is executed on the database machine and the order table has 100 Million rows. So, in that case whole 100 Million rows will come from Storage server to DB node for processing and after processing by DB node DBA will get query output. This whole process take lot of time and resources from all RAC components.

How about if Storage Server only return 1 Million rows which were requested in where clause of the query ?

This is what Exadata Software does i.e. Filtering at Storage Level. Of course this is one of the mail feature of Exadata Software. A DBA can easily estimate the impact of this change. Technically this term is called as Exadata Smart Scan.

So, Here is What is Exadata ?

1. A bundle of Hardware and Software from Oracle Itself and ready to use immediately.

2. An Intelligent Storage Server to optimize Database Performance. In order to achieve this intelligent a software called Exadata is installed on all storage servers.

In Exadata Terms, A single storage server is called Cell or Cell server. So I will be referring storage server and cell server interchangeably in this article. I might be referring DB Server, DB Node for Database Server.

Architecture of Exadata Database Machine:

In above diagram, there are three DB nodes, which are connect with two cell servers. Each storage server has it components.

Component of Exadata Storage Server:

It's another storage device, which has CUP, Memory, Disks, network cards, an Operating System Oracle Linux and Most important past Exdata Storage Server Software. There are mainly three services which run on cell server for processing, CellSRV (Cell Service), MS (Management Server) and RS (Restart Server).

Cell Service (CellSRV): This is the primary component of Exdata Cell which provides all Exadata Storage services. This process communicate with Database server for providing database blocks.

Management Server (MS): MS service provide an interface to DBA for communicating or Managing Cell Server. CellCLI is the Command Line tool which Exadata DBA used for performing Exadata Administration services.

Restart Server (RS): This service is to make sure the functioning of Exadata Server. RS monitors CellSRV and MS for liveliness and restart them whenever required.

Four Main Features of Exadata Storage Server.

1. Smart Flash Cache Intelligent Caching: Storage servers usually have two kind of storage Hard disk and Flash Cache. Flash Cache are fast storage device used to keep most frequently accessed data for faster access, whenever there is a requirement of a block it is first looked into Flash Cache and if available returned to the server without going to Hard disk for reading the block. This is a common feature in Storage servers.

Now with Exadata Smart Flash Cache Inetlligence, Flash Disks are optimized to work according to
Oracle Database requirements. Exadata Software Directs Flash Cache to keep following objects into it.

    A. Frequently access data and index blocks
    B. Control Files read and write.
    C. File headers read and write.

Following are not kept in Flash Cache.

   a. Backup Related I/O
   b. Data Pump I/O
   c. Full Table Scan.

Database Administrator can also influence caching of Database Objects by pinning them into Flash Cache by setting CELL_FLASH_CACHE parameter. Here is a better way to know Exadata Smart Flash Cache.

2. Hybrid Columnar Compression (HCC): Compression is used for reducing storage consumption for large databases. In this technique a logical unit called Compression unit is used to store Hybrid columnar Compressed Rows. At time of data loading Column values are detached from rows, grouped and then compressed. After compression this is fit into compression unit.

Database Smart scan processing is also supported at cell level i.e. Data blocks need not to move from cell server to database server for processing. The amount of saving storage space using HCC is around 10 to 15 times. Suppose you have 100 Tera Bytes of data, by using HCC DBA can store this Data in to 10 TB of Storage space.

3. Smart Scan Processing: In Exadata storage server has got intelligence to filter data at storage level, rather than transferring it to database server.

Exadata Smart Scan works filtering for following kind of queries:

1. Predicate Filter: In a query with where clause only blocks with satisfy where clause condition will only go to database server not the entire table.

2. Column Filtering: Suppose in a select query like "select name, age from employee" is execute on DB server by DBA, then only columns name and age data will send to Database Server not all columns of tables.

3. Join Processing: All join processing are done at Storage level itself, so only filtered data is sent to DB server.

4. I/O Resource Management: In traditional Database environment, If you have more than one database running on a Shared storage server. Large queries from one database can use more resource and cause performance issue for other databases. Another case a batch job has started in one DB by a DBA which case performance trouble into OLTP Database.

To avoid all these challenges Exadata has introduced Exadata Storage Server I/O Resource Management (IORM). Till now we have Database Resource Manager (DBRM) to manage database resource within the database but now with help of IORM DBA can manager resource across databases.

Let's take an example: You have two database Production and testing running on same Cell Server. Now Database Administrator can configure Resource plan which gives priority to Production Database over testing DB. Here, Whenever Testing Database load will affect Production Database IORM will manage rerource request such that it doesn't impact production Database performance. Thiis means Test Database requests has to wait until they can execute without disturbing production database performance.

You can also find few more features of Exadata Machine, but above are the main features of it, other features are either extension or part of these these features.

To know more about Exadata visit http://www.oracle.com/technetwork/database/exadata/exadata-technical-whitepaper-134575.pdf.
You can aslo see a 3D demo of Exadata Machnie https://blogs.oracle.com/databaseinsider/entry/oracle_exadata_database_machin
Tutorials for Exadata.

Please share you views about this article.

Oracle Database Basic Architecture Concepts

What is An Oracle Database?

Basically, there are two main components of Oracle database –– instance and database itself. An instance consists of some memory structures and the background processes, whereas a database refers to the disk resources. Figure 1 will show you the relationship.

Instance(SGA+PGA+BACKGROUND PROCESSES)

The memory structures and background processes contitute an instance. The memory structure itself consists of System Global Area (SGA), Program Global Area (PGA), and an optional area . In the other hand, the mandatory background processes are Database Writer (DBWn), Log Writer (LGWR), Checkpoint (CKPT), System Monitor (SMON), and Process Monitor (PMON). And another optional background processes are Archiver (ARCn), Recoverer (RECO), etc. FiGURE will illustrate the relationship for those components on an instance.

System Global Area(SGA)

SGA is the primary memory structures. When Oracle DBAs talk about memory, they usually mean the SGA. This area is broken into a few of part memory –– Buffer Cache, Shared Pool, Redo Log Buffer, Large Pool, and Java Pool.

Buffer Cache

Buffer cache is used to stores the copies of data block that retrieved from datafiles. That is, when user retrieves data from database, the data will be stored in buffer cache. Its size can be manipulated via DB_CACHE_SIZE parameter in init.ora initialization parameter file.

Shared Pool

Shared pool is broken into two small part memories –– Library Cache and Dictionary Cache. The library cache is used to stores information about the commonly used SQL and PL/SQL statements; and is managed by a Least Recently Used (LRU) algorithm. It is also enables the sharing those statemens among users. In the other hand, dictionary cache is used to stores information about object definitions in the database, such as columns, tables, indexes, users, privileges, etc.

The shared pool size can be set via SHARED_POOL_SIZE parameter in init.ora initialization parameter file.

Redo Log Buffer

Each DML statement (insert, update, and delete) executed by users will generates the redo entry. What is a redo entry? It is an information about all data changes made by users. That redo entry is stored in redo log buffer before it is written into the redo log files. To manipulate the size of redo log buffer, you can use the LOG_BUFFER parameter in init.ora initialization parameter file.

Large Pool

Large pool is an optional area of memory in the SGA. It is used to reviews the burden place on the shared pool. It is also used for I/O processes. The large pool size can be set by LARGE_POOL_SIZE parameter in init.ora initialization parameter file.

Java Pool

As its name, Java pool is used to services parsing of the Java commands. Its size can be set by JAVA_POOL_SIZE parameter in init.ora initialization parameter file.

Program Global Area(PGA)

Although the result of SQL statemen parsing is stored in library cache, but the value of binding variable will be stored in PGA. Why? Because it must be private or not be shared among users. The PGA is also used for sort area.

Software Area Code

Software area code is a location in memory where the Oracle application software resides.

Oracle Background Processes

Oracle background processes is the processes behind the scene that work together with the memories.

DBWn

Database writer (DBWn) process is used to write data from buffer cache into the datafiles. Historically, the database writer is named DBWR. But since some of Oracle version allows us to have more than one database writer, the name is changed to DBWn, where n value is a number 0 to 9.

LGWR

Log writer (LGWR) process is similar to DBWn. It writes the redo entries from redo log buffer into the redo log files.

CKPT

Checkpoint (CKPT) is a process to give a signal to DBWn to writes data in the buffer cache into datafiles. It will also updates datafiles and control files header when log file switch occurs.

SMON
System Monitor (SMON) process is used to recover the system crach or instance failure by applying the entries in the redo log files to the datafiles.

PMON

Process Monitor (PMON) process is used to clean up work after failed processes by rolling back the transactions and releasing other resources.

Database

The database refers to disk resources, and is broken into two main structures –– Logical structures and Physical structures.

Logical Structures

Oracle database is divided into smaller logical units to manage, store, and retrieve data effeciently. The logical units are tablespace, segment, extent, and data block. Figure 3 will illustrate the relationships between those units.

The relationships between the Oracle logical structures

Tablespace

A Tablespace is a grouping logical database objects. A database must have one or more tablespaces. In the Figure 3, we have three tablespaces –– SYSTEM tablespace, Tablespace 1, and Tablespace 2. Tablespace is composed by one or more datafiles.

Segment

A Tablespace is further broken into segments. A segment is used to stores same type of objects. That is, every table in the database will store into a specific segment (named Data Segment) and every index in the database will also store in its own segment (named Index Segment). The other segment types are Temporary Segment and Rollback Segment.

Extent

A segment is further broken into extents. An extent consists of one or more data block. When the database object is enlarged, an extent will be allocated. Unlike a tablespace or a segment, an extent cannot be named.

Data Block

A data block is the smallest unit of storage in the Oracle database. The data block size is a specific number of bytes within tablespace and it has the same number of bytes.

Physical Structures

The physical structures are structures of an Oracle database (in this case the disk files) that are not directly manipulated by users. The physical structure consists of datafiles, redo log files, and control files.

Datafiles

A datafile is a file that correspondens with a tablespace. One datafile can be used by one tablespace, but one tablespace can has more than one datafiles.

Redo Log Files

Redo log files are the files that store the redo entries generated by DML statements. It can be used for recovery processes.

Control Files

Control files are used to store information about physical structure of database, such as datafiles size and location, redo log files location, etc.

Images of Oracle db architecture for clear understanding.....

Wednesday, July 23, 2014

How I Investigate Node Reboot or Database System Restart in 10 Minutes

Database System Restart, Node Eviction or Node Reboot are day to day investigation by a DBA. As most of the time DBA's are the owner of these machines So they are first to start find reason for Database System Restart, Node Eviction or Node Reboot. Here, I am talking about cases when either this is a single instance database or Real Application Cluster DB. In both case Node reboot happens though there could be different reasons for both.

Here, I will discuss approach to investigate Database System Restart or Node Reboot. This approach applies on both single instance DB or RAC.

For each and every case of Node Reboot or Database System Restart, Fist check /var/log/messages file to verify if this is a Self suside or Node eviction. Let's see by an example in Linux env.

System Restart due to Node Eviction: First Let's see How Node eviction message looks like.

Feb 18 17:20:42 db01 kernel: SysRq : Resetting
Feb 18 17:20:44 db01 kernel: printk: 6 messages suppressed.
Feb 18 17:20:44 db01 kernel: type=1701 audit(1392744044.855:28194): auid=4294967295 uid=1000 gid=1001 ses=4294967295 pid=8368 comm="ocssd.bin" sig=6
Feb 18 17:24:26 db01 syslogd 1.4.1: restart.
Feb 18 17:24:26 db01 kernel: klogd 1.4.1, log source = /proc/kmsg started.
Feb 18 17:24:26 db01 kernel: Linux version 2.6.18-238.12.2.0.2.el5 (mockbuild@ca-build9.us.oracle.com) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-50)) #1 SMP Tue Jun 28 05:21:19 EDT 2011

In above logs important point to notice is "kernel: SysRq : Resetting". When ever you see SysRq command issued. This shows this node is evicted by some other node from cluster. So you should investigate in Node eviction direction by collecting diagcollection from evicted and other nodes. Since, Node eviction is a huge topic to write on So, I will not write about that in this post.

Node Reboot/System Restart due to Memory Pressure: There could be case when system restart due to high load on it. Message file may look like this in that case.

Feb 12 07:32:42 db02 kernel: Total swap = 25165816kB
Feb 12 07:32:42 db02 kernel: Free swap:        97972kB
Feb 12 07:32:42 db02 kernel: 25690112 pages of RAM
Feb 12 07:32:42 db02 kernel: 978057 reserved pages
Feb 12 07:32:42 db02 kernel: 175190262 pages shared
Feb 12 07:32:42 db02 kernel: 45130 pages swap cached
Feb 12 07:35:49 db02 xinetd[7315]: START: omni pid=8176 from=::ffff:10.77.9.254
Feb 12 07:57:57 db02 syslogd 1.4.1: restart.
Feb 12 07:57:57 db02 kernel: klogd 1.4.1, log source = /proc/kmsg started.

In above case, No SysRq is reported. This shows a case of self suicide, So don't look at cluster side to investigate. Though, there could be many possible reason for system self suicide but most command are high load cause memory pressure on the system and system got reboot. To investigate high load start looking at OS Watcher Top command output at system restart time or prior to that.

Node Reboot/System Restart due to Linux kernel Bug: Another, Reason for self suicide could be some Linux kernel bug which cause system panic and system got reboot. Log files may look like this.

---[ end trace 288cce3e7b8bd8ba ]---
Kernel panic - not syncing: Fatal exception
Pid: 6381, comm: Thread-13686 Tainted: G      D    2.6.32-300.4.1.el5uek #1
Call Trace:
[] panic+0xa5/0x162
[] ? native_cpu_up+0x8af/0xa5d
[] ? xen_restore_fl_direct_end+0x0/0x1
[] ? _spin_unlock_irqrestore+0x16/0x18
[] ? release_console_sem+0x194/0x19d
[] ? console_unblank+0x6a/0x6f
[] ? print_oops_end_marker+0x23/0x25
[] oops_end+0xb7/0xc7
[] die+0x5a/0x63
[] do_trap+0x115/0x124
[] do_invalid_op+0x9c/0xa5
[] ? do_exit+0x67d/0x696
[] ? __dequeue_entity+0x33/0x38
[] ? pick_next_task_fair+0xa5/0xb1
[] invalid_op+0x1b/0x20
[] ? do_exit+0x67d/0x696
[] ? do_exit+0x67d/0x696
[] complete_and_exit+0x0/0x23
[] system_call_fastpath+0x16/0x1b
May 30 09:02:57 db008 syslogd 1.4.1: restart.

In Panic situation, you will see "panic" keyword on message file few line above restart. most likely this is because of Linux bug. So take help from Linux team to investigate further.

I have see case when Node restart due to memory pressure or kernel panic but DBA starts investigation from Cluster Alert logs. So, I hope this post will help readers to start right direction of investigation in case of Node Reboot or Database System Restart.

Please share, If you have more cases for Node Reboot or System Restart.

Sunday, July 13, 2014

How To Add, Remove or Relocate Online Redo Log Files and Groups

How To Add, Remove or Relocate Online Redo Log Files and Groups

To protect against a failure involving the redo log itself, Oracle Database allows a multiplexed redo log, meaning that two or more identical copies of the redo log can be automatically maintained in separate locations. For the most benefit, these locations should be on separate disks. Even if all copies of the redo log are on the same disk, however, the redundancy can help protect against I/O errors, file corruption, and so on. When redo log files are multiplexed, LGWR concurrently writes the same redo log information to multiple identical redo log files, thereby eliminating a single point of redo log failure.
Multiplexing is implemented by creating groups of redo log files. A group consists of a redo log file and its multiplexed copies. Each identical copy is said to be a member of the group. Each number defines a redo log group. Such as group 1, group 2, and so on. ( More information on finding online redolog files )
1. Adding A New redo log Group

sql> ALTER DATABASE ADD LOGFILE('/oracledata/test/redo3a.log','/test_multiplex/redo3b.log') SIZE 52429312;

The command above will add a new redo log group to your database. The group will have two members with size 52429312. A group number will automatically be assigned to your new group. If you want to set it explicitly you may use command below.

sql> ALTER DATABASE ADD LOGFILE THREAD 1 GROUP 15 ('/oracledata/test/redo3a.log','/test_multiplex/redo3b.log') SIZE 52429312;

In a RAC (Real Application Clusters) environment you have to specify which instance the redo log group will belong to by setting thread number. Every instance in a RAC database has a thread number. For ex: the command below will add the redo log group to instance with thread number 1.

sql> ALTER DATABASE ADD LOGFILE THREAD 1 GROUP 15 ('/oracledata/test/redo3a.log','/test_multiplex/redo3b.log') SIZE 52429312;

2. Dropping A redo log Group
Before dropping a redo log group, make sure that at the moment, logwriter process is not writing to the redo log file you plan to drop.

sql> SELECT GROUP#, THREAD#,ARCHIVED, STATUS FROM V$LOG ;

GROUP#	THREAD#	ARC	STATUS
1	1	YES	INACTIVE
2	1	NO	CURRENT
3	1	YES	INACTIVE

The value for "ARCHIVED" column must be "YES" and STATUS column must be "INACTIVE".
If the status is "CURRENT" it means that logwriter is currently using this group to store redo records. After the redo files in this group is filled it will switch to next group. You may wait till next log switch operation or you may manually trigger it by command below:

SQL> alter system switch logfile ;

If status is "ACTIVE" it means that there are some redo log records, which have not been applied to datafiles yet. The status is supposed to turn to INACTIVE soon after the next checkpoint operation. If you do not want to wait till then you may manually trigger checkpoint by command below.

sql> alter system checkpoint ;

If the status has turned to INACTIVE you may drop it by specifying the group number.

sql> ALTER DATABASE DROP LOGFILE GROUP 6;

This command will drop the redo log group, which is identified by number 6.
Logwriter writes to redo log files in a circular fashion. When a group is filled it moves to next group. After all groups in database are filled it returns back to first redo log group. Because of that, there has to be at least 2 groups in an instance. Oracle won't permit dropping a group if there are only two groups in the instance.
After dropping a group, you may delete relevant files from operating system.

3. Adding A New Member To An Existing Group
As mentioned at the beginning of article it is recommended to multiplex redo log members for fault tolerance. You can accomplish this by adding new members to an existing group.

sql> ALTER DATABASE ADD LOGFILE MEMBER '/u02/oradata/mydb/redo02.log' TO GROUP 2;

The above command adds a new redo log file to redo log group with number 2.

4. Dropping A Member From An Existing Group
As stated in subject 3 (Dropping A redo log Group) again the group, whose member you want to drop should be inactive.

sql> ALTER DATABASE DROP LOGFILE MEMBER '/u03/oradata/mydb/redo03.log'

5. Relocating A redo log Group
You may want to relocate your existing redo log group and change the path of redo log files. Here are the steps to accomplish it
Step 1: Shutdown database

sql> shutdown immediate ;

Step 2: Backup your existing control file and redo log file at operating system level.
Step 3: Move your redo log member to a new location. You may also change the file name if you want.

# mv /u02/oradata/mydb/redo02.log /u03/oradata/mydb/redo03.log

Step 4: Startup in mount mode

sql> startup mount;

Step 5: Path of redo log files is kept in control file. You have to update that record by command below:

sql> ALTER DATABASE RENAME FILE '/u02/oradata/mydb/redo02.log' TO '/u03/oradata/mydb/redo03.log'

Control file is modified and the path of redo log member is updated. Now you can open your database.

sql> alter database open;

As you see this way of relocating online redo log files requires downtime. You may also change the path of your redo log files by adding a new group at desired location and dropping the ones at former location. This approach requires no downtime.