Download An Interface Between the SAS System and the INFORMIX Database

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Concurrency control wikipedia , lookup

Microsoft Jet Database Engine wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

Database wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

Functional Database Model wikipedia , lookup

Relational model wikipedia , lookup

Clusterpoint wikipedia , lookup

Database model wikipedia , lookup

Transcript
AN INTERFACE BETWEEN THE SAS SYSTEM AND THE
INFORMIX DATABASE
Kevin Kane, John B Brown, Dr. Krystyna Kelly - AMGEN Ltd.
ABSTRACT : AMGEN Ltd., a pharmaceutical company, uses an Informix based
relational database to store its clinical trials data, and performs statistical ,analysis
using the SAS system. This creates the problem of data transfer between the two
systems, and ensuring that the transfer is exact.
A set of generic Informix-4GL programs were written to transfer automatically the data
from any Informix database to a series of SAS datasets by interrogating the database
system catalogue. The first 4GL program dumps the data from each database table into
formatted ASCII files. The next program creates SAS code which reads the ASCII files
The generated SAS code is executed, creating a SAS dataset
into SAS datasets.
corresponding to each database table.
-, . ,
The procedure to ensure that the SAS datasets exactly match the data held in the
Informix tables is to re-transfer the data from SAS into Informix and to compare this set
of data with the original copy.
This new methodology has· proved to be reliable and has saved many hours of
programming time.
AMGEN Ltd. is a biopharmaceutical company based in Thousand Oaks, California. Clinical
trials, which are part of Amgen's drug development programme, are carried out globally, and
the data centre for Europe is based in Cambridge, England. The company is relatively small,
with around 1500 employees world wide, 30 of these being involved in the data centre at
Cambridge. The activities within the data centre include database development, data entry
and the analysis and reporting of clinical trials.
Currently, Amgen's two main products are growth factors. One of these stimulates the
number of red blood cells, and the other increases the number of one type of white cell.
Because clinical trials of these drugs tend to be carried out over long periods of time and
measurements have to be recorded frequently, large amounts of data are generated.
Typically, one patient generates 10,000 data points. The data centre processes this data using
around 35 databases every year, and each of these databases have approximately 1000
variables.
:
.
i
The clinical trials data are processed using a Sun 4/490 Sparc Server which runs under Unix.
The data is stored in a relational database developed using the Informix database management
system. Statistical analysis is carried out using the SAS® system. The use of these two
different packages creates the problem of transferring data from the Informix database to the
SAS system. The issue of whether to do any data manipulation before or after the data are
transferred is raised.
:
In the past, analysis variables were derived using Informix tools, and then written to ASCII
\- (or text) files. The statistician had to specify the exact layout of these ASCII files. SAS code
\vas then used to read in the data and do any required analysis. Mar . "
378
I
.
encountered with this system because of the large number of steps required - opportunity for
error was high and reproducibility was difficult. Additionally, because the database tools
being used were not primarily designed for the mathematical data manipulations required, the
programming was more difficult and very labour intensive. This system was not
satisfactory, and a better method of data transfer and manipulation was sought.
The requirements for our data transfer process were that the transfer had to be exact so that
we could have confidence that any end results produced were correct, and it had to be fast
and easy so that we could have an efficient process. The new method that was developed
consisted of three automatic steps. Initially, ASCII files were produced of the contents of the
database. SAS code was then generated to read in these ASCII files and then the SAS
statements were run.
The first step involved writing a program that would, for any Informix database, produce
standard column formatted ASCII files. One file is produced for every database table and
each file has one column per variable. The next step is the automatic generation of SAS
statements that will read the ASCII files into SAS datasets. Both of these steps require
information about the contents of the database. In a relational database, this information is
stored in the database system catalogue (see figure 1). There are a number of different tables
of data, each table holding data that differs in some way. Each table has a number of
columns or variables of different data types. For each table, the system catalogue holds
information such as the table name, a unique identification number and the number of
columns in the table. Similar information is held for each column - the column name, the
identification number of the table the column is located in, a column number, the type of data
that is held in this column (eg. numeric or text), and the length of the data allowed to be
stored in the column. From the system catalogue we can extract information about both the
tables and columns in the database for use in our programs.
Database System Catalogue
S~§tem
TSlbie Information
Table Name,
Table 10 Number
Number of Columns
etc
S~stem
Column Information
. olumn Name,
Table 10 Number
Column Number,
Column Type,
Column Length
Figure 1 : The Database System Catalogue
:$
There are some problems caused by differences in the method of handling data between SAS
apd the Informix database. The two main differences are the length of variable names, and
t~ different data types that are used in each system. In the SAS system, v~ri~h1~ n~m~.~ r.:m
379
only be up to a maximum of 8 characters long. However, in the Infonllix database, they can
be up to 18. To truncate the variable would generate problems if any two variables existed in
one table that had names which only differed after the eighth character. Additionally, our
databases are developed using a two character prefix and an underscore in front of each
column name to indicate which table it is from, and to make it unique in the database. To
preserve this prefix for our SAS names would leave only five available characters to represent
meaningful names. To overcome these problems, we developed a method of translation for
the variable names. The translator would strip off the two letter prefix and underscore from
column names, and truncate the remaining text at eight characters. If this name had already
been used, then a number was substituted for the last character to make the name unique. For
example, if there were two variable names in one table that were called tl_temperature_before
and tCtemperature_after, then the resulting SAS names would be temperat and temperal.
The other main problem encountered in transferring the data from Informix to SAS was that
the types of data storage available in each system differed. In Informix, a number of data
types are available which are predefined when the database is built. SAS only uses two data
types - numeric or character, but has the ability to format these in a variety of ways. The
program which generates the SAS statements reads the system catalogue to determine the data
type and length of the column. An equivalent SAS format is then constructed and used in the
SAS input statement. For example, a date type in the database would be mapped onto to a
ddmmyy8. format, and an Informix data type decimal(S,3) would have a numeric 10.3
format.
When the SAS input statements have been generated, using the translated SAS names and
formats, the code can then be run. The SAS input statements are generated for column input,
using the same columns as the program which produces the ASCII files. The SAS code
reads the ASCII files and produces one dataset for every database table.
The advantages of the new data transfer system are numerous. Because all data in the
database are transferred to SAS datasets, the statistician has access to all data. This is an
advantage because the statistical analysis of clinical trials is usually a dynamic process, that
can only be planned to a certain extent. The process also removes the tedious work of
specifying column formatted text files, writing code to produce these, and creating SAS input
statements to read them in. Another major implication of the new system is that the data
manipulation is carried out in SAS as opposed to using database tools. Complicated data
manipulation is clumsy using the database's manipulation facilities, as they are designed
primarily for data extraction. Additionally, an advantage is that the new data transfer process
allows you to take a 'snapshot' of a live database at any time, allowing analyses to be carried
out at any time, or using a small amount of actual data as a basis for testing statistical
programs. The only disadvantage of the new system is that large amounts of data are stored
twice. For an increasing volume of data that has to be analysed, this has implications for the
amount of disk space required.
In the pharmaceutical industry, there is a need to be able to prove that any results or
conclusions made are genuine. Regulatory agencies must be confident that data collected
during clinical trials has not been corrupted from the original information recorded in the
patients' notes at the hospitals. It is therefore incumbent on the pharmaceutical company to
',verify all steps of data collection and transfer. In particular, the use of electronic systems
'must be validated.
380
To satisfy the validation requirement, a plan for the quality control of the proposed data
transfer system was devised (see figure 2). The basis of the plan was to transfer all the data
in the SAS datasets back into an Informix database and compare this with the original. To
transfer the data from SAS to Informix, a SAS program was written which would output the
data in the style of Informix 'unload' files. An unload file is an Informix utility which allows
you to dump the contents of a table to an ASCII file which is free format, and has separators
between the value of each variable. Commands exist to load and unload the data easily
between unload files and the database. The unload files that are generated from the SAS
data sets are then loaded into an empty copy of the database and unloaded once again into
unload files. This step is necessary to ensure that these unload files are formatted in exactly
the same way as the files that have been unloaded from the original database. The two sets of
unload files are compared (using the unix 'diff utility) and an error report is generated of any
differences. Because all of the computer programmes used are generic, the process can be
repeated for any informix database.
Informix
style
unload
file
Informix
original
database
copy of
Informix
database
Informix
unload
file
Informix
unload
file
Error
Report
Figure 2 : Plan for the Quality Control of the New Data Transfer System
In the future, improvements to the system include adding a user interface, and updating the
programs so they can handle any new data types available in further releases of the Inforrnix
software.
In summary, the development of the new data transfer system was successful, with the time
to analyse clinical trials being reduced from over 100 person/working hours to around 20
person/working hours. The system has been adopted for use in the analysis of all European
clinical trials. The system is also well validated, so we can be confident in our results and
satisfy any requirements from the regulatory agencies.
\ SAS is a registered tragemark of SAS Institute Inc., Cary, NC, USA.
381