BLOGBUSINESS INTELLIGENCESAP HANA
Rutger Heijmerikx

SAP HANA and R: How to integrate - the Inside-Out Approach

In the blogpost ‘SAP HANA and R: how to integrate – the Outside-In Approach’ we discussed the Outside-In Approach: connecting to a SAP HANA database as a ‘data source’ from within an R instance, by using an ODBC connection.

In this blogpost, we would like to discuss the Inside-Out Approach: embedding R Script within your HANA model’s SQL code to call the R Server as a calculation engine from within the SAP HANA environment.

As an example, we want to predict the Sales Value for two years, for which data is stored in the ‘R_ACTUALS’-table, having the following structure:

I.e.: data for 2 Companies (1234 & 5678), 2 categories (A & B) and a range of periods per Company, not really clearly identified besides a 1 or 2-digit number. The prediction, only calculated for Company 1234 will eventually be stored in table ‘R_TARGET’.

A general rule of thumb is to move as little of data to the R Server as possible. For our prediction, wanting to use the HoltWinters time series model, we only potentially need the Period (24 periods, e.g. assuming it to be the whole of 2015 and 2016 starting from Period = 1) and Value columns for Company 1234, Category A.

Both the selection of the fewer columns needed (i.e. only Period and Value), as well as the filtering (i.e. only Company 1234 and Category A) can be done at the HANA side itself, before send any data to the R Server:

To generate the following usable view for our R functions instead of the R_ACTUALS table:

(note: only the 24 ‘Company 1234’-rows of the initial 35 rows remain)

When creating a procedure to use the R Script, we can also define output table(s), for example to store the prediction or parameters used in the model. Without going into any details, we can for example store the prediction target values for the selected Company. Please note that a lot of information can be used as variable input (by having dynamic input tables) as well as stored (e.g. for verification purposes) with the procedures.

Not using the above created view for the R_ACTUALS table, we simply use the table itself as input (note: changing this would be a simple switch between R_ACTUALS to R_ACTUALS_VIEW in the below line of coding):

Within the structure of the ‘R_example’-procedure, we can start using R script by stating the following begin/end boundaries:

Which can be preceded (or, followed up) by SQL-script, within the overall procedure.

Below all the steps are described individually (i.e. the steps of the R_example procedure, between the BEGIN and END-commands of the LANGUAGE RLANG AS statement) and their output shown within a local R environment, to verify if the results are the same as those stored as an end-result in the HANA target table (R_TARGET).

1.  To start using pre-defined R script functions, you first have to initialize the packages required to execute them. For the HoltWinters-model, we can use the ‘forecast’-package:

2.  Not having used the R_ACTUALS_VIEW, we as such still need to isolate the 1234 Company for only the first 24 periods of data available, to properly use the HoltWinters model:

Note that you can combine several selection criteria (e.g. ==, & and <) via the subset() command over your ‘data’ input (which in this case points to the original R_ACTUALS table as input for the R_example procedure).

3.  The above command only created a subset of 24 rows for Company 1234, but did not drop any of the columns that are not required (i.e. as you can see that there are still 4 variables):

These can be dropped in a similar way, for example via:

Which selects the third and fourth column of the ‘actuals_subset’-table (note: via [rows, column], using only [,4] for example would select only the fourth column, which is sufficient given the nature of the data we are using):

and:

As you may notice, there are multiple ways to create subsets of data (e.g. both using the name as well as the technical orders of the columns within the table), for which your requirements may determine which suits your needs best. Personally, we would advise not to use the [row, column] identifier as this can quickly lead to errors if something in your source data or temporary/overall structures changes (i.e. if the VALUE-column suddenly no longer exists, in the worst-case scenario you will not get a warning, but simply the next column will be picked up).

4.  To start using the HoltWinters model, we first need to transform the data into a ‘time-series’-structure, for the R script time-series functions. The exact command to use depends on your source data. In our example, we assume that we use monthly data, starting at January 2015 (because there were no other identifiers).

Creating the time-series structure can then be done via:

Which creates:

With the following content:

5.  With this time-series structure, we can estimate the HoltWinters-parameters, and, execute the forecast model based on those parameters (in the below example, for the next 24 periods):

In which table ‘estimate’ stores the parameters and coefficients identified:

And table ‘estimate2’ stores the forecast based on these parameters and coefficients:

6.  These outputs can be combined into the ‘results’-structure to be stored back into the R_TARGET table. For example, only using the (normal) forecast values of ‘estimate2’ and some generic data of the input table for this:

Creates the following output table in the R environment:

For which you can additionally lose the time-series nature of the ‘results’-table via the ‘as.data.frame(results)’ command (note: this may additionally need the timeSeries package to be initialized as well).

This leads to the following R_TARGET output table in HANA:

With regards to business use and reporting, you most likely want to combine the R_ACTUALS and R_TARGET (including the Lo/Hi confidence intervals) information into one model/table, to be able to plot a more detailed and easier to understand forecast. A plotted forecast like this may look similar to the below graph, generated via the ‘plot(estimate2)’-statement in the R environment itself (note: for h = 6, instead of the above-mentioned 24):

(note: the different Lo/Hi confidence levels are identified by the different colours).

When you want to proceed with the next steps, whether this will be creating your own R-procedures or setting up a proof of concept automated model, McCoy's consultants are specialized in Predictive Analytics and are more than willing to help you on the road of Predictive Analytics.

Please feel free to contact us for more information.