Snowflake and R: How to integrate

Are you searching for a way to connect Snowflake with R (Studio)? Then you’ve arrived at the right blog! In this blog, a ‘how to’ is given on an easy connectivity possibility between R (Studio) and Snowflake; being able to read data from Snowflake and writing back a result set after a script has been executed in R Studio.

In case you want to try out Snowflake, you can quite easily set up a trial account yourself. For a brief explanation about what Snowflake is and what its benefits might be for your organization to start using it, you can watch our latest episodes of ‘McCoy TV’ by our colleague Joran de Vries.

Use Case

To illustrate the connection you have made after following the steps of this blog, a dataset from the OpenSAP course on SAP Predictive Analytics is used. This dataset is about clustering stores based on certain KPI’s using kmeans algorithm. More details on these tables below:

  • The ‘R_TABLE_KMEANS_INPUT’ table (‘input’ from a R Studio point of view) stores a set of a 150 ‘Stores’ with certain related KPI’s (Turnover, Size, Staff and Margin):

  • The ‘R_TABLE_KMEANS_OUTPUT’ table stores the ‘Cluster Number’ for these 150 stores that is calculated within R Studio. The result will be 5 Clusters in total for the set of Stores.

In this blog, the following steps will be explained:

  • Setting up a connection between Snowflake and R (Studio).

  • Loading data from Snowflake to R (Studio) and performing a simple algorithm in R (Studio).

  • Writing back the result to Snowflake. The end result will be in a Snowflake view combining the IN- and OUTPUT tables on a Store level, to keep the 2 tables in Snowflake separate from each other.

Pre-requisites (1): Setting up the Snowflake ODBC driver

To be able to load data between the two systems, for example an ‘ODBC’ connection can be used. For this, the Snowflake ODBC driver needs to be setup. A guide on how to do this for Windows for example can be found here.

Parameters that are required as input for the driver are:

  • Data Source: a name for the configured driver. Please note that when using the driver in R Studio to establish the connection, the type of driver is the required input (in this case a ‘SnowflakeDSIIDriver’), not the Data Source description given in this step).

  • User and password: credentials of the user used to connect to Snowflake. The password is not stored in the driver, but you can enter it to test if the driver is working as expected.

  • Server: the part of the Snowflake link which identifies the server, i.e. <your account name>.<cloud platform>.snowflakecomputing.com

  • The Database, Schema and Warehouse within Snowflake. In this blog these values are used:

    • Database: ‘SNOWFLAKE_R_CONNECT’

    • Schema: ‘PUBLIC’

    • Warehouse: ‘COMPUTE_WH’

Once the driver has been downloaded and installed, the connectively can be tested, e.g. for:

 

(NOTE: it might be the case that the connection fails due to proxy settings. How to solve this is described here. And, even if the connection works when testing, still the same proxy-related error might occur when attempting to use the connection within R Studio).

Pre-requisites (2): Required R Packages

The relevant ‘R Packages’ need to be made available in R Studio via the install.packages(“…”) command within R Studio:

  • dbplyr

  • DBI

  • odbc

  • tidyverse

(NOTE: if you get errors on the ‘R Tools’ version (or lack thereof), please have a look at the solution here).

Setting up the connection

After having installed the relevant Packages, they have to be ‘initialized’ via the library(…) command:

Then, a connection can be defined for the Snowflake ODBC driver:

If the connection is made successfully, you will see it in the ‘Connections’-tab within R Studio:

Loading data from Snowflake to R (Studio)

Once the connection has been established, the data can be loaded from Snowflake towards R Studio:

Afterwards, you can check the content in R Studio to verify that the data has been loaded successfully:

Performing a simple algorithm

Now, basic R computations can be executed on the loaded R_TABLE_KMEANS_INPUT (e.g. changing the format from a ‘lazy’ tibble to a data frame) so that the kmeans function can be used:

Writing back the result to Snowflake

Lastly, the result of R_TABLE_KMEANS_OUTPUT can be checked and loaded back to Snowflake. Where the result after the ‘kmeans’ algorithm in R Studio is:

 

And to append the data to the existing table in Snowflake, the dbWriteTable function within the DBI package is used, to prevent a temporary table not being properly loaded in Snowflake:

End result in Snowflake

Now that the data should have been loaded successfully towards Snowflake, a check can be done if this is actually the case! As mentioned in the ‘Use Case’ section, a view combining both the INPUT and OUTPUT table will be used for this:

 

And many more…

If you would like to know more about Snowflake, please contact Roel van Bommel (+31622698392) or Nico van der Hoeven (+31651528656).

Ben jij al een Friend of McCoy?

Als innovatiepartner willen wij graag blijven inspireren. Daarom delen wij graag onze meest relevante content, evenementen, webinars en andere waardevolle updates met jou.