Curious about the ‘File Space’ feature within the SAP Datasphere Object Store, using ‘SAP HANA Data Lake Files’ as a Storage Type? Wondering why it might be beneficial –even in scenarios where only SAP S/4 HANA is used as the source system-? Then this blog is for you.
In this blog, we walk through how to set up a File Space and show with a practical example of the Change Document Items (CDPOS) how it can save storage, compared to using a ‘standard’ Space in SAP Datasphere.
Your company wants to retain all the Change Documents Items (CDPOS) in its Business Intelligence landscape, in a way that’s both (1) readily accessible for different use cases (2) requires minimal customization for each use case individually. As the CDPOS table storage size in MiB grows quite quickly, the challenge becomes storing all that data efficiently. Ideally in a way that is more storage-optimized than the standard space in SAP Datasphere (disk or in-memory), while maintaining ease of access.
As such, the new File Space in the Object Store is explored, which uses SAP HANA Data Lake Files as the underlying storage, to evaluate how it affects disk usage (in MiB) in a comparative scenario.
*Disclaimer: the positioning of the File Space in the Object Store by SAP might be more focused on other use cases. This blog is to show it might also be relevant for more traditional ones.
According to SAP’s latest architectural vision for SAP Datasphere, the File Space resides in the Object Store and is intended for storing large data volumes:
When creating a new space, you’ll now see the option to create a space with the Storage Type being ‘SAP HANA Data Lake Files’:
Note: if the Storage Type option is grayed out, it may not yet be enabled in your environment. In that case, SAP can enable it via a support ticket.
The creation of the File Space involves several steps (automatic, which might take some time):
Provisioning an Apache Spark instance
Setting up the file container (e.g., ‘...-hdlf’)
Running instance and file configuration checks
Once completed, the File Space includes configuration options such as:
vCPU / Memory and the Application setup of the Apache Spark Application (adjustable):
Task Assignment for Object Types within the File Space, related to the Spark Applications:
Currently supported object types within the File Space include:
Folder
Local Table
Replication Flow
Transformation Flow
Task Chain
To test the impact of data growth on disk usage (in MiB), the following setup is followed:
S/4 HANA Source: a basic, extraction-enabled CDS view on the CDPOS table. It includes all columns and applies no filters.
Replication Flow: Loading the same CDPOS data as an initial load into the two different spaces:
One File Space (‘Load 1’).
One Standard Space (‘Load 2’).
Data Doubling Procedure: after the initial load (~2.2 million records), using a Transformation Flow the dataset is doubled each execution:
Adding a new unique key using scripting logic.
Syncing Transformation:
Create a ‘Meta’ Task Chain to ensure consistent record duplication accross Spaces:
Load 1.
Merge Table of Load 1.
Optimize Table of Load 1.
Load 2.
Note: in the Standard Space, merging and optimization steps technically run automated in the background.
The following is compared between the File Space and Standard Space
Record count after each run in the Local Table.
Disk usage in MiB after all processing steps of the ‘Meta’ Task Chain.
File Space disk usage as a % of Standard Space usage.
% growth in MiB for both Spaces after each load cycle.
By plotting the results on a ‘Log2 scale’ (record count in millions on the x-axis) and comparing the used disk space (%) on the y-axis, the following is observed:
At the final Local Table size of 564.2 million records, the File Space used only 59.85% of the disk space (in MiB) compared to the Standard Space, whereas for the initial load of 2.2 million records it was 97.7% of the disk space (in MiB).
As an observation during loading in the Standard Space, the disk usage temporarily spiked (e.g. over 30,836 MiB while loading from 282.1 to 564.2 million records), even though the final disk space usage was much lower in the end.
This spike occurred due to the Delta Table merging and thereby causing a temporary lock on the Space due to size limits being exceeded.
Keep in mind that results may vary depending on table characteristics such as:
Number of columns.
Column data types.
Cardinality of values.
A File Space using Data Lake Files can significantly reduce disk usage (MiB), especially as data volumes grow.
This feature offers benefits –even for standard S/4 HANA tables like the CDPOS, or, the STXL extraction explained in our previous blog- beyond the typical large non-SAP datasets.
Do keep in mind that at some point it is a tradeoff between increasing complexity of the Space, modelling and processing setup, and, reducing disk usage.
If you would like to know more about using the File Space in the Object Store in SAP Datasphere? Please contact Nico van der Hoeven (+31651528656).
As an innovation partner, we want to continue inspiring you. That's why we gladly share our most relevant content, events, webinars, and other valuable updates with you.