In this tutorial, we will help you to build your own Security Data Lake for most common AWS sources in an hour.

Need for Security Data Lake

Security and data breaches have registered a sharp rise across multiple industries, making it ever more pertinent to adopt and deploy robust and intelligent security solutions with comprehensive coverage. Organizations must collect, analyze and investigate data from as many sources/end points as possible to achieve a safety net against diverse threats. There is a need to centralize data across cloud, on-premise and SaaS environments to perform advanced analytics and deploy requisite solutions to encounter the pace and scale of sophisticated attacks.

The Challenges

Even though the organizations have a choice to do all of the above, they would need a dedicated team, a considerable budget and immense setup efforts to get things started. At the same time, ever-increasing scale of data being generated by large number of endpoints, tools and environments makes attaining this desired state a very cumbersome, time-consuming and highly expensive problem which traditional SIEMs are not equipped to handle.

LegacySIEM

This can be solved by building a Cloud Data Lake in Snowflake.

Prerequisites:

What You'll Learn:

Approach

solutionarc

Data extraction/collection in AWS

We use a combination of AWS service such as AWS lambda function, S3 bucket, CloudWatch, etc., to set up the data collection process for 5 common security sources (VPC, IAM, S3, EC2 and Guardduty) in AWS. As a result, we collect all the logs in an AWS S3 bucket which Snowflake can consume.

In AWS, you will be creating -

For your convenience, we have packaged most of the configuration required in an easily deployable cloud formation template. You will find the details on page 4.

Data Ingestion and Data Mapping in Snowflake

We have created a complete mechanism using various snowflake components to consume data logs directly from AWS S3 buckets. Furthermore, we have encapsulated the whole pipeline in an easily deployable SQL file which will set up the complete pipeline. As a result, you will access logs using a Snowflake view built using standard ODM on the raw logs.

Downstream Data Analytics

In this step, we will showcase a few capabilities that can be readily built on the log data for effective analytical monitoring.

We would need an active snowflake account to build our security data lake. You can either use your own active Snowflake or create a 30 days Snowflake free trial within a couple of minutes. This can be easily converted to an enterprise account if desired.

Note:

You can now move to the next step, where you will set up a Log Data collection.

The first step towards creating the Data Lake is setting up the data collection process from various vendors/tools. In this Codelabs, we have documented how it can be done for multiple AWS sources. The overall workflow would be to deploy a cloud stack in AWS for each service to write the data logs in compressed raw files to an S3 bucket.

To ease the process of cloud stack deployment, we have published cloud formation templates that can be accessed from git, which will simplify the whole process with only a few simple parameters needed from your side. The process for each service would be similar, with changes in parameters for some services.

The document below will show how to set up the collection for AWS VPC (virtual private cloud) source. Amazon Virtual Private Cloud is a commercial cloud computing service that provides users with a virtual private cloud by "provisioning a logically isolated section of Amazon Web Services Cloud", and the logs captured contain all the activity on the cloud. First, we will set up the data collection to collect and store the compressed raw files in an S3 bucket.

You can follow the steps below to setup the data collection:

Please login to your AWS console to start up the process.

login

Step 1: Creating S3 bucket to store log files

You can search "S3" in search box or choose options in the screen to navigate to S3 console.

s3

Please click on the "Create bucket" button to initiate the process of bucket creation.

create

Here you can select the bucket name and the region as per your choice (Our recommendation for bucket name : "cf-sd-aws") and use other standard options to create the bucket.

bucketbname

Step 2 : Cloud stack to collect data

We need to create a cloud stack for the whole collection pipeline in AWS. In this pipeline, we will create lambda function, rule in Amazon Event Bridge, a task, etc. To make it easy for you, we have given a template below which will automatically deploy everything and setup a task with 10 minute frequency (please change as required) of writing log files to the selected S3 bucket. In this page, we will showcase the process for VPC and hence, we will use the vpc template. Please follow the steps below :

Navigate to the cloudformation console.

cloudformation

Choose the option to create stack.

stack

Choose the option to create stack using "upload a template file" option and attach this

Template file

template

You will be asked for multiple parameters (prepopulated by template) which you need to input/change. You can change the name of the S3 bucket (as created above). You can set the ScheduleExportJobInterval to the desired frequency of data collection.

Note : The region of the S3 bucket and the stack should be the same. All services ask for 3 parameters expect VPC where you need to enter the id(VPCResourceId) of VPC you want to monitor. Please follow the instructions below to get the VPC id.

stackname

For VPCResourceId, we need to navigate to the VPC service page.

vpc

Click on the VPCs (see all regions).

vpc

You can use the vpcresourceid (as highlighted below) for the parameter in cloud formation.

resourceid

Now proceed to the next step for cloud formation and use all default options to proceed to next step. Here you need to select the checkbox to finally create the stack.

checkbix

This will initiate the deployment process of the stack and it will setup and activate the whole collection pipeline. As an outcome, it will start writing the data logs for the above VPC resource every 10 minutes to the mentioned S3 bucket. It will take probably a couple of minute for the setup to be complete.

crtbakt

As per the schedule chosen (10 minutes default), you will be able to se the data in S3 bucket.

In the bucket you will be able to see individual folders for each service.

vpc

Similarly you can follow the above process with respective templates to enable the data collection for other sources. The log data will be written to the same S3 bucket with subfolders for each service deployed automatically.

This page will cover all the components needed in snowflake to operationalize the pipeline.

Storage Integration between Snowflake and AWS

Snowflake has provided very comprehensive documentation to set up this storage integration. Please follow the steps in the links below.

Snowflake Storage Integration to AWS S3

Note:

You can use a command like the one below to create the integration. To create this integration, you don't need to select any warehouse/database. You would need to replace storage_aws_role_arn and storage_allowed_locations variables in AWS as per the snowflake documentation.

create or replace storage integration AMZ_AWS_SF
  type = external_stage
  storage_provider = 'S3'
  enabled = true
  storage_aws_role_arn = 'arn:aws:iam::4444444444:role/cf-trial-role',
  storage_allowed_locations = ('s3://cf-demo/');

Snowflake Data Ingestion Pipeline

So now we have our logs sitting in our S3 bucket, and we have also created an integration between Snowflake and S3 bucket. The next step is to set up the data ingestion pipeline in snowflake to store and parse all the data in snowflake.

You can use this SQL file and copy all the contents of the file in a new worksheet, and you would need to change two things (the name of storage integration and the path to the S3 bucket) in the create stage command in the file. Please execute all the commands by selecting all the commands and running them together (Ctrl + enter). At the end of it, you will have ODM views created and showing data generated from AWS.

Note: It takes a few minutes for data to load and show up in view after the execution of all commands has finished.

awsvpc

Elysium Analytics is an advanced analytics solution built on the same ODM structure data in Snowflake. We make building a semantic security data lake easy with our open data model for contextual and deep analytics. We also provide the prebuilt search and analytics applications you need to get immediate value from your data. Our cloud-native semantic data lake with an open data model enables organizations to perform contextual analytics and perform a full-text search across all data sources. Our data model connects the dots across all telemetry, allowing deep analytics, alerting, and visualization in a single pane of glass for better detection and investigation productivity. We onboard your telemetry and handle the data mapping for you.

Let's demonstrate how an enterprise organization achieves better detection through the adoption of modern platforms such as Snowflake. In this example, an enterprise is running a mission-critical AWS env for e-commerce transactions. To ensure the application is secured, the enterprise has deployed Elysium Analytics to collect the data with prebuilt data connectors. The demo video shows an analyst receiving emails from Elysium Analytics and begins investigating the alert notification regarding an unusual remote desktop app installed on a laptop.

To book a demo, the Client form can be accessed here: Fill the form

For Elysium Sales related inquiries: sales@elysiumanalytics.ai

You can watch through various Elysium product demo videos Elysium Analytics Walkthrough Demo