Creating a New Cluster
Grid allows you to create a cluster inside your own infrastructure. Using this approach allows you to control where your data sits and keeping that data secure.
Request access to this feature! Send us a message to our community Slack or send email to [email protected]
Grid creates clusters inside your own cloud account allowing you to keep complete control of the resources that you need. We'll guide you through the setup process for each of the supported cloud providers.

Amazon Web Services (AWS)

Requirements

Grid will create clusters designed for large AI workloads. In order to do so, your AWS account needs to have the right permissions and quotas. We'll cover both optional and required configurations as follows. If your cluster is small, or you require only a few instance types default quotas should work for you. Still, we recommend asking AWS for extra quotas as your needs expand in the future.
Configuration
Recommendation
Auto Scaling groups per region
800
Launch configurations per region
800
EC2 Spot (instance family you are interested in)
1000+
EC2 On-demand (instance family you are interested in)
1000+

Requesting Quotas

All AWS accounts have "service quotas". These are limits for the utilization of service provided by AWS. In order to increase your quotas, you have to request a quota increase to a specific service. That will open a ticket with AWS support. You may need to follow-up on the ticket in order for the quota to be granted.
You can request a quota by doing
    1.
    Login into your AWS console
    2.
    Search for "Service Quotas" and click on the result
    3.
    Click on the area of the service (e.g. "Amazon Elastic Compute Cloud (Amazon EC2)")
    4.
    Use the search filter to find the quota that you are looking for
    5.
    Make a quota request

Step 1: Get AWS Credentials

A: Login to AWS and search for IAM
Login into your AWS account. You will then use the search bar to find "IAM" (user management).
B: Click on "Users"
Click on the "Users" panel. You will be able to see a list of users. If you already have a user, click on your user name. If you don't, move to the next step to create a new user.
C: Create New User (optional)
If you don't have a user available and would like to create one, on the "Users" page click on "Add user". Fill in the user name of your preference and make sure to check "Programmatic access" (this allows you to use AWS keys).
Click on "Next: Permissions".
The user should have IAMFullAccess privileges.
Click on "Next: Tags" > "Next: Review" > "Create user".
D: Create New AWS Keys
    1.
    Navigate to the "Users" page
    2.
    Click on your user name
    3.
    Click on the tab "Security Credentials"
    4.
    Click on "Create access key"
    5.
    Copy both the "Access key ID" and the "Secret access key" values
The "Secret access key" value will only be shown once. Make sure you copy that value and store it in a safe location.
Make sure that your user name has the right policies attached in order to user Grid correctly. Refer to the section Adding Grid AWS Policies & Roles for more details.

Step 2: Add IAM permissions to your account

The user you just created, and fetched credentials for should have IAMFullAccess privileges.
Reach out to us via Slack or email if you have any issues creating the following AWS roles and policies. We're happy to help!
A: Add Policies to Your Account
The final step is to add all the Grid policies to your account. That means that your AWS keys will now be able to perform the operations required by Grid.
    1.
    First, log in to AWS and navigate to IAM
    2.
    Click on "Users"
    3.
    On the user's page, find your user name and click on it
    4.
    Click on "Add permissions"
    5.
    Click on "Attach existing policies directly"
Granting permissions to an user.
    1.
    Search for the policy IAMFullAccess:
    2.
    Click the Check Box to the left of IAMFullAccess
    3.
    Click on "Next:Review"
    4.
    Click on "Add permissions"
Now that you have added the right permissions to your user name, you can use the user's AWS API keys with Grid.

Step 3: Create Role & Policy grid requires

For the next step you're going to create role we're going to assume into. For this you'll be using terraform. Make sure you have git, terraform, jq and AWS CLI installed on your machine. Installation instruction of these tools are available. If you're familiar with terraform we recommend you check the terraform module we'll be using to create necessary roles & policies. https://github.com/gridai/terraform-aws-gridbyoc This module is published on official terraform registry for your convenience https://registry.terraform.io/modules/gridai/gridbyoc/aws/latest
The script needs following list of permissions:
    "eks:*",
    "ecr:*",
    "events:*",
    "arn:aws:iam::aws:policy/AmazonEC2FullAccess",
    "arn:aws:iam::aws:policy/AmazonGuardDutyFullAccess",
    "arn:aws:iam::aws:policy/AmazonRoute53ResolverFullAccess",
    "arn:aws:iam::aws:policy/AmazonS3FullAccess",
    "arn:aws:iam::aws:policy/AmazonSNSFullAccess",
    "arn:aws:iam::aws:policy/AmazonSQSFullAccess",
    "arn:aws:iam::aws:policy/AmazonVPCFullAccess",
    "arn:aws:iam::aws:policy/CloudWatchLogsFullAccess",
    "arn:aws:iam::aws:policy/IAMFullAccess",
For quick start
    Clone the repo
1
git clone https://github.com/gridai/terraform-aws-gridbyoc.git
2
cd terraform-aws-gridbyoc/quick-start
Copied!
1
unset AWS_ACCESS_KEY_ID
2
unset AWS_SECRET_ACCESS_KEY
3
unset AWS_SESSION_TOKEN
4
5
aws configure
6
7
# prompt and example entries below
8
9
AWS Access Key ID [None]: xxxxxxxxx
10
AWS Secret Access Key [None]: xxxxxxxxx
11
Default region name [None]:
12
Default output format [None]:
Copied!
    Verify AWS Access Key
1
aws sts get-caller-identity
2
3
# example entries below should match the above steps
4
{
5
"UserId": "xxxxxxxxx",
6
"Account": "xxxxxxxxx",
7
"Arn": "arn:aws:iam::xxxxxxxxx:user/xxxxxxxxx"
8
}
Copied!
    Run the Terraform script and enter the AWS Region when prompted. The region where the VPC is located is entered during the in the later step.
1
terraform init
2
terraform apply
3
4
# enter provider.aws.region
5
provider.aws.region
6
The region where AWS operations will take place. Examples
7
are us-east-1, us-west-2, etc.
8
9
Enter a value: <us-east-1>
10
11
# long list of actions truncated and the final prompt
12
13
Do you want to perform these actions?
14
Terraform will perform the actions described above.
15
Only 'yes' will be accepted to approve.
16
17
Enter a value: yes
Copied!
    Get the output from terraform. By default terraform hides the sensitive secret output
1
terraform output -json | jq
Copied!
From the last command you'll get the following output:
1
{
2
"external_id": {
3
"sensitive": true,
4
"type": "string",
5
"value": "<example-id>"
6
},
7
"role_arn": {
8
"sensitive": false,
9
"type": "string",
10
"value": "<arn:aws:iam::000000000000:role/example-role>"
11
},
12
"role_name": {
13
"sensitive": false,
14
"type": "string",
15
"value": "example-role"
16
}
17
}
Copied!
1
export EXTERNAL_ID=$(terraform output -json | jq -r '.external_id.value')
2
export ROLE_ARN=$(terraform output -json | jq -r '.role_arn.value')
Copied!

Step 4: Register Your Role in Grid

By default, Grid Sessions and Runs are spun up in Availability Zone a currently. Only specify the AWS region and not the AZ in the --region argument.
    Login to Grid. Please reference the detailed steps as required.
1
pip install lightning_grid --upgrade
2
grid login --username <Grid user name> --key <Grid API Key>
Copied!
    Create cluster in default region with default instance types.
    Cluster name must be lower case alphanumeric characters, '-' or '.' is allowed but not '_', and must start and end with an alphanumeric character
1
grid clusters aws --role-arn $ROLE_ARN --external-id $EXTERNAL_ID <cluster name>
Copied!
    Create cluster in us-west-2 region with default instance types. These will give you broad selection of commonly used instance type, but if you know better which one you'll be using specify it.
1
grid clusters aws --role-arn $ROLE_ARN --external-id $EXTERNAL_ID --region us-west-2 <cluster name>
Copied!
    Create cluster in eu-west-2 region with t2.medium and t2.xlarge instance types
1
grid clusters aws --role-arn $ROLE_ARN --external-id $EXTERNAL_ID --region us-west-2 --instance-types t2.medium,t2.large <cluster name>
Copied!
    Launch cluster in cost-savings mode, using the --cost-savings flag. See the later chapter what cost-savings actually implies.
1
grid clusters aws --role-arn $ROLE_ARN --external-id $EXTERNAL_ID --region us-west-2 --cost-savings --instance-types t2.medium,t2.large <cluster name>
Copied!
    Launch cluster and edit advance option before submitting it for creation.
1
grid clusters aws --role-arn $ROLE_ARN --external-id $EXTERNAL_ID --region us-west-2 --edit-before-creation --instance-types t2.medium,t2.large <cluster name>
Copied!

Step 5: Wait for cluster to be provisioned

1
grid clusters
Copied!
And wait for your cluster status be running:
1
┏━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┓
2
┃ id ┃ name ┃ status ┃
3
┡━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━┩
4
│ grid-cloud-prod │ grid-cloud-prod │ running │
5
│ <cluster name> │ <cluster name> │ running │
6
└────────────────────┴────────────────────┴─────────┘
Copied!
It can take some time to provision a new cluster, ~20-30 minutes

Step 6: Run your workloads in your new cluster

1
grid run --cluster <cluster name>
2
grid session create --cluster <cluster name>
Copied!
Or if you're using config file set the .compute.provider.cluster field to the cluster name you've just provisioned

Step 7: Enjoy!

Your cluster will be available for use on Grid, so use it (or any other cluster) as you wish.

Editing and Deleting Clusters

Use grid edit to see instance types available and update as necessary. You can also switch between cost-savings and default mode of operation.
1
grid edit cluster <cluster name>
Copied!
Use grid delete to delete cluster. Deleting a cluster will delete its resources, including runing resources. Use with care!
Grid attempts to delete all cluster resources when a delete operation is initiated. However, sometimes there are dangling resources left behind. Make sure to inspect your account for dangling resources and delete them manually if that is the case. Reach out to support if you have any issues -- we are happy to help!
1
grid delete cluster <cluster name>
Copied!

Cost saving mode

There are two cluster management modes you can pick, depending on your expected cluster size and latency/cost preferences. They are easily switched using the --cost-savings flag when creating the cluster.
    default(performance)
1
"performance_profile": "CLUSTER_PERFORMANCE_PROFILE_DEFAULT",
Copied!
    cost saving:
1
"performance_profile": "CLUSTER_PERFORMANCE_PROFILE_COST_SAVING",
Copied!
In the cost savings mode you're trading startup latency for lower cost. Grid has some background processes:
    VPC/EKS cluster/ELBs/CloudWatch Logs
which are the same in both modes. Some are variable:
    EC2 instances types & count for the management/skeleton crew purposes.
In the cost-savings mode we're running management workloads on a single server, while some components are scaled down to 0 replicas, and only booted when needed. In a performance (default) we run management nodes in HA (highly available) configuration, and certain components are persistently running to improve start-up latency. Depending on the region these costs are around ~$10/day, compared to ~$50/day for the default mode.

Trade-offs

Equivalent

    In both modes the session start time is equivalent
    Experiment runtime speed is equivalent
    Tensorboard runtime speed is equivalent
    In both cases Kubernetes API control plane is being managed by AWS in an HA manner, thus unaffected

Degraded performance

    Experiments may start slower.
    Tensorboard may start slower.
    Datastores may take longer to be optimized.
    Experiment logs are optimized for smaller query volumes compared to default mode.

Operational risks

    There's a higher, small but non-negligible risk of cluster malfunction. This is due to a single point of failure concerning the single management node. This node runs gridlet agent & cluster-autoscaler responsible for dynamic scale up and down.
    Maximum concurrent experiment/session count is smaller. This means the cluster could experience issues with bigger node counts; especially with workload scheduling and scaling up & down the nodes. Mostly due to resource constraints imposed on gridlet & cluster-autoscaler.
By the way, you can also overprovision certain instance types that experiments & sessions start even faster for those instances:
1
"instance_types": [
2
{
3
"name": "t2.medium",
4
5
# Number of extra warm instances that should be available to speed things up
6
"overprovisioned_ondemand_count": 3
7
}
8
],
Copied!
Be warned you're paying for those spare capacities despite being unused most of the time. Use grid edit cluster <cluster name> or grid clusters aws --edit-before-creation <cluster name> to access these advance options.

Installing 3rd Party Tools

Cluster setup requires the following tools, so make sure you have them installed.

MacOS

brew and pip3 are used in this example.
1
brew install git
2
brew install terraform
3
brew install jq
4
pip3 install awscli --upgrade --user
Copied!

Linux (Debian/Ubuntu)

Grid Session SSH can be used to run the below example. apt-get and repository configuration are used in this example.
1
# add hashicorp repo
2
sudo apt-get install gpg
3
sudo apt-get install software-properties-common
4
curl -fsSL https://apt.releases.hashicorp.com/gpg | sudo apt-key add -
5
sudo apt-add-repository "deb [arch=$(dpkg --print-architecture)] https://apt.releases.hashicorp.com $(lsb_release -cs) main"
6
7
# install the tools
8
sudo apt-get install git
9
sudo apt-get install terraform
10
sudo apt-get install jq
11
sudo apt-get install awscli
Copied!
Last modified 6d ago