Skip to main content

Overview

This page describes BYOC cluster creation in Grid-Managed infrastructure mode. In this mode, Grid manages and provisions AWS infrastructure on your behalf in a fully automated fashion. To achieve this, Grid requires more permissions (including iam:FullAccess) than the self-managed BYOC mode requires.

This documentation assumes you have followed our pre-requisites installation steps.

Deploying Grid-Managed Bring Your Own Cluster (BYOC) Mode

note

Request access to this feature! Send us a message to our community Slack or send email to support@grid.ai

Grid creates clusters inside your own cloud account, allowing you to maintain complete control over the resources that you need. We'll guide you through the setup process for each of the supported cloud providers.

Amazon Web Services (AWS)โ€‹

Requirementsโ€‹

Grid creates clusters designed for large AI workloads. In order to do so, your AWS account needs to have the right permissions and quotas. We'll cover both optional and required configurations as follows.

ConfigurationRecommendation
Auto Scaling groups per region800
Launch configurations per region600
EC2 Spot (instance family you are interested in)1000+
EC2 On-demand (instance family you are interested in)1000+

Grid will create a number of AWS resources in order to provision your BYOC cluster as seen in the table below. If creating these resources would exceed your quota then the BYOC cluster creation process will fail. In order to address this issue you should either delete existing unused resources or increase your AWS quotas.

ResourceRequired Quota
AWS IAM roles15
AWS IAM policies15
VPC5
S3 Buckets5

AWS STS regional endpoints have to be enabled in the target region. Go to AWS account settings and verify the regional endpoint is activated. In most cases your region already has AWS STS regional endpoint enabled, see IAM User Guide.

note

Skipping this step will cause issues that are difficult to debug. The kubelet will be unable to authenticate against the kubernetes API server and nothing will work.

Requesting Quotasโ€‹

All AWS accounts have "service quotas". These are limits for the utilization of service provided by AWS. In order to increase your quotas, you have to request a quota increase to a specific service. That will open a ticket with AWS support. You may need to follow-up on the ticket in order for the quota to be granted.

You can request a quota by doing the following:

  1. Login into your AWS console
  2. Search for "Service Quotas" and click on the result
  3. Click on the area of the service (e.g. "Amazon Elastic Compute Cloud (Amazon EC2)")
  4. Use the search filter to find the quota that you are looking for
  5. Make a quota request

Step 1: Get AWS Credentialsโ€‹

A: Login to AWS and search for IAMโ€‹

Login to your AWS account. You will then use the search bar to find "IAM" (user management).

B: Click on "Users"โ€‹

Click on the "Users" panel. You will be able to see a list of users. If you already have a user, click on your user name. If you don't, move to the next step to create a new user.

C: Create New User (optional)โ€‹

If you don't have a user available and would like to create one, on the "Users" page click on "Add user". Fill in the username of your preference and make sure to check "Programmatic access" (this allows you to use AWS keys).

Click on "Next: Permissions".

The user should have IAMFullAccess privileges.

Click on "Next: Tags" > "Next: Review" > "Create user".

D: Create New AWS Keysโ€‹

  1. Navigate to the "Users" page
  2. Click on your user name
  3. Click on the tab "Security Credentials"
  4. Click on "Create access key"
  5. Copy both the "Access key ID" and the "Secret access key" values
note

The "Secret access key" value will only be shown once. Make sure you copy that value and store it in a safe location.

Make sure that your username has the right policies attached in order to user Grid correctly. Refer to the section Adding Grid AWS Policies & Roles for more details.

Step 2: Add IAM permissions to your accountโ€‹

The created user and fetched credentials for should have IAMFullAccess privileges.

note

Reach out to us via Slack or email if you have any issues creating the following AWS roles and policies. We're happy to help!

A: Add Policies to Your Accountโ€‹

The final step is to add all the Grid policies to your account. That means that your AWS keys will now be able to perform the operations required by Grid.

  1. First, log in to AWS and navigate to IAM
  2. Click on "Users"
  3. On the user's page, find your user name and click on it
  4. Click on "Add permissions"
  5. Click on "Attach existing policies directly"

Granting permissions to a user.

  1. Search for the policy IAMFullAccess:
  2. Click the Check Box to the left of IAMFullAccess
  3. Click on "Next:Review"
  4. Click on "Add permissions"

Now that you have added the right permissions to your username, you can use the user's AWS API keys with Grid.

Step 3: Create Role & Policy grid requiresโ€‹

For the next step you're going to create role we're going to assume into. For this you'll be using terraform. Make sure you have git, terraform, jq and AWS CLI installed on your machine. Installation instruction of these tools are available here. If you're familiar with terraform we recommend you check the terraform module we'll be using to create necessary roles & policies, https://github.com/gridai/terraform-aws-gridbyoc. This module is published on the official terraform registry for your convenience https://registry.terraform.io/modules/gridai/gridbyoc/aws/latest.

note

The script needs following list of permissions:

:::

  • "eks:*",
  • "ecr:*",
  • "events:*",
  • "arn:aws:iam::aws:policy/AmazonEC2FullAccess",
  • "arn:aws:iam::aws:policy/AmazonGuardDutyFullAccess",
  • "arn:aws:iam::aws:policy/AmazonRoute53ResolverFullAccess",
  • "arn:aws:iam::aws:policy/AmazonS3FullAccess",
  • "arn:aws:iam::aws:policy/AmazonSNSFullAccess",
  • "arn:aws:iam::aws:policy/AmazonSQSFullAccess",
  • "arn:aws:iam::aws:policy/AmazonVPCFullAccess",
  • "arn:aws:iam::aws:policy/CloudWatchLogsFullAccess",
  • "arn:aws:iam::aws:policy/IAMFullAccess", :::

For quick start:

  • Clone the repo
git clone https://github.com/gridai/terraform-aws-gridbyoc.git
cd terraform-aws-gridbyoc/quick-start
unset AWS_ACCESS_KEY_ID
unset AWS_SECRET_ACCESS_KEY
unset AWS_SESSION_TOKEN

aws configure

# prompt and example entries below

AWS Access Key ID [None]: xxxxxxxxx
AWS Secret Access Key [None]: xxxxxxxxx
Default region name [None]:
Default output format [None]:
  • Verify AWS Access Key
aws sts get-caller-identity

# example entries below should match the above steps
{
"UserId": "xxxxxxxxx",
"Account": "xxxxxxxxx",
"Arn": "arn:aws:iam::xxxxxxxxx:user/xxxxxxxxx"
}
  • Run the Terraform script and enter the AWS Region when prompted. The region where the VPC is located is entered in the later step.
terraform init
terraform apply

# enter provider.aws.region
provider.aws.region
The region where AWS operations will take place. Examples
are us-east-1, us-west-2, etc.

Enter a value: <us-east-1>

# long list of actions truncated and the final prompt

Do you want to perform these actions?
Terraform will perform the actions described above.
Only 'yes' will be accepted to approve.

Enter a value: yes
  • Get the output from terraform. By default terraform hides the sensitive secret output.
terraform output -json | jq

From the previous command, you should get the following output:

{
"external_id": {
"sensitive": true,
"type": "string",
"value": "<example-id>"
},
"role_arn": {
"sensitive": false,
"type": "string",
"value": "<arn:aws:iam::000000000000:role/example-role>"
},
"role_name": {
"sensitive": false,
"type": "string",
"value": "example-role"
}
}
export EXTERNAL_ID=$(terraform output -json | jq -r '.external_id.value')
export ROLE_ARN=$(terraform output -json | jq -r '.role_arn.value')

Step 4: Register Your Role in Gridโ€‹

By default, Grid Sessions and Runs are spun up in Availability Zone a. Only specify the AWS region and not the AZ in the --region argument.

  • Login to Grid. Please reference the detailed steps as required.
pip install lightning-grid --upgrade
grid login --username <Grid user name> --key <Grid API Key>
  • Create cluster in default region with default instance types.
  • Cluster name must be lower case alphanumeric characters, '-' or '.' is allowed but not '_', and must start and end with an alphanumeric character
grid clusters aws --role-arn $ROLE_ARN --external-id $EXTERNAL_ID <cluster name>
  • Create cluster in us-west-2 region with default instance types
grid clusters aws --role-arn $ROLE_ARN --external-id $EXTERNAL_ID --region us-west-2 <cluster name>
  • Create cluster in eu-west-2 region with t2.medium and t2.xlarge instance types
grid clusters aws --role-arn $ROLE_ARN --external-id $EXTERNAL_ID --region us-west-2 --instance-types t2.medium,t2.large <cluster name>

Step 5: Wait for cluster to be provisionedโ€‹

After submitting the cluster creation request, you can check the cluster state by running:

grid clusters

And wait for your cluster status be running:

โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”“
โ”ƒ id โ”ƒ name โ”ƒ type โ”ƒ status โ”ƒ created โ”ƒ
โ”กโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ฉ
โ”‚ grid-cloud-prod โ”‚ grid-prod-cloud โ”‚ grid-cloud โ”‚ running โ”‚ 2 days ago โ”‚
โ”‚ <cluster name> โ”‚ <cluster name> โ”‚ byoc โ”‚ running โ”‚ a hour ago โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

It can take some time to provision a new cluster, ~30-50 minutes. Optionally, you can add --wait flag on the cluster creation step, and grid CLI will wait until the cluster is running.

Step 6: Run your workloads in your new clusterโ€‹

grid run --cluster <cluster name>
grid session create --cluster <cluster name>

Or if you're using config file set the .compute.provider.cluster field to the cluster name you've just provisioned.

Step 7: Enjoy!โ€‹

Your cluster will be available for use on Grid, so use it (or any other cluster) as you wish.

Editing Clustersโ€‹

Use grid edit to see instance types available and update as necessary.

grid edit cluster <cluster name>

An editor in your command line will show the json configuration for the Cluster like the one below (we have omitted with ellipsis ... some attributes to make this section easier to understand)

{
"cluster_type": "CLUSTER_TYPE_BYOC",
"cost_factor": "",
"desired_state": "CLUSTER_STATE_RUNNING",
"driver": {
"external": null,
"kubernetes": {
"aws": {
...
"instance_types": [
{
"name": "g4dn.xlarge",
"overprovisioned_ondemand_count": 0,
},
{
"name": "m5ad.xlarge",
"overprovisioned_ondemand_count": 0,
},
],
...
},
...
},
},
...
"performance_profile": "CLUSTER_PERFORMANCE_PROFILE_DEFAULT"
}

Some important attributes you can chagne:

  • instance_types: Here you can add or remove Instance Type following AWS naming, but at the moment only instances that are amd64 compatible can be used. You can also change the overprovisioned_ondemand_count for the instance if you want to pre-allocate instances for faster start but that will also make you incur in extra costs.
  • performance_profile: You can change the profile for the cluster. It can either be
    • CLUSTER_PERFORMANCE_PROFILE_DEFAULT with extra nodes for larger clusters and metrics and monitoring capabilities
    • CLUSTER_PERFORMANCE_PROFILE_COST_SAVING for smaller clusters but also without metrics and monitoring capabilities but also less expensive to run.

Deleting Clustersโ€‹

Use grid delete to delete cluster. Deleting a cluster will delete its resources, including runing resources. The deletion will take ~20-30 minutes. Use with care! The flag --wait is also available here, in the case of using, grid CLI will wait until the cluster is deleted.

note

Grid attempts to delete all cluster resources when a delete operation is initiated. However, sometimes there are dangling resources left behind. Make sure to inspect your account for dangling resources and delete them manually if that is the case. Reach out to support if you have any issues -- we are happy to help!

grid delete cluster <cluster name>

Next, use terraform to delete the AWS resources you created as part of the install process.

terraform destroy

Next Steps

Now that you have gotten a feel for deploying Grid Managed BYOC Mode, we would like to show you the Enterprise-ready mode called Self Managed BYOC Mode.