How to do it...

We proceed with running distributed TensorFlow on Google CloudML:

  1. The first step is simply to download the example code

git clone
cd cloudml-dist-mnist-example

  1. Then we download the data and save in a GCP storage bucket
PROJECT_ID=$(gcloud config list project --format "value(core.project)")
gsutil mb -c regional -l us-central1 gs://${BUCKET}
gsutil cp /tmp/data/train.tfrecords gs://${BUCKET}/data/
gsutil cp /tmp/data/test.tfrecords gs://${BUCKET}/data/
  1. Submitting a training job is a very easy: we can simply invoke the training step with CloudML engine. In this example, the trainer code runs for 1000 iterations in the region us-central1. The input data is taken from a storage bucket and the output bucket will be submitted to a different storage bucket.
JOB_NAME="job_$(date +%Y%m%d_%H%M%S)"
gcloud ml-engine jobs submit training ${JOB_NAME}
--package-path trainer
--module-name trainer.task
--staging-bucket gs://${BUCKET}
--job-dir gs://${BUCKET}/${JOB_NAME}
--runtime-version 1.2
--region us-central1
--config config/config.yaml
--data_dir gs://${BUCKET}/data
--output_dir gs://${BUCKET}/${JOB_NAME}
--train_steps 10000
  1. If you want You can control the training process by accessing the CloudML console in
  2. Once the training is concluded, it is possible to serve the model directly from CloudML
gcloud ml-engine models create --regions us-central1 ${MODEL_NAME}
ORIGIN=$(gsutil ls gs://${BUCKET}/${JOB_NAME}/export/Servo | tail -1)
gcloud ml-engine versions create
--origin ${ORIGIN}
--model ${MODEL_NAME}
gcloud ml-engine versions set-default --model ${MODEL_NAME} ${VERSION_NAME}
  1. Once the model is served online it is possible to access the server and make a prediction. The request.json is created by using the script which reads data from MNIST, performs a one hot encoding and then write the features with a well-formatted json schema.
gcloud ml-engine predict --model ${MODEL_NAME} --json-instances request.json
