mltoolbox.regression.dnn

This module contains functions for regression problems modeled as a fully connected feedforward deep neural network.

Every function can run locally or use Google Cloud Platform.

mltoolbox.regression.dnn.analyze(output_dir, dataset, cloud=False, project_id=None)[source]

Blocking version of analyze_async. See documentation of analyze_async.

mltoolbox.regression.dnn.analyze_async(output_dir, dataset, cloud=False, project_id=None)[source]

Analyze data locally or in the cloud with BigQuery.

Produce analysis used by training. This can take a while, even for small datasets. For small datasets, it may be faster to use local_analysis.

Parameters:
  • output_dir – The output directory to use.
  • dataset – only CsvDataSet is supported currently.
  • cloud – If False, runs analysis locally with Pandas. If Ture, runs analysis in the cloud with BigQuery.
  • project_id – Uses BigQuery with this project id. Default is datalab’s default project id.
Returns:

A google.datalab.utils.Job object that can be used to query state from or wait.

mltoolbox.regression.dnn.batch_predict(training_dir, prediction_input_file, output_dir, mode, batch_size=16, shard_files=True, output_format='csv', cloud=False)[source]

Blocking versoin of batch_predict.

See documentation of batch_prediction_async.

mltoolbox.regression.dnn.batch_predict_async(training_dir, prediction_input_file, output_dir, mode, batch_size=16, shard_files=True, output_format='csv', cloud=False)[source]

Local and cloud batch prediction.

Parameters:
  • training_dir – The output folder of training.
  • prediction_input_file – csv file pattern to a file. File must be on GCS if running cloud prediction
  • output_dir – output location to save the results. Must be a GSC path if running cloud prediction.
  • mode – ‘evaluation’ or ‘prediction’. If ‘evaluation’, the input data must contain a target column. If ‘prediction’, the input data must not contain a target column.
  • batch_size – Int. How many instances to run in memory at once. Larger values mean better performace but more memeory consumed.
  • shard_files – If False, the output files are not shardded.
  • output_format – csv or json. Json file are json-newlined.
  • cloud – If ture, does cloud batch prediction. If False, runs batch prediction locally.
Returns:

A google.datalab.utils.Job object that can be used to query state from or wait.

mltoolbox.regression.dnn.predict(data, training_dir=None, model_name=None, model_version=None, cloud=False)[source]

Runs prediction locally or on the cloud.

Parameters:
  • data – List of csv strings or a Pandas DataFrame that match the model schema.
  • training_dir – local path to the trained output folder.
  • model_name – deployed model name
  • model_version – depoyed model version
  • cloud – bool. If False, does local prediction and data and training_dir must be set. If True, does cloud prediction and data, model_name, and model_version must be set.

For cloud prediction, the model must be created. This can be done by running two gcloud commands:

1) gcloud beta ml models create NAME
2) gcloud beta ml versions create VERSION --model NAME --origin gs://BUCKET/training_dir/model
or these datalab commands:
  1. import google.datalab as datalab
model = datalab.ml.ModelVersions(MODEL_NAME) model.deploy(version_name=VERSION, path=’gs://BUCKET/training_dir/model’)

Note that the model must be on GCS.

Returns:Pandas DataFrame.
mltoolbox.regression.dnn.train(train_dataset, eval_dataset, analysis_dir, output_dir, features, layer_sizes, max_steps=5000, num_epochs=None, train_batch_size=100, eval_batch_size=16, min_eval_frequency=100, learning_rate=0.01, epsilon=0.0005, job_name=None, cloud=None)[source]

Blocking version of train_async. See documentation for train_async.

mltoolbox.regression.dnn.train_async(train_dataset, eval_dataset, analysis_dir, output_dir, features, layer_sizes, max_steps=5000, num_epochs=None, train_batch_size=100, eval_batch_size=16, min_eval_frequency=100, learning_rate=0.01, epsilon=0.0005, job_name=None, cloud=None)[source]

Train model locally or in the cloud.

Local Training:

Parameters:
  • train_dataset – CsvDataSet
  • eval_dataset – CsvDataSet
  • analysis_dir – The output directory from local_analysis
  • output_dir – Output directory of training.
  • features

    file path or features object. Example: {

    “col_A”: {“transform”: “scale”, “default”: 0.0}, “col_B”: {“transform”: “scale”,”value”: 4}, # Note col_C is missing, so default transform used. “col_D”: {“transform”: “hash_one_hot”, “hash_bucket_size”: 4}, “col_target”: {“transform”: “target”}, “col_key”: {“transform”: “key”}

    } The keys correspond to the columns in the input files as defined by the schema file during preprocessing. Some notes 1) The “key” and “target” transforms are required. 2) Default values are optional. These are used if the input data has

    missing values during training and prediction. If not supplied for a column, the default value for a numerical column is that column’s mean vlaue, and for a categorical column the empty string is used.
    1. For numerical colums, the following transforms are supported: i) {“transform”: “identity”}: does nothing to the number. (default) ii) {“transform”: “scale”}: scales the colum values to -1, 1. iii) {“transform”: “scale”, “value”: a}: scales the colum values
      to -a, a.

      For categorical colums, the following transforms are supported:

    1. {“transform”: “one_hot”}: A one-hot vector using the full
      vocabulary is used. (default)
    2. {“transform”: “embedding”, “embedding_dim”: d}: Each label is embedded into an d-dimensional space.
  • max_steps – Int. Number of training steps to perform.
  • num_epochs – Maximum number of training data epochs on which to train. The training job will run for max_steps or num_epochs, whichever occurs first.
  • train_batch_size – number of rows to train on in one step.
  • eval_batch_size – number of rows to eval in one step. One pass of the eval dataset is done. If eval_batch_size does not perfectly divide the numer of eval instances, the last fractional batch is not used.
  • min_eval_frequency – Minimum number of training steps between evaluations.
  • layer_sizes – List. Represents the layers in the connected DNN. If the model type is DNN, this must be set. Example [10, 3, 2], this will create three DNN layers where the first layer will have 10 nodes, the middle layer will have 3 nodes, and the laster layer will have 2 nodes.
  • learning_rate – tf.train.AdamOptimizer’s learning rate,
  • epsilon – tf.train.AdamOptimizer’s epsilon value.

Cloud Training:

All local training arguments are valid for cloud training. Cloud training contains two additional args:

Parameters:
  • cloud – A CloudTrainingConfig object.
  • job_name – Training job name. A default will be picked if None.
Returns:

Datalab job