google.datalab.ml Module

CloudML Helper Library.

class google.datalab.ml.Job(name, context=None)[source]

Represents a Cloud ML job.

Initializes an instance of a CloudML Job.

Parameters:
  • name – the name of the job. It can be an operation full name (“projects/[project_id]/jobs/[operation_name]”) or just [operation_name].
  • context – an optional Context object providing project_id and credentials.
static submit_training(job_request, job_id=None)[source]

Submit a training job.

Parameters:
  • job_request

    the arguments of the training job in a dict. For example, {

    ‘package_uris’: ‘gs://my-bucket/iris/trainer-0.1.tar.gz’, ‘python_module’: ‘trainer.task’, ‘scale_tier’: ‘BASIC’, ‘region’: ‘us-central1’, ‘args’: {
    ‘train_data_paths’: [‘gs://mubucket/data/features_train’], ‘eval_data_paths’: [‘gs://mubucket/data/features_eval’], ‘metadata_path’: ‘gs://mubucket/data/metadata.yaml’, ‘output_path’: ‘gs://mubucket/data/mymodel/’,

    }

    } If ‘args’ is present in job_request and is a dict, it will be expanded to –key value or –key list_item_0 –key list_item_1, ...

  • job_id – id for the training job. If None, an id based on timestamp will be generated.
Returns:

A Job object representing the cloud training job.

class google.datalab.ml.Jobs(filter=None)[source]

Represents a list of Cloud ML jobs for a project.

Initializes an instance of a CloudML Job list that is iteratable (“for job in jobs()”).

Parameters:
  • filter – filter string for retrieving jobs, such as “state=FAILED”
  • context – an optional Context object providing project_id and credentials.
  • api – an optional CloudML API client.
get_iterator()[source]

Get iterator of jobs so it can be used as “for model in Jobs().get_iterator()”.

class google.datalab.ml.Summary(paths)[source]

Represents TensorFlow summary events from files under specified directories.

Initializes an instance of a Summary.

Parameters:path – a list of paths to directories which hold TensorFlow events files. Can be local path or GCS paths. Wild cards allowed.
get_events(event_names)[source]

Get all events as pandas DataFrames given a list of names.

Parameters:event_names – A list of events to get.
Returns:
A list with the same length as event_names. Each element is a dictionary
{dir1: DataFrame1, dir2: DataFrame2, ...}. Multiple directories may contain events with the same name, but they are different events (i.e. ‘loss’ under trains_set/, and ‘loss’ under eval_set/.)
list_events()[source]

List all scalar events in the directory.

Returns:A dictionary. Key is the name of a event. Value is a set of dirs that contain that event.
plot(event_names, x_axis='step')[source]
Plots a list of events. Each event (a dir+event_name) is represetented as a line
in the graph.
Parameters:
  • event_names – A list of events to plot. Each event_name may correspond to multiple events, each in a different directory.
  • x_axis – whether to use step or time as x axis.
class google.datalab.ml.TensorBoard[source]

Start, shutdown, and list TensorBoard instances.

static list()[source]

List running TensorBoard instances.

static start(logdir)[source]

Start a TensorBoard instance.

Parameters:logdir – the logdir to run TensorBoard on.
Raises:Exception if the instance cannot be started.
static stop(pid)[source]

Shut down a specific process.

Parameters:pid – the pid of the process to shutdown.
class google.datalab.ml.CsvDataSet(file_pattern, schema=None, schema_file=None)[source]

DataSet based on CSV files and schema.

Parameters:
  • file_pattern – A list of CSV files. or a string. Can contain wildcards in file names. Can be local or GCS path.
  • schema – A BigQuery schema object in the form of [{‘name’: ‘col1’, ‘type’: ‘STRING’}, {‘name’: ‘col2’, ‘type’: ‘INTEGER’}] or a single string in of the form ‘col1:STRING,col2:INTEGER,col3:FLOAT’.
  • schema_file – A JSON serialized schema file. If schema is None, it will try to load from schema_file if not None.
Raise:
ValueError if both schema and schema_file are None.
input_files

Returns the file list that was given to this class without globing files.

sample(n)[source]

Samples data into a Pandas DataFrame. :param n: number of sampled counts.

Returns:A dataframe containing sampled data.
Raises:Exception if n is larger than number of rows.
size

The size of the schema. If the underlying data source changes, it may be outdated.

class google.datalab.ml.BigQueryDataSet(sql=None, table=None)[source]

DataSet based on BigQuery table or query.

Parameters:
  • sql – A SQL query string, or a SQL Query module defined with ‘%%bq query –name [query_name]’
  • table – A table name in the form of ‘dataset.table or project.dataset.table’.
Raises:

ValueError if both sql and table are set, or both are None.

sample(n)[source]
Samples data into a Pandas DataFrame. Note that it calls BigQuery so it will
incur cost.
Parameters:n – number of sampled counts. Note that the number of counts returned is approximated.
Returns:A dataframe containing sampled data.
Raises:Exception if n is larger than number of rows.
size

The size of the schema. If the underlying data source changes, it may be outdated.

class google.datalab.ml.Models(project_id=None)[source]

Represents a list of Cloud ML models for a project.

Parameters:project_id – project_id of the models. If not provided, default project_id will be used.
create(model_name)[source]

Create a model.

Parameters:model_name – the short name of the model, such as “iris”.
Returns:If successful, returns informaiton of the model, such as {u’regions’: [u’us-central1’], u’name’: u’projects/myproject/models/mymodel’}
Raises:If the model creation failed.
delete(model_name)[source]

Delete a model.

Parameters:model_name – the name of the model. It can be a model full name (“projects/[project_id]/models/[model_name]”) or just [model_name].
describe(model_name)[source]

Print information of a specified model.

Parameters:model_name – the name of the model to print details on.
get_iterator()[source]

Get iterator of models so it can be used as “for model in Models().get_iterator()”.

get_model_details(model_name)[source]

Get details of the specified model from CloudML Service.

Parameters:
  • model_name – the name of the model. It can be a model full name (“projects/[project_id]/models/[model_name]”) or just [model_name].
  • Returns – a dictionary of the model details.
list(count=10)[source]

List models under the current project in a table view.

Parameters:count – upper limit of the number of models to list.
Raises:Exception if it is called in a non-IPython environment.
class google.datalab.ml.ModelVersions(model_name, project_id=None)[source]

Represents a list of versions for a Cloud ML model.

Parameters:
  • model_name – the name of the model. It can be a model full name (“projects/[project_id]/models/[model_name]”) or just [model_name].
  • project_id – project_id of the models. If not provided and model_name is not a full name (not including project_id), default project_id will be used.
delete(version_name)[source]

Delete a version of model.

Parameters:version_name – the name of the version in short form, such as “v1”.
deploy(version_name, path)[source]

Deploy a model version to the cloud.

Parameters:
  • version_name – the name of the version in short form, such as “v1”.
  • path – the Google Cloud Storage path (gs://...) which contains the model files.
Raises: Exception if the path is invalid or does not contain expected files.
Exception if the service returns invalid response.
describe(version_name)[source]

Print information of a specified model.

Parameters:version – the name of the version in short form, such as “v1”.
get_iterator()[source]

Get iterator of versions so it can be used as “for v in ModelVersions(model_name).get_iterator()”.

get_version_details(version_name)[source]

Get details of a version.

Parameters:version – the name of the version in short form, such as “v1”.

Returns: a dictionary containing the version details.

list()[source]

List versions under the current model in a table view.

Raises:Exception if it is called in a non-IPython environment.
predict(version_name, data)[source]

Get prediction results from features instances.

Parameters:
  • version_name – the name of the version used for prediction.
  • data – typically a list of instance to be submitted for prediction. The format of the instance depends on the model. For example, structured data model may require a csv line for each instance. Note that online prediction only works on models that take one placeholder value, such as a string encoding a csv line.
Returns:

A list of prediction results for given instances. Each element is a dictionary representing

output mapping from the graph.

An example:
[{“predictions”: 1, “score”: [0.00078, 0.71406, 0.28515]},

{“predictions”: 1, “score”: [0.00244, 0.99634, 0.00121]}]

class google.datalab.ml.ConfusionMatrix(cm, labels)[source]

Represents a confusion matrix.

Parameters:
  • cm – a 2-dimensional matrix with row index being target, column index being predicted, and values being count.
  • labels – the labels whose order matches the row/column indexes.
static from_bigquery(sql)[source]

Create a ConfusionMatrix from a BigQuery table or query.

Parameters:
  • sql – Can be one of: A SQL query string. A Bigquery table string. A Query object defined with ‘%%bq query –name [query_name]’.
  • query results or table must include "target", "predicted" columns. (The) –
Returns:

A ConfusionMatrix that can be plotted.

Raises:

ValueError if query results or table does not include ‘target’ or ‘predicted’ columns.

static from_csv(input_csv, headers=None, schema_file=None)[source]

Create a ConfusionMatrix from a csv file.

Parameters:
  • input_csv – Path to a Csv file (with no header). Can be local or GCS path.
  • headers – Csv headers. If present, it must include ‘target’ and ‘predicted’.
  • schema_file – Path to a JSON file containing BigQuery schema. Used if “headers” is None. If present, it must include ‘target’ and ‘predicted’ columns.
Returns:

A ConfusionMatrix that can be plotted.

Raises:

ValueError if both headers and schema_file are None, or it does not include ‘target’ – or ‘predicted’ columns.

plot()[source]

Plot the confusion matrix.

class google.datalab.ml.FeatureSliceView[source]

Represents A feature slice view.

plot(data)[source]

Plots a featire slice view on given data.

Parameters:data
Can be one of:
A string of sql query. A sql query module defined by “%%sql –module module_name”. A pandas DataFrame.
Regardless of data type, it must include the following columns:
“feature”: identifies a slice of features. For example: “petal_length:4.0-4.2”. “count”: number of instances in that slice of features.

All other columns are viewed as metrics for its feature slice. At least one is required.

class google.datalab.ml.CloudTrainingConfig[source]

A config namedtuple containing cloud specific configurations for CloudML training.

Fields:
region: the region of the training job to be submitted. For example, “us-central1”.
Run “gcloud compute regions list” to get a list of regions.
scale_tier: Specifies the machine types, the number of replicas for workers and
parameter servers. For example, “STANDARD_1”. See https://cloud.google.com/ml/reference/rest/v1beta1/projects.jobs#scaletier for list of accepted values.
master_type: specifies the type of virtual machine to use for your training
job’s master worker. Must set this value when scale_tier is set to CUSTOM. See the link in “scale_tier”.
worker_type: specifies the type of virtual machine to use for your training
job’s worker nodes. Must set this value when scale_tier is set to CUSTOM.
parameter_server_type: specifies the type of virtual machine to use for your training
job’s parameter server. Must set this value when scale_tier is set to CUSTOM.
worker_count: the number of worker replicas to use for the training job. Each
replica in the cluster will be of the type specified in “worker_type”. Must set this value when scale_tier is set to CUSTOM.
parameter_server_count: the number of parameter server replicas to use. Each
replica in the cluster will be of the type specified in “parameter_server_type”. Must set this value when scale_tier is set to CUSTOM.

Create new instance of CloudConfig(region, scale_tier, master_type, worker_type, parameter_server_type, worker_count, parameter_server_count)