datalab.bigquery Module

Google Cloud Platform library - BigQuery Functionality.

datalab.bigquery.wait_all(jobs, timeout=None)[source]

Return when all of the specified jobs have completed or timeout expires.

Parameters:
  • jobs – a single Job or list of Jobs to wait on.
  • timeout – a timeout in seconds to wait for. None (the default) means no timeout.
Returns:

A list of completed Jobs. If the call timed out this will be shorter than the list of jobs supplied as a parameter.

datalab.bigquery.wait_any(jobs, timeout=None)[source]

Return when at least one of the specified jobs has completed or timeout expires.

Parameters:
  • jobs – a list of Jobs to wait on.
  • timeout – a timeout in seconds to wait for. None (the default) means no timeout.
Returns:

Once at least one job completes, a list of all completed jobs. If the call times out then an empty list will be returned.

class datalab.bigquery.CSVOptions(delimiter=u', ', skip_leading_rows=0, encoding=u'utf-8', quote=u'"', allow_quoted_newlines=False, allow_jagged_rows=False)[source]

Initialize an instance of CSV options.

Parameters:
  • delimiter – The separator for fields in a CSV file. BigQuery converts the string to ISO-8859-1 encoding, and then uses the first byte of the encoded string to split the data as raw binary (default ‘,’).
  • skip_leading_rows – A number of rows at the top of a CSV file to skip (default 0).
  • encoding – The character encoding of the data, either ‘utf-8’ (the default) or ‘iso-8859-1’.
  • quote – The value used to quote data sections in a CSV file; default ‘”’. If your data does not contain quoted sections, set the property value to an empty string. If your data contains quoted newline characters, you must also enable allow_quoted_newlines.
  • allow_quoted_newlines – If True, allow quoted data sections in CSV files that contain newline characters (default False).
  • allow_jagged_rows – If True, accept rows in CSV files that are missing trailing optional columns; the missing values are treated as nulls (default False).
class datalab.bigquery.Dataset(name, context=None)[source]

Represents a list of BigQuery tables in a dataset.

Initializes an instance of a Dataset.

Parameters:
  • name – the name of the dataset, as a string or (project_id, dataset_id) tuple.
  • context – an optional Context object providing project_id and credentials. If a specific project id or credentials are unspecified, the default ones configured at the global level are used.
Raises:

Exception if the name is invalid.

create(friendly_name=None, description=None)[source]

Creates the Dataset with the specified friendly name and description.

Parameters:
  • friendly_name – (optional) the friendly name for the dataset if it is being created.
  • description – (optional) a description for the dataset if it is being created.
Returns:

The Dataset.

Raises:

Exception if the Dataset could not be created.

delete(delete_contents=False)[source]

Issues a request to delete the dataset.

Parameters:delete_contents – if True, any tables and views in the dataset will be deleted. If False and the dataset is non-empty an exception will be raised.
Returns:None on success.
Raises:Exception if the delete fails (including if table was nonexistent).
description

The description of the dataset, if any.

Raises:Exception if the dataset exists but the metadata for the dataset could not be retrieved.
exists()[source]

Checks if the dataset exists.

Returns:True if the dataset exists; False otherwise.
Raises:Exception if the dataset exists but the metadata for the dataset could not be retrieved.
friendly_name

The friendly name of the dataset, if any.

Raises:Exception if the dataset exists but the metadata for the dataset could not be retrieved.
name

The DatasetName named tuple (project_id, dataset_id) for the dataset.

tables()[source]

Returns an iterator for iterating through the Tables in the dataset.

update(friendly_name=None, description=None)[source]

Selectively updates Dataset information.

Parameters:
  • friendly_name – if not None, the new friendly name.
  • description – if not None, the new description.

Returns:

views()[source]

Returns an iterator for iterating through the Views in the dataset.

class datalab.bigquery.DatasetName(project_id, dataset_id)

A namedtuple for Dataset names.

Parameters:
  • project_id – the project id for the dataset.
  • dataset_id – the dataset id for the dataset.
dataset_id

Alias for field number 1

project_id

Alias for field number 0

class datalab.bigquery.Datasets(project_id=None, context=None)[source]

Iterator class for enumerating the datasets in a project.

Initialize the Datasets object.

Parameters:
  • project_id – the ID of the project whose datasets you want to list. If None defaults to the project in the context.
  • context – an optional Context object providing project_id and credentials. If a specific project id or credentials are unspecified, the default ones configured at the global level are used.
class datalab.bigquery.FederatedTable[source]

Create an external table reference. Do not call this directly; use factory method(s).

static from_storage(source, source_format=u'csv', csv_options=None, ignore_unknown_values=False, max_bad_records=0, compressed=False, schema=None)[source]

Create an external table for a GCS object.

Parameters:
  • source – the URL of the source objects(s). Can include a wildcard ‘*’ at the end of the item name. Can be a single source or a list.
  • source_format – the format of the data, ‘csv’ or ‘json’; default ‘csv’.
  • csv_options – For CSV files, the options such as quote character and delimiter.
  • ignore_unknown_values – If True, accept rows that contain values that do not match the schema; the unknown values are ignored (default False).
  • max_bad_records – The maximum number of bad records that are allowed (and ignored) before returning an ‘invalid’ error in the Job result (default 0).
  • compressed – whether the data is GZ compressed or not (default False). Note that compressed data can be used as a federated table but cannot be loaded into a BQ Table.
  • schema – the schema of the data. This is required for this table to be used as a federated table or to be loaded using a Table object that itself has no schema (default None).
class datalab.bigquery.Job(job_id, context)[source]

Represents a BigQuery Job.

Initializes an instance of a Job.

Parameters:
  • job_id – the BigQuery job ID corresponding to this job.
  • context – a Context object providing project_id and credentials.
class datalab.bigquery.Query(sql, context=None, values=None, udfs=None, data_sources=None, **kwargs)[source]

Represents a Query object that encapsulates a BigQuery SQL query.

This object can be used to execute SQL queries and retrieve results.

Initializes an instance of a Query object.
Note that either values or kwargs may be used, but not both.
Parameters:
  • sql

    the BigQuery SQL query string to execute, or a SqlStatement object. The latter will have any variable references replaced before being associated with the Query (i.e. once constructed the SQL associated with a Query is static).

    It is possible to have variable references in a query string too provided the variables are passed as keyword arguments to this constructor.

  • context – an optional Context object providing project_id and credentials. If a specific project id or credentials are unspecified, the default ones configured at the global level are used.
  • values – a dictionary used to expand variables if passed a SqlStatement or a string with variable references.
  • udfs – array of UDFs referenced in the SQL.
  • data_sources – dictionary of federated (external) tables referenced in the SQL.
  • kwargs – arguments to use when expanding the variables if passed a SqlStatement or a string with variable references.
Raises:

Exception if expansion of any variables failed.

execute(table_name=None, table_mode=u'create', use_cache=True, priority=u'interactive', allow_large_results=False, dialect=None, billing_tier=None)[source]

Initiate the query, blocking until complete and then return the results.

Parameters:
  • table_name – the result table name as a string or TableName; if None (the default), then a temporary table will be used.
  • table_mode – one of ‘create’, ‘overwrite’ or ‘append’. If ‘create’ (the default), the request will fail if the table exists.
  • use_cache – whether to use past query results or ignore cache. Has no effect if destination is specified (default True).
  • priority – one of ‘batch’ or ‘interactive’ (default). ‘interactive’ jobs should be scheduled to run quickly but are subject to rate limits; ‘batch’ jobs could be delayed by as much as three hours but are not rate-limited.
  • allow_large_results – whether to allow large results; i.e. compressed data over 100MB. This is slower and requires a table_name to be specified) (default False).
  • dialect – {‘legacy’, ‘standard’}, default ‘legacy’ ‘legacy’ : Use BigQuery’s legacy SQL dialect. ‘standard’ : Use BigQuery’s standard SQL (beta), which is compliant with the SQL 2011 standard.
  • billing_tier – Limits the billing tier for this job. Queries that have resource usage beyond this tier will fail (without incurring a charge). If unspecified, this will be set to your project default. This can also be used to override your project-wide default billing tier on a per-query basis.
Returns:

The QueryResultsTable for the query.

Raises:

Exception if query could not be executed.

execute_async(table_name=None, table_mode=u'create', use_cache=True, priority=u'interactive', allow_large_results=False, dialect=None, billing_tier=None)[source]

Initiate the query and return a QueryJob.

Parameters:
  • table_name – the result table name as a string or TableName; if None (the default), then a temporary table will be used.
  • table_mode – one of ‘create’, ‘overwrite’ or ‘append’. If ‘create’ (the default), the request will fail if the table exists.
  • use_cache – whether to use past query results or ignore cache. Has no effect if destination is specified (default True).
  • priority – one of ‘batch’ or ‘interactive’ (default). ‘interactive’ jobs should be scheduled to run quickly but are subject to rate limits; ‘batch’ jobs could be delayed by as much as three hours but are not rate-limited.
  • allow_large_results – whether to allow large results; i.e. compressed data over 100MB. This is slower and requires a table_name to be specified) (default False).
  • dialect – {‘legacy’, ‘standard’}, default ‘legacy’ ‘legacy’ : Use BigQuery’s legacy SQL dialect. ‘standard’ : Use BigQuery’s standard SQL (beta), which is compliant with the SQL 2011 standard.
  • billing_tier – Limits the billing tier for this job. Queries that have resource usage beyond this tier will fail (without incurring a charge). If unspecified, this will be set to your project default. This can also be used to override your project-wide default billing tier on a per-query basis.
Returns:

A QueryJob.

Raises:

Exception if query could not be executed.

execute_dry_run(dialect=None, billing_tier=None)[source]

Dry run a query, to check the validity of the query and return some useful statistics.

Parameters:
  • dialect – {‘legacy’, ‘standard’}, default ‘legacy’ ‘legacy’ : Use BigQuery’s legacy SQL dialect. ‘standard’ : Use BigQuery’s standard SQL (beta), which is compliant with the SQL 2011 standard.
  • billing_tier – Limits the billing tier for this job. Queries that have resource usage beyond this tier will fail (without incurring a charge). If unspecified, this will be set to your project default. This can also be used to override your project-wide default billing tier on a per-query basis.
Returns:

A dict with ‘cacheHit’ and ‘totalBytesProcessed’ fields.

Raises:

An exception if the query was malformed.

extract(storage_uris, format=u'csv', csv_delimiter=u', ', csv_header=True, compress=False, use_cache=True, dialect=None, billing_tier=None)[source]

Exports the query results to GCS.

Parameters:
  • storage_uris – the destination URI(s). Can be a single URI or a list.
  • format – the format to use for the exported data; one of ‘csv’, ‘json’, or ‘avro’ (default ‘csv’).
  • csv_delimiter – for csv exports, the field delimiter to use (default ‘,’).
  • csv_header – for csv exports, whether to include an initial header line (default True).
  • compress – whether to compress the data on export. Compression is not supported for AVRO format (default False).
  • use_cache – whether to use cached results or not (default True).
  • dialect – {‘legacy’, ‘standard’}, default ‘legacy’ ‘legacy’ : Use BigQuery’s legacy SQL dialect. ‘standard’ : Use BigQuery’s standard SQL (beta), which is compliant with the SQL 2011 standard.
  • billing_tier – Limits the billing tier for this job. Queries that have resource usage beyond this tier will fail (without incurring a charge). If unspecified, this will be set to your project default. This can also be used to override your project-wide default billing tier on a per-query basis.
Returns:

A Job object for the export Job if it was completed successfully; else None.

Raises:

An Exception if the query or extract failed.

extract_async[source]

Exports the query results to GCS. Returns a Job immediately.

Note that there are two jobs that may need to be run sequentially, one to run the query, and the second to extract the resulting table. These are wrapped by a single outer Job.

If the query has already been executed and you would prefer to get a Job just for the extract, you can can call extract_async on the QueryResultsTable instead; i.e.:

query.results().extract_async(...)
Parameters:
  • storage_uris – the destination URI(s). Can be a single URI or a list.
  • format – the format to use for the exported data; one of ‘csv’, ‘json’, or ‘avro’ (default ‘csv’).
  • csv_delimiter – for CSV exports, the field delimiter to use (default ‘,’).
  • csv_header – for CSV exports, whether to include an initial header line (default True).
  • compress – whether to compress the data on export. Compression is not supported for AVRO format (default False).
  • use_cache – whether to use cached results or not (default True).
  • dialect – {‘legacy’, ‘standard’}, default ‘legacy’ ‘legacy’ : Use BigQuery’s legacy SQL dialect. ‘standard’ : Use BigQuery’s standard SQL (beta), which is compliant with the SQL 2011 standard.
  • billing_tier – Limits the billing tier for this job. Queries that have resource usage beyond this tier will fail (without incurring a charge). If unspecified, this will be set to your project default. This can also be used to override your project-wide default billing tier on a per-query basis.
Returns:

A Job for the combined (execute, extract) task that will in turn return the Job object for the completed extract task when done; else None.

Raises:

An Exception if the query failed.

results(use_cache=True, dialect=None, billing_tier=None)[source]

Retrieves table of results for the query. May block if the query must be executed first.

Parameters:
  • use_cache – whether to use cached results or not. Ignored if append is specified.
  • dialect – {‘legacy’, ‘standard’}, default ‘legacy’ ‘legacy’ : Use BigQuery’s legacy SQL dialect. ‘standard’ : Use BigQuery’s standard SQL (beta), which is compliant with the SQL 2011 standard.
  • billing_tier – Limits the billing tier for this job. Queries that have resource usage beyond this tier will fail (without incurring a charge). If unspecified, this will be set to your project default. This can also be used to override your project-wide default billing tier on a per-query basis.
Returns:

A QueryResultsTable containing the result set.

Raises:

Exception if the query could not be executed or query response was malformed.

sample(count=5, fields=None, sampling=None, use_cache=True, dialect=None, billing_tier=None)[source]

Retrieves a sampling of rows for the query.

Parameters:
  • count – an optional count of rows to retrieve which is used if a specific sampling is not specified (default 5).
  • fields – the list of fields to sample (default None implies all).
  • sampling – an optional sampling strategy to apply to the table.
  • use_cache – whether to use cached results or not (default True).
  • dialect – {‘legacy’, ‘standard’}, default ‘legacy’ ‘legacy’ : Use BigQuery’s legacy SQL dialect. ‘standard’ : Use BigQuery’s standard SQL (beta), which is compliant with the SQL 2011 standard.
  • billing_tier – Limits the billing tier for this job. Queries that have resource usage beyond this tier will fail (without incurring a charge). If unspecified, this will be set to your project default. This can also be used to override your project-wide default billing tier on a per-query basis.
Returns:

A QueryResultsTable containing a sampling of the result set.

Raises:

Exception if the query could not be executed or query response was malformed.

static sampling_query(sql, context, fields=None, count=5, sampling=None, udfs=None, data_sources=None)[source]

Returns a sampling Query for the SQL object.

Parameters:
  • sql – the SQL statement (string) or Query object to sample.
  • context – a Context object providing project_id and credentials.
  • fields – an optional list of field names to retrieve.
  • count – an optional count of rows to retrieve which is used if a specific sampling is not specified.
  • sampling – an optional sampling strategy to apply to the table.
  • udfs – array of UDFs referenced in the SQL.
  • data_sources – dictionary of federated (external) tables referenced in the SQL.
Returns:

A Query object for sampling the table.

scripts

Get the code for any Javascript UDFs used in the query.

sql

Get the SQL for the query.

to_dataframe(start_row=0, max_rows=None, use_cache=True, dialect=None, billing_tier=None)[source]

Exports the query results to a Pandas dataframe.

Parameters:
  • start_row – the row of the table at which to start the export (default 0).
  • max_rows – an upper limit on the number of rows to export (default None).
  • use_cache – whether to use cached results or not (default True).
  • dialect – {‘legacy’, ‘standard’}, default ‘legacy’ ‘legacy’ : Use BigQuery’s legacy SQL dialect. ‘standard’ : Use BigQuery’s standard SQL (beta), which is compliant with the SQL 2011 standard.
  • billing_tier – Limits the billing tier for this job. Queries that have resource usage beyond this tier will fail (without incurring a charge). If unspecified, this will be set to your project default. This can also be used to override your project-wide default billing tier on a per-query basis.
Returns:

A Pandas dataframe containing the table data.

to_file(path, format=u'csv', csv_delimiter=u', ', csv_header=True, use_cache=True, dialect=None, billing_tier=None)[source]

Save the results to a local file in CSV format.

Parameters:
  • path – path on the local filesystem for the saved results.
  • format – the format to use for the exported data; currently only ‘csv’ is supported.
  • csv_delimiter – for CSV exports, the field delimiter to use. Defaults to ‘,’
  • csv_header – for CSV exports, whether to include an initial header line. Default true.
  • use_cache – whether to use cached results or not.
  • dialect – {‘legacy’, ‘standard’}, default ‘legacy’ ‘legacy’ : Use BigQuery’s legacy SQL dialect. ‘standard’ : Use BigQuery’s standard SQL (beta), which is compliant with the SQL 2011 standard.
  • billing_tier – Limits the billing tier for this job. Queries that have resource usage beyond this tier will fail (without incurring a charge). If unspecified, this will be set to your project default. This can also be used to override your project-wide default billing tier on a per-query basis.
Returns:

The path to the local file.

Raises:

An Exception if the operation failed.

to_file_async[source]

Save the results to a local file in CSV format. Returns a Job immediately.

Parameters:
  • path – path on the local filesystem for the saved results.
  • format – the format to use for the exported data; currently only ‘csv’ is supported.
  • csv_delimiter – for CSV exports, the field delimiter to use. Defaults to ‘,’
  • csv_header – for CSV exports, whether to include an initial header line. Default true.
  • use_cache – whether to use cached results or not.
  • dialect – {‘legacy’, ‘standard’}, default ‘legacy’ ‘legacy’ : Use BigQuery’s legacy SQL dialect. ‘standard’ : Use BigQuery’s standard SQL (beta), which is compliant with the SQL 2011 standard.
  • billing_tier – Limits the billing tier for this job. Queries that have resource usage beyond this tier will fail (without incurring a charge). If unspecified, this will be set to your project default. This can also be used to override your project-wide default billing tier on a per-query basis.
Returns:

A Job for the save that returns the path to the local file on completion.

Raises:

An Exception if the operation failed.

to_view(view_name)[source]

Create a View from this Query.

Parameters:view_name – the name of the View either as a string or a 3-part tuple (projectid, datasetid, name).
Returns:A View for the Query.
class datalab.bigquery.QueryJob(job_id, table_name, sql, context)[source]

Represents a BigQuery Query Job.

Initializes a QueryJob object.

Parameters:
  • job_id – the ID of the query job.
  • table_name – the name of the table where the query results will be stored.
  • sql – the SQL statement that was executed for the query.
  • context – the Context object providing project_id and credentials that was used when executing the query.
bytes_processed

The number of bytes processed, or None if the job is not complete.

cache_hit

Whether the query results were obtained from the cache or not, or None if not complete.

results

Get the table used for the results of the query. If the query is incomplete, this blocks.

Raises:Exception if we timed out waiting for results or the query failed.
sql

The SQL statement that was executed for the query.

total_rows

The total number of rows in the result, or None if not complete.

wait(timeout=None)[source]

Wait for the job to complete, or a timeout to happen.

This is more efficient than the version in the base Job class, in that we can use a call that blocks for the poll duration rather than a sleep. That means we shouldn’t block unnecessarily long and can also poll less.
Parameters:timeout – how long to wait (in seconds) before giving up; default None which means no timeout.
Returns:The QueryJob
class datalab.bigquery.QueryResultsTable(name, context, job, is_temporary=False)[source]

A subclass of Table specifically for Query results.

The primary differences are the additional properties job_id and sql.

Initializes an instance of a Table object.

Parameters:
  • name – the name of the table either as a string or a 3-part tuple (projectid, datasetid, name).
  • context – an optional Context object providing project_id and credentials. If a specific project id or credentials are unspecified, the default ones configured at the global level are used.
  • job – the QueryJob associated with these results.
  • is_temporary – if True, this is a short-lived table for intermediate results (default False).
is_temporary

Whether this is a short-lived table or not.

job

The QueryJob object that caused the table to be populated.

job_id

The ID of the query job that caused the table to be populated.

sql

The SQL statement for the query that populated the table.

class datalab.bigquery.QueryStats(total_bytes, is_cached)[source]

A wrapper for statistics returned by a dry run query. Useful so we can get an HTML representation in a notebook.

class datalab.bigquery.Sampling[source]

Provides common sampling strategies.

Sampling strategies can be used for sampling tables or queries.

They are implemented as functions that take in a SQL statement representing the table or query that should be sampled, and return a new SQL statement that limits the result set in some manner.

static default(fields=None, count=5)[source]

Provides a simple default sampling strategy which limits the result set by a count.

Parameters:
  • fields – an optional list of field names to retrieve.
  • count – optional number of rows to limit the sampled results to.
Returns:

A sampling function that can be applied to get a random sampling.

static hashed(field_name, percent, fields=None, count=0)[source]

Provides a sampling strategy based on hashing and selecting a percentage of data.

Parameters:
  • field_name – the name of the field to hash.
  • percent – the percentage of the resulting hashes to select.
  • fields – an optional list of field names to retrieve.
  • count – optional maximum count of rows to pick.
Returns:

A sampling function that can be applied to get a hash-based sampling.

static random(percent, fields=None, count=0)[source]

Provides a sampling strategy that picks a semi-random set of rows.

Parameters:
  • percent – the percentage of the resulting hashes to select.
  • fields – an optional list of field names to retrieve.
  • count – maximum number of rows to limit the sampled results to (default 5).
Returns:

A sampling function that can be applied to get some random rows. In order for this to provide a good random sample percent should be chosen to be ~count/#rows where #rows is the number of rows in the object (query, view or table) being sampled. The rows will be returned in order; i.e. the order itself is not randomized.

static sampling_query(sql, fields=None, count=5, sampling=None)[source]

Returns a sampling query for the SQL object.

Parameters:
  • sql – the SQL object to sample
  • fields – an optional list of field names to retrieve.
  • count – an optional count of rows to retrieve which is used if a specific sampling is not specified.
  • sampling – an optional sampling strategy to apply to the table.
Returns:

A SQL query string for sampling the input sql.

static sorted(field_name, ascending=True, fields=None, count=5)[source]

Provides a sampling strategy that picks from an ordered set of rows.

Parameters:
  • field_name – the name of the field to sort the rows by.
  • ascending – whether to sort in ascending direction or not.
  • fields – an optional list of field names to retrieve.
  • count – optional number of rows to limit the sampled results to.
Returns:

A sampling function that can be applied to get the initial few rows.

class datalab.bigquery.Schema(definition=None)[source]

Represents the schema of a BigQuery table as a flattened list of objects representing fields.

Each field object has name, data_type, mode and description properties. Nested fields get flattened with their full-qualified names. So a Schema that has an object A with nested field B will be represented as [(name: ‘A’, ...), (name: ‘A.b’, ...)].

Initializes a Schema from its raw JSON representation, a Pandas Dataframe, or a list.

Parameters:definition – a definition of the schema as a list of dictionaries with ‘name’ and ‘type’ entries and possibly ‘mode’ and ‘description’ entries. Only used if no data argument was provided. ‘mode’ can be ‘NULLABLE’, ‘REQUIRED’ or ‘REPEATED’. For the allowed types, see: https://cloud.google.com/bigquery/preparing-data-for-bigquery#datatypes
class Field(name, data_type, mode=u'NULLABLE', description=u'')[source]

Represents a single field in a Table schema.

This has the properties:

  • name: the flattened, full-qualified name of the field.
  • data_type: the type of the field as a string (‘INTEGER’, ‘BOOLEAN’, ‘FLOAT’, ‘STRING’
    or ‘TIMESTAMP’).
  • mode: the mode of the field; ‘NULLABLE’ by default.
  • description: a description of the field, if known; empty string by default.
Schema.find(name)[source]

Get the index of a field in the flattened list given its (fully-qualified) name.

Parameters:name – the fully-qualified name of the field.
Returns:The index of the field, if found; else -1.
static Schema.from_data(source)[source]
Infers a table/view schema from its JSON representation, a list of records, or a Pandas
dataframe.
Parameters:source

the Pandas Dataframe, a dictionary representing a record, a list of heterogeneous data (record) or homogeneous data (list of records) from which to infer the schema, or a definition of the schema as a list of dictionaries with ‘name’ and ‘type’ entries and possibly ‘mode’ and ‘description’ entries. Only used if no data argument was provided. ‘mode’ can be ‘NULLABLE’, ‘REQUIRED’ or ‘REPEATED’. For the allowed types, see: https://cloud.google.com/bigquery/preparing-data-for-bigquery#datatypes

Note that there is potential ambiguity when passing a list of lists or a list of dicts between whether that should be treated as a list of records or a single record that is a list. The heuristic used is to check the length of the entries in the list; if they are equal then a list of records is assumed. To avoid this ambiguity you can instead use the Schema.from_record method which assumes a single record, in either list of values or dictionary of key-values form.

Returns:A Schema for the data.
static Schema.from_dataframe(dataframe, default_type=u'STRING')[source]
Infer a BigQuery table schema from a Pandas dataframe. Note that if you don’t explicitly set the types of the columns in the dataframe, they may be of a type that forces coercion to STRING, so even though the fields in the dataframe themselves may be numeric, the type in the derived schema may not be. Hence it is prudent to make sure the Pandas dataframe is typed correctly.
Parameters:
  • dataframe – The DataFrame.
  • default_type – The default big query type in case the type of the column does not exist in the schema. Defaults to ‘STRING’.
Returns:

A Schema.

static Schema.from_record(source)[source]

Infers a table/view schema from a single record that can contain a list of fields or a dictionary of fields. The type of the elements is used for the types in the schema. For a dict, key names are used for column names while for a list, the field names are simply named ‘Column1’, ‘Column2’, etc. Note that if using a dict you may want to use an OrderedDict to ensure column ordering is deterministic.

Parameters:source – The list of field values or dictionary of key/values.
Returns:A Schema for the data.
class datalab.bigquery.Table(name, context=None)[source]

Represents a Table object referencing a BigQuery table.

Initializes an instance of a Table object. The Table need not exist yet.

Parameters:
  • name – the name of the table either as a string or a 3-part tuple (projectid, datasetid, name). If a string, it must have the form ‘<project>:<dataset>.<table>’ or ‘<dataset>.<table>’.
  • context – an optional Context object providing project_id and credentials. If a specific project id or credentials are unspecified, the default ones configured at the global level are used.
Raises:

Exception if the name is invalid.

create(schema, overwrite=False)[source]

Create the table with the specified schema.

Parameters:
  • schema – the schema to use to create the table. Should be a list of dictionaries, each containing at least a pair of entries, ‘name’ and ‘type’. See https://cloud.google.com/bigquery/docs/reference/v2/tables#resource
  • overwrite – if True, delete the table first if it exists. If False and the table exists, creation will fail and raise an Exception.
Returns:

The Table instance.

Raises:

Exception if the table couldn’t be created or already exists and truncate was False.

delete()[source]

Delete the table.

Returns:True if the Table no longer exists; False otherwise.
exists()[source]

Checks if the table exists.

Returns:True if the table exists; False otherwise.
Raises:Exception if there was an error requesting information about the table.
extract(destination, format=u'csv', csv_delimiter=u', ', csv_header=True, compress=False)[source]

Exports the table to GCS; blocks until complete.

Parameters:
  • destination – the destination URI(s). Can be a single URI or a list.
  • format – the format to use for the exported data; one of ‘csv’, ‘json’, or ‘avro’ (default ‘csv’).
  • csv_delimiter – for CSV exports, the field delimiter to use. Defaults to ‘,’
  • csv_header – for CSV exports, whether to include an initial header line. Default true.
  • compress – whether to compress the data on export. Compression is not supported for AVRO format. Defaults to False.
Returns:

A Job object for the completed export Job if it was started successfully; else None.

extract_async(destination, format=u'csv', csv_delimiter=u', ', csv_header=True, compress=False)[source]

Starts a job to export the table to GCS.

Parameters:
  • destination – the destination URI(s). Can be a single URI or a list.
  • format – the format to use for the exported data; one of ‘csv’, ‘json’, or ‘avro’ (default ‘csv’).
  • csv_delimiter – for CSV exports, the field delimiter to use. Defaults to ‘,’
  • csv_header – for CSV exports, whether to include an initial header line. Default true.
  • compress – whether to compress the data on export. Compression is not supported for AVRO format. Defaults to False.
Returns:

A Job object for the export Job if it was started successfully; else None.

insert_data(data, include_index=False, index_name=None)[source]

Insert the contents of a Pandas DataFrame or a list of dictionaries into the table.

The insertion will be performed using at most 500 rows per POST, and at most 10 POSTs per second, as BigQuery has some limits on streaming rates.

Parameters:
  • data – the DataFrame or list to insert.
  • include_index – whether to include the DataFrame or list index as a column in the BQ table.
  • index_name – for a list, if include_index is True, this should be the name for the index. If not specified, ‘Index’ will be used.
Returns:

The table.

Raises:
  • Exception if the table doesn’t exist, the table’s schema differs from the data’s schema,
  • or the insert failed.
is_temporary

Whether this is a short-lived table or not. Always False for non-QueryResultsTables.

job

For tables resulting from executing queries, the job that created the table.

Default is None for a Table object; this is overridden by QueryResultsTable.

length

Get the length of the table (number of rows). We don’t use __len__ as this may return -1 for ‘unknown’.

load(source, mode=u'create', source_format=u'csv', csv_options=None, ignore_unknown_values=False, max_bad_records=0)[source]

Load the table from GCS.

Parameters:
  • source – the URL of the source objects(s). Can include a wildcard ‘*’ at the end of the item name. Can be a single source or a list.
  • mode – one of ‘create’, ‘append’, or ‘overwrite’. ‘append’ or ‘overwrite’ will fail if the table does not already exist, while ‘create’ will fail if it does. The default is ‘create’. If ‘create’ the schema will be inferred if necessary.
  • source_format – the format of the data, ‘csv’ or ‘json’; default ‘csv’.
  • csv_options – if source format is ‘csv’, additional options as a CSVOptions object.
  • ignore_unknown_values – if True, accept rows that contain values that do not match the schema; the unknown values are ignored (default False).
  • max_bad_records – the maximum number of bad records that are allowed (and ignored) before returning an ‘invalid’ error in the Job result (default 0).
Returns:

A Job object for the completed load Job if it was started successfully; else None.

load_async(source, mode=u'create', source_format=u'csv', csv_options=None, ignore_unknown_values=False, max_bad_records=0)[source]

Starts importing a table from GCS and return a Future.

Parameters:
  • source – the URL of the source objects(s). Can include a wildcard ‘*’ at the end of the item name. Can be a single source or a list.
  • mode – one of ‘create’, ‘append’, or ‘overwrite’. ‘append’ or ‘overwrite’ will fail if the table does not already exist, while ‘create’ will fail if it does. The default is ‘create’. If ‘create’ the schema will be inferred if necessary.
  • source_format – the format of the data, ‘csv’ or ‘json’; default ‘csv’.
  • csv_options – if source format is ‘csv’, additional options as a CSVOptions object.
  • ignore_unknown_values – If True, accept rows that contain values that do not match the schema; the unknown values are ignored (default False).
  • max_bad_records – the maximum number of bad records that are allowed (and ignored) before returning an ‘invalid’ error in the Job result (default 0).
Returns:

A Job object for the import if it was started successfully or None if not.

Raises:

Exception if the load job failed to be started or invalid arguments were supplied.

metadata

Retrieves metadata about the table.

Returns:A TableMetadata object.
Raises
Exception if the request could not be executed or the response was malformed.
name

The TableName named tuple (project_id, dataset_id, table_id, decorator) for the table.

range(start_row=0, max_rows=None)[source]

Get an iterator to iterate through a set of table rows.

Parameters:
  • start_row – the row of the table at which to start the iteration (default 0)
  • max_rows – an upper limit on the number of rows to iterate through (default None)
Returns:

A row iterator.

sample(fields=None, count=5, sampling=None, use_cache=True, dialect=None, billing_tier=None)[source]

Retrieves a sampling of data from the table.

Parameters:
  • fields – an optional list of field names to retrieve.
  • count – an optional count of rows to retrieve which is used if a specific sampling is not specified.
  • sampling – an optional sampling strategy to apply to the table.
  • use_cache – whether to use cached results or not.
  • dialect – {‘legacy’, ‘standard’}, default ‘legacy’ ‘legacy’ : Use BigQuery’s legacy SQL dialect. ‘standard’ : Use BigQuery’s standard SQL (beta), which is compliant with the SQL 2011 standard.
  • billing_tier – Limits the billing tier for this job. Queries that have resource usage beyond this tier will fail (without incurring a charge). If unspecified, this will be set to your project default. This can also be used to override your project-wide default billing tier on a per-query basis.
Returns:

A QueryResultsTable object containing the resulting data.

Raises:

Exception if the sample query could not be executed or query response was malformed.

schema

Retrieves the schema of the table.

Returns:A Schema object containing a list of schema fields and associated metadata.
Raises
Exception if the request could not be executed or the response was malformed.
snapshot(at)[source]

Return a new Table which is a snapshot of this table at the specified time.

Parameters:at

the time of the snapshot. This can be a Python datetime (absolute) or timedelta (relative to current time). The result must be after the table was created and no more than seven days in the past. Passing None will get a reference the oldest snapshot.

Note that using a datetime will get a snapshot at an absolute point in time, while a timedelta will provide a varying snapshot; any queries issued against such a Table will be done against a snapshot that has an age relative to the execution time of the query.

Returns:A new Table object referencing the snapshot.
Raises:An exception if this Table is already decorated, or if the time specified is invalid.
to_dataframe(start_row=0, max_rows=None)[source]

Exports the table to a Pandas dataframe.

Parameters:
  • start_row – the row of the table at which to start the export (default 0)
  • max_rows – an upper limit on the number of rows to export (default None)
Returns:

A Pandas dataframe containing the table data.

to_file(destination, format=u'csv', csv_delimiter=u', ', csv_header=True)[source]

Save the results to a local file in CSV format.

Parameters:
  • destination – path on the local filesystem for the saved results.
  • format – the format to use for the exported data; currently only ‘csv’ is supported.
  • csv_delimiter – for CSV exports, the field delimiter to use. Defaults to ‘,’
  • csv_header – for CSV exports, whether to include an initial header line. Default true.
Raises:

An Exception if the operation failed.

to_file_async[source]

Start saving the results to a local file in CSV format and return a Job for completion.

Parameters:
  • destination – path on the local filesystem for the saved results.
  • format – the format to use for the exported data; currently only ‘csv’ is supported.
  • csv_delimiter – for CSV exports, the field delimiter to use. Defaults to ‘,’
  • csv_header – for CSV exports, whether to include an initial header line. Default true.
Returns:

A Job for the async save operation.

Raises:

An Exception if the operation failed.

to_query(fields=None)[source]

Return a Query for this Table.

Parameters:fields – the fields to return. If None, all fields will be returned. This can be a string which will be injected into the Query after SELECT, or a list of field names.
Returns:A Query object that will return the specified fields from the records in the Table.
update(friendly_name=None, description=None, expiry=None, schema=None)[source]

Selectively updates Table information.

Any parameters that are omitted or None are not updated.

Parameters:
  • friendly_name – if not None, the new friendly name.
  • description – if not None, the new description.
  • expiry – if not None, the new expiry time, either as a DateTime or milliseconds since epoch.
  • schema – if not None, the new schema: either a list of dictionaries or a Schema.
window(begin, end=None)[source]

Return a new Table limited to the rows added to this Table during the specified time range.

Parameters:
  • begin

    the start time of the window. This can be a Python datetime (absolute) or timedelta (relative to current time). The result must be after the table was created and no more than seven days in the past.

    Note that using a relative value will provide a varying snapshot, not a fixed snapshot; any queries issued against such a Table will be done against a snapshot that has an age relative to the execution time of the query.

  • end – the end time of the snapshot; if None, then the current time is used. The types and interpretation of values is as for start.
Returns:

A new Table object referencing the window.

Raises:

An exception if this Table is already decorated, or if the time specified is invalid.

class datalab.bigquery.TableMetadata(table, info)[source]

Represents metadata about a BigQuery table.

Initializes a TableMetadata instance.

Parameters:
  • table – the Table object this belongs to.
  • info – The BigQuery information about this table as a Python dictionary.
created_on

The creation timestamp.

description

The description of the table if it exists.

expires_on

The timestamp for when the table will expire, or None if unknown.

friendly_name

The friendly name of the table if it exists.

modified_on

The timestamp for when the table was last modified.

refresh()[source]

Refresh the metadata.

rows

The number of rows within the table, or -1 if unknown.

size

The size of the table in bytes, or -1 if unknown.

class datalab.bigquery.TableName(project_id, dataset_id, table_id, decorator)

A namedtuple for Table names.

Parameters:
  • project_id – the project id for the table.
  • dataset_id – the dataset id for the table.
  • table_id – the table id for the table.
  • decorator – the optional decorator for the table (for windowing/snapshot-ing).
dataset_id

Alias for field number 1

decorator

Alias for field number 3

project_id

Alias for field number 0

table_id

Alias for field number 2

class datalab.bigquery.UDF(inputs, outputs, name, implementation, support_code=None, imports=None)[source]

Represents a BigQuery UDF declaration.

Initializes a Function object from its pieces.

Parameters:
  • inputs – a list of string field names representing the schema of input.
  • outputs – a list of name/type tuples representing the schema of the output.
  • name – the name of the javascript function
  • implementation – a javascript function implementing the logic.
  • support_code – additional javascript code that the function can use.
  • imports – a list of GCS URLs or files containing further support code.
Raises:

Exception if the name is invalid.

class datalab.bigquery.View(name, context=None)[source]

An implementation of a BigQuery View.

Initializes an instance of a View object.

Parameters:
  • name – the name of the view either as a string or a 3-part tuple (projectid, datasetid, name). If a string, it must have the form ‘<project>:<dataset>.<view>’ or ‘<dataset>.<view>’.
  • context – an optional Context object providing project_id and credentials. If a specific project id or credentials are unspecified, the default ones configured at the global level are used.
Raises:

Exception if the name is invalid.

create(query)[source]

Creates the view with the specified query.

Parameters:query – the query to use to for the View; either a string containing a SQL query or a Query object.
Returns:The View instance.
Raises:Exception if the view couldn’t be created or already exists and overwrite was False.
delete()[source]

Removes the view if it exists.

description

The description of the view if it exists.

execute(table_name=None, table_mode=u'create', use_cache=True, priority=u'high', allow_large_results=False, dialect=None, billing_tier=None)[source]

Materialize the View synchronously.

Parameters:
  • table_name – the result table name; if None, then a temporary table will be used.
  • table_mode – one of ‘create’, ‘overwrite’ or ‘append’. If ‘create’ (the default), the request will fail if the table exists.
  • use_cache – whether to use past query results or ignore cache. Has no effect if destination is specified (default True).
  • priority – one of ‘low’ or ‘high’ (default). Note that ‘high’ is more expensive, but is better suited to exploratory analysis.
  • allow_large_results – whether to allow large results; i.e. compressed data over 100MB. This is slower and requires a table_name to be specified) (default False).
  • dialect – {‘legacy’, ‘standard’}, default ‘legacy’ ‘legacy’ : Use BigQuery’s legacy SQL dialect. ‘standard’ : Use BigQuery’s standard SQL (beta), which is compliant with the SQL 2011 standard.
  • billing_tier – Limits the billing tier for this job. Queries that have resource usage beyond this tier will fail (without incurring a charge). If unspecified, this will be set to your project default. This can also be used to override your project-wide default billing tier on a per-query basis.
Returns:

A QueryJob for the materialization

Raises:

Exception (KeyError) if View could not be materialized.

execute_async(table_name=None, table_mode=u'create', use_cache=True, priority=u'high', allow_large_results=False, dialect=None, billing_tier=None)[source]

Materialize the View asynchronously.

Parameters:
  • table_name – the result table name; if None, then a temporary table will be used.
  • table_mode – one of ‘create’, ‘overwrite’ or ‘append’. If ‘create’ (the default), the request will fail if the table exists.
  • use_cache – whether to use past query results or ignore cache. Has no effect if destination is specified (default True).
  • priority – one of ‘low’ or ‘high’ (default). Note that ‘high’ is more expensive, but is better suited to exploratory analysis.
  • allow_large_results – whether to allow large results; i.e. compressed data over 100MB. This is slower and requires a table_name to be specified) (default False).
  • dialect – {‘legacy’, ‘standard’}, default ‘legacy’ ‘legacy’ : Use BigQuery’s legacy SQL dialect. ‘standard’ : Use BigQuery’s standard SQL (beta), which is compliant with the SQL 2011 standard.
  • billing_tier – Limits the billing tier for this job. Queries that have resource usage beyond this tier will fail (without incurring a charge). If unspecified, this will be set to your project default. This can also be used to override your project-wide default billing tier on a per-query basis.
Returns:

A QueryJob for the materialization

Raises:

Exception (KeyError) if View could not be materialized.

exists()[source]

Whether the view’s Query has been executed and the view is available or not.

friendly_name

The friendly name of the view if it exists.

name

The name for the view as a named tuple.

query

The Query that defines the view.

results(use_cache=True, dialect=None, billing_tier=None)[source]

Materialize the view synchronously.

If you require more control over the execution, use execute() or execute_async().

Parameters:
  • use_cache – whether to use cached results or not.
  • dialect – {‘legacy’, ‘standard’}, default ‘legacy’ ‘legacy’ : Use BigQuery’s legacy SQL dialect. ‘standard’ : Use BigQuery’s standard SQL (beta), which is compliant with the SQL 2011 standard.
  • billing_tier – Limits the billing tier for this job. Queries that have resource usage beyond this tier will fail (without incurring a charge). If unspecified, this will be set to your project default. This can also be used to override your project-wide default billing tier on a per-query basis.
Returns:

A QueryResultsTable containing the result set.

Raises:

Exception if the query could not be executed or query response was malformed.

sample(fields=None, count=5, sampling=None, use_cache=True, dialect=None, billing_tier=None)[source]

Retrieves a sampling of data from the view.

Parameters:
  • fields – an optional list of field names to retrieve.
  • count – an optional count of rows to retrieve which is used if a specific sampling is not specified.
  • sampling – an optional sampling strategy to apply to the view.
  • use_cache – whether to use cached results or not.
  • dialect – {‘legacy’, ‘standard’}, default ‘legacy’ ‘legacy’ : Use BigQuery’s legacy SQL dialect. ‘standard’ : Use BigQuery’s standard SQL (beta), which is compliant with the SQL 2011 standard.
  • billing_tier – Limits the billing tier for this job. Queries that have resource usage beyond this tier will fail (without incurring a charge). If unspecified, this will be set to your project default. This can also be used to override your project-wide default billing tier on a per-query basis.
Returns:

A QueryResultsTable object containing the resulting data.

Raises:

Exception if the sample query could not be executed or the query response was malformed.

schema

Retrieves the schema of the table.

Returns:A Schema object containing a list of schema fields and associated metadata.
Raises
Exception if the request could not be executed or the response was malformed.
update(friendly_name=None, description=None, query=None)[source]

Selectively updates View information.

Any parameters that are None (the default) are not applied in the update.

Parameters:
  • friendly_name – if not None, the new friendly name.
  • description – if not None, the new description.
  • query – if not None, a new query string for the View.