datalab.data Module

Google Cloud Platform library - Generic SQL Helpers.

class datalab.data.Csv(path, delimiter=', ')[source]

Represents a CSV file in GCS or locally with same schema.

Initializes an instance of a Csv instance. :param path: path of the Csv file. :param delimiter: the separator used to parse a Csv line.

browse(max_lines=None, headers=None)[source]

Try reading specified number of lines from the CSV object. :param max_lines: max number of lines to read. If None, the whole file is read :param headers: a list of strings as column names. If None, it will use “col0, col1...”

Returns:

A pandas DataFrame with the schema inferred from the data.

Raises:
  • Exception if the csv object cannot be read or not enough lines to read, or the
  • headers size does not match columns size.
sample_to(count, skip_header_rows, strategy, target)[source]

Sample rows from GCS or local file and save results to target file.

Parameters:
  • count – number of rows to sample. If strategy is “BIGQUERY”, it is used as approximate number.
  • skip_header_rows – whether to skip first row when reading from source.
  • strategy – can be “LOCAL” or “BIGQUERY”. If local, the sampling happens in local memory, and number of resulting rows matches count. If BigQuery, sampling is done with BigQuery in cloud, and the number of resulting rows will be approximated to count.
  • target – The target file path, can be GCS or local path.
Raises:

Exception if strategy is “BIGQUERY” but source is not a GCS path.

class datalab.data.SqlModule[source]

A container for SqlStatements defined together and able to reference each other.

static expand(sql, args=None)[source]

Expand a SqlStatement, query string or SqlModule with a set of arguments.

Parameters:
  • sql – a SqlStatement, %%sql module, or string containing a query.
  • args – a string of command line arguments or a dictionary of values. If a string, it is passed to the argument parser for the SqlModule associated with the SqlStatement or SqlModule. If a dictionary, it is used to override any default arguments from the argument parser. If the sql argument is a string then args must be None or a dictionary as in this case there is no associated argument parser.
Returns:

The expanded SQL, list of referenced scripts, and list of referenced external tables.

static get_default_query_from_module(module)[source]

Given a %%sql module return the default (last) query for the module.

Parameters:module – the %%sql module.
Returns:The default query associated with this module.
static get_sql_statement_with_environment(item, args=None)[source]
Given a SQLStatement, string or module plus command line args or a dictionary,
return a SqlStatement and final dictionary for variable resolution.
Parameters:
  • item – a SqlStatement, %%sql module, or string containing a query.
  • args – a string of command line arguments or a dictionary of values.
Returns:

A SqlStatement for the query or module, plus a dictionary of variable values to use.

class datalab.data.SqlStatement(sql, module=None)[source]

A helper class for wrapping and manipulating SQL statements.

Initializes the SqlStatement.

Parameters:
  • sql – a string containing a SQL query with optional variable references.
  • module – if defined in a %%sql cell, the parent SqlModule object for the SqlStatement.
static format(sql, args=None)[source]

Resolve variable references in a query within an environment.

This computes and resolves the transitive dependencies in the query and raises an exception if that fails due to either undefined or circular references.

Parameters:
  • sql – query to format.
  • args – a dictionary of values to use in variable expansion.
Returns:

The resolved SQL text with variables expanded.

Raises:

Exception on failure.

module

The parent SqlModule for the SqlStatement, if any.

sql

The (unexpanded) SQL for the SqlStatement.