google.datalab.data Module

Google Cloud Platform library - Generic SQL Helpers.

class google.datalab.data.CsvFile(path, delimiter=', ')[source]

Represents a CSV file in GCS or locally with same schema.

Initializes an instance of a Csv instance. :param path: path of the Csv file. :param delimiter: the separator used to parse a Csv line.

browse(max_lines=None, headers=None)[source]

Try reading specified number of lines from the CSV object. :param max_lines: max number of lines to read. If None, the whole file is read :param headers: a list of strings as column names. If None, it will use “col0, col1...”

Returns:

A pandas DataFrame with the schema inferred from the data.

Raises:
  • Exception if the csv object cannot be read or not enough lines to read, or the
  • headers size does not match columns size.
sample_to(count, skip_header_rows, strategy, target)[source]

Sample rows from GCS or local file and save results to target file.

Parameters:
  • count – number of rows to sample. If strategy is “BIGQUERY”, it is used as approximate number.
  • skip_header_rows – whether to skip first row when reading from source.
  • strategy – can be “LOCAL” or “BIGQUERY”. If local, the sampling happens in local memory, and number of resulting rows matches count. If BigQuery, sampling is done with BigQuery in cloud, and the number of resulting rows will be approximated to count.
  • target – The target file path, can be GCS or local path.
Raises:

Exception if strategy is “BIGQUERY” but source is not a GCS path.