My sister recently asked me for some help in running a comparison job between some large files spanning GBs worth of text. The most efficient solution I could find was Big Query.
So I tried some sample jobs by manually getting the data into a Google Compute Engine then to Google Storage and then loaded into a BigQuery dataset. The reason I had to use the Compute Engine was because this large file was available on a remote server with an ssh only access. Instead of asking my sister to install gsutil for uploading to Google Storage, I found it easier to just ask her to scp the file to a GCE.
The runtimes were quite surprising:
Query complete (9.4s elapsed, 6.81 GB processed)
For the first few iterations, I used the UI and manually sent out the output. Eventually I built an automated Google Colaboratory notebook where my sister could do this on her own.
Setting up Google Compute Engine to sync directory to Google Storage
Setup a Shell Script to sync the directory to Google Storage bucket every x seconds.
https://cloud.google.com/storage/docs/gsutil/commands/rsync
#!/bin/sh
while true
do
gsutil -m rsync -r -d /home/tusharm567/didi_data_src/ gs://didi_data_src > /tmp/sync.log 2>&1
sleep 5
done
Then set it to run every time GCE boots up.
TODO…
https://stackoverflow.com/questions/12973777/how-to-run-a-shell-script-at-startup
We also need to provide access to the user who’ll be using the notebook.
gsutil iam ch user:tusharm57123@gmail.com:objectCreator,objectViewer gs://didi_data_src
PS: User needs to login when we call
auth.authenticate_user()
Creating a table and adding data from Google Storage
https://googlecloudplatform.github.io/google-cloud-python/latest/bigquery/usage.html
Creating a table can be done simply using the BigQuery python client.
table_ref = dataset_ref.table(table_name)table = bigquery.Table(table_ref)table = client.create_table(table)
To load data from a Google Storage CSV file, I needed to create a job with some configurations.
The user running this should have the **“BigQuery Data Owner” **role assigned to them. Search for IAM in Google Cloud Console and add user ( by Email ID ) and assign the above role.
job_config = bigquery.LoadJobConfig()
job_config.autodetect = True
job_config.skip_leading_rows = 1
load_job = client.load_table_from_uri(uri, dataset_ref.table(table_name), job_config=job_config)
autodetect -> Infer the schema from File
skip_leading_rows -> Leave out first row assuming header.
The user running this should have the **“BigQuery Job User” **role assigned to them. Search for IAM in Google Cloud Console and add user ( by Email ID ) and assign the above role.
For More Info: https://cloud.google.com/bigquery/docs/access-control#bigquery.dataOwner
Querying the table
Tried directly using BQ Python Client’s SQL query, but it doesn’t support duplicate column names and the query I was using was an inner join with conflicting columns.
So I read the query output in a Pandas dataframe instead.
df = pd.io.gbq.read_gbq("""select * from {0}.{1};""".format(dataset_name, table_name), project_id=project_id, verbose=False)
User I/O
For file I/O, I used google.colab.files
uploaded = files.upload()files.download(output_file_path)
Google Collab supports some form feature which are quite simple.
table_name = 'sampler' #@param {type:"string"}
You can also get a checkbox quite easily
skip_header = False #@param {type: "boolean"}
Good thing is the updates in the Forms are also reflected in code.
For more info: https://colab.research.google.com/notebooks/forms.ipynb