tutorial: Exporting files to different locations from Google Cloud in Python

By Julien Hernandez Lallement, 2021-01-03, in category Tutorial

cloud, python

Google Cloud Platform (GCP)

I already discussed some functionalities of GCP in another post, that will guide you in automatizing the execution of simple code chunks.

One thing that might come in handy is not only to execute code on GCP, but also to export the result of that code as a file somewhere where you can access it.

In this post, I will discuss two different approaches that will allow you to export the result(s) of your code execution to:

  • 1) Cloud Bucket (with an option to Big Query tables): Big query tables will be useful if you need to perform additional operations on GCP, such as perform regular ML predictions on certain data tables. They can also be access by third party softwares such as Tableau or Power BI if you want to fetch data for visualization.

  • 2) Google Drive: Google Drive export would allow you to provide easy and readable access to many users, if the data is to be shared on a regular basis for further personal use.

  • 3) FTP server: your orgnization might be using a FTP server to exhange and/or store data.

Executing code on the cloud

I personally enjoy the notebook interface for testing and development. I already introducted how to create a notebook instance in my previous post, so I invite you to read through that if you are not sure how to proceed. I will assume from now on that you have a notebook instance with your code up and running, producing A file (whatever that might be, in whatever extension you require) to be exported.

Let's now have a look at how to export these files. All code is python-based.

Caution:
All these proceedings are my own findings resulting from trial&error. There might be better, more elevant and efficient ways to proceed. Do your own research as well and think critically about what I did here :)

1. Exporting on Big Query Tables

1) Create GCP Bucket

To export file on Big Query Tables, you should first export your data on a GCP bucket. The storage page will display all buckets currently existing and give you the opportunity to create one.

In [8]:
from IPython.display import Image
PATH = "C:\\Users\\jhernandez-lallement\\Documents\\GCP_post\\exporting_files\\"
Image(filename = PATH + "create_bucket.JPG", width=1200, height=800)
Out[8]:

Go to the Cloud Storage page, and click on Create a Bucket. See documentation to configure different parameters of your bucket.

Once create, your bucket will be accessible with a given name, my_test_bucket.

2) Create Service Account Key

I recommend you use a service account key to monitor and control the access to your bucket. You can create an service account key here, which will allow you to link the key to your bucket. Prior identification using the key will be required to download and upload files to the bucket

In [11]:
from IPython.display import Image
PATH = "C:\\Users\\jhernandez-lallement\\Documents\\GCP_post\\exporting_files\\"
Image(filename = PATH + "create_service_account.JPG", width=1200, height=800)
Out[11]:

The key will be a .json with the following structure:

In [ ]:
{
  "type": "service_account",
  "project_id": "my_project_id",
  "private_key_id": "my_private_key_id",
  "private_key": "-----BEGIN PRIVATE KEY-----KEY-----END PRIVATE KEY-----\n",
  "client_email": "email_adress",
  "client_id": "client_id",
  "auth_uri": "https://accounts.google.com/o/oauth2/auth",
  "token_uri": "https://oauth2.googleapis.com/token",
  "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
  "client_x509_cert_url": ""
}

3) Enter python code snippet in code file

Once your destination bucket and service account key are created, you can proceed with copying the file(s) in the bucket using python code.

  • First, we import the required libraries. I used the following versions
In [ ]:
import json
from google.cloud import storage
from google.oauth2 import service_account

google-api-core 1.22.4
google-api-python-client 1.12.3
google-auth 1.22.0
google-auth-httplib2 0.0.4
google-auth-oauthlib 0.4.1
google-cloud-bigquery 1.26.1
google-cloud-bigquery-storage 1.0.0
google-cloud-bigtable 1.4.0
google-cloud-core 1.3.0
google-cloud-dataproc 1.1.1
google-cloud-datastore 1.12.0
google-cloud-firestore 1.8.1
google-cloud-kms 1.4.0
google-cloud-language 1.3.0
google-cloud-logging 1.15.1
google-cloud-monitoring 1.1.0
google-cloud-pubsub 1.7.0
google-cloud-scheduler 1.3.0
google-cloud-spanner 1.17.1
google-cloud-speech 1.3.2
google-cloud-storage 1.30.0
google-cloud-tasks 1.5.0
google-cloud-translate 2.0.2
google-cloud-videointelligence 1.15.0
google-cloud-vision 1.0.0
google-crc32c 1.0.0
google-resumable-media 1.1.0
googleapis-common-protos 1.52.0
json5 0.9.5

  • We then define the project ID under which you are working, as well as the path to service account token you previously generated.
In [ ]:
project_id = 'fd-io-exc-orsay'
path_to_token = './bucket_access_key.json'
  • We can now proceed with the identification using the token:
In [ ]:
storage_credentials = service_account.Credentials.from_service_account_file(path_to_token)
storage_client = storage.Client(project=project_id, credentials=storage_credentials)
  • Now that we are identified, we need to (i) define the destination bucket (that you created at the beginning of the tutorial) and (ii) define the drop zone, that is, where the files will be copied in the bucket. In my case, I copy the file in a subfolder /tmp (there are no real subfolders in the bucket, but it is displaeyd as such):
In [ ]:
# Identify Destination Bucket
destination_bucket_name = 'my_bucket_name'
destination_bucket = storage_client.bucket(destination_bucket_name)

# Define Final Drop zone
destination_blob_name = 'tmp/my_file_to_upload.csv'
blob = destination_bucket.blob(destination_blob_name)
  • Upload file on Bucket
In [ ]:
# Upload file on Bucket
source_file_name = 'my_file_to_upload.csv'
blob.upload_from_filename(source_file_name)

If you now access your bucket, you should see the file copied in a subfolde /tmp. Depending on the rights that your account has been given, you might not be able to create a subfolder within the bucket, if that folder is not already present. Similarly, you might not be able to create a file, but simply to edit it. If you get issues in the steps described above, try to copy in the bucket's root and/or upload a file with the same filename manually. I am not 100% sure these concerns are valid though.

4) Optional: Export file on BigQuery table

We just exported the data as a .csv file on a Cloud Bucket. You might however want to keep the data in a tabular format on the cloud. That could be useful if you want to launch a ML pipeline on a table, or connect a third party software (e.g. Tableau) to the data.

Since these steps are well described in the Google Doc, I won't get in details here. I would however recommend, given the low cost of storing data on GCP, to perform this cloud storage of your data, so that you can always get access to it and automatize its use.

2. Exporting on Google Drive

Exporting a file on Google Drive can have quite some advantages, in particular if users in your organization are not familiar with cloud technology, and prefer to navigate using good old spread sheets. You can then save your data in an .xlsx or .csv and export the file on a drive to which users have access.

We will pe using PyDrive to control the upload and download of files to a Google Drive. You will need to obtain a client_secrets.json file to link your code to the account you want to copy the file on. You can create that file easily by following the steps described here.

I followed the steps described here to copy a file on a Google Drive. The nice thing with that solution is that the authentification to the Google Account is only required once, which creates a file mycreds.txt. That file is then used in further authentification processes to bypass manual authentification.

In [ ]:
def export_file_on_google_drive_no_id(source_file,destination_file):
    """
    Exports file on Google Drive Account linked to the the mycreds.txt file present in the folder
    source_file: filename of the file to copy on GDrive
    destination_file: filename of the destination file on GDrive
    """
    from pydrive.auth import GoogleAuth
    from pydrive.drive import GoogleDrive

    os.chdir('\path\to\mycreds.txt')

    # Authentication steps
    # See
    gauth = GoogleAuth()
    gauth.LoadCredentialsFile("mycreds.txt")
    if gauth.credentials is None:
        # Authenticate if they're not there
        gauth.LocalWebserverAuth()
    elif gauth.access_token_expired:
        # Refresh them if expired
        gauth.Refresh()
    else:
        # Initialize the saved creds
        gauth.Authorize()
    # Save the current credentials to a file
    gauth.SaveCredentialsFile("mycreds.txt")
    drive = GoogleDrive(gauth)

    # List files in Google Drive
    file_list = drive.ListFile({'q': "'root' in parents and trashed=false"}).GetList()

    # Find file to update
    for file1 in file_list:
        print('title: %s, id: %s' % (file1['title'], file1['id']))
        if file1['title'] == source_file:
            file_of_interest = file1

    file_of_interest.SetContentFile(destination_file)
    file_of_interest.Upload()
    return source_file + 'was exported to Google Drive as ' + source_file

Executing this function the first time will trigger the opening of a Google Authentification step in your browser. Upon successful authentification, further execution of the export will use the mycreds.txt file to access directly the drive and copy the data.

As you may have noticed, this solution can also be used on other platforms that GCP, or simply locally. To use it in GCP, simply authenticate once to generate a mycreds.txt file. If you execute your code manually, then make sure the file is present in your code repo. If you use Docker to execute your code regularly and automatically, then make sure to copy that file in your Dockerfile.

That's it!

oh, by the way, it is equally simple to fetch data from Google Drive from GCP, using the same logic. See the function below

In [ ]:
def fetch_file_from_google_drive_no_id(source_file,destination_file):
    """
    Fetches file(s) on Google Drive Account linked to the the mycreds.txt file present in the folder
    source_file: filename of the file to copy from GDrive
    destination_file: filename of the destination file on your machine/cloud
    """
    from pydrive.auth import GoogleAuth
    from pydrive.drive import GoogleDrive

    # Authentication steps
    gauth = GoogleAuth()
    gauth.LoadCredentialsFile("mycreds.txt")
    if gauth.credentials is None:
        # Authenticate if they're not there
        gauth.LocalWebserverAuth()
    elif gauth.access_token_expired:
        # Refresh them if expired
        gauth.Refresh()
    else:
        # Initialize the saved creds
        gauth.Authorize()
    # Save the current credentials to a file
    gauth.SaveCredentialsFile("mycreds.txt")
    drive = GoogleDrive(gauth)

    # List files in Google Drive
    file_list = drive.ListFile({'q': "'root' in parents and trashed=false"}).GetList()

    # Find file to update
    for file1 in file_list:
        print('title: %s, id: %s' % (file1['title'], file1['id']))
        if file1['title'] == source_file:
            file_of_interest = file1

    file_of_interest.GetContentFile(destination_file)
    return source_file + 'was fetched from Google Drive as ' + destination_file

3. Exporting on an FTP server

As mentionned previously, you might want to export some data on an FTP server, for further sharing. For example, I had to set up a daily export of data for a team of data analyst, that should be delivered on an FTP server to which they had access. The data would be fetched through an API call, munged and exported on the FTP every day.

I use pysftp, version 0.2.9.

You will simply need the hostname, username, password to the FTP server, which should be provided by your organization. You can use the function below to send a file filename located in localpath to the FTP server.

In [ ]:
def copy_file_on_ftp(myHostname,myUsername,myPassword,filename,localpath):

    remotepath = '/Export/'+ str(filename)
    os.chdir(localpath)
    print('')
    print(localpath)
    print('')
    cnopts = pysftp.CnOpts()
    cnopts.hostkeys = None

    with pysftp.Connection(host=myHostname, username=myUsername, password=myPassword, cnopts=cnopts) as sftp:
        print("Connection succesfully established ... ")
        print('Exporting...')
        with pysftp.cd(localpath):
            sftp.put(localpath=localpath+filename, remotepath=remotepath)
    sftp.close()
    print('Export to sFTP successful!')
    print('')

Similarly as the GDrive case, you can use the same library to get file from an FTP server to GCP for further use.

The function below has one additional parameter, the ftp_subfolder parameter which provides, if the case, the subfolder in which the file of interest is located.

In [ ]:
def import_from_sftp(myHostname,myUsername,myPassword,filename,ftp_subfolder,localPath):
    """
    Import data from ftp server
    :param myHostname (str), ftp server host that needs to be reached
            myUsername (str), username to log in ftp
            myPassword (str), password used to login
            filename (str), file to be fetched
            ftp_subfolder (str), subfolder location in ftp server (optional)
            localpath (str), local destination of fetched file
    :return: copied file locally
    """
    cnopts = pysftp.CnOpts()
    cnopts.hostkeys = None
    # If file is in the FTP root, do not use that input
    if ftp_subfolder == '':
        remoteFilePath = '/' + str(filename)
    else:
        remoteFilePath = '/'+str(ftp_subfolder)+'/'+str(filename)

    print(remoteFilePath)
    
    localFilePath = localPath+ str(filename)

    with pysftp.Connection(host=myHostname, username=myUsername, password=myPassword, cnopts=cnopts) as sftp:
        print("Connection successfully established. Copying file... ")
        print('Downloading...')
        sftp.get(remoteFilePath, localFilePath)  # get a remote file
    sftp.close()
    print('fetched ' + str(remoteFilePath) + ' from sFTP server.')
    print('')

4. Final words

That's it! Let me know if you have questions or comments :)