# Setting up the development environment This document explains all the necessary steps to set up your environment and work with this project correctly. Perhaps you want to set up the environment to help us out, or to learn how we work, or because you want to set up a similar workflow. In any case, we appreciate the time you are taking here 😀. - [Python](#python) - [Install project library](#install-project-library) - [Set environment variables](#set-environment-variables) - [Configuration file](#configuration-file) - [Secrets file](#secrets-file) - [Questions?](#questions) ## Python This project uses Python for most of its processes. We have tested the code in Python 3.9 and 3.10. We recommend creating a [virtual environment](https://docs.python.org/3/library/venv.html) and installing all dependencies there. Something like: ```bash # Create python -m venv venv # Activate . venv/bin/activate ``` ## Download the project First thing is to download the project. If you just want to run the code, clone it from the official repository: ```bash git clone https://github.com/owid/covid-19-data.git ``` Note that the project is quite significant in size, so you may want to use a [shallow clone](https://git-scm.com/docs/git-clone>): ```bash git clone --depth 1 --no-single-branch https://github.com/owid/covid-19-data.git ``` If you want to **contribute** consider [forking the repository](https://docs.github.com/en/get-started/quickstart/fork-a-repo) instead. ## Install project library This project is built around the python library `cowidev`, which we are developing to help us maintain and improve our COVID-19 data pipeline. We recommend using `pip` in [editable mode](https://pip.pypa.io/en/stable/cli/pip_install/#editable-installs). For this, you need to be in `scripts/` folder, next to the `setup.py` file: ```bash cd scripts pip install -e . ``` If the installation went well, running `cowid` in your terminal will execute but raise an `EnvironmentError` error. ## Set environment variables To run the pipeline, you need to create three environment variables: `OWID_COVID_PROJECT_DIR`, `OWID_COVID_CONFIG` and `OWID_COVID_SECRETS`. The last two variables point to files that we will create in the following sections. | Variable | Description | |----------|-------------| | `OWID_COVID_PROJECT_DIR` | Path to the local project directory, e.g. `/Users/username/projects/covid-19-data/` | | `OWID_COVID_CONFIG` | Path to the data pipeline [configuration file](#configuration-file). This file provides the default configuration values for the pipeline. Our team uses [`config.yaml`](https://github.com/owid/covid-19-data/blob/master/scripts/config.yaml), which you can use and adapt to your needs. | | `OWID_COVID_SECRETS` | Path to the data pipeline [secrets file](#secrets-file). | You need to add these variables to your shell config file (i.e. `.bashrc`, `.bash_profile` or `.zshrc`), e.g.: ``` export OWID_COVID_PROJECT_DIR=/Users/username/projects/covid-19-data export OWID_COVID_CONFIG=${OWID_COVID_PROJECT_DIR}/scripts/config.yaml export OWID_COVID_SECRETS=${OWID_COVID_PROJECT_DIR}/scripts/secrets.yaml ``` Note that this is an example and you are free to choose other paths as long as they point to the correct files. More on the `config.yaml` and `secrets.yaml` file below. ## Configuration file The configuration file is required to run the COVID-19 vaccination and testing data pipelines (might be extended to other pipelines). Please find below a sample with its structure. You can also check [the one we use](https://github.com/owid/covid-19-data/blob/master/scripts/config.yaml). Note that all fields are required, even if left empty. ```yaml execution: parallel: # Use parallelization (bool) njobs: # Number of threads when parallel=True (int) pipeline: # Vaccination data pipeline vaccinations: # Get step get: countries: # Countries to collect data for (list) skip_countries: # Countries to skip data collection for (list) # Process step process: skip_complete: # Countries to skip data processing (list) skip_monotonic_check: skip_anomaly_check: # Skip anomaly checks for these countries, dates and metrics (dict) Australia: # Country name, Australia left as an example (list) - date: # Date to avoid check for (str YYYY-MM-DD) metrics: # Metric to avoid check for (str) # Generate step generate: # Export step export: # Testing data pipeline testing: # Get step get: countries: # Countries to collect data for (list) skip_countries: # Countries to skip data collection for (list) # Process step process: # Generate step generate: # Export step export: # Hospitalization data pipeline hospitalizations: # Generate step generate: # Countries to include countries: # Countries to skip skip_countries: ``` ## Secrets file We use the secrets file to update internal flows with the pipeline's output (fields `vax` and `test`). While there are many fields, **contributors may only need set one field: `google.clients_secrets`**, which is needed to interact with Google Drive / Google Sheets based sources (more on how to get it [here](#how-can-i-get-the-google-client-secrets-json-file)). Note that this file is not shared, as it may contain sensitive data. ```yaml # Google configuration (dict) google: client_secrets: # Path to google client_secrets.json file mail: # Email (str), OPTIONAL scraperapi: token: # Token for https://www.scraperapi.com/ services (free plan) slack: token: # Token to send messages to slack # Vaccination configuration (dict), OPTIONAL vaccinations: post: # OWID Vaccination internal post link (str) sheet_id: # OWID Vaccination internal spredsheet ID, where manual imports happen (str) # Testing configuration (dict), OPTIONAL testing: post: # OWID Testing internal post link (str) sheet_id: # OWID Testing internal spredsheet ID, where manual imports happen (str) sheet_id_attempted: # OWID Extra Testing internal spredsheet ID, where attempted countries are listed (str) # Twitter configuration (dict), OPTIONAL twitter: consumer_key: # Consumer key (str) consumer_secret: # Consumer secret (str) access_secret: # Access secret (str) access_token: # Acces token (str) ``` ### How can I get the google `client_secrets.json` file? The value of `google.client_secrets` should point to the JSON file downloaded from Google Cloud Platform that contains your personal Google credentials. To obtain it, you can follow [`gsheets` documentation](https://gsheets.readthedocs.io/en/stable/#quickstart): > Log into the [Google Developers Console](https://console.developers.google.com/) with the Google account whose > spreadsheets you want to access. Create (or select) a project and enable the **Drive API** and **Sheets API** (under > Google Apps APIs). > > Go to the Credentials for your project and create **New credentials** > **OAuth client ID** > of type **Other**. In > the list of your **OAuth 2.0 client IDs** click **Download JSON** for the Client ID you just created. We recommend saving the downloaded file in a safe directory, with a simplified name, e.g. `~/.config/owid/client_secrets.json`. ### What is `scraperapi.token`? Scraper API is a service with a friendly proxy that allows you to access any HTML without being blocked. Four our pipeline you need to [register](https://www.scraperapi.com/) and get their TOKEN. The free plan should be OK! ## Verify installation Once you have installed the library, configured the configuration and secrets files accordingly along with the environment variables you should be able to run: ``` ~ cowid --help Usage: cowid [OPTIONS] COMMAND [ARGS]... COVID-19 Data pipeline tool by Our World in Data. Options: --parallel / --no-parallel Parallelize process. [default: parallel] --n-jobs INTEGER Number of threads to use. [default: -2] -S, --server / --no-server Only critical log and final message to slack. [default: no-server] --help Show this message and exit. Commands: megafile COVID-19 data integration pipeline (former megafile) test COVID-19 Testing data pipeline. vax COVID-19 Vaccination data pipeline. hosp COVID-19 Hospitalization data pipeline. jhu COVID-19 Cases/Deaths data pipeline. variants COVID-19 Variants data pipeline. xm COVID-19 Excess Mortality data pipeline. gmobility Google Mobility data pipeline. oxcgrt COVID-19 stringency index (by OxCGRT) data pipeline. decoupling COVID-19 Decoupling data pipeline. sweden COVID-19 Sweden data pipeline. uk-nations COVID-19 UK Nations data pipeline. check COVID-19 data pipeline checks. ``` ## Questions? Raise an [issue](https://github.com/owid/covid-19-data/issues), we are happy to help!