# Data pipeline To produce [our dataset](../dataset) we are constantly developing our dedicated library [cowidev](../cowidev/index). This library provides us with the command tool [`cowid`](../cowidev/cowid-api) which eases: 1. Running several _sub-processes_ (or pipelines) that generate _intermediate datasets_. 2. Jointly processing and merging all these intermediate datasets into the final and complete dataset. Consequently, the dataset is updated multiple times a day (_at least_ at 06:00 and 18:00 UTC), using the latest generated intermediate datasets. ## Overview The dataset pipeline is built from several pipelines, which are executed independently and whose outputs are combined in a final step. The complexity of the pipelines varies. For instance, for vaccinations, testing and hospitalization we are responsible for collecting, processing and publishing the data but for cases/deaths we leave the collection step to [Johns Hopkins Coronavirus Resource Center](https://coronavirus.jhu.edu/map.html) and then transform and publish the data. Note that on 23 June 2022, we stopped adding new data points to our COVID-19 testing dataset ([read more)](https://github.com/owid/covid-19-data/discussions/2667)). The table below lists all the constituent pipelines, along with their execution frequencies, and what are the pipelines' tasks. | **Pipeline** | **Frequency** | **Tasks** | |---------------------------|------------------------------|------------------------------------------| | [Vaccinations](#vaccinations-pipeline) | every weekday at 12:00 UTC | {abbr}`Collection (Scraping primary sources (e.g. country governmental sites) and extracting relevant datapoints.)`, {abbr}`transformation (Transforming and cleaning the downloaded data into a human-readable format.)`, {abbr}`presentation (Presenting the cleaned data to the public (e.g. charts, dataset files, etc.).)` | | [Testing](#testing-pipeline) | Phased out ([read more](https://github.com/owid/covid-19-data/discussions/2667)) | Collection, transformation, presentation | | [Hospitalization & ICU](#hospitalization-icu-pipeline) | daily at 06:00 and 18:00 UTC | Collection, transformation, presentation | | [Cases & Deaths (JHU)](#cases-deaths-jhu-pipeline) | daily at 04:00, 10:00, 16:00 and 22:00 UTC | Transformation, presentation | | [Excess mortality](#excess-mortality-pipeline) | weekly | Transformation, presentation | | [Variants](#variants-pipeline) | daily at 20:00 UTC | Transformation, presentation | | [Reproduction rate](#reproduction-rate-pipeline) | daily | Presentation | | [Policy responses (OxCGRT)](#policy-responses-oxcgrt-pipeline) | daily | Transformation, presentation | | [Public monitor (YouGov)](#public-monitor-yougov-pipeline) | weekly | Transformation, presentation | You can find all the automation details [in this file](https://github.com/owid/covid-19-data/blob/master/scripts/scripts/autoupdate.sh). ## Vaccinations pipeline The vaccination pipeline is probably the most complete one, where we scrape and extract data for each country in the dataset. The pipeline is executed manually, by [@edomt](https://github.com/edomt) or [@lucasrodes](https://github.com/lucasrodes) every weekday (i.e. Monday until Friday) before 12 UTC. ### Execution steps ``` # Download/scrape data cowid vax get # Proces/check data cowid vax process # Generate dataset cowid vax generate # Integrate into full dataset cowid vax export ``` ```{seealso} [Intermediate dataset](https://github.com/owid/covid-19-data/blob/master/public/data/vaccinations/), including per-country files and data technical details. ``` ## Testing pipeline We scrape and process data for multiple countries, similarly to the vaccinations pipeline. The pipeline is executed manually, by [@camapel](https://github.com/camapel) on Mondays and Fridays. :::{warning} On 23 June 2022, we stopped adding new datapoints to our COVID-19 testing dataset. We continue to update all other metrics in our COVID-19 dataset. You can read more [here](https://github.com/owid/covid-19-data/discussions/2667). ::: ### Execution steps ``` # Download/scrape data cowid testing get ``` ```{seealso} [Intermediate datasets](https://github.com/owid/covid-19-data/tree/master/public/data/testing) ``` ## Hospitalization & ICU pipeline We scrape and process the data similarly as to what we do for testing and vaccinations. The pipeline is run daily. ### Execution steps ``` # Download data & generate dataset cowid hosp generate # Update Grapher-ready files cowid hosp grapher-io # Update Grapher database cowid hosp grapher-db ``` ```{seealso} [Intermediate dataset and data technical details](https://github.com/owid/covid-19-data/tree/master/public/data/hospitalizations). ``` ## Cases & Deaths (JHU) pipeline We source cases and death figures from the [COVID-19 Data Repository by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University](https://github.com/CSSEGISandData/COVID-19). We transform some of the variables and re-publish the dataset. ### Execution steps ``` # Download data cowid jhu download # Generate dataset cowid jhu generate # Update Grapher database cowid jhu grapher-db ``` ```{seealso} [Intermediate dataset](https://github.com/owid/covid-19-data/tree/master/public/data/jhu). ``` ## Excess Mortality pipeline The pipeline is manually executed once a week. The reported all-cause mortality data is from the [Human Mortality Database](https://www.mortality.org/) (HMD) Short-term Mortality Fluctuations project and the [World Mortality Dataset](https://github.com/akarlinsky/world_mortality) (WMD). Both sources are updated weekly. We also present estimates of excess deaths globally that are [published by _The Economist_](https://github.com/TheEconomist/covid-19-the-economist-global-excess-deaths-model). ### Execution steps ``` # Download data and generate dataset cowid xm generate ``` ```{seealso} [Intermediate dataset and data technical details](https://github.com/owid/covid-19-data/tree/master/public/data/excess_mortality). ``` ## Variants pipeline We run this pipeline daily. ### Execution steps ``` # Download data and generate dataset cowid variants generate # Update Grapher-ready files cowid variants grapher-io ``` ```{note} The data on variants and sequencing is indeed no longer available to download. It is published by GISAID under a license that doesn't allow us to redistribute it. Please visit [the data publisher's website](https://www.gisaid.org/) for more details. You may want to register an account there if you're really interested in using this data. ``` ## Reproduction rate pipeline We source the data from [crondonm/TrackingR/](https://github.com/crondonm/TrackingR/). ```{seealso} [_Tracking R of COVID-19 A New Real-Time Estimation Using the Kalman Filter_](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0244474), by Francisco Arroyo, Francisco Bullano, Simas Kucinskas, and Carlos Rondón-Moreno ``` ## Policy responses (OxCGRT) pipeline ``` # Get the data cowid oxcgrt get # Update Grapher files cowid oxcgrt grapher-io # Upload data to database cowid oxcgrt grapher-db ``` ## Public monitor (YouGov) pipeline :::{warning} The YouGov pipeline is under construction. :::