Révision | d7d1d1aea8bf51989975e4c3205c635f9a49c881 |
---|---|
Taille | 6,477 octets |
l'heure | 2022-02-16 06:48:30 |
Auteur | Lorenzo Isella |
Message de Log | A real code to generate a flipbook of one of my scripts. |
---
title: "TAM Data Preparation"
subtitle: "A Step-by-step Guide"
author: "Lorenzo Isella"
output:
xaringan::moon_reader:
lib_dir: libs
css: [default, hygge, ninjutsu]
nature:
ratio: 16:10
highlightStyle: github
highlightLines: true
countIncrementalSlides: false
---
```{r setup, include = FALSE}
library(flipbookr)
library(tidyverse)
knitr::opts_chunk$set(fig.width = 6, message = FALSE,
warning = FALSE, comment = "",
cache = F)
```
```{css, eval = TRUE, echo = FALSE}
.remark-code{line-height: 1.5; font-size: 80%}
@media print {
.has-continuation {
display: block;
}
}
```
```{r covid, include = FALSE}
library(tidyverse)
library(janitor)
library(openxlsx)
library(stringr)
library(lubridate)
source("/home/lorenzo/myprojects-hg/R-codes/stat_lib.R")
covid_data_ini <- read_excel("SA-Covid19.xlsx") #BREAK
covid_data <- covid_data_ini %>%
clean_data() %>%
select(case_reference) %>%
distinct %>%
mutate(is_covid_case="Yes")
covid_data
```
```{r proc, include = FALSE}
codes <- read_csv("procedure_codes.csv")
codes
```
---
First we load some libraries and the file containing the covid identifier.
---
`r chunk_reveal(chunk_name = "covid" , break_type = "user" ,
title = "### Set up the libraries and read the covid identifier file")`
---
Then we read the file which contains the list of the procedures
---
`r chunk_reveal(chunk_name = "proc" ,
title = "### Read the file with the procedure list")`
```{r tam_raw, include = FALSE}
df_ini <- read_all_csv_to_char(pattern="(tam).*\\.csv$")
df_ini
```
---
Then we read the TAM data as extracted by R3 (with a parser we have
already converted the Excel files to csv)
---
`r chunk_reveal(chunk_name = "tam_raw" ,
title = "### Read the TAM data extracted by Kurt and Praveen already converted to csv format")`
```{r tam_clean, include = FALSE}
df <- df_ini %>% clean_names() %>%
distinct(aid_award_reference, .keep_all=T) %>%
filter(beneficiary_country %!in% c("Spain", "Poland", "Romania")) %>%
clean_data() %>%
mutate(across(contains("aid_absolute_eur"), ~as.numeric(.x))) %>%
mutate(across(contains("date"), ~mdy(.x))) %>%
mutate(year=year(aid_award_granted_date)) %>%
mutate(lower_bound=str_extract_before(granted_range_eur,"-")) %>%
mutate(upper_bound=str_extract_after(granted_range_eur,"-")) %>%
mutate(lower_bound=as.numeric(lower_bound),
upper_bound=as.numeric(upper_bound)) %>%
mutate(estimated_value=(lower_bound+upper_bound)/2) %>% pattern_to_na(0) %>%
mutate(granted_value_extended_eur = case_when(
!is.na(granted_aid_absolute_eur) ~ granted_aid_absolute_eur,
is.na(granted_aid_absolute_eur) & !is.na(estimated_value) ~estimated_value,
is.na(granted_aid_absolute_eur) & is.na(estimated_value) ~ nominal_aid_absolute_eur)) %>%
mutate(nominal_value_extended_eur=
case_when(!is.na(nominal_aid_absolute_eur) ~ nominal_aid_absolute_eur,
is.na(nominal_aid_absolute_eur)~granted_value_extended_eur )) %>%
select(-c(lower_bound, upper_bound, estimated_value)) %>%
mutate(is_covid_case=if_else(case_reference %in% covid_data$case_reference,
"Yes", "No")) %>%
mutate(granted_value_extended_eur=if_else(is_covid_case=="Yes",
NA_real_,granted_value_extended_eur ))
df %>% glimpse
```
---
The crucial part of the code: I try to infer the nominal amounts and
aid elements when they are not given explicitly in TAM.
If the aid element is not given directly, but a range is available, I
use the mid-point to estimate the aid element. If neither is
available, then I assume that the aid element is equal to the nominal
value.
Then I deal with the nominal value.
When the nominal amount is not
available, I replace it with the aid element I calculated before.
Finally, I turn again to the aid element and,
for all the cases with the covid identifier equal to "yes", I set the
aid element as not available (since in that case the
amounts are always nominal and if they are given as aid elements, then
it is a mistake in the TAM encodying or in the info provided by the
MS).
On top of that, I make sure that the beneficiary country is not one of
Spain, Poland or Romania (which do not upload their data in our
system).
I make sure the dates are treated correctly and I create a new
variable "year" which is the year when the aid was awarded.
---
`r chunk_reveal(chunk_name = "tam_clean" ,
title = "### Bulk of the code")`
```{r ttnew, include = FALSE}
write_csv(df, "tam_updated.csv.gz")
```
---
I save the TAM database as a compressed csv file and read a file with
the NACE codes
---
`r chunk_reveal(chunk_name = "ttnew" ,
title = "### Save the TAM database as a compressed csv")`
---
I also read the NACE codes and descriptions from a file.
---
```{r nace2, include = FALSE}
nace <- read_excel("nace_codes.xlsx") %>%
mutate(code=substrLeft(description,1))
nace
```
`r chunk_reveal(chunk_name = "nace2" ,
title = "### Get the NACE codes and descriptions")`
---
I perform some minor modifications on the tam database and I extract a
single-letter NACE code, which I call macro sector. I then join the
TAM database with the file containing the description of the single
letter NACE code.
Finally, I save the data in the R friendly RDS format
---
```{r conclusion, include = FALSE}
df.out <- df %>%
mutate(name_of_beneficiary=if_else(!is.na(beneficiary_name_english),
beneficiary_name_english, beneficiary_name)) %>%
mutate(aid_award_ga=if_else(!is.na(aid_award_ga_english),
aid_award_ga_english, aid_award_ga_original)) %>%
rename("instrument_type"="aid_award_instrument") %>%
mutate(macro_sector=substrLeft(beneficiary_sector, 1)) %>%
left_join(y=nace, by=c("macro_sector"="code")) %>%
select(-macro_sector) %>%
rename("macro_sector"="description") %>%
rename("case_title"="case_title_original") %>%
left_join(y=codes, by=c("main_procedure_type_code"="code")) %>%
mutate(main_procedure_type_code=meaning) %>%
select(-meaning)
df.out %>% glimpse
saveRDS(df.out,"TAM_cleaned_for_shiny.RDS")
```
`r chunk_reveal(chunk_name = "conclusion" ,
title = "### Last Steps and I save the TAM database for the Shiny App")`