• R/O
  • SSH

Tags
Aucun tag

Frequently used words (click to add to your profile)

javac++androidlinuxc#windowsobjective-ccocoa誰得qtpythonphprubygameguibathyscaphec計画中(planning stage)翻訳omegatframeworktwitterdomtestvb.netdirectxゲームエンジンbtronarduinopreviewer

File Info

Révision d7d1d1aea8bf51989975e4c3205c635f9a49c881
Taille 6,477 octets
l'heure 2022-02-16 06:48:30
Auteur Lorenzo Isella
Message de Log

A real code to generate a flipbook of one of my scripts.

Content

---
title: "TAM Data Preparation"
subtitle: "A Step-by-step Guide"
author: "Lorenzo Isella"
output:
  xaringan::moon_reader:
    lib_dir: libs
    css: [default, hygge, ninjutsu]
    nature:
      ratio: 16:10
      highlightStyle: github
      highlightLines: true
      countIncrementalSlides: false
---

```{r setup, include = FALSE}
library(flipbookr)
library(tidyverse)
knitr::opts_chunk$set(fig.width = 6, message = FALSE, 
                      warning = FALSE, comment = "", 
                      cache = F)
```

```{css, eval = TRUE, echo = FALSE}
.remark-code{line-height: 1.5; font-size: 80%}
@media print {
  .has-continuation {
    display: block;
  }
}
```



```{r covid, include = FALSE}
library(tidyverse)
library(janitor)
library(openxlsx)
library(stringr)
library(lubridate)


source("/home/lorenzo/myprojects-hg/R-codes/stat_lib.R")

covid_data_ini <- read_excel("SA-Covid19.xlsx") #BREAK


covid_data <- covid_data_ini %>%
    clean_data()  %>%
    select(case_reference) %>%
    distinct %>%
    mutate(is_covid_case="Yes")

covid_data

```

```{r proc, include = FALSE}
codes <- read_csv("procedure_codes.csv")
codes
```
---
First we load some libraries and the file containing the covid identifier.
---
`r chunk_reveal(chunk_name = "covid" , break_type = "user" ,
              title =  "###  Set up the libraries and read the covid identifier file")`


---
Then we read the file which contains the list of the procedures
---
`r chunk_reveal(chunk_name = "proc" ,
              title = "### Read the file with the procedure list")`

```{r tam_raw, include = FALSE}
df_ini <- read_all_csv_to_char(pattern="(tam).*\\.csv$")
df_ini
```
---
Then we read the TAM data as extracted by R3 (with a parser we have
already converted the Excel files to csv)
---
`r chunk_reveal(chunk_name = "tam_raw" ,
              title = "### Read the TAM data extracted by Kurt and Praveen already converted to csv format")`


```{r tam_clean, include = FALSE}
df <- 	df_ini %>% clean_names() %>%
    distinct(aid_award_reference, .keep_all=T) %>% 
    filter(beneficiary_country %!in% c("Spain", "Poland", "Romania")) %>% 
    clean_data() %>% 
    mutate(across(contains("aid_absolute_eur"), ~as.numeric(.x))) %>% 
    mutate(across(contains("date"), ~mdy(.x))) %>%
    mutate(year=year(aid_award_granted_date)) %>% 
    mutate(lower_bound=str_extract_before(granted_range_eur,"-")) %>%
    mutate(upper_bound=str_extract_after(granted_range_eur,"-")) %>%
    mutate(lower_bound=as.numeric(lower_bound),
           upper_bound=as.numeric(upper_bound)) %>%
    mutate(estimated_value=(lower_bound+upper_bound)/2) %>% pattern_to_na(0) %>% 
    mutate(granted_value_extended_eur = case_when(
    !is.na(granted_aid_absolute_eur) ~ granted_aid_absolute_eur,
    is.na(granted_aid_absolute_eur) & !is.na(estimated_value)  ~estimated_value,
   is.na(granted_aid_absolute_eur) & is.na(estimated_value)  ~ nominal_aid_absolute_eur)) %>% 
    mutate(nominal_value_extended_eur=
      case_when(!is.na(nominal_aid_absolute_eur) ~ nominal_aid_absolute_eur, 
    is.na(nominal_aid_absolute_eur)~granted_value_extended_eur   )) %>% 
    select(-c(lower_bound, upper_bound, estimated_value)) %>%
    mutate(is_covid_case=if_else(case_reference %in% covid_data$case_reference,
                                "Yes", "No")) %>%
    mutate(granted_value_extended_eur=if_else(is_covid_case=="Yes",
          NA_real_,granted_value_extended_eur ))

df %>% glimpse
```
---
The crucial part of the code: I try to infer the nominal amounts and
aid elements when they are not given explicitly in TAM.
If the aid element is not given directly, but a range is available, I
use the mid-point to estimate the aid element. If neither is
available, then I assume that the aid element is equal to the nominal
value. 
Then I deal with the nominal value.
When the nominal amount is not
available, I replace it with the aid element I calculated before.

Finally, I turn again to the aid element and,
for all the cases with the covid identifier equal to "yes", I set the
aid element as not available (since in that case the
amounts are always nominal and if they are given as aid elements, then
it is a mistake in the TAM encodying or in the info provided by the
MS). 

On top of that, I make sure that the beneficiary country is not one of
Spain, Poland or Romania (which do not upload their data in our
system). 
I make sure the dates are treated correctly and I create a new
variable "year" which is the year when the aid was awarded.
---
`r chunk_reveal(chunk_name = "tam_clean" ,
              title = "### Bulk of the code")`


```{r ttnew, include = FALSE}
write_csv(df, "tam_updated.csv.gz")

```
---
I save the TAM database as a compressed csv file and read a file with
the NACE codes
---
`r chunk_reveal(chunk_name = "ttnew" ,
              title = "### Save the TAM database as a compressed csv")`
---
I also read the NACE codes and descriptions from a file.
---
```{r nace2, include = FALSE}
nace <- read_excel("nace_codes.xlsx") %>%
    mutate(code=substrLeft(description,1))
nace
```
`r chunk_reveal(chunk_name = "nace2" ,
              title = "### Get the NACE codes and descriptions")`
---
I perform some minor modifications on the tam database and I extract a
single-letter NACE code, which I call macro sector. I then join the
TAM database with the file containing the description of the single
letter NACE code. 

Finally, I save the data in the R friendly RDS format
---
```{r conclusion, include = FALSE}
df.out <- df %>%
    mutate(name_of_beneficiary=if_else(!is.na(beneficiary_name_english),
                                              beneficiary_name_english, beneficiary_name)) %>%
    mutate(aid_award_ga=if_else(!is.na(aid_award_ga_english),
                                aid_award_ga_english, aid_award_ga_original)) %>%
    rename("instrument_type"="aid_award_instrument") %>% 
    mutate(macro_sector=substrLeft(beneficiary_sector, 1)) %>%
    left_join(y=nace, by=c("macro_sector"="code")) %>%
    select(-macro_sector) %>%
    rename("macro_sector"="description")  %>%
    rename("case_title"="case_title_original") %>%
    left_join(y=codes, by=c("main_procedure_type_code"="code")) %>%
    mutate(main_procedure_type_code=meaning) %>%
    select(-meaning) 

df.out %>% glimpse

saveRDS(df.out,"TAM_cleaned_for_shiny.RDS")

```
`r chunk_reveal(chunk_name = "conclusion" ,
              title = "### Last Steps and I save the TAM database for the Shiny App")`