# Fault analysis

Ensuring probability of failure is not systematically under/overestimated, or overfitted by adding too many explanatory variables, is of particular concern for regulators who must ensure revenue caps are neither too high nor too low. Data scientists and researchers also need to estimate new parameters for new variables, that might only relate to a few climates or geographical areas. Parameter estimation is also important for defining new models for asset types not currently in the CNAIM standard.

The fault statistics dataset, transformer_11kv_faults, is provided when the main package is loaded.

When identifying new parameters, it is reasonable to assume that C = 1, since it is constant for all parameters in the 2017 CNAIM specification. The shape of the curve is, in fact, exclusively defined by H.

In its general form, the Weibull distribution is a lifetime distribution with two parameters: the scale parameter $$\alpha$$ and the shape parameter $$\gamma$$. With these two parameters, the Weibull distribution has pdf

$$f(t, \gamma, \alpha) = \frac{\gamma}{t}\left(\frac{t}{\alpha}\right)^\gamma e^{-\left(\frac{t}{\alpha}\right)^\gamma}$$

$$\alpha \Gamma\left(1+\frac{1}{\gamma}\right),$$

variance

$$\alpha^2 \Gamma\left(1+\frac{2}{\gamma}\right) - \left[ \alpha\Gamma\left(1+\frac{1}{\gamma}\right)\right]^2$$.

For a first analysis, we set $$\gamma = 1$$: This reduces the Weibull distribution to an exponential distribution, with pdf $$f(t, \alpha) = \frac{1}{\alpha}e^{-\frac{t}{\alpha}}$$, mean lifetime $$\alpha$$, and variance $$\alpha^2$$. Firstly, the parameter $$\alpha$$ is fitted to the data using multilinear regression. Then the variance of the resulting distribution is compared with the variance estimated from the data. If the exponential distribution turns out to be a poor fit, the analysis can be extended with the shape parameter $$\gamma$$.

In our use case, the parameter $$\alpha$$ denotes the mean (or expected) lifetime of a transformer, which depends on several variables describing location and environmental factors. This dependence is represented by the following multilinear model for $$\alpha$$: $\alpha = \alpha_0 + \alpha_1 x_1 + \alpha_2 x_2 + ... + \alpha_n x_n,$ with $$x_1,...,x_n$$ the values of the explanatory variables, and $$\alpha_0,\alpha_1,...,\alpha_n$$ the coefficients to be determined by regression.

linear_model <- lm(formula = age ~ .,
summary(linear_model)

Call:
lm(formula = age ~ ., data = transformer_11kv_faults %>% select(-dead,
-transformer_id, -pof))

Residuals:
Min      1Q  Median      3Q     Max
-59.593 -12.203  -0.642  12.276  57.339

Coefficients:
Estimate Std. Error  t value Pr(>|t|)
(Intercept)                             3.390e+01  2.553e-01  132.803  < 2e-16
utilisation_pct                         8.098e-04  7.942e-04    1.020 0.307860
placementOutdoor                       -3.726e+00  1.147e-01  -32.473  < 2e-16
altitude_m                             -1.825e-03  2.601e-04   -7.016 2.29e-12
distance_from_coast_km                 -2.319e-02  3.979e-03   -5.829 5.59e-09
corrosion_category_index2              -2.062e-01  1.822e-01   -1.132 0.257619
corrosion_category_index3              -3.537e-02  1.822e-01   -0.194 0.846057
corrosion_category_index4              -1.182e+00  1.820e-01   -6.493 8.47e-11
corrosion_category_index5              -2.897e+00  1.810e-01  -16.007  < 2e-16
partial_dischargeHigh (Not Confirmed)   8.628e+00  1.611e-01   53.548  < 2e-16
partial_dischargeLow                    2.729e+01  1.623e-01  168.095  < 2e-16
partial_dischargeMedium                 2.686e+01  1.623e-01  165.522  < 2e-16
oil_acidity                            -8.740e-01  1.541e-01   -5.673 1.41e-08
temperature_readingVery High           -1.370e+01  1.402e-01  -97.740  < 2e-16
observed_conditionGood                 -1.262e-01  1.823e-01   -0.692 0.489016
observed_conditionPoor                 -6.006e-01  1.818e-01   -3.303 0.000956
observed_conditionSlight Deterioration -2.002e-01  1.817e-01   -1.102 0.270612
observed_conditionVery Poor            -1.995e+01  1.812e-01 -110.128  < 2e-16

(Intercept)                            ***
utilisation_pct
placementOutdoor                       ***
altitude_m                             ***
distance_from_coast_km                 ***
corrosion_category_index2
corrosion_category_index3
corrosion_category_index4              ***
corrosion_category_index5              ***
partial_dischargeHigh (Not Confirmed)  ***
partial_dischargeLow                   ***
partial_dischargeMedium                ***
oil_acidity                            ***
observed_conditionGood
observed_conditionPoor                 ***
observed_conditionSlight Deterioration
observed_conditionVery Poor            ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 18.02 on 98685 degrees of freedom
Multiple R-squared:  0.4322,    Adjusted R-squared:  0.4321
F-statistic:  4173 on 18 and 98685 DF,  p-value: < 2.2e-16

Most parameter estimates have a high level of statistical significance, but the $$R^2$$ value of 0.43 shows that a significant amount of variance cannot be explained by a linear model. The next step is to pass the residuals through a Weibull analysis function:

CNAIM_weibull(linear_model)
$k  0.0085$H
 4.7