Causal Inference by Independent Component Analysis: Theory and Applications

Structural vector‐autoregressive models are potentially very useful tools for guiding both macro‐ and microeconomic policy. In this study, we present a recently developed method for estimating such models, which uses non‐normality to recover the causal structure underlying the observations. We show how the method can be applied to both microeconomic data (to study the processes of firm growth and firm performance) and macroeconomic data (to analyse the effects of monetary policy).


I. Introduction
Structural vector autoregressions (SVAR models) are among the most prevalent tools in empirical economics to analyse dynamic phenomena. Their basic model is the vector autoregression (VAR model), in which a system of variables is formalized as driven by their past values and a vector of random disturbances. This reduced form representation is typically used for the sake of estimation and forecasting. The VAR form, however, is not sufficient to perform economic policy analysis: it does not provide enough information to study the causal influence of various shocks on the key economic variables, nor to use the estimated parameters to predict the effect of an intervention. Thus, SVAR models are meant to furnish the VAR with structural information so that one can recover the causal relationships existing among the variables under investigation, and trace out how economically interpreted random shocks affect the system.
Where does this structural information come from? The common approach is that it must be derived from economic theory or from institutional knowledge related to the data generating mechanism (Stock and Watson, 2001). In other words, external restrictions (not derived from the data) are used to constrain the SVAR model sufficiently to yield identifiability of the remaining parameters from the data. Such an approach is extremely useful when applicable, but depends on the availability of reliable prior economic theory. The problem is that often competing restrictions, imposing alternative causal orderings, are equally reconcilable with background knowledge. As Demiralp and Hoover (2003) pointed out, this is quite ironic, because the main idea of the VAR approach was to let the data speak and to eschew incredible a priori restrictions (Sims, 1980), but if the SVAR method relies on preconceived stories about simultaneous causal relationships it is at risk of losing empirical plausibility.
Alternatively, one can take a more data-driven approach. Under certain general statistical assumptions (not related to economic theory), it is possible to infer aspects of the SVAR model on the basis of the statistical distribution of the estimated VAR residuals. This approach builds on techniques developed in the graphical causal model literature (Pearl, 2000;Spirtes et al., 2000), which are mostly based on conditional independence tests on the residuals (Swanson and Granger, 1997;Bessler and Lee, 2002;Bessler and Yang, 2003;Demiralp and Hoover, 2003;Moneta, 2004Moneta, , 2008Demiralp et al., 2008). Unfortunately, the information obtained from this approach is generally not sufficient to provide full identification of the SVAR model, so typically some additional background knowledge is still needed.
Fortunately, if the data are non-Gaussian, which is not uncommon in many econometric studies (Lanne and Saikkonen, 2009;Lanne and Lütkepohl, 2010), one can exploit higher-order statistics of the VAR residuals to get stronger identification results (Shimizu et al., 2006;Hyvärinen et al., 2008). Under further reasonable assumptions (independence of shocks, and causal acyclicity), full identification of the SVAR model can be obtained by applying Independent Component Analysis. In this contribution, we present this novel family of methods to the econometrics community and discuss its relationship to other methods for estimating SVAR models. Furthermore, we give two extended examples of applications to micro-and macroeconomic datasets; these examples demonstrate both the power and some limitations of the procedure. Our hope is that this will further the adoption of these novel methods within the field of econometrics.
The rest of the paper is structured as follows. In section II we describe the SVAR model and the problem of identification we propose to solve. In section III we present our identification algorithm and discuss its assumptions. In section IV we illustrate how the proposed method performs in a microeconomic application aimed at studying the coevolution of firm growth, firm performance, and R&D expenditure. In section V we present an application to a macroeconomic dataset to analyze the effects of monetary policy. Conclusions are provided in section VI.

II. VAR and SVAR models The basic model
We consider data in the form of a set of K scalar variables y 1 , . . . , y K measured at regular time intervals, and denote by the vector y t = (y 1t , . . . , y Kt ) T the values of these variables at time t. The data are modelled by expressing the value of each variable y i , at each time instant t, as a linear combination of the earlier values (up to lag p) of all variables as well as the contemporaneous values of the other variables. In vector form we thus have y t = By t + 1 y t−1 + · · · + p y t−p + t (1) where B and the j (j = 1, . . . , p) are K × K matrices denoting the contemporaneous and lagged coefficients (respectively), B has a zero diagonal (by definition), and t is a K × 1 vector of error terms. This model can equivalently be expressed in the standard SVAR form 0 y t = 1 y t−1 + · · · + p y t−p + t (2) with the notation 0 = I − B. In the standard SVAR model it is assumed that the covariance matrix = E{ t T t } is diagonal (Levtchenkova et al., 1998). We make the stronger assumption that the structural error terms 1 , . . . , K are mutually statistically independent (a sufficient but not a necessary condition for uncorrelatedness). We discuss this assumption in more detail in section III. The error terms also have the property of being zero-mean white noise processes (i.e. no correlations across time).
A simple illustration of such a model is given in Figure 1. In this two-variable (K = 2) and one-lag (p = 1) example, we have Note that the error terms are not explicitly drawn in the figure.

The identification problem
Since variables y 1t , . . . , y Kt are endogenous in equations (1) and (2), these models cannot be directly estimated without biases. However, from a SVAR model one can derive the reduced form VAR model where u t is a vector of zero-mean white noise processes with covariance matrix u = E{u t u T t }, which in general will not be diagonal. For our example in Figure 1, the reduced form VAR parameters would be While the VAR parameters of equation (4) can easily be directly estimated from the observed data y t (t = 1, . . . , T ) (see e.g. Canova, 1995), this is not sufficient to recover the parameters of equation (2), whose number is larger than the number of parameters of equation (4). This is a problem since the reduced-form VAR model is only adequate for time-series forecasting, and not sufficient for policy analysis. To see this, consider the Wold Moving Average (MA) representation of the VAR model 1 where the j (j = 0, 1, 2, . . .) are the MA coefficient matrices and j (j = 0, 1, 2, . . .) represent the impulse responses of the elements of y t to the shocks t−j (j = 0, 1, 2, . . .). We have where A j = 0 for j > p. The impulse responses j can therefore be obtained from the reduced-form VAR parameters A j only if we know the SVAR coefficients matrix 0 = I − B.
As the impulse responses are crucial for policy analysis, it is clear that we need to recover the SVAR representation for this purpose. The problem is that any invertible unit-diagonal matrix 0 , as we see from equation (6), is compatible with the coefficient matrices that we obtain by estimating the VAR reduced form (4). It is crucial then to find the correct 0 which produces the right transformation 0 u t = t of the VAR error terms u t . A common approach is to require the contemporaneous connections to be acyclic, which implies that 0 is lower triangular when the variables are arranged in an appropriate order. For any fixed variable order, this condition is sufficient to uniquely identify 0 (thus obtainable from a Cholesky factorization of u ). Hence, a standard approach for SVAR identification is to use economic theory to select an appropriate contemporaneous ordering, and then let the data identify the remaining parameters. Unfortunately, there does not always exist sufficient theory to unambiguously order the observed variables.
To illustrate, for the example in Figure 1, if we could not use economic theory to dictate whether there should be a contemporaneous connection from y 1 to y 2 , or vice versa, we would not (using covariance information alone) be able to distinguish between the model of Figure 1 and that given by (In section III, using the same example, we will show how non-normality of the error terms can be exploited to fully identify the SVAR model.) Graphical-model applications to SVAR identification seek to recover the matrix B, and consequently, 0 , starting from tests on conditional independence relations among the elements of u t . Conditional independence relations among u 1t , . . . , u Kt imply, under general assumptions, the presence of some causal relations and the absence of some others. Search algorithms, such as those proposed by Pearl (2000) and Spirtes et al. (2000), exploit this information to find the class of admissible structures among u 1t , . . . , u Kt (see also Bryant et al. 2009). In most of the cases, there are several structures belonging to this class, so the outcome of the search procedure is not a unique B. Bessler and Lee (2002), Demiralp and Hoover (2003) and Moneta (2008) use partial correlation as a measure of conditional dependence, based on the assumption that error terms are normally distributed.
In this study, we similarly apply a graph-theoretic search algorithm aimed at recovering the matrix B, but here we do not use conditional independence tests or partial correlations (and hence the assumptions required for identifiability are different). The (unique) causal structure among the elements of u t is captured by identifying their independent components through an analysis which exploits any non-Gaussian structure in the error terms.

The assumption of non-normality
While we present the search algorithm in detail in the next section, we should first ask whether the assumption of non-normal error terms introduces some specificities in the way a VAR model is formalized and estimated. Several authors suggest to test for non-normality after the estimation of the reduced form VAR model, as a model-checking procedure. For example, Lütkepohl (2006) maintains that '[a]lthough normality is not a necessary condition for the validity of many of the statistical procedures related to VAR models, deviations from the normality assumption may indicate that model improvements are possible' (p. 491). However, economic shocks may well deviate from normality, even after the inclusion of new variables and the control of the lag structure. While it is true that one should be prepared to change the model if some basic statistical assumptions are not satisfied, we suggest that in the case of non-Gaussianity one can alternatively exploit this characteristic of the data for the sake of identification. Lanne and Saikkonen (2009) and Lanne and Lütkepohl (2010) have also recently proved that non-Gaussianity can be useful for the identification of the structural shocks in a SVAR model. In contrast with these studies, however, we do not make specific distributional assumptions but rather allow any form of non-Gaussianity (thus, we use an essentially semi-parametric approach).
In terms of VAR estimation, non-normality of the error terms yields a loss in asymptotic efficiency in estimation. In particular, least squares estimation is not identical to maximum likelihood estimation, unlike the Gaussian setting. Dasgupta and Mishra (2004) have shown that the least absolute deviation (LAD) estimation method may perform better than least squares when u t is non-normally distributed.
Under the assumption of non-Gaussianity, the fact that t are white noise does not imply that they are serially independent. This is a subtle, but important aspect of a non-Gaussian VAR, because even if corr( it , i(t + 1) ) = 0 (i = 1, . . . , k), it is in principle possible that i(t + 1) is statistically dependent on it . In this latter case i(t + 1) would not be an actual innovation term, since it can be predicted by y t . This point, as noted by Lanne and Saikkonen (2009), is closely related to the issue of non-fundamental representations of structural models formulated in moving average terms. Non-fundamentalness arises if the MA polynomial has roots lying inside the unit circle (for a survey on this topic see Alessi et al., 2011). In such cases, VAR modelling is not appropriate. We do not provide solutions to the non-fundamentalness problem here, so it must be taken into account as a limitation of our work, to be addressed in future research.

III. SVAR identification
We obtain full identification of the model given in equations (1) and (2) under the following conditions: (i) the shocks 1t , . . . , Kt are non-normally distributed; (ii) the shocks 1t , . . . , Kt are statistically independent; (iii) the contemporaneous causal structure among y 1t , . . . , y Kt is acyclic, that is there exists an ordering of the variables such that 0 is lower triangular; the appropriate ordering of the variables, however, is not known to the researcher a priori. We describe this result, and an appropriate estimation algorithm, in this section. The first assumption is in general testable, by applying normality tests to the estimated residuals. The result of such tests, however, may depend on the method used to estimate the model. For example, it is possible that the normality of residuals is more often rejected when we apply the LAD estimator instead of OLS, since a least squares estimator tends, by construction, to make the residuals look normal even when the underlying shocks are not. To control for this issue, in our macroeconomic applications, in which two of the residuals look close to normal, we check whether the results are robust across different estimation methods. Another related issue is that in small samples, the non-normality that our approach exploits might arise from just a few outlying observations. We suggest to tackle this potential problem by investigating the robustness of the observed causal order across multiple bootstrap samples or different subsamples, as shown in sections IV and V, respectively. The third assumption is a restriction of the causal search to recursive structures, commonly made in the literature. The implications of this assumption are spelled out at the end of this section. We turn now to discuss the second assumption.

The assumption of independent shocks
For non-Gaussian random variables statistical independence is a much stronger requirement than (linear) uncorrelatedness: while uncorrelatedness of the shocks t only requires that the covariance matrix E{ t T t } is diagonal, full statistical independence requires that the joint probability density equals the product of its marginals, that is P( 1t , 2t , . . . , Kt ) = P( 1t ) P( 2t ) · · · P( Kt ). The assumption of statistically independent t is justified in the case where the structural error terms represent distinct exogenous economic shocks, corresponding to independent sources, affecting the system at each time period. There is no direct method to statistically test that the underlying shocks are independent, since we observe and estimate only mixtures of the shocks. Whether this assumption is valid or not has to be decided on the basis of background knowledge for each case. The choice of the variables plays a critical role, because there might well be cases in which the assumption that the variables are ultimately the result of a linear combination of shocks, that are both independent and economic interpretable, is not valid. Notice that an implicit but important assumption of the SVAR model that we also use here is that the number of structural shocks is equal to the number of variables (for a possible way to relax this assumption see Bernanke et al. 2005).

Illustrative example
Let us consider again the two-variable example introduced in section II ( Figure 1). As already discussed, even if we assume acyclicity of the contemporaneous structure, we cannot, based on uncorrelatedness alone, distinguish between the true parameterization (3) and the alternative parameterization (11) without reference to some external prior knowledge. This identification problem is unavoidable when the shocks t are normally distributed. To illustrate, we simulated data from the model in Figure 1 and plot in Figure 2 the corresponding estimated error terms. In panel (a) we show samples of the reduced form VAR error terms u t , while panels (b) and (c) plot the corresponding samples of t for parameterizations (3) and (11) respectively. Note that for both parameterizations we obtain linearly uncorrelated t , which in the Gaussian case is equivalent to full statistical independence.
However, if the shocks t exhibit non-Gaussian distributions, the models can be distinguished. This is illustrated in Figure 3. In this illustrative case, we take the uniform distribution as an example of a non-Gaussian distribution, although we could equally well have taken any other non-Gaussian distribution. Again, in panel (a), we display samples of the u t , with (b) and (c) showing the corresponding t for models (3) and (11). In particular, note that while the estimated shocks in (b) are statistically independent, this is not the case in (c); for instance, knowing that 1t = 2 (marked by a vertical line) allows us to constrain the value of 2t (illustrated by the horizontal dashed lines). Thus, while the t in (c) are linearly uncorrelated, they are not statistically independent. Hence a simple approach, in the two-variable case, would be to derive both parameterizations and infer the appropriate model based on some test for statistical dependence (utilizing higher-order statistics) between 1t and 2t . Such an approach could be extended to an arbitrary number K of variables by deriving parameterizations (using the Cholesky decomposition as mentioned in section II) for all K! possible contemporaneous variable orderings and, in each case, testing for dependence among the components of t . In addition to being computationally expensive due to the factorial number of variable orderings, such an approach may not be statistically reliable due to the large number of tests involved. In the remainder of this section we present a procedure which can be seen as a computationally efficient, and statistically reliable, alternative to exhaustive search.

Independent component analysis
The SVAR identification procedure we describe in the next subsection relies on a statistical technique termed 'independent component analysis' (ICA) (Comon, 1994;Hyvärinen et al., 2001;Bonhomme and Robin, 2009) both in terms of guarantees of identifiability and also in terms of the actual algorithm employed. The technique can perhaps best be understood in relation to the well-known method of principal component analysis (PCA): while PCA gives a transformation of the original space such that the computed latent components are (linearly) uncorrelated, ICA goes further and attempts to minimize all statistical dependencies between the resulting components.
Specifically, in the SVAR context described in section II, the goal is to find a representation u t = −1 0 t of the VAR error terms u t , such that the t are mutually statistically independent. While a matrix −1 0 which yields uncorrelated t can always be found, for an arbitrary random vector u t there may exist no linear representation with statistically independent t . Nevertheless, one can show that if there exists a representation with non-Gaussian 2 statistically independent components t , then the representation is essentially unique (up to permutation, sign and scaling) (Comon, 1994), and there exist a number of computationally efficient algorithms for consistent estimation (Hyvärinen et al. 2001).
We illustrate the basic distinction between uncorrelatedness and statistical independence in Figure 4. Consider a density for t uniform in the square [−1, 1] × [−1, 1], as The density (uniform inside the parallelogram) after a linear transformation of the space. Note that here the variables are linearly dependent. (c) PCA, by first rotating and then rescaling the space, yields uncorrelated but statistically dependent samples (uniform inside the rotated square). The original components are not yet recovered. (d) ICA performs an additional rotation of the space to minimize statistical dependencies, and is here able to orient the space to obtain statistical independence. The original components are recovered, although with an arbitrary permutation and arbitrary signs. (e) When the original independent random variables are Gaussian (normal), the joint density is spherically symmetric (for standardized variables). In this case, from the mixed data of panel (f), PCA already yields independent components but the components are mixtures of the originals (g). Due to spherical symmetry, any rotation of the space yields independent components, thus there is no further information for ICA to use to find the original basis. Note that the Gaussian is the only density where independent standardized variables yield a spherically symmetric density shown in panel (a). This (non-Gaussian) joint density factorizes (trivially) so 1t and 2t are mutually independent. An arbitrary invertible linear transformation −1 0 yields a density for u t = −1 0 t which is uniform inside a parallelogram, as given in (b). Using PCA to rotate and re-scale the space yields˜ t which are uncorrelated but statistically dependent, shown in panel (c). Finally, the original components (up to permutation, sign, and scaling) are obtained by searching for an orthogonal transformation to obtain statistically independent t , in (d). Panels (e)-(g) illustrate that the final step to identify the original components is not possible for Gaussian random variables, because of the spherical symmetry of the joint distribution.
The main power of ICA algorithms is in determining this final orthogonal transformation. There exist several approaches for estimating the independent components: among others maximum likelihood estimation, maximizing the non-Gaussianity of the estimated components, or minimizing the mutual information among them. The different methods are intimately related. For instance, non-Gaussianity can be measured by the kurtosis or negentropy, and maximizing such measures for the estimated components is related to reducing their mutual dependencies since the densities of additive mixtures of independent random variables are typically closer to Gaussian than the densities of the original variables. The motivation for minimizing mutual information is immediate since mutual information equals zero if and only if the variables are independent. A further distinction of the methods can be made in how the resulting optimization problems are solved. Empirically, fixed-point algorithms have proven to be very efficient. One wellknown representative of fixed-point ICA algorithms is the 'FastICA' algorithm (Hyvärinen et al. 2001), based on maximum likelihood estimation. Since usually the densities of the underlying independent components are not known, maximum likelihood estimation of the ICA model is essentially a semi-parametric estimation problem. However, it is known that to obtain consistent estimates of the ICA mixing matrix it is not necessary to have very accurate estimates of the component densities. Rather, it is sufficient that the estimation algorithm can handle both supergaussian and subgaussian components. While some early ICA algorithms were not sufficiently flexible (Bell and Sejnowski, 1995), more recent algorithms, such as FastICA, are adaptive in this sense. For the interested reader, a number of excellent tutorials on ICA are available (see e.g. Cardoso, (1998); Hyvärinen and Oja, (2000). For a more thorough exposition, the reader is referred to the textbook by Hyvärinen et al. (2001).

Identification of acyclic linear causal structure
Given that the SVAR identification problem posed in section II consists of finding the appropriate 0 matrix that relates the independent shocks to the VAR error terms through u t = −1 0 t , it would seem that ICA as described in the previous section provides an immediate solution. However, as already noted, the original components are only found up to permutation, sign and scaling. In the original SVAR model each shock it is tied to a given variable y it , but this connection is lost in the ICA estimation process.
Thus, we use the common assumption of acyclic contemporaneous causal structure among the variables y t . As discussed in section II, this implies that for some ordering of the variables the matrix −1 0 (and hence also 0 ) is lower triangular, and there is no contemporaneous feedback. The contemporaneous structure is then equivalently represented by the matrix B = I − 0 , where the diagonal elements of B are zero. The restriction to acyclic systems guarantees that we can avoid the permutation, sign and scaling indeterminacies of ICA, for two reasons. First, the lower-triangularity of B (for an appropriate ordering of the variables 3 ) ensures that there is only one permutation of the columns of −1 0 which makes all of the diagonal entries of −1 0 non-zero (used in step 3 of Algorithm 1 below). Second, the zero diagonal in B fixes the scaling indeterminacy, since this forces the diagonal entries of −1 0 , and hence also 0 , to unity, determining the scaling and signs uniquely (see step 4 of Algorithm 1 below). In essence, acyclicity allows us to tie the components of t to the components of u t in a one-to-one relationship. The resulting method, termed LiNGAM (for linear, non-Gaussian, acyclic model) was introduced by Shimizu et al. (2006). Adapting the procedure to the identification of the SVAR model, the resulting VAR-LiNGAM algorithm is provided in Algorithm 1. 4 Further details and discussion can be found in Hyvärinen et al. (2008).
Having identified B and hence −1 0 , the correct interpretation of the SVAR model is obtained by studying the j (and the j ) rather than the A j , for j = 1, . . . , p. In effect, the reduced form VAR coefficient matrices A j mix together the causal effects over time (the j ) with the contemporaneous effects (B). In our application examples we show how this distinction improves the interpretation of the results.

Algorithm 1 VAR-LiNGAM
1. Estimate the reduced form VAR model of equation (4), obtaining estimatesÂ of the matrices A for = 1 , . . . , p. Denote byÛ the K × T matrix of the corresponding estimated VAR error terms, that is each column ofÛ isû t ≡ (û 1t , . . . ,û Kt ) , (t = 1, . . . , T ). Check whether the u it (for all rows i) indeed are non-Gaussian, and proceed only if this is so. 2. Use FastICA or any other suitable ICA algorithm (Hyvärinen et al., 2001) to obtain a decompositionÛ = PÊ, where P is K × K andÊ is K × T , such that the rows ofÊ are the estimated independent components ofÛ. Then validate non-Gaussianity and (at least approximate) statistical independence of the components before proceeding.
3. Let˜ 0 = P −1 . Find˜ 0 , the row-permuted version of˜ 0 which minimizes i 1/ |˜ 0 ii | with respect to the permutation. Note that this is a linear matching problem which can be easily solved even for high K (Shimizu et al., 2006). 4. Divide each row of˜ 0 by its diagonal element, to obtain a matrixˆ 0 with all ones on the diagonal. 5. LetB = I −ˆ 0 . 6. Find the permutation matrix Z which makes ZBZ T as close as possible to strictly lower triangular. This can be formalized as minimizing the sum of squares of the permuted upper-triangular elements, and minimized using a heuristic procedure (Shimizu et al., 2006). Set the upper-triangular elements to zero, and permute back to obtainB which now contains the acyclic contemporaneous structure. (Note that it is useful to check that ZBZ T indeed is close to strictly lower-triangular.) 7.B now contains K(K − 1)/ 2 non-zero elements, some of which may be very small (and statistically insignificant). For improved interpretation and visualization, it may be desired to prune out (set to zero) small elements at this stage, for instance using a bootstrap approach. See Shimizu et al. (2006) for details. 8. Finally, calculate estimates ofˆ , = 1, . . . , p, for lagged effects usingˆ = (I −B)Â

IV. Microeconomic application: firm growth Background and data
To demonstrate how the VAR-LiNGAM technique might be used in a microeconomic data application, we here apply it to analyse the dynamics of different aspects of firm growth. In particular, we are looking at relationships between the rates of growth of employment, sales, research and development (R&D) expenditure, and operating income.
Previous attempts to investigate the processes of firm growth and R&D investment have been hampered by difficulties in establishing the causal relations between the statistical series. Growth rates series are characteristically erratic and idiosyncratic, which discourages the application of those microeconometric techniques usually applied for addressing causality such as instrumental variables and System GMM (Arellano and Bond, 1991;Blundell and Bond, 1998). In addition, the use of Gaussian estimators is inappropriate because growth rates and residuals typically display non-Gaussian, fat-tailed distributions.
We base our analysis on the Compustat database, which essentially covers US firms listed on the stock exchange. We restrict ourselves to the manufacturing sector (SIC classes 2000-3999) for the years 1973-2004; the reason for starting from 1973 is that the disclosure of R&D expenditure was made compulsory for US firms in 1972. Since most firms do not report data for each year, we have an unbalanced panel dataset.
The variables of interest are Employees, Total Sales, R&D expenditure, and Operating Income (sometimes referred to as 'profits' in the rest of this article). We replace operating income and R&D with 0 if the company has declared the relevant amount to be 'insignificant'. In order to avoid generating missing values whilst taking logarithms and ra-tios, we now retain only those firms with strictly positive values for operating income, R&D expenditure, and employees in each year. This leads to a loss of observations, especially for our growth of operating income variable (where about 15% of observations are dropped).
Growth rates were calculated as log-differences of size for firm i in year t, that is: where y is any one of employment, sales, R&D expenditure, or operating income, and y it is the corresponding growth rate of that variable for that firm and that year. Any time-invariant firm-specific components have thus been removed in the process of taking differences (i.e. growth rates) rather than focusing on size levels. While the dynamics of firm size levels display a high degree of persistence (some authors relate the dynamics of firm size to a unit root process, see e.g. Goddard et al., 2002), growth rates have a low degree of persistence, with the within-firm variance being observed to be higher than the between-firm variance (Geroski and Gugler, 2004). In our regressions, firms are pooled together under the standard panel-data assumption that different firms undergo similar structural patterns in their growth process.
All growth rate series are positively correlated with each other, although the correlation is far from perfect. The highest correlation is between sales growth and employment growth (0.63), while the lowest correlation is between R&D growth and operating income growth (0.06). Thus, each of the four variables reflects a different facet of firm growth and firm behavior.Although we do not expect the residuals from the reduced-form VAR (i.e. the u t ) to be independent (in fact, a closer inspection reveals they are highly significantly correlated), we consider it appropriate to consider that the t are statistically independent. More details concerning the dataset, as well as summary statistics, can be found in Coad and Rao (2010).

Results
As outlined in section III, our procedure consists of first estimating a reduced-form VAR model and subsequently analysing the statistical dependencies between the resulting residuals, to finally obtain corrected estimates of lagged effects. Table 1a shows the results of LAD estimation of a 2-lag reduced-form VAR. Results of a 1-lag model are very similar and hence omitted for the remainder of this section. 5 Although most of the coefficients are statistically significant (at a significance level of 0.01), the strongest coefficients relate growth of employment and sales at time t to growth of all variables at time t + 1. Additionally, operating income displays a strong negative autocorrelation in its annual growth rates. These results are essentially identical to those obtained by Coad and Rao (2010).
Next, we investigated the statistical structure of the residuals. Figure 5 presents histograms with overlaid Gaussian distributions and quantile-quantile plots alongside the Gaussian benchmark of the empirical distributions of the residuals. Both the histograms and the q-q plots lead us to reject the hypothesis of Gaussian residuals. Furthermore, Notes: The number of observations was 28,538. The coefficients in bold are significantly different from zero using a t-test at significance level 0.01. (The table is read column to row so, for instance, the 1-lag VAR coefficient from sales growth to employment growth is 0.1017.) Shapiro-Wilk normality tests clearly reject the null hypothesis of Gaussian distributions (P < 10 −40 for all four residuals).
Contemporaneous causal effects are then estimated using steps 2-7 from Algorithm 1 given in section III. The instantaneous effects returned by VAR-LiNGAM form a fully connected DAG (directed acyclic graph). Using a bootstrap approach (with 100 repetitions) to test the stability of the results, the variables were in the majority of the cases ordered by VAR-LiNGAM as sales growth first, then employment growth, R&D growth, and operating income growth last. The coefficients obtained for this variable ordering are shown in Table 2. Testing the structural residuals ( t =(I − B)u t ) for non-Gaussianity with a Shapiro-Wilk test yields P-values smaller than 10 −50 for all four variables. Histograms and q-q plots of the structural residuals look similar to the VAR-residuals shown in Figure 5.
Finally, we can obtain the corrected lagged effects as given by step 8 of Algorithm 1. These are shown for the 2-lag model in Table 1b. Figure 6 shows the final estimated VAR-LiNGAM models graphically, displaying both contemporaneous effectsB and lagged effects (solid arrows denote positive effects, while dashed arrows indicate negative effects).  Notes: The number of observations was 28,538. The coefficients in bold are significantly different from zero using a t-test at significance level 0.01.

Discussion
First, consider the contemporaneous (within period) effects. Growth in sales has a strong positive effect for growth in all other variables (and profits in particular), while growth in employment has a positive effect on R&D but a negative effect on profits. These results make economic sense, as sales are often seen as the driving factor for growth in theoretical Empl.gr(t) Sales.gr(t)

RnD.gr(t)
Opinc.gr (t) Opinc.gr (t+1) RnD.gr (t+1) Sales.gr (t+1) Empl.gr (t+1) Empl.gr (t+2) Sales.gr (t+2) RnD.gr (t+2) Opinc.gr(t+2) Figure 6. Plot of results from VAR-LiNGAM-estimates with two time lags. Solid arrows indicate positive effects, dashed arrows negative ones. Thick lines correspond to strong effects, thin ones to weak effects work, and much of research and development costs are employment costs. Furthermore, growth of R&D expenditure has a negative instantaneous effect on profits, as under US tax law R&D expenditure is treated as an operating expense and is deducted from operating income as a cost (since profit = revenue − cost). One finding of potential policy interest is that growth of employment and sales are significant determinants of both instantaneous and subsequent growth of R&D expenditure, but that growth of operating income has no major effect on R&D growth (although a small positive effect can be detected at the first lag). The VAR-LiNGAM estimates of lagged effects are generally similar to the reducedform VAR estimates, but there are nonetheless some large differences (see Table 1) that mainly concern the contribution of employment growth to growth in the other variables. First, the autocorrelation coefficient for employment growth changed from being small and positive to rather large and negative. Second, the contribution of employment growth (and also sales growth) to subsequent growth of R&D expenditure decreases considerably in magnitude, although growth of employment and sales still have a much larger impact on subsequent R&D growth than does growth of operating income. Third, the VAR estimates suggest that employment growth is positively associated with (subsequent) growth of profits, while the corresponding VAR-LiNGAM coefficient is strongly negative. The VAR result is rather simplistic, because it does not separate the direct negative effect of employment on profits (because employment is a cost) from the indirect positive effect (employment at time t increases profits at t + 1 via increased sales at t + 1). We therefore prefer the VAR-LiNGAM estimates because they go further than naive associations to shed light on the underlying causal relationships.

V. Macroeconomic application: the effects of monetary policy Background and data
As a second empirical application we show how VAR-LiNGAM might be applied to analyse the effects of changes in monetary policy on macroeconomic variables. Structural VARs are often applied to describe the dynamic interaction between monetary policy indicators and aggregate economic variables such as income (GDP) and the price level. Results are then used both for policy evaluation and for judging between competing theoretical models. Unfortunately, modeling causal links between central bank decisions and the status of the economy has encountered major problems: it is not clear which time series variable best captures changes in monetary policy, and there is no agreement on the method to identify the structural VAR. As explained in section II, the choice of the 'right rotation' of the model relies on the identification of the contemporaneous causal structure. We show how the VAR-LiNGAM method offers one solution to the latter problem. Once the model is identified, this helps to answer questions about the appropriate indicator of monetary policy changes and the measurement of its effects on the economy.
Our study is based on Bernanke and Mihov's (1998) data set, which consists of six monthly time series US data (1965:1-1996:12), 6 three of which are policy variables: TR t : total bank reserves (normalized by 36-month moving average of total reserves); NBR t : non-borrowed reserves and extended credit (same normalization) and FFR t : the federal funds rate. The other three variables are non-policy macroeconomic variables: GDP t : real GDP (log); PGDP t : the GDP deflator (log) and PSCCOM t : the Dow-Jones index of spot commodity prices (log). An important underlying assumption in the structural VAR model is that each variable is affected by an independent shock. Since non-borrowed reserves are part of total reserves, it is likely that a shock affecting NBR t is correlated with a shock affecting TR t . To render our independence assumption more plausible, we replace TR t with BR t ≡ (TR t − NBR t ).
Phillips-Perron tests do not reject the hypothesis of a unit root for each of the six series considered. We estimate the model as a system of cointegrated variables (vector error correction model), using Johansen and Juselius' (1990) procedure. Although this procedure is based on maximum likelihood estimation and it assumes normal errors, it is robust for non-normality, as demonstrated by Silvapulle and Podivinsky (2000). However, we check for the robustness of our results across different estimation methods. We select the number of lags (seven) using Akaike's information criterion.

Results
The histograms of the six VAR residualsû t are displayed in Figure 7, together with the respective q-q plots. These suggest departures from normality for each of the six residuals, although in a much less evident manner for the GDP and PGDP residuals. The P-values of the Shapiro-Wilk normality test are 0.0104, 0.0215, 1.8e-05, 2.4e-15, 6.3e-20, 1.9e-07, for the residuals referring to GDP, PGDP, NBR, BR, FFR and PSCCOM respectively. The Shapiro-Francia test produces similar results, except that normality for the first two residuals is more clearly rejected, with the following P-values (same order): 0.0044, 0.0097, 1.3e-05, 6.8e-14, 9.5e-18 and 1.9e-07. The Jarque-Bera tests yield analogous results. All these numbers support the hypothesis of non-normality for all residuals and permit us to apply the VAR-LiNGAM procedure described in section III. Table 3a displays the estimates of the VAR model in levels y t = A 1 y t−1 + · · · + A 7 y t−7 + u t . For reasons of space, we report the estimates of A 1 and A 2 only. Table 3 b presents the estimates of 1 and 2 from the structural equation (identified through the VAR-LiNGAM method): 0 y t = 1 y t−1 + · · · + 7 y t−7 + t . The estimates of the instantaneous effects B = (I − 0 ) are displayed in Table 4. Figure 8 shows contemporaneous and lagged (until 2 lags) effects. These results provide useful information both about the mechanism operating in the market for bank reserves and about the mutual influences between policymakers'actions and the state of the economy.As regards the market for bank reserves, a useful starting point for reviewing the results is to look at the mechanism operating among the policy variables NBR, BR and FFR. Among these variables the contemporaneous causal structure is FFR t ← BR t → NBR t . BR measures the portion of reserves that banks choose to borrow at the discount window. This variable is usually assumed to depend on FFR, which is the rate at which a bank, in turn, can lend the borrowed reserves to another bank. 7 Our results suggest that FFR takes more than one month to influence BR, since the (4,5) entry of matrixB is zero (see Table 4), while the (4,5) entry of matrixˆ 1 is positive (see Table 3). NBR measures all the bank reserves which do not come from the discount window and is, as expected, correlated with BR, which positively influences NBR with a one-month lag. FFR is probably the variable which is most representative of the target pursued by the Fed, as the impulse response functions analysed below suggest and as argued by Bernanke and Blinder (1992). If this is true, our results indicate that the Fed observes and responds to changes of demand for (nonborrowed and borrowed) reserves within the period, although   only the coefficient describing the contemporaneous influence of BR t on FFR t (and not of NBR t on FFR t ) is significant. Notice that FFR responds positively to BR within the period, but negatively in the subsequent periods, probably in order to compensate for the fact that if FFR continued to rise, banks would have the incentive to borrow more reserves from the discount window and to lend them again to other banks.

Coefficients of the first two lagged effect-matrices from VECM estimates and VAR-LiNGAM estimates in a seventh order model including standard errors from 100 bootstrap sampleŝ
Concerning the relationships between policy variables (BR, NBR and FFR) and variables describing the state of the economy (GDP, PGDP and PSCCOM ), our results suggest that within the period the Fed observes and reacts to macroeconomic variables, but that policy actions have significant effects on the economy only with lags. Regarding significant lagged effects, we see that GDP is affected positively by NBR only with a two-month lag (and also with 4, 6 and 7 month lags, which are not displayed in the table). Figure 9 displays the impulse response functions of GDP, PGDP and FFR to onestandard-deviation shocks to NBR, BR and FFR, with 99% confidence bands. The responses to NBR shocks are shown in the first column of the figure, while the responses to BR and FFR are displayed in the second and third column, respectively. Qualitatively, the dynamic responses to BR and FFR are quite similar: in both cases output falls and the federal fund rate rises, especially in the first months. The price level responds quite slowly, but in the case of the BR shock the price level rises, while in the case of the FFR shock the price level eventually falls. After a NBR shock, output rises, but only between the second and fourth month after the shock, after which it goes down again. The price level rises quite rapidly and FFR responds very slowly. As discussed at more length at the end of this section, these results confirm, to quite some extent, the interpretation of the NBR innovation term as an expansionary policy shock and the interpretation of BR and FFR as contractionary policy shocks. However, they suggest that the FFR shock is a better indicator of the monetary policy shock, since its responses conform better to the 'stylized facts' established in the literature.

Responses of FFR to FFR shock
We also estimated the model using a series of different estimation methods, namely OLS, LAD and FM-LAD (Fully Modified Least Absolute Deviation, proposed by Phillips 1995). All these methods deliver the same contemporaneous causal order, and the coefficients of the respective B matrices of the instantaneous effects have quite similar magnitudes and the same sign (except for the influence of NBR t on FFR t which turns out to be negative in the FM-LAD case, but is positive although statistically insignificant in all the other cases).

Discussion
Shocks to NBR, BR and FFR represent the sources of variations in central bank policy instruments which are not due to systematic responses to variations in the state of the economy. Shocks to policy variables can be interpreted as exogenous shocks to the preferences of the monetary authority (preferences about the weight to be given to growth and inflation, for example), as variations in monetary policy generated by the influences that some of the stochastic (and independent of macroeconomic conditions) expectations of private agents may have on the Fed's decisions, and, finally, as measurement errors (see Christiano et al., 1999, pp. 71-72). There is a large debate as to which variable, among NBR, BR and FFR, best reflects policy actions (cfr. Bernanke and Mihov, 1998 and references therein). While there is no consensus about the right measure of the policy shock, there is a considerable agreement about the qualitative effects of a monetary policy shock. As argued by Christiano et al. (1999, p. 69), 'the nature of this agreement is as follows: after a contractionary monetary policy shock, short term interest rates rise, aggregate output, employment, profits and various monetary aggregates fall, the aggregate price level responds very slowly, and various measures of wages fall, albeit by very modest amounts.' As regards the full sample 1965-1996, the FFR shock is the shock which better conforms to this pattern. However, in the first subsample analysed (1965)(1966)(1967)(1968)(1969)(1970)(1971)(1972)(1973)(1974)(1975)(1976)(1977)(1978)(1979) the NBR shock (with the opposite sign) is the shock which is the most consistent with the stylized facts of Christiano et al. (1999). In the subsequent subsamples both the BR and FFR shocks are conforming quite well. In sum, this suggests that the Fed may have changed its policy instrument across years: in previous years NBR was the variable which corresponded more closely to policy shocks, and in the subsequent years this role has been taken by BR and FFR.

VI. Conclusion
We have described a new approach to the identification of a structural VAR, applicable when the reduced-form VAR residuals are non-Gaussian. This approach is based on a recently developed technique for causal inference (LiNGAM) developed in the machinelearning community. The technique exploits non-Gaussian structure in the residuals to identify the independent components corresponding to unobserved structural innovation terms. Assuming a recursive structure among the contemporaneous variables of the VAR, the technique is able to uncover causal dependencies among the relevant variables. This permits us to analyse how an innovation term is propagated in the system over time.
We have applied this method to two different databases. In the first application we have analysed the relationship between firm performance and R&D investment. We find that sales growth has a relatively strong influence on growth of all other variables: employment, R&D expenditure and operating income. Employment growth also has a strong influence on subsequent sales growth. Growth of operating income has little effect on growth of any of the other variables, however.
In the second application we have examined the mutual effects between monetary policy and macroeconomic performance and have addressed the issue of the appropriate indicator of the monetary policy shock. We find that within the period of one month the central bank monitors the conditions of the economy, but the economy responds to central bank policy with lags. In line with the consensus existing in the literature about the qualitative effect of a monetary policy shock, we find that the shock to the federal funds rate is the shock that most appropriately reflects innovations in monetary policy. However, taking into account some sub-samples, we find evidence that in some periods the non-borrowed reserve shock and the borrowed reserve shock are better indicators of the exogenous policy shock. This suggests that the Fed has changed the target of its policy several times between 1965 and 1995.
The method proposed has the advantage of being data-driven: identification of the model is reached without advocating theoretical intuitions about causal dependencies. Our approach, however, is based on some assumptions, which may be seen as limitations. One possible drawback is the underlying assumption of recursiveness (acyclicity). Although the recursiveness assumption is quite common in the literature on structural VARs, in principle it is possible that there are causal directed cycles among variables within the measured period. Lacerda et al. (2008) generalized the ICA-based approach to causal discovery by relaxing the assumption that the underlying causal structure has to be acyclic. This method is yet to be applied to the SVAR framework. Other assumptions which might be alternatively relaxed in future research are causal sufficiency (allowing the possibility of confounding latent variables) and the related assumption that the number of independent components is equal to the number of observed variables.