Linear Fixed Effects Models on Undirected Networks

Rastin Seysan
5 min readNov 5, 2021

--

The fixed effects or within estimator is widely used in economics and across many areas of empirical research seeking to make causal inference. With growing interest in network data, increasingly studies are applying fixed effects models to networks in order to absorb idiosyncratic factors associated with each agent represented by a vertex (node) in the corresponding graph.

Here I will explain why using econometric packages out of the box to implement a model with vertex fixed effects for data from an undirected network is not ideal and why you may still do it despite the issues.

Take the seminal example of a social network, where vertices represent people and edges the connections between them. For instance, if we want to study whether becoming Facebook friends causes two people to attend more of the same events as each other, we would naturally want to be sure that our findings are not influenced by the fact that some people tend to be more gregarious than others, or that a particular person is invited to more events. To control for all unobservable factors that would cause a person to make more connections or go to more events, we need to implement person fixed effects in our model.

Network based panel datasets are often arranged as observations of edges recorded through time, where each edge is identified by the vertices it connects. In the case of our example, the dataset records person 1, person 2 and time indices for each observation, so we know that Alice and Bob were not Facebook friends in 2015 when they attended 10 events together, but in 2020, when they became Facebook friends, they attended 3 events together. To account for individual unobservable factors, we would like to introduce dummies representing each individual, estimating a least squares dummy variable (LSDV) model. However, since our dataset might be prohibitively large, we may decide to opt for the within estimator which can be shown to be equivalent to LSDV.

The proof of this equivalence makes use of the Frisch–Waugh–Lovell theorem to partial out the dummies in the first stage, before estimating the second stage equation for the regressors of interest. Briefly, the proof states that when regressing on the dummies for the first stage, because the model is being estimated on a dummy for each group, the residuals are in effect the same as the demeaned values calculated for the within estimator (for a more detailed explanation refer to this post or this one). However, in the case of our example, there would be two dummies that turn on for each observation, representing the ends of the focal graph edge. This would no longer result in residuals that are the same as demeaned values, and would break the equivalence between LSDV and the fixed effects estimator.

To consider this from another angle, econometric packages estimating fixed effects models are designed for demeaning based on the assumption that each column contains a different “class” of objects. The problem that arises when Person 1 and Person 2 fixed effects are specified in an econometric package is that in the case of undirected, non-bipartite graphs such as our example, the edge list data structure makes it impossible to record observations in a way that each vertex would exclusively be recorded under either vertex 1 or vertex 2. For instance, Alice may appear under the Person 1 column when her connection to Bob is observed, and under Person 2 when her connection with Alex is observed. As a result, simply implementing Person 1 and Person 2 fixed effects would group the edges where Alice appears as Person 2 separately to those where she appears under Person 1, breaking the groupings depending on the order in which the ends of each edge are recorded.

Given this issue, we cannot expect our LSDV specification designed to absorb the idiosyncrasies of each vertex/person to be the same as a fixed effects model with person 1 and person 2 fixed effects. However, the LSDV model is prohibitively expensive to estimate for larger networks, so we may not have the option of estimating our model that way. Now the main concern is of course that the fixed effects model would result in biased coefficients. To check whether this would be the case, we use Monte Carlo simulations to compare the two specifications.

Simulations

Assuming the following data generating process

Where “c” and “d” are normally distributed, “u” and “v” are the unobservable effects associated with persons i and j accordingly, and the error term ε is also normally distributed.

Our simulated data consists of a graph of up to 25 vertices with a maximum of 200 randomly generated edges to simulate a panel with a relatively large number of small groups. The ends of each edge are generated uniformly at random from 25 vertices, while loops and repeated edges are removed. This would be most similar to an Erdős–Rényi random graph, but we can also test our simulations using graphs generated from the Barabasi–Albert model to get a dataset more akin to the real world small-world networks.

Next we generate the outcome variable “y” as “c” plus 3 times “d’, and add our normally distributed error term ε. Finally, to add the individual specific factors, we loop through vertices and add a normally distributed random amount to “y” for the edges connected to each vertex. At this point, we have our simulated panel dataset prepared for regressions.

Findings

We generate 10,000 datasets using the above procedure and estimate the LSDV model with dummies turning on for both ends of each edge as well as the fixed effects model, absorbing person 1 and person 2 effects on each dataset. We find that the fixed effects (FE) estimator’s estimates are not biased compared to those of the LSDV specification, while the FE model results in systematically inflated standard errors, both when using regular standard errors or clustered around the fixed effects.

Testing the difference in the means of the estimates from FE compared to those from LSDV, we find that the t-test fails to reject the null hypothesis that the difference in means is zero, while the Kolmogorov–Smirnov test also fails to reject the null hypothesis that the estimates from the two models come from the same distribution.

Conclusion

We have established that using vertex 1 and vertex 2 fixed effects for data based on undirected networks is inappropriate since the equivalence with the corresponding LSDV estimator breaks in the case of undirected graphs. However, using the FE estimator results in unbiased estimates and inflated standard errors, as can be intuitively expected given the fact that by using two sets of fixed effects, we are in effect potentially breaking each actual group into two separate groups of fewer observations.

Despite the lack of equivalence between the data generating process and the FE estimator for undirected networks, seeing as the estimates from the fixed effects model are not biased, in case an LSDV model is too expensive to estimate, we can use the FE model with vertex 1 and vertex 2 fixed effects, but our standard errors will be biased upwards, which is not detrimental to making causal inference.

--

--

Rastin Seysan

Engineer-economist, investigating 🔎, contributing to ✍️ and investing 📈 in the future of work