School Leader Collaboration and Gender: A Social Network Analysis using ERGMs

Introduction

As stated in Carolan’s (2013) chapter on Network Data and Statistical Models, using statistical techniques to model social network data are important as they allow you to employ social network analysis in ways that “move beyond description toward explanation.” These inferential statistics can be used to make predictions from the data and test propositions about network-related properties. As Carolan noted, the statistical approach is concerned on the reproducibility of the network data and understanding the probablistic realizations of an underlying true tendency.

In order to analyze this probability in network studies, simulations are used. Exponential random graph models, or ERGMs, provide a way to determine whether network properties occur by chance as a result of other network properties or whether the observed properties are unlikely given other parameters in the network.

So….ERGMs - what are they, and what are they used for? 🤔 I’m still not totally sure either, but we will both find out together by diving into some data.

The Data

In Daly’s Network of School District Leaders data, information regarding the leadership network were collected at two school districts over 3 consecutive years. The particular dataset being used here describes the collaboration network of 43 school leaders in year 3 of the three-year study. It is a directed valued (weighted) network using a 5 point scale from 0-4 with higher values indicating more frequent collaborations. It also contains attribute data gender (male, (1=yes, 0=no)), efficacy (on a 9-point Likert scale), trust (on a 7-point Likert scale), and whether or not the school leader is at the school or district level (1=yes, 0=no).

Summary Statistics for Daly’s School Leader Network

Purpose

In order to more fully understand how ERGMs work, we will try to answer to the question:

Is a school leader’s gender predictive of the number of ties he/she has in the year 3 collaborative network?

We can compare our findings with that of the analysis conducted in Carolan (2013) which showed that gender is not significantly related to either sending or receiving a tie for the confidential exchange network. Again, with the use of ERGMs we can ascertain whether gender is significant in explaining the ties in the collaboration network of the school leaders and predictive of the outcomes.

Methods

In order to get the data ready for ERGM analysis, we will use the following packages - readxl, tidyverse, igraph, tidygraph, ggraph - and do the following:

  1. Dichotomize the matrix data
  2. Create Network Graph Object
  3. Look at some descriptive statistics on our network
  4. Visualize our network

For the ERGM analysis, we will use the statnet package to do the following:

  1. Create network object for ERGM analysis
  2. Run summaries of our model
  3. Model using ERGM
  4. Look at Goodness of Fit
  5. Examine Monte-Carlo Markov Chain (MCMC) Diagnostics to see if model is good representation of our network

Code

Read In Data

First we read in the data using readxl which is take from Carolan’s (2013) student resource’s companion site. The first data set contains node information about the school leaders and their attributes and the second is a square adjacency matrix that contains information about the ties in the network.

# read in data nodes
leader_collab_nodes <- read_excel("school leader attributes.xlsx", col_types = c("text", 
    "numeric", "numeric", "numeric", "numeric"))

# read in data leader matrix
leader_collab_matrix <- read_excel("school leader collab adj matrix.xlsx", col_names = FALSE)

Adjacency Matrix & Graph Object

Next, we dichotomize the information (make it either 0 or 1 for ease of analysis) in the leader collaboration matrix.

# dichotomize matrix
leader_matrix <- leader_collab_matrix %>%
    as.matrix()

leader_matrix[leader_matrix <= 2] <- 0

leader_matrix[leader_matrix >= 3] <- 1

Then we add the row names from the node data into the leader matrix

# add row names
rownames(leader_matrix) <- leader_collab_nodes$ID
colnames(leader_matrix) <- leader_collab_nodes$ID

Then create the adjacency matrix and convert to a standard edge-list for analysis

# get edges
adjacency_matrix <- graph.adjacency(leader_matrix, diag = FALSE)

# convert matrix to standard edge-list
leader_edges <- get.data.frame(adjacency_matrix)

However, in order to calculate our descriptive statistics and plot our network, we will use the tidygraph function tbl_graph to create a network graph object for calculating our descriptive statistics and plotting our network.

# create network graph object
leader_graph <- tbl_graph(edges = leader_edges, nodes = leader_collab_nodes)

Descriptive Stats

Let’s look at some descriptive statistics of our network. In order to do that, we will calculate degrees of centrality, in- and out-degree, to see who sent and received the most ties.

# Node degree
leader_measures <- leader_graph %>%
    activate(nodes) %>%
    mutate(in_degree = centrality_degree(mode = "in")) %>%
    mutate(out_degree = centrality_degree(mode = "out"))
Leader Measures
ID EFFICACY TRUST DISTRICT/SITE MALE in_degree out_degree
1 6.06 4.00 1 0 12 6
2 6.56 5.63 1 0 2 1
3 7.39 4.63 1 0 13 10
4 4.89 4.00 1 0 5 7
5 6.06 5.75 0 1 9 1
6 7.39 4.38 0 0 6 1
7 5.56 3.63 0 1 6 2
8 7.50 5.63 1 1 20 24
9 7.67 5.25 0 0 10 5
10 6.64 4.78 0 0 9 17
11 5.06 5.88 0 1 7 23
12 7.44 3.75 0 1 7 1
13 8.28 5.13 0 0 6 8
14 5.67 5.50 0 0 8 15
15 6.72 5.63 0 1 12 12
16 4.61 3.75 1 0 8 1
17 7.17 5.13 0 0 6 2
18 5.00 4.13 1 1 12 12
19 7.89 3.88 0 1 9 1
20 7.17 5.75 0 1 6 7
21 7.44 5.38 0 1 9 5
22 4.94 3.75 1 0 9 7
23 8.50 5.50 0 0 7 3
24 5.22 5.13 1 0 7 6
25 6.94 4.63 1 0 6 8
26 7.00 4.88 0 0 9 2
27 5.06 4.50 1 0 12 8
28 7.61 4.50 1 1 11 42
29 5.67 4.13 1 1 4 2
30 5.50 4.88 0 1 5 0
31 6.78 4.25 0 1 6 1
32 5.88 4.13 1 0 8 3
33 5.89 5.13 0 0 8 7
34 5.12 4.25 1 1 12 17
35 7.44 3.88 0 1 7 1
36 7.72 5.75 1 0 8 12
37 6.64 4.78 1 1 11 32
38 7.56 5.75 1 1 14 29
39 8.00 5.63 0 0 6 0
40 7.33 5.38 0 0 6 0
41 8.17 4.13 0 0 7 4
42 8.06 5.25 0 1 9 7
43 6.72 3.88 0 0 8 10

Looks like leader 8 is on 🔥

Here is a summary of the node measure we have just calculated:

# node measures
node_measures <- leader_measures %>%
    activate(nodes) %>%
    data.frame()
ID EFFICACY TRUST DISTRICT.SITE MALE in_degree out_degree
Length:43 Min. :4.610 Min. :3.630 Min. :0.0000 Min. :0.0000 Min. : 2.000 Min. : 0.000
Class :character 1st Qu.:5.670 1st Qu.:4.130 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.: 6.000 1st Qu.: 1.500
Mode :character Median :6.780 Median :4.780 Median :0.0000 Median :0.0000 Median : 8.000 Median : 6.000
NA Mean :6.649 Mean :4.783 Mean :0.4186 Mean :0.4419 Mean : 8.419 Mean : 8.419
NA 3rd Qu.:7.470 3rd Qu.:5.440 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.: 9.500 3rd Qu.:11.000
NA Max. :8.500 Max. :5.880 Max. :1.0000 Max. :1.0000 Max. :20.000 Max. :42.000

Click on the tabs to find summaries of the following measures grouped by whether a leader is at the school (0) or district level (1):

In and out degree

Summary Statistics, In-degree
DISTRICT.SITE n mean sd
0 25 7.520000 1.661325
1 18 9.666667 4.228753
Summary Statistics, Out-degree
DISTRICT.SITE n mean sd
0 25 5.40000 5.993051
1 18 12.61111 11.712934

Looks like school leaders at the district level sent out way more ties than at the school level but the results were pretty comparable for receiving ties between school and district.

Trust

Summary Statistics, Trust
DISTRICT.SITE n mean sd
0 25 4.9044 0.7191666
1 18 4.6150 0.6880514

Efficacy

Summary Statistics, Efficacy
DISTRICT.SITE n mean sd
0 25 7.022000 0.9534149
1 18 6.131667 1.1151906

Plot

This is the fun part - let’s plot our network using ggraph! 💹

You can see the density of the network/number of ties in the sociogram. It seems like there might be some correlation of gender to the number of ties that are sent out. We can explore further using ERGMs.

# Visualization
ggraph(leader_measures, layout = "kk") + geom_node_point(aes(size = in_degree, alpha = out_degree, 
    colour = factor(`DISTRICT/SITE`), shape = factor(MALE))) + geom_node_text(aes(label = ID), 
    repel = TRUE) + geom_edge_link(color = "grey", arrow = arrow(type = "closed", 
    angle = 15, length = unit(0.5, "mm")), show.legend = FALSE) + theme_graph()

ERGM

Now it is time for the pièce de résistance, the ERGM.

We’ll first have to create a network object to run our summaries and models. The statnet package command as.network will take care of this for us:

# create network object
leader_network <- as.network(leader_edges, vertices = leader_collab_nodes)

As stated in the vignette for the package, it is good practice to run summary stats on the network measures before running the model so you understand (🤔) what you are looking at. We’ll use the summary function:

# Estimate ERGM Model, good practice to run summary first
summary(leader_network ~ edges + mutual) %>%
    kable(caption = "Summary of Leader Network: Edges and Tie Reciprocation") %>%
    kable_paper("hover", full_width = T, fixed_thead = T)
Summary of Leader Network: Edges and Tie Reciprocation
x
edges 362
mutual 91

As you can see, there are 362 edges, and 91 were mutual dyad pairs.

The first model we will run will look at just the network structure, the edges and reciprocity (or mutual). We will set a specific seed so we can reproduce our results:

The results of the model is shown below:

summary(ergm_mod_1)
Call:
ergm(formula = leader_network ~ edges + mutual)

Monte Carlo Maximum Likelihood Results:

       Estimate Std. Error MCMC % z value Pr(>|z|)    
edges  -1.95952    0.08525      0  -22.99   <1e-04 ***
mutual  1.95741    0.19500      0   10.04   <1e-04 ***
---
Signif. codes:  0***0.001**0.01*0.05 ‘.’ 0.1 ‘ ’ 1

     Null Deviance: 2504  on 1806  degrees of freedom
 Residual Deviance: 1699  on 1804  degrees of freedom
 
AIC: 1703  BIC: 1714  (Smaller is better. MC Std. Err. = 1.206)

For this network, it seems like the parameter ‘mutual’ or reciprocated ties between school leaders occurs more than would be expected by chance, very similar to results found in the confidential network analysis. The negative estimate of the number of edges implies the probably of a collaborative tie in year 3 is low, again much like the confidential network analysis.

Now let’s look at our specific research question, is gender predictive of the number of ties a school leader has in the year 3 collaborative network. To figure this out, we’ll add the co-variate male using the nodefactor specification.

First, we run our summary statistics on the model:

summary(leader_network ~ edges + mutual + gwesp(0.25, fixed = T) + nodefactor("MALE"))
            edges            mutual  gwesp.fixed.0.25 nodefactor.MALE.1 
         362.0000           91.0000          431.5217          395.0000 

Then add the co-variate information using the nodefactor specification:

ergm_3 <- ergm(leader_network ~ edges + mutual + gwesp(0.25, fixed = T) + nodefactor("MALE"))

Here are the results of the model:

summary(ergm_3)
Call:
ergm(formula = leader_network ~ edges + mutual + gwesp(0.25, 
    fixed = T) + nodefactor("MALE"))

Monte Carlo Maximum Likelihood Results:

                  Estimate Std. Error MCMC % z value Pr(>|z|)    
edges             -5.22419    0.35661      0 -14.650  < 1e-04 ***
mutual             1.42892    0.19020      0   7.513  < 1e-04 ***
gwesp.fixed.0.25   2.20071    0.27991      0   7.862  < 1e-04 ***
nodefactor.MALE.1  0.15007    0.04992      0   3.006  0.00265 ** 
---
Signif. codes:  0***0.001**0.01*0.05 ‘.’ 0.1 ‘ ’ 1

     Null Deviance: 2504  on 1806  degrees of freedom
 Residual Deviance: 1546  on 1802  degrees of freedom
 
AIC: 1554  BIC: 1576  (Smaller is better. MC Std. Err. = 0.9141)

In comparison with the confidential network analysis, there seems to be a slight, but this time moderately statistically significant, tendency towards male leaders being more likely to send or receive ties in the year 3 data.

Goodness of Fit

Before we leave the world of ERGMs, we want to make sure the model we created is a “good fit” of the data, or “how well it reproduces the observed global network properties that are not in the model.” To do this, we will use the gof() function.

ergm_3_gof <- gof(ergm_3)

The black line, which is our actual observed network measure, more or less follows the blue aggregate measures generated from the simulations. The model could probably be a little better, but hey we’ll take it. 🤷🏿

MCMC Diagnostics

Last but not least, we should look at the Monte-Carlo Markov Chain (MCMC) diagnostics to really make sure our model simulation would actually produce a similar network to the one that was observed. We do this by running the mcmc.diagnostics() function.

mcmc.diagnostics(ergm_3)
Sample statistics summary:

Iterations = 1843200:7561216
Thinning interval = 4096 
Number of chains = 1 
Sample size per chain = 1397 

1. Empirical mean and standard deviation for each variable,
   plus standard error of the mean:

                    Mean     SD Naive SE Time-series SE
edges             1.2341 31.314   0.8378         1.9608
mutual            0.1875  9.477   0.2536         0.5102
gwesp.fixed.0.25  1.8149 41.423   1.1083         2.5004
nodefactor.MALE.1 1.8604 29.277   0.7833         1.3837

2. Quantiles for each variable:

                    2.5%    25%   50%   75% 97.5%
edges             -67.00 -18.00 4.000 24.00 56.00
mutual            -18.00  -6.00 0.000  6.00 18.00
gwesp.fixed.0.25  -87.25 -23.18 4.815 31.98 74.89
nodefactor.MALE.1 -59.00 -17.00 3.000 22.00 57.00


Are sample statistics significantly different from observed?
               edges    mutual gwesp.fixed.0.25 nodefactor.MALE.1
diff.      1.2340730 0.1875447        1.8149324         1.8604152
test stat. 0.6293765 0.3676094        0.7258652         1.3444751
P-val.     0.5291026 0.7131645        0.4679214         0.1787948
           Overall (Chi^2)
diff.                   NA
test stat.       7.0079517
P-val.           0.1383314

Sample statistics cross-correlations:
                      edges    mutual gwesp.fixed.0.25 nodefactor.MALE.1
edges             1.0000000 0.8219976        0.9955156         0.7550109
mutual            0.8219976 1.0000000        0.8312809         0.6410802
gwesp.fixed.0.25  0.9955156 0.8312809        1.0000000         0.7669329
nodefactor.MALE.1 0.7550109 0.6410802        0.7669329         1.0000000

Sample statistics auto-correlation:
Chain 1 
              edges    mutual gwesp.fixed.0.25 nodefactor.MALE.1
Lag 0     1.0000000 1.0000000        1.0000000        1.00000000
Lag 4096  0.6388993 0.4456455        0.6038312        0.45171927
Lag 8192  0.4632409 0.3405171        0.4365858        0.26873912
Lag 12288 0.3470160 0.2346855        0.3231989        0.17792970
Lag 16384 0.2438624 0.1818677        0.2249331        0.10618711
Lag 20480 0.1765139 0.1210351        0.1604500        0.07718246

Sample statistics burn-in diagnostic (Geweke):
Chain 1 

Fraction in 1st window = 0.1
Fraction in 2nd window = 0.5 

            edges            mutual  gwesp.fixed.0.25 nodefactor.MALE.1 
           -1.408            -1.893            -1.457            -1.165 

Individual P-values (lower = worse):
            edges            mutual  gwesp.fixed.0.25 nodefactor.MALE.1 
        0.1589864         0.0583848         0.1451023         0.2440729 
Joint P-value (lower = worse):  0.6055984 .


MCMC diagnostics shown here are from the last round of simulation, prior to computation of final parameter estimates. Because the final estimates are refinements of those used for this simulation run, these diagnostics may understate model performance. To directly assess the performance of the final model on in-model statistics, please use the GOF command: gof(ergmFitObject, GOF=~model).

The charts look roughly balanced, so we’ll call this a day - our model did not fail to converge.

Conclusion

By using ERGMs, we have established a slight correlation between gender and the number of ties in the year-3 collaborative network of school leaders. This differs from the year-3 confidential ties network which showed no significant correlation. These results warrant further investigation into what role gender may play in the formation of these networks and how these relationships effect the learning environment within these schools and districts.

Final Thoughts

This is the last assignment in completion of the of the Learning Analytics graduate certificate program at North Carolina State University. 🥳🎉I want to thank our Professor @sbkellogg for such a great learning experience and all the great colleagues/classmates I’ve had throughout the program that helped me to be better in #RStats.

Por último, pero no menos importante, les quisera agraceder a mis niñas pequeñitas por su paciencia cuando mami hacia su ‘trabajo por la computadora.’ Las quiero mucho ❤️.

About Me

By day, I am Assistant Director for the Global Education Office at Duke University. By (late) night, I mess around with data. Follow me on Twitter @sorayaworldwide