School Leader Collaboration and Gender: A Social Network Analysis using ERGMs

Introduction

As stated in Carolan’s (2013) chapter on Network Data and Statistical Models, using statistical techniques to model social network data are important as they allow you to employ social network analysis in ways that “move beyond description toward explanation.” These inferential statistics can be used to make predictions from the data and test propositions about network-related properties. As Carolan noted, the statistical approach is concerned on the reproducibility of the network data and understanding the probablistic realizations of an underlying true tendency.

In order to analyze this probability in network studies, simulations are used. Exponential random graph models, or ERGMs, provide a way to determine whether network properties occur by chance as a result of other network properties or whether the observed properties are unlikely given other parameters in the network.

So….ERGMs - what are they, and what are they used for? 🤔 I’m still not totally sure either, but we will both find out together by diving into some data.

The Data

In Daly’s Network of School District Leaders data, information regarding the leadership network were collected at two school districts over 3 consecutive years. The particular dataset being used here describes the collaboration network of 43 school leaders in year 3 of the three-year study. It is a directed valued (weighted) network using a 5 point scale from 0-4 with higher values indicating more frequent collaborations. It also contains attribute data gender (male, (1=yes, 0=no)), efficacy (on a 9-point Likert scale), trust (on a 7-point Likert scale), and whether or not the school leader is at the school or district level (1=yes, 0=no).

Summary Statistics for Daly’s School Leader Network

Purpose

In order to more fully understand how ERGMs work, we will try to answer to the question:

Is a school leader’s gender predictive of the number of ties he/she has in the year 3 collaborative network?

We can compare our findings with that of the analysis conducted in Carolan (2013) which showed that gender is not significantly related to either sending or receiving a tie for the confidential exchange network. Again, with the use of ERGMs we can ascertain whether gender is significant in explaining the ties in the collaboration network of the school leaders and predictive of the outcomes.

Methods

In order to get the data ready for ERGM analysis, we will use the following packages - readxl, tidyverse, igraph, tidygraph, ggraph - and do the following:

Dichotomize the matrix data
Create Network Graph Object
Look at some descriptive statistics on our network
Visualize our network

For the ERGM analysis, we will use the statnet package to do the following:

Create network object for ERGM analysis
Run summaries of our model
Model using ERGM
Look at Goodness of Fit
Examine Monte-Carlo Markov Chain (MCMC) Diagnostics to see if model is good representation of our network

Code

Read In Data

First we read in the data using readxl which is take from Carolan’s (2013) student resource’s companion site. The first data set contains node information about the school leaders and their attributes and the second is a square adjacency matrix that contains information about the ties in the network.

# read in data nodes
leader_collab_nodes <- read_excel("school leader attributes.xlsx", col_types = c("text", 
    "numeric", "numeric", "numeric", "numeric"))

# read in data leader matrix
leader_collab_matrix <- read_excel("school leader collab adj matrix.xlsx", col_names = FALSE)

Adjacency Matrix & Graph Object

Next, we dichotomize the information (make it either 0 or 1 for ease of analysis) in the leader collaboration matrix.

# dichotomize matrix
leader_matrix <- leader_collab_matrix %>%
    as.matrix()

leader_matrix[leader_matrix <= 2] <- 0

leader_matrix[leader_matrix >= 3] <- 1

Then we add the row names from the node data into the leader matrix

# add row names
rownames(leader_matrix) <- leader_collab_nodes$ID
colnames(leader_matrix) <- leader_collab_nodes$ID

Then create the adjacency matrix and convert to a standard edge-list for analysis

# get edges
adjacency_matrix <- graph.adjacency(leader_matrix, diag = FALSE)

# convert matrix to standard edge-list
leader_edges <- get.data.frame(adjacency_matrix)

However, in order to calculate our descriptive statistics and plot our network, we will use the tidygraph function tbl_graph to create a network graph object for calculating our descriptive statistics and plotting our network.

# create network graph object
leader_graph <- tbl_graph(edges = leader_edges, nodes = leader_collab_nodes)

Descriptive Stats

Let’s look at some descriptive statistics of our network. In order to do that, we will calculate degrees of centrality, in- and out-degree, to see who sent and received the most ties.

# Node degree
leader_measures <- leader_graph %>%
    activate(nodes) %>%
    mutate(in_degree = centrality_degree(mode = "in")) %>%
    mutate(out_degree = centrality_degree(mode = "out"))

Leader Measures
ID	EFFICACY	TRUST	DISTRICT/SITE	MALE	in_degree	out_degree
1	6.06	4.00	1	0	12	6
2	6.56	5.63	1	0	2	1
3	7.39	4.63	1	0	13	10
4	4.89	4.00	1	0	5	7
5	6.06	5.75	0	1	9	1
6	7.39	4.38	0	0	6	1
7	5.56	3.63	0	1	6	2
8	7.50	5.63	1	1	20	24
9	7.67	5.25	0	0	10	5
10	6.64	4.78	0	0	9	17
11	5.06	5.88	0	1	7	23
12	7.44	3.75	0	1	7	1
13	8.28	5.13	0	0	6	8
14	5.67	5.50	0	0	8	15
15	6.72	5.63	0	1	12	12
16	4.61	3.75	1	0	8	1
17	7.17	5.13	0	0	6	2
18	5.00	4.13	1	1	12	12
19	7.89	3.88	0	1	9	1
20	7.17	5.75	0	1	6	7
21	7.44	5.38	0	1	9	5
22	4.94	3.75	1	0	9	7
23	8.50	5.50	0	0	7	3
24	5.22	5.13	1	0	7	6
25	6.94	4.63	1	0	6	8
26	7.00	4.88	0	0	9	2
27	5.06	4.50	1	0	12	8
28	7.61	4.50	1	1	11	42
29	5.67	4.13	1	1	4	2
30	5.50	4.88	0	1	5	0
31	6.78	4.25	0	1	6	1
32	5.88	4.13	1	0	8	3
33	5.89	5.13	0	0	8	7
34	5.12	4.25	1	1	12	17
35	7.44	3.88	0	1	7	1
36	7.72	5.75	1	0	8	12
37	6.64	4.78	1	1	11	32
38	7.56	5.75	1	1	14	29
39	8.00	5.63	0	0	6	0
40	7.33	5.38	0	0	6	0
41	8.17	4.13	0	0	7	4
42	8.06	5.25	0	1	9	7
43	6.72	3.88	0	0	8	10

Looks like leader 8 is on 🔥

Here is a summary of the node measure we have just calculated:

# node measures
node_measures <- leader_measures %>%
    activate(nodes) %>%
    data.frame()

ID	EFFICACY	TRUST	DISTRICT.SITE	MALE	in_degree	out_degree
Length:43	Min. :4.610	Min. :3.630	Min. :0.0000	Min. :0.0000	Min. : 2.000	Min. : 0.000
Class :character	1st Qu.:5.670	1st Qu.:4.130	1st Qu.:0.0000	1st Qu.:0.0000	1st Qu.: 6.000	1st Qu.: 1.500
Mode :character	Median :6.780	Median :4.780	Median :0.0000	Median :0.0000	Median : 8.000	Median : 6.000
NA	Mean :6.649	Mean :4.783	Mean :0.4186	Mean :0.4419	Mean : 8.419	Mean : 8.419
NA	3rd Qu.:7.470	3rd Qu.:5.440	3rd Qu.:1.0000	3rd Qu.:1.0000	3rd Qu.: 9.500	3rd Qu.:11.000
NA	Max. :8.500	Max. :5.880	Max. :1.0000	Max. :1.0000	Max. :20.000	Max. :42.000

Click on the tabs to find summaries of the following measures grouped by whether a leader is at the school (0) or district level (1):

In and out degree

Summary Statistics, In-degree
DISTRICT.SITE	n	mean	sd
0	25	7.520000	1.661325
1	18	9.666667	4.228753

Summary Statistics, Out-degree
DISTRICT.SITE	n	mean	sd
0	25	5.40000	5.993051
1	18	12.61111	11.712934

Looks like school leaders at the district level sent out way more ties than at the school level but the results were pretty comparable for receiving ties between school and district.

Trust

Summary Statistics, Trust
DISTRICT.SITE	n	mean	sd
0	25	4.9044	0.7191666
1	18	4.6150	0.6880514

Efficacy

Summary Statistics, Efficacy
DISTRICT.SITE	n	mean	sd
0	25	7.022000	0.9534149
1	18	6.131667	1.1151906

Plot

This is the fun part - let’s plot our network using ggraph! 💹

You can see the density of the network/number of ties in the sociogram. It seems like there might be some correlation of gender to the number of ties that are sent out. We can explore further using ERGMs.

# Visualization
ggraph(leader_measures, layout = "kk") + geom_node_point(aes(size = in_degree, alpha = out_degree, 
    colour = factor(`DISTRICT/SITE`), shape = factor(MALE))) + geom_node_text(aes(label = ID), 
    repel = TRUE) + geom_edge_link(color = "grey", arrow = arrow(type = "closed", 
    angle = 15, length = unit(0.5, "mm")), show.legend = FALSE) + theme_graph()

ERGM

Now it is time for the pièce de résistance, the ERGM.

We’ll first have to create a network object to run our summaries and models. The statnet package command as.network will take care of this for us:

# create network object
leader_network <- as.network(leader_edges, vertices = leader_collab_nodes)

As stated in the vignette for the package, it is good practice to run summary stats on the network measures before running the model so you understand (🤔) what you are looking at. We’ll use the summary function:

# Estimate ERGM Model, good practice to run summary first
summary(leader_network ~ edges + mutual) %>%
    kable(caption = "Summary of Leader Network: Edges and Tie Reciprocation") %>%
    kable_paper("hover", full_width = T, fixed_thead = T)

Summary of Leader Network: Edges and Tie Reciprocation
	x
edges	362
mutual	91

As you can see, there are 362 edges, and 91 were mutual dyad pairs.

The first model we will run will look at just the network structure, the edges and reciprocity (or mutual). We will set a specific seed so we can reproduce our results:

The results of the model is shown below:

summary(ergm_mod_1)
Call:
ergm(formula = leader_network ~ edges + mutual)

Monte Carlo Maximum Likelihood Results:

       Estimate Std. Error MCMC % z value Pr(>|z|)    
edges  -1.95952    0.08525      0  -22.99   <1e-04 ***
mutual  1.95741    0.19500      0   10.04   <1e-04 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

     Null Deviance: 2504  on 1806  degrees of freedom
 Residual Deviance: 1699  on 1804  degrees of freedom
 
AIC: 1703  BIC: 1714  (Smaller is better. MC Std. Err. = 1.206)

For this network, it seems like the parameter ‘mutual’ or reciprocated ties between school leaders occurs more than would be expected by chance, very similar to results found in the confidential network analysis. The negative estimate of the number of edges implies the probably of a collaborative tie in year 3 is low, again much like the confidential network analysis.

Now let’s look at our specific research question, is gender predictive of the number of ties a school leader has in the year 3 collaborative network. To figure this out, we’ll add the co-variate male using the nodefactor specification.

First, we run our summary statistics on the model:

summary(leader_network ~ edges + mutual + gwesp(0.25, fixed = T) + nodefactor("MALE"))

            edges            mutual  gwesp.fixed.0.25 nodefactor.MALE.1 
         362.0000           91.0000          431.5217          395.0000

Then add the co-variate information using the nodefactor specification:

ergm_3 <- ergm(leader_network ~ edges + mutual + gwesp(0.25, fixed = T) + nodefactor("MALE"))

Here are the results of the model:

summary(ergm_3)
Call:
ergm(formula = leader_network ~ edges + mutual + gwesp(0.25, 
    fixed = T) + nodefactor("MALE"))

Monte Carlo Maximum Likelihood Results:

                  Estimate Std. Error MCMC % z value Pr(>|z|)    
edges             -5.22419    0.35661      0 -14.650  < 1e-04 ***
mutual             1.42892    0.19020      0   7.513  < 1e-04 ***
gwesp.fixed.0.25   2.20071    0.27991      0   7.862  < 1e-04 ***
nodefactor.MALE.1  0.15007    0.04992      0   3.006  0.00265 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

     Null Deviance: 2504  on 1806  degrees of freedom
 Residual Deviance: 1546  on 1802  degrees of freedom
 
AIC: 1554  BIC: 1576  (Smaller is better. MC Std. Err. = 0.9141)

In comparison with the confidential network analysis, there seems to be a slight, but this time moderately statistically significant, tendency towards male leaders being more likely to send or receive ties in the year 3 data.

Goodness of Fit

Before we leave the world of ERGMs, we want to make sure the model we created is a “good fit” of the data, or “how well it reproduces the observed global network properties that are not in the model.” To do this, we will use the gof() function.

ergm_3_gof <- gof(ergm_3)

The black line, which is our actual observed network measure, more or less follows the blue aggregate measures generated from the simulations. The model could probably be a little better, but hey we’ll take it. 🤷🏿

MCMC Diagnostics

Last but not least, we should look at the Monte-Carlo Markov Chain (MCMC) diagnostics to really make sure our model simulation would actually produce a similar network to the one that was observed. We do this by running the mcmc.diagnostics() function.

mcmc.diagnostics(ergm_3)

Sample statistics summary:

Iterations = 1843200:7561216
Thinning interval = 4096 
Number of chains = 1 
Sample size per chain = 1397 

1. Empirical mean and standard deviation for each variable,
   plus standard error of the mean:

                    Mean     SD Naive SE Time-series SE
edges             1.2341 31.314   0.8378         1.9608
mutual            0.1875  9.477   0.2536         0.5102
gwesp.fixed.0.25  1.8149 41.423   1.1083         2.5004
nodefactor.MALE.1 1.8604 29.277   0.7833         1.3837

2. Quantiles for each variable:

                    2.5%    25%   50%   75% 97.5%
edges             -67.00 -18.00 4.000 24.00 56.00
mutual            -18.00  -6.00 0.000  6.00 18.00
gwesp.fixed.0.25  -87.25 -23.18 4.815 31.98 74.89
nodefactor.MALE.1 -59.00 -17.00 3.000 22.00 57.00


Are sample statistics significantly different from observed?
               edges    mutual gwesp.fixed.0.25 nodefactor.MALE.1
diff.      1.2340730 0.1875447        1.8149324         1.8604152
test stat. 0.6293765 0.3676094        0.7258652         1.3444751
P-val.     0.5291026 0.7131645        0.4679214         0.1787948
           Overall (Chi^2)
diff.                   NA
test stat.       7.0079517
P-val.           0.1383314

Sample statistics cross-correlations:
                      edges    mutual gwesp.fixed.0.25 nodefactor.MALE.1
edges             1.0000000 0.8219976        0.9955156         0.7550109
mutual            0.8219976 1.0000000        0.8312809         0.6410802
gwesp.fixed.0.25  0.9955156 0.8312809        1.0000000         0.7669329
nodefactor.MALE.1 0.7550109 0.6410802        0.7669329         1.0000000

Sample statistics auto-correlation:
Chain 1 
              edges    mutual gwesp.fixed.0.25 nodefactor.MALE.1
Lag 0     1.0000000 1.0000000        1.0000000        1.00000000
Lag 4096  0.6388993 0.4456455        0.6038312        0.45171927
Lag 8192  0.4632409 0.3405171        0.4365858        0.26873912
Lag 12288 0.3470160 0.2346855        0.3231989        0.17792970
Lag 16384 0.2438624 0.1818677        0.2249331        0.10618711
Lag 20480 0.1765139 0.1210351        0.1604500        0.07718246

Sample statistics burn-in diagnostic (Geweke):
Chain 1 

Fraction in 1st window = 0.1
Fraction in 2nd window = 0.5 

            edges            mutual  gwesp.fixed.0.25 nodefactor.MALE.1 
           -1.408            -1.893            -1.457            -1.165 

Individual P-values (lower = worse):
            edges            mutual  gwesp.fixed.0.25 nodefactor.MALE.1 
        0.1589864         0.0583848         0.1451023         0.2440729 
Joint P-value (lower = worse):  0.6055984 .


MCMC diagnostics shown here are from the last round of simulation, prior to computation of final parameter estimates. Because the final estimates are refinements of those used for this simulation run, these diagnostics may understate model performance. To directly assess the performance of the final model on in-model statistics, please use the GOF command: gof(ergmFitObject, GOF=~model).

The charts look roughly balanced, so we’ll call this a day - our model did not fail to converge.

Conclusion

By using ERGMs, we have established a slight correlation between gender and the number of ties in the year-3 collaborative network of school leaders. This differs from the year-3 confidential ties network which showed no significant correlation. These results warrant further investigation into what role gender may play in the formation of these networks and how these relationships effect the learning environment within these schools and districts.

Final Thoughts

This is the last assignment in completion of the of the Learning Analytics graduate certificate program at North Carolina State University. 🥳🎉I want to thank our Professor @sbkellogg for such a great learning experience and all the great colleagues/classmates I’ve had throughout the program that helped me to be better in #RStats.

Por último, pero no menos importante, les quisera agraceder a mis niñas pequeñitas por su paciencia cuando mami hacia su ‘trabajo por la computadora.’ Las quiero mucho ❤️.

About Me

By day, I am Assistant Director for the Global Education Office at Duke University. By (late) night, I mess around with data. Follow me on Twitter @sorayaworldwide