On Quantifying Knowledge Segregation in Society

With rapid increase in online information consumption, especially via social media sites, there have been concerns on whether people are getting selective exposure to a biased subset of the information space, where a user is receiving more of what she already knows, and thereby potentially getting trapped in echo chambers or filter bubbles. Even though such concerns are being debated for some time, it is not clear how to quantify such echo chamber effect. In this position paper, we introduce Information Segregation (or Informational Segregation) measures, which follow the long lines of work on residential segregation. We believe that information segregation nicely captures the notion of exposure to different information by different population in a society, and would help in quantifying the extent of social media sites offering selective (or diverse) information to their users.


INTRODUCTION
As increasing number of users are consuming information online, often via social media sites like Facebook and Twitter, there have been concerns regarding content quality [6], and the possibility of biases in the information people are getting exposed to [3][4][5]7].In such sites, people tend to be connected with other like-minded users out of homophily [1], and thus individual users can have selective exposure to information which closely matches their own views, and may not have enough exposure to differing views.There have been further concerns over the effect of such echo chambers [7] on the polarization of society [8,13].
Interestingly, in past works, two competing theories of opinion polarization have been proposed [12].One school of thought assumes that opinions are reinforced when likeminded individuals interact with each other [8,13].Whereas, other researchers have argued that exposure to differing views and their subsequent rejections lead to polarization [2].Polarization can be thought as a measure of the ideological state of the population in a society, which is difficult to quantify in general.Also, it is not explicitly clear what constitutes the ideal notion of the depolarized state of a society.
In this position paper, we argue that an alternative option would be to consider the access to different types of information by members of a society.For example, within a population with multiple parties operating, it is but natural that political opinion would be fragmented.However, it is highly desirable that the entire population have access to the same information / knowledge and they take informed decision to follow different paths.In other words, the bigger issue here is whether different groups of people are having access to similar kind of information or not, where groups may be formed based on predefined demographics (e.g., gender, race, age, income level) or derived features (e.g., political leaning) of people.To investigate this issue, we borrow ideas from the past literature on residential segregation.A large number of research works have considered the bipartite matching between different groups of people and the urban units where they reside (as shown in Figure 1), and proposed different measures to quantify geographical segregation of different groups [9,10].Massey and Denton [14] identified five distinct dimensions of residential segregation: (i) Evenness is the degree to which groups are distributed proportionately across areal units in an urban area.(ii) Exposure is the extent to which members of different groups share common residential areas.(iii) Concentration refers to the degree of a group's agglomeration in urban space.(iv) Centralization is the extent to which group members reside towards the center of an urban area, and (v) Clustering measures the degree to which different groups are located adjacent to one another.
Then, they grouped different segregation measures along these five dimensions.Note that some segregation measures are relative between two groups, whereas others are absolute measures of the segregation of one particular group.
Following this line of work, in this paper, we present the notion of Information (or Informational) Segregation.Similar to Figure 1, we consider another bipartite matching between different groups of people and the information units they have access to (shown in Figure 2).Then utilizing this mapping, we can compute information segregation to measure whether different groups in a society are having access to similar kind of information or not.
However, there are two primary aspects where the mapping between people and information units differs from the mapping between people and residential units: (i) residential segregation is computed over a two-dimensional geographical space, whereas information segregation needs to be computed over a n-dimensional topic space (n = 1 in Figure 2, but in general, n ≥ 1), and (ii) one person may have access to multiple information units, which needs to be accounted for while computing information segregation; whereas, one person is considered to be permanently staying in only one residential unit.To account for people accessing different information units, we use the notion of fractional personhood [15].For an information unit i, we consider the personhood of 1 for everyone who have access to only i, personhood of 1  2 for them who have access to i and another information unit, and so on.
In this paper, we propose five measures of information segregation analogous to the residential segregation measures discussed earlier, by considering the fractional personhoods of people from different groups.Then, as a proof of concept, we measure the information segregation of US-based Facebook users as evident from how they follow different news media pages on Facebook.Our investigation reveals that Hispanic users are accessing information more evenly across political spectrum; whereas Asian Americans have highest information segregation among all racial groups.Similarly, we also looked at how users having different political leanings are accessing contrary views.We found that moderately conservative leaning users tend to get information more evenly across the spectrum; whereas, extremely conservative leaning users are most segregated among others.
The information segregation measures proposed in this paper can also be used to evaluate the role of search / recommender systems for exposing different types of information to a large population.We believe that in future, greater emphasis should be put on designing more responsible search / recommender systems which limit information segregation to acceptable limits.

INFORMATION SEGREGATION MEASURES
In this section, we introduce different measures of information segregation, considering the five distinct dimensions as identified by Messey and Denton [14] for residential segregation.

I. Evenness
The evenness measure of information segregation captures how uniformly members of a particular group have access to different units in the n-dimensional information space.Figure 3 shows an example scenario where members of Yellow group have access to all four information units; whereas, members of Purple group have access to only two units.Therefore, Yellow group in Figure 3 have more even information access than Purple group.Massey and Denton [14] discussed five different measures of residential evenness (including both relative and absolute measures).For brevity, we are defining only one measure of absolute evenness of a group, which is the complement of Gini Coefficient [9].
Gini coefficient G A measures the unevenness of a particular group A, by capturing the mean absolute difference between the personhoods of A having access to different information units.Then, Information Evenness IE A can be computed as where a i is the sum of personhoods belonging to group A who get information i, a tot al is the size of group A in the overall population, m is the number of information units, and a ′ tot al is the number of people in the overall population who do not belong to group A. I E A varies between 0 to 1, higher the value, the group has more even information access.

II. Joint Exposure
Joint exposure quantifies the extent to which members of two groups get jointly exposed to the same information.In Figure 4, members of Purple and Yellow groups are jointly exposed to three out of four information units; whereas, members of Purple and Pink groups are jointly exposed to only one unit.Therefore, in Figure 4, Purple and Yellow groups have higher joint exposure compared to Purple and Pink groups.
Again using the notion of personhoods, joint information exposure between groups A and B is computed as where a i , a tot al , and m are as defined earlier, b i is sum of personhoods belonging to B who get information i, and total i is sum of all personhoods having access to information i.JIE AB varies between 0 to 1, higher the value, A and B have more common exposure.

III. Concentration
Concentration of a group A refers to the relative amount of topical space that A have access to.Every information unit may not have similar topical density (or number of information sources, etc), with some units having more topics mapping into it, compared to other information units.For example, in Figure 5, red and blue units consist of higher number of topics than blueish and reddish grey units.Therefore, even though Yellow and Purple groups have access to same number of units (hence have same evenness), Yellow group would be considered more concentrated (i.e., more segregated) as it has access to fewer topics.Information concentration is captured by the metric Delta [11]: where a i , a t ot al , and m are already defined, n i is number of topics in information unit i, and n t ot al is number of topics overall.

IV. Centralization
Compared to the geographical context, identifying the center of an information space is tricky, and may not be always possible.Centrality may be computed by considering centroids in a dimensionreduced topical space, or by measuring it over networks induced by information units and their topical or preference similarity.In scenarios where the notion of information center is defined, centralization between two groups A and B refers to how the information units that A and B have access to are distributed around the center.For example, in Figure 6, if we assume the blueish grey unit to be the center, then although Yellow and Purple groups have same evenness and concentration measures, Purple group is more centralized than Yellow group.Formally, Centralization Index [10] can be measured as where information units are sorted based on their distance from the center, and a i , b i , and m are as defined earlier.CI AB varies between −1 to 1, positive value indicating A is more centralized than B.

V. Clustering
The final dimension of information segregation is the degree to which members of a group A have access to information clusters, i.e., whether the different types of information received by A are close to each other in the information space.In Figure 7, both Purple and Yellow groups have access to two information units, and have the same evenness and concentration scores.However, as the information units Purple group have access to are close to each other, according to clustering measure, it is more segregated than Yellow group .We can formally define information clustering as where a i , a tot al , total j , and m are as defined earlier, and d i j is the distance between information units i and j.IC A varies from 0 to 1.
Hispanic Asian Am.African Am.Caucasian to gather the cumulative number of followers for a particular unit.However, Facebook doesn't allow us to get the follower size for a combination of more than 400 Facebook pages.Therefore, we randomly select 400 pages from the set of 2.5K+ news media pages, map them to their corresponding units, and gather the demographics of the followers of pages belonging to every information unit.
As some users may follow Facebook pages belonging to multiple units (for example, follow both conservative and liberal leaning pages), we need to accurately account for these overlaps in information access.As mentioned earlier, we use the notion of fractional personhood in this regard.Therefore, instead of considering the number of followers of pages in a particular unit, we consider the sum of personhoods for pages in every information unit.
For every unit i, the sum of personhoods N * i is computed as where S is the set of all information units {VC, C, M, L, V L} and N (x) gives the number of followers of pages in unit(s) x.

Information Segregation among Racial Groups
Facebook ad interface returns four racial categories for the users: Caucasian, African American, Asian American, and Hispanic.
For every information unit, we compute the personhoods belonging to each race, and then measure information segregation among them.Figure 8(a) shows the evenness of different racial groups.We can see in Figure 8(a) that Hispanics have most even access to different political information units; whereas, Asian Americans have most uneven access to political information units.

Information Segregation between Political Groups
Similar to the racial categories, we also computed the personhoods w.r.t.different political leanings for every information unit, and then measure the information segregation among these groups.Figure 8(b) shows that conservative leaning users tend to get information evenly from information units; whereas, very conservative leaning users have most uneven access to different units.Then to measure how very conservative leaning users have common access to information units with others, we plot their joint exposure with other groups in Figure 8(c).We observe that very conservative leaning users have highest joint exposure with conservatives, denoting that they are exposed to multiple information units together.Whereas, they have least joint exposure with very liberal leaning users, implying that these two groups have access to very different information units.

CONCLUSION
In this position paper, we proposed five measures of information segregation motivated by the residential segregation measures proposed in literature.Then, using these measures, we computed information segregation among US-based Facebook users.Our future work lies in evaluating how search / recommender systems are exposing information to different groups of users, and proposing mechanisms to keep information segregation to acceptable limits.

Figure 1 :
Figure 1: Basis for computing residential segregation: bipartite matching between people and residential units in a city.

Figure 2 :
Figure 2: Basis for computing information segregation: bipartite matching between people and information units.To investigate this issue, we borrow ideas from the past literature on residential segregation.A large number of research works have considered the bipartite matching between different groups of people and the urban units where they reside (as shown in Figure1), and proposed different measures to quantify geographical segregation of different groups[9,10].Massey and Denton[14] identified five distinct dimensions of residential segregation: (i) Evenness is the degree to which groups are distributed proportionately across areal units in an urban area.(ii) Exposure is the extent to which members of different groups share common residential areas.(iii) Concentration refers to the degree of a group's agglomeration in urban space.(iv) Centralization is the extent to which group members reside towards the center of an urban area, and (v) Clustering measures the degree to which different groups are located adjacent to one another.Then, they grouped different segregation measures along these five dimensions.Note that some segregation measures are relative

Figure 3 :
Figure 3: Yellow group gets information more evenly than Purple group.

Figure 4 :
Figure 4: Joint exposure between Purple and Yellow group is higher compared to Purple and Pink group.

Figure 5 :
Figure 5: Yellow is more concentrated than Purple group.III.ConcentrationConcentration of a group A refers to the relative amount of topical space that A have access to.Every information unit may not have similar topical density (or number of information sources, etc), with some units having more topics mapping into it, compared to other information units.For example, in Figure5, red and blue units consist of higher number of topics than blueish and reddish grey units.Therefore, even though Yellow and Purple groups have access to same number of units (hence have same evenness), Yellow group would be considered more concentrated (i.e., more segregated) as it has access to fewer topics.Information concentration is captured by the metric Delta[11]:

Figure 6 :
Figure 6: Purple is more centralized than Yellow group.IV.CentralizationCompared to the geographical context, identifying the center of an information space is tricky, and may not be always possible.Centrality may be computed by considering centroids in a dimensionreduced topical space, or by measuring it over networks induced by information units and their topical or preference similarity.In scenarios where the notion of information center is defined, centralization between two groups A and B refers to how the information units that A and B have access to are distributed around the center.For example, in Figure6, if we assume the blueish grey unit to be the center, then although Yellow and Purple groups have same evenness and concentration measures, Purple group is more centralized than Yellow group.Formally, Centralization Index[10] can be measured as

Figure 7 :
Figure 7: Purple group is more clustered than Yellow group.whichmembers of a group A have access to information clusters, i.e., whether the different types of information received by A are close to each other in the information space.In Figure7, both Purple and Yellow groups have access to two information units, and have the same evenness and concentration scores.However, as the information units Purple group have access to are close to each other, according to clustering measure, it is more segregated than Yellow group .We can formally define information clustering as al m j=1 e −d i j total j ) − ( a t ot al m 2 m i=1 m j=1 e −d i j )

Figure 8 :
Figure 8: Information segregation between different groups along two dimensions: evenness of (a) different racial groups, (b) different political groups, and (c) joint exposure of very conservative leaning people (VC) with other political groups.togather the cumulative number of followers for a particular unit.However, Facebook doesn't allow us to get the follower size for a combination of more than 400 Facebook pages.Therefore, we randomly select 400 pages from the set of 2.5K+ news media pages, map them to their corresponding units, and gather the demographics of the followers of pages belonging to every information unit.As some users may follow Facebook pages belonging to multiple units (for example, follow both conservative and liberal leaning pages), we need to accurately account for these overlaps in information access.As mentioned earlier, we use the notion of fractional personhood in this regard.Therefore, instead of considering the number of followers of pages in a particular unit, we consider the sum of personhoods for pages in every information unit.For every unit i, the sum of personhoods N * i is computed asN * i = [N (S)−N (S\i)]+