Endorsement of Ranked Choice Voting in the United States¶

Introduction¶

Ranked choice voting lets voters rank candidates by preference instead of choosing just one. This topic matters because voting systems affect how people view elections and whether they feel their vote matters. Since many people are not familiar with ranked choice voting, it is important to find out who supports it and who does not.

This project uses survey data to answer two questions: who supports ranked choice voting, and what factors predict less support? I will look at things like political views, party, education, trust in elections, and age, since these likely shape opinions about changing the voting process. This is important because a university lab trying to build support for ranked choice voting needs to know which groups to focus on to create better education and outreach efforts.

In [2]:
# import required packages
import numpy as np
import datascience as ds

# These lines do some fancy plotting magic
import matplotlib
# Required to view plots in a notebook
%matplotlib inline
import matplotlib.pyplot as plt
# This is just to make the plots look a certain way
plt.style.use('fivethirtyeight')

# import datascience techniques
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.neighbors import KNeighborsClassifier

from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score, precision_score, recall_score

Data Description¶

This table below shows a breif overview of the data we are looking at today, the data comes from the Pulse of the Nation survey. The main outcome variable is support for ranked choice voting, which asks whether respondents would support a voting system where people could choose their first choice candidate, second choice candidate, third choice candidate, and so on. Before beginning the analysis, I removed unclear responses such as DK/REF and Other so that the results would be easier to interpret.

The main variables I looked at were political leaning, political affiliation, education, confidence in fair elections, and age. These variables can imapct how ranked choice voting is related to elections, politics, and how people view the voting process.

In [3]:
cah = ds.Table.read_table("CAH_PulseoftheNation_FinalProject.csv")
cah.show(5)
Gender Age Race Education Political Affiliation Political Leaning Trump Finances Fair Elections Ranked Choice Woman President Universal Healthcare
Female 81 White Some college Democrat Liberal Strongly Disapprove Not Very Often No No Yes Yes
Male 80 Asian Some college Democrat Moderate Somewhat Disapprove Somewhat Often Yes, somewhat confident Yes Yes No
Female 65 Black High school or less Democrat Moderate Strongly Disapprove Somewhat Often Yes, somewhat confident Yes Yes No
Male 24 Asian College degree Independent Moderate Strongly Disapprove Not Very Often Yes, somewhat confident Yes Yes No
Male 74 White Graduate degree Democrat Liberal Strongly Disapprove Not Very Often Yes, very confident Yes Yes Yes

... (795 rows omitted)

In [4]:
#This cleans up the data to have more definitive answers instead of including "DK/REF" or "other" responses.
cah.num_rows
cah.labels
cah_clean = cah.where("Ranked Choice", ds.are.not_equal_to("DK/REF"))
cah_clean = cah_clean.where("Gender", ds.are.not_equal_to("DK/REF"))
cah_clean = cah_clean.where("Gender", ds.are.not_equal_to("Other"))
cah_clean = cah_clean.where("Race", ds.are.not_equal_to("DK/REF"))
cah_clean = cah_clean.where("Race", ds.are.not_equal_to("Other"))
cah_clean = cah_clean.where("Education", ds.are.not_equal_to("DK/REF"))
cah_clean = cah_clean.where("Education", ds.are.not_equal_to("Other"))
cah_clean = cah_clean.where("Political Affiliation", ds.are.not_equal_to("DK/REF"))
cah_clean = cah_clean.where("Political Leaning", ds.are.not_equal_to("DK/REF"))
cah_clean = cah_clean.where("Trump", ds.are.not_equal_to("DK/REF"))
cah_clean = cah_clean.where("Finances", ds.are.not_equal_to("DK/REF"))
cah_clean = cah_clean.where("Fair Elections", ds.are.not_equal_to("DK/REF"))
cah_clean = cah_clean.where("Woman President", ds.are.not_equal_to("DK/REF"))
cah_clean = cah_clean.where("Universal Healthcare", ds.are.not_equal_to("DK/REF"))

cah_clean.num_rows
Out[4]:
486
In [5]:
#This creates support variables which can easier to interpret
support = np.where(cah_clean.column("Ranked Choice") == "Yes", 1, 0)
no_support = np.where(cah_clean.column("Ranked Choice") == "No", 1, 0)

cah_clean = cah_clean.with_columns(
    "Support", support,
    "No Support", no_support
)

cah_clean.show(5)
Gender Age Race Education Political Affiliation Political Leaning Trump Finances Fair Elections Ranked Choice Woman President Universal Healthcare Support No Support
Female 81 White Some college Democrat Liberal Strongly Disapprove Not Very Often No No Yes Yes 0 1
Male 80 Asian Some college Democrat Moderate Somewhat Disapprove Somewhat Often Yes, somewhat confident Yes Yes No 1 0
Female 65 Black High school or less Democrat Moderate Strongly Disapprove Somewhat Often Yes, somewhat confident Yes Yes No 1 0
Male 24 Asian College degree Independent Moderate Strongly Disapprove Not Very Often Yes, somewhat confident Yes Yes No 1 0
Male 74 White Graduate degree Democrat Liberal Strongly Disapprove Not Very Often Yes, very confident Yes Yes Yes 1 0

... (481 rows omitted)

In [7]:
# Summary table for main data description including the number of people and percentage of support or no support

original_rows = cah.num_rows
cleaned_rows = cah_clean.num_rows

ranked_choice_yes = cah_clean.where("Ranked Choice", "Yes").num_rows
ranked_choice_no = cah_clean.where("Ranked Choice", "No").num_rows

support_rate = np.mean(cah_clean.column("Support"))
no_support_rate = np.mean(cah_clean.column("No Support"))

summary_table = ds.Table().with_columns(
    "Description", ds.make_array(
        "Original dataset",
        "Cleaned dataset",
        "Supports ranked choice voting",
        "Does not support ranked choice voting",
        "Support rate",
        "No support rate"),
    "Value", ds.make_array(
        original_rows,
        cleaned_rows,
        ranked_choice_yes,
        ranked_choice_no,
        support_rate,
        no_support_rate))

summary_table
Out[7]:
Description Value
Original dataset 800
Cleaned dataset 486
Supports ranked choice voting 250
Does not support ranked choice voting 236
Support rate 0.514403
No support rate 0.485597
In [9]:
#This table shows a simplified group count
political_leaning_counts = cah_clean.group("Political Leaning")
political_affiliation_counts = cah_clean.group("Political Affiliation")
education_counts = cah_clean.group("Education")
fair_elections_counts = cah_clean.group("Fair Elections")

# To make arrays
group_variable = ds.make_array()
group_category = ds.make_array()
group_count = ds.make_array()

# Political Leaning
for row in political_leaning_counts.rows:
    group_variable = np.append(group_variable, "Political Leaning")
    group_category = np.append(group_category, row.item("Political Leaning"))
    group_count = np.append(group_count, row.item("count"))

# Political Affiliation
for row in political_affiliation_counts.rows:
    group_variable = np.append(group_variable, "Political Affiliation")
    group_category = np.append(group_category, row.item("Political Affiliation"))
    group_count = np.append(group_count, row.item("count"))

# Education
for row in education_counts.rows:
    group_variable = np.append(group_variable, "Education")
    group_category = np.append(group_category, row.item("Education"))
    group_count = np.append(group_count, row.item("count"))

# Fair Elections
for row in fair_elections_counts.rows:
    group_variable = np.append(group_variable, "Fair Elections")
    group_category = np.append(group_category, row.item("Fair Elections"))
    group_count = np.append(group_count, row.item("count"))

# For the combined table
combined_group_counts = ds.Table().with_columns(
    "Variable", group_variable,
    "Group", group_category,
    "Count", group_count
)

combined_group_counts
Out[9]:
Variable Group Count
Political Leaning Conservative 144
Political Leaning Liberal 111
Political Leaning Moderate 231
Political Affiliation Democrat 192
Political Affiliation Independent 156
Political Affiliation Republican 138
Education College degree 153
Education Graduate degree 105
Education High school or less 121
Education Some college 107

... (3 rows omitted)

In [11]:
# Summary statistics for age
age = cah_clean.column("Age")
np.mean(age), np.median(age), np.min(age), np.max(age)
# Average age by ranked choice voting support
cah_clean.select("Ranked Choice", "Age").group("Ranked Choice", np.mean)
Out[11]:
Ranked Choice Age mean
No 61.7881
Yes 53.772

Data Summary¶

The cleaned dataset contains the respondents used in the rest of this project. The tables above show the number of people in the main response groups. The ranked choice voting group includes respondents who support this voting method, while the non-supporting group consists of those who do not. The other groups represent respondents classified by political leaning, political affiliation, education level, and level of confidence in fair elections.

Building on this, the age summary shows the average, median, minimum, and maximum ages of the respondents. I also compare the average age by support for ranked-choice voting, as age may be related to whether someone supports changing the voting system.

In [12]:
overall = cah_clean.group("Ranked Choice")

plt.bar(overall.column("Ranked Choice"), overall.column("count"))
plt.title("Overall Support for Ranked Choice Voting")
plt.xlabel("Response")
plt.ylabel("Number of Respondents")
plt.show()

The graph above shows the overall number of people who said Yes or No to supporting ranked choice voting. This gives a basic starting point for understanding current support in the sample.

In [13]:
support_by_leaning = cah_clean.select("Political Leaning", "Support").group("Political Leaning", np.mean)
support_by_leaning
plt.bar(support_by_leaning.column("Political Leaning"), support_by_leaning.column("Support mean"))
plt.title("Support for Ranked Choice Voting by Political Leaning")
plt.xlabel("Political Leaning")
plt.ylabel("Proportion Supporting")
plt.show()

This graph shows the support for Ranked Choice Voting based off the polical leaning variable. This is important because ranked choice voting is related to elections, so political leaning may be connected to whether someone supports it.

In [14]:
support_by_party = cah_clean.select("Political Affiliation", "Support").group("Political Affiliation", np.mean)
support_by_party
plt.bar(support_by_party.column("Political Affiliation"), support_by_party.column("Support mean"))
plt.title("Support for Ranked Choice Voting by Political Affiliation")
plt.xlabel("Political Affiliation")
plt.ylabel("Proportion Supporting")
plt.show()

This graph compare support by the political affiliation variable. This helps show whether Democrats, Republicans, and Independents differ in their support for ranked choice voting.

In [21]:
support_by_education = cah_clean.select("Education", "Support").group("Education", np.mean)
support_by_education
plt.bar(support_by_education.column("Education"), support_by_education.column("Support mean"))
plt.title("Support for Ranked Choice Voting by Education")
plt.xlabel("Education")
plt.ylabel("Proportion Supporting")
plt.xticks(rotation = 30)
plt.show()

This graph show support for ranked choice voting based off the education level variable, which could show whether support is different for people with different levels of education.

In [19]:
support_by_fair = cah_clean.select("Fair Elections", "Support").group("Fair Elections", np.mean)
support_by_fair
plt.bar(support_by_fair.column("Fair Elections"), support_by_fair.column("Support mean"))
plt.title("Support by Confidence in Fair Elections")
plt.xlabel("Confidence in Fair Elections")
plt.ylabel("Proportion Supporting")
plt.xticks(rotation = 30)
plt.show()

Finally, this last graph compares support based on confidence in fair elections. This is useful because people who feel differently about election fairness may also feel differently about changing the voting system.

Inference: Hypothesis Test or Confidence Interval¶

For the inference section, I am using a bootstrap confidence interval to compare support for ranked choice voting between liberals and conservatives. I chose political leaning as my focus because ranked choice voting is directly tied to elections, so it is reasonable to expect that political views play a role in whether someone supports it. The statistic I am examining is the difference in support rates between the two groups. The bootstrap method works by repeatedly resampling from the original data and recalculating that difference each time, which produces a range of plausible values for the true difference in the broader population. If the confidence interval does not contain zero, that is evidence of a real difference in support between liberals and conservatives, not just random variation in the sample.

In [22]:
#This shows the observed difference
liberal_group = cah_clean.where("Political Leaning", "Liberal")
conservative_group = cah_clean.where("Political Leaning", "Conservative")

liberal_support = np.mean(liberal_group.column("Support"))
conservative_support = np.mean(conservative_group.column("Support"))

observed_difference = liberal_support - conservative_support

liberal_support, conservative_support, observed_difference
Out[22]:
(0.63963963963963966, 0.36805555555555558, 0.27158408408408408)
In [27]:
bootstrap_differences = ds.make_array()

for i in np.arange(1000):
    liberal_sample = liberal_group.sample(liberal_group.num_rows, with_replacement = True)
    conservative_sample = conservative_group.sample(conservative_group.num_rows, with_replacement = True)
    
    liberal_sample_support = np.mean(liberal_sample.column("Support"))
    conservative_sample_support = np.mean(conservative_sample.column("Support"))
    
    difference = liberal_sample_support - conservative_sample_support
    bootstrap_differences = np.append(bootstrap_differences, difference)

left_bound = np.percentile(bootstrap_differences, 2.5)
right_bound = np.percentile(bootstrap_differences, 97.5)

left_bound, right_bound
#To make a graph
plt.hist(bootstrap_differences)
plt.title("Bootstrap Differences in Support")
plt.xlabel("Liberal Support - Conservative Support")
plt.ylabel("Frequency")
plt.show()

The bootstrap confidence interval gives us a range of plausible values for the true difference in support between liberals and conservatives. Since the interval does not include zero, this suggests the difference is meaningful and not just due to chance. Based on the data, liberals in this sample tend to support ranked choice voting at a higher rate than conservatives.

Prediction¶

My goal is to identify respondents who may not support ranked choice voting. I set No Support as the positive case, meaning No Support = 1 represents someone who does not support it. This framing makes sense because the campaign's priority is finding people who might need more education or outreach about ranked choice voting. I used five predictors: age, political affiliation, political leaning, education, and confidence in fair elections. I then compared two models, KNN and a decision tree, and evaluated them using recall as the main metric. Recall is the right choice here because what matters most is how well the model actually catches people in the No Support group, not just overall accuracy.

In [28]:
#This shows a prediction table
prediction_table = cah_clean.select(
    "Age",
    "Political Affiliation",
    "Political Leaning",
    "Education",
    "Fair Elections",
    "No Support")

prediction_table.show(5)
Age Political Affiliation Political Leaning Education Fair Elections No Support
81 Democrat Liberal Some college No 1
80 Democrat Moderate Some college Yes, somewhat confident 0
65 Democrat Moderate High school or less Yes, somewhat confident 0
24 Independent Moderate College degree Yes, somewhat confident 0
74 Democrat Liberal Graduate degree Yes, very confident 0

... (481 rows omitted)

In [30]:
#Converting categories into numbers
democrat = np.where(prediction_table.column("Political Affiliation") == "Democrat", 1, 0)
republican = np.where(prediction_table.column("Political Affiliation") == "Republican", 1, 0)

liberal = np.where(prediction_table.column("Political Leaning") == "Liberal", 1, 0)
conservative = np.where(prediction_table.column("Political Leaning") == "Conservative", 1, 0)

college_degree = np.where(prediction_table.column("Education") == "College degree", 1, 0)
graduate_degree = np.where(prediction_table.column("Education") == "Graduate degree", 1, 0)

fair_elections_yes = np.where(prediction_table.column("Fair Elections") != "No", 1, 0)
ml_table = ds.Table().with_columns(
    "Age", prediction_table.column("Age"),
    "Democrat", democrat,
    "Republican", republican,
    "Liberal", liberal,
    "Conservative", conservative,
    "College Degree", college_degree,
    "Graduate Degree", graduate_degree,
    "Fair Elections Yes", fair_elections_yes,
    "No Support", prediction_table.column("No Support"))

ml_table.show(5)
Age Democrat Republican Liberal Conservative College Degree Graduate Degree Fair Elections Yes No Support
81 1 0 1 0 0 0 0 1
80 1 0 0 0 0 0 1 0
65 1 0 0 0 0 0 1 0
24 0 0 0 0 1 0 1 0
74 1 0 1 0 0 1 1 0

... (481 rows omitted)

In [31]:
#To split the data into train and test.
rows_to_take = int(ml_table.num_rows * 0.8)

shuffled = ml_table.sample(with_replacement = False)

train = shuffled.take(np.arange(rows_to_take))
test = shuffled.take(np.arange(rows_to_take, ml_table.num_rows))

train.num_rows, test.num_rows
predictors = train.drop("No Support").rows
outcome = train.column("No Support")

test_predictors = test.drop("No Support").rows
expected = test.column("No Support")
In [32]:
# For the KNN model
knn = KNeighborsClassifier(n_neighbors = 5)
knn.fit(X = predictors, y = outcome)

knn_predicted = knn.predict(test_predictors)
knn_accuracy = accuracy_score(expected, knn_predicted)
knn_precision = precision_score(expected, knn_predicted)
knn_recall = recall_score(expected, knn_predicted)

knn_accuracy, knn_precision, knn_recall
Out[32]:
(0.59183673469387754, 0.58823529411764708, 0.61224489795918369)
In [33]:
#For the Decision Tree
tree = DecisionTreeClassifier(max_depth = 4)
tree.fit(X = predictors, y = outcome)

tree_predicted = tree.predict(test_predictors)
tree_accuracy = accuracy_score(expected, tree_predicted)
tree_precision = precision_score(expected, tree_predicted)
tree_recall = recall_score(expected, tree_predicted)

tree_accuracy, tree_precision, tree_recall
Out[33]:
(0.58163265306122447, 0.59090909090909094, 0.53061224489795922)
In [35]:
#Now we can comapre the two models
model_results = ds.Table().with_columns(
    "Model", ds.make_array("KNN", "Decision Tree"),
    "Accuracy", ds.make_array(knn_accuracy, tree_accuracy),
    "Precision", ds.make_array(knn_precision, tree_precision),
    "Recall", ds.make_array(knn_recall, tree_recall))

model_results
Out[35]:
Model Accuracy Precision Recall
KNN 0.591837 0.588235 0.612245
Decision Tree 0.581633 0.590909 0.530612
In [36]:
plt.bar(model_results.column("Model"), model_results.column("Recall"))
plt.title("Model Comparison by Recall")
plt.xlabel("Model")
plt.ylabel("Recall for No Support")
plt.show()
In [37]:
model_results.sort("Recall", descending = True)
Out[37]:
Model Accuracy Precision Recall
KNN 0.591837 0.588235 0.612245
Decision Tree 0.581633 0.590909 0.530612
In [38]:
confusion_matrix(expected, knn_predicted)
Out[38]:
array([[28, 21],
       [19, 30]])

The model comparison table shows accuracy, precision, and recall for both the KNN and decision tree models. Since the goal is to identify people who do not support ranked choice voting, recall is the most important metric here because it measures how well the model actually catches people in the No Support group. Based on that, I chose the KNN model as the final model. It had the highest recall at 0.612, meaning it did a better job of identifying respondents who may not support ranked choice voting, along with an accuracy of 0.592 and a precision of 0.588. The decision tree trailed slightly with a recall of 0.531, accuracy of 0.582, and precision of 0.591.

Conclusion¶

Overall, the results suggest that support for ranked choice voting varies meaningfully across groups. Political leaning stood out as one of the most important factors that impacted this. The data shows liberals in this sample supporting ranked choice voting at a noticeably higher rate than conservatives. The descriptive statistics helped reveal these broad patterns, while the bootstrap confidence interval gave a more direct comparison between the two groups.

For prediction, I treated non-supporters as the positive case since the whole point is to find people who may need more information before getting on board with ranked choice voting. The KNN model performed best on recall, making it the more useful tool for that goal. In practice, a campaign could use these findings to focus its outreach on groups with lower predicted support, whether that means sample ballots, short explainer videos, or clear messaging that addresses common concerns about fairness and how the system actually works.