0

I am trying to plot a histogram using ggplot2 with percentage on the y-axis and numerical values on the x-axis.

A sample of my data and script looks like this (below) and goes on for about 100,000 rows (or more).

A    B
0.2  x
1    y
0.995    x
0.5  x
0.5  x
0.2  y
ggplot(data, aes(A, colour=B)) + geom_bar() +stat_bin(breaks=seq(0,1, by=0.05)) + scale_y_continuous(labels = percent)

I want to know the percentage of B values distributed in each bin of A value, instead of the number of B values per A value.

The code as it is now gives me a y-axis with ymax as 15000. The y-axis is supposed to be in percentages (0-100).

1 Answer 1

2

Is this what you want? I assume your data frame is called df:

# calculate proportions of B for each level of A
df2 <- as.data.frame(with(df, prop.table(table(A, B))))
df2
#       A B      Freq
# 1   0.2 x 0.1666667
# 2   0.5 x 0.3333333
# 3 0.995 x 0.1666667
# 4     1 x 0.0000000
# 5   0.2 y 0.1666667
# 6   0.5 y 0.0000000
# 7 0.995 y 0.0000000
# 8     1 y 0.1666667

ggplot(data = df2, aes(x = A, y = Freq, fill = B)) +
geom_bar(stat = "identity", position = position_dodge())

enter image description here

Sign up to request clarification or add additional context in comments.

3 Comments

Yes! However, when I try to add a frequency column using the first line, my data gets shortened and some values of B are missing.
@Mengll, sorry, but I don't quite understand what you mean. The table of frequencies, that is converted to a data frame, is an aggregated version of your original data frame, so yes your data will be "shortened". Say you have 500 lines of y = 0.5. These will boil down to a single line of a percentage of y in 'bin' 0.5.
I did not understand that, but it makes sense now. My resulting plot looks strange, but that's probably because of my own dataset. Thank you!

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.