2

I have a dataframe comprising two columns, 'host', and 'date'; which describes a series of cyber attacks against a number of different servers on specific dates over a seven month period.

Here's what the data looks like,

> china_atks %>% head(100) host date 1 groucho-oregon 2013-03-03 2 groucho-oregon 2013-03-03 ... 46 groucho-singapore 2013-03-03 48 groucho-singapore 2013-03-04 ... 

Where 'groucho-oregon', 'groucho-signapore', etc., is the hostname of the server targeted by an attack.

There are around 190,000 records, spanning 03/03/2013 to 08/09/2013, e.g.

> unique(china_atks$date) [1] "2013-03-03" "2013-03-04" "2013-03-05" "2013-03-06" "2013-03-07" "2013-03-08" "2013-03-09" [8] "2013-03-10" "2013-03-11" "2013-03-12" "2013-03-13" "2013-03-14" "2013-03-15" "2013-03-16" [15] "2013-03-17" "2013-03-18" "2013-03-19" "2013-03-20" "2013-03-21" "2013-03-22" "2013-03-23" ... 

I'd like to create a multi-line time series chart that visualises how many attacks each individual server received each day over the range of dates, but I can't figure out how to pass the data to ggplot to achieve this. There are nine unique hostnames, and so the chart would show nine lines.

Thanks!

6
  • You dont have the number of attacks in the data, if yes where is it? Commented Apr 1, 2018 at 19:43
  • Each row appears to be a separate observation, so the number of attacks could be calculated by summarising this. Commented Apr 1, 2018 at 19:47
  • No the number of attacks will have to be counted in the actual data frame itself -- should this be my first step? Commented Apr 1, 2018 at 19:48
  • stackoverflow.com/q/22767893/7347699 Commented Apr 1, 2018 at 19:49
  • 1
    Please add more dates to the sample data above. Commented Apr 1, 2018 at 19:52

2 Answers 2

3

Here's one way to do this.

First Summarize the count frequency by date.

library(plyr) df <- plyr::count(da,c("host", "date")) 

Then Do the plotting.

ggplot(data=df, aes(x=date, y=freq, group=1)) + geom_line(aes(color = host)) 

Data

 da <- structure(list(host = structure(1:4, .Label = c("groucho-eu", "groucho-oregon", "groucho-singapore", "groucho-tokyo"), class = "factor"), date = structure(c(1L, 1L, 1L, 1L), .Label = "2013-03-03", class = "factor"), freq = c(1L, 4L, 2L, 1L)), .Names = c("host", "date", "freq" ), row.names = c(NA, -4L), class = "data.frame") 
Sign up to request clarification or add additional context in comments.

5 Comments

This is perfect, thanks! I had to change 'group=1' to 'group=host' for a smooth line between the points, otherwise it comes out as a sort of bar chart.
Good one. You can use dplyr which is next iteration of plyr.
@MKR Thanks! Could have done dc <- da %>% group_by(host,date) %>% dplyr::summarise(freq = n()) but I think it would have been a bit longer.
Absolutely correct!! But i meant you could have used dplyr::count
@MKR makes sense :)
3

ggplot2 library is capable of performing statistics. Hence, an option could be to let ggplot handle count/frequency. This should draw multiple lines (one for each group)

ggplot(df, aes(x=Date, colour = host, group = host)) + geom_line(stat = "count") 

Note: Make sure host is converted to factor to have discrete color for lines.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.