2
$\begingroup$

I am seeking to find a dataset with log files that have labeled cybersecurity issues. As I am trying to build a cybersecurity log analysis model there is no preference on the type of the log, but there is a preference on existence of known cybersecurity issues in the data.

Currently all I was able to find log datasets(HDFS, BGL) that had anomalies which were not cybersecurity issues but rather execution flow errors. Also I have found numerous amounts of network data such as in https://vizsec.org/data/, but they contain network traffic instead of logs. Also, I have found log datasets that actually had cybersecurity issues but the quantity of them were too little to train a model on.

It would also be helpful to know, how is it possible to generate such a dataset in large quantities.

$\endgroup$

3 Answers 3

0
$\begingroup$

In reference with your little found data either augment it or apply cross validation on top of it.

else Look for your expected data in https://datasetsearch.research.google.com/

$\endgroup$
0
$\begingroup$

See if this can help - Publicly Available Datasets

Also you can use SMOTE technique if you have insufficient data.

$\endgroup$
1
  • $\begingroup$ Thank you for the answer. The problem I see with synthetic generation techniques in this case is that the log data is not robust to noise and a change of even a single character or a little change of the order of logs could potentially be a security issue. $\endgroup$ Commented Sep 15, 2020 at 20:53
0
$\begingroup$

Finding up-to-date log-based public datasets including labels for new attacks, is hard to find. but there are some old-fashioned log-based datasets for some known attacks (i.e., iSQL, XSS injection) within weblogs or HTTP requests for the context of Web-server Log Anomaly Detection (WLAD) if fits you.

Please see Table II in this paper:

Majd, Mehryar, et al. "A Comprehensive Review of Anomaly Detection in Web Logs." 2022 IEEE/ACM International Conference on Big Data Computing, Applications and Technologies (BDCAT). IEEE, 2022.

Context: Web-server Log Anomaly Detection (WLAD)

Here author collected recent workarounds including the used datasets of weblogs or HTTPS requests in the cybersecurity domain that the author addressed recently reviewed works of literature. As you see in this table, one of the most recent papers from Amazon used: HTTP CSIC 2010 and ISCX IDS 2012 which are old public datasets as I mentioned in his approach.

I also would like to share that a long time ago I saw a conversation in RG you might look at:

there are also old posts at https://security.stackexchange.com/ :

some related Repo GH:

recent survey:

$\endgroup$

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.