Log analysis dataset with labeled cybersecurity issues

Question

I am seeking to find a dataset with log files that have labeled cybersecurity issues. As I am trying to build a cybersecurity log analysis model there is no preference on the type of the log, but there is a preference on existence of known cybersecurity issues in the data.

Currently all I was able to find log datasets(HDFS, BGL) that had anomalies which were not cybersecurity issues but rather execution flow errors. Also I have found numerous amounts of network data such as in https://vizsec.org/data/, but they contain network traffic instead of logs. Also, I have found log datasets that actually had cybersecurity issues but the quantity of them were too little to train a model on.

It would also be helpful to know, how is it possible to generate such a dataset in large quantities.

Durga K · Accepted Answer · 2020-09-15 16:50:35Z

In reference with your little found data either augment it or apply cross validation on top of it.

else Look for your expected data in https://datasetsearch.research.google.com/

Madhur Yadav · Accepted Answer · 2020-09-15 16:56:50Z

0

See if this can help - Publicly Available Datasets

Also you can use SMOTE technique if you have insufficient data.

answered Sep 15, 2020 at 16:56

Madhur Yadav

1581 silver badge14 bronze badges

$\begingroup$ Thank you for the answer. The problem I see with synthetic generation techniques in this case is that the log data is not robust to noise and a change of even a single character or a little change of the order of logs could potentially be a security issue. $\endgroup$

jsbc
– jsbc

2020-09-15 20:53:17 +00:00
Commented Sep 15, 2020 at 20:53

Add a comment |

Mario · Accepted Answer · 2024-11-25 20:00:49Z

Finding up-to-date log-based public datasets including labels for new attacks, is hard to find. but there are some old-fashioned log-based datasets for some known attacks (i.e., iSQL, XSS injection) within weblogs or HTTP requests for the context of Web-server Log Anomaly Detection (WLAD) if fits you.

Please see Table II in this paper:

Majd, Mehryar, et al. "A Comprehensive Review of Anomaly Detection in Web Logs." 2022 IEEE/ACM International Conference on Big Data Computing, Applications and Technologies (BDCAT). IEEE, 2022.

Context: Web-server Log Anomaly Detection (WLAD)

Here author collected recent workarounds including the used datasets of weblogs or HTTPS requests in the cybersecurity domain that the author addressed recently reviewed works of literature. As you see in this table, one of the most recent papers from Amazon used: HTTP CSIC 2010 and ISCX IDS 2012 which are old public datasets as I mentioned in his approach.

I also would like to share that a long time ago I saw a conversation in RG you might look at:

there are also old posts at https://security.stackexchange.com/ :

some related Repo GH:

recent survey:

Landauer, Max, et al. "Deep learning for anomaly detection in log data: A survey." Machine Learning with Applications 12 (2023): 100470.
Le, V. H., & Zhang, H. (2022, May). Log-based anomaly detection with deep learning: How far are we?. In Proceedings of the 44th international conference on software engineering (pp. 1356-1367).
- https://github.com/LogIntelligence/LogADEmpirical

Stack Exchange Network

Log analysis dataset with labeled cybersecurity issues

3 Answers 3

Linked

Hot Network Questions

Log analysis dataset with labeled cybersecurity issues

3 Answers 3

Linked

Related

Hot Network Questions