-
Notifications
You must be signed in to change notification settings - Fork 227
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data Ingestion & Testing #35
Comments
good idea! we have the toy datasets in: We can add another folder instead of "toy_1" @skopula , feel free to add whatever sets of data you want in a branch. we have the toy_1 datasets sitting in hadoop as well, and simply reading them in pyspark. Simply the datasources are provided In the scheme.json file: OpenUBA/core/storage/scheme.json Lines 1 to 16 in 5636a7b
@kaiiyer and jed may have ideas on which other datasets, but these are enough for now until we finish pipelining. we also have these files being sent to a local elastic cluster as well, and are reading the elastic data in python. Will push that up. Also, the DataSourceFileType enum in process.py defines the datafile type defines the datasource type. Lines 38 to 42 in 5636a7b
The LogSourceType in dataset.py is defining the location from which we fetch the data: Lines 88 to 91 in 5636a7b
|
What is the status of this issue? I am interested in the data for UBA and I was poking around the toy_1 data folder but it is not clear to me how this data can be used for any Machine Learning task because the data does not have labels e.g. True, False if we want to build a classifier. So I can take up this issue but I would like to understand how any existing dataset can be used for an ML task, so any guideline would be appreciated. |
Hey @anupamme sorry for the very late response. I just returned from the grave. I got separated from the team for a long time. Looking back to this issues now. Thanks for bumping up. |
Hey, @jedwafu just checking if this issue is being looked at? And if there is any progress or timeline? p.s. I also just came from a trip to the grave :). |
These link below are great! to get sample log data on proxy, weblog, dns logetc
https://www.secrepo.com/
https://log-sharing.dreamhosters.com/
We can start testing with Sample user data (user1, user2 etc) and Sample Proxy log data . I will create a new folder "data" folder update folder with sample proxy log data and user data.
Let me know your thoughts ? Or is it too early on data ingesting and testing ?
The text was updated successfully, but these errors were encountered: