Update file README.md

f28896c7 · Svoboda, Jan · 10cfdf51 · f28896c7
Commit f28896c7 authored 1 year ago by Svoboda, Jan
--- a/README.md
+++ b/README.md
@@ -4,45 +4,14 @@

    Part 1: 
        Basic exploration work on URLs from  https://www.kaggle.com/datasets/sid321axn/malicious-urls-dataset 
-        Data description: 6,2 mil URLs - 4 classes -> benign,malware,phishing,defacement  
+        Data description: 600k URLs - 4 classes -> benign,malware,phishing,defacement  
    
    Goal:   Explore malware classification of URL data using Clustering methods
            Find "good" clustering for this purpose and explore how to evaluate what is "good" in this case
-            Try to approach more sophisticated data (XSS,SQLi,DGA etc.) and discriminate between types of malicious URLs
-
-    Notebooks:
-        1 - Module "prototype.ipynb"  
-            - Taking random sample of data 
-            - Calculating basic features (either hand chosen, vectorized data or mix) 
-            - Clustering  
-            - Visualising results to 2D and plotting 
-            - Computing metrics from the clustered data and visualising over range of parameters (k)
-            Using sklearn classes for 
-                K-Means (minibatch,k-means++), 
-                vectorization of URLs (tf-idf transformer with count vectorizer)
-                PCA to 2D for visualisation (TruncatedSVD for sparse matrices)
-                Basic metrics (Rand index, Silhouette score, Davies-Bouldin index, Homogeneity)
-        2 - Module "with_vec.ipynb"
-            - Based on "prototype.ipynb" 
-            - PLUS:
-                - different ways to split dat 
-                    - strip "https://" etc.
-                    - Leave whole
-                    - Separate hostnames from path
-                - Way to SAVE the clusters to .json with important constants to re-initialize vectorizer and PCA
-                - Way to visually read samples from the different clusters (print 10 URLs at the time)
-                - Way to manually annotate clusters and then save the descriptions with the data for later exploration and clustering
-                - Elbow method for visually finding most advantageous k (best benefit for clustering with the least computation dificulty gain)
-        3 - Module "classifier.ipyb"
-            - Loads the .json 
-            - Takes the new URLs as input
-            - Calculates the features for the new data based on saved settings
-            - Finds the closest cluster center (predicts cluster for new data)
-            - Shows the descriptions saved for the assigned clusters
-        4 - Module for testing the classification
-            - Work In Progress
-
-    TODO:   Refactor code into classes to streamline work and separate functionality
-            Building "framework" for evaluation classification results
-            Experiment to find "good" classification
+            Try to approach more sophisticated data (XSS,SQLi,DGA etc.) and discriminate between types of malicious URLs. 

+python 3.8.10
+jupyter 1.0.0
+scikit-learn 1.2.2
+numpy 1.23.2
+matplotlib 3.7.4
\ No newline at end of file