Ransomware is a special type of malware that can lock victims’ screen and/or encrypt their files to obtain ransoms, resulting in great damage to users. Mapping ransomware into families is useful for identifying the variants of a known ransomware sample and for reducing analysts’ workload. However, ransomware that can fingerprint the environment can evade the precious work of dynamic analysis. To the best of our knowledge, to overcome this shortcoming, we are the first to propose an approach based on static analysis to classifying ransomware. First, opcode sequences from ransomware samples are transformed into N-gram sequences. Then, Term frequency-Inverse document frequency (TF-IDF) is calculated for each N-gram to select feature N-grams so that these N-grams exhibit better discrimination between families. Finally, we treat the vectors composed of the TF values of the feature N-grams as the feature vectors and subsequently feed them to five machine-learning methods to perform ransomware classification. Six evaluation criteria are employed to validate the model. Thorough experiments performed using real datasets demonstrate that our approach can achieve the best Accuracy of 91.43%. Furthermore, the average F1-measure of the “wannacry” ransomware family is up to 99%, and the Accuracy of binary classification is up to 99.3%. The proposed method can detect and classify ransomware that can fingerprint the environment. In addition, we discover that different feature dimensions are required for achieving similar classifier performance with feature N-grams of diverse lengths.

Classification of ransomware families with machine learning based on N-gram of opcodes

Mercaldo F.;
2019-01-01

Abstract

Ransomware is a special type of malware that can lock victims’ screen and/or encrypt their files to obtain ransoms, resulting in great damage to users. Mapping ransomware into families is useful for identifying the variants of a known ransomware sample and for reducing analysts’ workload. However, ransomware that can fingerprint the environment can evade the precious work of dynamic analysis. To the best of our knowledge, to overcome this shortcoming, we are the first to propose an approach based on static analysis to classifying ransomware. First, opcode sequences from ransomware samples are transformed into N-gram sequences. Then, Term frequency-Inverse document frequency (TF-IDF) is calculated for each N-gram to select feature N-grams so that these N-grams exhibit better discrimination between families. Finally, we treat the vectors composed of the TF values of the feature N-grams as the feature vectors and subsequently feed them to five machine-learning methods to perform ransomware classification. Six evaluation criteria are employed to validate the model. Thorough experiments performed using real datasets demonstrate that our approach can achieve the best Accuracy of 91.43%. Furthermore, the average F1-measure of the “wannacry” ransomware family is up to 99%, and the Accuracy of binary classification is up to 99.3%. The proposed method can detect and classify ransomware that can fingerprint the environment. In addition, we discover that different feature dimensions are required for achieving similar classifier performance with feature N-grams of diverse lengths.
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11695/115610
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 192
  • ???jsp.display-item.citation.isi??? ND
social impact