Download AI Based An$virus

Document related concepts
no text concepts found
Transcript
AIBasedAn*virus:
Detec*ngAndroidMalwareVariantsWitha
DeepLearningSystem
ThomasLeiWang
@thomaslwang
About me
•  My first (boring) job was a virus analyst in 2004.
•  I had a dream…
Virus Analysis VS Image Recognition
ImageProvidedbytheMNISThandwriHendatabase
Experiencedvirusanalystsome*mesisdoingimagerecogni*on!
Sample increase VS signature efficiency decrease
NumberofMaliciousAndroidApps
Dowgin:ARichVariantsAndroidAdwareFamily
NewDowginSamplesVSAverageDowginSamplesHitPerSignature
Maliciousapps,DowginsamplesandDowginsignaturesarecountedfromourdatabase.
Our evolution
Signature
basedrules
Behavioral
basedrules
Opcodebased
rules
AIbaseddeep
learningsystem
Training
Feature
Extrac+on
Feature
Normaliza+on
•  Structural
type
•  Sta*s*caltype
•  Empiricaltype
•  Continuous
value
•  0-1value
TraininginDeep
NeuralNetwork
•  Standard
score
normaliza*on
•  CuVng
technique
•  Quan*le
normaliza*on
•  PaddlePaddle
plaYorm
•  Residuallayer
•  AutoEncoder
•  Configura*on
tunings
Models
•  Malware
model
•  PUAmodel
Prediction
InputAPK
features
Model
Output
Feature extraction
Numeraliza*on(N=1235)
Structuralfeatures
• Numofuses-permissonsinAndroidManifest
• Numberofpicturefilesin/res
• Sizeof/res
• NumberofclassesstartswithLcom/
• NumofclassesstartswithLjava/
• Numoffieldstypeboolean
• Numofmethodswhichhasparameters>20
APK
Sta*s*calfeatures
• Countcer*ficatefieldsinsamplestoget
100stringswithdiscrimina*veinfo.E.g.
[email protected]
malicious/benign=52
Empiricalfeatures
• Hasexecutablefilein/res
• Hasapkfilein/assets
• RegisterDEVICE_ADMIN_ENABLED
broadcastandhassendSMSMessage
permission
205
3
34.5
143234
285
68
296
7
13850
157
11218
847
1.23e+9
422
1004
177
0
398
13.333
125
0
0
0
0
0
0
0
1
0
1
0
0
0
1
1
0
0
1
0
0
Con*nuousvalue
(N=571)
0-1value
(N=664)
Tomakefeaturesmorediscrimina*ve
Precisionincreasedby9%
Feature normalization
Standardscore
normaliza*on
Gaussiandistribu*on
CuVng
technique
Con*nuous
value
Noiseproblem
CuVng
technique
Mul*modaldistribu*on
Quan*lenormaliza*on
Long-taileddistribu*on
[-1,1]
Training in deep neural network
Inputlayer
Hiddenlayer1
Normalized
Con*nuous
value
(n=571)
Configura*ons:
• Hiddenlayerac*va*onfunc*on:Tanh
andReLU
• Costfunc*on:Mul*classcrossentropy
• Learningmethod:ADADELTA
• Finallayerac*va*onfunc*on:Sormax
• Passes:20–30
outputlayer
Hiddenlayer2 Hiddenlayer3 Residuallayer
iden*ty
n=256
Sormax
ReLU
Tanh
0-1value
(n=664)
ReLU
n=256
AutoEncoder
1
Tanh
Tanh
n=256
0
n=256
n=256
NetworkArchitecture
TrainedonPaddlePaddleplaYormwith15M+samples
Tanh
ReLU
Prediction & Evaluation
Apkfeatures
Models
Predic*on
0.995
0.99
0.985
0.98
0.975
0.97
0.965
0.96
0.955
Sendfeaturesto
thecloud
Extractapkfeatures
onthephone
Perf:140ms/apk
Traffic:1kB/apk
Returnpredic*onto
thephone
Modellife+me
Recall
Predictin
thecloud
Detec+onperformance
TruePosi*veRate
Produc+ondeployment
0.95
0
0.02
0.04
0.06
0.08
0.1
FalsePosi*veRate
0.97
0.96
0.95
0.94
0.93
0.92
0.91
0.9
0.89
0.88
Jan2016
Mar2016 May2016
Jul2016
Detec*onperformanceasROCcurve
Thelife*meofmodeltrainedonJan2016
ROCcurveistestagainstAV-TESTJuly’ssamples:
7613Androidmalware,3020legi*mateAndroid
apps,total10633.
ThemodelistrainedonJan2016andtested
againstAV-TESTJan,Mar,MayandJuly’s
samples.Recallratedroppedby7.6%in6
months.
Limitations
Advantages
•  Can’t provide explanations
for its detection results
•  Can’t understand code
meaning.
•  Build on static analysis and
lack of dynamic inspection.
•  Can’t self learning, need
continuous training with
labeled data.
•  More difficult to evade
•  Fixed-size
Conclusion
•  Feature extraction is the key step
•  Virus analyst experience can help to find valuable features.
•  AutoEncoder neural network can be used to extract the most valuable
features from a large number of features.
•  This system is designed to detect Android malware, but these
methods can also be used in detecting malware in other
platforms.
•  Our system learns in image recognition way. It’s effective only in
detecting malware variants.
Thank you
•  Welcome contact me
•  Twitter: @thomaslwang
•  Email: [email protected]
•  Welcome cooperation and partnership with us
•  Acknowledgement
•  Baidu IDL: Lyv Qin, Xiao Zhou, Jie Zhou, Errui Ding, Yuanqing Lin,
Andrew Ng
•  Partner: Liuping Hou, Jinke Liu, Zhijun Jia, Yanyan Ji
•  PaddlePaddle platform http://paddlepaddle.org
Related documents