节点文献

轮奸案混合DNA分析的关键技术基础研究及分离软件研发

The Research of Key Technology in Gang Rape Mixed DNA Analysis and Separation Software Development

【作者】 胡娜

【导师】 丛斌;

【作者基本信息】 河北医科大学 , 法医学, 2014, 博士

【摘要】 混合DNA(mixed DNA,DNA mixture)包含多名来源个体的DNA信息,如何对混合斑生物检材DNA进行正确分型检验并对其结果进行科学解释是法医DNA鉴定领域中亟待解决的理论技术难题。本研究通过构建批量轮奸案混合DNA的实验模型,将通过科学性验证的实验数据用于混合DNA的参数评估和约束性条件的挖掘;进而构建基于STR分型数据的混合DNA分离模型,并将分离模型与国外的数学模型(如:mixsepsoftware package)进行比较分析;将构建的混合DNA分离模型进行软件转化,并通过模拟混合DNA分型数据完成研发软件的效能分析和应用验证;从而为DNA鉴定人员解决轮奸案混合DNA的个人识别提供初步的自动化专家系统。本研究构建的方法和模型不受混合样本的来源种类(如:混合精斑、混合血痕、混合脱落上皮细胞等)限制,故对不同案件类型的混合DNA均适用。第一部分:构建轮奸案混合DNA的实验模型目的:以两男混合DNA和三人混合DNA(1男+1男+1女)模拟轮奸案的混合DNA作为研究对象;利用ABI7500实时荧光定量仪构建模拟混合样本,包括不同来源个体和不同混合梯度的样本制备;将通过科学性验证的实验数据用于混合DNA分析的参数评估以及分离模型的研发。方法:将来自河北省血液中心的50份人全血样本提取DNA并进行ABI7500实时荧光定量,以DNA浓度非常接近为标准对单一DNA原液进行归类,作为构建模拟两男混合DNA和三人混合DNA的来源样本,确保通过调整DNA溶液的体积能够实现不同混合梯度的制备。为避免后续构建分离模型时由于样本类型过于单一且样本量不足而导致模型出现“过度拟合”,故需构建不同来源个体多种类型的模拟混合DNA,且各混合DNA均包含多个混合梯度,以确保客观地反映混合DNA分型和混合比例(mixture proportion, Mx)对后续分析的影响;另外,需将模拟混合DNA原液的浓度调整至理想浓度范围0.5-1.25ng/μl内,以满足DNA检测试剂盒对模板量的要求。随后,将实测Mx与理论Mx间的误差D值和mixsep软件包估算的Mx值(alpha)作为实验模型的科学性验证指标,通过数据挖掘和统计分析以评估构建实验模型的数据质量。结果:以DNA浓度相差不超过0.5ng/μl为归类标准,其中符合两男混合DNA标准的单一DNA样本有22个,可构建11组;符合三人混合DNA标准的单一DNA样本有12个,可构建4组。两男混合DNA的Mx在95%的可信区间中PCR扩增前后的偏差D≤0.1,波动较小,说明构建两男混合DNA实验模型的数据质量较好,为后续混合DNA分型的准确分离提供了较好的数据基础;三人混合DNA没有通用的D值计算公式,故不予评估。两男混合DNAIdentifiler(简称ID)分型中实测alpha值的均方根标准误(root mean square error, RMSE)值较大的数据在11组样本中散在出现;除了梯度1:1的RMSE>0.02,其余8个梯度的RMSE均位于0.01-0.02之间。即:本实验构建的两男混合DNA ID分型相应的实测Mx与理论Mx之间的RMSE值不超过0.02(除梯度1:1),该实验模型能够为科学合理的进行混合DNA分析提供良好的数据基础。三人混合DNAID分型由于出现很多等位基因drop-out,mixsep软件无法保证准确评估alpha值,故不予评估。两男混合DNA Yfiler分型中实测alpha值的RMSE值较大的数据在11组样本中散在出现;除了梯度1:3和1:4的RMSE偏大>0.02但<0.3,其余梯度的RMSE均位于0.01-0.02之间。即:本实验构建的两男混合DNAYfiler分型相应的实测Mx与理论Mx之间的RMSE值不超过0.03,Yfiler分型数据仅作为补充基础。结论:结合ABI7500实时荧光定量仪和实验模型的科学性验证,该部分建立了模拟轮奸案混合DNA的297个两男混合DNA和264个三人混合DNA(其中三人混合DNA存在很多等位基因drop-out现象),这些模型除了用于构建混合DNA分离模型及分析软件的研发之外,297个模拟两男混合DNA的ID分型还要为混合DNA相关参数(如:等位基因平均峰高/面积、混合比例、杂合型均衡比、等位基因缺失、基因座间平衡等)的评估分析及规律性挖掘提供数据支持。第二部分:两男混合DNA的参数评估及mixsep软件验证目的:对混合DNA的参数进行评估和分析,观察各参数之间的相关关系,以明晰混合DNA分析的约束性条件并挖掘其规律性;并通过模拟混合DNA分型数据对mixsep软件进行应用验证,明确该软件的优缺点,取长补短,为混合DNA分离模型的研发提供参照和效能比较对象。方法:峰高(PH)与峰面积(PA)两参数间的相关分析选择广义可加模型拟合法进行曲线拟合,以及最小二乘回归分析计算回归系数,观察两种定量信息在混合DNA分析中的效能是否有差异。平均峰高(APH)与杂合型均衡比(Hb)两参数间的相关分析,选择局部加权回归和Kruskal-Wallis秩和检验等,这些方法适用于非正态分布的数据,可分析16个STR基因座和9个混合梯度相应的Hb分布趋势及规律。通过不同channel荧光敏感度对APH的影响和基因座间平衡(Inter-locus balance, Ci)参数对混合DNA分型中各channel对应的STR基因座进行荧光敏感度差异分析,证明各STR基因座在混合DNA分析中的效能是否有差异,并通过Tukey’s Honestly显著性差异法进行多重检验。该部分所有统计图均由R软件(版本3.0.1)的ggplot2(版本0.9.3)程序包绘制完成。结果:1PH与PA相关分析:16个STR基因座中,除基因座D19S433、D3S1358、D58S18和D8S1179的PH与PA呈良好线性关系外,其余12个基因座的PH与PA呈高度线性关系,这与Tvedebrink的研究结论基本一致。即:PH与PA具有良好线性关系,两种定量信息在混合DNA分析中均可使用,分析效能差别不大。2APH与Hb相关分析:通过Kruskal-Wallis秩和检验,各基因座Hb分布的检验p值=0.0063<显著性水平0.05,说明各基因座的Hb分布有统计学差异;另外,各混合梯度Hb分布的检验p值=0.02257<0.05,说明各混合梯度的Hb分布也存在统计学差异。即:参数Hb会受到STR基因座和混合梯度两个因素的共同影响。当APH<1250rfu时,Hb值明显增大;当APH≥1250rfu时,Hb值基本稳定,APH≥1250rfu时相应的Hb均值为0.878。结合本实验数据,APH<2500rfu且Hb>0.6阈值的数据达到92.74%。基因座CSF1PO、D19S433、D21S11、D2S1338和vWA中,相应Hb值和APH均较高的数据比其它基因座多;当混合梯度的不平衡性增加(从1:5到1:9)时,Hb值和APH均较低的数据会增多。3APH与drop-out相关分析:当混合梯度比较均衡(1:1到1:3)时,等位基因drop-out(简称ADO)的个数较少;而当梯度非常不均衡(1:7到1:9)时,ADO个数陡增,即:ADO个数与混合梯度相关;随着ADO个数的增多,相应的样本APH逐渐降低。4荧光敏感度对APH的影响:为检验不同荧光敏感度的四种channel(蓝色、绿色、黄色和红色)间APH均值是否有统计学差异,利用基于Tukey’s Honest Significant Difference方法进行多重检验,蓝色与绿色channel间的荧光敏感度无差异(p值=0.446);同时,黄色与红色channel间的荧光敏感度也无差异(p值=0.530);其余蓝色与黄色组、蓝色与红色组、绿色与黄色组、绿色与红色组共4组对应的APH检验p值均=3.95E-08远小于0.05,即:“蓝色和绿色”的两种荧光与“黄色和红色”的两种荧光相比有显著性差异。也就是说,ABI3130xl基因分析仪对“蓝色和绿色”荧光的敏感度确实高于其它两种荧光。蓝色channel中基因座D8S1179的APH中位数最高;绿色channel中基因座D3S1358、TH01和D13S317的APH中位数高于其它;黄色和红色channel中基因座D18S51和FGA的APH中位数最低;这些恰与ABIIdentifler试剂盒中STR基因座的片段分子量大小排列相吻合,即:分子量较小的基因座D8S1179、D21S11、D3S1358、TH01、D13S317、D19S433、vWA、Amel-和D5S818对应的APH中位数均较高。也就是说,APH会受到基因分析仪的荧光敏感度和STR基因座分子量两个因素的共同影响。5基因座间平衡(Ci)参数分析:Ci的均值、中位数与ADO个数间的Pearson相关系数R2分别为-0.7179和-0.7065,检验p值分别为1.736E-3<0.05和2.215E-3<0.05,具有显著性差异,即:Ci的均值、中位数与ADO个数呈显著负相关;Ci中位数最高的是基因座D8S1179。16个STR基因座Ci值的分布规律同基因分析仪对四种channel的荧光敏感度差异规律基本一致,即:ABI3130xl对蓝色和绿色channel的荧光敏感度偏高,对应8个基因座D8S1179、D21S11、CSF1PO、D3S1358、TH01、D13S317、D16S539和D2S1338(D7S820例外)Ci值的整体水平偏高;而对黄色和红色channel的荧光敏感度偏低,对应6个基因座D19S433、vWA、TPOX、D18S51、AMEL-和FGA(D5S818例外)Ci值的整体水平偏低。6mixsep横向分析:混合梯度与基因座分离准确率进行相关分析,得到相关系数R2=-0.7121,检验p值=0.03139<0.05,两者呈线性负相关;另外,混合梯度与ADO个数也进行相关分析,得到R2=-0.4244,检验p值=0.2549>0.05,说明两者无明显相关性;梯度1:1的准确率最低,随着混合梯度不平衡性的增加,相应的准确率呈先提高后降低的趋势,其中梯度1:2、1:3和1:4的准确率较高,梯度1:1和1:9的准确率较低且波动较大;去除ADO后的分离准确率比未去除时的稍高,说明等位基因发生drop-out会降低mixsep软件的分析效能。7mixsep纵向分析:基因座D5S818、D8S1179和FGA的准确率较高>88%,而基因座D19S433、D2S1338和D7S820的准确率偏低≤80%;基因座AMEL-、D5S818和D8S1179的ADO个数最少,而基因座D18S51、D19S433、FGA、TPOX和vWA的ADO个数较多>15个,后者5个基因座均位于黄色和红色channel且基因座APH均较低,这与ABI3130xl基因分析仪对黄色和红色荧光敏感度偏低的规律相一致。当梯度为1:1时,除了基因座AMEL-和D3S1358外,其它基因座的准确率均≤70%,箱线图下方区域的离群点即为该梯度的数据;当梯度为1:2、1:3、1:4和1:5时,各基因座的准确率均较高,尤以梯度1:3的各基因座准确率均最高≥90%;当梯度为1:8和1:9时,基因座分离准确率波动较大且平均水平较低。结论:结合DNA分型的APH信息、STR基因座和混合梯度分别对参数Hb、等位基因drop-out、荧光敏感度和参数Ci等多个因素进行相关分析以及对mixsep软件进行效能分析,本研究认为:针对ABI ID试剂盒的16个STR基因座,在混合DNA的基因型分离过程中,如果该分型的APH大于1250rfu且混合梯度在1:1到1:5范围内(不包括梯度1:1),我们优先信任蓝色channel的基因座D8S1179、D21S11、CSF1PO,绿色channel的基因座D3S1358、TH01、D13S317,黄色channel的基因座D19S433、vWA、TPOX和红色channel的基因座AMEL-、D5S818(合计11个)对应的基因型分离结果,即:16个STR基因座在混合DNA分析中的基因型分离效能有差别,相应的证据强度也不尽相同。而如果混合DNA分型的APH偏低(小于1000rfu)且混合梯度极度不平衡(低于梯度1:6),在等位基因drop-out不详或没有已知参考样本时,不建议贸然进行混合DNA软件分析,这种情况很容易出现错判(misclassification);另外,混合梯度为1:1的混合DNA分型是无法进行基因型分离和个人识别的。也就是说,即使有了完整的混合DNA分离模型和分析软件,在基因型分离的前后,仍然需要DNA鉴定人员人工判断的参与,不能单纯依赖混合DNA分析软件作出鉴定结论。第三部分:轮奸案混合DNA分离模型的构建及效能分析目的:基于批量模拟混合DNA STR分型构建科学合理同时保守的混合DNA分离模型,对分离模型进行效能验证,并与mixsep软件进行比较分析,证明研发模型的稳健性(robustness)和保守性。方法:1朴素贝叶斯模型(Naive Bayesian model):假设等位基因峰高hα符合正态分布N(Bα+C,Hτ2),为方便处理,假设混合比例α的先验分布也符合正态分布N(m,A);而方差参数丁仍为一个参数,且混合比例α与参数丁无关,故峰高的边缘分布推导如下对于先验分布的方差超参数A,当实验数据比较精确时,各基因座间的α相差≤0.05,区间估计取3个标准差范围,故先验的α方差约为A=0.01672≈0.00028(由本实验室的数据经验所得)。此时A很小,故B2A相对于原来的方差可忽略,则峰高hα的边缘分布可简化为ba|τ~N(Bm+C,Hτ2)由先验分布得出m,遍历各基因座的所有基因型时,可通过最大化边际似然获得似然值最大时对应的最优匹配和相应参数;而对于次优匹配基因型,可人工判断来选择,本实验室的经验是一般与最优匹配的似然值相差达1.5倍以上均不考虑。2受限的单基因座分析模型(constrained single locus analysis model):当初始混合比例已知时,对混合比例的波动范围进行经验性约束,然后遍历各基因座的所有基因型,通过最大化似然函数来求解混合比例α和方差参数。仍沿袭正态分布的假设条件,此时等位基因峰高的均值与方差分别为此时的混合比例有限制条件α∈[α,b],方差参数也有限制条件τ∈[∈,M],∈接近于0,M值通常较大。通过参数限制,遍历基因型并最大化似然函数若求解的α达到或超过限制条件的上限或下限,即使该基因型的似然函数值最大,该基因型仍要被警告或排除。另外,由方差参数τ2的公式看出峰高拟合越好,方差参数就越小,当方差参数τ2接近于下限∈,对应的峰高拟合最好。结合本实验室混合DNA的大量实验数据,估算的混合比例会在朴素贝叶斯模型所求的先验混合比例上下波动,波动范围≤0.08,其中两等位基因的波动范围≤0.05;如果估计的混合比例接近限制条件的上下限,则该基因座很可能异常,可依据经验排除最优匹配而选择次优匹配。结果:1该部分构建的两种分离模型——朴素贝叶斯模型和受限的单基因座分析模型,在编号NAN3-1-9-B DNA分型的基因座AMEL-均出现了分离错误(结果同mixsep软件),错判基因型组合为X,X和X,Y。法医DNA鉴定中,基因座AMEL-在嫌疑人性别判定中具有重要作用。当未考虑混合DNA分型的其它影响因素时单纯依靠基因座AMEL-的条带峰高来直接推断混合DNA是由多名男性混合还是男性和女性混合不够保守可能导致分离模型对该案件的嫌疑人性别发生误判,从而对案件的侦破方向产生错误引导。2峰高退化(degradation)的影响因素中,当分子量占主要因素时,通过峰高调整可使错判的基因座进行修正;而对于分子量不是峰高退化主要因素的基因座(如:编号NAN3-1-5-B的基因座vWA),峰高调整后的分离结果不变。即:根据混合效应模型(mixed effect model)估测的峰高退化系数相对保守,在实际混合DNA分型中分子量导致的峰高退化有时并不是主要因素,故峰高调整只对部分STR基因座有效。结论:该部分研究从全局一致性问题、朴素贝叶斯和单基因座求解的三种思路人手,通过构建朴素贝叶斯模型(简称Bayer)、受限的单基因座分析模型(简称Iter),对mixsep软件分析结果不理想的4个混合DNA分型进行基因型分离,在不考虑峰高退化导致分离错误的前提下,Bayer与Iter两种模型的联合使用可使最优匹配基因型分析获得更理想的结果;此外,构建的混合效应模型可保守地解决峰高退化现象,当分子量占峰高退化的主要因素时,通过峰高调整可使错判的基因座进行修正;而对于分子量不是主要因素的基因座,峰高调整后的分离结果不变,即:峰高调整只对部分STR基因座有效,只作为可选修正。第四部分:混合STR分型分离软件sepDNA的研发及应用验证目的:选择能与中国法庭科学DNA数据库兼容的STR分型为录入数据,将第三部分的两种分离模型(即:Bayer和Iter模型)联合使用,研发混合DNA分离软件sepDNA,并通过实验数据验证该软件的保守性和可靠性。方法:通过R语言将第三部分构建的多个混合DNA分离模型转化为源代码,附加sepDNA用户界面的源代码,转化成sepDNA软件;并对该软件的分析效能进行验证评估。结果:sepDNA软件包括两个分离模型和多个小模块,其中,Bayes模型通过寻找先验混合比例,转化为峰高均值仅与基因型有关的正态分布,最大化边际似然函数后寻求最优匹配基因型;Iter模型用各基因座单独分析代替联合分析,对混合比例的波动范围进行经验性约束,通过最大化似然函数求解各基因座的混合比例和方差参数,并通过参数限制对单个基因座进行遍历求解。两种模型从全局优化和局部优化两种不同的建模思路完成混合DNA的基因型分离,虽然基因座D3S1358和D7S820的分离结果拉低了Iter模型的整体分离准确率,但从基因座AMEL-来源个体的性别推断和分离模型保守性的角度考虑,有必要将两种模型联合使用,以两种模型均出现的分离结果为可靠结果,两种模型不一致的分离结果需要进一步人工判断,以确保混合DNA分离结果的保守性和可靠性。结论:本文研发的sepDNA软件中的Bayes模型和Iter模型需联合使用,以两种模型均出现的分离结果为可靠结果,以确保混合DNA分离结果的保守性和可靠性;本软件无处理drop-out的模块,软件设计有“样本平均峰高”和“混合比例”的参数信息,如果平均峰高过低或者混合比例极不平衡,提示DNA分型可能发生等位基因drop-out,此时分离报告需要结合参数信息和DNA鉴定人员的人工判断作出最终结论。另外,本软件的三人混合DNA分离模块设计有“设置固定基因型”,可适当提高三人混合DNA的分离准确率,该模块效能还需大量三人混合DNA数据做进一步验证。

【Abstract】 In this study, scientifically verified experimental data were used for evaluatingparameters of mixed DNA and exploring constraints through constructing theexperimental models of mixed DNA in gang-rape cases; and then separationmodels for the mixed DNA was constructed based on STR genotyping data.The separation models were compared with the mixsep software abroad; andthey were then transformed into a software package whose efficacy andapplicability was verified using the genotyping data of the simulated mixedDNA. This study has brought forward an basic expert system for theindividual identification of mixed DNA.Part I: Constructing an Experimental Model of Mixed DNAObjective: Experimental models of two-male mixed DNA andthree-person mixed DNA (two males+one female) were used to simulate themixed DNA samples in gang-rape cases. ABI7500real-time PCR analyzerwas used to construct the simulated mixed DNA, including sample preparationwith different contributors and different mixed ratios. And the scientificallyverified experimental data was used for evaluating parameters of mixed DNAand developing the separation model. Afterwards, the deviation D valuebetween the measured Mxand the theoretical Mx, and the Mxvalue estimatedby the mixsep software were taken as the scientific verification indexes for theexperimental model. The data quality of the experimental model wasevaluated through data mining and statistical analysis.Method: ABI7500analyzer was performed on the DNA extracted from50whole blood samples. During the construction of simulated two-male andthree-person mixed DNA, single DNA samples were classified as contributorsbased on the criterion that the DNA concentrations were very close so as toensure the preparation of different mixed ratios through the volume adjustment of DNA solution; and in order to avoid the “overfitting” which might becaused by simple sample types and insufficient sample size while constructingthe separation model, multiple types of mixed DNA from differentcontributors needed to be constructed, and each mixed DNA should containmultiple mixed ratios to ensure that they could objectively reflect the impactof mixed DNA profiles and mixture proportion (Mx) on the analysis; besides,the concentration of the original mixed DNA solution needed to be adjusted tothe recommended concentration range within0.5-1.25ng/μl, so as to meet therequirement of PCR Amplification Kit for DNA template.Results: With the DNA concentration difference no less than0.5ng/μl asthe standard for classification, there were22single DNA samples that met thestandard for two-male mixed DNA, which could construct11groups; andthere were12single DNA samples that met the standard for three-personmixed DNA, which could construct4groups. The deviation D value of mixedDNA’s Mxwithin95%confidence interval before and after PCR amplificationwas≤0.1with relatively small fluctuation, which indicated that the data usedto construct the experimental model for two-male mixed DNA were of goodquality. Therefore, it could provide a favorable data basis for the accurateseparation of mixed DNA genotype.In mixed Identifiler profiles, among the root mean square errors (RMSEs)of the measured alpha values, the data with relatively larger RMSEs werescattered among the11groups of samples; except that the RMSE of1:1ratiowas>0.02, all the RMSEs for the rest8ratios were within the range of0.01-0.02. That is, the RMSE differences between the measured Mxs andtheoretical Mxs were no more than0.02in the simulated two-male mixed DNAprofiles. This experimental model could provide a favorable data basis forscientific analysis of mixed DNA.In mixed Yfiler profiles, measured alpha values with relatively largerRMSEs were scattered among the11groups of samples; except that theRMSEs of1:3and1:4ratios were>0.02but <0.3, all RMSEs of the rest ratioswere within the range of0.01-0.02. That is, in the Yfiler profiles of two-male mixed DNA constructed in this experiment, the RMSE differences betweenthe measured Mxs and theoretical Mxs were no more than0.03.Conclusion: In this part, with ABI7500Analyzer and the scientificverification of experimental model,297simulated two-male mixed DNA and264simulated three-person mixed DNA for simulating the mixed DNA ingang-rape cases were established. Besides constructing the separation modelfor mixed DNA and R&D of the separation software, the Identifiler-STRprofiles of297simulated two-male mixed DNA would also provide datasupport for the evaluation analysis, and regularity mining for the parameters ofmixed DNA (such as the average peak height/area of active alleles, mixtureproportion, heterozygote balance ratio, allelic drop-out, and inter-locusbalance).Part II: Parameter Estimation and Mixsep Software Verification for theSimulated Two-male Mixed DNAObjective: to clarify the constraints in the mixed DNA analysis and findout their regularity by evaluating and analyzing the parameters of mixed DNA,and by analyzing the correlations among parameters. Through applying thesimulated mixed DNA profiles data into mixsep software, it would beexpected to verify its advantages and disadvantages for further improvement,providing reference and efficacy comparison for the development of ourmixed DNA separation model.Methods: For correlation analysis between the peak height (PH) andpeak area (PA) of mixed DNA profiles, the generalized additive model fittingmethod was adopted for curve fitting, and the least square regression analysiswas used to compute the regression coefficient in order to observe whetherthere was efficacy difference of between the two quantitative information inthe mixed DNA analysis.For correlation analysis between the two parameters of APH and Hb, thelocally weighted regression and Kruskal-Wallis rank test were adopted fornon-normally distributed data, and the Hbdistribution corresponding to16STR loci and9mixed ratios could be analyzed separately. Variation analysis of fluorescence sensitivity was performed on the STRloci corresponding to each channel of the mixed DNA profiles through theparameter analysis of each channel’s fluorescence sensitivity with APH andthe Inter-locus balance (Ci), so as to prove whether there was differencebetween the efficacy of each STR locus in the mixed DNA analysis; and themultiple test was performed through the Tukey’s Honestly significantdifference method.All statistical charts in this paper were drawn with the ggplot2(Version0.9.3) program package of R software (Version3.0.1).Results:1Correlation analysis of PH and PA:The distribution of PH andPA corresponding to16STR loci showed that, besides the good linear relationbetween loci D19S433, D3S1358, D58S18and D8S1179, there was asignificant linear relation between the PH and PA of the rest12loci. This wasprimarily consistent with the study conclusion of Tvedebrink, i.e. PH and PAhad a good linear relation, and the two quantitative information could both beused in mixed DNA analysis with little difference in the analytical efficacy.2Correlation analysis of APH and Hb: Through Kruskal-Wallis rank-sumtest, the p value for the Hbdistribution of each locus was0.0063, which wasless than0.05, indicating that the Hbdistribution of each locus werestatistically different; besides, the p value for the Hbdistribution of each mixedratio was0.02257, which was less than0.05, indicating that the Hbdistributions of each mixed ratio were also statistically different, that is, the Hbdistribution would be affected by STR locus and mixed ratio. WhenAPH<1250rfu, Hbvalue significantly increased (from0.75to around0.87);When APH≥1250rfu, Hbvalue was almost constant and Hbmean value was0.878. Combined with the experimental data, when APH≥1250rfu, theHb>0.6threshhold data accounted for92.74%.Among loci CSF1PO, D19S433, D21S11, D2S1338, and vWA, therewere more data with correspondingly high Hband high APH value than theother loci; when the imbalance of the mixed ratio increased (from1:5to1:9),there would be more data with lower Hband APH value. 3Correlation analysis of APH and drop-out: For relatively balancedratios (1:1to1:3), there were less allelic drop-outs (ADO); but for the veryimbalanced ratios (1:7to1:9), the ADO number increased rapidly, suggestingthat its number was correlated with the Mx; along with the increment of ADO,the relevant sample APH gradually decreased.4Impact of fluorescence sensitivity on APH: In order to test whetherthere was statistical difference among APH mean values of four channels atdifferent fluorescence sensitivities, multiple test was performed based onTukey’s Honest Significant Difference method. There was no differencebetween the fluorescence sensitivities of blue channel and green channel(p=0.446), and there was also no difference between the fluorescencesensitivity of yellow channel and red channel (p=0.530). And for the tests ofrest4groups, blue and yellow group, blue and red group, green and yellowgroup, and green and red group, the corresponding p values were all equal to3.95E-08, which was far less than0.05. The sensitivity to blue and greenfluorescence therefore differed significantly from that to yellow and red, thatis, the ABI3130xl Genetic Analyzer was truly more sensitive to blue andgreen fluorescence than to the other two.The median APH of locus D8S1179was highest in the blue channel. Themedian APH of loci D3S1358, TH01, and D13S317were higher than those ofthe other loci in the green channel, and the median APH of loci D18S51andFGA were lowest in the yellow and red channels. That is, the distribution ofAPH was generally consistent with the molecular size of the STR loci, and themedian APH values of loci with small molecular sizes (i.e., D8S1179, D21S11,D3S1358, TH01, D13S317, D19S433, vWA, AMEL-, and D5S818) wererelatively high.5Analysis of parameter Ci: The Pearson correlation coefficients (R2) ofthe mean and median Ciwith the ADO count were-0.7179and-0.7065,respectively. The corresponding P values were1.736E-3and2.215E-3,indicating statistically significant differences; the mean and median Cihadsignificant negative correlations with the ADO count. The locus with the highest Cimedian was D8S1179.The Civalues distribution of16STR loci were generally consistentedwith the fluorescence sensitivity of ABI3130xl Genetic Analyzer in the fourchannels, that is, ABI3130xl had higher fluorescence sensitivity to blue andgreen channel, and the corresponding8loci D8S1179, D21S11, CSF1PO,D3S1358, TH01, D13S317, D16S539, and D2S1338(D7S820as an exception)all had higher Civalues; but ABI3130xl had lower fluorescence sensitivity toyellow and red channels, and the corresponding6loci, D19S433, vWA, TPOX,D18S51, AMEL-, and FGA (Except D5S818) all had lower Civalues.6Horizontal analysis of mixsep: Correlation analysis was carried out onmixed ratios and locus separation accuracy, revealing correlation coefficientR2=-0.7121and p value=0.03139; the two had negative linear correlation.Besides, correlation analysis was also performed between mixed ratios and theADOs count, revealing R2=-0.4244and p value=0.2549, suggesting nomarked correlation. Ratio1:1had the lowest accuracy. And along with theincrease unbalance of mixed ratios, the corresponding accuracy rised at first,and then decreased. Among them, ratios1:2,1:3, and1:4had higher accuracy,while ratios1:1and1:9had relatively lower accuracy and greater variation.The locus separation accuracy without ADO was higher than that with it,meaning the allelic dropout would impair the analytical efficacy of mixsep.7Vertical analysis of mixsep: Loci D5S818, D8S1179, and FGA had ahigher accuracy (>88%), while loci D19S433, D2S1338, and D7S820had alower accuracy (≤80%); loci AMEL-, D5S818, and D8S1179had the leastdropout count, while loci D18S51, D19S433, FGA, TPOX, and vWA hadmore dropout count (>15); and these5loci were all located at yellow and redchannels with lower APHs, which was consistent with the pattern where ABI3130xl Genetic Analyzer had lower fluorescence sensitivity to yellow and red.For ratio1:1, except loci AMEL-and D3S1358, accuracies of all theother loci were≤70%; and the outliers at the lower area of the box plot werethe data of this ratio. For ratios1:2,1:3,1:4, and1:5respectively, theaccuracies of each locus were all higher, and particularly at ratio1:3was the highest (≥90%); for ratios1:8and1:9, the locus separation accuracies werecomparatively more fluctuated with lower mean values.Conclusion: Combining the APH of DNA profiles, mixed ratios andSTR loci, correlation analysis on parameters Hb, Ci, and fluorescencesensitivity, as well as efficacy analysis of mixsep software, this study suggests:during the genotype separation of the mixed DNA profiles in ABI Identifiler,if the APH of this profile was greater than1250rfu while the mixed ratio waswithin1:1to1:5(excluded1:1), we prefer the genotype separation results ofloci D8S1179, D21S11and CSF1PO in blue channel, loci D3S1358, TH01,and D13S317in green channel, loci D19S433, vWA, and TPOX in yellowchannel, and loci AMEL-and D5S818in red channel (with a total of11loci).That is, separation efficacies of16STR loci in the mixed DNA analysis aswell as evidence strength were not the same. If the APH of mixed DNAprofile was less than1000rfu and the mixed ratio was extremely imbalanced(lower than ratio1:6), and when allelic dropout was not clear or there was noknown samples, it is not recommended to perform the software analysis withthe mixed DNA hastily, which easily leads to misjudgment. Moreover, mixedDNA profiles with mixed ratio close to1:1could not undergo genotypeseparation and individual identification. That is, even if there were completemixed DNA separation model and analytical software, around the time ofgenotype separation, artificial judgment of forensic investigators were stillneeded, and an expert conclusion could not be drawn simply based on thesoftware report.Part III: the Separation Model Construction and Efficacy Analysis inMixed DNAObjective: Constructing scientific and conservative mixed DNAseparation model based on large number of simulated mixed DNA profiles, toverify the efficacy of the separation model, and to compare it with the mixsepsoftware, so as to prove the robustness of separation model constructed.Method:1Na ve Bayesian model: The peak height of alleles wereassumed to conform to normal distribution. For convenience, the prior distribution of mixed proportion α was also assumed to normal distribution N(m, A); the variance parameter τ was still a parameter, and the α had no relationship with the parameter τ, therefore, the marginal distribution of ha was deduced as follows: For the variance super parameter A of prior distribution, when the experimental data was relatively accurate, the difference between α of each locus was≤0.05; with the three standard deviation ranges taken for interval estimate, the prior α variance was about A=001672≈0.00028(obtained through data experience of this lab). Since A was very small, the B2A relative to the original variance could be ignored, then the marginal distribution for the ha could be simplified as: ha|τ~N(Bm+C,τ2) In which, m was obtained through prior distribution; for all the genotypes of each locus, when the likelihood value was the largest, its corresponding optimal match and parameters could be obtained through maximizing marginal likelihood; and for the suboptimum matched genotypes, it could be done through artificial judgment. Our experience was that the likelihood difference from optimum match exceeding over1.5times was not considered.2constrained single locus analysis model:Given the initial mixed proportion, the experiential constraint was performed on the fluctuation range of mixed proportion α, and then for all the genotypes of each locus, the α and variance parameter were solved through maximizing likelihood function. The assumed condition for normal distribution would still be followed, and then the mean of alleles peak height and variance were as follows, respectively: Herein, the limiting conditions of the mixed proportion is α∈[a, b], and variance parameter is τ∈[(?), M], with (?) close to0while M is usually large. Through limitation of parameters, the genotypes are traversed and the likelihood function is maximized; if the a to be solved reached or exceeded the upper-limit or lower-limit of the constraint, even if the likelihood of this genotype were the maximum, the genotype would still be warned or excluded. In addition, according to the formula of the variance parameter τ2, the better the peak height fitting is, the smaller the variance parameter will be obtained; and when the variance parameter τ2approaches the lower limit (?), the corresponding peak height fitting will be close to its best. Combining the experimental mixed DNA data generated from our lab, the estimated α fluctuates around the prior α solved through Naive Bayesian model, with the variation range≤0.08, in which the variation ranges of two allelic bands≤0.05. If the estimated α approached the upper limit or lower limit of the constraint, then this locus would be very much likely to be abnormal whose optimal match could be replaced with the suboptimum one according to experience.Results:The two types of separation models constructed in this study, Naive Bayesian model (Called Bayer, for short) and constrained single locus analysis model (Called Iter, for short), both separation errors on locus AMEL-of NAN3-1-9-B DNA profile, and the misjudged genotype combinations were X,X and X,Y (the result was the same as mixsep). In the forensic DNA testing, locus AMEL-played an important role in the suspect gender inference. When other factors affecting mixed DNA profile were not considered, it was not conservative enough to directly infer whether the mixed DNA was from multiple males or males and female just relying on the peak height of this locus, which could cause the separation model to misjudge the gender of suspects, thus providing a wrong direction for the case.In the influential factors of peak height degradation, when molecular weight became the major factor, the misjudged locus could be corrected through peak height adjustment; while for loci whose molecular weight were not the major causes for peak height degradation (such as locus vWA of NAN3-1-5-B sample), the separation result after peak height adjustment remained the same. In another word, the peak height degradation coefficientwas relatively conservatively estimated based on the mixed effect model.Therefore, the peak height adjustment is only effective for some STR loci inmixed DNA profiles.Conclusion: The research started with global consistency problem, Na veBayesian and single locus solving, and through constructing Bayer model andIter model, genotype separations were done in5mixed DNA profiles datawhich did not have ideal analytical results from the mixsep. On the premisethat the peak height degradation causing separation error was not considered,the combined use of bayer and Iter, could make the best matched genotypesanalysis get more ideal results. In addition, the mixed effect model constructedcould conservatively solve the phenomenon of peak height degradation; whenmolecular weight was the major factor contributing to peak height degradation,the misjudged loci could be corrected through peak height adjusting; that is,the peak height adjustment was only effective to some STR loci, which wasonly taken as an optional correction.Part IV: Development of sepDNA Software in mixed DNA analysis andCase ApplicationObjective: Select STR marker as the input data which is compatible withthe DNA database of Chinese Forensic Science, using the two separationmodels developed in this study together, to research and develop a mixedDNA separation sepDNA software, and to verify the robustness and reliabilityof sepDNA software through experimental data.Methods: The sepDNA software package was developed through Rlanguage by converting the source code from the multiple mixed DNAseparation models constructed in Part III, and adding the source code ofsepDNA user interface; application verification was performed to theanalytical efficacy of this software.Results: The sepDNA software contained two separation models andmultiple little modules. In the Bayes model, through exploring prior mixtureproportion, it was converted to normal distribution where average peak height was only related to genotype; and the best match genotype was found aftermaximizing the marginal likelihood function. And in the Iter model, jointanalysis was replaced by unilateral analysis on each locus, and empiricalconstraint was performed on the fluctuation range of mixed proportion; themixture proportion and variance parameter of each locus were solved throughmaximizing likelihood function, and traversal solving to single locus was donethrough parameter constraint.The two types of models completed genotype separation of mixed DNAfrom two different modeling ideas, global optimizing and local optimizing of. Although the separation results of loci D3S1358and D7S820impaired theoverall separation accuracy of the Iter model, to consider from the genderestimation of locus AMEL-and the robustness of separation model, it wasnecessary to use these two models combinedly and take the separation resultas a reliable one when both models had the same result; and for differentresults in two models, further artificial judgment was needed to ensure that theseparation report of the mixed DNA had robustness and reliability.Conclusion: The sepDNA software created in this study includes twoseparation models, Bayer model and Iter model. The two models should beused together and the separation result of the mixed DNA appears in bothmodels is considered as a reliable one in order to ensure robustness andreliability. This software has no module of allelic drop-outs. In the software,there was parameter information of “average peak height” and “mixtureproportion”, and if the average peak height was too low or the mixed ratio wasextremely unbalanced, it prompted that allelic drop-out might happen in mixedDNA profile. The analysis report of sepDNA was needed to be drawn with theparameter information and artificial judgment. In the three-person mixed DNAseparation module of this software, there was a function called “set up fixedgenotype”, which could properly increase the separation accuracy ofthree-person mixed DNA, but the efficacy of this module still needs to befurther verified with more three-person mixed DNA data.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络