Data Mining Techniques for the Life Sciences (DOI: 10.1007/978-1-60327-241-4)
Data Mining Techniques for the Life Sciences (DOI: 10.1007/978-1-4939-3572-7)
Cited as:
参见网页版各章节的 cite as
该图书属于 Methods in Molecular Biology 丛书。截至2021年7月,该丛书已出版2323本,涉及生物学与生物医学各类主题。
For over 35 years, biological scientists have come to rely on the research protocols and methodologies in the critically acclaimed Methods in Molecular Biology series. The series was the first to introduce the step-by-step protocols approach that has become the standard in all biomedical protocol publishing. Each protocol is provided in readily-reproducible step-by step fashion, opening with an introductory overview, a list of the materials and reagents needed to complete the experiment, and followed by a detailed procedure that is supported with a helpful notes section offering tips and tricks of the trade as well as troubleshooting advice. These hallmark features were introduced by series editor Dr. John Walker and constitute the key ingredient in each and every volume of the Methods in Molecular Biology series. Tested and trusted, comprehensive and reliable, all protocols from the series are indexed in PubMed.
Data Mining Techniques for the Life Sciences 目前有两版,分别出版于2009年12月和2016年4月,编辑是 Oliviero Carugo 和 Frank Eisenhaber 。
Oliviero Carugo 研究方向为大分子的结构化学,特别强调球状蛋白质三级和四级结构的分析、计算和生物信息学预测;
Frank Eisenhaber 的研究兴趣集中在从生物和医学数据中发现新的生物分子机制以及尚未表征的基因和通路的功能表征。 由于机理理解是生物技术、生物医学和临床应用的驱动力,这项工作促进了各种应用研究。 Frank Eisenhaber 是发现 SET 域甲基转移酶、ATGL、kleisins、许多新的蛋白质域功能(例如在 GPI 脂质锚生物合成途径中)的科学家之一,开发了用于翻译后修饰和亚细胞的准确预测工具定位和组学数据分析算法。
本书分为生命科学相关的数据库、数据技术与数据库应用三部分。
第一部分 数据库
第二部分 数据技术
第三部分 数据库应用——打标及预测
第一部分 数据库
第二部分 数据技术
第三部分 数据库应用——打标及预测
TPA:experimental: Annotation of sequence data is supported by peer-reviewed wet-lab experimental evidence. (TPA: Third Party Annotation)
TPA:inferential: Annotation of sequence data by inference (where the source molecule or its product(s) have not been the subject of direct experimentation)
TPA:assembly: Assembly or reassembly of sequence data for which the generation, whether it is purely computational or informed by experimentation, has been subject to peer review. Feature annotation is not required to be part of the peer review for this TPA type. (Examples of such assemblies include complete viruses, mitochondria, or named biosynthetic gene clusters)
GenBank: An archival database of primary nucleotide sequences that were directly sequenced by the submitter.
RefSeq: A curated, non-redundant database that includes genomic DNA, transcript (RNA), and protein products, for major organisms. The sequence data are derived from GenBank primary data, and the annotation is computational, from published literature, or from domain experts.
(Retrived from https://www.ncbi.nlm.nih.gov/genbank/tpa/ at 2021.07.16.)
The data in ID system are stored in Abstract Syntax Notation (ASN.1) format, a standard descriptive language for describing structured information. NCBI has adopted ASN.1 language to describe the biological sequence and all related information (taxonomical, bibliographical) in a structured way. Many NCBI users think of the GenBank flatfile as the archetypal sequence data format. However, within NCBI and especially within the ID internal data flow system, ASN.1 is considered the original format from which reports such as the GenBank flatfile can be generated. As an object-oriented structured language, ASN.1 is easily transformed to other high-level programming languages such as XML, C, and C++. The NCBI Toolkit provides the converters between the data structures. Entrez display options allow to view the data in various text formats including ASN.1, XML, and GenBank flatfiles.
(For more information, please refer https://www.ncbi.nlm.nih.gov/Structure/asn1.html.)
Entrez 节点指的是将数据分组和索引在一起的集合,每个节点包含一些常见常规和格式包括用于 Boolean 查询的术语列表和发布文件(即检索引擎),节点内和节点之间的链接,以及用于列出搜索结果的摘要格式,摘要格式中的每个记录称为 DocSum。在搜索时,每个 Entrez 节点中的搜索独立进行。
节点间的链接包括,如基因组序列与基因组项目之间,序列与文献之间,核酸序列与蛋白序列之间。节点内的链接包括,如序列与序列依据相似性大小关联,文献与文献通过统计词项的频率关联,这种关联呈现在Related Articles上。
学习 eUtils 请参考 https://www.ncbi.nlm.nih.gov/books/NBK25501/
所有物种的初级基因组序列都存档在公共存储库中,这些存储库提供可靠、自由和稳定地访问序列信息。NCBI 提供多种基因组生物学工具和在线资源,包括包含许多相关网站和数据库链接的特定群体(group)和特定生物体(organism)页面。
Trace Repositories 指全基因组鸟枪测序(whole genome shotgun sequencing)的结果,相当于测序的原始数据。Trace指random short fragments。如 Trace Archive (Capillary-based sequencing technology);Short Read Archive (parallel sequencing techonology);GenBank,为初级序列库。
Entrez 数据库系列包含一个集成信息系统,将生物医学和书目数据的异类信息链接在一起。以下是三个 Entrez 数据库示例,其中包含有关基因组项目、基因组序列和由完整微生物基因组编码的蛋白质序列的信息。
Entrez Genome,包含了主要分类组的记录和格式,预先计算的数据和用来辅助搜索的在线工具,其内容包括病毒和有机体的基因组,细菌和真核生物的全基因组,Genome 中的一个条目代表一个复制子(replicon),如染色体、有机体或质粒;可用的工具包括:病毒的全基因组的多比对,GenePlot,TaxPlot,gMap等。
Microbial genome sequencing has come a long way since the first H. influenzae project. As of February 2008 public collection contains more than 600 complete genomes and close to 500 draft genome assemblies.
Query examples:Find all the chromosomes of Haemophilus influenzae
Haemophilus influenzae [organism] AND chromosome[replicon type]
“A project is defined as a collection of INSDC database records originating from a single organization, or from a consortium of coordinated organizations. The collective database records from a project make up a complete genome or metagenome and may contain genomic sequence, EST libraries and any other sequences that contribute to the assembly and annotation of the genome or metagenome. Projects group records either from single organism studies or from metagenomic studies comprising communities of organisms.”
As of January 2008 Genome Project database contains 80 metagenomics project.
Query examples
Find all complete fungal genome projects.
fungi[ORGN] AND complete[SEQSTAT]
Find all projects that correspond to pathogens that can infect humans.
human[HOST]
Find all metagenomic projects
type_environmental[All Fields]
As of January 2008, the database contains 1.4 million proteins that compose 6,043 curated clusters and more than 200,000 automatic clusters.
Query examples
Retrieve all clusters containing the protein beta galactosidase:
beta galactosidase [Protein Name]
Find all clusters associated with Escherichia coli:
Escherichia coli[Organism]
https://www.ncbi.nlm.nih.gov/genome/gdv/
NCBI’s Genome Data Viewer – Getting Started (Oct 27, 2017)
https://www.youtube.com/watch?v=iPSq0VfU19c (介绍很简略,还是没懂)
完蛋,写的1.3没有保存。。。(2021-07-16)