以URL为基础的API分为4个部分
其具体格式一般为:
input又分为3个参数:
=
compound domain
substance domain
assay domain
举例(cid为2244的化合物的input):
https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/2244/
以具体的ID进行检索:
assay->aid
compound->cid
substance->sid
而且可以以以逗号分隔的ID列表进行检索,如:
https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/1,2,3,4,5/property/MolecularFormula,MolecularWeight,CanonicalSMILES/CSV
通过名字进行检索,而且是可以只检索部分名字的,但如果只检索部分名字就需要精化检索类型为单个单词 ,如:
https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/myxalamid/cids/XML?name_type=word
通过分子结构描述符如sdf,smiles,inchi,等进行输入检索,如:
https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/smiles/CCCC/cids/TXT
进行结构检索,检索形式有(substracture|superstracture,similarity,identity),但是因为其是以整个pubchem的数百万个分子数据库来进行匹配检测来进行检索,其所需时间很长,结果中会返回一个"ListKey”如:
https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/substructure/smiles/C1CCCCCC1/XML
以此"ListKey”可以进行后续操作获取信息,如:
https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/listkey/12345678910/cids/TXT
正因为上述操作耗时很长,聪明的人又进行了程序开发,得出了快速的结构检索方式.这个检索方式进行同步输入,不会得出"ListKey”,而是会进行单个调用数据而立即出结果.其具体的使用方式如下:
https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/fastidentity/cid/5793/cids/TXT?identity_type=same_connectivity
https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/fastsubstructure/cid/2244/cids/XML?StripHydrogen=true
https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/fastsimilarity_2d/cid/2244/property/MolecularWeight,MolecularFormula,RotatableBondCount/XML?Threshold=99
最后一个api的结果如下:
输入的信息不是pubchem的参数二十其他库的参数,能接受的其他库参数如下表所示:
Cross-reference | Meaning |
---|---|
RegistryID |
external registry identifier |
RN |
registry number |
PubMedID |
NCBI PubMed identifier |
MMDBID |
NCBI MMDB identifier |
DBURL |
external database home page URL |
SBURL |
external database substance URL |
ProteinGI |
NCBI protein GI |
NucleotideGI |
NCBI nucleotide GI |
TaxonomyID |
NCBI taxonomy identifier |
MIMID |
NCBI MIM identifier |
GeneID |
NCBI gene identifier |
ProbeID |
NCBI probe identifier |
PatentID |
patent identifier |
SourceName |
external depositor name |
SourceCategory |
depositor category(ies) |
具体的使用方式如下:
https://pubchem.ncbi.nlm.nih.gov/rest/pug/substance/xref/PatentID/US20050159403A1/sids/JSON
(其实也就是对输入进行选择具体取何种数据)
compound domain
substance domain
assay domain
target_type = {ProteinGI, ProteinName, GeneID, GeneSymbol}
在默认不进行操作选择的情况下,会输出输入条件下的所有数据,其适用的输出格式有– ASN.1 (NCBI’s native format), XML, SDF,甚至是JSON格式,具体使用例子如下:
https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/2244/SDF
https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/2244/record/XML
而且是可以一次取多个identity的数据的(以逗号隔开的列表),如:
https://pubchem.ncbi.nlm.nih.gov/rest/pug/substance/sid/1,2,3,4,5/SDF
若是想的分子的图像,则为不进行操作以PNG格式进行输出就行,但是此方式一次只能输出一个image.如:
https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/lipitor/PNG
https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/smiles/CCCCC=O/PNG
https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/inchikey/RZJQGNCSTQAWON-UHFFFAOYSA-N/PNG
结果如下:
compound的所有property如下表所示:
Property | Notes |
---|---|
MolecularFormula |
Molecular formula. |
MolecularWeight |
The molecular weight is the sum of all atomic weights of the constituent atoms in a compound, measured in g/mol. In the absence of explicit isotope labelling, averaged natural abundance is assumed. If an atom bears an explicit isotope label, 100% isotopic purity is assumed at this location. |
CanonicalSMILES |
Canonical SMILES (Simplified Molecular Input Line Entry System) string. It is a unique SMILES string of a compound, generated by a “canonicalization” algorithm. |
IsomericSMILES |
Isomeric SMILES string. It is a SMILES string with stereochemical and isotopic specifications. |
InChI |
Standard IUPAC International Chemical Identifier (InChI). It does not allow for user selectable options in dealing with the stereochemistry and tautomer layers of the InChI string. |
InChIKey |
Hashed version of the full standard InChI, consisting of 27 characters. |
IUPACName |
Chemical name systematically determined according to the IUPAC nomenclatures. |
XLogP |
Computationally generated octanol-water partition coefficient or distribution coefficient. XLogP is used as a measure of hydrophilicity or hydrophobicity of a molecule. |
ExactMass |
The mass of the most likely isotopic composition for a single molecule, corresponding to the most intense ion/molecule peak in a mass spectrum. |
MonoisotopicMass |
The mass of a molecule, calculated using the mass of the most abundant isotope of each element. |
TPSA |
Topological polar surface area, computed by the algorithm described in the paper by Ertl et al. |
Complexity |
The molecular complexity rating of a compound, computed using the Bertz/Hendrickson/Ihlenfeldt formula. |
Charge |
The total (or net) charge of a molecule. |
HBondDonorCount |
Number of hydrogen-bond donors in the structure. |
HBondAcceptorCount |
Number of hydrogen-bond acceptors in the structure. |
RotatableBondCount |
Number of rotatable bonds. |
HeavyAtomCount |
Number of non-hydrogen atoms. |
IsotopeAtomCount |
Number of atoms with enriched isotope(s) |
AtomStereoCount |
Total number of atoms with tetrahedral (sp3) stereo [e.g., (R)- or (S)-configuration] |
DefinedAtomStereoCount |
Number of atoms with defined tetrahedral (sp3) stereo. |
UndefinedAtomStereoCount |
Number of atoms with undefined tetrahedral (sp3) stereo. |
BondStereoCount |
Total number of bonds with planar (sp2) stereo [e.g., (E)- or (Z)-configuration]. |
DefinedBondStereoCount |
Number of atoms with defined planar (sp2) stereo. |
UndefinedBondStereoCount |
Number of atoms with undefined planar (sp2) stereo. |
CovalentUnitCount |
Number of covalently bound units. |
Volume3D |
Analytic volume of the first diverse conformer (default conformer) for a compound. |
XStericQuadrupole3D |
The x component of the quadrupole moment (Qx) of the first diverse conformer (default conformer) for a compound. |
YStericQuadrupole3D |
The y component of the quadrupole moment (Qy) of the first diverse conformer (default conformer) for a compound. |
ZStericQuadrupole3D |
The z component of the quadrupole moment (Qz) of the first diverse conformer (default conformer) for a compound. |
FeatureCount3D |
Total number of 3D features (the sum of FeatureAcceptorCount3D, FeatureDonorCount3D, FeatureAnionCount3D, FeatureCationCount3D, FeatureRingCount3D and FeatureHydrophobeCount3D) |
FeatureAcceptorCount3D |
Number of hydrogen-bond acceptors of a conformer. |
FeatureDonorCount3D |
Number of hydrogen-bond donors of a conformer. |
FeatureAnionCount3D |
Number of anionic centers (at pH 7) of a conformer. |
FeatureCationCount3D |
Number of cationic centers (at pH 7) of a conformer. |
FeatureRingCount3D |
Number of rings of a conformer. |
FeatureHydrophobeCount3D |
Number of hydrophobes of a conformer. |
ConformerModelRMSD3D |
Conformer sampling RMSD in Å. |
EffectiveRotorCount3D |
Total number of 3D features (the sum of FeatureAcceptorCount3D, FeatureDonorCount3D, FeatureAnionCount3D, FeatureCationCount3D, FeatureRingCount3D and FeatureHydrophobeCount3D) |
ConformerCount3D |
The number of conformers in the conformer model for a compound. |
Fingerprint2D |
Base64-encoded PubChem Substructure Fingerprint of a molecule. |
使用实例如下:
https://pubchem.ncbi.nlm.nih.gov/rest/pug/substance/sourceid/IBM/5F1CA2B314D35F28C7F94168627B29E3/ASNT
https://pubchem.ncbi.nlm.nih.gov/rest/pug/substance/sourceid/DTP.NCI/747285/SDF
https://pubchem.ncbi.nlm.nih.gov/rest/pug/substance/sourceid/DTP.NCI/747285/PNG
https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/2244/SDF
https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/2244/PNG
https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/2244/SDF?record_type=3d
https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/2244/PNG?record_type=3d&image_size=small
https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/aspirin/SDF
https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/inchikey/BPGDAMSIGCZZLK-UHFFFAOYSA-N/SDF
https://pubchem.ncbi.nlm.nih.gov/rest/pug/assay/aid/1000/XML
https://pubchem.ncbi.nlm.nih.gov/rest/pug/assay/aid/1000/CSV?sid=26736081,26736082,26736083
https://pubchem.ncbi.nlm.nih.gov/rest/pug/assay/aid/1000/concise/CSV
同样的是可以一次性取多个property数据的,如:
https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/1,2,3,4,5/property/MolecularWeight,MolecularFormula,HBondDonorCount,HBondAcceptorCount,InChIKey,InChI/CSV
其结果为:
取得一个化合物|物质的所用名字:
https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/vioxx/synonyms/XML
检索其他数据库的参数,可用参数已于上文中给出了:
https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/2244/xrefs/MMDBID/XML
在PUBCHEM中一个assay分为两个部分:Assay Description 和 Assay Data
前者包括authorship, general description, protocol, and definitions of the data readout columns 作者,概括说明.协议和数据列的定义.
后者则包含实验中的各项数据
如果想单独获取实验的描述性内容,可进行如下操作:
其有效的输出格式有 XML, JSON(P), and ASNT/B,如
https://pubchem.ncbi.nl:m.nih.gov/rest/pug/assay/aid/504526/description/XML
还有另外一种格式,不会有上述形式这般详细的description 但是会包含相关靶点和有活性的和非活性的SID和CID统计信息,如:
https://pubchem.ncbi.nlm.nih.gov/rest/pug/assay/aid/1000/summary/JSON
结果如下:
Assay Data有关assay可取的数据如下:
https://pubchem.ncbi.nlm.nih.gov/rest/pug/assay/aid/504526/CSV
其结果为
一个assay是可以涉及多个sid即物质的,如果我们只想得到特定物质的结果可以进行指定,如:
首先查询一个实验设计多少个sid:
https://pubchem.ncbi.nlm.nih.gov/rest/pug/assay/aid/640/sids/TXT
再进行指定:
https://pubchem.ncbi.nlm.nih.gov/rest/pug/assay/aid/504526/XML?sid=104169547,109967232
获取assay所作用的靶点的信息:
有效的输出格式有 XML, JSON(P), ASNT/B, and TXT. 而有效的target类型如下表所示
Target Type | Notes |
---|---|
ProteinGI |
NCBI GI of a protein sequence |
ProteinName |
protein name |
GeneID |
NCBI Gene database identifier |
GeneSymbol |
gene symbol |
Example:
https://pubchem.ncbi.nlm.nih.gov/rest/pug/assay/aid/490,1000/targets/ProteinGI,ProteinName,GeneID,GeneSymbol/XML
其输出结果为:
并非所有实验都有确定的蛋白质或者基因靶点
同样的我们也可以反过来以具体的靶点名进行检索,如以USP2基因靶点进行检索进行了多少次实验:
https://pubchem.ncbi.nlm.nih.gov/rest/pug/assay/target/genesymbol/USP2/aids/TXT
其结果为:
以实验获取的某项活性数据名进行检索,如查询做了EC50的所有实验如下:
https://pubchem.ncbi.nlm.nih.gov/rest/pug/assay/activity/EC50/aids/JSON
查一个化合物|物质的实验汇总记录:
https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/1000,1001/assaysummary/CSV
https://pubchem.ncbi.nlm.nih.gov/rest/pug/substance/sid/104234342/assaysummary/XML
结果如下:
获取物质|化合物的计量反应实验的结果:
一个aid(实验)最多返回1000个sid的计量反应数据,有效的输出格式为XML, JSON(P), ASNT/B, and CSV:
Option | Allowed Values | Meaning |
---|---|---|
sid | listkey, or comma-separated integers | SID rows to retrieve for an assay |
listkey | valid SID listkey | listkey containing SIDs, if using sid=listkey |
Examples:
https://pubchem.ncbi.nlm.nih.gov/rest/pug/assay/aid/504526/doseresponse/XML
https://pubchem.ncbi.nlm.nih.gov/rest/pug/assay/aid/504526/doseresponse/CSV?sid=104169547,109967232
https://pubchem.ncbi.nlm.nih.gov/rest/pug/assay/aid/doseresponse/XML (with “aid=504526&sid=104169547,109967232” in the POST body)
https://pubchem.ncbi.nlm.nih.gov/rest/pug/assay/aid/602332/sids/XML?sids_type=doseresponse&list_return=listkey
followed by
https://pubchem.ncbi.nlm.nih.gov/rest/pug/assay/aid/602332/doseresponse/CSV?sid=listkey&listkey=xxxxxx&listkey_count=100 (where ‘xxxxxx’ is the listkey returned by the previous URL)
选择输出格式:
其实"ListKey"的使用并不仅仅限于结构检索(structure search),如果返回的输出对象是大型的数据列表的话,是可以不输出用operation_options将其保存在服务器上的,使用方式如下:
https://pubchem.ncbi.nlm.nih.gov/rest/pug/assay/aid/640/sids/XML?list_return=listkey返回的"ListKey" 结果为1757602293094779987 (aid640是一个非常大的实验设计了诸多物质(sid))
再对返回的"ListKey"进行读取,同时以'listkey_start=""listkey_count='来限制结果的输出量,让时间不要太长.使用实例如下:
以上述的 "ListKey"进行操作
https://pubchem.ncbi.nlm.nih.gov/rest/pug/assay/aid/640/CSV?sid=listkey&listkey=1757602293094779987&listkey_start=0&listkey_count=1000
共1000行数据
只要限制好了'listkey_start=""listkey_count='甚至可以以"ListKey"作物输入条件,以上述"ListKey"为例:
https://pubchem.ncbi.nlm.nih.gov/rest/pug/substance/listkey/1757602293094779987/synonyms/XML?&listkey_start=0&listkey_count=10
其结果为:
综上所述可以进行多种选择配合以此来获取pubchem中具体的数据,举例:
https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/2244/property/MolecularFormula/JSON
其结果为:
以上都是以URL的API形式获取信息的例子
invalid input:输入无效
nothing was found for the given query:再所给定的输入条件下不存在匹配项
the request was too broad and took too long to complete: 数据太大而导致耗时太长,无法完成检索.(pug rest的网站服务请求的最大时间设置是30s)
会返回具体的代码,代码即其所代表的含义如下表所示:
HTTP Status | Error Code | General Error Category |
---|---|---|
200 |
(none) |
Success |
202 |
(none) |
Accepted (asynchronous operation pending) |
400 |
PUGREST.BadRequest |
Request is improperly formed (syntax error in the URL, POST body, etc.) |
404 |
PUGREST.NotFound |
The input record was not found (e.g. invalid CID) |
405 |
PUGREST.NotAllowed |
Request not allowed (such as invalid MIME type in the HTTP Accept header) |
504 |
PUGREST.Timeout |
The request timed out, from server overload or too broad a request |
503 |
PUGREST.ServerBusy |
Too many requests or server is busy, retry later |
501 |
PUGREST.Unimplemented |
The requested operation has not (yet) been implemented by the server |
500 |
PUGREST.ServerError |
Some problem on the server side (such as a database server down, etc.) |
500 |
PUGREST.Unknown |
An unknown error occurred |