Welcome to TiddlyWiki created by Jeremy Ruston, Copyright © 2007 UnaMesa Association
''__Schedule of Lectures__''
*[[Lecture 01]] on Tuesday, January 15
*[[Lecture 02]] on Thursday, January 17
*[[Lecture 03]] on Tuesday, January 22
*[[Lecture 04]] on Thursday, January 24
*[[Lecture 05]] on Tuesday, January 29
*[[Lecture 06]] on Thursday, January 31
*[[Lecture 07]] on Tuesday, February 5
*[[Lecture 08]] on Thursday, February 7
*[[Lecture 09]] on Tuesday, February 12
*[[Lecture 10]] on Thursday, February 14
*[[Lecture 11]] on Tuesday, February 19
*[[Lecture 12]] on Thursday, February 21
*[[Lecture 13]] on Tuesday, February 26
*[[Lecture 14]] on Thursday, February 28
*[[Lecture 15]] on Tuesday, March 4 ''Exam 1 (in class)''
*[[Lecture 16]] on Thursday, March 6
*no lectures during Spring Break
*[[Lecture 17]] on Tuesday, March 18
*[[Lecture 18]] on Thursday, March 20
*[[Lecture 19]] on Tuesday, March 25
*[[Lecture 20]] on Thursday, March 27
*[[Lecture 21]] on Tuesday, April 1
*[[Lecture 22]] on Thursday, April 3
*[[Lecture 23]] on Tuesday, April 8 ''no lecture due to ACS conflict''
*[[Lecture 24]] on Thursday, April 10
*[[Lecture 25]] on Tuesday, April 15
*[[Lecture 26]] on Thursday, April 17
*[[Lecture 27]] on Tuesday, April 22
*[[Lecture 28]] on Thursday, April 24
*[[Final Exam]] on Thursday, May 1
This is an abbreviation used frequently in the drug discovery field or pharmacokinetics for ''A''bsorption, ''D''istribution, ''M''etabolism, and ''E''xcretion. Sometimes, ''Tox''icology is added to that for ADME/Tox or ADMET. These criteria are very important when considering the success or failure of a potential drug compound in the drug discovery process.
''Absorption'' of a drug molecule is what has to happen after you take a drug; whether by pill, injection, patch, or inhaled. Once in the system, it has to be taken up by the body in cells, for example, or it must be able to cross the blood-brain barrier.
''Distribution'' is the ability of the drug to get where it needs to go once it's absorbed - most commonly via the bloodstream.
''Metabolism'' of the drug can start almost immediately, depending upon the type of chemical and place of administration. Most metabolism is carried out in the liver by enzymes important in redox reactions, namely cytochrome P450 enzymes. The parent compound is converted to daughter or metabolite compounds; these metabolites can range from inert to compounds even more powerful than the parent drug.
''Excretion'' (sometimes ''Elimination'') is the final stage of the process, where the body gets rid of the compound and/or its metabolites. Mainly, you lose drugs through the urine (kidney) or feces (biliary), but sometimes also through the lungs, as with certain inhalants or anesthetic gases.
''Toxicology'' is important as well, because you don't want the drug or any of its metabolites to be toxic to the body at any of these four stages!
The following describes a three-layer, fully-connected, feed-forward computational neural network.
[img[CNN|images/cnn.jpg]]
The input layer consists of as many neurons as there are model [[descriptors|Structural Descriptor]]. Values are transformed on the order of 0-1 (0.05-0.95) to avoid "blowing up" the results during the non-linear transformation process. Each input neuron value is assigned a weight (randomly initialized and optimized throughout training) and passed to each hidden layer neuron (determined experimentally). At each hidden layer, the weighted terms are summed and a bias term is added (again, initial bias term is randomly assigned and optimized throughout training). The resultant sum of each hidden layer neuron is then applied to the non-linear transformation (seen in bottom half of enlarged neuron). This value is then sent to the output layer neuron (the predicted value of interest), again weighted and biased. The output layer neuron values are then transformed back to the original range and compared with the actual value. A [[BFGS]] algorithm is used for optimization of the costs associated with the weights and biases, which are then adjusted and the whole process repeats until the predicted value is as close to the actual value as possible.
A range of values surrounding the mean (sample mean) which is assumed to contain the population mean with a particular probability. Depending on the scope of the data being studied, one might choose the 95% or 99% confidence interval.
The extreme upper/lower limit values that define the space of a confidence interval are called the ''confidence limit''.
A tabular representation of correlation coefficients of multiple variables.
[img[correlation matrix|images/corrmatx.jpg]]
The diagonal elements all have r = 1.00, as the values are perfectly correlated with themselves, and upper and lower triangles are symmetric. If total variables = k, the total cells = k^^2^^. Number of non-diagonal cells = k^^2^^ - k. Unique non-diagonal cells = (k^^2^^-k)/2.
<div class='title' macro='view title'></div>
<div class='toolbar' macro='toolbar +saveTiddler -cancelTiddler deleteTiddler'></div>
<div class='editor' macro='edit title'></div>
<div class='editor' macro='edit text'></div>
<div class='editor' macro='edit tags'></div><div class='editorFooter'><span macro='message views.editor.tagPrompt'></span><span macro='tagChooser'></span></div>
''Electronic descriptors'', as with geometric descriptors, depend upon a reliable, low-energy three-dimensional geometry. Single-atom energy values are based upon their nieghbor interactions, so even a slight tweak in geometry can alter these values. These values, such as electronegativities and dipoles, give us an idea of the electronic environment of the whole molecule.
Emacs is a good editor for files in the Linux system. There are other editors available, which you can read about elsewhere (see Linux sites in [[External Resources]]). The following is a list of common commands that should let you do what you need to do for a QSAR study.
Commands in the editor shell are usually given by typing specific keys while holding down on the Control (''Ctrl'') key or Escape (''Esc'') key. For this guide, commands will be written with the following short notation:
*For Control key commands: ''C-x'' means //hold down the Ctrl key while pressing x//
*For Escape or Alt key command: ''M-x'' means //hold down the Esc or Alt key while pressing x//
*For multi-key commands, it will sometimes be necessary to type two letters. In such a case, the notation will be ''C-x C-c'', which means that while holding down the Ctrl key, //type x followed by c//. Similarly, if the command is ''C-x u'', that means hold the Ctrl key while pressing x, //release the Ctrl key//, then press 'u'.
|File Manipulation|c
|Command|Keystrokes|Description|h
|find-file|C-x C-f|after issuing this command, type in the first few letters of a filename and press ''Tab''|
|save-file|C-x C-s|saves changes under current file name; does not close editing buffer|
|save-as-file|C-x C-w|(aka write-file); saves information in new file name (prompted)|
|exit-file|C-x C-c|saves changes in current filename and exits buffer|
|Buffer Navigation|c
|Command|Keystrokes|Description|h
|cursor forward|C-f|similar to →|
|cursor backward|C-b|similar to ←|
|cursor previous-line|C-p|similar to ↑|
|cursor next-line|C-n|similar to ↓|
|cursor forward-one-word|M-f|moves cursor to the start of the next word|
|cursor backward-one-word|M-b|moves cursor to the start of the previous word|
|cursor beginning-of-line|C-a|similar to Home|
|cursor end-of-line|C-e|similar to End|
|cursor back-one-sentence|M-a|moves back to the beginning of the previous sentence|
|cursor forward-one-sentence|M-e|moves forward to the beginning of the next sentence|
|cursor forward-one-paragraph|M-}|move to the beginning of the next paragraph|
|cursor backward-one-paragraph|M-{|move to the beginning of the previous paragraph|
|cursor forward-one-screen|C-v|similar to Page Down|
|cursor backward-one-screen|M-v|similar to Page Up|
|cursor to line-number|M-x, goto-line|after command, type Enter; enter line number; type Enter|
|cursor move n-lines|M-n command|e.g., to move 500 lines forward: M-500, C-n|
|cursor beginning-of-buffer|M-<|moves to beginning of first line in buffer|
|cursor end-of-buffer|M->|moves to the last line of the buffer|
|Buffer Editing|c
|Command|Keystrokes|Description|h
|undo-command|C-x u|undoes last command; can be repeated|
|delete-character|C-d|erases one character at a time; similar to Del|
|delete-to-end-of-word|M-d|erases all characters from cursor to the end of a word|
|delete-to-end-of-line|C-k|erases all characters from cursor to the end of the line|
|restore-deletion (yank)|C-y|restores the characters that were just erased from delete command(s)|
|cancel-command|C-g|clears buffer command line when you mistype a command (before execution)|
The Euclidean distance is given by the shortest distance between two points in //n//-dimensional space. For most applications here, this considers the x,y,z Cartesian coordinate space, where the distance between two points (atoms in a molecule, for example) is given by:
d^^2^^ = (x~~2~~ - x~~1~~)^^2^^ + (y~~2~~ - y~~1~~)^^2^^ + (z~~2~~ - z~~1~~)^^2^^
For a distance in any dimension, this can simply be expanded to the summation of all differences between two points in each dimension.
For a more complete explanation, check out Wolfram ~MathWorld's page [[here|http://mathworld.wolfram.com/Distance.html]]
Here is a list (ever growing!) of external links that you may find interesting, helpful, or downright crucial to your enjoyment & success in ~CHEM-681.
__''Computers & Programming''__
*[[www.python.org | http://www.python.org]] - Homepage of the Python programming language
*[[PuTTY: A Free Telnet/SSH Client|http://www.chiark.greenend.org.uk/~sgtatham/putty/]] - see also PuTTY
*[[Linux Online|http://www.linux.org/]] - a good resource for most things Linux
*[[Linux Knowledge Base|http://www.linux-tutorial.info/]] - repository of info and tutorials on the Linux operating system
__''Journals''__
*[[Journal of Chemical Information and Modeling| http://pubs.acs.org/journals/jcisd8/index.html]] - an [[ACS| http://www.chemistry.org]] publication
*[[Journal of Medicinal Chemistry| http://pubs.acs.org/journals/jmcmar/index.html]] - an [[ACS| http://www.chemistry.org]] publication
*[[Journal of Chemical Research in Toxicology| http://pubs.acs.org/journals/crtoec/index.html]] - an [[ACS| http://www.chemistry.org]] publication
*[[Journal of Mathematical Chemistry| http://springerlink.metapress.com/content/101749/]] - a Springer publication
*[[Journal of Molecular Graphics and Modelling| http://www.sciencedirect.com/science/journal/10933263]] - an [[Elsevier| http://www.elsevier.com]] publication
*[[QSAR & Combinatorial Science| http://www3.interscience.wiley.com/cgi-bin/jhome/104557877?CRETRY=1&SRETRY=0]] - a Wiley Interscience publication
*[[Journal of Computer-Aided Molecular Design|http://www.springerlink.com/content/102928/]] - a Springer publication
*[[SAR and QSAR in Environmental Research| http://www.informaworld.com/smpp/title~content=t716100694]] - a Taylor & Francis publication
__''Organized Groups''__
*[[The Cheminformatics and QSAR Society| http://www.ndsu.nodak.edu/qsar_soc/]]
*[[CINF|http://www.acscinf.org/]] - the Chemical Information Division of the [[ACS|http://www.chemistry.org]]
*[[COMP|http://membership.acs.org/C/Comp/newsletters/index.html]] - the Computers in Chemistry Division of the [[ACS|http://www.chemistry.org]]
*[[Cheminformatics.org|http://www.cheminformatics.org/]] - links to cheminformatics programs and other resources
__''Math & Statistics''__
*[[MathWorld|http://mathworld.wolfram.com/]] by Wolfram Research
*[[StatSoft Electronic Textbook|http://www.statsoft.com/textbook/stathome.html]] by ~StatSoft Inc.
Formatting of foreign letters. Click 'view' to see how to display special letters.
|''&_grave;''|''&_acute;''|''&_circ;''|''&_uml;''|''&_tilde;''|''&_ring;''|''&_slash;''|''&_cedil;''|
| À | Á | Â | Ä | Ã | Å | Ø | Ç |
| à | á | â | ä | ã | å | ø | ç |
| È | É | Ê | Ë | Õ | | | |
| è | é | ê | ë | õ | | | |
| Ì | Í | Î | Ï | Ñ | | | |
| ì | í | î | ï | ñ | | | |
| Ò | Ó | Ô | Ö | | | | |
| ò | ó | ô | ö | | | | |
| Ù | Ú | Û | Ü | | | | |
| ù | ú | û | ü | | | | |
| | Ý | | Ÿ | | | | |
| | ý | | ÿ | | | | |
''Geometric descriptors'' relay information about the shape and size of a molecule, and therefore depend upon a reliable representation of the molecule in three-dimensional space. Finding a reliable geometry is often a point of contention between different groups, because geometric (and electronic and hybridized) descriptor values can differ within the same molecule depending upon how the geometry was calculated. There are a host of applications available to calculate molecule conformations. Examples of geometric descriptors include solvent accessible surface areas and volume.
[[ADME]] (or [[ADME/Tox|ADME]])
[[ANOVA|Analysis of Variance]] or [[Analysis of Variance]]
[[Bayesian Neural Network]]
[[Computational Neural Network]] (or [[CNN|Computational Neural Network]])
[[Confidence Interval]]
[[Degrees of Freedom]]
[[Dependent Variable]]
[[False Negative]]
[[False Positive]]
[[Heteroscedastic]]
[[Homoscedastic]]
[[Independent Variable]]
[[Inductive Learning|Machine Learning]]
[[k-Nearest Neighbor]] (or [[kNN|k-Nearest Neighbor]])
[[Linear Discriminant Analysis]] (or [[LDA|Linear Discriminant Analysis]])
[[Machine Learning]]
[[Multiple Linear Regression]] (or [[MLR|Multiple Linear Regression]])
[[Null Hypothesis]]
[[Quantitative Structure-Activity Relationship|QSAR]] (or [[QSAR]])
[[Structural Descriptor]] (or [[Molecular Descriptor|Structural Descriptor]])
[[Supervised Learning]]
[[Support Vector Machine]] (or [[SVM|Support Vector Machine]])
[[Training Set]]
[[Unsupervised Learning]]
|List of Greek letters|c
|''letter''|''upper''|''lower''||''letter''|''upper''|''lower''| |''letter''|''upper''|''lower''| |''letter''|''upper''|''lower''|h
|alpha| Α | α |bgcolor(#ffffaf):|eta| Η | η |bgcolor(#ffffaf):|nu| Ν | ν |bgcolor(#ffffaf):|tau| Τ | τ |
|beta| Β | β |bgcolor(#ffffaf):|theta| Θ | θ |bgcolor(#ffffaf):|xi| Ξ | ξ |bgcolor(#ffffaf):|upsilon| Υ | υ |
|gamma| Γ | γ |bgcolor(#ffffaf):|iota| Ι | ι |bgcolor(#ffffaf):|omicron| Ο | ο |bgcolor(#ffffaf):|phi| Φ | φ |
|delta| Δ | δ |bgcolor(#ffffaf):|kappa| Κ | κ |bgcolor(#ffffaf):|pi| Π | π |bgcolor(#ffffaf):|chi| Χ | χ |
|epsilon | Ε | ε |bgcolor(#ffffaf):|lambda| Λ | λ |bgcolor(#ffffaf):|rho| Ρ | ρ |bgcolor(#ffffaf):|psi| Ψ | ψ |
|zeta| Ζ | ζ |bgcolor(#ffffaf):|mu| Μ | μ |bgcolor(#ffffaf):|sigma| Σ | σ |bgcolor(#ffffaf):|omega| Ω | ω |
''Hybridized descriptors'' make use of two or more of the previous three descriptors. These are useful for quantitative values of charge partial surface areas and hydrogen bonding information, and are quite useful in QSAR models.
!!!Tuesday, January 15
*Introduction to the course
*Review of course [[syllabus|docs/syllabus_CHEM-681.pdf]]
*Review of course [[website-wiki|http://nsm1.nsm.iup.edu/nate/courses/chem681]]
*Outline of topics to be covered this term
*Some historical background of the field
**Discussed work by Hammett (articles 8, 9 on [[Reading List]])
**Discussed work by Hansch (articles 10, 11 on [[Reading List]])
*Continuation of historical notes: Wiener and chemical descriptors (#12, 13 on [[Reading List]])
*Started in-depth outline of cheminformatics (see #6 on [[Reading List]]), including:
**Representation of chemical compounds
**Line notations
**Graph notations & matrices
**Connection tables
*Lecture notes can be found [[here|docs/lect02.pdf]] as a PDF file (right click to save or click to open).
* Continuation, completion of in-depth outline of cheminformatics (see #6 on [[Reading List]]), including:
**more on how to represent molecules
**compound libraries and searches
**three-dimensional structure considerations & surfaces
**a few words about applications
*Lecture notes can be found [[here|docs/lect03.pdf]] as a PDF file (right click to save or click to open).
*Met in the computer room, Weyandt 144, to get everyone set up on Saison
*Went over handout on using Saison including:
**logging in
**basic navigation
**basic file and directory manipulation
*There are no lecture notes for this meeting
*Introduction to basic statistics that are needed for chemometrics & chemoinformatics.
*Lecture notes can be found [[here|docs/lect05.pdf]] as a PDF file (right click to save or click to open).
*Lecture was canceled due to illness.
*Met in the computer room, Weyandt 144
*Went over editing files using Emacs
*Went over the "Hello World" program in Python
*Looked at some command line computer programs
*There are no lecture notes for this meeting
*Covered material on how compounds are described, including:
**IUPAC, common, and systematic names
**[[SMILES]] notation (see ref #14 in [[Reading List]])
**adjacency matrices
*Lecture notes can be found [[here|docs/lect08.pdf]] as a PDF file (right click to save or click to open).
There was no lecture on February 12. All IUP classes and activities were canceled after 2pm because of inclement weather.
*Continued our discussion on how molecules are represented, including:
**[[adjacency|Adjacency Matrix]], [[distance|Distance Matrix]], and [[Bond-Electron(BE)|Bond-Electron Matrix]] matrices
**[[connection tables|Connection Table]]
*Covered fragment based representation and [[fingerprints|Fingerprints]].
*Lecture notes can be found [[here|docs/lect10.pdf]] as a PDF file (right click to save or click to open).
*Set Exam 1 date for Tuesday, March 4; a [[study guide|Exam 1 Study Guide]] will be posted before that time
*Continued our discussion on how molecules are represented, including:
**[[hash codes|Hashcode]]
**comparison of advantages/disadvantages of the different methods
*Discussed how stereochemistry is portrayed in SMILES and in sketch
*Discussed how 3-D structural information is stored, including [[z matrices|Z Matrix]]
*Discussed different types and importance of [[surface area|Surface Area]]
*Lecture notes can be found [[here|docs/lect11.pdf]] as a PDF file (right click to save or click to open).
*Slides on surface area can be found [[here|docs/lect11a.pdf]] as a PDF file (right click to save or click to open).
*Discussed an overview of [[QSAR]] and how data, knowledge, and structure are intertwined
*Discussed some data pre-processing steps
*Lecture notes can be found [[here|docs/lect12.pdf]] as a PDF file (right click to save or click to open).
*Went over materials that are 'fair game' for the exam next Tuesday
*Talked more about pre-processing of data and splitting the dataset into subsets
*Lecture notes can be found [[here|docs/lect13.pdf]] as a PDF file (right click to save or click to open).
*This class met in the computer room, Weyandt 144
*Each student was assigned a list of compound from the [[Bergstrom dataset of melting points|docs/bergstrom.xls]]
*Students worked on finding structures of their list in order to sketch later
*Exam 1 was given in class; there are no lecture notes.
*Started the discussion on molecular descriptors including:
**physicochemical descriptors such as logP
**branching descriptors
**molecular connectivity descriptors
*Lecture notes can be found [[here|docs/lect16.pdf]] as a PDF file (right click to save or click to open).
*Continued the discussion on molecular descriptors, including:
**Kier and Hall's [[Electrotopological State Indices]]
**summary of advantages/disadvantages of [[topological descriptors|Topological Descriptor]]
**geometry optimization of structures
**[[geometric descriptors|Geometric Descriptor]] like shadow areas, moments of inertia, and [[Euclidean Distance]]
*Lecture notes can be found [[here|docs/lect17.pdf]] as a PDF file (right click to save or click to open)
*Continued the discussion on molecular descriptors, including:
**molecular surface and volume
**CPSA descriptors
*Lecture notes can be found [[here|docs/lect18.pdf]] as a PDF file (right click to save or click to open)
*We met in Weyandt 144 computer lab to start work on the collated Bergstrom data set.
*No lecture notes.
Lecture canceled due to instructor conflict in scheduled event.
*Began the discussion on machine learning, including:
**supervised and unsupervised learning
**decision trees
**confusion matrix
*Lecture notes can be found [[here|docs/lect21.pdf]] as a PDF file (right click to save or click to open)
*Continued our discussion on machine learning, including:
**k-Nearest Neighbor
**Linear Discriminant Analysis
**Pearson correlation coefficient
**Multiple Linear Regression Analysis
*Lecture notes can be found [[here|docs/lect22.pdf]] as a PDF file (right click to save or click to open)
There was no lecture held on this day. Dr. ~McElroy was presenting research at the 235th American Chemical Society Meeting in New Orleans, LA.
*Continued our discussion on machine learning methods, including:
**Root Mean Square Error (RMSE)
**Genetic Algorithm
**Simulated Annealing
**MLRA statistics (T-values, F-statistic, outlier tests)
*Lecture notes can be found [[here|docs/lect24.pdf]] as a PDF file (right click to save or click to open)
*Continued our discussion on machine learning, including:
**computational neural network
**Bayesian neural network
*Lecture notes can be found [[here|docs/lect25.pdf]] as a PDF file (right click to save or click to open)
In order to successfully navigate through your work area, manipulate files, and run programs on the Linux workstation, you must be familiar with some basic commands and conventions.
For all of the workstation programs and file manipulations, you will be working in a terminal environment. That is, rather than a bunch of fancy icons or the typical windows-type interface, you will be entering command at a ~command-line prompt. For this section and others, when asked to enter a command at the prompt, an example will be given such as:
/>> //command name//, where the '''/>>''' represents the computer prompt shown in your terminal. You will not actually type '/>>'
Some basic commands in Linux that you should know include:
*''ls'' - this command shows you a listing of the files and directories in your current location on the workstation
*''pwd'' - this command shows your exact location in the workstation directory hierarchy. Typically, when you log on to the system, you will start in your home directory: ''/home/username'', where //username// is your user identification assigned to you by Dr. ~McElroy.
*''mkdir'' - this command allows you to create a directory when followed by a word. For example, if you wanted to create the directory ''chem'' in your ''/home/username'' directory, then at the prompt type //>>mkdir chem//. Now, if you type the command //>> ls//, you should see ''/chem'' in the home directory.
*''cd'' - this command allows you to ''c''hange ''d''irectory when followed by an existing directory. From /home/username, try the command //>>cd chem//. If the /chem directory was created, you should be able to change into it. If successful, you may type //>>pwd// and see that you are there. Type //>>ls//, and you will see that there are no files existing in that directory.
*''cd ..'' - By typing the command //>>cd ..// (that's cd followed by two periods), you will move back up the directory by one level. If in /home/username/chem, issue the command //>>cd ..//, then type //>>pwd// and/or //>>ls//. You should see that you are back in your home directory /home/username. From anywhere in the system, if you simply type //>>cd// with nothing after it, you will automatically be transferred to your /home/username directory.
*''touch'' - If you type this command followed by a word, it will create a file by that name at your current location. In your /home/username directory, type //>> touch a.txt//. After this, type //>>ls//, and you should see the file 'a.txt' in the directory. Once the file exists, you could edit it using [[Emacs]] or other editing programs.
*''cp'' - this command allows you to copy a file. In the /home/username directory, type //>>cp a.txt b.txt//, then //>>ls//. You should now see that both a.txt and b.txt exist, where b.txt is a copy of a.txt.
*''mv'' - this command allows you to move a file (not copy) from one place to another. In /home/username, type //>>mv b.txt chem //. If the directory /chem exists in your /home/username directory, it should have moved b.txt to that directory. Use the proper commands to see if this worked.
*''rm'' - this command deletes a file (BE CAREFUL). Once a file or directory is removed, you can't get it back (i.e., no UNDO). In your /home/username/chem directory, type the command //>>rm b.txt//. Once completed, that file should no longer exist.
*''cat'' - this command allows you to scroll the contents of a file on your screen. In your /home/username directory, issue the command //>>cat a.txt//. Most likely nothing will appear on the screen because a.txt is empty. If you had a list of books in that file, then all of the contents would scroll down your screen.
Mitchell^^1^^ defines machine learning as //"the study of computer algorithms that improve automatically through experience"//. This field has roots in many disciplines, including computer science, statistics, and cognitive science.
''Inductive Learning'' is a term used to describe a learning system that uses sample data to build a model which can be used to analyze other data. Learning, in terms of machines, means that the system (such as a neural network) changes its structure in order to improve its performance.
Learning requires then a data set of information, which is usually split into two subsets: a [[training set|Training Set]] and a [[test set|Prediction Set]] or [[prediction set|Prediction Set]]. The training set data is used to build a model, which one hope then will be able to predict some property of the test set (called [[generalization|Generalization]]).
There are two types of inductive learning: [[unsupervised learning|Unsupervised Learning]] and [[supervised learning|Supervised Learning]].
^^1^^. Mitchell, //Machine Learning//, ~McGraw-Hill ''1996''.
[[Syllabus]]
[[Schedule|2008 Schedule]]
[[Reading List]]
[[Special Project]]
[[External Resources]]
[[Glossary]]
[[Open Babel|http://openbabel.sourceforge.net/wiki/Main_Page]] is an open source (OS) toolkit that allows chemical information in one format to be translated to another format.
This is particularly handy when dealing with different data formats produced by [[HyperChem]], [[MOPAC]], and [[SMILES]] programs, so that information can be traded between programs without losing vital information.
<!--{{{-->
<div id='header'>
</div>
<div id='sidebar'>
<div id='titleLine'></div>
<span id='siteTitle' refresh='content' tiddler='SiteTitle'></span>- <span id='siteSubtitle' refresh='content' tiddler='SiteSubtitle'></span>
<div id='mainMenu' refresh='content' tiddler='MainMenu'></div>
<div id='sidebarOptions' refresh='content' tiddler='SideBarOptions'></div>
<div id='sidebarTabs' refresh='content' force='true' tiddler='SideBarTabs'></div>
</div>
<div id='displayArea'>
<div id='messageArea'></div>
<div id='tiddlerDisplay'></div>
</div>
<!--}}}-->
Inspired by Jeremy Ruston's version at [[TiddlyWiki|http://www.tiddlywiki.com]].
|| !1 | !2 |!| !3 | !4 | !5 | !6 | !7 | !8 | !9 | !10 | !11 | !12 | !13 | !14 | !15 | !16 | !17 | !18 |
|!1|bgcolor(#a0ffa0): @@color(red):H@@ |>|>|>|>|>|>|>|>|>|>|>|>|>|>|>|>||bgcolor(#c0ffff): @@color(red):He@@ |
|!2|bgcolor(#ff6666): Li |bgcolor(#ffdead): Be |>|>|>|>|>|>|>|>|>|>||bgcolor(#cccc99): B |bgcolor(#a0ffa0): C |bgcolor(#a0ffa0): @@color(red):N@@ |bgcolor(#a0ffa0): @@color(red):O@@ |bgcolor(#ffff99): @@color(red):F@@ |bgcolor(#c0ffff): @@color(red):Ne@@ |
|!3|bgcolor(#ff6666): Na |bgcolor(#ffdead): Mg |>|>|>|>|>|>|>|>|>|>||bgcolor(#cccccc): Al |bgcolor(#cccc99): Si |bgcolor(#a0ffa0): P |bgcolor(#a0ffa0): S |bgcolor(#ffff99): @@color(red):Cl@@ |bgcolor(#c0ffff): @@color(red):Ar@@ |
|!4|bgcolor(#ff6666): K |bgcolor(#ffdead): Ca ||bgcolor(#ffc0c0): Sc |bgcolor(#ffc0c0): Ti |bgcolor(#ffc0c0): V |bgcolor(#ffc0c0): Cr |bgcolor(#ffc0c0): Mn |bgcolor(#ffc0c0): Fe |bgcolor(#ffc0c0): Co |bgcolor(#ffc0c0): Ni |bgcolor(#ffc0c0): Cu |bgcolor(#ffc0c0): Zn |bgcolor(#cccccc): Ga |bgcolor(#cccc99): Ge |bgcolor(#cccc99): As |bgcolor(#a0ffa0): Se |bgcolor(#ffff99): @@color(green):Br@@ |bgcolor(#c0ffff): @@color(red):Kr@@ |
|!5|bgcolor(#ff6666): Rb |bgcolor(#ffdead): Sr ||bgcolor(#ffc0c0): Y |bgcolor(#ffc0c0): Zr |bgcolor(#ffc0c0): Nb |bgcolor(#ffc0c0): Mo |bgcolor(#ffc0c0): Tc |bgcolor(#ffc0c0): Ru |bgcolor(#ffc0c0): Rh |bgcolor(#ffc0c0): Pd |bgcolor(#ffc0c0): Ag |bgcolor(#ffc0c0): Cd |bgcolor(#cccccc): In |bgcolor(#cccccc): Sn |bgcolor(#cccc99): Sb |bgcolor(#cccc99): Te |bgcolor(#ffff99): I |bgcolor(#c0ffff): @@color(red):Xe@@ |
|!6|bgcolor(#ff6666): Cs |bgcolor(#ffdead): Ba |bgcolor(#ffbfff):^^*1^^|bgcolor(#ffc0c0): Lu |bgcolor(#ffc0c0): Hf |bgcolor(#ffc0c0): Ta |bgcolor(#ffc0c0): W |bgcolor(#ffc0c0): Re |bgcolor(#ffc0c0): Os |bgcolor(#ffc0c0): Ir |bgcolor(#ffc0c0): Pt |bgcolor(#ffc0c0): Au |bgcolor(#ffc0c0): @@color(green):Hg@@ |bgcolor(#cccccc): Tl |bgcolor(#cccccc): Pb |bgcolor(#cccccc): Bi |bgcolor(#cccc99): Po |bgcolor(#ffff99): At |bgcolor(#c0ffff): @@color(red):Rn@@ |
|!7|bgcolor(#ff6666): Fr |bgcolor(#ffdead): Ra |bgcolor(#ff99cc):^^*2^^|bgcolor(#ffc0c0): Lr |bgcolor(#ffc0c0): Rf |bgcolor(#ffc0c0): Db |bgcolor(#ffc0c0): Sq |bgcolor(#ffc0c0): Bh |bgcolor(#ffc0c0): Hs |bgcolor(#ffc0c0): Mt |bgcolor(#ffc0c0): Ds |bgcolor(#ffc0c0): Rg |bgcolor(#ffc0c0): @@color(green):Uub@@ |bgcolor(#cccccc): Uut |bgcolor(#cccccc): Uuq |bgcolor(#cccccc): Uup |bgcolor(#cccccc): Uuh |bgcolor(#fcfecc): @@color(#cccccc):Uus@@ |bgcolor(#ecfefc): @@color(#cccccc):Uuo@@ |
| !Lanthanides^^*1^^|bgcolor(#ffbfff): La |bgcolor(#ffbfff): Ce |bgcolor(#ffbfff): Pr |bgcolor(#ffbfff): Nd |bgcolor(#ffbfff): Pm |bgcolor(#ffbfff): Sm |bgcolor(#ffbfff): Eu |bgcolor(#ffbfff): Gd |bgcolor(#ffbfff): Tb |bgcolor(#ffbfff): Dy |bgcolor(#ffbfff): Ho |bgcolor(#ffbfff): Er |bgcolor(#ffbfff): Tm |bgcolor(#ffbfff): Yb |
| !Actinides^^*2^^|bgcolor(#ff99cc): Ac |bgcolor(#ff99cc): Th |bgcolor(#ff99cc): Pa |bgcolor(#ff99cc): U |bgcolor(#ff99cc): Np |bgcolor(#ff99cc): Pu |bgcolor(#ff99cc): Am |bgcolor(#ff99cc): Cm |bgcolor(#ff99cc): Bk |bgcolor(#ff99cc): Cf |bgcolor(#ff99cc): Es |bgcolor(#ff99cc): Fm |bgcolor(#ff99cc): Md |bgcolor(#ff99cc): No |
*Chemical Series of the Periodic Table
**@@bgcolor(#ff6666): Alkali metals@@
**@@bgcolor(#ffdead): Alkaline earth metals@@
**@@bgcolor(#ffbfff): Lanthanides@@
**@@bgcolor(#ff99cc): Actinides@@
**@@bgcolor(#ffc0c0): Transition metals@@
**@@bgcolor(#cccccc): Poor metals@@
**@@bgcolor(#cccc99): Metalloids@@
**@@bgcolor(#a0ffa0): Nonmetals@@
**@@bgcolor(#ffff99): Halogens@@
**@@bgcolor(#c0ffff): Noble gases@@
*State at standard temperature and pressure
**those in @@color(red):red@@ are gases
**those in @@color(green):green@@ are liquids
**those in black are solids
~PuTTY is a free application that allows secure SSH connection with a computer host. For this course, you will be connecting to Dr. ~McElroy's Linux workstation from an IUP computer.
The executable program (putty.exe) can be saved to your computer (easiest to put it on the Desktop) and run only when needed. There are two options for getting this software:
*1. Go to the [[PuTTY website| http://www.chiark.greenend.org.uk/~sgtatham/putty/]] and follow download instructions for your particular computer and operating system.
*2. Right click on this -> [[putty.exe|utilities/putty.exe]] and save the file to your computer (ONLY if a Windows computer running XP or Vista).
To access [[Saison]]:
*1. activate the program [[PuTTY]]
*2. Under "host name or IP address", enter: ''saison.nsm.iup.edu'' (Port 22 should be default)
*3. the protocol SSH should be selected
*4. you may save this information by giving it a name under "Saved Session", and then click "save"
*5. at the bottom, choose "open"
*6. a Linux terminal window will open, and you will see ''login as:'' At this prompt, type the user name you were given in class
*7. enter your password
*8. you are now connected to [[Saison]] in your [[home directory|Home Directory]]
Here is the recommended reading list for this course. When possible, documents will be in Adobe PDF format. If your computer is not equipped to read PDF files, please go [[here|http://www.adobe.com/products/acrobat/readstep2.html]] to get the latest reader. The following lists are ''not'' meant to be read in order; they're just in the order that I added them to the list and numbered so that I can reference them for you.
I will suggest which ones to read at different times throughout the semester, so don't feel as though you have to read all of these before the first week is done!
[[Go here|Syllabus]] for the class syllabus.
__''Books On Reserve''__
The following books are on hold at the IUP library. :
#//Chemometrics: Experimental Design// by Ed Morgan, ''1991'', John Wiley & Sons, Chichester.
#//Cheminformatics Developments: History, Reviews, and Current Research// edited by Jan Noordik, ''2004'', IOS Press, Amsterdam.
#//Linux in a Nutshell// by Siever, et al, ''2005'', O'Reilly, New York.
#//Handbook of Molecular Descriptors// by Roberto Todeschini and Viviana Consonni, ''2000'', ~Wiley-VCH, Weinheim.
#//Learning GNU Emacs// by Debra Cameron, et al. ''2005'', O'Reilly, Sebastopol, CA.
__''Electronic Files''__
The following articles are available as PDF files:
# [[Recent Advances in Chemoinformatics|docs/ci700059g.pdf]] by Dimitris Agrafiotis, et al. //J. Chem. Inf. Model.// ''2007'', //47(4)//, 1279-1293.
# [[Rational Drug Discovery Revisited|docs/ddt2001-6-p989.pdf]]:Interfacing Experimental Programs with Bio- and Chemoinformatics by Jürgen Bajorath, //Drug Discovery Today// ''2001'', //6(19)//, 989-995.
# [[The Emerging Importance of Predictive ADME Simulation in Drug Discovery|docs/ddt2002-7-p109.pdf]] by Alan Beresford et al //Drug Discovery Today// ''2002'', //7(2)//, 109-116.
# [[Trends and Plot Methods in MLR Studies|docs/ci6004959.pdf]] by Emili Besalú, et al. //J. Chem. Inf. Model.// ''2007'', //47(3)//, 751-760.
# [[Predicting ADME Properties in silico:Methods and Models|docs/ddt2002-7-pS83.pdf]] by Darko Butina et al. //Drug Discovery Today// ''2002'', //7(11) supplement//, ~S83-S88.
# [[Basic Overview of Chemoinformatics|docs/ci600234z.pdf]] by Thomas Engel //J. Chem. Inf. Model.// ''2006'', //46(6)//, 2267-2277.
# [[Chemoinformatics: Past, Present, and Future|docs/ci060016u.pdf]] by William Chen //J. Chem. Inf. Model.// ''2006'', //46(6)//, 2230-2255.
# [[Some Relations Between Reaction Rates and Equilibrium Constants|docs/cr60056a010.pdf]] by Louis Hammett //Chem. Rev.// ''1935'',//17(1)//, 125-136.
# [[The Effect of Structure upon the Reactions of Organic Compounds. Benzene Derivatives|docs/ja01280a022.pdf]] by Louis Hammett //J. Am. Chem. Soc.// ''1937'', //59(1)//, 96-103.
# [[The Correlation of Biological Activity of Plant Growth Regulators and Chloromycetin Derivatives with Hammett Constants and Partition Coefficients|docs/ja00901a033.pdf]] by Corwin Hansch, et al. //J. Amer. Chem. Soc.// ''1963'', //85(18)//, 2817-2824.
#ρ-σ-π [[Analysis. A Method for the Correlation of Biological Activity and Chemical Structure|docs/ja01062a035.pdf]] by Corwin Hansch & Toshio Fujita //J. Amer. Chem. Soc.// ''1964'', //86(8)//, 1616-1626.
#[[Structural Determination of Paraffin Boiling Points|docs/ja01193a005.pdf]] by Harry Wiener //J. Am. Chem. Soc.// ''1947'', //69(1)//, 17-20.
#[[Relation of the Physical Properties of the Isomeric Alkanes to Molecular Structure|docs/j150462a018.pdf]] by Harry Wiener //J. Phys. Chem.// ''1948'', //52(6)//, 1082-1089.
#[[SMILES, A Chemical Language and Information System. 1. Introduction to Methodology and Encoding Rules|docs/ci00057a005.pdf]] by David Weininger //J. Comp. Inf. Model.// ''1988'', //28(1)//, 31-36.
#[[Development and Use of Charged Partial Surface Area Structural Descriptors in Computer-Assisted Quantitative Structure-Property Relationship Studies|docs/ac00220a013.pdf]] by David Stanton and Peter Jurs //Anal. Chem.// ''1990'', //62(21)//, 2323-2329.
#[[Development and Use of Hydrophobic Surface Area (HSA) Descriptors for Computer-Assisted Quantitative Structure-Activity and Structure-Property Relationship Studies|docs/ci034284t.pdf]] by David Stanton, et. al. //J. Chem. Inf. Comput. Sci.//, ''2004'', //44(3)//, 1010-1023.
#[[Electrotopological State Indices for Atom Types: A Novel Combination of Electronic, Topological, and Valence State Information|docs/ci00028a014.pdf]] by Hall & Kier //J. Chem. Inf. Comput. Sci.//, ''1995'', //35(6)//, 1039-1045.
#[[The Electrotopological State: Structure Information at the Atomic Level for Molecular Graphs|docs/ci00001a012.pdf]] by Hall, et. al. //J. Chem. Inf. Comput. Sci.//, ''1991'', //31(1)//, 76-82.
#[[A Simple Method for the Representation, Quantification, and Comparison of the Volumes and Shapes of Chemical Compounds|docs/ci00049a002.pdf]] by Stouch and Jurs //J. Chem. Inf. Comput. Sci.//, ''1986'', //26(1)//, 4-12.
#[[Prediction of Physicochemical Parameters by Atomic Contributions|docs/ci990307l.pdf]] by Wildman and Crippen //J. Chem. Inf. Comput. Sci.//, ''1999'', //39(5)//, 868-873.
__''Tablet PC Journal Files''__
The following are available as Adobe PDF files and require the Adobe Acrobat Reader.
#[[Lecture 02 notes (right click to save)|docs/lect02.pdf]]
#[[Lecture 03 notes (right click to save)|docs/lect03.pdf]]
#[[Lecture 05 notes (right click to save)|docs/lect05.pdf]]
#[[Lecture 08 notes (right click to save)|docs/lect08.pdf]]
#[[Lecture 10 notes (right click to save)|docs/lect10.pdf]]
#[[Lecture 11 notes (right click to save)|docs/lect11.pdf]] and [[surface area notes|docs/lect11a.pdf]]
#[[Lecture 12 notes (right click to save)|docs/lect12.pdf]]
#[[Lecture 13 notes (right click to save)|docs/lect13.pdf]]
#[[Lecture 15 notes (right click to save)|docs/lect15.pdf]]
#[[Lecture 16 notes (right click to save)|docs/lect16.pdf]]
#[[Lecture 17 notes (right click to save)|docs/lect17.pdf]]
#[[Lecture 18 notes (right click to save)|docs/lect18.pdf]]
#[[Lecture 21 notes (right click to save)|docs/lect21.pdf]]
#[[Lecture 22 notes (right click to save)|docs/lect22.pdf]]
#[[Lecture 24 notes (right click to save)|docs/lect24.pdf]]
#[[Lecture 25 notes (right click to save)|docs/lect25.pdf]]
SMILES = ''S''implified ''M''olecular ''I''nput ''L''ine ''E''ntry ''S''pecification
The original work was created by Arthur and David Weininger (need ref), and has been expanded since. This system allows an unambiguous description of a chemical structure within a single string (list of characters), and is quite common in. SMILES are used for information storage and retrieval in many software packages, and are quite commonly used in publications.
A good explanation of SMILES notation can be found at the [[Daylight Chemical Information Systems website|http://www.daylight.com/dayhtml/doc/theory/theory.smiles.html]].
For our work here, we use SMILES for database storage and dataset processing, and the SMILES notation is created from [[HyperChem]] [[hin file|HIN File]] information by conversion via [[Open Babel]] software.
What if I type a ref here [[1|#ref 1]]
<<tag formatting>>
[[Corwin Hansch]]
[[Harry Wiener]]
[[Hammett]]
[[Hammett Equation]]
[[Naive Bayes Classifier]]
Chemometrics & Cheminformatics at [[Indiana University of Pennsylvania|http://www.iup.edu]]
Sparklines [[were invented|http://www.edwardtufte.com/bboard/q-and-a-fetch-msg?msg_id=0001OR&topic_id=1]] by Edward Tufte, author of a number of thoughtful and inspiring books on the presentation of visual information.
Sparklines are described by Tufte as "small, intense, wordlike graphics". They are designed to be used inline with ordinary text. For example, this <<sparkline 163 218 231 236 232 266 176 249 289 1041 1835 2285 3098 2101 1755 3283 3353 3335 2898 2224 1404 1354 1825 1839 2142 1942 1784 1145 979 1328 1611>> shows one measure of activity on www.tiddlywiki.com during the month of April 2005.
Creating a sparkline is easy using the new [[Macros]] feature:
{{{
<<sparkline 163 218 ... 1328 1611>>
}}}
The cunning thing about these sparklines is that they are created inline without requiring any graphics or other ~ServerSide support.
Some special characters and mathematical symbols
|''symbol''|''description''| |''symbol''|''description''| |''symbol''|''description''|
| ∂ |partial differential| | ∧ |logical and| | ⊂ |subset of|
| ∃ |there exists| | ∨ |logical or| | ⊃ |superset of|
| ∅ |empty or null set| | ∩ |intersection| | ⊄ |not a subset of|
| ∇ |nabla or backward difference| | ∪ |union| | ⊆ |subset of or equal to|
| ∈ |element of| | ∫ |integral| | ⊇ |superset of or equal to|
| ∉ |not an element of| | ∴ |therefore| | ∀ |for all|
| ∋ |contains as member| | ∼ |similar to| | ⇔ |double arrow|
| ∏ |product sign| | ≅ |approximately equal to| | ↔ |small double arrow|
| ∑ |summation sign| | ≈ |almost equal to| | ƒ |function|
| √ |square root| | ≠ |not equal to| | ⊥ |orthogonal to|
| ∝ |proportional to| | ≡ |identical to| | ⋅ |dot operator|
| ∞ |infinity| | ≤ |less than or equal to| | ⊕ |direct sum|
| ∠ |angle| | ≥ |greater than or equal to| | ⊗ |direct product|
Details will soon appear regarding the [[CHEM-681|Welcome to CHEM 681]] student projects.
You will be given an account on Dr. ~McElroy's Linux workstation (saison.nsm.iup.edu) which you will be able to access from any ~IUP-connected computer via [[PuTTY]]. We will have at least one lecture dedicated to working on the Linux workstation from the computer room in Weyandt 144, where you will learn some basic [[Linux commands|Linux in CHEM 681]] and [[editing commands|Emacs]].
For the ''project'', please purchase a dedicated notebook such as a simple [[Mead composition notebook|images/mead.jpg]], available for between $1 to $3, depending on where you buy it. This notebook should be separate from your notebook used in class, and contain all information recorded during your computer sessions, etc. I will most likely collect the project notebook at least once during the semester to look over your work.
In terms of the work presented here, a structural (or molecular) descriptor is simply an mathematical value which is the result of some calculation that encodes a particular feature of a molecule. This can be a simple and easily interpretable value, such as the molecular weight of a compound, or it can be a very complex and highly discriminant value whose interpretation is limited (in terms of some facet on the molecule). This mathematical representation, however, must be unchanged by specified mathematical or physical operations or transformations to the molecule's size and/or number of atoms - otherwise, there would be no way to build statistical or neural network models.
The two main factors that determine the information held within a molecular descriptor are: 1) how the molecule is represented (two-dimensional? three-dimensional? hydrogen-suppressed?); and 2) the mathematical algorithm that is used to carry out the descriptor calculation (is it rigorous? is it feasible?)
Descriptors here are generally categorized into one of four types:
#[[topological|Topological Descriptor]]
#[[geometric|Geometric Descriptor]]
#[[electronic|Electronic Descriptor]]
#[[hybrid or hybridized|Hybrid Descriptor]]
Topological descriptors can be calculated from a simple two-dimensional [[connectivity matrix|Connection Table]] or graph representation, where only the types of atoms and their connections are relevant. These descriptors give information about the relative size and content of the molecule, and are quickly calculated.
Geometric descriptors are calculated using [[connectivity|Connection Table]] information too, but they also rely on correct low-energy three-dimensional optimization prior to calculation, in order to capture the relative positions of atoms within the molecule. There are many different software packages available that will find low-energy, three-dimensional geometries using many different methods such as: //ab initio//, semi-empirical, or molecular mechanics. I use the semi-empirical molecular orbital package [[MOPAC]] to calculate geometries before finding these descriptor values.
Electronic descriptors are also dependent upon correct structure optimization. Here, however, rather than looking at the atom positions, we're interested in the partial atomic charges that result from geometry optimizations. So again, we use the results from the [[MOPAC]] optimization.
Hybrid descriptors involve calculations that may make use of information from topological, geometric, and/or electronic descriptor information in order to more fully capture aspects of a molecule. One particularly useful set of descriptors from this class are [[Charged Partial Surface Area (CPSA)|Charged Partial Surface Areas]] descriptors.
/***
http://tiddlystyles.com/#theme:TiddlyPedia
***/
/*{{{*/
body{
background: #f9f9f9 url(headbg.jpg) no-repeat top left;
}
#titleLine{
display: block;
background: transparent url(wiki.png) no-repeat 18px -7px;
_background: transparent;
height: 120px;
_height: 135px;
width: 150px;
color: #000;
border: 1px;
padding: 0;
margin: 0;
}
* html #titleLine{
filter:progid:DXImageTransform.Microsoft.AlphaImageLoader(src='wiki.png',sizingMethod='scale');
}
#contentWrapper #siteTitle a{
display: inline;
font-weight: bold;
color: #000;
font-size: 16px;
}
#siteSubtitle{
padding: 0;
}
#siteTitle, #mainMenu{
position: static;
}
#contentWrapper #sidebar{
top: 0;
left: 0;
}
#displayArea {
margin: 0 0 0 15em;
}
#messageArea{
position: fixed;
top: 0;
right: 0;
font-size: 10px;
border: 1px solid #aaa;
background: #fff;
z-index: 25;
}
#messageArea a:link{
color: #002bb8;
text-decoration: none;
}
#messageArea a:hover{
text-decoration: underline;
}
.viewer{
background: #fff;
border: 1px solid #aaa;
padding: 1em;
margin: 0;
}
.body{
padding: 1px;
}
.title{
background: #fff;
border: 1px solid #aaa;
display: inline;
margin-left: .5em;
padding: 2px .5em;
border-bottom: 0;
font-weight: bold;
color: #000;
font-size: 1.2em;
}
.toolbar{
visibility: visible;
display: inline;
padding: 0;
font-family: sans-serif;
}
.toolbar a.button:link,.toolbar a.button:visited{
background: #fff;
border: 1px solid #aaa;
color:#002bb8;
font-size: 11px;
padding-bottom: 0;
margin-right: .25em;
}
/* TiddlyPedia was Created by Clinton Checketts based on the Monobook skin of Wikipedia */
#contentWrapper .toolbar .button:hover{
border-bottom: 1px solid #fff;
background: #fff;
color:#002bb8;
}
.toolbar a.button:hover{
border-bottom: 1px solid #fff;
background: #fff;
color:#000;
}
#displayArea .viewer a,a.button:link,a.button:visited,
a.tiddlyLink:link,a.tiddlyLink:visited,
#sidebarOptions .sliderPanel a{
color:#002bb8;
background: transparent;
border: 0;
}
.viewer a:hover,a.button:hover,a.button:active,
a.tiddlyLink:hover,a.tiddlyLink:active,
.viewer a.button:hover,
#sidebarOptions .sliderPanel a:hover{
color:#002bb8;
background: transparent;
text-decoration: underline;
}
#mainMenu{
font-family: sans-serif;
text-align: left;
font-size: x-small;
width: 100%;
margin: 0;
padding: 0;
}
#mainMenu h1{
font-size: 11px;
font-weight: normal;
padding: 0;
margin: 0;
background: transparent;
}
#mainMenu ul{
font-size: 11px;
border: 1px solid #aaa;
padding: .25em 0;
margin: 0;
list-style-type: square;
list-style-image: url(bullet.gif);
background: #fff;
width: 100%;
}
#mainMenu li{
margin: 0 0 0 2em;
padding: 0;
}
#contentWrapper #mainMenu a:link,#contentWrapper #mainMenu a:visited{
color:#002bb8;
padding: 0;
margin: 0;
background: transparent;
}
#mainMenu .externalLink {
text-decoration: none;
}
#mainMenu .externalLink:hover {
text-decoration: underline;
}
#sidebar{
padding: .5em;
font-family: sans-serif;
}
#sidebarOptions{
border: 1px solid #aaa;
background: #fff;
margin-top: .5em;
width: 100%;
}
#sidebar .sliderPanel{
margin: 0;
}
#contentWrapper #sidebarOptions .button,#contentWrapper #sidebarOptions .button:hover{
color:#002bb8;
padding: .1em 0 .1em 2em;
background: transparent url(bullet.gif) 10px -2px no-repeat;
}
#sidebarOptions input{
width: 80%;
margin: 0 .5em;
}
#sidebarTabs{
background: #fff;
margin-top: .5em;
width: 100%;
}
#sidebarTabs .tabContents,#sidebarTabs .tabContents .tabContents{
border: 1px solid #aaa;
background: #fff;
}
#sidebarTabs .tabSelected,#sidebarTabs .tabcontents .tabSelected {
background: #fff;
border: 1px solid #aaa;
border-bottom: 0;
cursor: default;
padding-bottom: 3px;
color: #000;
}
#sidebarTabs .tabUnselected,#sidebarTabs .tabContents .tabUnselected{
background: #aaa;
padding-bottom: 0;
color: #000;
}
#contentWrapper #sidebarTabs .tiddlyLink,#contentWrapper #sidebarTabs .button,
#contentWrapper #sidebarTabs a.tiddlyLink:hover,#contentWrapper #sidebarTabs a.button:hover{
background: transparent;
color: #002bb8;
}
.footer{
margin: -1em 0 1em 0;
}
.footer .button:hover,.editorFooter .button:hover{
background: transparent;
color: #002bb8;
border-bottom: 1px solid #002bb8;
}
#popup{
background: #e9e9e9;
color: #000;
}
#popup hr{
border-color: #aaa;
background-color: #aaa;
}
#popup a{
color: #000;
}
#popup a:hover,#contentWrapper #sidebarTabs #popup a:hover{
background: #666;
color: #fff;
text-decoration: none;
}
#displayArea .tiddler a.tiddlyLinkNonExisting{
color: #ba0000;
}
#displayArea .tiddler a.externalLink{
text-decoration: none;
color:#002bb8;
padding-right: 1em;
background: transparent url(external.png) 100% 50% no-repeat;
}
#displayArea .tiddler a.externalLink:hover{
text-decoration: underline;
}
.viewer pre{
background: #e9e9e9;
border: 1px solid #666;
}
.viewer h1, .viewer h2, .viewer h3, .viewer h4, .viewer h5, .viewer h6{
background: transparent;
border-bottom: .2em solid #aaa;
}
#sidebar .sliderPanel{
background: #e9e9e9;
}
#sidebar .sliderPanel input{width: auto;}
.tagged, .tagging, .listTitle{
float: none;
display: inline;
}
.tagged li, .tagging li,
.tagged ul, .tagging ul{
display: inline;
}
/*}}}*/
Supervised learning, as opposed to [[unsupervised learning|Unsupervised Learning]], is used to make a system learn to associate input data (such as a set of organic compounds) with some output value (such as the IC~~50~~ or enzyme inhibition values of each compound). Data used to build the system or model is called the [[training set|Training Set]]. Once the system learns how to associate each input with an output, then it is hoped that when a new input ([[prediction set|Prediction Set]]) is placed in the system that the system will produce a reliable output ([[generalization|Generalization]].)
Supervised learning requires a set of data (input) with corresponding known target values (output). During training, the system can check its generalizations with the target values, then make adjustments in order to minimize the error between actual target values and predicted or calculated target values.
Models from supervised learning can be used for [[clustering|Clustering]] and for modeling and prediction.
Examples of this type would be [[decision trees|Decision Trees]], [[CNNs|Computational Neural Networks]], and [[genetic algorithms|Genetic Algorithm]].
This is the online version of Dr. ~McElroy's [[CHEM-681|Welcome to CHEM 681]] Chemometrics & Chemoinformatics Spring 2008 syllabus. For an electronic copy of the syllabus handed out in class, go [[here|docs/syllabus_CHEM-681.pdf]].
|''Course Name'' |CHEM 681: Chemometrics & Chemoinformatics |
|''Location'' |Tuesday, Thursday from 18:00-19:15 in Weyandt 240 |
|''Instructor'' |Dr. Nathan ~McElroy |
|''Office'' |148C Weyandt |
|''Office Hours'' |MWF 9:00a-10:00a; Thurs. 9:00a-11:00a |
|''Email'' |nathan.mcelroy@iup.edu |
|''Phone'' |+1 724 357 4829 |
|''Text'' |None |
|''Web site'' |[[http://nsm1.nsm.iup.edu/nate/courses/chem681|http://nsm1.nsm.iup.edu/nate/courses/chem681]] |
''Goals'' This course will examine the role that mathematical and statistical methods play in optimizing and solving chemistry problems. Topics will include pattern recognition and classification, multiple linear regression, neural networks, chemical structure representation, and structural descriptor generation. Examples of applications in areas such as drug discovery will be reviewed.
''Evaluation'' You will be evaluated by three exams (each 25% of the total grade), two during the semester and one given during the final exam period. The final 25% of your grade will come from a [[semester project|Special Project]] that will involve a literature review and an application of chemometrics to a data set.
''Schedule'' for the schedule of Spring 2008 lectures, see [[2008 Schedule]]
''Topological descriptors'' are taken from the connectivity information of a structure. In other words, they do not rely on a good geometry of a molecule, only how atoms are connected to each other. These descriptors include molecular weight, atom counts, path counts, path lengths, and connectivity indices. Topological descriptors are easy to calculate, but have less discrimination power.
A training set is a large portion of data samples taken from a global data set (for example, 90% of data points in a large set). The training set data is used in [[machine learning|Machine Learning]] to build some type of predictive model that can be used for [[generalization|Generalization]].
In a [[Quantitative Structure-Activity Relationship]], for example, there are some common conventions used when selecting the training set compounds from a large data set of organic compounds. Depending on the number of compounds in a database, the type of model algorithm that one will use for training, and the goals of a study, the training set comprises anywhere between 60-90% of the total number of dataset compounds. The remaining compounds will be placed into a [[test/prediction set|Prediction Set]].
In [[machine learning|Machine Learning]], unsupervised learning is used to find some sort of structure or representation of data in a large set. In other words, the user or the data is not suggesting any pattern or trying to bias the system toward one result over another. Common tasks for unsupervised learning includes the detection of outliers, data compression, or clustering.
A popular way to achieve unsupervised learning is using a [[Kohonen network|Self-Organizing Maps]] or a [[Self-Organizing Map]] (SOM), which is a type of [[Artificial Neural Network]] (ANN).
<!--{{{-->
<div class='title' macro='view title'></div>
<div class='toolbar' macro='toolbar -closeTiddler closeOthers +editTiddler permalink references jump'></div>
<div class='tagging' macro='tagging'></div>
<div class='tagged' macro='tags'></div>
<div class='viewer' macro='view text wikified'></div>
<div class='tagClear'></div>
<!--}}}-->
This web page will serve as the electronic hub for information of ''Dr. ~McElroy's ~CHEM-681: Chemometrics & Cheminformatics'' for Spring 2008.
''The default font size may be small for your browser/preference. You can increase the font size on the browser menu.''
The formatting of this site is based on [[Jeremy Ruston's TiddlyWiki| http://www.tiddlywiki.com]], a type of special .html page that allows some nonlinear linking between topics. In order to navigate the page, simply click on hyperlinks as you would any other web page. The formatting and styles for this particular design were taken from [[Tiddly Themes|http://www.tiddlythemese.com]].
However, you will notice that if the link is active, a new 'Tiddler', or sub-window, will appear on the main screen. Until you close the window, it will remain on the main page. It would be possible to open up every single item in this Wiki and have one long web page - but that gets a little cluttered!
See the menu to the left for the main areas of interest:
*[[Syllabus]] (the whats and wherefores of the course)
*[[Schedule|2008 Schedule]] (a list of lecture dates, etc)
*[[Reading List]] (where you'll find all the PDF files and other documents mentioned in class)
*[[Special Project]] (a main page that deals with everything related to the special project]
*[[External Resources]] (links to non-IUP resources that you find useful)
*[[Glossary]] (some common terms & acronyms used in this course)