PMF-AMS Analysis Guide

From Jimenez Group Wiki
Revision as of 10:04, 13 May 2009 by DonnaS (talk | contribs) (A little more detail)
Jump to: navigation, search

Introduction

The PMF Evaluation Panel consists of 3 Igor procedure files (ipfs) called PMF_Execution, PMF_ViewResults, and PMF_Scatter. This wiki serves as the help and documentation for the software. To run PMF with the panel, the PMF executable and associated files, accessed separately, are required (see Section 3, Installing PMF with Igor). The PMF executable is compiled only for Windows/DOS. It is possible to execute PMF on a Windows computer and then analyze the experiment on a Macintosh.

The ipfs were written by Ingrid Ulbrich and Donna Sueper (Jimenez Group, University of Colorado, Boulder) and Greg Brinkman (Hannigan Group, University of Colorado, Boulder). Questions about this codecan be addressed to Ingrid or Donna at ulbrich@colorado.edu or dsueper@colorado.edu.

PMF (Positive Matrix Factorization) was developed by Dr. P. Paatero (Dept. of Physics, University of Helsinki). One of the original papers describing this method is Paatero, 1997 P. Paatero, Least squares formulation of robust non-negative factor analysis, Chemometrics and Intelligent Laboratory Systems 37 (1997), pp. 23–35. First time users to PMF are encouraged to first read the documentation by Paatero regarding PMF (see Section 9, Other Resources).

This Igor toolkit was intended for use in analyzing AMS data, but there are only few assumptions in the toolkit relating to AMS-type data. Non-AMS users of this software can skip Section 4, Creating the Organics and Error Matrices.

A Message to Contributors

We want to encourage active participation by all users in the evolution of the information contained within this wiki and welcome the addition of content that is beneficial to the community as a whole. However, please DO NOT delete any content from this page!! Significant time, effort, and deliberation has gone into the information contained in this page. Rather than deleting content, please feel free to voice your concerns by posting a comment to the discussion page where others can contribute (please be sure to include a topic to be referenced in responses).

Installing PMF with Igor

Setting up PMF on Your Computer

  1. Download the BareBones PMF Starter Kit
  2. Create a new folder on your computer where you'll store the PMF files and all files output by PMF (this should NOT be the folder where you store your Igor experiments). This folder must contain:
    • PMF2wtst.exe (obtained from P. Paatero or the BareBones Starter Kit)
    • imupmf.ini (obtained from the BareBones Starter Kit)
    • pmf2key.key (obtained from P. Pattero, U. Helsinki)
  3. Start a new Igor experiment and load the following files from the BareBones Starter Kit:
    • DataAndErrorForBareBonesPMF.itx
    • BareBonesPMFExecution_1_00B.ipf


Running the PMF Test Case

  1. Start the Panel from the BareBonesPMF menu on the menu bar.
    File:BareBonesPanel.png

  2. Use the first button and select the path to the folder you created with the PMF2wtst.exe file.
  3. Use the second button to execute PMF.

You should see a black DOS window pop up, scroll a lot out output, and then close. Igor will then tell you, "The PMF barebones analysis was successfully completed within X seconds."

X should be > 0!

If the window goes away immediately and the Igor message says that the analysis was completed within 0 seconds, the execution was not successful. Try the following steps to solve the problem.


If PMF does not Run Properly

1. Look in the folder you created with the PMF2wtst.exe file for the existence of the files

 Matrix.dat and StdDev.dat
If these files exist, Igor was able to access the correct folder. Continue with Step 2.
If these files do not exist, Igor was not able to access the correct folder. Go back to the first button on the panel and check that you've given the correct path to the folder with the PMF2wtst.exe file. Run PMF again by pressing the second button.

2. Look in the folder you created with the PMF2wtst.exe file for the existence of the file

 PMF2.LOG
If this file does not exist, PMF was not run in this folder. Go to step 3.
If this file does exist, PMF attempted to run in this folder. Open the file PMF2.LOG (it is a text file).
Glance down the contents of the file and look for many lines of sequential numbered output, such as
 1 rank1 step chi2=   9282.6     Penalty=  1.5287E+04 Flags GF
 2 rank1 step chi2=   7411.7     Penalty=  1.4084E+04 Flags GF
If these lines are in the file, PMF ran successfully on your computer, which must be fast enough to run this data in less than 1 second. You're done!
If the lines of sequential numerical output are not in the file, you should find the following lines within this file (note that every line is NOT included here, but the lines selected here are in the order they appear in the file):

2a)

 ##PMF2 .ini file for: IMUPMF.INI  --- BareBonesPMF
 Successfully read task initialization file imupmf.ini
 titled:  ##PMF2 .ini file for: IMUPMF.INI  --- BareBonesPMF
If these lines appear in the file, PMF found the .ini file. Continue with Step 2b.
If these lines do not appear, you should see a message about not finding an appropriate .ini file.
  • Make sure that this folder contains the file imupmf.ini, provided with the BareBones PMF Starter Kit. If it did not, copy the file to this folder, delete the file PMF2.LOG and press the second button in the panel again to see whether PMF runs successfully.
If the file imupmf.ini already exists in the same folder as the PMF2wtst.exe file, continue with Step 3.

2b)

 Successfully opened input file      30
 with name MATRIX.DAT
 Successfully opened input file      31
 with name STD_DEV.DAT
If these lines appear in the file, everything should have run correctly. Look at the rest of the PMF2.LOG file to see whether other errors are reported. If you still encounter difficulty, contact Ingrid Ulbrich at <Ingrid dot Ulbrich at colorado dot edu> for assistance and attach the PMF2.LOG file to your email.
If these lines do not appear in the file, you will see a message about PMF not being able to access one of these files. Check that the files are not being used by other programs, delete the file PMF2.LOG, and press the second button on the panel again to see whether PMF runs successfully.

3. Look in the C:\ directory of your computer for the existence of the file

 runpmf.bat
If this file exists, Igor was able to write to your C:\ drive. Continue with Step 3a.
If this file does not appear, Igor was not able to write this file to your C:\ drive. This might be due to high security settings on your computer. You should create a text file with this name (NOT runpmf.bat.txt) and choose to edit it (not open it) with a text editor (e.g., Notepad, WordPad, Emacs, etc.). The file should contain the lines
 cd C:\Documents and Settings\Ingrid Ulbrich\Desktop\
 pmf2wtst imupmf  
NOTE that you must change the path in this example to the path to the folder where you have put the file PMF2wtst.exe!

3a) Execute runpmf.bat by double-clicking on its icon. You should see the black DOS window pop up, scroll output, and close again.

If this happens, PMF has run successfully from the batch. In the Igor experiment, press the second button on the panel to see whether PMF runs successfully.
If the black window does not show scrolled output, continue with Step 3b.

3b) You now need to execute PMF from the command line.

To access the DOS command line, launch the Command Prompt from your Windows Start menu, located in
 Start -> Programs -> Accessories -> Command Prompt
Change from the path displayed in the prompt to the folder where you put the PMF2wtst.exe file. Use the command
 cd
to change directories (e.g.,
 cd Desktop\PMF
). You can change one directory at a time, or several at a time as shown in the example. To move up 1 directory, use
 cd ..	
Be sure that the folder contains the three files
 PMF2wtst.exe (obtained from Paatero or the BareBones Starter Kit)
 imupmf.ini  (obtained from the BaseBones Starter K
 pmf2key.key (obtained from P. Pattero, U. Helsinki)
by listing the contents of the current directory, using the command
 dir	
Then type this command at the prompt to run PMF:
 pmf2wtst imupmf
You should see scrolling output in the command window. The window will not disappear and you can scroll back through the output (which is also saved in the file PMF2.LOG).
If the output contains many lines of sequential numbered output, such as
 1 rank1 step chi2=   9282.6     Penalty=  1.5287E+04 Flags GF
 2 rank1 step chi2=   7411.7     Penalty=  1.4084E+04 Flags GF
PMF ran successfully here. Delete the file PMF2.LOG and in the Igor experiment, press the second button on the panel to whether PMF runs successfully from Igor.
If the output does not contain these lines, go back to Step 2 of this section to examine errors that might be reported by PMF in the file PMF2.LOG.

Creating the Organics and Error Matrices (Step 0)

Creating the Matrices, by Instrument (Software) type

In the Q-AMS Software (James')

Be sure to use v1.41 or later.

spikes

smoothing

In SQUIRREL

  • The Igor AMS PMF tool needs 4 inputs: two 2-dimensional matrices of the same size and two 1-dimensional waves corresponding to indexes of the rows and columns. One is the organics matrix (typically PMF is only performed on organics), the other is the error of the organics matrix. The Igor PMF code also needs two 1-dimenensional waves corresponding to index values for the rows and columns of the data matrix. These are the times series (corresponding to the rows) wave and a m/z wave (corresponding the columns).
  • In the Corrections tab, Errors sub-tab, check the calc MS errors checkbox. Make sure you don't intend to do any other calculations in the corrections tab and then press the Do Corrections button. This will generate the matrix MSSDiff_p_err in the root folder.
  • In the MS tab, average mass spectra section, enter (only) Org as a species, then press the Export Matricies button. This will generate the MSSD_xxx_Mat_Org and MSSD_xxx_Mat_Org_err matricies that you need (where xxx is the name of the todo wave). Both these matricies reside in the root folder and will have zeros in the m/z columns where no organics contribution is identified (as is given in the frag_organics wave) and blanks in the rows for run numbers not included in the todo wave. Note that MSSD_xxx_Mat_Org_err will not be calculated unless MSSDiff_p_err has first been created.
  • Now you have the two most important basic pieces - the Org matrix and the Org error matrix.
  • The two 1-dimensional waves corresponding to the rows and columns of the matrix are the time series wave and a simple wave corresponding to m/z. In squirrel the time series wave is root:index:t_series. There is a m/z wave, root:diagnostics:amus whose values are simply column numbers, 1,2,3,...1000. This is the wave that is plotted as the x axis when one generates and average mass spectrum. For simplicity, the maximum value is always 1000, regardless as to what the maximum m/z data value is. Users can either redimension this amus wave to be the correct size (the same number of points as the column of the data matrix) or to create a new wave simply by the command line.
  • There is still more work that needs to be done: removing columns with zeros/nans, and perhaps doing some editing on the error matrix.

In PIKA

Deleting NaNs/zeros for all Instruments

The matrices produced in the previous step have NaNs in all rows from bad runs and 0 values in columns with good runs that have no organic fragments. All of these rows and columns need to be removed before running PMF. This is a two-step process.

1. First, change the columns with 0's to NaNs. You can do this by changing all 0's to NaNs, e.g.

OrganicMx = OrganicMx[p][q] == 0 ? NaN : OrganicMx[p][q]

or you could make an organic framgments mask wave (= 1 for organic fragments, = NaN for others) and multiply your matrix by the wave.

Note that the latter method is safer! The function chooses the first row and column of the data that contain a mix of NaNs and values and uses this as a template to delete rows or columns (respectively) that contain NaNs. Therefore if an actual good data value in the first row was 0 (unlikely but possible) and you have replaced zeros by NaNs, a column may be deleted inadvertently!


2. In the PMF_Execution_XX.ipf, use the function

pmf_ams_deleteNaNs_mxWvs(dataMx, errMx, rowDescrWv, colDescrWv)

where

dataMx and errMx are your data and error matrices,
rowDescrWv is a 1-D wave that gives the indices for the rows (usually t_series), and
colDescrWv is a 1-D wave that gives the indices for the columns (usually amus).

The function creates new versions of the input waves with names noNaNs_amus, noNaNs_t_series, etc. from which the NaN rows and/or columns have been deleted. Global strings called NaNsList_amus and NaNsList_tseries have been created and are used by related functions to delete items from other waves or reinsert the original NaNs into waves.

Here is a piece of code that can remove columns with zeros. This approach presumes you have already used the pmf_ams_deleteNaNs_mxWvs function above.(Copy and paste these lines into a procedure window and execute from the command line.)

// A function for removing columns in a matrix that have only zeros
// For example, if a user has an "Org" matrix, it would remove columns
// corresponding to m/z 1 - 11, 14, etc.
// sample usage: RemoveZeroCols(noNaNs_mx, noNaNs_mxerr, noNaNs_amus, NaNsList_amus) // where noNaNs_mx and noNaNs_mxerr are replaced with your wave names
Function RemoveZeroCols(noNaNsmx, noNaNsmxerr, NoNansAmuWave, NoNansAmuList)
wave noNaNsMx, noNaNsMxErr, NoNansAmuWave
string NoNansAmuList
	
variable idex, numRow, numCol
	
numRow = dimsize(noNaNsMx,0)
numCol = dimsize(noNaNsMx,1)
	
make/o/n=(numRow) tempCol
	
for (idex = numCol;idex>=0;idex-=1)	// work 'backwards' so that we don't mess up the counting
	tempCol = noNaNsMx[p][idex]
	wavestats/q/m=1 tempCol 
	if (V_min ==0  && V_max == 0)
		deletepoints/m=1 idex, 1, noNaNsMx, noNaNsMxerr
		deletepoints/m=1 idex, 1, NoNansAmuWave
		NoNansAmuList = num2str(idex+1) + ";"+ NoNansAmuList// keep adding at the front to keep in numerical order. 
	endif
		
endfor

killwaves/z tempCol

End

Related Functions

Also found in PMF_Execution_XX.ipf:

  • pmf_ams_deleteNaNs_Wvs
  • pmf_ams_insertNaNs_Mxs
  • pmf_ams_insertNaNs_Wvs

Recommended Practice: Downweight "Weak" Variables

Recommended Practice: Set a Minimum Error

Recommended Practice: Downweight Peaks Related to m/z 44 in Frag Table

Perform PMF Analysis *Step 1*

Additional File for PMF Executable Directory

Be sure that the file

 mypmft.ini

(provided on the PMF data page) is located in the directory with the PMF Executable.

Recommended Practice: Organizing Your Folders

 + root
    + TemplateData          to copy for new versions of analysis
    + Variation1            e.g., your basis case
    + Variation2            e.g., with different error estimates
    + External_MassSpectra  for use with the Scatter Panel
    + External_Tseries      for use with the Scatter Panel

NOTE: Data for running PMF can be in root: or a subfolder of root: , but not any lower folder.

The Executable Panel

File:ExecutePanel.png

  1. Set the path to the PMF executable
  2. Provide the data and error matrices information
    • Chose the folder (must be root: or a directory in root:)
    • Choose the Data and Error matrices (use noNaNs_ versions)
    • Choose model error (PMF increases the errors provided by newError = oldError + modelError*dataValue)
  3. Choose the type of PMF analysis
    • Exploration will run PMF for a range of number of factors and FPEAKs or SEEDs. This is the typical use for exploring a dataset and comparing many solutions.
    • Bootstrapping explores the uncertainty of one solution (i.e., one number of factors at one fpeak for one seed). This is usually a final step run only on the solution you have selected from the exploratory analysis.
  4. Choose a range for number of factors.
    • When checking to make sure that everything runs properly, you may want to run just one case (Min p = 2, Max p = 2).
    • Recommended Practice: Run cases with 1 factor to have a context for the meaning of the 2-factor solution.
    • In the Bootstrapping mode, only the "min p" is read; the "max p" is ignored.
  5. Choose FPEAK or SEED values
    • "FPEAK" is a tool used to explore rotations of the solutions of a given number of factors. Note that FPEAK does not explore all possible rotations of a solution. FPEAK = 0 does not apply any rotational forcing. Non-zero values of FPEAK create near-zero values in the factor profiles (mass spectra) or time series. More information about FPEAK can be found in the PMF Users Manual Part 1 (pp. 9,12,14,21) and Part 2 (p. 24), and in several papers by P. Paatero (see Other Resources).
      • A good first set of FPEAK values is -1.0 to +1.0 with a delta value of 0.1 or 0.2. For a full analysis, a wide enough range of FPEAKs to achieve Q/Qexp of at least 1% above the minimum value is recommended.
      • In Exploration mode when varying the fpeaks, Seed = 0.
      • In Bootstrapping mode, only the "min fpeak" is read; the "max fpeak" is ignored.
    • "SEED" is a tool used to choose different random starts (initial values) for the PMF algorithm. Using different seeds may lead to solutions in different local minima (Q/Qexp) in the solution space. One set of solutions may have more physical meaning than another, or multiple sets may make physical sense. It is impossible to test all start values, but testing many seeds may give an indication of local minima for your dataset. More information about seeds can be found in the PMF Users Manual Part 1 (p. 11) and Part 2 (p. 16).
      • Run seeds from 0 to your preferred maximum with a delta value of 1.
      • In Exploration mode when varying the Seed, fpeak = 0.
      • In Bootstrapping mode, Seed = 0.
  6. Select checkboxes
    • Run PMF in background Each execution of PMF (see Exploration Mode Summary) creates a black DOS window that pops up. If the box is not checked, this window "grabs the focus" and makes itself the top window. This makes it hard to use the computer for anything else. If the box is checked, the window will not grab focus, but Igor and your computer's CPU will be busy.


What the Software Does When You Press the Button...

Exploration Mode Summary

The software will execute PMF once for every combination of number of factors and FPEAK/seed. So if you run 1-5 factors and 5 FPEAKs, PMF will run 5x5=25 times. Each run starts a new black DOS window that will close when the run is completed. The duration of each run is printed in the history at the end of each run. In general, runs which solve for more factors and runs with FPEAK farther from 0 take longer. The code runs all of the FPEAKS or seeds for one number of factors, then advances to the next number of factors (e.g., run 1 factor with each of 5 FPEAK values, then 2 factors with each of 5 FPEAK values, etc.).

A little more detail

The software writes the files

C:\delete_log.bat    C:\runPMF.bat

and writes your DataMatrix and ErrorMatrix as MATRIX.DAT and STD_DEV.DAT, respectively to the folder with your PMF Executable. The software also writes a file to that folder called STD_DEV_PROP.DAT, which has the same number of points as the DataMatrix and in which every element is equal to the ModelError.


The software then enters a pair of nested loops in which the following steps occur:

  • for each number of factors
    • for each FPEAK or SEED
      • use the file
 mypmft.ini
as a template to create the file
 imupmf.ini
which is used as the control file for PMF.
      • Delete the old PMF2.LOG file by running delete_log.bat
      • Execute PMF by running run_PMF.bat.
      • Wait for PMF to complete its run.
      • Load PMF output (including log file and factors)

At the completion of the loops, the software calculates some statistics from the output and then creates a panel to select data for viewing.

Bootstrapping (added in v2.02)

The bootstrapping mode is developed after the method described in the EPA PMF v3.0 Users Manual Sect. 6.4 (see Other Resources). The bootstrapping method is used to estimate the uncertainty in both the factor mass spectra and time series. This is achieved by running PMF on the full dataset once and then making a series of variations (the number is specified by the user when selecting the bootstrapping mode) in which a subset of the original rows (mass spectra) are randomly replaced by other rows from the original matrix and running PMF on each of these.

For each new PMF case ("bootstrapped case"), the resultant factors are compared to those from the original dataset and assigned as "matching" the original factor with which it has the highest correlation. Bootstrapped cases in which each bootstrapped factor was matched to exactly one of the original factors (i.e., there is a one-to-one mapping between original factors and those from the individual boot-strapped cases) are retained for calculation of the average mass spectrum and time series of bootstrapped factors. Plots of the original factors and the average bootstrap factors with 1-sigma variation bars are produced automatically.

EPA PMF Users Manual recommends doing 100 bootstrap runs for final results.

A little more detail

All output from the bootstrapping runs is saved in folder root:pmf_bootstrap:.

Row (mass spectrum) replacement is performed by using the StatsResample function in Igor to select rows for replacement. The row values are then sorted in increasing order as a convenience.

  • The 2d wave RowsToBeReplaced records the rows to be used in each bootstrapped case. Each column represents one bootstrap case. Each column lists the rows of the original matrix included in that bootstrapped case.
  • The 2d wave ReplacementHistogram counts the number of times that each original matrix row was used in a bootstrap case. Each column represents one bootstrap case. Summing the rows of this matrix gives the total number of times that each original matrix row was was used in the bootstrapping cases.

The assignment of bootstrapped factors to the factors from the original case is made by Pearson R correlation. Factors are assigned only on the basis of mass spectral comparison, and each factor is assigned to one of the original factors.

  • Note that this is different than in EPA PMF. Our code does not have a criterion for the lowest allowable correlation between bootstrapped and original factors. In EPA PMF, factors that fall below this limit are "unmapped"; no factors are "unmapped" in our code.
  • Note that if the original case has factors that are very similar to each other, the assignment of the bootstrapped factors may be incorrect or ambiguous. No current work has been done to give guidance as to what "very similar" means. No sanity checks are made in the code for this type of situation.
  • The 3d wave FactorProfile_Rval stores the correlation between the boostrapped and original factors. Rows represent the factors from the original case, columns represent the factors from the bootrapped cases, and layers represent each bootstrapped case.
  • The 2d wave FactorProfileSort contains the number of the original factor to which each boostrapped factor (row) has been matched in the bootstrapped case (column). Columns which contain (e.g., in a case with three factors) "0, 1, 2" have factors which were uniquely matched to the columns in the original case; these will be included in the averages of mapped factors. Columns which contain duplicate entries (e.g., "0, 1, 0") have multiple factors that were matched to the same original factor; these cases will not be included in the averages of the mapped factors.

Note that the criteria for including bootstrapped cases in calculations of the average bootstrapped factors is different in our code than in EPA PMF. Bootstrapped cases in which two or more factors are mapped to the same original factor are not included in the averages in our code. In these instances, the whole bootstrapping case is rejected before calculating averages. In EPA PMF, all bootstrapped factors mapped to the same original factor are included in the average.

Running Two Simultaneous Analyses on Dual-Processor Computers

View PMF Analysis Results *Step 2*

Compare PMF Results with External Factors *Step 3*

Considerations for Choosing a Solution

Other Resources