Open-source tools for generating and analyzing large materials data sets Anubhav Jain Energy Technologies Area Lawrence Berkeley National Laboratory Berkeley, CA ACS Spring, April 2017 Slides (already) posted to http://www.slideshare.net/anubhavster Link is also listed at end of talk
2 “Civilization advances by extending the number of important operations which we can perform without thinking about them.” - Alfred North Whitehead
We don’t work on catalysis, but we do write software •  We don’t do research into heterogeneous catalysis •  We do build software to: –  execute millions of calculations on supercomputing centers –  make it more straightforward to run density functional theory calculations (mostly VASP, some Gaussian/Q-Chem) –  perform structural manipulations –  analyze the results of calculations 3
Software technologies that we contribute to 4 (automatic materials science workflows) Custodian (calculation error recovery) (materials analysis framework) Base packages Derived packages (workflow definition & execution) These are all open-source: •  FireWorks, atomate, and matminer are led by our group •  pymatgen and custodian are led by Prof. Ong group (UC San Diego) •  All developed in coordination with Persson group (UC Berkeley) (materials data mining)
Applications: The Materials Project database 5 Jain*, Ong*, Hautier, Chen, Richards, Dacek, Cholia, Gunter, Skinner, Ceder, and Persson, APL Mater., 2013, 1, 011002. *equal contributions! The Materials Project (http://www.materialsproject.org) free and open ~30,000 registered users around the world >65,000 compounds calculated Data includes •  thermodynamic props. •  electronic band structure •  aqueous stability (E-pH) •  elasticity tensors •  piezoelectric tensors >75 million CPU-hours invested = massive scale!
Applications: The Electrolyte Genome 6 data on ~22,000 molecules (mainly geometry + IP/EA via full adiabatic calcs) Also deployed on the Materials Project web site L. Cheng, R.S. Assary, X. Qu, A. Jain, S.P. Ong, N.N. Rajput, et al., J. Phys. Chem. Lett. 6 (2015) 283–291.! ! X. Qu, A. Jain, N.N. Rajput, L. Cheng, Y. Zhang, S.P. Ong, et al., Comput. Mater. Sci. 103 (2015) 56–67.!
Applications: Crystalium (Ong / Persson) 7 http://crystalium.materialsvirtuallab.org surface energies for 142 polymorphs of 72 elements + rotatable Wulff shapes certainly applicable to catalysis computed & maintained by the Ong group (UC San Diego) with support from Persson Group (UC Berkeley) R. Tran, Z. Xu, B. Radhakrishnan, D. Winston, W. Sun, K. A. Persson, and S. P. Ong, Sci. Data, 2016, 3, 160080.!
Applications: Rapid data generation 8 M. de Jong, W. Chen, H. Geerlings, M. Asta, and K. A. Persson, Sci. Data, 2015, 2, 150053.! M. De Jong, W. Chen, T. Angsten, A. Jain, R. Notestine, A. Gamst, M. Sluiter, C. K. Ande, S. Van Der Zwaag, J. J. Plata, C. Toher, S. Curtarolo, G. Ceder, K. a Persson, and M. Asta, Sci. Data, 2015, 2, 150009.! >4500 elastic tensors >900 piezoelectric tensors >48000 Seebeck coefficients + cRTA transport Ricci, Chen, Aydemir, Snyder, Rignanese, Jain, & Hautier (in submission)!
Let’s revisit the libraries 9 (automatic materials science workflows) Custodian (calculation error recovery) (materials analysis framework) Base packages Derived packages (workflow definition & execution) These are all open-source: •  FireWorks, atomate, and matminer are led by our group •  pymatgen and custodian are led by Prof. Ong group (UC San Diego) •  All developed in coordination with Persson group (UC Berkeley) (materials data mining)
pymatgen – object-oriented materials analysis 10 www.pymatgen.org! Ong, S. P., Richards, W. D., Jain, A., Hautier, G., Kocher, M., Cholia, S., Gunter, D., Chevrier, V. L., Persson, K. a. & Ceder, G. Python Materials Genomics (pymatgen): A robust, open-source python library for materials analysis. Comput. Mater. Sci. 68, 314–319 (2013).!
pymatgen – examples of analyses 11 phase diagrams Pourbaix diagrams diffusivity from MDband structure analysis
pymatgen - many useful tools made accessible 12 Structure Matcher analyzes if two periodic structures are equivalent, even if they are in different settings or have minor distortions = ?! Order-disorder resolve partial or mixed occupancies into a fully ordered crystal structure (e.g., mixed oxide-fluoride site into separate oxygen/fluorine) Many other tools, such as: •  Automatic surface slab generator •  Bond-valence sums to determine valence •  Voronoi coordination as well as 3D coordination polyhedron analysis •  Automatically find and insert interstitial sites •  Powder diffraction pattern generation •  Simple cost and materials availability estimators
custodian – fixing job errors •  Custodian can wrap around an executable (e.g., VASP) –  i.e., run custodian instead of directly running VASP •  During execution, custodian will monitor output files and detect errors / problems –  If so, it can change input files and rerun the job –  e.g., if ZPOTRF error detected, rerun with ISYM=0 –  ever-expanding library of fixes 13
FireWorks – scientific workflow software •  FireWorks is an open-source scientific workflow software •  Materials Project, JCESR, and other projects manage their runs with FireWorks –  >1 million jobs –  >100 million CPU-hours –  multiple computing clusters •  You can write any kind of workflow –  e.g., FireWorks is used for graphics processing, machine learning, document processing, and protein folding –  #1 Google hit for “Python workflow software”, top 5 for general scientific workflow software •  Detailed tutorials are available 14 Jain, A., Ong, S. P., Chen, W., Medasani, B., Qu, X., Kocher, M., Brafman, M., Petretto, G., Rignanese, G.-M., Hautier, G., Gunter, D. & Persson, K. A. FireWorks: a dynamic workflow system designed for high-throughput applications. Concurr. Comput. Pract. Exp. 22, 5037–5059 (2015).! www.pythonhosted.org/ FireWorks!
FireWorks – screenshot of jobs status 15 Live version at http://fireworks.dash.materialsproject.org
atomate – our newest code (currently in beta) 16 Redesigns an older,, clunkier code (MPWorks) translate minimal specifications into well-defined FireWorks workflows. (FireWorks handles all the execution and job management details) What	is	the GGA-PBE	elas0c tensor	of	GaAs?
Advantages – reduce specialization Because of the steep learning curve to computational methods, there is often a single group member assigned to a technique 17 “Alice knows how to do charged defect calculations.”! “Bob is the one who can properly converge GW runs.”! “Olga has all the scripts for phonon calculations.”!
Advantages – reduce errors Let’s take a look at two alternate universes: Automation reduces your chances of being caught in universe #2!! 18 researcher! has coffee! copies files from! previous simulation! edits 5 lines! runs simulation,! creates report! forgets coffee! copies files from! previous simulation! edits 4 lines! forgets! LHFCALC=F! creates report, ! looks fine at first, ! in a month! discovers it used the ! wrong functional! 1 2 researcher!
atomate – what’s available? 19 K. Mathew J. Montoya S. Dwaraknath A. Faghaninia •  band structure •  spin-orbit coupling •  hybrid functional calcs •  elastic tensor •  piezoelectric tensor •  Raman spectra •  NEB •  GIBBS method •  QH thermal expansion •  AIMD •  FEFF method •  LAMMPS MD All past and present knowledge, from everyone in the group, everyone previously in the group, and outside collaborators, about how to run calculations M. Aykol S.P. Ong B. Bocklund T. Smidt H. Tang
matminer (still in alpha) 20 MatMiner’s goal: help enable data mining studies in materials science
matminer usage •  Examples of usage on the github page: –  https://github.com/ hackingmaterials/ matminer •  Coming next: new types of crystal structure descriptors based on local environment 21
Some lessons learned (1) •  In the beginning, strong central coordination from authority was needed to develop these –  require that people contribute to common code, e.g. pymatgen, and not write their own detached scripts •  Once a code was “established”, less authority was needed –  people voluntarily contributed improvements rather than writing their own code because this benefited them •  Today the process is almost completely decentralized –  culture has changed –  even for new codes, people rally around it rather than build independent things 22
Some lessons learned (2) •  It is helpful to have a strong BDFL (benevolent dictator for life) for each codebase •  Requirements for the BDFL: –  very detail-oriented –  cares about the code itself, not just the application –  cares more about the code quality than about offending teammates, i.e., will not accept poor quality contributions –  at the same time, able to rally support from people and convince them to contribute or clean up code –  willing to work overtime to do things like write detailed docs, answer questions from users, advocate for the code, review commits, etc. –  derives joy from building and deploying things! 23
Some lessons learned (3) •  Computer scientists are useful for staying up to date in the fast-moving world of software –  2006: I took a graduate class in databases at a top CS university; all SQL, not a single mention of “NoSQL” –  2007: we use SQL to build a precursor to Materials Project –  2011: We are designing the framework for Materials Project; I have lots of experience with SQL and confident this is the way to go; a computer scientist casually mentions NoSQL, its growing prominence, and its potential applicability to our problem –  2017: We do almost everything in NoSQL •  Lesson: software moves fast! Much faster than materials science knowledge or methods. Don’t use “up to date” data from 5 years ago to inform your decision. 24
Further resources •  The Github web sites –  www.github.com/materialsproject –  www.github.com/hackingmaterials •  Software carpentry •  https://software-carpentry.org 25
Acknowledgements •  Research group of Prof. Shyue Ping Ong •  Research group of Prof. Kristin Persson •  Funding: US Dept of Energy, Materials Science Division …and all extended collaborators for these various projects! 26 Slides (already) posted to http://www.slideshare.net/anubhavster

Open-source tools for generating and analyzing large materials data sets

  • 1.
    Open-source tools forgenerating and analyzing large materials data sets Anubhav Jain Energy Technologies Area Lawrence Berkeley National Laboratory Berkeley, CA ACS Spring, April 2017 Slides (already) posted to http://www.slideshare.net/anubhavster Link is also listed at end of talk
  • 2.
    2 “Civilization advances byextending the number of important operations which we can perform without thinking about them.” - Alfred North Whitehead
  • 3.
    We don’t workon catalysis, but we do write software •  We don’t do research into heterogeneous catalysis •  We do build software to: –  execute millions of calculations on supercomputing centers –  make it more straightforward to run density functional theory calculations (mostly VASP, some Gaussian/Q-Chem) –  perform structural manipulations –  analyze the results of calculations 3
  • 4.
    Software technologies thatwe contribute to 4 (automatic materials science workflows) Custodian (calculation error recovery) (materials analysis framework) Base packages Derived packages (workflow definition & execution) These are all open-source: •  FireWorks, atomate, and matminer are led by our group •  pymatgen and custodian are led by Prof. Ong group (UC San Diego) •  All developed in coordination with Persson group (UC Berkeley) (materials data mining)
  • 5.
    Applications: The MaterialsProject database 5 Jain*, Ong*, Hautier, Chen, Richards, Dacek, Cholia, Gunter, Skinner, Ceder, and Persson, APL Mater., 2013, 1, 011002. *equal contributions! The Materials Project (http://www.materialsproject.org) free and open ~30,000 registered users around the world >65,000 compounds calculated Data includes •  thermodynamic props. •  electronic band structure •  aqueous stability (E-pH) •  elasticity tensors •  piezoelectric tensors >75 million CPU-hours invested = massive scale!
  • 6.
    Applications: The ElectrolyteGenome 6 data on ~22,000 molecules (mainly geometry + IP/EA via full adiabatic calcs) Also deployed on the Materials Project web site L. Cheng, R.S. Assary, X. Qu, A. Jain, S.P. Ong, N.N. Rajput, et al., J. Phys. Chem. Lett. 6 (2015) 283–291.! ! X. Qu, A. Jain, N.N. Rajput, L. Cheng, Y. Zhang, S.P. Ong, et al., Comput. Mater. Sci. 103 (2015) 56–67.!
  • 7.
    Applications: Crystalium (Ong/ Persson) 7 http://crystalium.materialsvirtuallab.org surface energies for 142 polymorphs of 72 elements + rotatable Wulff shapes certainly applicable to catalysis computed & maintained by the Ong group (UC San Diego) with support from Persson Group (UC Berkeley) R. Tran, Z. Xu, B. Radhakrishnan, D. Winston, W. Sun, K. A. Persson, and S. P. Ong, Sci. Data, 2016, 3, 160080.!
  • 8.
    Applications: Rapid datageneration 8 M. de Jong, W. Chen, H. Geerlings, M. Asta, and K. A. Persson, Sci. Data, 2015, 2, 150053.! M. De Jong, W. Chen, T. Angsten, A. Jain, R. Notestine, A. Gamst, M. Sluiter, C. K. Ande, S. Van Der Zwaag, J. J. Plata, C. Toher, S. Curtarolo, G. Ceder, K. a Persson, and M. Asta, Sci. Data, 2015, 2, 150009.! >4500 elastic tensors >900 piezoelectric tensors >48000 Seebeck coefficients + cRTA transport Ricci, Chen, Aydemir, Snyder, Rignanese, Jain, & Hautier (in submission)!
  • 9.
    Let’s revisit thelibraries 9 (automatic materials science workflows) Custodian (calculation error recovery) (materials analysis framework) Base packages Derived packages (workflow definition & execution) These are all open-source: •  FireWorks, atomate, and matminer are led by our group •  pymatgen and custodian are led by Prof. Ong group (UC San Diego) •  All developed in coordination with Persson group (UC Berkeley) (materials data mining)
  • 10.
    pymatgen – object-orientedmaterials analysis 10 www.pymatgen.org! Ong, S. P., Richards, W. D., Jain, A., Hautier, G., Kocher, M., Cholia, S., Gunter, D., Chevrier, V. L., Persson, K. a. & Ceder, G. Python Materials Genomics (pymatgen): A robust, open-source python library for materials analysis. Comput. Mater. Sci. 68, 314–319 (2013).!
  • 11.
    pymatgen – examplesof analyses 11 phase diagrams Pourbaix diagrams diffusivity from MDband structure analysis
  • 12.
    pymatgen - manyuseful tools made accessible 12 Structure Matcher analyzes if two periodic structures are equivalent, even if they are in different settings or have minor distortions = ?! Order-disorder resolve partial or mixed occupancies into a fully ordered crystal structure (e.g., mixed oxide-fluoride site into separate oxygen/fluorine) Many other tools, such as: •  Automatic surface slab generator •  Bond-valence sums to determine valence •  Voronoi coordination as well as 3D coordination polyhedron analysis •  Automatically find and insert interstitial sites •  Powder diffraction pattern generation •  Simple cost and materials availability estimators
  • 13.
    custodian – fixingjob errors •  Custodian can wrap around an executable (e.g., VASP) –  i.e., run custodian instead of directly running VASP •  During execution, custodian will monitor output files and detect errors / problems –  If so, it can change input files and rerun the job –  e.g., if ZPOTRF error detected, rerun with ISYM=0 –  ever-expanding library of fixes 13
  • 14.
    FireWorks – scientificworkflow software •  FireWorks is an open-source scientific workflow software •  Materials Project, JCESR, and other projects manage their runs with FireWorks –  >1 million jobs –  >100 million CPU-hours –  multiple computing clusters •  You can write any kind of workflow –  e.g., FireWorks is used for graphics processing, machine learning, document processing, and protein folding –  #1 Google hit for “Python workflow software”, top 5 for general scientific workflow software •  Detailed tutorials are available 14 Jain, A., Ong, S. P., Chen, W., Medasani, B., Qu, X., Kocher, M., Brafman, M., Petretto, G., Rignanese, G.-M., Hautier, G., Gunter, D. & Persson, K. A. FireWorks: a dynamic workflow system designed for high-throughput applications. Concurr. Comput. Pract. Exp. 22, 5037–5059 (2015).! www.pythonhosted.org/ FireWorks!
  • 15.
    FireWorks – screenshotof jobs status 15 Live version at http://fireworks.dash.materialsproject.org
  • 16.
    atomate – ournewest code (currently in beta) 16 Redesigns an older,, clunkier code (MPWorks) translate minimal specifications into well-defined FireWorks workflows. (FireWorks handles all the execution and job management details) What is the GGA-PBE elas0c tensor of GaAs?
  • 17.
    Advantages – reducespecialization Because of the steep learning curve to computational methods, there is often a single group member assigned to a technique 17 “Alice knows how to do charged defect calculations.”! “Bob is the one who can properly converge GW runs.”! “Olga has all the scripts for phonon calculations.”!
  • 18.
    Advantages – reduceerrors Let’s take a look at two alternate universes: Automation reduces your chances of being caught in universe #2!! 18 researcher! has coffee! copies files from! previous simulation! edits 5 lines! runs simulation,! creates report! forgets coffee! copies files from! previous simulation! edits 4 lines! forgets! LHFCALC=F! creates report, ! looks fine at first, ! in a month! discovers it used the ! wrong functional! 1 2 researcher!
  • 19.
    atomate – what’savailable? 19 K. Mathew J. Montoya S. Dwaraknath A. Faghaninia •  band structure •  spin-orbit coupling •  hybrid functional calcs •  elastic tensor •  piezoelectric tensor •  Raman spectra •  NEB •  GIBBS method •  QH thermal expansion •  AIMD •  FEFF method •  LAMMPS MD All past and present knowledge, from everyone in the group, everyone previously in the group, and outside collaborators, about how to run calculations M. Aykol S.P. Ong B. Bocklund T. Smidt H. Tang
  • 20.
    matminer (still inalpha) 20 MatMiner’s goal: help enable data mining studies in materials science
  • 21.
    matminer usage •  Examplesof usage on the github page: –  https://github.com/ hackingmaterials/ matminer •  Coming next: new types of crystal structure descriptors based on local environment 21
  • 22.
    Some lessons learned(1) •  In the beginning, strong central coordination from authority was needed to develop these –  require that people contribute to common code, e.g. pymatgen, and not write their own detached scripts •  Once a code was “established”, less authority was needed –  people voluntarily contributed improvements rather than writing their own code because this benefited them •  Today the process is almost completely decentralized –  culture has changed –  even for new codes, people rally around it rather than build independent things 22
  • 23.
    Some lessons learned(2) •  It is helpful to have a strong BDFL (benevolent dictator for life) for each codebase •  Requirements for the BDFL: –  very detail-oriented –  cares about the code itself, not just the application –  cares more about the code quality than about offending teammates, i.e., will not accept poor quality contributions –  at the same time, able to rally support from people and convince them to contribute or clean up code –  willing to work overtime to do things like write detailed docs, answer questions from users, advocate for the code, review commits, etc. –  derives joy from building and deploying things! 23
  • 24.
    Some lessons learned(3) •  Computer scientists are useful for staying up to date in the fast-moving world of software –  2006: I took a graduate class in databases at a top CS university; all SQL, not a single mention of “NoSQL” –  2007: we use SQL to build a precursor to Materials Project –  2011: We are designing the framework for Materials Project; I have lots of experience with SQL and confident this is the way to go; a computer scientist casually mentions NoSQL, its growing prominence, and its potential applicability to our problem –  2017: We do almost everything in NoSQL •  Lesson: software moves fast! Much faster than materials science knowledge or methods. Don’t use “up to date” data from 5 years ago to inform your decision. 24
  • 25.
    Further resources •  TheGithub web sites –  www.github.com/materialsproject –  www.github.com/hackingmaterials •  Software carpentry •  https://software-carpentry.org 25
  • 26.
    Acknowledgements •  Research groupof Prof. Shyue Ping Ong •  Research group of Prof. Kristin Persson •  Funding: US Dept of Energy, Materials Science Division …and all extended collaborators for these various projects! 26 Slides (already) posted to http://www.slideshare.net/anubhavster