[00:02:39] RECOVERY - puppet last run on ganeti1001 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [00:07:05] (03PS3) 10ArielGlenn: Clean up temp files from page content dumps before retry [dumps] - 10https://gerrit.wikimedia.org/r/336849 [00:07:19] RECOVERY - puppet last run on db1075 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [00:25:19] PROBLEM - puppet last run on mw1305 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:29:39] PROBLEM - puppet last run on cp3036 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:38:19] PROBLEM - puppet last run on nitrogen is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:53:20] RECOVERY - puppet last run on mw1305 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [00:57:39] RECOVERY - puppet last run on cp3036 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [01:07:19] RECOVERY - puppet last run on nitrogen is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [02:01:19] PROBLEM - puppet last run on ms-be1008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:19:48] !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.12) (duration: 07m 53s) [02:19:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:25:07] !log l10nupdate@tin ResourceLoader cache refresh completed at Mon Feb 20 02:25:07 UTC 2017 (duration 5m 19s) [02:25:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:30:19] RECOVERY - puppet last run on ms-be1008 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [03:07:49] PROBLEM - puppet last run on labvirt1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:14:39] PROBLEM - puppet last run on logstash1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:20:29] PROBLEM - SSH on bast3001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:20:49] PROBLEM - DPKG on bast3001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:20:49] PROBLEM - Disk space on bast3001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:20:49] PROBLEM - dhclient process on bast3001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:20:49] PROBLEM - Check whether ferm is active by checking the default input chain on bast3001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:20:49] PROBLEM - Check size of conntrack table on bast3001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:20:50] PROBLEM - Check systemd state on bast3001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:20:50] PROBLEM - puppet last run on bast3001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:20:51] PROBLEM - configured eth on bast3001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:20:51] PROBLEM - salt-minion processes on bast3001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:21:09] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 791.44 seconds [03:21:39] RECOVERY - Check size of conntrack table on bast3001 is OK: OK: nf_conntrack is 0 % full [03:21:40] RECOVERY - Check whether ferm is active by checking the default input chain on bast3001 is OK: OK ferm input default policy is set [03:21:40] RECOVERY - Check systemd state on bast3001 is OK: OK - running: The system is fully operational [03:21:40] RECOVERY - Disk space on bast3001 is OK: DISK OK [03:21:40] RECOVERY - DPKG on bast3001 is OK: All packages OK [03:21:40] RECOVERY - dhclient process on bast3001 is OK: PROCS OK: 0 processes with command name dhclient [03:21:40] RECOVERY - salt-minion processes on bast3001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:21:41] RECOVERY - configured eth on bast3001 is OK: OK - interfaces up [03:21:41] RECOVERY - puppet last run on bast3001 is OK: OK: Puppet is currently enabled, last run 35 minutes ago with 0 failures [03:22:19] RECOVERY - SSH on bast3001 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u3 (protocol 2.0) [03:29:09] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 259.28 seconds [03:29:29] PROBLEM - puppet last run on mc1022 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:36:49] RECOVERY - puppet last run on labvirt1003 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [03:42:40] RECOVERY - puppet last run on logstash1001 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [03:57:29] RECOVERY - puppet last run on mc1022 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [04:17:19] PROBLEM - puppet last run on es1014 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:45:19] RECOVERY - puppet last run on es1014 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [04:54:39] PROBLEM - puppet last run on maerlant is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:34:39] PROBLEM - puppet last run on contint2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:52:40] RECOVERY - puppet last run on maerlant is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [06:03:39] RECOVERY - puppet last run on contint2001 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [06:45:19] PROBLEM - puppet last run on ms-be1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:14:19] RECOVERY - puppet last run on ms-be1006 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [07:20:00] (03PS1) 10Marostegui: db-codfw.php: Depool db2048 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338707 (https://phabricator.wikimedia.org/T132416) [07:23:29] PROBLEM - All k8s worker nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/k8s/nodes/ready - 185 bytes in 0.179 second response time [07:25:29] (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool db2048 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338707 (https://phabricator.wikimedia.org/T132416) (owner: 10Marostegui) [07:27:09] (03Merged) 10jenkins-bot: db-codfw.php: Depool db2048 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338707 (https://phabricator.wikimedia.org/T132416) (owner: 10Marostegui) [07:27:18] (03CR) 10jenkins-bot: db-codfw.php: Depool db2048 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338707 (https://phabricator.wikimedia.org/T132416) (owner: 10Marostegui) [07:28:34] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Depool db2048 - T132416 (duration: 00m 41s) [07:28:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:28:41] T132416: Rampant differences in indexes on enwiki.revision across the DB cluster - https://phabricator.wikimedia.org/T132416 [07:29:58] !log Deploy alter table on db2048 enwiki.revision - T132416 [07:30:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:31:42] (03PS1) 10Marostegui: db-codfw.php: Update ticket for db2048 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338709 [07:36:36] (03CR) 10Marostegui: [C: 032] db-codfw.php: Update ticket for db2048 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338709 (owner: 10Marostegui) [07:37:42] (03Merged) 10jenkins-bot: db-codfw.php: Update ticket for db2048 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338709 (owner: 10Marostegui) [07:37:50] (03CR) 10jenkins-bot: db-codfw.php: Update ticket for db2048 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338709 (owner: 10Marostegui) [07:39:47] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Update ticket number for db2048 depool reason (duration: 00m 44s) [07:39:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:29] RECOVERY - All k8s worker nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.165 second response time [08:16:19] (03CR) 10ArielGlenn: [C: 032] Clean up temp files from page content dumps before retry [dumps] - 10https://gerrit.wikimedia.org/r/336849 (owner: 10ArielGlenn) [08:17:08] !log ariel@tin Started deploy [dumps/dumps@d50e129]: cleanup tmp files before checkpoint file rerun [08:17:10] !log ariel@tin Finished deploy [dumps/dumps@d50e129]: cleanup tmp files before checkpoint file rerun (duration: 00m 02s) [08:17:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:17:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:39] PROBLEM - puppet last run on phab2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:41:30] !log Increase 100G dbstore1002 lv /dev/mapper/tank-data [08:41:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:47:09] !log restarting diamond on wdqs1002 after initial data import [08:47:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:29] PROBLEM - puppet last run on ms-be1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:02:37] (03PS4) 10Marostegui: mariadb: Add gtid_domain_id to s6 [puppet] - 10https://gerrit.wikimedia.org/r/335816 (https://phabricator.wikimedia.org/T149418) [09:02:39] RECOVERY - puppet last run on phab2001 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [09:10:13] (03PS1) 10Muehlenhoff: Blacklist kernel modules for DCCP protocol [puppet] - 10https://gerrit.wikimedia.org/r/338720 [09:15:33] (03CR) 10Jcrespo: [C: 031] mariadb: Add gtid_domain_id to s6 [puppet] - 10https://gerrit.wikimedia.org/r/335816 (https://phabricator.wikimedia.org/T149418) (owner: 10Marostegui) [09:20:03] (03CR) 10Marostegui: "Compiles fine after the file path change: https://puppet-compiler.wmflabs.org/5506/" [puppet] - 10https://gerrit.wikimedia.org/r/335816 (https://phabricator.wikimedia.org/T149418) (owner: 10Marostegui) [09:20:04] ACKNOWLEDGEMENT - Check systemd state on restbase-dev1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Filippo Giunchedi raid degraded - T157425 [09:20:04] ACKNOWLEDGEMENT - MD RAID on restbase-dev1001 is CRITICAL: CRITICAL: State: degraded, Active: 11, Working: 11, Failed: 1, Spare: 0 Filippo Giunchedi raid degraded - T157425 [09:20:04] ACKNOWLEDGEMENT - cassandra-a CQL 10.64.0.36:9042 on restbase-dev1001 is CRITICAL: connect to address 10.64.0.36 and port 9042: Connection refused Filippo Giunchedi raid degraded - T157425 [09:20:04] ACKNOWLEDGEMENT - cassandra-a SSL 10.64.0.36:7001 on restbase-dev1001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused Filippo Giunchedi raid degraded - T157425 [09:20:04] ACKNOWLEDGEMENT - cassandra-a service on restbase-dev1001 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed Filippo Giunchedi raid degraded - T157425 [09:20:04] ACKNOWLEDGEMENT - cassandra-b CQL 10.64.0.37:9042 on restbase-dev1001 is CRITICAL: connect to address 10.64.0.37 and port 9042: Connection refused Filippo Giunchedi raid degraded - T157425 [09:20:04] ACKNOWLEDGEMENT - cassandra-b SSL 10.64.0.37:7001 on restbase-dev1001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused Filippo Giunchedi raid degraded - T157425 [09:20:05] ACKNOWLEDGEMENT - cassandra-b service on restbase-dev1001 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed Filippo Giunchedi raid degraded - T157425 [09:20:05] ACKNOWLEDGEMENT - puppet last run on restbase-dev1001 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 15 minutes ago with 3 failures. Failed resources (up to 3 shown): Service[cassandra-b],Service[cassandra-a],File[/srv/log/restbase/syslog.log] Filippo Giunchedi raid degraded - T157425 [09:21:29] (03CR) 10Hashar: [C: 04-1] "I would instead point to Wikitech which has far more informations: https://wikitech.wikimedia.org/wiki/Jouncebot" (031 comment) [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/338700 (owner: 10Zppix) [09:25:29] RECOVERY - puppet last run on ms-be1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:25:52] (03CR) 10Marostegui: [C: 032] mariadb: Add gtid_domain_id to s6 [puppet] - 10https://gerrit.wikimedia.org/r/335816 (https://phabricator.wikimedia.org/T149418) (owner: 10Marostegui) [09:28:26] (03PS1) 10Jcrespo: Add python3-mysql for the mariadb client servers [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/338721 [09:31:32] (03PS2) 10Jcrespo: Add python3-pymysql for the mariadb client servers [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/338721 [09:32:38] (03CR) 10Marostegui: [C: 031] Add python3-pymysql for the mariadb client servers [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/338721 (owner: 10Jcrespo) [09:32:58] (03CR) 10Jcrespo: [C: 032] Add python3-pymysql for the mariadb client servers [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/338721 (owner: 10Jcrespo) [09:33:13] !log Manually deploy gtid_domain_id on s6 hosts - T149418 [09:33:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:18] T149418: Deploy gtid_domain_id flag in our mysql hosts - https://phabricator.wikimedia.org/T149418 [09:35:02] (03PS1) 10Jcrespo: Rebase mariadb module to the latest version [puppet] - 10https://gerrit.wikimedia.org/r/338722 [09:37:04] (03CR) 10Jcrespo: [C: 032] Rebase mariadb module to the latest version [puppet] - 10https://gerrit.wikimedia.org/r/338722 (owner: 10Jcrespo) [09:40:57] (03CR) 10Gehel: [C: 031] "It very much makes sense to not lint external code..." [puppet] - 10https://gerrit.wikimedia.org/r/338143 (https://phabricator.wikimedia.org/T154894) (owner: 10Hashar) [09:41:48] (03CR) 10Hashar: "And I have another patch to let us run the syntax checks with Puppet 4.x :-}" [puppet] - 10https://gerrit.wikimedia.org/r/338143 (https://phabricator.wikimedia.org/T154894) (owner: 10Hashar) [09:46:40] !log upgrading mediawiki servers in codfw to HHVM 3.12.14 [09:46:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:10] (03CR) 10Zfilipin: [C: 031] build: allow usage of a different puppet version [puppet] - 10https://gerrit.wikimedia.org/r/338633 (owner: 10Hashar) [09:52:33] :} [10:00:05] (03PS4) 10Hashar: syntax: ignore stdlib Puppet 4 manifests [puppet] - 10https://gerrit.wikimedia.org/r/338143 (https://phabricator.wikimedia.org/T154894) [10:00:07] (03PS3) 10Hashar: build: allow usage of a different puppet version [puppet] - 10https://gerrit.wikimedia.org/r/338633 [10:01:13] (03PS1) 10Urbanecm: Add new throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338726 (https://phabricator.wikimedia.org/T158432) [10:01:40] (03CR) 10Hashar: "I made it so the PuppetSyntax ignores are only set for Puppet below 4. With the follow up change https://gerrit.wikimedia.org/r/#/c/3386" [puppet] - 10https://gerrit.wikimedia.org/r/338143 (https://phabricator.wikimedia.org/T154894) (owner: 10Hashar) [10:01:58] hashar, can you deploy 338726 please? [10:02:08] (03PS1) 10ArielGlenn: fix prefetch setup for retries of content file dump steps [dumps] - 10https://gerrit.wikimedia.org/r/338728 [10:02:10] Or anybody else [10:02:17] Urbanecm: url ? and not right now sorry [10:02:21] in a meeting [10:02:31] https://gerrit.wikimedia.org/r/338726 [10:02:42] It's a last-minute throttle rule for T158432 [10:02:42] T158432: Lift IP registration cap for an event on 2017-02-20 [IP address currently unknown] - https://phabricator.wikimedia.org/T158432 [10:02:44] hashar, ^ [10:03:57] Oh, sorry, I didn't noticed the "not right". I'll try to find anyone else now... [10:04:04] *someone [10:04:31] ah easy [10:05:58] So would you? Or what should I do? [10:06:02] If something [10:08:00] maybe people should face que consecuences of not being diligent enough to request those IP cap lifts in due time [10:08:12] just sayi'n [10:08:38] s/que/the [10:11:17] (03PS1) 10Muehlenhoff: Removed LDAP access for siddarth11 [puppet] - 10https://gerrit.wikimedia.org/r/338731 [10:11:20] (03PS2) 10ArielGlenn: fix prefetch setup for retries of content file dump steps [dumps] - 10https://gerrit.wikimedia.org/r/338728 [10:11:57] tabbycat, yeah, they should. They should give IPs in the due time at least. But how to tell it them... [10:12:23] (03PS1) 10Ema: Release 4.1.5-1wm1 [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/338732 [10:13:01] (03CR) 10jerkins-bot: [V: 04-1] Release 4.1.5-1wm1 [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/338732 (owner: 10Ema) [10:15:58] hashar: mmh there seems to be something wrong with the debian-glue jenkins job ^ [10:16:01] https://integration.wikimedia.org/ci/job/debian-glue/620/consoleText [10:16:09] E: Unknown operation: --buildresult [10:16:53] (03PS1) 10Filippo Giunchedi: Revert "diamond: switch to graphite2001" [puppet] - 10https://gerrit.wikimedia.org/r/338733 (https://phabricator.wikimedia.org/T157022) [10:18:57] (03CR) 10Hashar: [C: 032] Add new throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338726 (https://phabricator.wikimedia.org/T158432) (owner: 10Urbanecm) [10:19:20] (03CR) 10Muehlenhoff: [C: 032] Update to 4.4.49 [debs/linux44] - 10https://gerrit.wikimedia.org/r/338358 (owner: 10Muehlenhoff) [10:20:26] (03Merged) 10jenkins-bot: Add new throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338726 (https://phabricator.wikimedia.org/T158432) (owner: 10Urbanecm) [10:20:34] (03CR) 10jenkins-bot: Add new throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338726 (https://phabricator.wikimedia.org/T158432) (owner: 10Urbanecm) [10:24:04] !log hashar@tin Synchronized wmf-config/throttle.php: Add new throttle rule - T158432 (duration: 00m 49s) [10:24:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:11] T158432: Lift IP registration cap for an event on 2017-02-20 [IP address currently unknown] - https://phabricator.wikimedia.org/T158432 [10:24:21] Urbanecm: done ! [10:24:47] (03CR) 10ArielGlenn: [C: 032] fix prefetch setup for retries of content file dump steps [dumps] - 10https://gerrit.wikimedia.org/r/338728 (owner: 10ArielGlenn) [10:25:56] (03CR) 10Muehlenhoff: [C: 032] Removed LDAP access for siddarth11 [puppet] - 10https://gerrit.wikimedia.org/r/338731 (owner: 10Muehlenhoff) [10:26:31] !log ariel@tin Started deploy [dumps/dumps@dee43ca]: fix prefetch on retries of partially complete page content dumps [10:26:33] !log ariel@tin Finished deploy [dumps/dumps@dee43ca]: fix prefetch on retries of partially complete page content dumps (duration: 00m 02s) [10:26:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:10] (03PS1) 10Marostegui: mariadb: Add gtid_domain_id to s2 [puppet] - 10https://gerrit.wikimedia.org/r/338734 [10:28:22] (03CR) 10Gehel: [C: 031] "makes sense to me..." [puppet] - 10https://gerrit.wikimedia.org/r/338633 (owner: 10Hashar) [10:28:35] (03CR) 10Marostegui: [C: 04-1] "Wait a few days till we are sure s6 is fine" [puppet] - 10https://gerrit.wikimedia.org/r/338734 (owner: 10Marostegui) [10:36:22] (03PS2) 10Marostegui: mariadb: Add gtid_domain_id to s2 [puppet] - 10https://gerrit.wikimedia.org/r/338734 (https://phabricator.wikimedia.org/T149418) [10:38:36] (03CR) 10Marostegui: [C: 04-1] "Compiles fine and only changes s2 hosts https://puppet-compiler.wmflabs.org/5507/" [puppet] - 10https://gerrit.wikimedia.org/r/338734 (https://phabricator.wikimedia.org/T149418) (owner: 10Marostegui) [10:43:40] PROBLEM - puppet last run on mw2112 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[hhvm-dbg] [10:54:05] !log rolling restart of nginx on remaining mediawiki servers in eqiad to pick up openssl update [10:54:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:12] (03PS2) 10Filippo Giunchedi: Revert "diamond: switch to graphite2001" [puppet] - 10https://gerrit.wikimedia.org/r/338733 (https://phabricator.wikimedia.org/T157022) [10:59:37] (03CR) 10Filippo Giunchedi: [C: 032] Revert "diamond: switch to graphite2001" [puppet] - 10https://gerrit.wikimedia.org/r/338733 (https://phabricator.wikimedia.org/T157022) (owner: 10Filippo Giunchedi) [11:00:11] !log switch diamond traffic to graphite1001 - T157022 [11:00:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:16] T157022: Suspected faulty SSD on graphite1001 - https://phabricator.wikimedia.org/T157022 [11:01:33] (03CR) 10Volans: [C: 04-1] "A small bug and a couple of other comments inline." (033 comments) [software/conftool] - 10https://gerrit.wikimedia.org/r/288881 (https://phabricator.wikimedia.org/T155823) (owner: 10Giuseppe Lavagetto) [11:04:49] (03PS2) 10Ema: Release 4.1.5-1wm1 [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/338732 [11:04:58] (03CR) 10Hashar: build: allow usage of a different puppet version (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/338633 (owner: 10Hashar) [11:05:07] (03PS1) 10Addshore: Enable TwoColConflict extension on arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338738 (https://phabricator.wikimedia.org/T158493) [11:11:39] RECOVERY - puppet last run on mw2112 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [11:13:01] (03PS21) 10Volans: Cumin: allow connection to the targets [puppet] - 10https://gerrit.wikimedia.org/r/330436 (https://phabricator.wikimedia.org/T154588) [11:15:20] Urbanecm: simply "declined - can't be done, please provide the data with at least 10 days in advance" [11:15:22] ;) [11:23:36] (03PS1) 10Filippo Giunchedi: cache: move graphite/performance to graphite1001 [puppet] - 10https://gerrit.wikimedia.org/r/338745 (https://phabricator.wikimedia.org/T157022) [11:32:48] (03PS1) 10ArielGlenn: make empty list check shorter and clearer [dumps] - 10https://gerrit.wikimedia.org/r/338747 [11:39:17] (03CR) 10Harej: [C: 031] Bump timeout to 1 minute [puppet] - 10https://gerrit.wikimedia.org/r/338473 (https://phabricator.wikimedia.org/T158184) (owner: 10Smalyshev) [11:49:48] (03CR) 10Hashar: "recheck" [software/cumin] - 10https://gerrit.wikimedia.org/r/338382 (https://phabricator.wikimedia.org/T154588) (owner: 10Volans) [11:50:48] (03CR) 10Hashar: "recheck" [software/cumin] (debian) - 10https://gerrit.wikimedia.org/r/338374 (https://phabricator.wikimedia.org/T154588) (owner: 10Volans) [11:52:24] (03CR) 10Zfilipin: [C: 031] build: allow usage of a different puppet version (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/338633 (owner: 10Hashar) [12:00:47] (03PS2) 10Filippo Giunchedi: cache: move graphite/performance to graphite1001 [puppet] - 10https://gerrit.wikimedia.org/r/338745 (https://phabricator.wikimedia.org/T157022) [12:00:49] (03PS1) 10Filippo Giunchedi: udpmirror: encode line before sending [puppet] - 10https://gerrit.wikimedia.org/r/338750 [12:03:04] (03PS2) 10Filippo Giunchedi: udpmirror: encode line before sending [puppet] - 10https://gerrit.wikimedia.org/r/338750 [12:05:58] (03CR) 10Filippo Giunchedi: [C: 032] udpmirror: encode line before sending [puppet] - 10https://gerrit.wikimedia.org/r/338750 (owner: 10Filippo Giunchedi) [12:07:23] (03PS1) 10MarcoAurelio: Configuration changes for wikitech.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338751 (https://phabricator.wikimedia.org/T158516) [12:11:23] (03CR) 10Giuseppe Lavagetto: [C: 031] Blacklist kernel modules for DCCP protocol [puppet] - 10https://gerrit.wikimedia.org/r/338720 (owner: 10Muehlenhoff) [12:12:27] (03PS2) 10Muehlenhoff: ldap::client::utils: Move to require_package [puppet] - 10https://gerrit.wikimedia.org/r/338320 [12:12:58] (03PS2) 10Muehlenhoff: Blacklist kernel modules for DCCP protocol [puppet] - 10https://gerrit.wikimedia.org/r/338720 [12:14:59] PROBLEM - Disk space on graphite1001 is CRITICAL: DISK CRITICAL - free space: / 918 MB (2% inode=97%) [12:15:37] godog: FYI ^^^ [12:16:54] sigh, thanks volans [12:17:02] 21G daemon.log [12:17:03] 21G syslog [12:17:08] :( [12:17:16] it's full :/ [12:17:28] delete? [12:17:39] seems Too many open files godog [12:17:51] yeah I'm looking too [12:18:23] (03CR) 10Muehlenhoff: [C: 032] Blacklist kernel modules for DCCP protocol [puppet] - 10https://gerrit.wikimedia.org/r/338720 (owner: 10Muehlenhoff) [12:19:59] RECOVERY - Disk space on graphite1001 is OK: DISK OK [12:20:56] !log remove syslog from graphite1001, bump max open files for carbon-c-relay [12:21:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:26:19] (03PS3) 10Filippo Giunchedi: cache: move graphite/performance to graphite1001 [puppet] - 10https://gerrit.wikimedia.org/r/338745 (https://phabricator.wikimedia.org/T157022) [12:26:21] (03PS1) 10Filippo Giunchedi: graphite: increase maximum open files for frontend carbon-c-relay [puppet] - 10https://gerrit.wikimedia.org/r/338753 [12:26:27] (03PS1) 10ArielGlenn: for page content dumps, for each numbered part do either ranges or whole dump [dumps] - 10https://gerrit.wikimedia.org/r/338754 (https://phabricator.wikimedia.org/T158517) [12:30:26] (03CR) 10Filippo Giunchedi: [V: 032 C: 032] graphite: increase maximum open files for frontend carbon-c-relay [puppet] - 10https://gerrit.wikimedia.org/r/338753 (owner: 10Filippo Giunchedi) [12:31:15] (03CR) 10jerkins-bot: [V: 04-1] for page content dumps, for each numbered part do either ranges or whole dump [dumps] - 10https://gerrit.wikimedia.org/r/338754 (https://phabricator.wikimedia.org/T158517) (owner: 10ArielGlenn) [12:32:09] PROBLEM - Disk space on elastic1018 is CRITICAL: DISK CRITICAL - free space: /srv 60807 MB (12% inode=99%) [12:37:11] (03PS1) 10DCausse: Elastic 5.2.1 plugins [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/338756 [12:39:05] (03CR) 10Faidon Liambotis: [C: 032] Add debian/ directory for packaging (031 comment) [software/cumin] (debian) - 10https://gerrit.wikimedia.org/r/338374 (https://phabricator.wikimedia.org/T154588) (owner: 10Volans) [12:40:51] (03CR) 10DCausse: [C: 04-1] Elastic 5.2.1 plugins [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/338756 (owner: 10DCausse) [12:41:54] (03Merged) 10jenkins-bot: Add debian/ directory for packaging [software/cumin] (debian) - 10https://gerrit.wikimedia.org/r/338374 (https://phabricator.wikimedia.org/T154588) (owner: 10Volans) [12:45:09] PROBLEM - Disk space on elastic1018 is CRITICAL: DISK CRITICAL - free space: /srv 62276 MB (12% inode=99%) [12:45:31] godog: ^? [12:53:17] (03PS2) 10ArielGlenn: for page content dumps, for each numbered part do either ranges or whole dump [dumps] - 10https://gerrit.wikimedia.org/r/338754 (https://phabricator.wikimedia.org/T158517) [12:56:03] (03CR) 10Faidon Liambotis: [C: 04-1] "This code is ugly, as it lookupvars() the facts twice. You should integrate the check better with the rest of the code." [puppet] - 10https://gerrit.wikimedia.org/r/308882 (owner: 10Hashar) [12:56:29] (03PS2) 10Faidon Liambotis: Move the Diamond NTP collector to ntp::daemon [puppet] - 10https://gerrit.wikimedia.org/r/338333 (owner: 10Muehlenhoff) [12:56:42] (03CR) 10Faidon Liambotis: [C: 032] Move the Diamond NTP collector to ntp::daemon [puppet] - 10https://gerrit.wikimedia.org/r/338333 (owner: 10Muehlenhoff) [12:59:18] (03PS2) 10Faidon Liambotis: mirrors: update archvsync to 20170204 [puppet] - 10https://gerrit.wikimedia.org/r/338383 [13:00:47] (03CR) 10Faidon Liambotis: [V: 032 C: 032] mirrors: update archvsync to 20170204 [puppet] - 10https://gerrit.wikimedia.org/r/338383 (owner: 10Faidon Liambotis) [13:03:53] jouncebot, next [13:03:53] No deployments scheduled for the forseeable future! [13:04:50] !log installing jasper security updates [13:04:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:52] (03PS2) 10Faidon Liambotis: apt.w.o: redirect / to wikitech article [puppet] - 10https://gerrit.wikimedia.org/r/330140 (owner: 10Ema) [13:05:58] (03CR) 10Faidon Liambotis: [V: 032 C: 032] apt.w.o: redirect / to wikitech article [puppet] - 10https://gerrit.wikimedia.org/r/330140 (owner: 10Ema) [13:06:26] (03CR) 10Volans: Add schema support (031 comment) [software/conftool] - 10https://gerrit.wikimedia.org/r/288881 (https://phabricator.wikimedia.org/T155823) (owner: 10Giuseppe Lavagetto) [13:08:09] RECOVERY - Disk space on elastic1018 is OK: DISK OK [13:11:59] (03PS6) 10Faidon Liambotis: Linting changes for docker/etcd/kubernetes profiles [puppet] - 10https://gerrit.wikimedia.org/r/334303 (https://phabricator.wikimedia.org/T93645) (owner: 10Juniorsys) [13:12:14] (03CR) 10jerkins-bot: [V: 04-1] Linting changes for docker/etcd/kubernetes profiles [puppet] - 10https://gerrit.wikimedia.org/r/334303 (https://phabricator.wikimedia.org/T93645) (owner: 10Juniorsys) [13:13:59] (03CR) 10Volans: "recheck" [software/cumin] (debian) - 10https://gerrit.wikimedia.org/r/338374 (https://phabricator.wikimedia.org/T154588) (owner: 10Volans) [13:14:45] (03CR) 10Giuseppe Lavagetto: Add schema support (033 comments) [software/conftool] - 10https://gerrit.wikimedia.org/r/288881 (https://phabricator.wikimedia.org/T155823) (owner: 10Giuseppe Lavagetto) [13:15:02] (03PS7) 10Faidon Liambotis: Linting changes for docker/etcd/kubernetes profiles [puppet] - 10https://gerrit.wikimedia.org/r/334303 (https://phabricator.wikimedia.org/T93645) (owner: 10Juniorsys) [13:15:49] (03PS1) 10Gehel: elasticsearch - reimage elastic10(25|28|29|30) to jessie and move data to /srv [puppet] - 10https://gerrit.wikimedia.org/r/338761 (https://phabricator.wikimedia.org/T151326) [13:16:49] PROBLEM - puppet last run on gerrit2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:17:40] (03CR) 10Gehel: [C: 032] elasticsearch - reimage elastic10(25|28|29|30) to jessie and move data to /srv [puppet] - 10https://gerrit.wikimedia.org/r/338761 (https://phabricator.wikimedia.org/T151326) (owner: 10Gehel) [13:17:46] !log gehel@puppetmaster1001 conftool action : set/pooled=no; selector: name=elastic10(25|28|29|30).eqiad.wmnet [13:17:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:05] 06Operations, 10ops-eqiad, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic1048-1052 - https://phabricator.wikimedia.org/T155790#3040538 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic1025.eqiad.wmnet'] ``` The... [13:21:27] 06Operations, 10ops-eqiad, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic1048-1052 - https://phabricator.wikimedia.org/T155790#3040539 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic1028.eqiad.wmnet'] ``` The... [13:21:31] 06Operations, 10ops-eqiad, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic1048-1052 - https://phabricator.wikimedia.org/T155790#3040540 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic1029.eqiad.wmnet'] ``` The... [13:21:52] 06Operations, 10ops-eqiad, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic1048-1052 - https://phabricator.wikimedia.org/T155790#3040541 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic1030.eqiad.wmnet'] ``` The... [13:24:29] PROBLEM - salt-minion processes on puppetmaster1001 is CRITICAL: PROCS CRITICAL: 5 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [13:26:30] (03PS1) 10Faidon Liambotis: aptrepo: remove most external sources from precise [puppet] - 10https://gerrit.wikimedia.org/r/338762 [13:27:00] (03CR) 10Faidon Liambotis: [V: 032 C: 032] aptrepo: remove most external sources from precise [puppet] - 10https://gerrit.wikimedia.org/r/338762 (owner: 10Faidon Liambotis) [13:28:13] (03PS2) 10Hashar: [throttle] New rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338128 (https://phabricator.wikimedia.org/T158312) (owner: 10Urbanecm) [13:29:07] Urbanecm: can you check my rebase / conflict fix on https://gerrit.wikimedia.org/r/338128 please ? [13:29:09] and I will deploy it [13:30:02] Yep, working on it. [13:30:04] 06Operations: Manage apt sources via puppet? - https://phabricator.wikimedia.org/T158562#3040563 (10MoritzMuehlenhoff) [13:31:19] PROBLEM - Disk space on elastic1023 is CRITICAL: DISK CRITICAL - free space: /srv 60914 MB (12% inode=99%) [13:33:03] Urbanecm: triple checked and it looks good to me [13:34:43] (03CR) 10Faidon Liambotis: "LGTM to me, would love to see a CR from Luca/Andrew." [puppet] - 10https://gerrit.wikimedia.org/r/334317 (https://phabricator.wikimedia.org/T93645) (owner: 10Juniorsys) [13:35:01] Yeah, looks good. [13:35:06] (03CR) 10Hashar: [C: 032] [throttle] New rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338128 (https://phabricator.wikimedia.org/T158312) (owner: 10Urbanecm) [13:35:10] !log Transferring dbstore1001:/srv/backups (the last 2 backups) to dbstore2001:/srv/backup/dbstore1001 - T153768 [13:35:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:16] T153768: Install and reimage dbstore1001 as jessie - https://phabricator.wikimedia.org/T153768 [13:36:42] (03CR) 10Faidon Liambotis: [C: 031] "LGTM, anyone else?" [puppet] - 10https://gerrit.wikimedia.org/r/334303 (https://phabricator.wikimedia.org/T93645) (owner: 10Juniorsys) [13:36:57] (03Merged) 10jenkins-bot: [throttle] New rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338128 (https://phabricator.wikimedia.org/T158312) (owner: 10Urbanecm) [13:37:42] (03CR) 10jenkins-bot: [throttle] New rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338128 (https://phabricator.wikimedia.org/T158312) (owner: 10Urbanecm) [13:40:38] tested on mwdebug1001 [13:40:48] hashar, throttle rules are testable? [13:41:02] well [13:41:08] just making sure that the site is not fatalling out :} [13:41:11] !log hashar@tin Synchronized wmf-config/throttle.php: [throttle] New rule - T158312 (duration: 00m 42s) [13:41:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:17] T158312: Lift registration cap from an IP on en.wp for event on 2017-03-08 - https://phabricator.wikimedia.org/T158312 [13:41:23] Understand. [13:41:23] Urbanecm: deployed! [13:41:28] Thank for your deploy! [13:43:49] RECOVERY - puppet last run on gerrit2001 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [13:45:39] PROBLEM - Check systemd state on mx2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:45:49] PROBLEM - Exim SMTP on mx2001 is CRITICAL: connect to address 208.80.153.45 and port 25: Connection refused [13:46:40] RECOVERY - Check systemd state on mx2001 is OK: OK - running: The system is fully operational [13:46:49] RECOVERY - Exim SMTP on mx2001 is OK: OK - Certificate mail.wikimedia.org will expire on Mon 23 Oct 2017 06:01:00 PM UTC. [13:48:29] RECOVERY - salt-minion processes on puppetmaster1001 is OK: PROCS OK: 4 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [13:49:29] !log installing remaining lcms security updates [13:49:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:08] (03CR) 10DCausse: "PS9 compiler output: https://puppet-compiler.wmflabs.org/5509" [puppet] - 10https://gerrit.wikimedia.org/r/333969 (https://phabricator.wikimedia.org/T155578) (owner: 10EBernhardson) [13:52:54] !log resetting ownership of new .wsp files for wdqs1002 on graphite[12]001 [13:53:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:05] 06Operations, 10ops-eqiad, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic1048-1052 - https://phabricator.wikimedia.org/T155790#3040642 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic1028.eqiad.wmnet'] ``` and were **ALL** successful. [13:53:29] godog: ^ I found 2 files with strange ownership on graphite servers (owned by root instead of _graphite) [13:54:07] 06Operations, 10ops-eqiad, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic1048-1052 - https://phabricator.wikimedia.org/T155790#3040655 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic1030.eqiad.wmnet'] ``` Of which those **FAILED**: ``` set(['elastic1030.eqi... [13:54:09] hashar: are there any swats going on today? [13:54:21] (sorry for the direct ping, but you seem active ;) ) [13:54:34] phuedx: yeah I was talking about it in -releng [13:54:44] supposedly there is no deployment on a US holiday [13:54:49] ah [13:55:01] no worries [13:55:02] but I guess if they are very trivial we can do it :) [13:55:02] 06Operations, 10ops-eqiad, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic1048-1052 - https://phabricator.wikimedia.org/T155790#3040673 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic1025.eqiad.wmnet'] ``` and were **ALL** successful. [13:55:09] godog: I assume this is linked to the disk space issue (those metrics were created around that time) [13:55:14] 06Operations, 10ops-eqiad, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic1048-1052 - https://phabricator.wikimedia.org/T155790#3040676 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic1029.eqiad.wmnet'] ``` and were **ALL** successful. [13:55:25] the one I pushed was a throttling rule which is well covered with tests and imho can be done at anytime [13:55:39] PROBLEM - puppet last run on analytics1028 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:55:52] phuedx: so depends on the patch you wanna push and how I confident we will feel about pushing it now :} [13:56:15] hashar: ahhh, is that why there is no deployment calander? [13:56:22] I guess [13:56:26] hashar: i've just come back off of holiday and the folk responsible for the change are on holiday today ;) [13:56:36] i also work a short monday [13:56:41] I think usually greg updates the [[Deployments]] page on Friday [13:56:48] so, those things considered, i'll hold off until tomo [13:57:14] phuedx: sounds better :) and we can pair it together tomorrow [13:57:32] (03CR) 10Faidon Liambotis: [C: 04-1] "Yeah, I had a closer look: this won't work. GnuTLS expects a cipher string different than OpenSSL's, so our cipher list won't work. The ma" [puppet] - 10https://gerrit.wikimedia.org/r/335232 (owner: 10BBlack) [13:57:34] \o/ [13:58:38] (03CR) 10Faidon Liambotis: [C: 031] Linting fixes (multiple modules) [puppet] - 10https://gerrit.wikimedia.org/r/334317 (https://phabricator.wikimedia.org/T93645) (owner: 10Juniorsys) [13:59:05] (03CR) 10Ema: [C: 032] Release 4.1.5-1wm1 [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/338732 (owner: 10Ema) [13:59:11] (03CR) 10Ema: [V: 032 C: 032] Release 4.1.5-1wm1 [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/338732 (owner: 10Ema) [14:00:07] (03PS3) 10Muehlenhoff: ldap::client::utils: Move to require_package [puppet] - 10https://gerrit.wikimedia.org/r/338320 [14:03:33] PROBLEM - puppet last run on wtp1011 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:03:33] PROBLEM - All k8s worker nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/k8s/nodes/ready - 185 bytes in 0.200 second response time [14:06:58] (03CR) 10Muehlenhoff: [C: 032] ldap::client::utils: Move to require_package [puppet] - 10https://gerrit.wikimedia.org/r/338320 (owner: 10Muehlenhoff) [14:08:23] RECOVERY - Disk space on elastic1023 is OK: DISK OK [14:10:16] !log gehel@puppetmaster1001 conftool action : set/pooled=yes; selector: name=elastic10(25|28|29|30).eqiad.wmnet [14:10:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:33] RECOVERY - puppet last run on analytics1028 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [14:25:28] (03PS1) 10Gehel: elasticsearch - reimage elastic10(26|31|36|40) to jessie and move data to /srv [puppet] - 10https://gerrit.wikimedia.org/r/338768 (https://phabricator.wikimedia.org/T151326) [14:25:56] (03Abandoned) 10Gehel: jessie installs: adding rootdelay=90 to kernel options [puppet] - 10https://gerrit.wikimedia.org/r/337804 (https://phabricator.wikimedia.org/T149845) (owner: 10Gehel) [14:28:08] (03PS1) 10Hashar: contint: slave role for Saucelabs jobs [puppet] - 10https://gerrit.wikimedia.org/r/338770 [14:30:06] !log varnish 4.1.5-1wm1 uploaded to apt.w.o [14:30:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:33] RECOVERY - All k8s worker nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.232 second response time [14:30:40] (03CR) 10Volans: "From a quick look around seems with few lines is possible to get the equivalence of the ciphers based on the hex IDs." [puppet] - 10https://gerrit.wikimedia.org/r/335232 (owner: 10BBlack) [14:31:33] RECOVERY - puppet last run on wtp1011 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [14:31:33] PROBLEM - puppet last run on mw1264 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:31:47] (03PS10) 10Gehel: Update elasticsearch module for es5 compatability [puppet] - 10https://gerrit.wikimedia.org/r/333969 (https://phabricator.wikimedia.org/T155578) (owner: 10EBernhardson) [14:32:00] !log upgrading pinkunicorn to varnish 4.1.5-1wm1 [14:32:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:38] (03CR) 10Gehel: [C: 031] Elastic 5.2.1 plugins [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/338756 (owner: 10DCausse) [14:39:47] (03CR) 10Gehel: [C: 032] Update elasticsearch module for es5 compatability [puppet] - 10https://gerrit.wikimedia.org/r/333969 (https://phabricator.wikimedia.org/T155578) (owner: 10EBernhardson) [14:41:24] 06Operations, 10ops-eqiad, 13Patch-For-Review: Degraded RAID on relforge1001 - https://phabricator.wikimedia.org/T156663#3040850 (10Gehel) @Cmjohnson any news on that disk? [14:42:30] (03PS2) 10Hashar: ontint: slave role for Saucelabs jobs [puppet] - 10https://gerrit.wikimedia.org/r/338770 [14:43:23] PROBLEM - puppet last run on elastic1047 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/elasticsearch/plugins] [14:43:33] PROBLEM - puppet last run on relforge1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 34 seconds ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/elasticsearch/plugins] [14:43:43] PROBLEM - puppet last run on elastic2002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/elasticsearch/plugins] [14:43:43] PROBLEM - puppet last run on elastic2022 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/elasticsearch/plugins] [14:44:11] (03PS1) 10Gehel: elasticsearch: relforge is still on elasticsearch 2.x for a few days [puppet] - 10https://gerrit.wikimedia.org/r/338774 [14:44:23] PROBLEM - puppet last run on elastic1017 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/elasticsearch/plugins] [14:44:32] puppet failures above are mine... checking [14:44:43] PROBLEM - puppet last run on elastic1034 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/elasticsearch/plugins] [14:44:53] PROBLEM - puppet last run on logstash1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/elasticsearch/plugins] [14:45:09] gehel: shouldn't be linked no, what files? [14:45:43] PROBLEM - puppet last run on elastic2017 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/elasticsearch/plugins] [14:45:45] (03PS3) 10Hashar: contint: slave role for Saucelabs jobs [puppet] - 10https://gerrit.wikimedia.org/r/338770 [14:46:12] (03PS1) 10Gehel: elasticsearch: do not manage plugin directory yet [puppet] - 10https://gerrit.wikimedia.org/r/338775 [14:46:23] PROBLEM - puppet last run on elastic1018 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/elasticsearch/plugins] [14:46:43] PROBLEM - puppet last run on elastic1046 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/elasticsearch/plugins] [14:46:51] godog: just a sec, fixing my crap... [14:47:01] (03CR) 10Gehel: [V: 032 C: 032] elasticsearch: do not manage plugin directory yet [puppet] - 10https://gerrit.wikimedia.org/r/338775 (owner: 10Gehel) [14:47:10] (03CR) 10Hashar: [C: 04-1] "Cherry picked PS3 on the CI puppet master. Play testing it on the instance saucelabs-01.integration.eqiad.wmflabs" [puppet] - 10https://gerrit.wikimedia.org/r/338770 (owner: 10Hashar) [14:47:33] PROBLEM - puppet last run on elastic1027 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/elasticsearch/plugins] [14:47:43] PROBLEM - puppet last run on elastic2004 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/elasticsearch/plugins] [14:47:43] PROBLEM - puppet last run on elastic2033 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/elasticsearch/plugins] [14:48:34] PROBLEM - puppet last run on logstash1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/elasticsearch/plugins] [14:48:35] (03PS1) 10Gehel: elasticsearch: do not manage plugin dir yet [puppet] - 10https://gerrit.wikimedia.org/r/338776 [14:48:44] (03CR) 10Gehel: [V: 032 C: 032] elasticsearch: do not manage plugin dir yet [puppet] - 10https://gerrit.wikimedia.org/r/338776 (owner: 10Gehel) [14:49:19] !log cp2002, cp4008: libssl1.1 upgraded to 1.1.0e-1+wmf1 and libevent-2.0-5 upgraded to 2.0.21-stable-2+deb8u1 [14:49:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:23] PROBLEM - puppet last run on elastic1041 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/elasticsearch/plugins] [14:49:24] PROBLEM - puppet last run on elastic1040 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/elasticsearch/plugins] [14:49:34] PROBLEM - puppet last run on elastic1049 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/elasticsearch/plugins] [14:49:53] PROBLEM - puppet last run on elastic2008 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/elasticsearch/plugins] [14:50:23] PROBLEM - puppet last run on elastic1045 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/elasticsearch/plugins] [14:50:43] PROBLEM - puppet last run on elastic2013 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:50:44] PROBLEM - puppet last run on elastic2032 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:50:53] RECOVERY - puppet last run on elastic2008 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [14:51:23] PROBLEM - puppet last run on elastic1035 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:51:27] (03PS2) 10Gehel: elasticsearch: relforge is still on elasticsearch 2.x for a few days [puppet] - 10https://gerrit.wikimedia.org/r/338774 [14:51:33] RECOVERY - puppet last run on logstash1002 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [14:51:52] (03PS4) 10Filippo Giunchedi: cache: move graphite/performance to graphite1001 [puppet] - 10https://gerrit.wikimedia.org/r/338745 (https://phabricator.wikimedia.org/T157022) [14:53:19] godog: the graphite .wsp ownership was probably a mistake during the move of those metrics last week. It was fine on other servers, and as this one was being reimaged (actually importing data), I did not check it until today [14:55:05] !log gehel@puppetmaster1001 conftool action : set/pooled=yes; selector: name=wdqs1002.eqiad.wmnet [14:55:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:09] gehel: ok! let me know what metric/file is affected if you see the same again [14:56:23] godog: yep! [14:56:35] (03CR) 10Filippo Giunchedi: [V: 032 C: 032] cache: move graphite/performance to graphite1001 [puppet] - 10https://gerrit.wikimedia.org/r/338745 (https://phabricator.wikimedia.org/T157022) (owner: 10Filippo Giunchedi) [14:56:58] godog: I'm confident that this was me doing an error while moving those. No need to dig further. [14:58:08] gehel: ack, FWIW to avoid similar things I usually su -s /bin/bash _graphite [14:58:36] godog: yeah, that's a solution... [14:58:58] (03PS3) 10Gehel: elasticsearch: relforge is still on elasticsearch 2.x for a few days [puppet] - 10https://gerrit.wikimedia.org/r/338774 [14:59:35] 06Operations, 13Patch-For-Review: Upgrade fluorine to trusty/jessie - https://phabricator.wikimedia.org/T123728#3040931 (10fgiunchedi) [15:00:33] RECOVERY - puppet last run on mw1264 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [15:02:06] (03PS1) 10Gehel: elasticsearch: force the creation of the plugins directory symlink [puppet] - 10https://gerrit.wikimedia.org/r/338781 [15:03:30] (03CR) 10Gehel: [C: 032] elasticsearch: relforge is still on elasticsearch 2.x for a few days [puppet] - 10https://gerrit.wikimedia.org/r/338774 (owner: 10Gehel) [15:05:34] RECOVERY - puppet last run on relforge1002 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [15:09:23] PROBLEM - puppet last run on labvirt1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:10:43] RECOVERY - puppet last run on elastic2022 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [15:11:23] RECOVERY - puppet last run on elastic1047 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [15:11:53] RECOVERY - puppet last run on logstash1001 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [15:12:23] RECOVERY - puppet last run on elastic1017 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [15:12:33] RECOVERY - puppet last run on elastic1034 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [15:12:43] RECOVERY - puppet last run on elastic2017 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [15:12:43] RECOVERY - puppet last run on elastic2002 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [15:14:23] RECOVERY - puppet last run on elastic1018 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [15:14:43] RECOVERY - puppet last run on elastic1046 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [15:15:33] RECOVERY - puppet last run on elastic1027 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [15:15:43] RECOVERY - puppet last run on elastic2033 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [15:15:43] RECOVERY - puppet last run on elastic2004 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [15:16:23] RECOVERY - puppet last run on elastic1041 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [15:17:23] RECOVERY - puppet last run on elastic1045 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [15:17:24] RECOVERY - puppet last run on elastic1040 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [15:17:33] RECOVERY - puppet last run on elastic1049 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [15:18:43] RECOVERY - puppet last run on elastic2013 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [15:19:43] RECOVERY - puppet last run on elastic2032 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [15:20:23] RECOVERY - puppet last run on elastic1035 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [15:30:24] 06Operations, 10DBA, 10procurement: New DBs purchase: codfw and eqiad final figures - https://phabricator.wikimedia.org/T158580#3040989 (10Marostegui) [15:41:23] RECOVERY - puppet last run on labvirt1001 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [16:05:18] (03PS4) 10Hashar: contint: slave role for Saucelabs jobs [puppet] - 10https://gerrit.wikimedia.org/r/338770 [16:18:23] PROBLEM - puppet last run on db1034 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:20:06] (03PS13) 10Giuseppe Lavagetto: Add schema support [software/conftool] - 10https://gerrit.wikimedia.org/r/288881 (https://phabricator.wikimedia.org/T155823) [16:28:27] (03PS5) 10Hashar: contint: slave role for Saucelabs jobs [puppet] - 10https://gerrit.wikimedia.org/r/338770 [16:28:37] (03PS1) 10KartikMistry: Deploy Compact Language Links in Swedish Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338790 (https://phabricator.wikimedia.org/T157114) [16:31:53] (03PS1) 10Reedy: Disable DisableAccount on two wikis were no disabled users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338792 (https://phabricator.wikimedia.org/T106067) [16:36:14] (03Abandoned) 10Ema: Allow misc directors to specify url path conditions as well as Host conditions [puppet] - 10https://gerrit.wikimedia.org/r/322964 (owner: 10Ottomata) [16:39:54] 06Operations, 10DBA: Puppetize grants for mysql backups on dbstore hosts - https://phabricator.wikimedia.org/T111929#3041133 (10jcrespo) [16:39:57] (03PS2) 10Reedy: Disable DisableAccount on wikis where there are no disabled users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338792 (https://phabricator.wikimedia.org/T106067) [16:40:32] !log gehel@puppetmaster1001 conftool action : set/pooled=no; selector: name=elastic10(26|31|36|40).eqiad.wmnet [16:40:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:40:37] (03PS2) 10Gehel: elasticsearch - reimage elastic10(26|31|36|40) to jessie and move data to /srv [puppet] - 10https://gerrit.wikimedia.org/r/338768 (https://phabricator.wikimedia.org/T151326) [16:41:08] 06Operations: Restructure our internal repositories further - https://phabricator.wikimedia.org/T158583#3041135 (10MoritzMuehlenhoff) [16:47:12] (03CR) 10Gehel: [C: 032] elasticsearch - reimage elastic10(26|31|36|40) to jessie and move data to /srv [puppet] - 10https://gerrit.wikimedia.org/r/338768 (https://phabricator.wikimedia.org/T151326) (owner: 10Gehel) [16:47:23] RECOVERY - puppet last run on db1034 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [16:47:50] (03PS1) 10Jcrespo: Add python3-tabulate package and labsdb password for clients-only [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/338793 (https://phabricator.wikimedia.org/T146149) [16:56:36] 06Operations, 10ops-eqiad, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic1048-1052 - https://phabricator.wikimedia.org/T155790#3041166 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic1040.eqiad.wmnet'] ``` The... [16:57:33] PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1002 is CRITICAL: CRITICAL - Expecting active but unit nfs-exportd is activating [17:04:10] (03PS1) 10Jcrespo: mariadb: rebase module to the latest version and separate labs pass [puppet] - 10https://gerrit.wikimedia.org/r/338797 (https://phabricator.wikimedia.org/T104900) [17:04:57] (03PS2) 10Jcrespo: Add python3-tabulate package and labsdb password for clients-only [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/338793 (https://phabricator.wikimedia.org/T146149) [17:06:31] (03PS2) 10Jcrespo: mariadb: rebase module to the latest version and separate labs pass [puppet] - 10https://gerrit.wikimedia.org/r/338797 (https://phabricator.wikimedia.org/T104900) [17:07:32] (03CR) 10Marostegui: [C: 031] Add python3-tabulate package and labsdb password for clients-only [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/338793 (https://phabricator.wikimedia.org/T146149) (owner: 10Jcrespo) [17:16:10] (03CR) 10Marostegui: [C: 031] "Looks good: https://puppet-compiler.wmflabs.org/5510/" [puppet] - 10https://gerrit.wikimedia.org/r/338797 (https://phabricator.wikimedia.org/T104900) (owner: 10Jcrespo) [17:16:12] (03CR) 10Jcrespo: [V: 032 C: 032] Add python3-tabulate package and labsdb password for clients-only [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/338793 (https://phabricator.wikimedia.org/T146149) (owner: 10Jcrespo) [17:16:28] (03PS3) 10Jcrespo: mariadb: rebase module to the latest version and separate labs pass [puppet] - 10https://gerrit.wikimedia.org/r/338797 (https://phabricator.wikimedia.org/T104900) [17:18:19] (03PS4) 10Jcrespo: mariadb: rebase module to the latest version and separate labs pass [puppet] - 10https://gerrit.wikimedia.org/r/338797 (https://phabricator.wikimedia.org/T104900) [17:20:17] 06Operations, 10ops-eqiad, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic1048-1052 - https://phabricator.wikimedia.org/T155790#3041210 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic1040.eqiad.wmnet'] ``` and were **ALL** successful. [17:22:46] (03PS1) 10Jcrespo: Add labsdb root pass fake string to make puppet compiler work [labs/private] - 10https://gerrit.wikimedia.org/r/338800 (https://phabricator.wikimedia.org/T104900) [17:24:44] (03CR) 10Marostegui: [C: 031] Add labsdb root pass fake string to make puppet compiler work [labs/private] - 10https://gerrit.wikimedia.org/r/338800 (https://phabricator.wikimedia.org/T104900) (owner: 10Jcrespo) [17:25:26] (03CR) 10Jcrespo: [V: 032 C: 032] Add labsdb root pass fake string to make puppet compiler work [labs/private] - 10https://gerrit.wikimedia.org/r/338800 (https://phabricator.wikimedia.org/T104900) (owner: 10Jcrespo) [17:27:03] 06Operations, 10ops-eqiad, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic1048-1052 - https://phabricator.wikimedia.org/T155790#3041214 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic1036.eqiad.wmnet'] ``` The... [17:27:07] 06Operations, 10ops-eqiad, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic1048-1052 - https://phabricator.wikimedia.org/T155790#3041215 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic1031.eqiad.wmnet'] ``` The... [17:29:27] (03CR) 10Jcrespo: [C: 031] "It removes the socket on all servers https://puppet-compiler.wmflabs.org/5511/db2034.codfw.wmnet/ , but I think that is something we want," [puppet] - 10https://gerrit.wikimedia.org/r/338797 (https://phabricator.wikimedia.org/T104900) (owner: 10Jcrespo) [17:30:34] (03CR) 10Jcrespo: [V: 032 C: 032] mariadb: rebase module to the latest version and separate labs pass [puppet] - 10https://gerrit.wikimedia.org/r/338797 (https://phabricator.wikimedia.org/T104900) (owner: 10Jcrespo) [17:36:54] (03PS1) 10Filippo Giunchedi: role: install apache mod_proxy_http [puppet] - 10https://gerrit.wikimedia.org/r/338803 [17:36:56] (03PS1) 10Filippo Giunchedi: uwsgi: parametrize service settings [puppet] - 10https://gerrit.wikimedia.org/r/338804 [17:36:58] (03PS1) 10Filippo Giunchedi: coal: disable uwsgi autoload [puppet] - 10https://gerrit.wikimedia.org/r/338805 [17:42:45] PROBLEM - keystone http on labtestcontrol2001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 78 bytes in 0.073 second response time [17:47:29] 06Operations, 10ops-eqiad, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic1048-1052 - https://phabricator.wikimedia.org/T155790#3041300 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic1026.eqiad.wmnet'] ``` The... [17:49:39] 06Operations, 10ops-eqiad, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic1048-1052 - https://phabricator.wikimedia.org/T155790#3041304 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic1031.eqiad.wmnet'] ``` and were **ALL** successful. [17:52:46] !log update change-prop to 30873ebd5 [17:52:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:53:00] 06Operations, 10ops-eqiad, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic1048-1052 - https://phabricator.wikimedia.org/T155790#3041321 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic1036.eqiad.wmnet'] ``` and were **ALL** successful. [17:54:28] !log ppchelko@tin Started deploy [changeprop/deploy@30873eb]: Update change-prop to 30873ebd5: enabling DNS caching for T158338 [17:54:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:34] T158338: Set up DNS caching for node services - https://phabricator.wikimedia.org/T158338 [17:56:10] !log ppchelko@tin Finished deploy [changeprop/deploy@30873eb]: Update change-prop to 30873ebd5: enabling DNS caching for T158338 (duration: 01m 41s) [17:56:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:57:00] (03Abandoned) 10Paladox: Up max_execution to 15 from 10 in phabricator/php.ini.erb [puppet] - 10https://gerrit.wikimedia.org/r/335714 (https://phabricator.wikimedia.org/T125357) (owner: 10Paladox) [17:57:03] (03Abandoned) 10Paladox: Up post_max_size to 50M in phabricator's php.ini file [puppet] - 10https://gerrit.wikimedia.org/r/335717 (owner: 10Paladox) [17:57:25] (03PS1) 10Volans: Improvements in the metadata and package setup [software/cumin] - 10https://gerrit.wikimedia.org/r/338808 (https://phabricator.wikimedia.org/T154588) [17:57:49] (03PS14) 10Giuseppe Lavagetto: Add schema support [software/conftool] - 10https://gerrit.wikimedia.org/r/288881 (https://phabricator.wikimedia.org/T155823) [17:57:52] (03PS14) 10Paladox: Gerrit: Add a systemd init script fro gerrit [debs/gerrit] - 10https://gerrit.wikimedia.org/r/333475 [17:58:44] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/5514/" [puppet] - 10https://gerrit.wikimedia.org/r/338805 (owner: 10Filippo Giunchedi) [17:59:02] (03PS2) 10Filippo Giunchedi: role: install apache mod_proxy_http [puppet] - 10https://gerrit.wikimedia.org/r/338803 [17:59:10] (03CR) 10Filippo Giunchedi: [V: 032 C: 032] role: install apache mod_proxy_http [puppet] - 10https://gerrit.wikimedia.org/r/338803 (owner: 10Filippo Giunchedi) [18:04:34] (03CR) 10Giuseppe Lavagetto: [C: 031] syntax: ignore stdlib Puppet 4 manifests [puppet] - 10https://gerrit.wikimedia.org/r/338143 (https://phabricator.wikimedia.org/T154894) (owner: 10Hashar) [18:04:41] PROBLEM - puppet last run on labtestservices2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[/usr/local/bin/labs-ip-alias-dump.py] [18:08:57] 06Operations, 10ops-eqiad, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic1048-1052 - https://phabricator.wikimedia.org/T155790#3041361 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic1026.eqiad.wmnet'] ``` and were **ALL** successful. [18:14:36] (03PS1) 10Jcrespo: [WIP] Create scripts for batch sql execution [puppet] - 10https://gerrit.wikimedia.org/r/338809 [18:15:13] (03CR) 10Jcrespo: [C: 04-2] "Not intended for puppet deploy." [puppet] - 10https://gerrit.wikimedia.org/r/338809 (owner: 10Jcrespo) [18:15:40] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Create scripts for batch sql execution [puppet] - 10https://gerrit.wikimedia.org/r/338809 (owner: 10Jcrespo) [18:17:21] PROBLEM - Disk space on elastic1023 is CRITICAL: DISK CRITICAL - free space: /srv 61935 MB (12% inode=99%) [18:18:49] (03PS2) 10Jcrespo: [WIP] Create scripts for batch sql execution [puppet] - 10https://gerrit.wikimedia.org/r/338809 [18:29:04] (03PS1) 10Gehel: elasticsearch - reimage elastic10(27|32|37|41) to jessie and move data to /srv [puppet] - 10https://gerrit.wikimedia.org/r/338811 (https://phabricator.wikimedia.org/T151326) [18:29:24] !log gehel@puppetmaster1001 conftool action : set/pooled=yes; selector: name=elastic10(26|31|36|40).eqiad.wmnet [18:29:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:30:11] !log gehel@puppetmaster1001 conftool action : set/pooled=no; selector: name=elastic10(27|32|37|41).eqiad.wmnet [18:30:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:30:25] RECOVERY - Disk space on elastic1023 is OK: DISK OK [18:30:49] Reedy: looks like that disableaccount sh*t is going to give us a headache [18:31:00] (03CR) 10Jcrespo: [C: 04-2] "This simplifies this problem (at the same time that enforces TLS usage):" [puppet] - 10https://gerrit.wikimedia.org/r/338809 (owner: 10Jcrespo) [18:31:06] iirc what the extension did is to remove the user credentials [18:32:15] I think it still does [18:32:25] Was the group adding done later? [18:32:48] // While we're not actually turning the user into a "system" user, it [18:32:48] // has the same end result: all passwords and other authentication [18:32:48] // credentials removed or set to something invalid, email blanked, [18:32:48] // token invalidated, and existing sessions dropped. So let's just use [18:32:48] // that if possible instead of duplicating all the code. [18:33:39] Reedy: if you disable an account with that extension, the extension added the inactive group (which is default for those on private.dblist) [18:33:55] the inactive group existed way before though [18:34:08] and we've been adding blocked accounts to that group too [18:34:28] I guess there's no problems removing the users from that group? [18:34:37] Probably not [18:34:40] it does not have any rights attached [18:35:00] Can just do it from the db if we don't care about the log entries [18:35:17] maybe for when the extension gets removed [18:35:32] Yeah, certainly makes sense there [18:35:50] so we avoid things like https://phabricator.wikimedia.org/T158413 reedy [18:36:43] It's a mess [18:36:51] All the more reason to get it removed [18:36:56] (03PS6) 10Hashar: contint: slave role for Saucelabs jobs [puppet] - 10https://gerrit.wikimedia.org/r/338770 [18:38:10] 06Operations: Restructure our internal repositories further - https://phabricator.wikimedia.org/T158583#3041423 (10faidon) p:05Triage>03High [18:38:26] (03CR) 10Hashar: [V: 031 C: 031] "Cherry picked against tip of production branch and on the CI Puppet master. I have provisioned the instances saucelabs-01 02 and 03 with t" [puppet] - 10https://gerrit.wikimedia.org/r/338770 (owner: 10Hashar) [18:43:56] Reedy: since I've got some time, I'll manually remove the 'inactive' flag for those accounts [18:44:16] so the script does not run twice on those accounts [18:44:37] well, on a second thought, no [18:44:44] too much work xD [18:45:31] // Try to update block if user is already blocked. Otherwise, attempt to insert a new one. [18:45:31] $success = $alreadyBlocked ? $block->update() : $block->insert(); [18:47:01] The script does remove people from the group when they've been "migrated" [18:50:04] (03CR) 10ArielGlenn: [C: 032] make empty list check shorter and clearer [dumps] - 10https://gerrit.wikimedia.org/r/338747 (owner: 10ArielGlenn) [18:51:34] (03CR) 10ArielGlenn: [C: 032] for page content dumps, for each numbered part do either ranges or whole dump [dumps] - 10https://gerrit.wikimedia.org/r/338754 (https://phabricator.wikimedia.org/T158517) (owner: 10ArielGlenn) [19:06:03] (03PS1) 10Tim Starling: Route PHP warnings from the handler into logstash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338820 (https://phabricator.wikimedia.org/T45086) [19:06:41] 06Operations, 10Traffic, 07Mobile: Samsung Internet's desktop mode getting redirected to mobile site - https://phabricator.wikimedia.org/T158599#3041524 (10MaxSem) [19:07:20] !log ariel@tin Started deploy [dumps/dumps@9757356]: fix retries of page content dumps with checkpoint, no dup ranges [19:07:22] !log ariel@tin Finished deploy [dumps/dumps@9757356]: fix retries of page content dumps with checkpoint, no dup ranges (duration: 00m 02s) [19:07:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:13:24] (03CR) 10BryanDavis: "Can we start with this just going to the udp2log aggregator until we had an idea of what the actual volume is?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338820 (https://phabricator.wikimedia.org/T45086) (owner: 10Tim Starling) [19:15:32] (03PS2) 10MarcoAurelio: Configuration changes for wikitech.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338751 (https://phabricator.wikimedia.org/T158516) [19:15:52] (03CR) 10Gergő Tisza: "Is there any reason to use the -json channel? I thought we were trying to abandon them." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338820 (https://phabricator.wikimedia.org/T45086) (owner: 10Tim Starling) [19:22:13] (03PS2) 10Tim Starling: Route PHP warnings from the handler into logstash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338820 (https://phabricator.wikimedia.org/T45086) [19:22:35] (03PS2) 10MarcoAurelio: Removing the 'shellmanagers' group from Wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338632 (https://phabricator.wikimedia.org/T158482) [19:22:44] (03CR) 10jerkins-bot: [V: 04-1] Removing the 'shellmanagers' group from Wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338632 (https://phabricator.wikimedia.org/T158482) (owner: 10MarcoAurelio) [19:24:18] (03PS3) 10MarcoAurelio: Removing the 'shellmanagers' group from Wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338632 (https://phabricator.wikimedia.org/T158482) [19:24:27] (03CR) 10jerkins-bot: [V: 04-1] Removing the 'shellmanagers' group from Wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338632 (https://phabricator.wikimedia.org/T158482) (owner: 10MarcoAurelio) [19:24:50] ... [19:29:09] (03CR) 10Hashar: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338632 (https://phabricator.wikimedia.org/T158482) (owner: 10MarcoAurelio) [19:29:17] (03CR) 10jerkins-bot: [V: 04-1] Removing the 'shellmanagers' group from Wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338632 (https://phabricator.wikimedia.org/T158482) (owner: 10MarcoAurelio) [19:29:40] hashar: maybe because the change it depends-on is still being processed? [19:29:44] tabbycat: that one is a glitch in the matrix [19:29:56] ah yeah there is a depends-on [19:30:02] that does quite make any sense [19:30:06] just chain the patches in gerrit [19:30:17] git checkout master [19:30:22] git reset --hard origin/master [19:30:23] git-review -x 338751 [19:30:33] git-review -x 338632 [19:30:38] (03CR) 10MarcoAurelio: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338632 (https://phabricator.wikimedia.org/T158482) (owner: 10MarcoAurelio) [19:30:40] amend the commit message to remove the Depends-On header [19:30:44] and send back [19:30:46] (03CR) 10jerkins-bot: [V: 04-1] Removing the 'shellmanagers' group from Wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338632 (https://phabricator.wikimedia.org/T158482) (owner: 10MarcoAurelio) [19:30:53] okay [19:31:01] Gerrit has built in support for patches that depends on each other [19:31:04] just have to send then in the same chain [19:31:14] (03PS4) 10MarcoAurelio: Removing the 'shellmanagers' group from Wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338632 (https://phabricator.wikimedia.org/T158482) [19:31:21] eg: (master) -> 338751 --> 338632 [19:31:35] but that only work when the patches are all in the same repository [19:31:43] the depends-on: hack is when changes are in different repo [19:31:57] I've added the depends-on tag in the past w/o problems [19:32:03] not sure why it fails now [19:32:14] I am not sure what is going on in Zuul, but it seems it ignores the depends-on [19:32:26] + noob dev here, not complicated stuff please :) [19:32:33] yeah :} [19:33:32] what are you trying to achieve ? [19:34:29] hashar the zuul plugin will make it easyer for gerrit to understand depends-on:. [19:34:48] by that i mean server side but also client side too [19:34:49] paladox: a Gerrit zuul plugin ? [19:34:55] hashar yep, let me find it [19:35:02] hashar: well, not that it really matters much, but I feel I should first have the change I marked as depends-on deployed first, then merge this one that was failing [19:35:20] hashar https://gerrit.googlesource.com/plugins/zuul/ [19:35:25] https://gerrit.googlesource.com/plugins/zuul/+/master/src/main/resources/Documentation/about.md [19:35:26] tabbycat: so the easiest is to have them one after the other your local git repo [19:35:46] tabbycat: send that to Gerrit and Gerrit will notice one depends on an other automatically [19:36:06] paladox: fun. Is openstack using it ? [19:36:22] hashar maybe, but not sure. It was developed by zaro. [19:36:42] (03PS3) 10MarcoAurelio: Configuration changes for wikitech.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338751 (https://phabricator.wikimedia.org/T158516) [19:36:56] hashar actually nope, there gerrit version is to old to support the plugin [19:37:01] requires gerrit 2.13+ [19:37:01] paladox: yeah he was sponsored by HPE to work on openstack iirc [19:37:06] oh [19:37:20] they are planning to update to gerrit 2.13 i think. [19:37:33] i did convert zuul to bazel format so should hopefully work on gerrit 2.14 too [19:37:35] maybe we should give it a tryeventually [19:37:43] * tabbycat testing again [19:37:48] then I am already swamped in a lot of various stuff :( [19:37:51] Yep [19:38:17] (03PS5) 10MarcoAurelio: Removing the 'shellmanagers' group from Wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338632 (https://phabricator.wikimedia.org/T158482) [19:38:25] (03CR) 10jerkins-bot: [V: 04-1] Removing the 'shellmanagers' group from Wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338632 (https://phabricator.wikimedia.org/T158482) (owner: 10MarcoAurelio) [19:38:41] okay so it was not that the dependant change was un-rebased [19:38:43] rv. [19:38:48] 06Operations, 10Traffic, 07Mobile: Samsung Internet's desktop mode getting redirected to mobile site - https://phabricator.wikimedia.org/T158599#3041524 (10revi) Interesting, my Samsung Galaxy A7 (2016)'s bundled Samsung Internet correctly handles 'request desktop version'. Maybe it's for Google Play version... [19:39:01] anyway I gotta escape [19:39:02] (03PS6) 10MarcoAurelio: Removing the 'shellmanagers' group from Wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338632 (https://phabricator.wikimedia.org/T158482) [19:39:03] and hunt for some food [19:39:17] see you tomorrow! [19:39:28] bye! [19:48:26] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite2001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [19:48:26] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [20:19:35] RECOVERY - Ensure NFS exports are maintained for new instances with NFS on labstore1002 is OK: OK - nfs-exportd is active [20:20:00] !log reducing concurrent recoveries / relocations to 4 on elasticsearch eqiad [20:20:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:22:35] PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1002 is CRITICAL: CRITICAL - Expecting active but unit nfs-exportd is activating [20:27:18] (03PS1) 10Papaul: Add mgmt and production DNS for ms-be2028-msbe2039 Bug: T158337 [dns] - 10https://gerrit.wikimedia.org/r/338824 (https://phabricator.wikimedia.org/T158337) [20:31:21] !log taking threaddumps and restarting elastic1017 (high load) [20:31:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:34:55] PROBLEM - tools homepage -admin tool- on tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 20 seconds [20:36:57] 06Operations, 10Continuous-Integration-Infrastructure, 10netops: jsduck publish error: index-pack died of signal 15 - https://phabricator.wikimedia.org/T158601#3041635 (10hashar) The jenkins jobs triggered by Zuul clones the repo from the zuul-merger instances on contint1001 / contint2001. They are being ser... [20:37:41] 06Operations, 10Continuous-Integration-Infrastructure, 10netops: git clone over EQIAD (wmflabs) CODFW timeout due to low bandwidth (~250 KiB/s) - https://phabricator.wikimedia.org/T158601#3041638 (10hashar) [20:39:45] PROBLEM - puppet last run on restbase1016 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:40:21] 06Operations, 10Continuous-Integration-Infrastructure, 06Labs, 10netops: git clone over EQIAD (wmflabs) CODFW timeout due to low bandwidth (~250 KiB/s) - https://phabricator.wikimedia.org/T158601#3041644 (10Paladox) [20:41:11] 06Operations, 10Continuous-Integration-Infrastructure, 06Labs, 10netops: git clone over EQIAD (wmflabs) CODFW timeout due to low bandwidth (~250 KiB/s) - https://phabricator.wikimedia.org/T158601#3041581 (10Paladox) Is this happening to any other repos? Should we set this as normal or high priority? [20:41:25] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] [20:41:25] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite2001 is OK: OK: Less than 20.00% above the threshold [500.0] [20:46:35] RECOVERY - tools homepage -admin tool- on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 3670 bytes in 0.029 second response time [20:50:04] 06Operations, 10RESTBase, 06Services (later): enable restbase syslog/file logging - https://phabricator.wikimedia.org/T112648#1641285 (10Pchelolo) I think we have to proceed on this. Right now, without local logging, if something breaks with Logstash (see T158602) we're losing all the logs completely, which... [21:07:45] RECOVERY - puppet last run on restbase1016 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [21:10:36] PROBLEM - puppet last run on elastic1019 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:14:25] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite2001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [21:14:26] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [21:32:51] (03PS3) 10Zppix: Update the realname from github repo url --> phabricator diffusion [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/338700 [21:32:56] (03CR) 10Zppix: [C: 031] Update the realname from github repo url --> phabricator diffusion [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/338700 (owner: 10Zppix) [21:38:35] RECOVERY - puppet last run on elastic1019 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [21:40:25] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] [21:40:25] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite2001 is OK: OK: Less than 20.00% above the threshold [500.0] [22:05:52] (03CR) 10Hashar: [C: 031] Update the realname from github repo url --> phabricator diffusion [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/338700 (owner: 10Zppix) [22:22:25] PROBLEM - puppet last run on snapshot1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:45:45] PROBLEM - Redis replication status tcp_6479 on rdb2005 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.32.133 on port 6479 [22:46:45] RECOVERY - Redis replication status tcp_6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 10.192.32.133:6479 has 1 databases (db0) with 3851141 keys, up 112 days 14 hours - replication_delay is 0 [22:46:55] PROBLEM - Redis replication status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479 [22:47:55] RECOVERY - Redis replication status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3850292 keys, up 112 days 14 hours - replication_delay is 0 [22:51:25] RECOVERY - puppet last run on snapshot1005 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [23:18:55] PROBLEM - Start a job and verify on Precise on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/precise - 185 bytes in 0.284 second response time [23:24:05] RECOVERY - Start a job and verify on Precise on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 1.042 second response time [23:25:04] (03PS1) 10ArielGlenn: add api job handler, config file in yaml, siteinfo props jobs [dumps] - 10https://gerrit.wikimedia.org/r/338899 (https://phabricator.wikimedia.org/T38178) [23:38:25] PROBLEM - puppet last run on lithium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:38:35] RECOVERY - Ensure NFS exports are maintained for new instances with NFS on labstore1002 is OK: OK - nfs-exportd is active [23:41:35] PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1002 is CRITICAL: CRITICAL - Expecting active but unit nfs-exportd is activating