[00:00:47] <icinga-wm>	 RECOVERY - puppet last run on cp1060 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures
[00:17:52] <wikibugs_>	 06Operations, 10Traffic, 07HTTPS: implement Public Key Pinning (HPKP) for Wikimedia domains - https://phabricator.wikimedia.org/T92002#1101271 (10Platonides) >>! In T92002#1879207, @BBlack wrote: > The issue here is that for PKP to assert validity, it's not enough that we're signed by a CA that's on our list...
[00:20:30] <wikibugs_>	 (03PS5) 10Andrew Bogott: Keystonehooks: Sync ldap project groups with keystone project membership [puppet] - 10https://gerrit.wikimedia.org/r/338918
[00:21:39] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] Keystonehooks: Sync ldap project groups with keystone project membership [puppet] - 10https://gerrit.wikimedia.org/r/338918 (owner: 10Andrew Bogott)
[00:23:30] <wikibugs_>	 (03PS6) 10Andrew Bogott: Keystonehooks: Sync ldap project groups with keystone project membership [puppet] - 10https://gerrit.wikimedia.org/r/338918
[00:28:37] <icinga-wm>	 PROBLEM - puppet last run on kafka1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[00:33:53] <wikibugs_>	 (03PS7) 10Andrew Bogott: Keystonehooks: Sync ldap project groups with keystone project membership [puppet] - 10https://gerrit.wikimedia.org/r/338918 (https://phabricator.wikimedia.org/T150091)
[00:35:57] <wikibugs_>	 (03CR) 10Andrew Bogott: [C: 032] Keystonehooks: Sync ldap project groups with keystone project membership [puppet] - 10https://gerrit.wikimedia.org/r/338918 (https://phabricator.wikimedia.org/T150091) (owner: 10Andrew Bogott)
[00:39:58] <wikibugs_>	 06Operations, 10hardware-requests, 13Patch-For-Review: Replace bast3001 - https://phabricator.wikimedia.org/T156506#3054564 (10Dzahn) I was able to install jessie on the-server-formerly-known-as-hooft as "bast3002".  It did not work over http and over tftp it was still very slow but it did work.   ``` Debian...
[00:49:57] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps-test2002 is CRITICAL: CRITICAL - Rep Delay is: 1802.310592 Seconds
[00:50:57] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps-test2002 is OK: OK - Rep Delay is: 31.588395 Seconds
[00:53:35] <wikibugs_>	 06Operations, 10Annual-Report, 10Security-Reviews, 13Patch-For-Review: add subdomain for annual report 2016 - https://phabricator.wikimedia.org/T151798#3054590 (10Dzahn) >>! In T151798#3053857, @dpatrick wrote: > I've reviewed both content and technical implementation  Thank you! Way more detailed than exp...
[00:56:27] <icinga-wm>	 PROBLEM - Redis replication status tcp_6479 on rdb2005 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.32.133 on port 6479
[00:56:32] <wikibugs_>	 (03PS1) 10Krinkle: [WIP] mediawiki: Add cache-warmup to maintenance [puppet] - 10https://gerrit.wikimedia.org/r/339802 (https://phabricator.wikimedia.org/T156922)
[00:56:37] <icinga-wm>	 RECOVERY - puppet last run on kafka1002 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures
[00:57:27] <icinga-wm>	 RECOVERY - Redis replication status tcp_6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 10.192.32.133:6479 has 1 databases (db0) with 4008488 keys, up 116 days 16 hours - replication_delay is 0
[00:58:11] <wikibugs_>	 (03PS2) 10Krinkle: [WIP] mediawiki: Add cache-warmup to maintenance [puppet] - 10https://gerrit.wikimedia.org/r/339802 (https://phabricator.wikimedia.org/T156922)
[01:01:41] <wikibugs_>	 (03PS1) 10Dzahn: annualreport: add X-Frame-Options header to Apache config [puppet] - 10https://gerrit.wikimedia.org/r/339803 (https://phabricator.wikimedia.org/T151798)
[01:01:55] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] annualreport: add X-Frame-Options header to Apache config [puppet] - 10https://gerrit.wikimedia.org/r/339803 (https://phabricator.wikimedia.org/T151798) (owner: 10Dzahn)
[01:03:23] <wikibugs_>	 (03PS2) 10Dzahn: annualreport: add X-Frame-Options header to Apache config [puppet] - 10https://gerrit.wikimedia.org/r/339803 (https://phabricator.wikimedia.org/T151798)
[01:04:33] <wikibugs_>	 (03CR) 10Dzahn: "https://geekflare.com/secure-apache-from-clickjacking-with-x-frame-options/" [puppet] - 10https://gerrit.wikimedia.org/r/339803 (https://phabricator.wikimedia.org/T151798) (owner: 10Dzahn)
[01:05:08] <wikibugs_>	 (03CR) 10Dzahn: [C: 032] annualreport: add X-Frame-Options header to Apache config [puppet] - 10https://gerrit.wikimedia.org/r/339803 (https://phabricator.wikimedia.org/T151798) (owner: 10Dzahn)
[01:11:35] <wikibugs_>	 06Operations, 10Annual-Report, 10Security-Reviews, 13Patch-For-Review: add subdomain for annual report 2016 - https://phabricator.wikimedia.org/T151798#3054621 (10Dzahn) >>! In T151798#3053857, @dpatrick wrote: > * [[ https://www.owasp.org/index.php/OWASP_Secure_Headers_Project#X-Frame-Options | X-Frame-Op...
[01:35:07] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: CRITICAL - Rep Delay is: 1801.311485 Seconds
[01:36:07] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1002 is OK: OK - Rep Delay is: 42.38049 Seconds
[01:39:30] <wikibugs_>	 06Operations, 10Annual-Report, 10Security-Reviews, 13Patch-For-Review: add subdomain for annual report 2016 - https://phabricator.wikimedia.org/T151798#3054637 (10Dzahn) a:05ZMcCune>03Dzahn
[01:43:26] <mutante>	 !log bast3002 - sign puppet cert, initial run with basic "bastion" role, to replace broken bast3001, but WIP, ganglia/prometheus roles not moved yet (T156506)
[01:43:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:43:32] <stashbot>	 T156506: Replace bast3001 - https://phabricator.wikimedia.org/T156506
[01:48:43] <wikibugs_>	 06Operations, 10Annual-Report, 10Security-Reviews, 13Patch-For-Review: add subdomain for annual report 2016 - https://phabricator.wikimedia.org/T151798#3054655 (10Dzahn) Old index redirect is cached but that's known and ok that way until Monday.
[01:56:04] <wikibugs_>	 06Operations, 10Ops-Access-Requests, 06Discovery, 06Maps: Give Max Semenik deployment rights for Maps - https://phabricator.wikimedia.org/T158820#3048354 (10Dzahn) @Muehlenhoff ^ fyi,re: sudo permissions not appearing in the data.yaml
[01:57:21] <wikibugs_>	 06Operations, 10hardware-requests, 13Patch-For-Review: Replace bast3001 - https://phabricator.wikimedia.org/T156506#3054667 (10Dzahn) ``` [bast3002:~] $ gen_fingerprints +---------+---------+-------------------------------------------------+ | Cipher  | Algo    | Fingerprint...
[02:00:49] <wikibugs_>	 06Operations, 10hardware-requests, 13Patch-For-Review: Replace bast3001 - https://phabricator.wikimedia.org/T156506#3054668 (10Dzahn) Next is https://gerrit.wikimedia.org/r/#/c/339684/   and moving the roles:  installserver::tftp  prometheus::ops  ganglia::monitor::aggregator    from 3001 to 3002, then shutt...
[02:19:49] <logmsgbot>	 !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.13) (duration: 07m 20s)
[02:19:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:20:01] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps-test2002 is CRITICAL: CRITICAL - Rep Delay is: 1800.474118 Seconds
[02:21:01] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps-test2002 is OK: OK - Rep Delay is: 21.463512 Seconds
[02:25:11] <logmsgbot>	 !log l10nupdate@tin ResourceLoader cache refresh completed at Sat Feb 25 02:25:10 UTC 2017 (duration 5m 21s)
[02:25:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:37:31] <icinga-wm>	 PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 22 probes of 269 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map
[02:42:31] <icinga-wm>	 RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 16 probes of 269 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map
[02:53:01] <icinga-wm>	 PROBLEM - puppet last run on mw1240 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[03:22:01] <icinga-wm>	 RECOVERY - puppet last run on mw1240 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures
[03:23:51] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 617.10 seconds
[03:26:51] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 205.11 seconds
[04:05:56] <wikibugs_>	 (03CR) 10Tim Landscheidt: "On Precise Labs instances, this gives:" [puppet] - 10https://gerrit.wikimedia.org/r/339231 (owner: 10Faidon Liambotis)
[05:00:09] <icinga-wm>	 PROBLEM - puppet last run on snapshot1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[05:41:09] <icinga-wm>	 PROBLEM - puppet last run on labstore1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[06:07:09] <icinga-wm>	 PROBLEM - puppet last run on radium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[06:09:09] <icinga-wm>	 RECOVERY - puppet last run on labstore1005 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures
[06:31:29] <icinga-wm>	 PROBLEM - puppet last run on osmium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[06:36:09] <icinga-wm>	 RECOVERY - puppet last run on radium is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures
[06:59:29] <icinga-wm>	 RECOVERY - puppet last run on osmium is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures
[07:29:49] <icinga-wm>	 PROBLEM - puppet last run on elastic1039 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[07:57:49] <icinga-wm>	 RECOVERY - puppet last run on elastic1039 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures
[08:23:29] <icinga-wm>	 PROBLEM - All k8s worker nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/k8s/nodes/ready - 185 bytes in 0.129 second response time
[08:50:29] <icinga-wm>	 RECOVERY - All k8s worker nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.144 second response time
[09:05:43] <wikibugs_>	 (03Abandoned) 10Giuseppe Lavagetto: Allow defining the conftool entities via a schema file [software/conftool] - 10https://gerrit.wikimedia.org/r/278892 (owner: 10Giuseppe Lavagetto)
[09:06:20] <wikibugs_>	 (03Abandoned) 10Giuseppe Lavagetto: realm: convert main_ipaddress and site into facts [puppet] - 10https://gerrit.wikimedia.org/r/311223 (https://phabricator.wikimedia.org/T85459) (owner: 10Giuseppe Lavagetto)
[10:19:59] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps-test2002 is CRITICAL: CRITICAL - Rep Delay is: 1809.867048 Seconds
[10:20:59] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps-test2002 is OK: OK - Rep Delay is: 28.270515 Seconds
[10:36:59] <icinga-wm>	 PROBLEM - puppet last run on cp4008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[11:04:59] <icinga-wm>	 RECOVERY - puppet last run on cp4008 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures
[11:10:39] <icinga-wm>	 PROBLEM - puppet last run on cp3033 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[11:23:29] <icinga-wm>	 PROBLEM - All k8s worker nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/k8s/nodes/ready - 185 bytes in 0.166 second response time
[11:34:29] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s2 on db1047 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 376.26 seconds
[11:38:39] <icinga-wm>	 RECOVERY - puppet last run on cp3033 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures
[11:50:29] <icinga-wm>	 RECOVERY - All k8s worker nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.185 second response time
[12:09:09] <icinga-wm>	 PROBLEM - puppet last run on mw1237 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[12:19:09] <icinga-wm>	 PROBLEM - puppet last run on cp3006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[12:24:29] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s2 on db1047 is OK: OK slave_sql_lag Replication lag: 34.21 seconds
[12:37:09] <icinga-wm>	 RECOVERY - puppet last run on mw1237 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures
[12:43:04] <TabbyCat>	 Reedy: ping
[12:48:09] <icinga-wm>	 RECOVERY - puppet last run on cp3006 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures
[12:49:28] <TabbyCat>	 Krinkle: around?
[12:57:29] <icinga-wm>	 PROBLEM - puppet last run on neodymium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[13:01:19] <icinga-wm>	 PROBLEM - citoid endpoints health on scb1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[13:01:19] <icinga-wm>	 PROBLEM - citoid endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[13:01:39] <icinga-wm>	 PROBLEM - zotero on sca1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[13:02:00] <icinga-wm>	 PROBLEM - citoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[13:02:09] <icinga-wm>	 PROBLEM - citoid endpoints health on scb1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[13:02:09] <icinga-wm>	 RECOVERY - citoid endpoints health on scb1003 is OK: All endpoints are healthy
[13:02:12] <TabbyCat>	 maybe somebody could *dry-run* namespaceDupes.php for ext.wikipedia to check a thing?
[13:02:19] <icinga-wm>	 RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy
[13:02:21] <TabbyCat>	 tnx
[13:02:29] <icinga-wm>	 RECOVERY - zotero on sca1003 is OK: HTTP OK: HTTP/1.0 200 OK - 62 bytes in 0.006 second response time
[13:02:59] <icinga-wm>	 RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy
[13:02:59] <icinga-wm>	 RECOVERY - citoid endpoints health on scb1004 is OK: All endpoints are healthy
[13:27:29] <icinga-wm>	 RECOVERY - puppet last run on neodymium is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures
[14:15:05] <wikibugs_>	 06Operations, 10Phabricator, 06Release-Engineering-Team: Update file phab_epipe.py to use maniphest.edit conduit api - https://phabricator.wikimedia.org/T159043#3055204 (10Paladox)
[14:21:51] <wikibugs_>	 06Operations, 10Phabricator: Update phabricator.py file to use maniphest.edit conduit api - https://phabricator.wikimedia.org/T159045#3055228 (10Paladox)
[14:26:00] <icinga-wm>	 PROBLEM - puppet last run on prometheus2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[14:33:09] <icinga-wm>	 PROBLEM - puppet last run on mw1192 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[14:46:19] <icinga-wm>	 PROBLEM - puppet last run on cp4002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[14:47:59] <icinga-wm>	 PROBLEM - dhclient process on thumbor1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[14:48:19] <icinga-wm>	 PROBLEM - salt-minion processes on thumbor1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[14:48:49] <icinga-wm>	 RECOVERY - dhclient process on thumbor1001 is OK: PROCS OK: 0 processes with command name dhclient
[14:49:09] <icinga-wm>	 RECOVERY - salt-minion processes on thumbor1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion
[14:54:00] <icinga-wm>	 RECOVERY - puppet last run on prometheus2001 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures
[15:01:09] <icinga-wm>	 RECOVERY - puppet last run on mw1192 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures
[15:01:59] <icinga-wm>	 PROBLEM - puppet last run on elastic1028 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[15:14:19] <icinga-wm>	 RECOVERY - puppet last run on cp4002 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures
[15:16:09] <icinga-wm>	 PROBLEM - puppet last run on cp3040 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[15:29:59] <icinga-wm>	 RECOVERY - puppet last run on elastic1028 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures
[15:38:29] <icinga-wm>	 PROBLEM - puppet last run on rdb1008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[15:45:09] <icinga-wm>	 RECOVERY - puppet last run on cp3040 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures
[16:06:29] <icinga-wm>	 RECOVERY - puppet last run on rdb1008 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures
[16:06:49] <icinga-wm>	 PROBLEM - puppet last run on db1078 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[16:17:09] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad is CRITICAL: CRITICAL - failed 24 probes of 274 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map
[16:22:09] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad is OK: OK - failed 13 probes of 274 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map
[16:30:08] <wikibugs_>	 (03PS3) 10Tim Landscheidt: Tools: Outfactor jobkill script to toollabs::node::all [puppet] - 10https://gerrit.wikimedia.org/r/335755
[16:32:43] <wikibugs_>	 (03PS4) 10Tim Landscheidt: Redirect wiki.toolserver.org to www.mediawiki.org [puppet] - 10https://gerrit.wikimedia.org/r/227079 (https://phabricator.wikimedia.org/T62220) (owner: 10Nemo bis)
[16:35:49] <icinga-wm>	 RECOVERY - puppet last run on db1078 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures
[16:36:08] <wikibugs_>	 06Operations: reset admin password for Wikimania-l - https://phabricator.wikimedia.org/T159048#3055305 (10Paladox)
[16:37:01] <wikibugs_>	 (03PS2) 10Tim Landscheidt: puppet: Remove templatedir setting [puppet] - 10https://gerrit.wikimedia.org/r/338540 (https://phabricator.wikimedia.org/T95158)
[16:47:49] <icinga-wm>	 PROBLEM - puppet last run on bast3001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[16:48:05] <wikibugs_>	 (03CR) 10Paladox: [C: 031] puppet: Remove templatedir setting [puppet] - 10https://gerrit.wikimedia.org/r/338540 (https://phabricator.wikimedia.org/T95158) (owner: 10Tim Landscheidt)
[16:50:49] <icinga-wm>	 PROBLEM - Check systemd state on bast3001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[16:50:49] <icinga-wm>	 PROBLEM - Check size of conntrack table on bast3001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[16:51:09] <icinga-wm>	 PROBLEM - SSH on bast3001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[16:51:19] <icinga-wm>	 PROBLEM - configured eth on bast3001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[16:51:19] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on bast3001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[16:52:59] <icinga-wm>	 RECOVERY - SSH on bast3001 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u3 (protocol 2.0)
[16:53:20] <icinga-wm>	 RECOVERY - configured eth on bast3001 is OK: OK - interfaces up
[16:53:20] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on bast3001 is OK: OK ferm input default policy is set
[16:53:49] <icinga-wm>	 PROBLEM - dhclient process on bast3001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[16:55:39] <icinga-wm>	 RECOVERY - dhclient process on bast3001 is OK: PROCS OK: 0 processes with command name dhclient
[16:55:39] <icinga-wm>	 RECOVERY - Check systemd state on bast3001 is OK: OK - running: The system is fully operational
[16:55:39] <icinga-wm>	 RECOVERY - Check size of conntrack table on bast3001 is OK: OK: nf_conntrack is 0 % full
[16:55:39] <icinga-wm>	 RECOVERY - puppet last run on bast3001 is OK: OK: Puppet is currently enabled, last run 39 minutes ago with 0 failures
[16:56:19] <icinga-wm>	 PROBLEM - puppet last run on elastic1020 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[17:24:19] <icinga-wm>	 RECOVERY - puppet last run on elastic1020 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures
[17:41:39] <wikibugs_>	 (03PS1) 10Volans: Fix additional minor issues reported by codacy [software/cumin] - 10https://gerrit.wikimedia.org/r/339833 (https://phabricator.wikimedia.org/T158967)
[17:41:41] <wikibugs_>	 (03PS1) 10Volans: Make docstring pep257 compliant [software/cumin] - 10https://gerrit.wikimedia.org/r/339834 (https://phabricator.wikimedia.org/T158967)
[17:48:02] <wikibugs_>	 (03PS2) 10Volans: Make docstring pep257 compliant [software/cumin] - 10https://gerrit.wikimedia.org/r/339834 (https://phabricator.wikimedia.org/T158967)
[17:51:21] <wikibugs_>	 (03PS2) 10Volans: Fix additional minor issues reported by codacy [software/cumin] - 10https://gerrit.wikimedia.org/r/339833 (https://phabricator.wikimedia.org/T158967)
[17:52:57] <wikibugs_>	 (03CR) 10Volans: [C: 032] Fix additional minor issues reported by codacy [software/cumin] - 10https://gerrit.wikimedia.org/r/339833 (https://phabricator.wikimedia.org/T158967) (owner: 10Volans)
[17:54:08] <wikibugs_>	 (03Merged) 10jenkins-bot: Fix additional minor issues reported by codacy [software/cumin] - 10https://gerrit.wikimedia.org/r/339833 (https://phabricator.wikimedia.org/T158967) (owner: 10Volans)
[18:02:01] <wikibugs_>	 06Operations, 10Phabricator, 07Technical-Debt: Update phabricator.py file to use maniphest.edit conduit api - https://phabricator.wikimedia.org/T159045#3055365 (10Aklapper)
[18:02:08] <wikibugs_>	 06Operations, 10Phabricator, 06Release-Engineering-Team, 07Technical-Debt: Update file phab_epipe.py to use maniphest.edit conduit api - https://phabricator.wikimedia.org/T159043#3055367 (10Aklapper)
[18:02:15] <wikibugs_>	 (03PS3) 10Volans: Make docstring pep257 compliant [software/cumin] - 10https://gerrit.wikimedia.org/r/339834 (https://phabricator.wikimedia.org/T158967)
[18:03:13] <wikibugs_>	 06Operations, 10Phabricator, 07Technical-Debt: Update wmf_auto_reimage.py file to use maniphest.edit conduit api - https://phabricator.wikimedia.org/T159045#3055371 (10Paladox)
[18:06:22] <wikibugs_>	 (03PS4) 10Volans: Make docstring pep257 compliant [software/cumin] - 10https://gerrit.wikimedia.org/r/339834 (https://phabricator.wikimedia.org/T158967)
[18:08:42] <wikibugs_>	 (03CR) 10Volans: [C: 032] Make docstring pep257 compliant [software/cumin] - 10https://gerrit.wikimedia.org/r/339834 (https://phabricator.wikimedia.org/T158967) (owner: 10Volans)
[18:09:18] <wikibugs_>	 (03Merged) 10jenkins-bot: Make docstring pep257 compliant [software/cumin] - 10https://gerrit.wikimedia.org/r/339834 (https://phabricator.wikimedia.org/T158967) (owner: 10Volans)
[18:11:36] <wikibugs_>	 06Operations, 06Operations-Software-Development, 10Phabricator, 07Technical-Debt: Update wmf_auto_reimage.py file to use maniphest.edit conduit api - https://phabricator.wikimedia.org/T159045#3055379 (10Volans) a:03Volans
[18:17:57] <wikibugs_>	 06Operations, 06Operations-Software-Development, 10Phabricator, 07Technical-Debt: Update wmf_auto_reimage.py file to use maniphest.edit conduit api - https://phabricator.wikimedia.org/T159045#3055383 (10Volans) p:05Triage>03Normal
[18:19:32] <volans>	 paladox: seems that also createtask is frozen and will be deprecated
[18:19:43] <paladox>	 Yep
[18:19:58] <volans>	 then I need another task for raid_handler.py... let me open it to not forget
[18:20:09] <paladox>	 Ok
[18:20:23] <paladox>	 volans you could just edit the one i created to say replace all deprecated code
[18:20:47] <volans>	 for those 2 yeah, make sense, given that are both 'mine' :)
[18:21:00] <paladox>	 Yep :)
[18:21:17] <paladox>	 volans i've been trying to figure out how to replace it in its-phabricator (gerrit)
[18:21:35] <paladox>	 I came up with a solution like
[18:21:43] <paladox>	 https://gerrit-review.googlesource.com/#/c/98576/
[18:22:09] <paladox>	 but it dosen't build locally, fails so i need to fix that too :)
[18:22:09] <icinga-wm>	 PROBLEM - puppet last run on mc1008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[18:23:00] <volans>	 lol, ok I will take a look when doing it, I guess is not that urgent given that it's now frozen, than will be deprecated and then removed
[18:25:50] <wikibugs_>	 06Operations, 06Operations-Software-Development, 10Phabricator, 07Technical-Debt: Update Puppet repo code that uses maniphest.edit conduit api - https://phabricator.wikimedia.org/T159045#3055400 (10Volans)
[18:27:16] <wikibugs_>	 06Operations, 06Operations-Software-Development, 10Phabricator, 07Technical-Debt: Update Puppet repo code that uses maniphest.update and maniphest.createtask conduit api - https://phabricator.wikimedia.org/T159045#3055404 (10Paladox)
[18:27:43] <volans>	 thanks for the news
[18:29:50] <paladox>	 Your welcome :)
[18:43:59] <icinga-wm>	 PROBLEM - puppet last run on rdb1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[18:49:09] <icinga-wm>	 RECOVERY - puppet last run on mc1008 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures
[18:58:27] <wikibugs_>	 06Operations, 10Revision-Scoring-As-A-Service-Backlog: Set up oresrdb redis node in codfw - https://phabricator.wikimedia.org/T139372#3055435 (10Halfak) I'd thought that maybe we could partition requests to limit our need for replication.  E.g. even rev_ids go to eqiad and off rev_ids go to codfw.  That way, w...
[19:10:59] <icinga-wm>	 RECOVERY - puppet last run on rdb1002 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures
[19:17:29] <icinga-wm>	 PROBLEM - puppet last run on mw1298 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[19:32:09] <icinga-wm>	 PROBLEM - Host cp2017 is DOWN: PING CRITICAL - Packet loss = 100%
[19:36:59] <icinga-wm>	 PROBLEM - IPsec on cp3049 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2017_v4, cp2017_v6
[19:36:59] <icinga-wm>	 PROBLEM - IPsec on cp4014 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2017_v4, cp2017_v6
[19:37:09] <icinga-wm>	 PROBLEM - IPsec on cp3038 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2017_v4, cp2017_v6
[19:37:09] <icinga-wm>	 PROBLEM - IPsec on cp3037 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2017_v4, cp2017_v6
[19:37:09] <icinga-wm>	 PROBLEM - IPsec on cp3047 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2017_v4, cp2017_v6
[19:37:19] <icinga-wm>	 PROBLEM - IPsec on cp1062 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2017_v4, cp2017_v6
[19:37:39] <icinga-wm>	 PROBLEM - IPsec on cp3035 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2017_v4, cp2017_v6
[19:37:39] <icinga-wm>	 PROBLEM - IPsec on cp1072 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2017_v4, cp2017_v6
[19:37:39] <icinga-wm>	 PROBLEM - IPsec on cp1049 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2017_v4, cp2017_v6
[19:37:39] <icinga-wm>	 PROBLEM - IPsec on cp1099 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2017_v4, cp2017_v6
[19:37:49] <icinga-wm>	 PROBLEM - IPsec on cp1073 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2017_v4, cp2017_v6
[19:37:49] <icinga-wm>	 PROBLEM - IPsec on cp1048 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2017_v4, cp2017_v6
[19:37:49] <icinga-wm>	 PROBLEM - IPsec on cp1063 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2017_v4, cp2017_v6
[19:37:49] <icinga-wm>	 PROBLEM - IPsec on cp1074 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2017_v4, cp2017_v6
[19:37:59] <icinga-wm>	 PROBLEM - IPsec on cp4007 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2017_v4, cp2017_v6
[19:37:59] <icinga-wm>	 PROBLEM - IPsec on cp4013 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2017_v4, cp2017_v6
[19:37:59] <icinga-wm>	 PROBLEM - IPsec on cp3045 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2017_v4, cp2017_v6
[19:37:59] <icinga-wm>	 PROBLEM - IPsec on cp3039 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2017_v4, cp2017_v6
[19:37:59] <icinga-wm>	 PROBLEM - IPsec on cp3036 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2017_v4, cp2017_v6
[19:37:59] <icinga-wm>	 PROBLEM - IPsec on cp3044 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2017_v4, cp2017_v6
[19:37:59] <icinga-wm>	 PROBLEM - IPsec on cp3048 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2017_v4, cp2017_v6
[19:38:00] <icinga-wm>	 PROBLEM - IPsec on cp3034 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2017_v4, cp2017_v6
[19:38:00] <icinga-wm>	 PROBLEM - IPsec on cp4005 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2017_v4, cp2017_v6
[19:38:01] <icinga-wm>	 PROBLEM - IPsec on cp4006 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2017_v4, cp2017_v6
[19:38:01] <icinga-wm>	 PROBLEM - IPsec on cp4015 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2017_v4, cp2017_v6
[19:38:07] <paladox>	 any ops ^^?
[19:38:09] <icinga-wm>	 PROBLEM - IPsec on cp3046 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2017_v4, cp2017_v6
[19:38:09] <icinga-wm>	 PROBLEM - IPsec on cp1064 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2017_v4, cp2017_v6
[19:38:09] <icinga-wm>	 PROBLEM - IPsec on cp1050 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2017_v4, cp2017_v6
[19:38:19] <icinga-wm>	 PROBLEM - IPsec on cp1071 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2017_v4, cp2017_v6
[19:39:06] <wikibugs_>	 (03PS5) 10Tim Landscheidt: Tools: Outfactor the configuration for outgoing HBA connections [puppet] - 10https://gerrit.wikimedia.org/r/267832
[19:39:19] <icinga-wm>	 PROBLEM - Host mw2256 is DOWN: PING CRITICAL - Packet loss = 100%
[19:39:39] <icinga-wm>	 RECOVERY - Host mw2256 is UP: PING OK - Packet loss = 0%, RTA = 36.08 ms
[19:47:29] <icinga-wm>	 RECOVERY - puppet last run on mw1298 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures
[19:54:05] <elukey>	 !log powercycled cp2017, mgmt console stuck
[19:54:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:55:39] <icinga-wm>	 RECOVERY - Host cp2017 is UP: PING OK - Packet loss = 0%, RTA = 36.13 ms
[19:55:39] <icinga-wm>	 RECOVERY - IPsec on cp1072 is OK: Strongswan OK - 56 ESP OK
[19:55:39] <icinga-wm>	 RECOVERY - IPsec on cp1049 is OK: Strongswan OK - 56 ESP OK
[19:55:40] <icinga-wm>	 RECOVERY - IPsec on cp1099 is OK: Strongswan OK - 56 ESP OK
[19:55:49] <icinga-wm>	 RECOVERY - IPsec on cp1073 is OK: Strongswan OK - 56 ESP OK
[19:55:49] <icinga-wm>	 RECOVERY - IPsec on cp1048 is OK: Strongswan OK - 56 ESP OK
[19:55:49] <icinga-wm>	 RECOVERY - IPsec on cp1063 is OK: Strongswan OK - 56 ESP OK
[19:55:49] <icinga-wm>	 RECOVERY - IPsec on cp1074 is OK: Strongswan OK - 56 ESP OK
[19:55:59] <icinga-wm>	 RECOVERY - IPsec on cp4007 is OK: Strongswan OK - 54 ESP OK
[19:55:59] <icinga-wm>	 RECOVERY - IPsec on cp4013 is OK: Strongswan OK - 54 ESP OK
[19:55:59] <icinga-wm>	 RECOVERY - IPsec on cp3049 is OK: Strongswan OK - 54 ESP OK
[19:55:59] <icinga-wm>	 RECOVERY - IPsec on cp3036 is OK: Strongswan OK - 54 ESP OK
[19:55:59] <icinga-wm>	 RECOVERY - IPsec on cp3039 is OK: Strongswan OK - 54 ESP OK
[19:55:59] <icinga-wm>	 RECOVERY - IPsec on cp3044 is OK: Strongswan OK - 54 ESP OK
[19:55:59] <icinga-wm>	 RECOVERY - IPsec on cp3045 is OK: Strongswan OK - 54 ESP OK
[19:56:00] <icinga-wm>	 RECOVERY - IPsec on cp3048 is OK: Strongswan OK - 54 ESP OK
[19:56:00] <icinga-wm>	 RECOVERY - IPsec on cp3034 is OK: Strongswan OK - 54 ESP OK
[19:56:01] <icinga-wm>	 RECOVERY - IPsec on cp4005 is OK: Strongswan OK - 54 ESP OK
[19:56:01] <icinga-wm>	 RECOVERY - IPsec on cp4006 is OK: Strongswan OK - 54 ESP OK
[19:56:02] <icinga-wm>	 RECOVERY - IPsec on cp4015 is OK: Strongswan OK - 54 ESP OK
[19:56:19] <icinga-wm>	 RECOVERY - IPsec on cp1071 is OK: Strongswan OK - 56 ESP OK
[19:56:19] <icinga-wm>	 RECOVERY - IPsec on cp1062 is OK: Strongswan OK - 56 ESP OK
[19:56:39] <icinga-wm>	 RECOVERY - IPsec on cp3035 is OK: Strongswan OK - 54 ESP OK
[19:57:32] <elukey>	 the host seems up and running, but not sure what is best (pooled=no for investigation or leave it running)
[19:59:26] <elukey>	 https://grafana.wikimedia.org/dashboard/file/server-board.json?var-server=cp2017 - host frozen, not seeing anything weird in metrics before the stop
[20:01:58] <elukey>	 varnishlog shows 200s
[20:03:49] <elukey>	 I'd be tempted to executed the depool command (that basically calls confctl)
[20:03:58] <elukey>	 ema: any chance you are there?
[20:04:02] <elukey>	 (or bblack)
[20:06:46] <elukey>	 !log depooled cp2017 (via local sudo -i depool command) since the host froze (it got back after a powercycle)
[20:06:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:08:29] <icinga-wm>	 PROBLEM - etcdmirror-conftool-eqiad-wmnet service on conf2002 is CRITICAL: CRITICAL - Expecting active but unit etcdmirror-conftool-eqiad-wmnet is failed
[20:08:39] <icinga-wm>	 PROBLEM - Check systemd state on conf2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[20:08:49] <icinga-wm>	 PROBLEM - Etcd replication lag on conf2002 is CRITICAL: connect to address 10.192.32.141 and port 8000: Connection refused
[20:09:09] <icinga-wm>	 PROBLEM - puppet last run on ms-be3004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[20:09:31] <_joe_>	 some network issue?
[20:10:34] <_joe_>	 ok no
[20:10:39] <wikibugs_>	 06Operations, 10Traffic: cp2017 froze and stopped serving traffic - https://phabricator.wikimedia.org/T159056#3055512 (10elukey)
[20:10:47] <_joe_>	 it's just that etcdmirror failed because of my experiments
[20:11:03] <elukey>	 :)
[20:11:07] <elukey>	 Ciao _joe_
[20:11:30] <_joe_>	 something/someone tried to depool a server including the non-active services
[20:11:35] <_joe_>	 oh IT WAS YOU
[20:11:47] * _joe_ blames elukey 
[20:11:53] <elukey>	 I ran depool! :P
[20:12:39] <icinga-wm>	 PROBLEM - puppet last run on bast3002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[20:12:41] <_joe_>	 :P
[20:12:45] <_joe_>	 let me ack that
[20:14:15] <icinga-wm>	 ACKNOWLEDGEMENT - Check systemd state on conf2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Giuseppe Lavagetto Etcd replica is broken because of my experiments with conftool.
[20:14:15] <icinga-wm>	 ACKNOWLEDGEMENT - Etcd replication lag on conf2002 is CRITICAL: connect to address 10.192.32.141 and port 8000: Connection refused Giuseppe Lavagetto Etcd replica is broken because of my experiments with conftool.
[20:14:15] <icinga-wm>	 ACKNOWLEDGEMENT - etcdmirror-conftool-eqiad-wmnet service on conf2002 is CRITICAL: CRITICAL - Expecting active but unit etcdmirror-conftool-eqiad-wmnet is failed Giuseppe Lavagetto Etcd replica is broken because of my experiments with conftool.
[20:14:32] <_joe_>	 ok, cool
[20:37:10] <icinga-wm>	 RECOVERY - puppet last run on ms-be3004 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures
[20:41:39] <icinga-wm>	 RECOVERY - puppet last run on bast3002 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures
[21:04:18] <wikibugs_>	 (03PS7) 10Tim Landscheidt: Tools: Fix argument quoting in jlocal [puppet] - 10https://gerrit.wikimedia.org/r/266935
[21:07:17] <wikibugs_>	 (03PS4) 10Tim Landscheidt: postgresql: Only set user password if different [puppet] - 10https://gerrit.wikimedia.org/r/329328
[21:53:29] <icinga-wm>	 PROBLEM - All k8s worker nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/k8s/nodes/ready - 185 bytes in 0.199 second response time
[22:13:29] <icinga-wm>	 PROBLEM - puppet last run on mc1009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[22:20:29] <icinga-wm>	 RECOVERY - All k8s worker nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.185 second response time
[22:38:29] <icinga-wm>	 PROBLEM - puppet last run on db1085 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[22:41:29] <icinga-wm>	 RECOVERY - puppet last run on mc1009 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures
[23:06:29] <icinga-wm>	 RECOVERY - puppet last run on db1085 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures
[23:38:28] <wikibugs_>	 (03CR) 10Tim Landscheidt: "(Did test it some time ago, works fine.)" [puppet] - 10https://gerrit.wikimedia.org/r/326892 (owner: 10Tim Landscheidt)
[23:39:20] <wikibugs_>	 (03PS7) 10Tim Landscheidt: Tools: Make tools-clush-generator project-agnostic [puppet] - 10https://gerrit.wikimedia.org/r/326892
[23:39:22] <wikibugs_>	 (03PS5) 10Tim Landscheidt: Tools: Generate node sets dynamically [puppet] - 10https://gerrit.wikimedia.org/r/328030