[00:06:45] <icinga-wm>	 PROBLEM - puppet last run on cp3037 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[00:10:45] <icinga-wm>	 RECOVERY - puppet last run on labsdb1009 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures
[00:16:34] <wikibugs_>	 (03PS1) 10Dzahn: Revert "icinga: set IP for benefactorevents/eventdonations to 127.0.0.1" [puppet] - 10https://gerrit.wikimedia.org/r/341103
[00:17:28] <wikibugs_>	 (03CR) 10Dzahn: "doesn't work as expected but found another work-around that doesn't need a gerrit change and just involves web ui. http://www.htmlgraphic." [puppet] - 10https://gerrit.wikimedia.org/r/341103 (owner: 10Dzahn)
[00:18:11] <wikibugs_>	 (03PS2) 10Dzahn: Revert "icinga: set IP for benefactorevents/eventdonations to 127.0.0.1" [puppet] - 10https://gerrit.wikimedia.org/r/341103
[00:21:30] <wikibugs_>	 (03CR) 10Dzahn: [C: 032] Revert "icinga: set IP for benefactorevents/eventdonations to 127.0.0.1" [puppet] - 10https://gerrit.wikimedia.org/r/341103 (owner: 10Dzahn)
[00:22:43] <wikibugs_>	 (03CR) 10Dzahn: "i did this instead to make the 2 special hosts appear as UP: http://www.htmlgraphic.com/nagios-check-host-without-ping/" [puppet] - 10https://gerrit.wikimedia.org/r/341037 (owner: 10Dzahn)
[00:24:53] <icinga-wm>	 ACKNOWLEDGEMENT - Check systemd state on graphite2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn https://phabricator.wikimedia.org/T157022#3045883
[00:24:53] <icinga-wm>	 ACKNOWLEDGEMENT - carbon-cache@a service on graphite2001 is CRITICAL: CRITICAL - Expecting active but unit carbon-cache@a is failed daniel_zahn https://phabricator.wikimedia.org/T157022#3045883
[00:24:53] <icinga-wm>	 ACKNOWLEDGEMENT - carbon-cache@b service on graphite2001 is CRITICAL: CRITICAL - Expecting active but unit carbon-cache@b is failed daniel_zahn https://phabricator.wikimedia.org/T157022#3045883
[00:24:53] <icinga-wm>	 ACKNOWLEDGEMENT - carbon-cache@c service on graphite2001 is CRITICAL: CRITICAL - Expecting active but unit carbon-cache@c is failed daniel_zahn https://phabricator.wikimedia.org/T157022#3045883
[00:24:53] <icinga-wm>	 ACKNOWLEDGEMENT - carbon-cache@d service on graphite2001 is CRITICAL: CRITICAL - Expecting active but unit carbon-cache@d is failed daniel_zahn https://phabricator.wikimedia.org/T157022#3045883
[00:24:53] <icinga-wm>	 ACKNOWLEDGEMENT - carbon-cache@e service on graphite2001 is CRITICAL: CRITICAL - Expecting active but unit carbon-cache@e is failed daniel_zahn https://phabricator.wikimedia.org/T157022#3045883
[00:24:53] <icinga-wm>	 ACKNOWLEDGEMENT - carbon-cache@f service on graphite2001 is CRITICAL: CRITICAL - Expecting active but unit carbon-cache@f is failed daniel_zahn https://phabricator.wikimedia.org/T157022#3045883
[00:24:54] <icinga-wm>	 ACKNOWLEDGEMENT - carbon-cache@g service on graphite2001 is CRITICAL: CRITICAL - Expecting active but unit carbon-cache@g is failed daniel_zahn https://phabricator.wikimedia.org/T157022#3045883
[00:24:54] <icinga-wm>	 ACKNOWLEDGEMENT - carbon-cache@h service on graphite2001 is CRITICAL: CRITICAL - Expecting active but unit carbon-cache@h is failed daniel_zahn https://phabricator.wikimedia.org/T157022#3045883
[00:24:55] <icinga-wm>	 ACKNOWLEDGEMENT - carbon-frontend-relay service on graphite2001 is CRITICAL: CRITICAL - Expecting active but unit carbon-frontend-relay is inactive daniel_zahn https://phabricator.wikimedia.org/T157022#3045883
[00:24:55] <icinga-wm>	 ACKNOWLEDGEMENT - carbon-local-relay service on graphite2001 is CRITICAL: CRITICAL - Expecting active but unit carbon-local-relay is failed daniel_zahn https://phabricator.wikimedia.org/T157022#3045883
[00:27:20] <wikibugs_>	 06Operations, 10ops-eqiad, 13Patch-For-Review: Suspected faulty SSD on graphite1001 - https://phabricator.wikimedia.org/T157022#2993161 (10Dzahn) carbon-cache alerts on graphite2001 - https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=carbon-cache  saw puppet is disabled there with link to...
[00:35:45] <icinga-wm>	 RECOVERY - puppet last run on cp3037 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures
[00:36:05] <wikibugs_>	 06Operations, 10ops-codfw, 06Analytics-Kanban, 13Patch-For-Review: rack/setup/deploy conf200[123] - https://phabricator.wikimedia.org/T131959#2184249 (10Dzahn) icinga said "CRITICAL - degraded: The system is operational but one or more units failed." on `conf2002.codfw.wmnet`  looking at the check_command...
[00:40:14] <wikibugs_>	 06Operations, 10ops-codfw, 06Analytics-Kanban, 13Patch-For-Review: rack/setup/deploy conf200[123] - https://phabricator.wikimedia.org/T131959#3072486 (10Dzahn) This is running:  etcdmirror--eqiad-wmnet.service                                                 loaded    active running   Etcd mirrormaker But t...
[00:50:55] <icinga-wm>	 RECOVERY - Check systemd state on conf2002 is OK: OK - running: The system is fully operational
[00:52:25] <mutante>	 !log conf2002 - ran "systemctl reset-failed" to fix Icinga alert about broken systemd state due to formerly existing but failed service etcdmirror-eqiad-wmnet. turns out you need this to remove missing units. found on http://serverfault.com/questions/606520/how-to-remove-missing-systemd-units  (T131959)
[00:52:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:52:30] <stashbot>	 T131959: rack/setup/deploy conf200[123] - https://phabricator.wikimedia.org/T131959
[00:55:01] <wikibugs_>	 06Operations, 10ops-codfw, 06Analytics-Kanban, 13Patch-For-Review: rack/setup/deploy conf200[123] - https://phabricator.wikimedia.org/T131959#3072515 (10Dzahn) The fix was `systemctl reset-failed` to get rid of the removed and now missing unit.   ``` < icinga-wm> RECOVERY - Check systemd state on conf2002...
[01:08:48] <wikibugs_>	 06Operations, 10Wikimedia-Apache-configuration: Create 2030.wikimedia.org redirect to Meta portal - https://phabricator.wikimedia.org/T158981#3072558 (10Dzahn) p:05Triage>03High
[01:45:41] <wikibugs_>	 06Operations: Verify bn.wikipedia.org via Webmaster Tools to allow linking a bn.wikipedia.org button to G+ page - https://phabricator.wikimedia.org/T109810#3072589 (10dr0ptp4kt) I wanted to note I haven't forgotten about this. Got sick and have been doing annual budgeting...
[01:46:05] <icinga-wm>	 PROBLEM - puppet last run on gerrit2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[02:01:15] <icinga-wm>	 PROBLEM - Redis replication status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 656 600 - REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 4151125 keys, up 123 days 17 hours - replication_delay is 656
[02:05:15] <icinga-wm>	 PROBLEM - puppet last run on contint2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[02:08:15] <icinga-wm>	 RECOVERY - Redis replication status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 4137180 keys, up 123 days 17 hours - replication_delay is 0
[02:14:05] <icinga-wm>	 RECOVERY - puppet last run on gerrit2001 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures
[02:15:45] <icinga-wm>	 PROBLEM - puppet last run on db1022 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[02:29:15] <icinga-wm>	 PROBLEM - Redis replication status tcp_6479 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 626 600 - REDIS 2.8.17 on 10.192.32.133:6479 has 1 databases (db0) with 4138768 keys, up 123 days 17 hours - replication_delay is 626
[02:29:16] <icinga-wm>	 PROBLEM - Redis replication status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 626 600 - REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 4138558 keys, up 123 days 18 hours - replication_delay is 626
[02:31:07] <logmsgbot>	 !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.14) (duration: 12m 10s)
[02:31:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:33:15] <icinga-wm>	 RECOVERY - puppet last run on contint2001 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures
[02:36:15] <icinga-wm>	 RECOVERY - Redis replication status tcp_6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 10.192.32.133:6479 has 1 databases (db0) with 4139748 keys, up 123 days 18 hours - replication_delay is 0
[02:36:25] <logmsgbot>	 !log l10nupdate@tin ResourceLoader cache refresh completed at Sat Mar  4 02:36:25 UTC 2017 (duration 5m 19s)
[02:36:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:38:15] <icinga-wm>	 RECOVERY - Redis replication status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 4139732 keys, up 123 days 18 hours - replication_delay is 0
[02:44:45] <icinga-wm>	 RECOVERY - puppet last run on db1022 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures
[02:54:15] <icinga-wm>	 PROBLEM - Redis replication status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 603 600 - REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 4140421 keys, up 123 days 18 hours - replication_delay is 603
[02:56:11] <wikibugs_>	 06Operations, 10MediaWiki-Configuration, 06Performance-Team, 06Services (watching), and 5 others: Allow integration of data from etcd into the MediaWiki configuration - https://phabricator.wikimedia.org/T156924#3072673 (10Krinkle) >>! In T156924#3072056, @tstarling wrote: > [..] the APC cache entry would h...
[02:56:15] <icinga-wm>	 RECOVERY - Redis replication status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 4140977 keys, up 123 days 18 hours - replication_delay is 0
[03:00:45] <mutante>	 !log planet2001 - reinstalling once more (T159432)
[03:00:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:00:51] <stashbot>	 T159432: Inconsistent package status on planet2001 - https://phabricator.wikimedia.org/T159432
[03:05:55] <mutante>	 !log planet2001 - and this time it just worked and i can't reproduce the issue. install finished. re-adding to puppet, signing certs...
[03:06:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:16:07] <wikibugs_>	 06Operations: Inconsistent package status on planet2001 - https://phabricator.wikimedia.org/T159432#3072695 (10Dzahn) repeated the install today, could not reproduce the problem of yesterday. this time it just worked.  re-signed puppet cert and salt keys.  reinstalled. no backports are activated.  sources.list l...
[03:16:40] <wikibugs_>	 06Operations: Inconsistent package status on planet2001 - https://phabricator.wikimedia.org/T159432#3072696 (10Dzahn) 05Open>03Resolved
[03:20:20] <wikibugs_>	 06Operations: Inconsistent package status on planet2001 - https://phabricator.wikimedia.org/T159432#3072697 (10Dzahn) well, there is  `/etc/apt/sources.list.d/debian-backports.list` with   ```  deb http://mirrors.wikimedia.org/debian/ jessie-backports main contrib non-free deb-src http://mirrors.wikimedia.org/de...
[03:22:39] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 649.93 seconds
[03:28:39] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 241.86 seconds
[03:28:46] <legoktm>	 !log pausing refreshLinks.php run due to increase in job queue
[03:28:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:34:36] <wikibugs_>	 (03CR) 10BryanDavis: [C: 031] toollabs: Preparing to move `/usr/local/bin/crontab` to labs/toollabs [puppet] - 10https://gerrit.wikimedia.org/r/336990 (https://phabricator.wikimedia.org/T156174) (owner: 10Zhuyifei1999)
[04:14:41] <icinga-wm>	 PROBLEM - MariaDB disk space on labsdb1005 is CRITICAL: DISK CRITICAL - free space: / 2023 MB (5% inode=97%)
[04:18:19] <icinga-wm>	 PROBLEM - Disk space on labsdb1005 is CRITICAL: DISK CRITICAL - free space: / 1279 MB (3% inode=97%)
[04:48:19] <icinga-wm>	 RECOVERY - Disk space on labsdb1005 is OK: DISK OK
[04:48:41] <icinga-wm>	 RECOVERY - MariaDB disk space on labsdb1005 is OK: DISK OK
[05:03:19] <icinga-wm>	 PROBLEM - Redis replication status tcp_6479 on rdb2005 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.32.133 on port 6479
[05:04:19] <icinga-wm>	 RECOVERY - Redis replication status tcp_6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 10.192.32.133:6479 has 1 databases (db0) with 4147816 keys, up 123 days 20 hours - replication_delay is 0
[05:16:19] <icinga-wm>	 PROBLEM - Redis replication status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479
[05:17:19] <icinga-wm>	 RECOVERY - Redis replication status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 4148246 keys, up 123 days 20 hours - replication_delay is 0
[05:28:29] <icinga-wm>	 PROBLEM - puppet last run on cp3048 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[05:34:29] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s2 on db1047 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 319.01 seconds
[05:36:29] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s2 on db1047 is OK: OK slave_sql_lag Replication lag: 13.91 seconds
[05:56:29] <icinga-wm>	 RECOVERY - puppet last run on cp3048 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures
[06:35:59] <icinga-wm>	 PROBLEM - puppet last run on mw1187 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[06:47:59] <icinga-wm>	 PROBLEM - puppet last run on wtp1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[07:02:59] <icinga-wm>	 RECOVERY - puppet last run on mw1187 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures
[07:11:59] <icinga-wm>	 PROBLEM - puppet last run on prometheus1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[07:14:59] <icinga-wm>	 RECOVERY - puppet last run on wtp1001 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures
[07:39:59] <icinga-wm>	 RECOVERY - puppet last run on prometheus1003 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures
[07:52:19] <icinga-wm>	 PROBLEM - puppet last run on wtp1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[08:08:39] <icinga-wm>	 PROBLEM - Router interfaces on cr2-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 76, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/3/0: down - Core: cr1-codfw:xe-5/0/2 (Zayo, OGYX/124337//ZYO, 38.8ms) {#?} [10Gbps wave]BR
[08:09:29] <icinga-wm>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 120, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/0/2: down - Core: cr2-ulsfo:xe-1/3/0 (Zayo, OGYX/124337//ZYO, 38.8ms) {#11541} [10Gbps wave]BR
[08:12:59] <icinga-wm>	 PROBLEM - puppet last run on db1083 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[08:17:09] <wikibugs_>	 06Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1060 - https://phabricator.wikimedia.org/T158193#3072811 (10Marostegui) It finished its rebuilt - so we can go ahead and replace #7:  ``` root@db1060:~# megacli -PDRbld -ShowProg -PhysDrv [32:4] -aALL  Device(Encl-32 Slot-4) is not in rebuild process  Exit...
[08:20:19] <icinga-wm>	 RECOVERY - puppet last run on wtp1002 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures
[08:34:29] <icinga-wm>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0
[08:34:39] <icinga-wm>	 RECOVERY - Router interfaces on cr2-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0
[08:39:59] <icinga-wm>	 RECOVERY - puppet last run on db1083 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures
[08:40:59] <icinga-wm>	 PROBLEM - puppet last run on elastic1024 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[09:07:59] <icinga-wm>	 RECOVERY - puppet last run on elastic1024 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures
[09:19:29] <icinga-wm>	 PROBLEM - puppet last run on cp3030 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[09:19:36] <icinga-wm>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 120, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/0/2: down - Core: cr2-ulsfo:xe-1/3/0 (Zayo, OGYX/124337//ZYO, 38.8ms) {#11541} [10Gbps wave]BR
[09:19:39] <icinga-wm>	 PROBLEM - Router interfaces on cr2-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 76, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/3/0: down - Core: cr1-codfw:xe-5/0/2 (Zayo, OGYX/124337//ZYO, 38.8ms) {#?} [10Gbps wave]BR
[09:28:19] <icinga-wm>	 PROBLEM - puppet last run on neodymium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[09:47:29] <icinga-wm>	 RECOVERY - puppet last run on cp3030 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures
[09:57:19] <icinga-wm>	 RECOVERY - puppet last run on neodymium is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures
[10:07:29] <icinga-wm>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0
[10:07:39] <icinga-wm>	 RECOVERY - Router interfaces on cr2-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0
[11:23:14] <wikibugs_>	 (03PS1) 10Addshore: Create extension1 db cluster for beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341120 (https://phabricator.wikimedia.org/T156241)
[11:24:48] <wikibugs_>	 (03PS4) 10Addshore: wmgUseInterwikiSorting true for group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341032 (https://phabricator.wikimedia.org/T150183)
[11:24:51] <wikibugs_>	 (03PS1) 10Addshore: Add InterwikiSorting extension to prod extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341121 (https://phabricator.wikimedia.org/T150183)
[11:25:16] <wikibugs_>	 (03PS4) 10Addshore: wmgUseInterwikiSorting true for wikidata clients, excluding wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341033 (https://phabricator.wikimedia.org/T150183)
[11:25:23] <wikibugs_>	 (03PS4) 10Addshore: wmgUseInterwikiSorting true for all wikidata clients [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341034 (https://phabricator.wikimedia.org/T150183)
[11:25:31] <wikibugs_>	 (03PS4) 10Addshore: Use wmgUseInterwikiSorting for labs from prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341036
[11:28:06] <wikibugs_>	 (03PS3) 10Urbanecm: Update logo for bswiki (Bosnian Wikipedia) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339326 (https://phabricator.wikimedia.org/T158815) (owner: 10DatGuy)
[11:29:56] <wikibugs_>	 (03CR) 10Urbanecm: [C: 031] "@DatGuy Seems you didn't have commited them. You may have them in your local PC but if you add new file you must run git add <filename> or" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339326 (https://phabricator.wikimedia.org/T158815) (owner: 10DatGuy)
[11:32:15] <wikibugs_>	 (03PS1) 10Addshore: Add Cognate to labs extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341122 (https://phabricator.wikimedia.org/T156241)
[11:35:42] <wikibugs_>	 (03PS2) 10Addshore: Add InterwikiSorting extension to prod extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341121 (https://phabricator.wikimedia.org/T150183)
[11:41:39] <wikibugs_>	 (03PS1) 10Addshore: Enable Cognate for beta wiktionaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341123 (https://phabricator.wikimedia.org/T156241)
[11:45:22] <wikibugs_>	 (03CR) 10Addshore: [C: 04-2] "To be scheduled" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341033 (https://phabricator.wikimedia.org/T150183) (owner: 10Addshore)
[11:45:29] <wikibugs_>	 (03CR) 10Addshore: [C: 04-2] "To be scheduled" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341034 (https://phabricator.wikimedia.org/T150183) (owner: 10Addshore)
[11:45:50] <wikibugs_>	 (03CR) 10Addshore: [C: 04-2] "Requires DB table creation first" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341123 (https://phabricator.wikimedia.org/T156241) (owner: 10Addshore)
[12:22:49] <wikibugs_>	 (03PS1) 10Addshore: Remove  Wikibase vs Interwikisorting checks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341127 (https://phabricator.wikimedia.org/T150183)
[12:22:59] <wikibugs_>	 (03PS2) 10Addshore: Remove Wikibase vs Interwikisorting checks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341127 (https://phabricator.wikimedia.org/T150183)
[12:23:44] <wikibugs_>	 (03CR) 10Addshore: [C: 04-2] "Waiting for the dep to be deployed" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341127 (https://phabricator.wikimedia.org/T150183) (owner: 10Addshore)
[13:56:29] <icinga-wm>	 PROBLEM - puppet last run on wtp1010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[14:24:29] <icinga-wm>	 RECOVERY - puppet last run on wtp1010 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures
[14:44:18] <wikibugs_>	 (03PS1) 10Marostegui: Add extra space [puppet] - 10https://gerrit.wikimedia.org/r/341131
[14:50:54] <wikibugs_>	 (03Abandoned) 10Marostegui: Add extra space [puppet] - 10https://gerrit.wikimedia.org/r/341131 (owner: 10Marostegui)
[14:58:29] <icinga-wm>	 PROBLEM - puppet last run on cp3008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[15:01:00] <wikibugs_>	 (03PS1) 10Marostegui: db-codfw.php: Depool db2046 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341132 (https://phabricator.wikimedia.org/T159414)
[15:09:29] <icinga-wm>	 PROBLEM - puppet last run on cp3041 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[15:27:29] <icinga-wm>	 RECOVERY - puppet last run on cp3008 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures
[15:37:29] <icinga-wm>	 RECOVERY - puppet last run on cp3041 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures
[15:54:29] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1189 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:54:29] <icinga-wm>	 PROBLEM - HHVM rendering on mw1189 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:56:19] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1189 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.061 second response time
[15:56:19] <icinga-wm>	 RECOVERY - HHVM rendering on mw1189 is OK: HTTP OK: HTTP/1.1 200 OK - 73425 bytes in 0.195 second response time
[15:59:29] <icinga-wm>	 PROBLEM - Apache HTTP on mw1197 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:59:29] <icinga-wm>	 PROBLEM - HHVM rendering on mw1197 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[16:00:19] <icinga-wm>	 RECOVERY - Apache HTTP on mw1197 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 615 bytes in 2.724 second response time
[16:00:29] <icinga-wm>	 RECOVERY - HHVM rendering on mw1197 is OK: HTTP OK: HTTP/1.1 200 OK - 73427 bytes in 4.833 second response time
[16:14:59] <icinga-wm>	 PROBLEM - Ensure mysql credential creation for tools users is running on labstore1005 is CRITICAL: CRITICAL - Expecting active but unit maintain-dbusers is failed
[16:15:39] <icinga-wm>	 PROBLEM - Check systemd state on labstore1005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[16:35:22] <Reedy>	 !log Manually generating some more captchas T159581
[16:35:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:35:29] <stashbot>	 T159581: The same CAPTCHA image is always used across platforms and refresh - https://phabricator.wikimedia.org/T159581
[16:36:57] <TabbyCat>	 Reedy: > PM :)
[16:38:39] <icinga-wm>	 RECOVERY - Check systemd state on labstore1005 is OK: OK - running: The system is fully operational
[16:38:59] <icinga-wm>	 RECOVERY - Ensure mysql credential creation for tools users is running on labstore1005 is OK: OK - maintain-dbusers is active
[16:43:25] <Reedy>	 !log Manually generating even more captchas (going upto 10k total) in screen as reedy on terbium T159581
[16:43:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:43:30] <stashbot>	 T159581: The same CAPTCHA image is always used across platforms and refresh - https://phabricator.wikimedia.org/T159581
[17:05:19] <icinga-wm>	 PROBLEM - Disk space on prometheus1004 is CRITICAL: DISK CRITICAL - free space: / 1253 MB (3% inode=52%)
[17:10:19] <icinga-wm>	 PROBLEM - puppet last run on aqs1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[17:10:20] <wikibugs_>	 06Operations, 10Wikimedia-General-or-Unknown, 07Easy: GenerateFancyCaptchas cronjob should output to logfile - https://phabricator.wikimedia.org/T159610#3073129 (10Reedy)
[17:16:43] <wikibugs_>	 (03PS1) 10MarcoAurelio: Create 'flood' flag for labswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341134
[17:27:19] <icinga-wm>	 PROBLEM - Disk space on prometheus1004 is CRITICAL: DISK CRITICAL - free space: / 1342 MB (3% inode=52%)
[17:29:12] <DatGuy>	 Any SWAT person can give me a simple summary of how SWAT operates/how to submit a Wikimedia-Site-Requests patch?
[17:36:55] <greg-g>	 DatGuy: https://wikitech.wikimedia.org/wiki/SWAT_deploys
[17:37:14] <DatGuy>	 Do users add the patches at the table, and then deployers deploy them?
[17:37:41] <greg-g>	 users
[17:38:19] <icinga-wm>	 RECOVERY - puppet last run on aqs1006 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures
[17:42:31] <DatGuy>	 alright, I'll do it per the instructions. Tell me if I mess up please ;)
[17:47:28] <DatGuy>	 uh nevermind
[17:47:29] <DatGuy>	 seems like there are 2 tasks of the same thing and 2 patches
[18:05:09] <wikibugs_>	 06Operations, 10MediaWiki-JobRunner, 13Patch-For-Review, 15User-Addshore: jobrunner should send statsd in batches - https://phabricator.wikimedia.org/T132327#3073270 (10Addshore) *poke @aaron again* can this be closed?
[18:23:09] <icinga-wm>	 PROBLEM - puppet last run on ocg1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[18:29:19] <icinga-wm>	 PROBLEM - puppet last run on stat1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[18:35:08] <wikibugs_>	 06Operations, 10ORES, 10Revision-Scoring-As-A-Service-Backlog: [spec] Active-active setup for ORES across datacenters (eqiad, codfw) - https://phabricator.wikimedia.org/T159615#3073283 (10Halfak)
[18:35:19] <icinga-wm>	 PROBLEM - Disk space on prometheus1004 is CRITICAL: DISK CRITICAL - free space: / 1109 MB (3% inode=52%)
[18:43:29] <icinga-wm>	 PROBLEM - HHVM rendering on mw1197 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[18:43:39] <icinga-wm>	 PROBLEM - HHVM rendering on mw1204 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[18:44:29] <icinga-wm>	 RECOVERY - HHVM rendering on mw1197 is OK: HTTP OK: HTTP/1.1 200 OK - 73371 bytes in 3.483 second response time
[18:44:29] <icinga-wm>	 RECOVERY - HHVM rendering on mw1204 is OK: HTTP OK: HTTP/1.1 200 OK - 73371 bytes in 3.285 second response time
[18:49:06] <wikibugs_>	 06Operations, 10Revision-Scoring-As-A-Service-Backlog, 13Patch-For-Review: Set up oresrdb redis node in codfw - https://phabricator.wikimedia.org/T139372#3073311 (10Halfak) Hey folks, I figured we should have a task specifically for identifying the option we'd like to pursue.  I've created {T159615} so we ca...
[18:50:50] <wikibugs_>	 06Operations, 10ORES, 10Revision-Scoring-As-A-Service-Backlog: [spec] Active-active setup for ORES across datacenters (eqiad, codfw) - https://phabricator.wikimedia.org/T159615#3073283 (10Halfak)
[18:52:09] <icinga-wm>	 RECOVERY - puppet last run on ocg1001 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures
[18:52:39] <icinga-wm>	 PROBLEM - Check systemd state on labstore1005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[18:52:59] <icinga-wm>	 PROBLEM - Ensure mysql credential creation for tools users is running on labstore1005 is CRITICAL: CRITICAL - Expecting active but unit maintain-dbusers is failed
[18:58:19] <icinga-wm>	 RECOVERY - puppet last run on stat1003 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures
[18:59:59] <icinga-wm>	 PROBLEM - Disk space on prometheus1003 is CRITICAL: DISK CRITICAL - free space: / 1275 MB (3% inode=53%)
[19:09:39] <icinga-wm>	 RECOVERY - Check systemd state on labstore1005 is OK: OK - running: The system is fully operational
[19:09:59] <icinga-wm>	 RECOVERY - Ensure mysql credential creation for tools users is running on labstore1005 is OK: OK - maintain-dbusers is active
[19:12:59] <icinga-wm>	 PROBLEM - puppet last run on labsdb1009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[19:13:59] <icinga-wm>	 PROBLEM - Disk space on prometheus1003 is CRITICAL: DISK CRITICAL - free space: / 1184 MB (3% inode=53%)
[19:15:52] <wikibugs_>	 06Operations, 10MediaWiki-JobQueue: Job queue rising to nearly 3 million jobs - https://phabricator.wikimedia.org/T159618#3073330 (10Legoktm)
[19:18:55] <wikibugs_>	 06Operations, 10MediaWiki-JobQueue: Job queue rising to nearly 3 million jobs - https://phabricator.wikimedia.org/T159618#3073342 (10Legoktm) wikidatawiki has 2,728,526 htmlCacheUpdate jobs queued.
[19:28:23] <wikibugs_>	 06Operations, 10MediaWiki-JobQueue, 10Wikidata: Job queue rising to nearly 3 million jobs - https://phabricator.wikimedia.org/T159618#3073343 (10Legoktm) p:05Triage>03Unbreak!
[19:34:19] <icinga-wm>	 PROBLEM - Disk space on prometheus1004 is CRITICAL: DISK CRITICAL - free space: / 1039 MB (3% inode=52%)
[19:38:29] <icinga-wm>	 PROBLEM - Redis replication status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 605 600 - REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 4220401 keys, up 124 days 11 hours - replication_delay is 605
[19:38:29] <icinga-wm>	 PROBLEM - Redis replication status tcp_6479 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 608 600 - REDIS 2.8.17 on 10.192.32.133:6479 has 1 databases (db0) with 4220429 keys, up 124 days 11 hours - replication_delay is 608
[19:39:59] <icinga-wm>	 RECOVERY - puppet last run on labsdb1009 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures
[19:45:09] <icinga-wm>	 PROBLEM - puppet last run on mw1263 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[19:46:59] <icinga-wm>	 PROBLEM - Disk space on prometheus1003 is CRITICAL: DISK CRITICAL - free space: / 1243 MB (3% inode=53%)
[19:51:19] <icinga-wm>	 PROBLEM - Disk space on prometheus1004 is CRITICAL: DISK CRITICAL - free space: / 1123 MB (3% inode=52%)
[20:14:09] <icinga-wm>	 RECOVERY - puppet last run on mw1263 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures
[20:33:39] <icinga-wm>	 PROBLEM - puppet last run on analytics1042 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[20:38:29] <icinga-wm>	 RECOVERY - Redis replication status tcp_6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 10.192.32.133:6479 has 1 databases (db0) with 4194792 keys, up 124 days 12 hours - replication_delay is 36
[20:48:29] <icinga-wm>	 PROBLEM - Redis replication status tcp_6479 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 636 600 - REDIS 2.8.17 on 10.192.32.133:6479 has 1 databases (db0) with 4194792 keys, up 124 days 12 hours - replication_delay is 636
[20:59:19] <icinga-wm>	 PROBLEM - puppet last run on mc1022 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[21:02:39] <icinga-wm>	 RECOVERY - puppet last run on analytics1042 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures
[22:05:14] <wikibugs_>	 06Operations, 10Wikimedia-General-or-Unknown: GenerateFancyCaptchas cronjob should output to logfile - https://phabricator.wikimedia.org/T159610#3073422 (10Aklapper) @Reedy: #easy tasks are self-contained, non-controversial issues with a clear approach and should be well-described with pointers to help the new...
[22:27:09] <icinga-wm>	 PROBLEM - puppet last run on mc1021 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[22:48:39] <icinga-wm>	 RECOVERY - Redis replication status tcp_6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 10.192.32.133:6479 has 1 databases (db0) with 4201262 keys, up 124 days 14 hours - replication_delay is 41
[22:55:09] <icinga-wm>	 RECOVERY - puppet last run on mc1021 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures
[22:55:29] <icinga-wm>	 RECOVERY - Redis replication status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 4201272 keys, up 124 days 14 hours - replication_delay is 31
[23:27:29] <icinga-wm>	 PROBLEM - Apache HTTP on mw1197 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[23:27:39] <icinga-wm>	 PROBLEM - HHVM rendering on mw1197 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[23:28:20] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1197 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[23:29:19] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1197 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 8.125 second response time
[23:29:19] <icinga-wm>	 RECOVERY - Apache HTTP on mw1197 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 615 bytes in 2.465 second response time
[23:29:29] <icinga-wm>	 RECOVERY - HHVM rendering on mw1197 is OK: HTTP OK: HTTP/1.1 200 OK - 73431 bytes in 6.115 second response time
[23:33:39] <icinga-wm>	 PROBLEM - HHVM rendering on mw1189 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[23:34:39] <icinga-wm>	 RECOVERY - HHVM rendering on mw1189 is OK: HTTP OK: HTTP/1.1 200 OK - 73431 bytes in 6.886 second response time
[23:55:39] <icinga-wm>	 PROBLEM - HHVM rendering on mw1197 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[23:57:29] <icinga-wm>	 RECOVERY - HHVM rendering on mw1197 is OK: HTTP OK: HTTP/1.1 200 OK - 73343 bytes in 7.871 second response time