[00:06:45] PROBLEM - puppet last run on cp3037 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:10:45] RECOVERY - puppet last run on labsdb1009 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [00:16:34] (03PS1) 10Dzahn: Revert "icinga: set IP for benefactorevents/eventdonations to 127.0.0.1" [puppet] - 10https://gerrit.wikimedia.org/r/341103 [00:17:28] (03CR) 10Dzahn: "doesn't work as expected but found another work-around that doesn't need a gerrit change and just involves web ui. http://www.htmlgraphic." [puppet] - 10https://gerrit.wikimedia.org/r/341103 (owner: 10Dzahn) [00:18:11] (03PS2) 10Dzahn: Revert "icinga: set IP for benefactorevents/eventdonations to 127.0.0.1" [puppet] - 10https://gerrit.wikimedia.org/r/341103 [00:21:30] (03CR) 10Dzahn: [C: 032] Revert "icinga: set IP for benefactorevents/eventdonations to 127.0.0.1" [puppet] - 10https://gerrit.wikimedia.org/r/341103 (owner: 10Dzahn) [00:22:43] (03CR) 10Dzahn: "i did this instead to make the 2 special hosts appear as UP: http://www.htmlgraphic.com/nagios-check-host-without-ping/" [puppet] - 10https://gerrit.wikimedia.org/r/341037 (owner: 10Dzahn) [00:24:53] ACKNOWLEDGEMENT - Check systemd state on graphite2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn https://phabricator.wikimedia.org/T157022#3045883 [00:24:53] ACKNOWLEDGEMENT - carbon-cache@a service on graphite2001 is CRITICAL: CRITICAL - Expecting active but unit carbon-cache@a is failed daniel_zahn https://phabricator.wikimedia.org/T157022#3045883 [00:24:53] ACKNOWLEDGEMENT - carbon-cache@b service on graphite2001 is CRITICAL: CRITICAL - Expecting active but unit carbon-cache@b is failed daniel_zahn https://phabricator.wikimedia.org/T157022#3045883 [00:24:53] ACKNOWLEDGEMENT - carbon-cache@c service on graphite2001 is CRITICAL: CRITICAL - Expecting active but unit carbon-cache@c is failed daniel_zahn https://phabricator.wikimedia.org/T157022#3045883 [00:24:53] ACKNOWLEDGEMENT - carbon-cache@d service on graphite2001 is CRITICAL: CRITICAL - Expecting active but unit carbon-cache@d is failed daniel_zahn https://phabricator.wikimedia.org/T157022#3045883 [00:24:53] ACKNOWLEDGEMENT - carbon-cache@e service on graphite2001 is CRITICAL: CRITICAL - Expecting active but unit carbon-cache@e is failed daniel_zahn https://phabricator.wikimedia.org/T157022#3045883 [00:24:53] ACKNOWLEDGEMENT - carbon-cache@f service on graphite2001 is CRITICAL: CRITICAL - Expecting active but unit carbon-cache@f is failed daniel_zahn https://phabricator.wikimedia.org/T157022#3045883 [00:24:54] ACKNOWLEDGEMENT - carbon-cache@g service on graphite2001 is CRITICAL: CRITICAL - Expecting active but unit carbon-cache@g is failed daniel_zahn https://phabricator.wikimedia.org/T157022#3045883 [00:24:54] ACKNOWLEDGEMENT - carbon-cache@h service on graphite2001 is CRITICAL: CRITICAL - Expecting active but unit carbon-cache@h is failed daniel_zahn https://phabricator.wikimedia.org/T157022#3045883 [00:24:55] ACKNOWLEDGEMENT - carbon-frontend-relay service on graphite2001 is CRITICAL: CRITICAL - Expecting active but unit carbon-frontend-relay is inactive daniel_zahn https://phabricator.wikimedia.org/T157022#3045883 [00:24:55] ACKNOWLEDGEMENT - carbon-local-relay service on graphite2001 is CRITICAL: CRITICAL - Expecting active but unit carbon-local-relay is failed daniel_zahn https://phabricator.wikimedia.org/T157022#3045883 [00:27:20] 06Operations, 10ops-eqiad, 13Patch-For-Review: Suspected faulty SSD on graphite1001 - https://phabricator.wikimedia.org/T157022#2993161 (10Dzahn) carbon-cache alerts on graphite2001 - https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=carbon-cache saw puppet is disabled there with link to... [00:35:45] RECOVERY - puppet last run on cp3037 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [00:36:05] 06Operations, 10ops-codfw, 06Analytics-Kanban, 13Patch-For-Review: rack/setup/deploy conf200[123] - https://phabricator.wikimedia.org/T131959#2184249 (10Dzahn) icinga said "CRITICAL - degraded: The system is operational but one or more units failed." on `conf2002.codfw.wmnet` looking at the check_command... [00:40:14] 06Operations, 10ops-codfw, 06Analytics-Kanban, 13Patch-For-Review: rack/setup/deploy conf200[123] - https://phabricator.wikimedia.org/T131959#3072486 (10Dzahn) This is running: etcdmirror--eqiad-wmnet.service loaded active running Etcd mirrormaker But t... [00:50:55] RECOVERY - Check systemd state on conf2002 is OK: OK - running: The system is fully operational [00:52:25] !log conf2002 - ran "systemctl reset-failed" to fix Icinga alert about broken systemd state due to formerly existing but failed service etcdmirror-eqiad-wmnet. turns out you need this to remove missing units. found on http://serverfault.com/questions/606520/how-to-remove-missing-systemd-units (T131959) [00:52:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:52:30] T131959: rack/setup/deploy conf200[123] - https://phabricator.wikimedia.org/T131959 [00:55:01] 06Operations, 10ops-codfw, 06Analytics-Kanban, 13Patch-For-Review: rack/setup/deploy conf200[123] - https://phabricator.wikimedia.org/T131959#3072515 (10Dzahn) The fix was `systemctl reset-failed` to get rid of the removed and now missing unit. ``` < icinga-wm> RECOVERY - Check systemd state on conf2002... [01:08:48] 06Operations, 10Wikimedia-Apache-configuration: Create 2030.wikimedia.org redirect to Meta portal - https://phabricator.wikimedia.org/T158981#3072558 (10Dzahn) p:05Triage>03High [01:45:41] 06Operations: Verify bn.wikipedia.org via Webmaster Tools to allow linking a bn.wikipedia.org button to G+ page - https://phabricator.wikimedia.org/T109810#3072589 (10dr0ptp4kt) I wanted to note I haven't forgotten about this. Got sick and have been doing annual budgeting... [01:46:05] PROBLEM - puppet last run on gerrit2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:01:15] PROBLEM - Redis replication status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 656 600 - REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 4151125 keys, up 123 days 17 hours - replication_delay is 656 [02:05:15] PROBLEM - puppet last run on contint2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:08:15] RECOVERY - Redis replication status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 4137180 keys, up 123 days 17 hours - replication_delay is 0 [02:14:05] RECOVERY - puppet last run on gerrit2001 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [02:15:45] PROBLEM - puppet last run on db1022 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:29:15] PROBLEM - Redis replication status tcp_6479 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 626 600 - REDIS 2.8.17 on 10.192.32.133:6479 has 1 databases (db0) with 4138768 keys, up 123 days 17 hours - replication_delay is 626 [02:29:16] PROBLEM - Redis replication status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 626 600 - REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 4138558 keys, up 123 days 18 hours - replication_delay is 626 [02:31:07] !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.14) (duration: 12m 10s) [02:31:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:33:15] RECOVERY - puppet last run on contint2001 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [02:36:15] RECOVERY - Redis replication status tcp_6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 10.192.32.133:6479 has 1 databases (db0) with 4139748 keys, up 123 days 18 hours - replication_delay is 0 [02:36:25] !log l10nupdate@tin ResourceLoader cache refresh completed at Sat Mar 4 02:36:25 UTC 2017 (duration 5m 19s) [02:36:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:38:15] RECOVERY - Redis replication status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 4139732 keys, up 123 days 18 hours - replication_delay is 0 [02:44:45] RECOVERY - puppet last run on db1022 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [02:54:15] PROBLEM - Redis replication status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 603 600 - REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 4140421 keys, up 123 days 18 hours - replication_delay is 603 [02:56:11] 06Operations, 10MediaWiki-Configuration, 06Performance-Team, 06Services (watching), and 5 others: Allow integration of data from etcd into the MediaWiki configuration - https://phabricator.wikimedia.org/T156924#3072673 (10Krinkle) >>! In T156924#3072056, @tstarling wrote: > [..] the APC cache entry would h... [02:56:15] RECOVERY - Redis replication status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 4140977 keys, up 123 days 18 hours - replication_delay is 0 [03:00:45] !log planet2001 - reinstalling once more (T159432) [03:00:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:00:51] T159432: Inconsistent package status on planet2001 - https://phabricator.wikimedia.org/T159432 [03:05:55] !log planet2001 - and this time it just worked and i can't reproduce the issue. install finished. re-adding to puppet, signing certs... [03:06:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:16:07] 06Operations: Inconsistent package status on planet2001 - https://phabricator.wikimedia.org/T159432#3072695 (10Dzahn) repeated the install today, could not reproduce the problem of yesterday. this time it just worked. re-signed puppet cert and salt keys. reinstalled. no backports are activated. sources.list l... [03:16:40] 06Operations: Inconsistent package status on planet2001 - https://phabricator.wikimedia.org/T159432#3072696 (10Dzahn) 05Open>03Resolved [03:20:20] 06Operations: Inconsistent package status on planet2001 - https://phabricator.wikimedia.org/T159432#3072697 (10Dzahn) well, there is `/etc/apt/sources.list.d/debian-backports.list` with ``` deb http://mirrors.wikimedia.org/debian/ jessie-backports main contrib non-free deb-src http://mirrors.wikimedia.org/de... [03:22:39] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 649.93 seconds [03:28:39] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 241.86 seconds [03:28:46] !log pausing refreshLinks.php run due to increase in job queue [03:28:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:34:36] (03CR) 10BryanDavis: [C: 031] toollabs: Preparing to move `/usr/local/bin/crontab` to labs/toollabs [puppet] - 10https://gerrit.wikimedia.org/r/336990 (https://phabricator.wikimedia.org/T156174) (owner: 10Zhuyifei1999) [04:14:41] PROBLEM - MariaDB disk space on labsdb1005 is CRITICAL: DISK CRITICAL - free space: / 2023 MB (5% inode=97%) [04:18:19] PROBLEM - Disk space on labsdb1005 is CRITICAL: DISK CRITICAL - free space: / 1279 MB (3% inode=97%) [04:48:19] RECOVERY - Disk space on labsdb1005 is OK: DISK OK [04:48:41] RECOVERY - MariaDB disk space on labsdb1005 is OK: DISK OK [05:03:19] PROBLEM - Redis replication status tcp_6479 on rdb2005 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.32.133 on port 6479 [05:04:19] RECOVERY - Redis replication status tcp_6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 10.192.32.133:6479 has 1 databases (db0) with 4147816 keys, up 123 days 20 hours - replication_delay is 0 [05:16:19] PROBLEM - Redis replication status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479 [05:17:19] RECOVERY - Redis replication status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 4148246 keys, up 123 days 20 hours - replication_delay is 0 [05:28:29] PROBLEM - puppet last run on cp3048 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:34:29] PROBLEM - MariaDB Slave Lag: s2 on db1047 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 319.01 seconds [05:36:29] RECOVERY - MariaDB Slave Lag: s2 on db1047 is OK: OK slave_sql_lag Replication lag: 13.91 seconds [05:56:29] RECOVERY - puppet last run on cp3048 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [06:35:59] PROBLEM - puppet last run on mw1187 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:47:59] PROBLEM - puppet last run on wtp1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:02:59] RECOVERY - puppet last run on mw1187 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [07:11:59] PROBLEM - puppet last run on prometheus1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:14:59] RECOVERY - puppet last run on wtp1001 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [07:39:59] RECOVERY - puppet last run on prometheus1003 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [07:52:19] PROBLEM - puppet last run on wtp1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:08:39] PROBLEM - Router interfaces on cr2-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 76, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/3/0: down - Core: cr1-codfw:xe-5/0/2 (Zayo, OGYX/124337//ZYO, 38.8ms) {#?} [10Gbps wave]BR [08:09:29] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 120, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/0/2: down - Core: cr2-ulsfo:xe-1/3/0 (Zayo, OGYX/124337//ZYO, 38.8ms) {#11541} [10Gbps wave]BR [08:12:59] PROBLEM - puppet last run on db1083 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:17:09] 06Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1060 - https://phabricator.wikimedia.org/T158193#3072811 (10Marostegui) It finished its rebuilt - so we can go ahead and replace #7: ``` root@db1060:~# megacli -PDRbld -ShowProg -PhysDrv [32:4] -aALL Device(Encl-32 Slot-4) is not in rebuild process Exit... [08:20:19] RECOVERY - puppet last run on wtp1002 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [08:34:29] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0 [08:34:39] RECOVERY - Router interfaces on cr2-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 [08:39:59] RECOVERY - puppet last run on db1083 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [08:40:59] PROBLEM - puppet last run on elastic1024 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:07:59] RECOVERY - puppet last run on elastic1024 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [09:19:29] PROBLEM - puppet last run on cp3030 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:19:36] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 120, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/0/2: down - Core: cr2-ulsfo:xe-1/3/0 (Zayo, OGYX/124337//ZYO, 38.8ms) {#11541} [10Gbps wave]BR [09:19:39] PROBLEM - Router interfaces on cr2-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 76, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/3/0: down - Core: cr1-codfw:xe-5/0/2 (Zayo, OGYX/124337//ZYO, 38.8ms) {#?} [10Gbps wave]BR [09:28:19] PROBLEM - puppet last run on neodymium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:47:29] RECOVERY - puppet last run on cp3030 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [09:57:19] RECOVERY - puppet last run on neodymium is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [10:07:29] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0 [10:07:39] RECOVERY - Router interfaces on cr2-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 [11:23:14] (03PS1) 10Addshore: Create extension1 db cluster for beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341120 (https://phabricator.wikimedia.org/T156241) [11:24:48] (03PS4) 10Addshore: wmgUseInterwikiSorting true for group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341032 (https://phabricator.wikimedia.org/T150183) [11:24:51] (03PS1) 10Addshore: Add InterwikiSorting extension to prod extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341121 (https://phabricator.wikimedia.org/T150183) [11:25:16] (03PS4) 10Addshore: wmgUseInterwikiSorting true for wikidata clients, excluding wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341033 (https://phabricator.wikimedia.org/T150183) [11:25:23] (03PS4) 10Addshore: wmgUseInterwikiSorting true for all wikidata clients [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341034 (https://phabricator.wikimedia.org/T150183) [11:25:31] (03PS4) 10Addshore: Use wmgUseInterwikiSorting for labs from prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341036 [11:28:06] (03PS3) 10Urbanecm: Update logo for bswiki (Bosnian Wikipedia) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339326 (https://phabricator.wikimedia.org/T158815) (owner: 10DatGuy) [11:29:56] (03CR) 10Urbanecm: [C: 031] "@DatGuy Seems you didn't have commited them. You may have them in your local PC but if you add new file you must run git add or" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339326 (https://phabricator.wikimedia.org/T158815) (owner: 10DatGuy) [11:32:15] (03PS1) 10Addshore: Add Cognate to labs extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341122 (https://phabricator.wikimedia.org/T156241) [11:35:42] (03PS2) 10Addshore: Add InterwikiSorting extension to prod extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341121 (https://phabricator.wikimedia.org/T150183) [11:41:39] (03PS1) 10Addshore: Enable Cognate for beta wiktionaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341123 (https://phabricator.wikimedia.org/T156241) [11:45:22] (03CR) 10Addshore: [C: 04-2] "To be scheduled" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341033 (https://phabricator.wikimedia.org/T150183) (owner: 10Addshore) [11:45:29] (03CR) 10Addshore: [C: 04-2] "To be scheduled" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341034 (https://phabricator.wikimedia.org/T150183) (owner: 10Addshore) [11:45:50] (03CR) 10Addshore: [C: 04-2] "Requires DB table creation first" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341123 (https://phabricator.wikimedia.org/T156241) (owner: 10Addshore) [12:22:49] (03PS1) 10Addshore: Remove Wikibase vs Interwikisorting checks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341127 (https://phabricator.wikimedia.org/T150183) [12:22:59] (03PS2) 10Addshore: Remove Wikibase vs Interwikisorting checks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341127 (https://phabricator.wikimedia.org/T150183) [12:23:44] (03CR) 10Addshore: [C: 04-2] "Waiting for the dep to be deployed" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341127 (https://phabricator.wikimedia.org/T150183) (owner: 10Addshore) [13:56:29] PROBLEM - puppet last run on wtp1010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:24:29] RECOVERY - puppet last run on wtp1010 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [14:44:18] (03PS1) 10Marostegui: Add extra space [puppet] - 10https://gerrit.wikimedia.org/r/341131 [14:50:54] (03Abandoned) 10Marostegui: Add extra space [puppet] - 10https://gerrit.wikimedia.org/r/341131 (owner: 10Marostegui) [14:58:29] PROBLEM - puppet last run on cp3008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:01:00] (03PS1) 10Marostegui: db-codfw.php: Depool db2046 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341132 (https://phabricator.wikimedia.org/T159414) [15:09:29] PROBLEM - puppet last run on cp3041 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:27:29] RECOVERY - puppet last run on cp3008 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [15:37:29] RECOVERY - puppet last run on cp3041 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [15:54:29] PROBLEM - Nginx local proxy to apache on mw1189 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:54:29] PROBLEM - HHVM rendering on mw1189 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:56:19] RECOVERY - Nginx local proxy to apache on mw1189 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.061 second response time [15:56:19] RECOVERY - HHVM rendering on mw1189 is OK: HTTP OK: HTTP/1.1 200 OK - 73425 bytes in 0.195 second response time [15:59:29] PROBLEM - Apache HTTP on mw1197 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:59:29] PROBLEM - HHVM rendering on mw1197 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:00:19] RECOVERY - Apache HTTP on mw1197 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 615 bytes in 2.724 second response time [16:00:29] RECOVERY - HHVM rendering on mw1197 is OK: HTTP OK: HTTP/1.1 200 OK - 73427 bytes in 4.833 second response time [16:14:59] PROBLEM - Ensure mysql credential creation for tools users is running on labstore1005 is CRITICAL: CRITICAL - Expecting active but unit maintain-dbusers is failed [16:15:39] PROBLEM - Check systemd state on labstore1005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:35:22] !log Manually generating some more captchas T159581 [16:35:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:29] T159581: The same CAPTCHA image is always used across platforms and refresh - https://phabricator.wikimedia.org/T159581 [16:36:57] Reedy: > PM :) [16:38:39] RECOVERY - Check systemd state on labstore1005 is OK: OK - running: The system is fully operational [16:38:59] RECOVERY - Ensure mysql credential creation for tools users is running on labstore1005 is OK: OK - maintain-dbusers is active [16:43:25] !log Manually generating even more captchas (going upto 10k total) in screen as reedy on terbium T159581 [16:43:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:43:30] T159581: The same CAPTCHA image is always used across platforms and refresh - https://phabricator.wikimedia.org/T159581 [17:05:19] PROBLEM - Disk space on prometheus1004 is CRITICAL: DISK CRITICAL - free space: / 1253 MB (3% inode=52%) [17:10:19] PROBLEM - puppet last run on aqs1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:10:20] 06Operations, 10Wikimedia-General-or-Unknown, 07Easy: GenerateFancyCaptchas cronjob should output to logfile - https://phabricator.wikimedia.org/T159610#3073129 (10Reedy) [17:16:43] (03PS1) 10MarcoAurelio: Create 'flood' flag for labswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341134 [17:27:19] PROBLEM - Disk space on prometheus1004 is CRITICAL: DISK CRITICAL - free space: / 1342 MB (3% inode=52%) [17:29:12] Any SWAT person can give me a simple summary of how SWAT operates/how to submit a Wikimedia-Site-Requests patch? [17:36:55] DatGuy: https://wikitech.wikimedia.org/wiki/SWAT_deploys [17:37:14] Do users add the patches at the table, and then deployers deploy them? [17:37:41] users [17:38:19] RECOVERY - puppet last run on aqs1006 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [17:42:31] alright, I'll do it per the instructions. Tell me if I mess up please ;) [17:47:28] uh nevermind [17:47:29] seems like there are 2 tasks of the same thing and 2 patches [18:05:09] 06Operations, 10MediaWiki-JobRunner, 13Patch-For-Review, 15User-Addshore: jobrunner should send statsd in batches - https://phabricator.wikimedia.org/T132327#3073270 (10Addshore) *poke @aaron again* can this be closed? [18:23:09] PROBLEM - puppet last run on ocg1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:29:19] PROBLEM - puppet last run on stat1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:35:08] 06Operations, 10ORES, 10Revision-Scoring-As-A-Service-Backlog: [spec] Active-active setup for ORES across datacenters (eqiad, codfw) - https://phabricator.wikimedia.org/T159615#3073283 (10Halfak) [18:35:19] PROBLEM - Disk space on prometheus1004 is CRITICAL: DISK CRITICAL - free space: / 1109 MB (3% inode=52%) [18:43:29] PROBLEM - HHVM rendering on mw1197 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:43:39] PROBLEM - HHVM rendering on mw1204 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:44:29] RECOVERY - HHVM rendering on mw1197 is OK: HTTP OK: HTTP/1.1 200 OK - 73371 bytes in 3.483 second response time [18:44:29] RECOVERY - HHVM rendering on mw1204 is OK: HTTP OK: HTTP/1.1 200 OK - 73371 bytes in 3.285 second response time [18:49:06] 06Operations, 10Revision-Scoring-As-A-Service-Backlog, 13Patch-For-Review: Set up oresrdb redis node in codfw - https://phabricator.wikimedia.org/T139372#3073311 (10Halfak) Hey folks, I figured we should have a task specifically for identifying the option we'd like to pursue. I've created {T159615} so we ca... [18:50:50] 06Operations, 10ORES, 10Revision-Scoring-As-A-Service-Backlog: [spec] Active-active setup for ORES across datacenters (eqiad, codfw) - https://phabricator.wikimedia.org/T159615#3073283 (10Halfak) [18:52:09] RECOVERY - puppet last run on ocg1001 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [18:52:39] PROBLEM - Check systemd state on labstore1005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:52:59] PROBLEM - Ensure mysql credential creation for tools users is running on labstore1005 is CRITICAL: CRITICAL - Expecting active but unit maintain-dbusers is failed [18:58:19] RECOVERY - puppet last run on stat1003 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [18:59:59] PROBLEM - Disk space on prometheus1003 is CRITICAL: DISK CRITICAL - free space: / 1275 MB (3% inode=53%) [19:09:39] RECOVERY - Check systemd state on labstore1005 is OK: OK - running: The system is fully operational [19:09:59] RECOVERY - Ensure mysql credential creation for tools users is running on labstore1005 is OK: OK - maintain-dbusers is active [19:12:59] PROBLEM - puppet last run on labsdb1009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:13:59] PROBLEM - Disk space on prometheus1003 is CRITICAL: DISK CRITICAL - free space: / 1184 MB (3% inode=53%) [19:15:52] 06Operations, 10MediaWiki-JobQueue: Job queue rising to nearly 3 million jobs - https://phabricator.wikimedia.org/T159618#3073330 (10Legoktm) [19:18:55] 06Operations, 10MediaWiki-JobQueue: Job queue rising to nearly 3 million jobs - https://phabricator.wikimedia.org/T159618#3073342 (10Legoktm) wikidatawiki has 2,728,526 htmlCacheUpdate jobs queued. [19:28:23] 06Operations, 10MediaWiki-JobQueue, 10Wikidata: Job queue rising to nearly 3 million jobs - https://phabricator.wikimedia.org/T159618#3073343 (10Legoktm) p:05Triage>03Unbreak! [19:34:19] PROBLEM - Disk space on prometheus1004 is CRITICAL: DISK CRITICAL - free space: / 1039 MB (3% inode=52%) [19:38:29] PROBLEM - Redis replication status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 605 600 - REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 4220401 keys, up 124 days 11 hours - replication_delay is 605 [19:38:29] PROBLEM - Redis replication status tcp_6479 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 608 600 - REDIS 2.8.17 on 10.192.32.133:6479 has 1 databases (db0) with 4220429 keys, up 124 days 11 hours - replication_delay is 608 [19:39:59] RECOVERY - puppet last run on labsdb1009 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [19:45:09] PROBLEM - puppet last run on mw1263 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:46:59] PROBLEM - Disk space on prometheus1003 is CRITICAL: DISK CRITICAL - free space: / 1243 MB (3% inode=53%) [19:51:19] PROBLEM - Disk space on prometheus1004 is CRITICAL: DISK CRITICAL - free space: / 1123 MB (3% inode=52%) [20:14:09] RECOVERY - puppet last run on mw1263 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [20:33:39] PROBLEM - puppet last run on analytics1042 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:38:29] RECOVERY - Redis replication status tcp_6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 10.192.32.133:6479 has 1 databases (db0) with 4194792 keys, up 124 days 12 hours - replication_delay is 36 [20:48:29] PROBLEM - Redis replication status tcp_6479 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 636 600 - REDIS 2.8.17 on 10.192.32.133:6479 has 1 databases (db0) with 4194792 keys, up 124 days 12 hours - replication_delay is 636 [20:59:19] PROBLEM - puppet last run on mc1022 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:02:39] RECOVERY - puppet last run on analytics1042 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [22:05:14] 06Operations, 10Wikimedia-General-or-Unknown: GenerateFancyCaptchas cronjob should output to logfile - https://phabricator.wikimedia.org/T159610#3073422 (10Aklapper) @Reedy: #easy tasks are self-contained, non-controversial issues with a clear approach and should be well-described with pointers to help the new... [22:27:09] PROBLEM - puppet last run on mc1021 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:48:39] RECOVERY - Redis replication status tcp_6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 10.192.32.133:6479 has 1 databases (db0) with 4201262 keys, up 124 days 14 hours - replication_delay is 41 [22:55:09] RECOVERY - puppet last run on mc1021 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [22:55:29] RECOVERY - Redis replication status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 4201272 keys, up 124 days 14 hours - replication_delay is 31 [23:27:29] PROBLEM - Apache HTTP on mw1197 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:27:39] PROBLEM - HHVM rendering on mw1197 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:28:20] PROBLEM - Nginx local proxy to apache on mw1197 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:29:19] RECOVERY - Nginx local proxy to apache on mw1197 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 8.125 second response time [23:29:19] RECOVERY - Apache HTTP on mw1197 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 615 bytes in 2.465 second response time [23:29:29] RECOVERY - HHVM rendering on mw1197 is OK: HTTP OK: HTTP/1.1 200 OK - 73431 bytes in 6.115 second response time [23:33:39] PROBLEM - HHVM rendering on mw1189 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:34:39] RECOVERY - HHVM rendering on mw1189 is OK: HTTP OK: HTTP/1.1 200 OK - 73431 bytes in 6.886 second response time [23:55:39] PROBLEM - HHVM rendering on mw1197 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:57:29] RECOVERY - HHVM rendering on mw1197 is OK: HTTP OK: HTTP/1.1 200 OK - 73343 bytes in 7.871 second response time