[00:01:58] RECOVERY - High load average on labstore1005 is OK: OK: Less than 50.00% above the threshold [16.0] [00:05:58] PROBLEM - High load average on labstore1005 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [24.0] [00:21:56] (03PS1) 10Niharika29: Config changes for deploying CodeMirror on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362327 (https://phabricator.wikimedia.org/T169284) [00:51:41] PROBLEM - MariaDB disk space on db1028 is CRITICAL: DISK CRITICAL - free space: /srv 100313 MB (5% inode=99%) [01:02:08] PROBLEM - Check health of redis instance on 6381 on rdb2001 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6381 [01:02:18] PROBLEM - Check health of redis instance on 6479 on rdb2005 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6479 [01:02:18] PROBLEM - Check health of redis instance on 6381 on rdb2003 is CRITICAL: CRITICAL: replication_delay is 1498784532 600 - REDIS 2.8.17 on 127.0.0.1:6381 has 1 databases (db0) with 8774692 keys, up 2 minutes 10 seconds - replication_delay is 1498784532 [01:02:18] PROBLEM - Check health of redis instance on 6379 on rdb2003 is CRITICAL: CRITICAL: replication_delay is 1498784532 600 - REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 8865794 keys, up 2 minutes 11 seconds - replication_delay is 1498784532 [01:02:29] PROBLEM - Check health of redis instance on 6379 on rdb2001 is CRITICAL: CRITICAL: replication_delay is 1498784546 600 - REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 8870887 keys, up 2 minutes 24 seconds - replication_delay is 1498784546 [01:02:38] PROBLEM - Check health of redis instance on 6481 on rdb2005 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6481 [01:02:38] PROBLEM - Check health of redis instance on 6480 on rdb2005 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6480 [01:02:48] PROBLEM - Check health of redis instance on 6380 on rdb2003 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6380 [01:03:08] RECOVERY - Check health of redis instance on 6381 on rdb2001 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6381 has 1 databases (db0) with 4163040 keys, up 3 minutes 2 seconds - replication_delay is 0 [01:03:18] RECOVERY - Check health of redis instance on 6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 4162018 keys, up 3 minutes 8 seconds - replication_delay is 0 [01:03:18] RECOVERY - Check health of redis instance on 6381 on rdb2003 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6381 has 1 databases (db0) with 8772460 keys, up 3 minutes 10 seconds - replication_delay is 0 [01:03:18] RECOVERY - Check health of redis instance on 6379 on rdb2003 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 8864134 keys, up 3 minutes 11 seconds - replication_delay is 0 [01:03:28] RECOVERY - Check health of redis instance on 6481 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6481 has 1 databases (db0) with 4161427 keys, up 3 minutes 19 seconds - replication_delay is 0 [01:03:28] RECOVERY - Check health of redis instance on 6379 on rdb2001 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 8868547 keys, up 3 minutes 24 seconds - replication_delay is 0 [01:03:38] RECOVERY - Check health of redis instance on 6480 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6480 has 1 databases (db0) with 4163850 keys, up 3 minutes 32 seconds - replication_delay is 0 [01:03:48] RECOVERY - Check health of redis instance on 6380 on rdb2003 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6380 has 1 databases (db0) with 8864444 keys, up 3 minutes 37 seconds - replication_delay is 0 [01:23:17] !log fail nfs from labstore1005 to labstore1004 (I failed to log a previous failover to 1004 and back) [01:23:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:25:21] !log reboot labstoer1005 [01:25:28] PROBLEM - puppet last run on labstore1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:26:08] PROBLEM - Host labstore1005 is DOWN: PING CRITICAL - Packet loss = 100% [01:27:08] RECOVERY - Ensure NFS exports are maintained for new instances with NFS on labstore1004 is OK: OK - nfs-exportd is active [01:27:38] PROBLEM - tools homepage -admin tool- on tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 20 seconds [01:28:18] RECOVERY - tools homepage -admin tool- on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 3570 bytes in 3.094 second response time [01:28:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:30:26] 10Operations, 10ops-eqiad: rack and setup wtp1025-1048 - https://phabricator.wikimedia.org/T165520#3393614 (10faidon) Likely because of a mismatch of our netboot image and Debian's kernel image. I've updated our netboot image, can you try again? [01:32:46] RoanKattouw: ah shoot, sorry for missing SWAT earlier; i had to step away for a while. [01:40:33] dbrant: That's OK, I just didn't deploy your patch, so you'll have to relist it for a future SWAT window some time [01:42:35] RoanKattouw: sure; just so I understand, will it still be the "1.30.0-wmf.7" branch on the next swat? [01:42:38] RECOVERY - Host labstore1005 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [01:42:46] Yes [01:42:53] kthx! [01:43:13] Normally, wmf.8 would start rolling out on Tuesday, so Monday would still be all wmf.7 [01:43:27] But because next Tuesday is the 4th of July, wmf.8 has been postponed a week [01:43:50] Also, the arrow thingies in places like https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170711T1900 are instructive [01:44:13] (Not sure if it's wrong that that says 7->9 or if we're deliberately skipping 8 for some reason) [01:45:02] nice, i see [01:45:38] PROBLEM - Host labstore1005 is DOWN: PING CRITICAL - Packet loss = 100% [01:46:28] RECOVERY - Host labstore1005 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [01:49:08] RECOVERY - High load average on labstore1005 is OK: OK: Less than 50.00% above the threshold [16.0] [01:49:18] PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1005 is CRITICAL: CRITICAL - Expecting active but unit nfs-exportd is inactive [01:49:18] PROBLEM - Ensure mysql credential creation for tools users is running on labstore1005 is CRITICAL: CRITICAL - Expecting active but unit maintain-dbusers is inactive [01:49:18] PROBLEM - drbd service on labstore1005 is CRITICAL: CRITICAL - Expecting active but unit drbd is inactive [02:01:12] 10Operations, 10DC-Ops, 10Labs: labstore1005 A PCIe link training failure error on boot - https://phabricator.wikimedia.org/T169286#3393637 (10Andrew) [02:01:36] 10Operations, 10DC-Ops, 10Labs: labstore1005 A PCIe link training failure error on boot - https://phabricator.wikimedia.org/T169286#3393620 (10Andrew) I tagged dc-ops because... have y'all ever seen something like this? [02:07:22] 10Operations, 10DC-Ops, 10Labs: labstore1005 A PCIe link training failure error on boot - https://phabricator.wikimedia.org/T169286#3393620 (10bd808) http://www.dell.com/support/manuals/us/en/04/dell-opnmang-sw-v8.1/EEMI_13G_v1.2-v1/UEFI-Event-Messages?guid=GUID-823669E3-2D7B-41B5-85F1-AF7A6BC11ACC&lang=en-u... [02:12:18] PROBLEM - Host labstore1005 is DOWN: PING CRITICAL - Packet loss = 100% [02:13:38] RECOVERY - Host labstore1005 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [02:14:25] !log reboot labstore1005 (5m ago) [02:14:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:16:40] 10Operations, 10DC-Ops, 10Labs: labstore1005 A PCIe link training failure error on boot - https://phabricator.wikimedia.org/T169286#3393620 (10madhuvishy) We did another reboot to downgrade the kernel back to 4.3 and the error happened again. [02:16:54] 10Operations, 10Labs, 10Kubernetes: etcd config depends on puppet certs, but puppet doesn't know - https://phabricator.wikimedia.org/T169287#3393649 (10Andrew) [02:18:08] PROBLEM - DRBD node status on labstore1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:19:19] PROBLEM - Host labstore1005 is DOWN: PING CRITICAL - Packet loss = 100% [02:20:08] RECOVERY - Host labstore1005 is UP: PING OK - Packet loss = 0%, RTA = 0.21 ms [02:21:25] (03PS2) 10Rush: labstore: secondary cluster set 1004 as primary [puppet] - 10https://gerrit.wikimedia.org/r/362214 [02:23:44] (03CR) 10Madhuvishy: [C: 031] labstore: secondary cluster set 1004 as primary [puppet] - 10https://gerrit.wikimedia.org/r/362214 (owner: 10Rush) [02:24:42] (03CR) 10Rush: [C: 032] labstore: secondary cluster set 1004 as primary [puppet] - 10https://gerrit.wikimedia.org/r/362214 (owner: 10Rush) [02:25:38] RECOVERY - puppet last run on labstore1005 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [02:26:28] RECOVERY - Ensure mysql credential creation for tools users is running on labstore1005 is OK: OK - maintain-dbusers is active [02:26:28] RECOVERY - Ensure NFS exports are maintained for new instances with NFS on labstore1005 is OK: OK - nfs-exportd is active [02:28:17] 10Operations, 10Labs, 10Kubernetes: etcd config depends on puppet certs, but puppet doesn't know - https://phabricator.wikimedia.org/T169287#3393675 (10madhuvishy) [02:29:28] RECOVERY - Ensure mysql credential creation for tools users is running on labstore1004 is OK: OK - maintain-dbusers is active [02:29:59] !log labstore1005 start drbd [02:30:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:30:28] RECOVERY - drbd service on labstore1005 is OK: OK - drbd is active [02:32:58] RECOVERY - DRBD node status on labstore1005 is OK: DRBD Status OK [02:33:55] 10Operations, 10Labs, 10Kubernetes: etcd config depends on puppet certs, but puppet doesn't know - https://phabricator.wikimedia.org/T169287#3393690 (10bd808) [02:36:04] 10Operations, 10DC-Ops, 10Labs: labstore1005 A PCIe link training failure error on boot - https://phabricator.wikimedia.org/T169286#3393692 (10bd808) [02:37:26] 10Operations, 10cloud-services-team: New anti-stackclash (4.9.25-1~bpo8+3 ) kernal SUPER BAD for NFS - https://phabricator.wikimedia.org/T169290#3393693 (10Andrew) [02:37:41] 10Operations, 10cloud-services-team: New anti-stackclash (4.9.25-1~bpo8+3 ) kernel super bad for NFS - https://phabricator.wikimedia.org/T169290#3393706 (10Andrew) [02:38:26] 10Operations, 10cloud-services-team: New anti-stackclash (4.9.25-1~bpo8+3 ) kernel super bad for NFS - https://phabricator.wikimedia.org/T169290#3393693 (10Andrew) [02:45:41] PROBLEM - MariaDB disk space on db1028 is CRITICAL: DISK CRITICAL - free space: /srv 100080 MB (5% inode=99%) [02:46:04] Please tell me someone else is here to look at that [03:02:32] 10Operations, 10DBA: File space alert for db1028 - https://phabricator.wikimedia.org/T169294#3393763 (10Andrew) [03:32:18] (03PS1) 10KartikMistry: WIP: cg3: New upstream version [debs/contenttranslation/cg3] - 10https://gerrit.wikimedia.org/r/362334 (https://phabricator.wikimedia.org/T168857) [04:13:58] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=2084.00 Read Requests/Sec=2229.30 Write Requests/Sec=2.00 KBytes Read/Sec=52432.40 KBytes_Written/Sec=20.40 [04:20:58] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=46.00 Read Requests/Sec=0.40 Write Requests/Sec=1.00 KBytes Read/Sec=1.60 KBytes_Written/Sec=49.60 [04:42:38] PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: /{domain}/v1/page/mobile-sections-lead/{title} (retrieve lead section of en.wp Altrincham page via mobile-sections-lead) is CRITICAL: Test retrieve lead section of en.wp Altrincham page via mobile-sections-lead responds with malformed body (AttributeError: NoneType object has no attribute get) [04:42:38] PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: /{domain}/v1/page/mobile-sections-lead/{title} (retrieve lead section of en.wp Altrincham page via mobile-sections-lead) is CRITICAL: Test retrieve lead section of en.wp Altrincham page via mobile-sections-lead responds with malformed body (AttributeError: NoneType object has no attribute get) [04:42:38] PROBLEM - mobileapps endpoints health on scb2003 is CRITICAL: /{domain}/v1/page/mobile-sections-lead/{title} (retrieve lead section of en.wp Altrincham page via mobile-sections-lead) is CRITICAL: Test retrieve lead section of en.wp Altrincham page via mobile-sections-lead responds with malformed body (AttributeError: NoneType object has no attribute get) [04:42:38] PROBLEM - mobileapps endpoints health on scb2006 is CRITICAL: /{domain}/v1/page/mobile-sections-lead/{title} (retrieve lead section of en.wp Altrincham page via mobile-sections-lead) is CRITICAL: Test retrieve lead section of en.wp Altrincham page via mobile-sections-lead responds with malformed body (AttributeError: NoneType object has no attribute get) [04:42:38] PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: /{domain}/v1/page/mobile-sections-lead/{title} (retrieve lead section of en.wp Altrincham page via mobile-sections-lead) is CRITICAL: Test retrieve lead section of en.wp Altrincham page via mobile-sections-lead responds with malformed body (AttributeError: NoneType object has no attribute get) [04:42:38] PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: /{domain}/v1/page/mobile-sections-lead/{title} (retrieve lead section of en.wp Altrincham page via mobile-sections-lead) is CRITICAL: Test retrieve lead section of en.wp Altrincham page via mobile-sections-lead responds with malformed body (AttributeError: NoneType object has no attribute get) [04:42:38] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/mobile-sections-lead/{title} (retrieve lead section of en.wp Altrincham page via mobile-sections-lead) is CRITICAL: Test retrieve lead section of en.wp Altrincham page via mobile-sections-lead responds with malformed body (AttributeError: NoneType object has no attribute get) [04:42:41] PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: /{domain}/v1/page/mobile-sections-lead/{title} (retrieve lead section of en.wp Altrincham page via mobile-sections-lead) is CRITICAL: Test retrieve lead section of en.wp Altrincham page via mobile-sections-lead responds with malformed body (AttributeError: NoneType object has no attribute get) [04:42:41] PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: /{domain}/v1/page/mobile-sections-lead/{title} (retrieve lead section of en.wp Altrincham page via mobile-sections-lead) is CRITICAL: Test retrieve lead section of en.wp Altrincham page via mobile-sections-lead responds with malformed body (AttributeError: NoneType object has no attribute get) [04:42:41] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/mobile-sections-lead/{title} (retrieve lead section of en.wp Altrincham page via mobile-sections-lead) is CRITICAL: Test retrieve lead section of en.wp Altrincham page via mobile-sections-lead responds with malformed body (AttributeError: NoneType object has no attribute get) [04:42:41] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: /{domain}/v1/page/mobile-sections-lead/{title} (retrieve lead section of en.wp Altrincham page via mobile-sections-lead) is CRITICAL: Test retrieve lead section of en.wp Altrincham page via mobile-sections-lead responds with malformed body (AttributeError: NoneType object has no attribute get) [04:42:41] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: /{domain}/v1/page/mobile-sections-lead/{title} (retrieve lead section of en.wp Altrincham page via mobile-sections-lead) is CRITICAL: Test retrieve lead section of en.wp Altrincham page via mobile-sections-lead responds with malformed body (AttributeError: NoneType object has no attribute get) [04:47:36] 10Operations, 10DBA: File space alert for db1028 - https://phabricator.wikimedia.org/T169294#3393798 (10Marostegui) p:05Triage>03Normal a:03Marostegui Hi Andrew, Thanks for the ticket. You are indeed correct, this is MySQL usage. The reason for this sudden growth of disk space is the ALTER table going... [05:13:28] PROBLEM - salt-minion processes on thumbor1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:13:38] PROBLEM - nutcracker process on thumbor1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:13:58] PROBLEM - dhclient process on thumbor1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:14:28] RECOVERY - salt-minion processes on thumbor1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [05:14:28] RECOVERY - nutcracker process on thumbor1002 is OK: PROCS OK: 1 process with UID = 115 (nutcracker), command name nutcracker [05:14:48] RECOVERY - dhclient process on thumbor1002 is OK: PROCS OK: 0 processes with command name dhclient [05:37:58] !log Deploy alter table on db1069 (and let it replicate) on s2 - T168661 [05:38:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:38:10] T168661: Apply schema change to add 3D filetype for STL files - https://phabricator.wikimedia.org/T168661 [05:43:32] !log Deploy alter table on dbstore1001 on s2 - T168661 [05:43:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:43:43] T168661: Apply schema change to add 3D filetype for STL files - https://phabricator.wikimedia.org/T168661 [05:44:46] !log Deploy alter table on dbstore1002 on s2 - T168661 [05:44:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:47:15] !log Deploy alter table on db1021 on s2 - T168661 [05:47:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:50:51] (03PS1) 10Marostegui: db-eqiad.php: Depool db1036 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362337 (https://phabricator.wikimedia.org/T168661) [05:52:40] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1036 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362337 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [05:53:59] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1036 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362337 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [05:55:49] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1036 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362337 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [05:55:57] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1036 - T168661 (duration: 00m 42s) [05:56:05] !log Deploy alter table on db1036 on s2 - T168661 [05:56:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:56:08] T168661: Apply schema change to add 3D filetype for STL files - https://phabricator.wikimedia.org/T168661 [05:56:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:56:27] !log Deploy alter table on db1047 on s2 - T168661 [05:56:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:56:36] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1036" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362338 [05:57:40] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1036" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362338 (owner: 10Marostegui) [05:58:38] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1036" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362338 (owner: 10Marostegui) [05:58:44] (03PS4) 10Jcrespo: mariadb: handle service for systemd -autostart and overrides [puppet] - 10https://gerrit.wikimedia.org/r/362156 (https://phabricator.wikimedia.org/T168356) [05:58:47] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1036" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362338 (owner: 10Marostegui) [05:59:45] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1036 - T168661 (duration: 00m 43s) [05:59:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:01:29] (03PS1) 10Marostegui: db-eqiad.php: Depool db1060 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362339 (https://phabricator.wikimedia.org/T168661) [06:01:32] (03CR) 10Jcrespo: [C: 032] mariadb: handle service for systemd -autostart and overrides [puppet] - 10https://gerrit.wikimedia.org/r/362156 (https://phabricator.wikimedia.org/T168356) (owner: 10Jcrespo) [06:03:18] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1060 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362339 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [06:04:36] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1060 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362339 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [06:05:18] PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 66, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/2/0: down - Core: cr1-eqord:xe-0/0/1 (Telia, IC-313592, 51ms) {#1502} [10Gbps wave]BR [06:05:18] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-0/0/1: down - Core: cr1-ulsfo:xe-1/2/0 (Telia, IC-313592, 51ms) {#11372} [10Gbps wave]BR [06:05:38] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1060 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362339 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [06:05:47] I guess that is the Telia maintenance [06:05:52] (03PS4) 10Jcrespo: mariadb: Set default limits for systemd core databases [puppet] - 10https://gerrit.wikimedia.org/r/362204 (https://phabricator.wikimedia.org/T168356) [06:06:54] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1060 - T168661 (duration: 00m 42s) [06:07:00] !log Deploy alter table on db1060 on s2 - T168661 [06:07:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:07:04] T168661: Apply schema change to add 3D filetype for STL files - https://phabricator.wikimedia.org/T168661 [06:07:08] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1060" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362340 [06:07:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:08:18] RECOVERY - Router interfaces on cr1-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 68, down: 0, dormant: 0, excluded: 0, unused: 0 [06:08:18] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [06:08:59] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1060" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362340 (owner: 10Marostegui) [06:10:10] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1060" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362340 (owner: 10Marostegui) [06:10:19] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1060" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362340 (owner: 10Marostegui) [06:10:25] !log Deploy alter table on db1074 on s2 - T168661 [06:10:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:10:59] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1060 - T168661 (duration: 00m 42s) [06:11:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:11:09] !log Deploy alter table on db1076 on s2 - T168661 [06:11:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:12:14] !log Deploy alter table on db1090 on s2 - T168661 [06:12:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:12:23] T168661: Apply schema change to add 3D filetype for STL files - https://phabricator.wikimedia.org/T168661 [06:12:46] !log Deploy alter table on db1018 on s2 - T168661 [06:12:52] and finished with the spam [06:12:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:38:12] !log Deploy alter table on s6 all eqiad hosts (primary master not included) - T168661 [06:38:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:38:23] T168661: Apply schema change to add 3D filetype for STL files - https://phabricator.wikimedia.org/T168661 [06:44:36] <_joe_> !lgo started manually burrow on krypton, could not start due to a stale pidfile [06:45:00] <_joe_> !log started manually burrow on krypton, could not start due to a stale pidfile [06:45:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:45:18] RECOVERY - Check systemd state on krypton is OK: OK - running: The system is fully operational [07:11:39] 10Operations, 10Mobile-Content-Service, 10Reading-Infrastructure-Team-Backlog, 10Services: Mobileapps swagger spec is broken (no pronounciation for Altrincham) - https://phabricator.wikimedia.org/T169299#3393909 (10Joe) [07:12:01] ACKNOWLEDGEMENT - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/mobile-sections-lead/{title} (retrieve lead section of en.wp Altrincham page via mobile-sections-lead) is CRITICAL: Test retrieve lead section of en.wp Altrincham page via mobile-sections-lead responds with malformed body (AttributeError: NoneType object has no attribute get): Giuseppe Lavagetto T169299 [07:12:02] ACKNOWLEDGEMENT - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/mobile-sections-lead/{title} (retrieve lead section of en.wp Altrincham page via mobile-sections-lead) is CRITICAL: Test retrieve lead section of en.wp Altrincham page via mobile-sections-lead responds with malformed body (AttributeError: NoneType object has no attribute get): Giuseppe Lavagetto T169299 [07:12:02] ACKNOWLEDGEMENT - mobileapps endpoints health on scb1001 is CRITICAL: /{domain}/v1/page/mobile-sections-lead/{title} (retrieve lead section of en.wp Altrincham page via mobile-sections-lead) is CRITICAL: Test retrieve lead section of en.wp Altrincham page via mobile-sections-lead responds with malformed body (AttributeError: NoneType object has no attribute get): Giuseppe Lavagetto T169299 [07:12:02] ACKNOWLEDGEMENT - mobileapps endpoints health on scb1002 is CRITICAL: /{domain}/v1/page/mobile-sections-lead/{title} (retrieve lead section of en.wp Altrincham page via mobile-sections-lead) is CRITICAL: Test retrieve lead section of en.wp Altrincham page via mobile-sections-lead responds with malformed body (AttributeError: NoneType object has no attribute get): Giuseppe Lavagetto T169299 [07:12:02] ACKNOWLEDGEMENT - mobileapps endpoints health on scb1003 is CRITICAL: /{domain}/v1/page/mobile-sections-lead/{title} (retrieve lead section of en.wp Altrincham page via mobile-sections-lead) is CRITICAL: Test retrieve lead section of en.wp Altrincham page via mobile-sections-lead responds with malformed body (AttributeError: NoneType object has no attribute get): Giuseppe Lavagetto T169299 [07:12:02] ACKNOWLEDGEMENT - mobileapps endpoints health on scb1004 is CRITICAL: /{domain}/v1/page/mobile-sections-lead/{title} (retrieve lead section of en.wp Altrincham page via mobile-sections-lead) is CRITICAL: Test retrieve lead section of en.wp Altrincham page via mobile-sections-lead responds with malformed body (AttributeError: NoneType object has no attribute get): Giuseppe Lavagetto T169299 [07:12:02] ACKNOWLEDGEMENT - mobileapps endpoints health on scb2001 is CRITICAL: /{domain}/v1/page/mobile-sections-lead/{title} (retrieve lead section of en.wp Altrincham page via mobile-sections-lead) is CRITICAL: Test retrieve lead section of en.wp Altrincham page via mobile-sections-lead responds with malformed body (AttributeError: NoneType object has no attribute get): Giuseppe Lavagetto T169299 [07:12:03] ACKNOWLEDGEMENT - mobileapps endpoints health on scb2002 is CRITICAL: /{domain}/v1/page/mobile-sections-lead/{title} (retrieve lead section of en.wp Altrincham page via mobile-sections-lead) is CRITICAL: Test retrieve lead section of en.wp Altrincham page via mobile-sections-lead responds with malformed body (AttributeError: NoneType object has no attribute get): Giuseppe Lavagetto T169299 [07:12:03] ACKNOWLEDGEMENT - mobileapps endpoints health on scb2003 is CRITICAL: /{domain}/v1/page/mobile-sections-lead/{title} (retrieve lead section of en.wp Altrincham page via mobile-sections-lead) is CRITICAL: Test retrieve lead section of en.wp Altrincham page via mobile-sections-lead responds with malformed body (AttributeError: NoneType object has no attribute get): Giuseppe Lavagetto T169299 [07:12:04] ACKNOWLEDGEMENT - mobileapps endpoints health on scb2004 is CRITICAL: /{domain}/v1/page/mobile-sections-lead/{title} (retrieve lead section of en.wp Altrincham page via mobile-sections-lead) is CRITICAL: Test retrieve lead section of en.wp Altrincham page via mobile-sections-lead responds with malformed body (AttributeError: NoneType object has no attribute get): Giuseppe Lavagetto T169299 [07:12:04] ACKNOWLEDGEMENT - mobileapps endpoints health on scb2005 is CRITICAL: /{domain}/v1/page/mobile-sections-lead/{title} (retrieve lead section of en.wp Altrincham page via mobile-sections-lead) is CRITICAL: Test retrieve lead section of en.wp Altrincham page via mobile-sections-lead responds with malformed body (AttributeError: NoneType object has no attribute get): Giuseppe Lavagetto T169299 [07:12:05] ACKNOWLEDGEMENT - mobileapps endpoints health on scb2006 is CRITICAL: /{domain}/v1/page/mobile-sections-lead/{title} (retrieve lead section of en.wp Altrincham page via mobile-sections-lead) is CRITICAL: Test retrieve lead section of en.wp Altrincham page via mobile-sections-lead responds with malformed body (AttributeError: NoneType object has no attribute get): Giuseppe Lavagetto T169299 [07:19:18] RECOVERY - pdfrender on scb1002 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.006 second response time [07:20:10] !log restart pdfrender on scb1002 [07:20:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:35:05] <_joe_> [07:35:43] RECOVERY - MariaDB disk space on db1028 is OK: DISK OK [07:37:04] that is an ongoing issue that happened overnight- it is handled [07:38:28] I did not see the page for the problem until now, it came when I was asleep (sorry) [07:41:57] 10Operations, 10DBA: File space alert for db1028 - https://phabricator.wikimedia.org/T169294#3393942 (10Marostegui) And the alter for the big table finished and space recovered: ``` root@db1028:~# df -hT /srv/ Filesystem Type Size Used Avail Use% Mounted on /dev/mapper/tank-data xfs 1.7T 1.4T... [07:45:48] 10Operations, 10Performance-Team: /usr/local/bin/xenon-generate-svgs and flamegraph.pl cronspam - https://phabricator.wikimedia.org/T169249#3393945 (10fgiunchedi) Indeed seems unlikely that's the root cause, the spam stopped at `Date: Fri, 30 Jun 2017 00:01:31 +0000` btw so I suspect one/multiple bad records y... [07:49:57] 10Operations, 10Mobile-Content-Service, 10Reading-Infrastructure-Team-Backlog, 10Services: Mobileapps swagger spec is broken (no pronounciation for `page/mobile-sections-lead` endpoints) - https://phabricator.wikimedia.org/T169299#3393952 (10Joe) p:05Triage>03High [08:02:52] (03PS11) 10Ema: Add firewall rules for pinkunicorn [puppet] - 10https://gerrit.wikimedia.org/r/361844 (https://phabricator.wikimedia.org/T169039) [08:02:54] (03CR) 10Ema: [V: 032 C: 032] Add firewall rules for pinkunicorn [puppet] - 10https://gerrit.wikimedia.org/r/361844 (https://phabricator.wikimedia.org/T169039) (owner: 10Ema) [08:09:28] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-0/0/1: down - Core: cr1-ulsfo:xe-1/2/0 (Telia, IC-313592, 51ms) {#11372} [10Gbps wave]BR [08:09:38] PROBLEM - salt-minion processes on thumbor1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:09:48] PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 66, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/2/0: down - Core: cr1-eqord:xe-0/0/1 (Telia, IC-313592, 51ms) {#1502} [10Gbps wave]BR [08:10:12] (03PS1) 10Alexandros Kosiaris: Introduce actinium and alcyone [dns] - 10https://gerrit.wikimedia.org/r/362343 (https://phabricator.wikimedia.org/T122134) [08:11:28] RECOVERY - salt-minion processes on thumbor1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [08:11:58] PROBLEM - Disk space on copper is CRITICAL: DISK CRITICAL - /var/lib/docker/containers/8f3e5aae75d5076ab5e27e8e72b36747983b307ad9a9454e621054d089e8cb50/shm is not accessible: Permission denied [08:13:20] (03CR) 10Alexandros Kosiaris: [C: 032] Introduce actinium and alcyone [dns] - 10https://gerrit.wikimedia.org/r/362343 (https://phabricator.wikimedia.org/T122134) (owner: 10Alexandros Kosiaris) [08:14:18] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.67% of data above the critical threshold [1000.0] [08:15:23] 10Operations, 10DBA: File space alert for db1028 - https://phabricator.wikimedia.org/T169294#3394003 (10Marostegui) 05Open>03Resolved [08:16:45] !log poweroff labcontrol1003. It was in the deian installer [08:16:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:58] RECOVERY - Disk space on copper is OK: DISK OK [08:19:24] (03CR) 10Elukey: "note for posterity: I assumed (incorrectly) that pinkunicorn handled live traffic, this is why I put a note about analytics data consumpti" [puppet] - 10https://gerrit.wikimedia.org/r/361844 (https://phabricator.wikimedia.org/T169039) (owner: 10Ema) [08:20:48] PROBLEM - Check correctness of the icinga configuration on einsteinium is CRITICAL: Icinga configuration contains errors [08:29:36] !log ayounsi@tin Started deploy [librenms/librenms@b10cc7c]: (no justification provided) [08:29:38] !log ayounsi@tin Finished deploy [librenms/librenms@b10cc7c]: (no justification provided) (duration: 00m 02s) [08:29:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:29:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:59] PROBLEM - Disk space on copper is CRITICAL: DISK CRITICAL - /var/lib/docker/containers/2d4d407dac61b8487a25ef3d5af9c0281f7c4e6d379a1362ffe5a7c60195ed40/shm is not accessible: Permission denied [08:34:32] !log ayounsi@tin Started deploy [librenms/librenms@3f407a7]: (no justification provided) [08:34:38] !log ayounsi@tin Finished deploy [librenms/librenms@3f407a7]: (no justification provided) (duration: 00m 05s) [08:34:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:58] RECOVERY - Disk space on copper is OK: DISK OK [08:38:14] (03PS8) 10Filippo Giunchedi: swift: introduce storage policies [puppet] - 10https://gerrit.wikimedia.org/r/353878 (https://phabricator.wikimedia.org/T151648) [08:42:07] (03PS9) 10Filippo Giunchedi: swift: introduce storage policies [puppet] - 10https://gerrit.wikimedia.org/r/353878 (https://phabricator.wikimedia.org/T151648) [08:44:10] 10Operations, 10Mobile-Content-Service, 10Reading-Infrastructure-Team-Backlog, 10Services: Mobileapps swagger spec is broken (no pronounciation for `page/mobile-sections-lead` endpoints) - https://phabricator.wikimedia.org/T169299#3393909 (10ArielGlenn) https://en.wikipedia.org/w/index.php?title=Module%3AI... [08:44:22] 10Operations, 10Wikimedia-IRC-RC-Server, 10Patch-For-Review, 10User-notice: Reboot irc.wikimedia.org for kernel upgrades - https://phabricator.wikimedia.org/T167643#3394038 (10akosiaris) 05Open>03Resolved No issue reported in a week, resolving [08:44:48] (03CR) 10Filippo Giunchedi: [C: 032] "NOOP in PCC (storage policies not enabled) https://puppet-compiler.wmflabs.org/6895/" [puppet] - 10https://gerrit.wikimedia.org/r/353878 (https://phabricator.wikimedia.org/T151648) (owner: 10Filippo Giunchedi) [08:45:30] 10Operations, 10DBA, 10Traffic: dbtree: make wasat a working backend and become active-active - https://phabricator.wikimedia.org/T163141#3394046 (10akosiaris) [08:45:32] 10Operations, 10vm-requests, 10Patch-For-Review: Site: 2 VM request for tendril (switch tendril from einsteinium to dbmonitor*) - https://phabricator.wikimedia.org/T149557#3394044 (10akosiaris) 05Open>03Resolved So tendril uses dbmonitor1001 for the last week, I am guessing we are ok, resolving. Feel fre... [08:47:29] 10Operations: Reimage/rename codfw pool counters - https://phabricator.wikimedia.org/T149298#2747982 (10akosiaris) We finally have ganeti multirow in both DCs, should we just create 2 VMs in codfw (one per row) and repurpose subra/suhail ? [08:48:17] 10Operations: Reimage/rename codfw pool counters - https://phabricator.wikimedia.org/T149298#3394052 (10akosiaris) Sigh. I just saw T163892 so the answer is yes ;-) [08:48:59] 10Operations: reinstall rcs100[12] with RAID - https://phabricator.wikimedia.org/T140441#2464918 (10akosiaris) Date for decomissioning is July 7th. [08:51:10] (03PS1) 10Marostegui: db-eqiad.php: Depool db1037 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362347 (https://phabricator.wikimedia.org/T168661) [08:51:24] 10Operations, 10Labs, 10Labs-Infrastructure: investigate slapd memory leak - https://phabricator.wikimedia.org/T130593#3394058 (10akosiaris) 05Open>03Resolved a:03akosiaris I am guessing this works fine ? Since then we 've had 0 troubles if I am not mistaken. I am resolving, feel free to reopen [08:52:04] 10Operations, 10Labs, 10Labs-Infrastructure: Investigate failover failure of LDAP servers - https://phabricator.wikimedia.org/T141277#2492736 (10akosiaris) Dependent task T130593 has had no update since Nov 2016, so this is probably solved. I am gonna resolve this, feel free to reopen [08:52:09] 10Operations, 10Labs, 10Labs-Infrastructure: investigate slapd memory leak - https://phabricator.wikimedia.org/T130593#2140296 (10akosiaris) [08:52:12] 10Operations, 10Labs, 10Labs-Infrastructure: Investigate failover failure of LDAP servers - https://phabricator.wikimedia.org/T141277#3394066 (10akosiaris) 05Open>03Resolved a:03akosiaris [08:53:17] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1037 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362347 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [08:53:51] 10Operations, 10DBA, 10Traffic: dbtree: make wasat a working backend and become active-active - https://phabricator.wikimedia.org/T163141#3187493 (10jcrespo) We should not enable active-active on dbtree (or enable it failing, as it is the current case). Dbtree database backend is db1011, which is only on eqi... [08:54:48] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [08:55:08] RECOVERY - Router interfaces on cr1-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 68, down: 0, dormant: 0, excluded: 0, unused: 0 [08:56:01] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1037 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362347 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [08:56:10] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1037 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362347 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [08:56:38] 10Operations, 10DBA, 10Traffic: dbtree: make wasat a working backend and become active-active - https://phabricator.wikimedia.org/T163141#3394080 (10jcrespo) [08:56:41] 10Operations, 10vm-requests, 10Patch-For-Review: Site: 2 VM request for tendril (switch tendril from einsteinium to dbmonitor*) - https://phabricator.wikimedia.org/T149557#3394078 (10jcrespo) 05Resolved>03Open Almost finished, we need to delete the garbage left on einst and the codfw hosts. [08:58:03] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1037" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362348 [08:58:09] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1037 - T168661 (duration: 00m 42s) [08:58:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:20] T168661: Apply schema change to add 3D filetype for STL files - https://phabricator.wikimedia.org/T168661 [08:59:39] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1037" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362348 (owner: 10Marostegui) [09:00:39] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1037" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362348 (owner: 10Marostegui) [09:00:42] 10Operations, 10vm-requests: codfw: VM request for poolcounter2001, poolcounter2002 - https://phabricator.wikimedia.org/T163892#3394101 (10akosiaris) [09:00:48] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1037" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362348 (owner: 10Marostegui) [09:01:35] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1037 - T168661 (duration: 00m 42s) [09:01:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:05] (03CR) 10Jcrespo: [C: 032] mariadb: Set default limits for systemd core databases [puppet] - 10https://gerrit.wikimedia.org/r/362204 (https://phabricator.wikimedia.org/T168356) (owner: 10Jcrespo) [09:02:12] (03PS5) 10Jcrespo: mariadb: Set default limits for systemd core databases [puppet] - 10https://gerrit.wikimedia.org/r/362204 (https://phabricator.wikimedia.org/T168356) [09:05:32] (03PS1) 10Alexandros Kosiaris: Introduce poolcounter200{1,2} [dns] - 10https://gerrit.wikimedia.org/r/362350 (https://phabricator.wikimedia.org/T163892) [09:06:12] godog: ok to merge? [09:06:28] jynus: oops, yes thank you [09:06:43] (03CR) 10Alexandros Kosiaris: [C: 032] Introduce poolcounter200{1,2} [dns] - 10https://gerrit.wikimedia.org/r/362350 (https://phabricator.wikimedia.org/T163892) (owner: 10Alexandros Kosiaris) [09:10:18] RECOVERY - carbon-cache too many creates on graphite1001 is OK: OK: Less than 1.00% above the threshold [500.0] [09:10:49] !log Deploy alter table on s5 all eqiad hosts (primary master not included) - T168661 [09:10:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:58] T168661: Apply schema change to add 3D filetype for STL files - https://phabricator.wikimedia.org/T168661 [09:15:30] (03PS2) 10Jcrespo: mariadb: Add cluster manager hosts to allowed admin port users [puppet] - 10https://gerrit.wikimedia.org/r/362217 [09:15:32] (03PS1) 10Jcrespo: mariadb: Followup to Set default limits for systemd core databases [puppet] - 10https://gerrit.wikimedia.org/r/362353 (https://phabricator.wikimedia.org/T168356) [09:19:19] (03CR) 10Jcrespo: [C: 032] mariadb: Followup to Set default limits for systemd core databases [puppet] - 10https://gerrit.wikimedia.org/r/362353 (https://phabricator.wikimedia.org/T168356) (owner: 10Jcrespo) [09:19:27] (03PS2) 10Jcrespo: mariadb: Followup to Set default limits for systemd core databases [puppet] - 10https://gerrit.wikimedia.org/r/362353 (https://phabricator.wikimedia.org/T168356) [09:19:58] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-0/0/1: down - Core: cr1-ulsfo:xe-1/2/0 (Telia, IC-313592, 51ms) {#11372} [10Gbps wave]BR [09:20:18] PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 66, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/2/0: down - Core: cr1-eqord:xe-0/0/1 (Telia, IC-313592, 51ms) {#1502} [10Gbps wave]BR [09:26:15] "Error: Could not find any hostgroup matching 'cache_canary_eqiad'" [09:27:15] I am trying to find a related change... [09:30:38] PROBLEM - salt-minion processes on thumbor1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:30:58] PROBLEM - dhclient process on thumbor1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:31:28] RECOVERY - salt-minion processes on thumbor1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [09:31:48] RECOVERY - dhclient process on thumbor1003 is OK: PROCS OK: 0 processes with command name dhclient [09:32:46] (03CR) 10Marostegui: [C: 031] mariadb: Add cluster manager hosts to allowed admin port users [puppet] - 10https://gerrit.wikimedia.org/r/362217 (owner: 10Jcrespo) [09:33:38] RECOVERY - Router interfaces on cr1-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 68, down: 0, dormant: 0, excluded: 0, unused: 0 [09:33:58] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [09:38:36] ok, this seems ema potentially running puppet for the first time on a new kind of hosts [09:41:50] (03CR) 10Daniel Kinzler: [C: 031] "Thank you Krinkle!" [puppet] - 10https://gerrit.wikimedia.org/r/361801 (https://phabricator.wikimedia.org/T169023) (owner: 10Krinkle) [09:42:56] (03CR) 10Daniel Kinzler: [C: 031] "by the way... why does the current alias not work? docs/ontology.owl does exist in the Wikibase extension." [puppet] - 10https://gerrit.wikimedia.org/r/361801 (https://phabricator.wikimedia.org/T169023) (owner: 10Krinkle) [09:44:10] (03PS1) 10Jcrespo: icinga-cache: Add new hostgroup to unbreak icinga config [puppet] - 10https://gerrit.wikimedia.org/r/362361 [09:45:45] (03PS1) 10Giuseppe Lavagetto: Fix pushing in the build script. [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/362362 [09:47:31] 10Operations, 10Patch-For-Review, 10codfw-rollout: url-downloader should be set up more redundantly - https://phabricator.wikimedia.org/T122134#3394197 (10akosiaris) 05stalled>03Open We are finally there. We have in both DCs a multi-row ganeti installation. Moving forward with adding 2 VMs, one per DC [09:47:33] (03PS1) 10Alexandros Kosiaris: Introduce poolcounter200{1,2} [puppet] - 10https://gerrit.wikimedia.org/r/362363 (https://phabricator.wikimedia.org/T163892) [09:47:35] (03PS1) 10Alexandros Kosiaris: Introduce alcyone, actinium [puppet] - 10https://gerrit.wikimedia.org/r/362364 (https://phabricator.wikimedia.org/T122134) [09:47:43] (03CR) 10Jcrespo: "Fix for https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=einsteinium&service=Check+correctness+of+the+icinga+configurat" [puppet] - 10https://gerrit.wikimedia.org/r/362361 (owner: 10Jcrespo) [09:48:41] (03CR) 10Alexandros Kosiaris: [C: 031] icinga-cache: Add new hostgroup to unbreak icinga config [puppet] - 10https://gerrit.wikimedia.org/r/362361 (owner: 10Jcrespo) [09:48:55] (03CR) 10Giuseppe Lavagetto: [C: 031] icinga-cache: Add new hostgroup to unbreak icinga config [puppet] - 10https://gerrit.wikimedia.org/r/362361 (owner: 10Jcrespo) [09:50:50] (03CR) 10Jcrespo: [C: 032] icinga-cache: Add new hostgroup to unbreak icinga config [puppet] - 10https://gerrit.wikimedia.org/r/362361 (owner: 10Jcrespo) [09:51:48] (03PS2) 10Alexandros Kosiaris: Introduce poolcounter200{1,2} [puppet] - 10https://gerrit.wikimedia.org/r/362363 (https://phabricator.wikimedia.org/T163892) [09:51:50] (03PS2) 10Alexandros Kosiaris: Introduce alcyone, actinium [puppet] - 10https://gerrit.wikimedia.org/r/362364 (https://phabricator.wikimedia.org/T122134) [09:52:11] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] Fix pushing in the build script. [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/362362 (owner: 10Giuseppe Lavagetto) [09:53:30] (03CR) 10Alexandros Kosiaris: [C: 032] Introduce poolcounter200{1,2} [puppet] - 10https://gerrit.wikimedia.org/r/362363 (https://phabricator.wikimedia.org/T163892) (owner: 10Alexandros Kosiaris) [09:53:38] (03CR) 10Alexandros Kosiaris: [C: 032] Introduce alcyone, actinium [puppet] - 10https://gerrit.wikimedia.org/r/362364 (https://phabricator.wikimedia.org/T122134) (owner: 10Alexandros Kosiaris) [09:54:21] !log uploaded kafkatee 0.1.6-1 to reprepro - T151748 [09:54:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:31] T151748: Cron conflict for kafkatee logrotate on oxygen - https://phabricator.wikimedia.org/T151748 [09:55:10] godog: I'd deploy the new kafkatee to oxygen, if you have a min to double check with me that everything works after that it would be great :D [09:56:36] (03PS4) 10Alexandros Kosiaris: Bump the TTLs again after renumbering [dns] - 10https://gerrit.wikimedia.org/r/361664 [09:56:52] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Bump the TTLs again after renumbering [dns] - 10https://gerrit.wikimedia.org/r/361664 (owner: 10Alexandros Kosiaris) [10:00:48] RECOVERY - Check correctness of the icinga configuration on einsteinium is OK: Icinga configuration is correct [10:04:30] (03PS1) 10Giuseppe Lavagetto: profile::docker::storage::loopback: bind mount source dir [puppet] - 10https://gerrit.wikimedia.org/r/362369 [10:20:18] PROBLEM - IPv4 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 27 probes of 453 (alerts on 19) - https://atlas.ripe.net/measurements/1791307/#!map [10:23:09] (03CR) 10Giuseppe Lavagetto: [C: 032] profile::docker::storage::loopback: bind mount source dir [puppet] - 10https://gerrit.wikimedia.org/r/362369 (owner: 10Giuseppe Lavagetto) [10:25:18] RECOVERY - IPv4 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 19 probes of 453 (alerts on 19) - https://atlas.ripe.net/measurements/1791307/#!map [10:30:08] PROBLEM - Disk space on copper is CRITICAL: DISK CRITICAL - /var/lib/docker/containers/30d840c74a491c4d917f3b76485b7d101a897eda9a0dd519ef455c12d059fe20/shm is not accessible: Permission denied [10:31:26] (03PS1) 10Marostegui: db-eqiad.php: Depool db1026 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362371 (https://phabricator.wikimedia.org/T168661) [10:32:18] PROBLEM - IPv4 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 43 probes of 453 (alerts on 19) - https://atlas.ripe.net/measurements/1791307/#!map [10:33:31] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1026 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362371 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [10:34:41] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1026 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362371 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [10:35:52] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1026 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362371 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [10:36:08] RECOVERY - Disk space on copper is OK: DISK OK [10:37:18] RECOVERY - IPv4 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 1 probes of 453 (alerts on 19) - https://atlas.ripe.net/measurements/1791307/#!map [10:38:10] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1026" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362372 [10:38:42] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1026 - T168661 (duration: 00m 42s) [10:38:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:54] T168661: Apply schema change to add 3D filetype for STL files - https://phabricator.wikimedia.org/T168661 [10:39:43] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1026" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362372 (owner: 10Marostegui) [10:41:00] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1026" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362372 (owner: 10Marostegui) [10:41:04] (03CR) 10Alexandros Kosiaris: "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/362369 (owner: 10Giuseppe Lavagetto) [10:41:09] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1026" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362372 (owner: 10Marostegui) [10:41:53] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1026 - T168661 (duration: 00m 42s) [10:42:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:42] 10Operations, 10Graphite, 10Labs, 10Patch-For-Review, 10User-fgiunchedi: Move labs 'instances' data to graphite labs - https://phabricator.wikimedia.org/T143405#3394343 (10ArielGlenn) Poking @bd808 on this, since it's been an issue for us again in the past week. [10:45:42] elukey: sure, already deployed? [10:52:33] godog: nope, I was waiting you :) [10:54:00] 10Operations, 10Performance-Team, 10Thumbor: Implement poolcounter failover in Thumbor - https://phabricator.wikimedia.org/T169312#3394367 (10fgiunchedi) [10:55:15] (03PS1) 10Volans: Cumin: add cache generator for emergencies [puppet] - 10https://gerrit.wikimedia.org/r/362377 (https://phabricator.wikimedia.org/T169304) [10:55:21] elukey: hah! ok go ahead [10:55:43] !log deploy kafkatee 0.1.6-1 to oxygen [10:55:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:57:25] (03CR) 10jerkins-bot: [V: 04-1] Cumin: add cache generator for emergencies [puppet] - 10https://gerrit.wikimedia.org/r/362377 (https://phabricator.wikimedia.org/T169304) (owner: 10Volans) [10:58:05] done [10:58:52] sampled.json looks ok [10:59:45] 10Operations, 10Performance-Team, 10Thumbor: Investigate poolcounter failure leading to thumbor failing to generate thumbs - https://phabricator.wikimedia.org/T169313#3394393 (10fgiunchedi) [10:59:54] (03PS2) 10Volans: Cumin: add cache generator for emergencies [puppet] - 10https://gerrit.wikimedia.org/r/362377 (https://phabricator.wikimedia.org/T169304) [11:00:13] 10Operations: tracking task: jessie -> stretch - https://phabricator.wikimedia.org/T168494#3394410 (10jcrespo) [11:00:16] 10Operations, 10DBA, 10Patch-For-Review: Prepare mysql hosts for stretch - https://phabricator.wikimedia.org/T168356#3394408 (10jcrespo) 05Open>03Resolved This is in no way a closed issues, but the initial scope is covered- pending tidying up puppet and hiera code. But the support is working, at least as... [11:01:13] elukey: try killing the process sending 5xx to logstash, if it restarts then we're golden [11:02:22] yeppa, works [11:03:52] (03PS3) 10Volans: Cumin: add cache generator for emergencies [puppet] - 10https://gerrit.wikimedia.org/r/362377 (https://phabricator.wikimedia.org/T169304) [11:04:30] elukey: \o/ \o/ [11:08:58] !log removing leftover data on tegmen T149557 [11:09:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:10] T149557: Site: 2 VM request for tendril (switch tendril from einsteinium to dbmonitor*) - https://phabricator.wikimedia.org/T149557 [11:15:25] 10Operations, 10Puppet: Use multiple puppetdbs on puppet masters - https://phabricator.wikimedia.org/T169318#3394486 (10fgiunchedi) [11:15:38] (03PS1) 10Addshore: WMDE Summer campaign - Add logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362380 (https://phabricator.wikimedia.org/T168631) [11:17:54] <_joe_> !log purging varnish, varnish-dbg from copper [11:18:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:19] (03CR) 10Volans: "Replies inline" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/356383 (https://phabricator.wikimedia.org/T108850) (owner: 10Elukey) [11:22:40] <_joe_> !log rebooting copper for kernel upgrade [11:22:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:27] thanks volans, we'll fix the rest of the script now that we agree on the solution. Good catch! [11:24:41] elukey: thank you for your patience ;) [11:25:07] 10Operations, 10vm-requests, 10Patch-For-Review: Site: 2 VM request for tendril (switch tendril from einsteinium to dbmonitor*) - https://phabricator.wikimedia.org/T149557#3394529 (10jcrespo) I did all of the above on tegmen except the cron, which I think was handled automatically by the user-deletion proces... [11:26:58] (03PS1) 10Elukey: Replace 'invoke-rc.d' with 'service' in logrotate config [puppet/kafkatee] - 10https://gerrit.wikimedia.org/r/362382 (https://phabricator.wikimedia.org/T151748) [11:28:01] godog: --^ [11:28:08] PROBLEM - Disk space on copper is CRITICAL: DISK CRITICAL - free space: / 785 MB (1% inode=54%): /var/lib/docker/devicemapper 785 MB (1% inode=54%) [11:31:32] (03PS1) 10ArielGlenn: actually rotate the cirrusdump log files [puppet] - 10https://gerrit.wikimedia.org/r/362383 (https://phabricator.wikimedia.org/T162688) [11:32:25] 10Operations, 10DC-Ops, 10monitoring: Monitor all management interfaces - https://phabricator.wikimedia.org/T169321#3394552 (10fgiunchedi) [11:39:48] PROBLEM - Check Varnish expiry mailbox lag on cp1049 is CRITICAL: CRITICAL: expiry mailbox lag is 2042619 [11:41:55] 10Operations, 10OTRS: Research whether it makes sense to have OTRS installation in an HA setup - https://phabricator.wikimedia.org/T169322#3394581 (10akosiaris) [11:42:14] 10Operations, 10OTRS: Research whether it makes sense to have OTRS installation in an HA setup - https://phabricator.wikimedia.org/T169322#3394593 (10akosiaris) p:05Triage>03Low [11:43:28] 10Operations, 10DC-Ops, 10monitoring: Monitor all management interfaces - https://phabricator.wikimedia.org/T169321#3394595 (10akosiaris) [11:44:14] 10Operations, 10DC-Ops, 10monitoring: Monitor all management interfaces - https://phabricator.wikimedia.org/T169321#3394552 (10akosiaris) As far as yesterday's outage is concerned, even `1` would have prevented it. But indeed if we can get 1 we can easily get 3 as well. [11:45:22] PROBLEM - puppet last run on alcyone is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 7 minutes ago with 2 failures. Failed resources (up to 3 shown): File_line[login.defs-SYS_GID_MAX],File_line[login.defs-SYS_UID_MAX] [11:46:52] PROBLEM - puppet last run on actinium is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 9 minutes ago with 2 failures. Failed resources (up to 3 shown): File_line[login.defs-SYS_GID_MAX],File_line[login.defs-SYS_UID_MAX] [11:50:38] 10Operations, 10DC-Ops, 10monitoring: Monitor all management interfaces - https://phabricator.wikimedia.org/T169321#3394605 (10fgiunchedi) [11:50:52] PROBLEM - Check Varnish expiry mailbox lag on cp1072 is CRITICAL: CRITICAL: expiry mailbox lag is 2016548 [11:51:52] RECOVERY - puppet last run on actinium is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [11:58:05] 10Operations, 10Patch-For-Review, 10codfw-rollout: url-downloader should be set up more redundantly - https://phabricator.wikimedia.org/T122134#3394610 (10akosiaris) 05Open>03Resolved alcyone and actinium are up and running and capable of being used as url-downloaders. So the service is finally set up mo... [12:02:33] (03PS1) 10Alexandros Kosiaris: Replace subra/suhail with poolcounter200{1,2} [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362386 [12:04:03] (03CR) 10Marostegui: "@jcrespo do you have any opinion (in favour or against) about this host becoming the new master for sanitarium for s2 role?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/360329 (https://phabricator.wikimedia.org/T153743) (owner: 10Marostegui) [12:04:31] (03CR) 10Lydia Pintscher: "If the redirect currently doesn't work I'd say let's kill it. The concerns raised are valid even if the likelihood is small. And we should" [puppet] - 10https://gerrit.wikimedia.org/r/361801 (https://phabricator.wikimedia.org/T169023) (owner: 10Krinkle) [12:06:22] RECOVERY - puppet last run on alcyone is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [12:06:29] (03PS2) 10Alexandros Kosiaris: Replace subra/suhail with poolcounter200{1,2} [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362386 (https://phabricator.wikimedia.org/T163892) [12:07:29] !log replace subra and suhail as poolcounters in codfw [12:07:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:41] (03CR) 10Alexandros Kosiaris: [C: 032] Replace subra/suhail with poolcounter200{1,2} [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362386 (https://phabricator.wikimedia.org/T163892) (owner: 10Alexandros Kosiaris) [12:07:52] (03CR) 10jenkins-bot: Replace subra/suhail with poolcounter200{1,2} [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362386 (https://phabricator.wikimedia.org/T163892) (owner: 10Alexandros Kosiaris) [12:08:52] !log akosiaris@tin Synchronized wmf-config/ProductionServices.php: Replace subra/suhail as poolcounters (duration: 00m 43s) [12:09:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:15] (03CR) 10Elukey: [C: 032] Replace 'invoke-rc.d' with 'service' in logrotate config [puppet/kafkatee] - 10https://gerrit.wikimedia.org/r/362382 (https://phabricator.wikimedia.org/T151748) (owner: 10Elukey) [12:09:28] 10Operations, 10vm-requests, 10Patch-For-Review: codfw: VM request for poolcounter2001, poolcounter2002 - https://phabricator.wikimedia.org/T163892#3394668 (10akosiaris) 05Open>03Resolved a:03akosiaris [12:11:21] (03PS1) 10Elukey: Update kafkatee module sha to the latest change. [puppet] - 10https://gerrit.wikimedia.org/r/362387 [12:13:53] (03CR) 10Elukey: [C: 032] Update kafkatee module sha to the latest change. [puppet] - 10https://gerrit.wikimedia.org/r/362387 (owner: 10Elukey) [12:14:01] (03PS1) 10Alexandros Kosiaris: Revert "Fixing several mgmt dns entries in wmnet file...had wrong zone" [dns] - 10https://gerrit.wikimedia.org/r/362389 [12:14:08] (03CR) 10jerkins-bot: [V: 04-1] Revert "Fixing several mgmt dns entries in wmnet file...had wrong zone" [dns] - 10https://gerrit.wikimedia.org/r/362389 (owner: 10Alexandros Kosiaris) [12:15:45] (03CR) 10Jcrespo: "No reason against. You seem to be pooling temporary intermediate ROW hosts with 0 weight, I wonder why (aside from the temporary stop for " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/360329 (https://phabricator.wikimedia.org/T153743) (owner: 10Marostegui) [12:15:45] (03CR) 10Alexandros Kosiaris: "I am not fully clear on why the original patch was submitted in the first place. Just to be clear, I 've uploaded this revert in order to " [dns] - 10https://gerrit.wikimedia.org/r/362389 (owner: 10Alexandros Kosiaris) [12:16:31] 10Operations, 10Patch-For-Review, 10User-Elukey: Cron conflict for kafkatee logrotate on oxygen - https://phabricator.wikimedia.org/T151748#2826635 (10elukey) Let's wait the regular weekly rotates to happen before calling this a win :) [12:17:06] (03CR) 10Jcrespo: "Also, unrelated, but we probably can use one of the future s8 host for sanitarium2 (second host) to have more room and avoid doing complex" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/360329 (https://phabricator.wikimedia.org/T153743) (owner: 10Marostegui) [12:17:40] (03CR) 10Marostegui: "> No reason against. You seem to be pooling temporary intermediate" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/360329 (https://phabricator.wikimedia.org/T153743) (owner: 10Marostegui) [12:19:32] (03CR) 10Marostegui: "> Also, unrelated, but we probably can use one of the future s8 host" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/360329 (https://phabricator.wikimedia.org/T153743) (owner: 10Marostegui) [12:19:41] (03CR) 10Jcrespo: "Ok, we should leave them pooled, specially ones like this (for long term)." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/360329 (https://phabricator.wikimedia.org/T153743) (owner: 10Marostegui) [12:20:30] (03CR) 10Marostegui: "> Ok, we should leave them pooled, specially ones like this (for long" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/360329 (https://phabricator.wikimedia.org/T153743) (owner: 10Marostegui) [12:20:49] 10Operations, 10Patch-For-Review: Reimage/rename codfw pool counters - https://phabricator.wikimedia.org/T149298#3394727 (10akosiaris) subra and suhail are no longer used, poolcounter2001/poolcounter2002 are now used. We can repurpose subra/suhail [12:21:29] (03CR) 10Jcrespo: "Not necessarily main traffic, it could be from the other roles, like this." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/360329 (https://phabricator.wikimedia.org/T153743) (owner: 10Marostegui) [12:21:43] (03PS1) 10Addshore: Add Newsletter to extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362393 (https://phabricator.wikimedia.org/T110170) [12:21:45] (03PS1) 10Addshore: Enable Newsletter on mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362394 (https://phabricator.wikimedia.org/T110170) [12:22:05] (03PS2) 10Marostegui: db-eqiad.php: Make db1060 s2 sanitarium2 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/360329 (https://phabricator.wikimedia.org/T153743) [12:22:15] (03CR) 10Addshore: [C: 04-2] "Needs its first branch to be made" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362393 (https://phabricator.wikimedia.org/T110170) (owner: 10Addshore) [12:22:22] (03CR) 10Addshore: [C: 04-2] Enable Newsletter on mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362394 (https://phabricator.wikimedia.org/T110170) (owner: 10Addshore) [12:23:44] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Make db1060 s2 sanitarium2 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/360329 (https://phabricator.wikimedia.org/T153743) (owner: 10Marostegui) [12:25:02] (03Merged) 10jenkins-bot: db-eqiad.php: Make db1060 s2 sanitarium2 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/360329 (https://phabricator.wikimedia.org/T153743) (owner: 10Marostegui) [12:26:05] (03CR) 10jenkins-bot: db-eqiad.php: Make db1060 s2 sanitarium2 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/360329 (https://phabricator.wikimedia.org/T153743) (owner: 10Marostegui) [12:26:23] (03Abandoned) 10Alexandros Kosiaris: Revert "Fixing several mgmt dns entries in wmnet file...had wrong zone" [dns] - 10https://gerrit.wikimedia.org/r/362389 (owner: 10Alexandros Kosiaris) [12:26:32] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Add comments to db1060 about its future usage as a sanitarium master - T153743 (duration: 00m 42s) [12:26:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:26:43] T153743: Add and sanitize s2, s4, s5, s6 and s7 to sanitarium2 and new labsdb hosts - https://phabricator.wikimedia.org/T153743 [12:30:22] (03PS3) 10Joal: Add cron job dropping webrequest from druid [puppet] - 10https://gerrit.wikimedia.org/r/362148 (https://phabricator.wikimedia.org/T168614) [12:32:12] RECOVERY - Disk space on copper is OK: DISK OK [12:33:29] (03CR) 10jerkins-bot: [V: 04-1] Add cron job dropping webrequest from druid [puppet] - 10https://gerrit.wikimedia.org/r/362148 (https://phabricator.wikimedia.org/T168614) (owner: 10Joal) [12:39:37] (03PS4) 10Joal: Add cron job dropping webrequest from druid [puppet] - 10https://gerrit.wikimedia.org/r/362148 (https://phabricator.wikimedia.org/T168614) [12:47:19] !log just upgraded wmf-mariadb101-client on mariadb::client hosts [12:47:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:48:10] (03PS1) 10Faidon Liambotis: Move IPMI monitoring check etc. to ipmi::monitor [puppet] - 10https://gerrit.wikimedia.org/r/362399 [12:48:12] (03PS1) 10Faidon Liambotis: ipmi: add has_ipmi and ipmi_lan facts [puppet] - 10https://gerrit.wikimedia.org/r/362400 [12:48:12] PROBLEM - Disk space on copper is CRITICAL: DISK CRITICAL - /var/lib/docker/containers/7bdbc3ad4b260369bec6865d69c1e7637e1b74bee8affd5d55ba4a0a0b704156/shm is not accessible: Permission denied [12:48:14] (03PS1) 10Faidon Liambotis: base: switch ipmi::monitor inclusion to has_ipmi [puppet] - 10https://gerrit.wikimedia.org/r/362401 [12:48:16] (03PS1) 10Faidon Liambotis: base: cleanup unneeded ipmi packages/checks [puppet] - 10https://gerrit.wikimedia.org/r/362402 [12:48:18] akosiaris: ^ [12:49:20] (03PS1) 10Urbanecm: Limit thanks for new users at pl.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362403 (https://phabricator.wikimedia.org/T169268) [12:49:36] (03PS2) 10Urbanecm: Limit thanks for new users at pl.wikipedia to 3 per day [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362403 (https://phabricator.wikimedia.org/T169268) [12:50:33] (03CR) 10jerkins-bot: [V: 04-1] base: switch ipmi::monitor inclusion to has_ipmi [puppet] - 10https://gerrit.wikimedia.org/r/362401 (owner: 10Faidon Liambotis) [12:50:43] (03CR) 10jerkins-bot: [V: 04-1] ipmi: add has_ipmi and ipmi_lan facts [puppet] - 10https://gerrit.wikimedia.org/r/362400 (owner: 10Faidon Liambotis) [12:51:12] (03CR) 10jerkins-bot: [V: 04-1] base: cleanup unneeded ipmi packages/checks [puppet] - 10https://gerrit.wikimedia.org/r/362402 (owner: 10Faidon Liambotis) [12:52:13] (03PS3) 10Urbanecm: Limit thanks for new users at pl.wikipedia to 3 per day [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362403 (https://phabricator.wikimedia.org/T169268) [12:54:12] RECOVERY - Disk space on copper is OK: DISK OK [12:55:33] (03PS2) 10Faidon Liambotis: ipmi: add has_ipmi and ipmi_lan facts [puppet] - 10https://gerrit.wikimedia.org/r/362400 [12:55:35] (03PS2) 10Faidon Liambotis: base: switch ipmi::monitor inclusion to has_ipmi [puppet] - 10https://gerrit.wikimedia.org/r/362401 [12:55:37] (03PS2) 10Faidon Liambotis: base: cleanup unneeded ipmi packages/checks [puppet] - 10https://gerrit.wikimedia.org/r/362402 [12:56:58] Hi all, recently had some issues with 2FA (T168064 related) and I've now come around to try to log back into wikitech and found Google Auth has removed the account helpfully. Can I get 2FA reset on my wikitech account Samtar? It's linked to Phab (https://wikitech.wikimedia.org/w/index.php?title=User%3ASamtar&type=revision&diff=256765&oldid=174574) and I still have access there to confirm. Can create a task if required [12:56:58] T168064: Possible issue with 2FA tokens - https://phabricator.wikimedia.org/T168064 [12:57:12] PROBLEM - Disk space on copper is CRITICAL: DISK CRITICAL - /var/lib/docker/containers/a38350fa2b73f21b9e335a803d10d331ee3c072b687c9d50547a05a46c6dd10c/shm is not accessible: Permission denied [12:59:38] (03PS1) 10Jcrespo: Fix configurable socket options for new labsdb hosts [puppet] - 10https://gerrit.wikimedia.org/r/362404 (https://phabricator.wikimedia.org/T148507) [13:00:06] (03CR) 10Marostegui: [C: 031] Fix configurable socket options for new labsdb hosts [puppet] - 10https://gerrit.wikimedia.org/r/362404 (https://phabricator.wikimedia.org/T148507) (owner: 10Jcrespo) [13:01:10] (03CR) 10jerkins-bot: [V: 04-1] Fix configurable socket options for new labsdb hosts [puppet] - 10https://gerrit.wikimedia.org/r/362404 (https://phabricator.wikimedia.org/T148507) (owner: 10Jcrespo) [13:01:23] TheresNoTime: Umm, your wikitech account is not linked to your phab account [13:02:20] bawolff: linked as in I physically providing a link to the account as in that diff above when I had access to it, apologies - didn't mean a technical link [13:02:22] 10Operations, 10DC-Ops, 10monitoring: Monitor all management interfaces - https://phabricator.wikimedia.org/T169321#3394946 (10faidon) Well, it depends on how we would monitor them. Yesterday's issue wasn't caused by management interfaces being unreachable, but by their DNS pointing to the wrong address. I j... [13:02:38] (03PS2) 10Jcrespo: Fix configurable socket options for new labsdb hosts [puppet] - 10https://gerrit.wikimedia.org/r/362404 (https://phabricator.wikimedia.org/T148507) [13:03:12] RECOVERY - Disk space on copper is OK: DISK OK [13:03:59] your phab account is linked to the SUL [[User:There'sNoTime]], which was renamed from [[User:Samtar]] - https://meta.wikimedia.org/w/index.php?title=Special%3ALog&type=&user=&page=User%3ASamtar [13:05:42] (03CR) 10Jcrespo: [C: 032] Fix configurable socket options for new labsdb hosts [puppet] - 10https://gerrit.wikimedia.org/r/362404 (https://phabricator.wikimedia.org/T148507) (owner: 10Jcrespo) [13:05:44] (03CR) 10Alexandros Kosiaris: [C: 031] Move IPMI monitoring check etc. to ipmi::monitor [puppet] - 10https://gerrit.wikimedia.org/r/362399 (owner: 10Faidon Liambotis) [13:06:59] TheresNoTime: Are you in control of [[User:There'sNoTime]] on normal wikis. An edit from that account stating that you own the phab account and want a 2FA reset, would probably be in order [13:07:45] (03CR) 10Alexandros Kosiaris: [C: 04-1] "seems fine, minor comment inline." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/362400 (owner: 10Faidon Liambotis) [13:07:48] Hmm, you are also logged into irc, with a wikipedia cloak, which is a good sign [13:08:10] bawolff: https://en.wikipedia.org/w/index.php?title=User%3AThere%27sNoTime&type=revision&diff=788265155&oldid=788078670 sufficient? :) I was part the way through logging a task, if you need something to reference? [13:08:42] TheresNoTime: You're still going to need to create a task though for this [13:08:54] Not a problem, I'll finish it up now [13:10:05] bawolff: which project do 2FA resets go to? o.O [13:10:38] (03CR) 10Alexandros Kosiaris: [C: 031] base: switch ipmi::monitor inclusion to has_ipmi [puppet] - 10https://gerrit.wikimedia.org/r/362401 (owner: 10Faidon Liambotis) [13:10:50] umm, I'm not sure. Just make sure that aklapper is cc'd [13:11:15] (03CR) 10Alexandros Kosiaris: [C: 031] base: cleanup unneeded ipmi packages/checks [puppet] - 10https://gerrit.wikimedia.org/r/362402 (owner: 10Faidon Liambotis) [13:11:42] T169332 [13:11:43] T169332: 2FA reset for Wikitech account - https://phabricator.wikimedia.org/T169332 [13:12:10] (03PS3) 10Faidon Liambotis: ipmi: add has_ipmi and ipmi_lan facts [puppet] - 10https://gerrit.wikimedia.org/r/362400 [13:12:12] (03PS3) 10Faidon Liambotis: base: switch ipmi::monitor inclusion to has_ipmi [puppet] - 10https://gerrit.wikimedia.org/r/362401 [13:12:13] Oh, sorry, I was confused, I thought you were talking about your phab account [13:12:15] (03PS3) 10Faidon Liambotis: base: cleanup unneeded ipmi packages/checks [puppet] - 10https://gerrit.wikimedia.org/r/362402 [13:12:32] which is why I was asking you to prove you owned your SUL account [13:12:45] (03PS1) 10Jcrespo: Labsdb-replica: Fix bug on gerrit:362404 [puppet] - 10https://gerrit.wikimedia.org/r/362408 [13:13:02] PROBLEM - puppet last run on labsdb1009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:13:36] (03CR) 10Alexandros Kosiaris: [C: 031] ipmi: add has_ipmi and ipmi_lan facts [puppet] - 10https://gerrit.wikimedia.org/r/362400 (owner: 10Faidon Liambotis) [13:13:59] akosiaris: hmm, i found a bug [13:14:00] ^that is me [13:14:11] circular dependency :) [13:14:12] I'll fix [13:14:28] (03PS4) 10Faidon Liambotis: ipmi: add has_ipmi and ipmi_lan facts [puppet] - 10https://gerrit.wikimedia.org/r/362400 [13:14:30] (03PS4) 10Faidon Liambotis: base: switch ipmi::monitor inclusion to has_ipmi [puppet] - 10https://gerrit.wikimedia.org/r/362401 [13:14:34] (03PS4) 10Faidon Liambotis: base: cleanup unneeded ipmi packages/checks [puppet] - 10https://gerrit.wikimedia.org/r/362402 [13:14:39] (03CR) 10Jcrespo: [C: 032] Labsdb-replica: Fix bug on gerrit:362404 [puppet] - 10https://gerrit.wikimedia.org/r/362408 (owner: 10Jcrespo) [13:17:02] RECOVERY - puppet last run on labsdb1009 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [13:18:09] (03PS20) 10Mforns: role::mariadb::analytics::custom_repl_slave: add eventlogging_cleaner.py [puppet] - 10https://gerrit.wikimedia.org/r/356383 (https://phabricator.wikimedia.org/T108850) (owner: 10Elukey) [13:19:03] volans: --^ [13:19:04] TheresNoTime: Also, using your tool labs account to put some sort of file in a tool you own, might be a good additional step to prove you really are you [13:19:45] (03CR) 10Mforns: "Thanks Volans!" [puppet] - 10https://gerrit.wikimedia.org/r/356383 (https://phabricator.wikimedia.org/T108850) (owner: 10Elukey) [13:19:46] ah good idea! [13:21:07] oh, i just found our official documentation [13:21:17] https://wikitech.wikimedia.org/wiki/Password_reset#Reset_two_factor_authentication [13:21:45] so writing a request to reset in your tool labs account is an officially reccomended way to prove you are you [13:21:56] (03CR) 10Faidon Liambotis: [C: 032] "PCC for both physical and VMs says this is OK." [puppet] - 10https://gerrit.wikimedia.org/r/362399 (owner: 10Faidon Liambotis) [13:22:05] Awesome, done at http://tools.wmflabs.org/communityguidelines/T169332.txt and I'll add it to the task [13:22:07] (03CR) 10Faidon Liambotis: [C: 032] ipmi: add has_ipmi and ipmi_lan facts [puppet] - 10https://gerrit.wikimedia.org/r/362400 (owner: 10Faidon Liambotis) [13:22:30] (03PS2) 10Faidon Liambotis: Move IPMI monitoring check etc. to ipmi::monitor [puppet] - 10https://gerrit.wikimedia.org/r/362399 [13:22:46] (03PS5) 10Faidon Liambotis: ipmi: add has_ipmi and ipmi_lan facts [puppet] - 10https://gerrit.wikimedia.org/r/362400 [13:24:23] ok, that should be good enough [13:28:13] TheresNoTime: Ok, I'm going to reset your 2FA [13:28:53] bawolff: Thank you :) [13:28:54] (03PS4) 10Strainu: Set collation for Romanian wikis to uca-ro [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361066 (https://phabricator.wikimedia.org/T168711) [13:31:20] volans: ehm we found a bug, PS21 is the best one :P [13:32:46] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1028" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362413 [13:32:49] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1028" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362413 [13:32:56] !log Reset 2FA of wikitech [[User:Samtar]] (T169332) [13:33:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:05] T169332: 2FA reset for Samtar Wikitech account - https://phabricator.wikimedia.org/T169332 [13:33:14] TheresNoTime: Should be done now [13:34:45] (03PS21) 10Mforns: role::mariadb::analytics::custom_repl_slave: add eventlogging_cleaner.py [puppet] - 10https://gerrit.wikimedia.org/r/356383 (https://phabricator.wikimedia.org/T108850) (owner: 10Elukey) [13:37:56] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1028" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362413 (owner: 10Marostegui) [13:39:01] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1028" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362413 (owner: 10Marostegui) [13:39:47] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1028" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362413 (owner: 10Marostegui) [13:39:55] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1028 - T166208 (duration: 00m 42s) [13:40:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:09] T166208: Convert unique keys into primary keys for some wiki tables on s7 - https://phabricator.wikimedia.org/T166208 [13:41:23] (03PS5) 10Strainu: Set collation for Romanian wikis to uca-ro-u-kn [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361066 (https://phabricator.wikimedia.org/T168711) [13:58:54] (03PS1) 10Giuseppe Lavagetto: Fix lookup for image tag [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/362416 [14:04:59] (03PS22) 10Mforns: role::mariadb::analytics::custom_repl_slave: add eventlogging_cleaner.py [puppet] - 10https://gerrit.wikimedia.org/r/356383 (https://phabricator.wikimedia.org/T108850) (owner: 10Elukey) [14:06:23] (03PS1) 10BBlack: CAA: add records to all canonicals [dns] - 10https://gerrit.wikimedia.org/r/362419 (https://phabricator.wikimedia.org/T155806) [14:14:29] (03PS1) 10Alexandros Kosiaris: monitoring::host: Monitor IPMI as well if applicable [puppet] - 10https://gerrit.wikimedia.org/r/362421 [14:15:01] (03PS15) 10Paladox: Upgrade gerrit to 2.14.2-pre (DO NOT MERGE) [debs/gerrit] - 10https://gerrit.wikimedia.org/r/350440 [14:16:36] !log reboot cp4021 [14:16:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:52] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [14:22:44] is gehel around? [14:23:01] jynus: I'm here [14:23:17] maybe a quick look at that alert? [14:23:28] yep, having a look [14:23:32] ok [14:28:10] 10Operations, 10DBA: File space alert for db1028 - https://phabricator.wikimedia.org/T169294#3395189 (10Andrew) thanks! [14:28:56] 10Operations, 10Labs, 10Labs-Infrastructure: investigate slapd memory leak - https://phabricator.wikimedia.org/T130593#3395192 (10Andrew) Yep, no problems in ages. Thanks for the bug cleanup. [14:33:37] (03Draft1) 10Paladox: Gerrit: Remove NameVirtualHost from apache file [puppet] - 10https://gerrit.wikimedia.org/r/362426 [14:33:39] (03PS2) 10Paladox: Gerrit: Remove NameVirtualHost from apache file [puppet] - 10https://gerrit.wikimedia.org/r/362426 [14:37:22] PROBLEM - puppet last run on kubestage1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:38:11] (03PS1) 10Giuseppe Lavagetto: Add python images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/362427 [14:40:42] elukey: ack, I'll take a look, so which PS in the end? :D [14:40:57] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] Fix lookup for image tag [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/362416 (owner: 10Giuseppe Lavagetto) [14:41:18] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] Add python images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/362427 (owner: 10Giuseppe Lavagetto) [14:42:52] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] [14:48:15] volans: last one :) [14:48:21] we are running it now in EL beta [14:48:27] seems to work fine [14:48:36] elukey: ok, I'll probably look at it in ~1h or so [14:48:41] even next week [14:48:50] so you can still fix last minute bugs, if found :-P [14:51:29] !log banning elastic1019 from cluster to move heavy shards around [14:51:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:43] 10Operations, 10Traffic, 10netops: codfw row B switch upgrade - https://phabricator.wikimedia.org/T169345#3395298 (10ayounsi) [14:53:06] (03PS1) 10Giuseppe Lavagetto: Do not rebuild already built images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/362428 [14:53:25] (03CR) 10Joal: [C: 04-1] "I don't think this actually works as is." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/362237 (https://phabricator.wikimedia.org/T169248) (owner: 10Nuria) [14:54:51] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] Do not rebuild already built images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/362428 (owner: 10Giuseppe Lavagetto) [14:59:12] PROBLEM - Disk space on copper is CRITICAL: DISK CRITICAL - /var/lib/docker/containers/6d3fb98f4daaa160c4886d4595ef430ffeded59ad275ff0addef338aec314b5f/shm is not accessible: Permission denied [15:00:50] 10Operations, 10Goal, 10Kubernetes: Prepare to service applications from kubernetes - https://phabricator.wikimedia.org/T162039#3395326 (10Joe) [15:00:53] 10Operations, 10Goal, 10Kubernetes, 10Patch-For-Review, and 3 others: Prepare and maintain base container images - https://phabricator.wikimedia.org/T162042#3395324 (10Joe) 05Open>03Resolved [15:05:12] RECOVERY - Disk space on copper is OK: DISK OK [15:05:22] RECOVERY - puppet last run on kubestage1001 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [15:06:14] 10Operations, 10DBA, 10cloud-services-team: Labsdb* servers need to be rebooted - https://phabricator.wikimedia.org/T168584#3395348 (10madhuvishy) @jcrespo Apologies for the delay. Can we start with just labsdb1005 first, and attempt to do it Wednesday July 5, and labsdb1004 on Thursday July 6, provided the... [15:07:02] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] [15:08:12] PROBLEM - Disk space on copper is CRITICAL: DISK CRITICAL - /var/lib/docker/containers/8a9f33cbaeeaac278fe2378a95f96e322d03cab748384416615c127391de596d/shm is not accessible: Permission denied [15:09:42] RECOVERY - Check Varnish expiry mailbox lag on cp1049 is OK: OK: expiry mailbox lag is 16635 [15:12:33] 10Operations, 10DBA, 10cloud-services-team: Labsdb* servers need to be rebooted - https://phabricator.wikimedia.org/T168584#3395356 (10Cmjohnson) @madhuvishy I am out all next week and will be back July 11. [15:13:52] PROBLEM - nova-compute process on labvirt1011 is CRITICAL: PROCS CRITICAL: 2 processes with regex args ^/usr/bin/python /usr/bin/nova-compute [15:14:12] RECOVERY - Disk space on copper is OK: DISK OK [15:14:41] 10Operations, 10DBA, 10cloud-services-team: Labsdb* servers need to be rebooted - https://phabricator.wikimedia.org/T168584#3395358 (10madhuvishy) @Cmjohnson Okay thanks for letting me know, I'll schedule the labsdb1001 and 1003 reboots (the ciscos), for after you are back then. When are you in the DC (from... [15:14:52] RECOVERY - nova-compute process on labvirt1011 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/nova-compute [15:16:22] 10Operations, 10DBA, 10cloud-services-team: Labsdb* servers need to be rebooted - https://phabricator.wikimedia.org/T168584#3395385 (10Cmjohnson) @madhuvishy I typically get the DC around 1400UTC (10am EST). [15:19:19] 10Operations, 10ops-eqiad, 10Labs, 10Labs-Infrastructure: rack/setup/install labnodepool1002.eqiad.wmnet - https://phabricator.wikimedia.org/T168407#3395386 (10Cmjohnson) [15:26:06] 10Operations, 10ops-eqiad: Broken disk on mw1228 - https://phabricator.wikimedia.org/T168613#3395402 (10Cmjohnson) The disk has been replaced and needs a fresh install [15:33:47] (03PS3) 10Nuria: Adding mailto to camus job [puppet] - 10https://gerrit.wikimedia.org/r/362237 (https://phabricator.wikimedia.org/T169248) [15:38:19] 10Operations, 10ops-eqiad, 10Labs: rack/setup/install labnet100[34] - https://phabricator.wikimedia.org/T165779#3395406 (10Cmjohnson) [15:39:38] !log unbanning elastic1019 from cluster and keeping an eye on it [15:39:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:36] (03PS5) 10Joal: Add cron job dropping webrequest from druid [puppet] - 10https://gerrit.wikimedia.org/r/362148 (https://phabricator.wikimedia.org/T168614) [15:42:36] (03CR) 10jerkins-bot: [V: 04-1] Add cron job dropping webrequest from druid [puppet] - 10https://gerrit.wikimedia.org/r/362148 (https://phabricator.wikimedia.org/T168614) (owner: 10Joal) [15:44:49] (03PS6) 10Joal: Add cron job dropping webrequest from druid [puppet] - 10https://gerrit.wikimedia.org/r/362148 (https://phabricator.wikimedia.org/T168614) [15:51:46] 10Operations, 10ops-eqiad, 10Analytics: Smartctl errors for one kafka1012 disk - https://phabricator.wikimedia.org/T168927#3395413 (10elukey) >>! In T168927#3392189, @RobH wrote: > This system is out of warranty, and will require onsite spare disks to be used as replacement. Yes please, do we need approvals... [15:52:14] elukey: you dont need approvals for that, i was just saving chris the step of looking up the warranty info =] [15:52:53] 10Operations, 10ops-eqiad, 10Analytics: Smartctl errors for one kafka1012 disk - https://phabricator.wikimedia.org/T168927#3395415 (10RobH) >>! In T168927#3395413, @elukey wrote: >>>! In T168927#3392189, @RobH wrote: >> This system is out of warranty, and will require onsite spare disks to be used as replace... [15:55:16] robh: aahhhh okok just wanted to make sure :) [15:58:25] (03CR) 10Volans: "LGTM, a couple of minor nitpicks, and there were some comment on PS18 that were left unanswered and unchanged, so I'm not sure if they wer" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/356383 (https://phabricator.wikimedia.org/T108850) (owner: 10Elukey) [16:00:52] RECOVERY - Check Varnish expiry mailbox lag on cp1072 is OK: OK: expiry mailbox lag is 2452 [16:02:09] (03CR) 10Volans: "Compiler results available:: http://puppet-compiler.wmflabs.org/6898/" [puppet] - 10https://gerrit.wikimedia.org/r/362377 (https://phabricator.wikimedia.org/T169304) (owner: 10Volans) [16:05:15] 10Operations, 10ops-eqiad: rack and setup wtp1025-1048 - https://phabricator.wikimedia.org/T165520#3395422 (10RobH) >>! In T165520#3393614, @faidon wrote: > Likely because of a mismatch of our netboot image and Debian's kernel image. I've updated our netboot image, can you try again? Still happening as of 20... [16:10:02] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [16:18:02] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] [16:32:39] (03CR) 10Nuria: [C: 031] "Looks good, let's merge it on monday." [puppet] - 10https://gerrit.wikimedia.org/r/362148 (https://phabricator.wikimedia.org/T168614) (owner: 10Joal) [16:40:29] (03PS1) 10Faidon Liambotis: autoinstall: switch stretch's d-i to stable [puppet] - 10https://gerrit.wikimedia.org/r/362435 [16:40:58] (03CR) 10Faidon Liambotis: [V: 032 C: 032] autoinstall: switch stretch's d-i to stable [puppet] - 10https://gerrit.wikimedia.org/r/362435 (owner: 10Faidon Liambotis) [16:47:33] 10Operations, 10ops-eqiad: Broken disk on mw1228 - https://phabricator.wikimedia.org/T168613#3395544 (10RobH) a:05Cmjohnson>03RobH mw systems are raid1, so this should be able to rebuild without reimage. I'll take the task to find out whats up with it. [16:49:22] 10Operations, 10ops-eqiad: Broken disk on mw1228 - https://phabricator.wikimedia.org/T168613#3395548 (10Cmjohnson) robh: it already hit the installer, a fresh install is required [16:59:49] (03CR) 10Chad: [C: 031] "That's from 2.2 days heh" [puppet] - 10https://gerrit.wikimedia.org/r/362426 (owner: 10Paladox) [17:00:00] Hello, can anybody deploy 362403 as an emergency please? It is due to extensive problematic user at their wiki known as Wikinger who thanks for random edit from newly created account so the final notification number is even three-digit. See T169268 for details. [17:00:00] T169268: Limiting thanks for new users at pl.wikipedia - https://phabricator.wikimedia.org/T169268 [17:00:56] Sorry, I've misread the last comment from the author :). I'll add it to the regular calendar. [17:01:20] !log rebooting mw1196 [17:01:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:01:47] (03CR) 10Framawiki: [C: 031] "Since it's not a code patch but a config one, bureaucracy need a separate and special deployment. See https://wikitech.wikimedia.org/wiki/" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341267 (https://phabricator.wikimedia.org/T159664) (owner: 10BearND) [17:02:37] And write faster than reading again :D. Its up to you, I really don't know if it should be considered as an emergency. [17:03:52] PROBLEM - Check systemd state on mw1196 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:05:42] RECOVERY - Check systemd state on mw1196 is OK: OK - running: The system is fully operational [17:07:40] (03PS1) 10Cmjohnson: Adding mgmt dns entries for restbase-dev100[456] [dns] - 10https://gerrit.wikimedia.org/r/362437 [17:09:37] (03CR) 10Cmjohnson: [C: 032] Adding mgmt dns entries for restbase-dev100[456] [dns] - 10https://gerrit.wikimedia.org/r/362437 (owner: 10Cmjohnson) [17:09:44] (03PS1) 10BBlack: [WIP] numa_networking: new state "isolate" [puppet] - 10https://gerrit.wikimedia.org/r/362438 [17:09:46] (03PS1) 10BBlack: [WIP] NUMA binding for cache frontends under 'isolate' [puppet] - 10https://gerrit.wikimedia.org/r/362439 [17:10:54] (03CR) 10jerkins-bot: [V: 04-1] [WIP] NUMA binding for cache frontends under 'isolate' [puppet] - 10https://gerrit.wikimedia.org/r/362439 (owner: 10BBlack) [17:13:09] (03PS2) 10BBlack: [WIP] NUMA binding for cache frontends under 'isolate' [puppet] - 10https://gerrit.wikimedia.org/r/362439 [17:15:07] (03PS2) 10Herron: Change donate.wikimedia.org SPF to soft fail (~all) [dns] - 10https://gerrit.wikimedia.org/r/361718 (https://phabricator.wikimedia.org/T167704) [17:15:44] 10Operations, 10ops-eqiad, 10DC-Ops, 10Services, 10User-fgiunchedi: rack/setup/install restbase-dev100[456] - https://phabricator.wikimedia.org/T166181#3395609 (10Cmjohnson) [17:19:06] (03CR) 10Herron: [C: 032] Change donate.wikimedia.org SPF to soft fail (~all) [dns] - 10https://gerrit.wikimedia.org/r/361718 (https://phabricator.wikimedia.org/T167704) (owner: 10Herron) [17:31:31] 10Operations, 10ops-eqiad: rack and setup wtp1025-1048 - https://phabricator.wikimedia.org/T165520#3395648 (10RobH) Issue fixed with https://gerrit.wikimedia.org/r/#/c/362435/ [17:36:24] PROBLEM - MegaRAID on db1052 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) [17:36:25] ACKNOWLEDGEMENT - MegaRAID on db1052 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T169355 [17:36:31] (03CR) 10Ladsgroup: "This role is used in prod too and I think it might cause down time, handle with cautious." [puppet] - 10https://gerrit.wikimedia.org/r/362097 (owner: 10Awight) [17:37:37] 10Operations, 10ops-eqiad: Degraded RAID on db1052 - https://phabricator.wikimedia.org/T169355#3395652 (10ops-monitoring-bot) [17:40:59] wikibugs just quit for excess flood :( [17:41:52] (03PS1) 10Rush: libvirt: turn off instance stats collection [puppet] - 10https://gerrit.wikimedia.org/r/362444 (https://phabricator.wikimedia.org/T143405) [17:42:21] T112032 [17:42:22] T112032: wikibugs - throttle output, don't get kicked for flooding - https://phabricator.wikimedia.org/T112032 [17:42:46] mutante: nice, thnaks [17:43:12] RECOVERY - dhclient process on wtp1042 is OK: PROCS OK: 0 processes with command name dhclient [17:43:12] RECOVERY - salt-minion processes on wtp1042 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [17:43:15] although was resolved [17:43:17] :D [17:43:22] RECOVERY - DPKG on wtp1042 is OK: All packages OK [17:43:22] RECOVERY - MD RAID on wtp1042 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [17:43:27] it used to be way more often [17:43:30] it did improve :p [17:43:42] RECOVERY - Disk space on wtp1042 is OK: DISK OK [17:44:02] RECOVERY - Check systemd state on wtp1042 is OK: OK - running: The system is fully operational [17:44:02] RECOVERY - configured eth on wtp1042 is OK: OK - interfaces up [17:44:32] RECOVERY - puppet last run on wtp1042 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [17:45:00] (03PS5) 10Faidon Liambotis: base: switch ipmi::monitor inclusion to has_ipmi [puppet] - 10https://gerrit.wikimedia.org/r/362401 [17:45:12] RECOVERY - Check the NTP synchronisation status of timesyncd on wtp1042 is OK: OK: synced at Fri 2017-06-30 17:45:09 UTC. [17:46:01] (03CR) 10BryanDavis: [C: 031] "We aren't using these metrics and if this applies on the contint instances it will make a crazy number of metrics that are just taking up " [puppet] - 10https://gerrit.wikimedia.org/r/362444 (https://phabricator.wikimedia.org/T143405) (owner: 10Rush) [17:46:02] (03PS2) 10Rush: libvirt: turn off instance stats collection [puppet] - 10https://gerrit.wikimedia.org/r/362444 (https://phabricator.wikimedia.org/T143405) [17:46:02] RECOVERY - Disk space on wtp1043 is OK: DISK OK [17:46:02] RECOVERY - salt-minion processes on wtp1043 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [17:46:03] (03CR) 10Faidon Liambotis: [C: 032] base: switch ipmi::monitor inclusion to has_ipmi [puppet] - 10https://gerrit.wikimedia.org/r/362401 (owner: 10Faidon Liambotis) [17:46:05] (03CR) 10Faidon Liambotis: [V: 032 C: 032] base: switch ipmi::monitor inclusion to has_ipmi [puppet] - 10https://gerrit.wikimedia.org/r/362401 (owner: 10Faidon Liambotis) [17:46:12] RECOVERY - MD RAID on wtp1043 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [17:46:32] RECOVERY - configured eth on wtp1043 is OK: OK - interfaces up [17:46:32] RECOVERY - dhclient process on wtp1043 is OK: PROCS OK: 0 processes with command name dhclient [17:46:42] RECOVERY - DPKG on wtp1043 is OK: All packages OK [17:46:45] RECOVERY - puppet last run on wtp1043 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [17:46:52] RECOVERY - Check systemd state on wtp1043 is OK: OK - running: The system is fully operational [17:47:07] (03CR) 10Rush: [V: 032 C: 032] libvirt: turn off instance stats collection [puppet] - 10https://gerrit.wikimedia.org/r/362444 (https://phabricator.wikimedia.org/T143405) (owner: 10Rush) [17:47:08] (03PS3) 10Rush: libvirt: turn off instance stats collection [puppet] - 10https://gerrit.wikimedia.org/r/362444 (https://phabricator.wikimedia.org/T143405) [17:47:16] mutante: also it reconnected here but not in -databases for example [17:48:42] volans: eh, ok, i have no idea why. https://tools.wmflabs.org/?tool=wikibugs [17:49:04] (03CR) 10Rush: [V: 032 C: 032] libvirt: turn off instance stats collection [puppet] - 10https://gerrit.wikimedia.org/r/362444 (https://phabricator.wikimedia.org/T143405) (owner: 10Rush) [17:49:21] akosiaris: when doing puppet-merge I see the last line is "Host key verification failed."...not sure if a real issue [17:50:11] (03PS5) 10Faidon Liambotis: base: cleanup unneeded ipmi packages/checks [puppet] - 10https://gerrit.wikimedia.org/r/362402 [17:50:13] (03CR) 10Faidon Liambotis: [V: 032 C: 032] base: cleanup unneeded ipmi packages/checks [puppet] - 10https://gerrit.wikimedia.org/r/362402 (owner: 10Faidon Liambotis) [17:52:14] mutante volans it will reconnect to channels when there is activity that touches those channels [17:52:30] ah ! [17:52:35] for example the bot will reconnect to -releng if there is activity touching either the integration projects or repos [17:52:48] gotcha [17:53:15] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1052 - https://phabricator.wikimedia.org/T169355#3395711 (10Marostegui) @Cmjohnson you still in the DC? [17:53:21] ok it's lazy [17:54:30] (03CR) 10Paladox: "thanks." [puppet] - 10https://gerrit.wikimedia.org/r/362426 (owner: 10Paladox) [17:55:55] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1052 - https://phabricator.wikimedia.org/T169355#3395717 (10Cmjohnson) @marostegui I am not [17:57:50] 10Operations, 10Graphite, 10Labs, 10Patch-For-Review, 10User-fgiunchedi: Move labs 'instances' data to graphite labs - https://phabricator.wikimedia.org/T143405#3395718 (10bd808) @fgiunchedi The patch from @chasemp should stop new ones from being created once puppet does its thing across all of the VMs.... [17:59:44] (03PS1) 10Faidon Liambotis: openstack/diamond: remove the libvirtkvm collector [puppet] - 10https://gerrit.wikimedia.org/r/362446 [18:02:23] 10Operations, 10Mobile-Content-Service, 10Services, 10Patch-For-Review, 10Reading-Infrastructure-Team-Backlog (Kanban): Mobileapps swagger spec is broken (no pronounciation for `page/mobile-sections-lead` endpoints) - https://phabricator.wikimedia.org/T169299#3395775 (10bearND) [18:03:18] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1052 - https://phabricator.wikimedia.org/T169355#3395781 (10Marostegui) @Cmjohnson if it helps, there are some hosts that are ready to be decommissioned which have 600GB disks which are probably old though: T166486 T164702 [18:04:10] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1052 - https://phabricator.wikimedia.org/T169355#3395785 (10jcrespo) p:05Triage>03High [18:07:30] 10Operations, 10Labs, 10Labs-Infrastructure: investigate slapd memory leak - https://phabricator.wikimedia.org/T130593#3395786 (10chasemp) 05Resolved>03Open I don't think this should be closed as long as this stuff exists: > modules/role/manifests/openldap/labs.pp ``` # restart slapd if it uses more... [18:09:29] 10Operations, 10Traffic, 10netops: codfw row B switch upgrade - https://phabricator.wikimedia.org/T169345#3395792 (10chasemp) We can absorb whatever outage is convenient for you @ayounsi for our things: ```labstore* labtestcontrol* labtestnet* labtestneutron* labtestvirt*``` @madhuvishy we may need to keep... [18:10:20] (03CR) 10BBlack: [C: 032] CAA: add records to all canonicals [dns] - 10https://gerrit.wikimedia.org/r/362419 (https://phabricator.wikimedia.org/T155806) (owner: 10BBlack) [18:10:59] 10Operations, 10Traffic, 10netops: codfw row B switch upgrade - https://phabricator.wikimedia.org/T169345#3395802 (10madhuvishy) @chasemp Cool, thanks for the heads up! [18:11:44] 10Operations, 10ops-codfw, 10ops-eqiad: Unresponsive iDRACs - https://phabricator.wikimedia.org/T169360#3395803 (10faidon) [18:12:37] 10Operations, 10ops-codfw, 10ops-eqiad: Unresponsive iDRACs - https://phabricator.wikimedia.org/T169360#3395803 (10Cmjohnson) is it okay to power off or do these need to be scheduled? [18:13:38] RECOVERY - Check the NTP synchronisation status of timesyncd on wtp1043 is OK: OK: synced at Fri 2017-06-30 18:13:29 UTC. [18:14:59] (03PS5) 10Dzahn: apache: add class for mod_php with PHP 7.0 for stretch [puppet] - 10https://gerrit.wikimedia.org/r/362119 (https://phabricator.wikimedia.org/T159756) [18:18:12] (03CR) 10Paladox: [C: 031] apache: add class for mod_php with PHP 7.0 for stretch [puppet] - 10https://gerrit.wikimedia.org/r/362119 (https://phabricator.wikimedia.org/T159756) (owner: 10Dzahn) [18:18:14] (03CR) 10Dzahn: [C: 032] "just adds new class, doesn't include it http://puppet-compiler.wmflabs.org/6904/" [puppet] - 10https://gerrit.wikimedia.org/r/362119 (https://phabricator.wikimedia.org/T159756) (owner: 10Dzahn) [18:18:16] (03PS6) 10Dzahn: apache: add class for mod_php with PHP 7.0 for stretch [puppet] - 10https://gerrit.wikimedia.org/r/362119 (https://phabricator.wikimedia.org/T159756) [18:24:10] (03CR) 10Dzahn: [C: 032] librenms: use libapache2-mod-php7.0 if on stretch [puppet] - 10https://gerrit.wikimedia.org/r/362123 (https://phabricator.wikimedia.org/T159756) (owner: 10Dzahn) [18:27:07] (03PS2) 10Dzahn: librenms: use libapache2-mod-php7.0 if on stretch [puppet] - 10https://gerrit.wikimedia.org/r/362123 (https://phabricator.wikimedia.org/T159756) [18:36:36] (03PS3) 10MarcoAurelio: Fix nowikisource template namespace subpages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362272 (https://phabricator.wikimedia.org/T166035) [18:37:09] (03PS3) 10MarcoAurelio: Add 'WP' namespace alias to ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362267 (https://phabricator.wikimedia.org/T168164) [18:45:13] PROBLEM - salt-minion processes on wtp1048 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [18:46:55] PROBLEM - salt-minion processes on wtp1047 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [18:51:07] (03CR) 10Dzahn: [C: 032] librenms: use libapache2-mod-php7.0 if on stretch [puppet] - 10https://gerrit.wikimedia.org/r/362123 (https://phabricator.wikimedia.org/T159756) (owner: 10Dzahn) [18:51:56] (03CR) 10Pmiazga: "rebase" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358415 (https://phabricator.wikimedia.org/T165018) (owner: 10Pmiazga) [18:52:28] (03PS3) 10Pmiazga: Remove unused wgPopupsAPIUseRESTBase config variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358415 (https://phabricator.wikimedia.org/T165018) [18:58:53] RECOVERY - salt-minion processes on wtp1047 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [18:59:13] RECOVERY - salt-minion processes on wtp1048 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [18:59:24] 10Operations, 10ops-eqiad: rack and setup wtp1025-1048 - https://phabricator.wikimedia.org/T165520#3395865 (10RobH) a:05RobH>03akosiaris [18:59:57] 10Operations, 10ops-eqiad: rack and setup wtp1025-1048 - https://phabricator.wikimedia.org/T165520#3268863 (10RobH) These systems are all online with stretch, puppet/salt signed. They have not been added to site.pp specifically, so they are just getting defaults. Assigning to Alex for implementation. This... [19:15:38] 10Operations, 10Patch-For-Review: setup netmon1002.wikimedia.org - https://phabricator.wikimedia.org/T159756#3395887 (10Dzahn) Alright, so: - servermon: one blocker removed: python-django-south has been uploaded to stretch-wikimedia-stretch (T159756#3388460, T159756#3389767) - network::monitor one blocker r... [19:16:37] (03Draft1) 10Paladox: servermon: Add gunicorn.service systemd script [puppet] - 10https://gerrit.wikimedia.org/r/362455 [19:16:39] (03PS2) 10Paladox: servermon: Add gunicorn.service systemd script [puppet] - 10https://gerrit.wikimedia.org/r/362455 [19:18:32] (03CR) 10jerkins-bot: [V: 04-1] servermon: Add gunicorn.service systemd script [puppet] - 10https://gerrit.wikimedia.org/r/362455 (owner: 10Paladox) [19:20:28] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1052 - https://phabricator.wikimedia.org/T169355#3395891 (10Cmjohnson) @marostegui the disk has been swapped with the last new spare disk on-site. [19:21:18] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1052 - https://phabricator.wikimedia.org/T169355#3395892 (10Cmjohnson) Currently rebuilding Enclosure Device ID: 32 Slot Number: 0 Drive's position: DiskGroup: 0, Span: 0, Arm: 0 Enclosure position: 1 Device Id: 0 WWN: 5000C500437173D8 Sequence Number: 11... [19:21:55] !log mw1182 powering down to due to unresponsive idrac [19:22:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:23:50] PROBLEM - Host mw1182 is DOWN: PING CRITICAL - Packet loss = 100% [19:24:30] PROBLEM - Host mw1186 is DOWN: PING CRITICAL - Packet loss = 100% [19:26:21] RECOVERY - Host mw1186 is UP: PING OK - Packet loss = 0%, RTA = 0.34 ms [19:28:49] (03PS1) 10BryanDavis: Add libicu-dev to nodejs images [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/362457 (https://phabricator.wikimedia.org/T169338) [19:28:49] (03PS4) 10Addshore: Remove Wikibase vs Interwikisorting checks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341127 (https://phabricator.wikimedia.org/T150183) [19:29:40] RECOVERY - Host mw1182 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [19:30:39] (03CR) 10Addshore: Remove Wikibase vs Interwikisorting checks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341127 (https://phabricator.wikimedia.org/T150183) (owner: 10Addshore) [19:31:34] !log powering off mw1190 to reestablish idrac connection [19:31:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:33:40] PROBLEM - Host mw1190 is DOWN: PING CRITICAL - Packet loss = 100% [19:36:19] (03CR) 10BryanDavis: [C: 032] Add libicu-dev to nodejs images [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/362457 (https://phabricator.wikimedia.org/T169338) (owner: 10BryanDavis) [19:36:20] RECOVERY - Host mw1190 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [19:36:57] (03Merged) 10jenkins-bot: Add libicu-dev to nodejs images [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/362457 (https://phabricator.wikimedia.org/T169338) (owner: 10BryanDavis) [19:37:10] !log powering off mw1191 for unresponsive idrac [19:37:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:39:20] PROBLEM - Host mw1191 is DOWN: PING CRITICAL - Packet loss = 100% [19:41:20] RECOVERY - Host mw1191 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [19:41:45] !log powering off mw1196 for unresponsive idrac [19:41:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:44:00] PROBLEM - Host mw1196 is DOWN: PING CRITICAL - Packet loss = 100% [19:57:58] (03PS1) 10ArielGlenn: batch abstract jobs and do abstracts and stubs in smaller queries [dumps] - 10https://gerrit.wikimedia.org/r/362462 [20:01:01] 10Operations, 10ops-codfw, 10ops-eqiad: Unresponsive iDRACs - https://phabricator.wikimedia.org/T169360#3395989 (10Cmjohnson) [20:02:07] 10Operations, 10ops-codfw, 10ops-eqiad: Unresponsive iDRACs - https://phabricator.wikimedia.org/T169360#3395803 (10Cmjohnson) mw1196 did not come back, fans were running but no output on crash cart. downed it but probably needs decom [20:02:07] (03CR) 10ArielGlenn: [C: 032] batch abstract jobs and do abstracts and stubs in smaller queries [dumps] - 10https://gerrit.wikimedia.org/r/362462 (owner: 10ArielGlenn) [20:03:42] !log ariel@tin Started deploy [dumps/dumps@02c71bc]: permit batching of abstract jobs, fix a dryrun reporting typo, smaller stub/abstract queries [20:03:45] !log ariel@tin Finished deploy [dumps/dumps@02c71bc]: permit batching of abstract jobs, fix a dryrun reporting typo, smaller stub/abstract queries (duration: 00m 03s) [20:03:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:03:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:56] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1052 - https://phabricator.wikimedia.org/T169355#3396020 (10Marostegui) >>! In T169355#3395891, @Cmjohnson wrote: > @marostegui the disk has been swapped with the last new spare disk on-site. Thanks Chris! Should we order more spares or how is this usuall... [20:19:56] (03PS5) 10ArielGlenn: treat wikidata just like enwiki for dumps [puppet] - 10https://gerrit.wikimedia.org/r/355100 [20:20:04] (03CR) 10Dereckson: [C: 031] "Emergency deployment would make sense, as the flood has been reported to be ongoing right now." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362403 (https://phabricator.wikimedia.org/T169268) (owner: 10Urbanecm) [20:21:19] (03CR) 10EBernhardson: [C: 031] actually rotate the cirrusdump log files [puppet] - 10https://gerrit.wikimedia.org/r/362383 (https://phabricator.wikimedia.org/T162688) (owner: 10ArielGlenn) [20:25:18] (03PS16) 10Paladox: Zuul: Add systemd script for zuul [puppet] - 10https://gerrit.wikimedia.org/r/359016 (https://phabricator.wikimedia.org/T167833) [20:26:07] 10Operations, 10Patch-For-Review: setup netmon1002.wikimedia.org - https://phabricator.wikimedia.org/T159756#3396037 (10Dzahn) [20:26:10] (03PS18) 10Paladox: Zuul: Add systemd script for zuul [puppet] - 10https://gerrit.wikimedia.org/r/359016 (https://phabricator.wikimedia.org/T167833) [20:26:13] 10Operations, 10Patch-For-Review: setup netmon1002.wikimedia.org - https://phabricator.wikimedia.org/T159756#3077593 (10Dzahn) [20:44:30] PROBLEM - puppet last run on mw2119 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/mwrepl] [21:02:20] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000.0] [21:08:20] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] [21:10:08] (03PS6) 10ArielGlenn: treat wikidata just like enwiki for dumps [puppet] - 10https://gerrit.wikimedia.org/r/355100 [21:12:50] RECOVERY - puppet last run on mw2119 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [21:15:45] (03CR) 10ArielGlenn: "I'm not actually convinced that this is right any more after irc conversation with Ebernhardson. Prefer to ditch dateext and just use the" [puppet] - 10https://gerrit.wikimedia.org/r/362383 (https://phabricator.wikimedia.org/T162688) (owner: 10ArielGlenn) [21:25:46] misses the bot already [21:29:15] (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/6906/" [puppet] - 10https://gerrit.wikimedia.org/r/362528 (owner: 10Dzahn) [21:29:30] thanks to who fixed it [21:32:19] 10Operations, 10ops-eqiad, 10Labs, 10Labs-Infrastructure, 10Patch-For-Review: rack/setup/install labcontrol100[34] - https://phabricator.wikimedia.org/T165781#3396208 (10RobH) [21:35:26] 10Operations, 10ops-eqiad, 10Labs, 10Labs-Infrastructure: rack/setup/install labcontrol100[34] - https://phabricator.wikimedia.org/T165781#3396227 (10RobH) a:05RobH>03chasemp [21:36:28] 10Operations, 10ops-eqiad, 10Labs, 10Labs-Infrastructure: rack/setup/install labcontrol100[34] - https://phabricator.wikimedia.org/T165781#3276700 (10RobH) Assigned to @chasemp for service implementation. This task can be resolved once you are aware! [21:40:38] (03PS1) 10GWicke: Set up grafana dashboard monitoring for services [puppet] - 10https://gerrit.wikimedia.org/r/362567 (https://phabricator.wikimedia.org/T162765) [21:45:44] (03PS3) 10ArielGlenn: remove some unused dump command lists [puppet] - 10https://gerrit.wikimedia.org/r/355148 [21:46:27] RECOVERY - MegaRAID on db1052 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy [21:46:38] \o/ [21:46:52] 10Operations, 10Patch-For-Review, 10Services (blocked): Set up grafana alerting for services - https://phabricator.wikimedia.org/T162765#3396273 (10GWicke) [21:47:58] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1052 - https://phabricator.wikimedia.org/T169355#3396274 (10Volans) Rebuild completed, RAID back to optimal. There are 2 disks with predictive failure that might fail sooner or later ``` $ sudo /usr/local/lib/nagios/plugins/get-raid-status-megacli === Rai... [21:47:59] (03CR) 10ArielGlenn: [C: 032] remove some unused dump command lists [puppet] - 10https://gerrit.wikimedia.org/r/355148 (owner: 10ArielGlenn) [21:48:45] (03PS1) 10Dzahn: librenms: ensure install_dir exists, add it as required resource [puppet] - 10https://gerrit.wikimedia.org/r/362590 [21:48:52] 10Operations, 10Mobile-Content-Service, 10Parsoid, 10Services, and 2 others: Mobileapps swagger spec is broken (no pronounciation for `page/mobile-sections-lead` endpoints) - https://phabricator.wikimedia.org/T169299#3396276 (10bearND) @ArielGlenn thanks for the the links and the patch! I agree this might... [21:50:49] (03PS2) 10Dzahn: librenms: ensure install_dir exists, add it as required resource [puppet] - 10https://gerrit.wikimedia.org/r/362590 [21:51:41] (03CR) 10Paladox: librenms: ensure install_dir exists, add it as required resource (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/362590 (owner: 10Dzahn) [21:57:31] (03CR) 10Dzahn: librenms: ensure install_dir exists, add it as required resource (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/362590 (owner: 10Dzahn) [22:01:44] (03PS2) 10ArielGlenn: improve names of dump command lists [puppet] - 10https://gerrit.wikimedia.org/r/355149 [22:02:17] (03CR) 10ArielGlenn: [C: 032] improve names of dump command lists [puppet] - 10https://gerrit.wikimedia.org/r/355149 (owner: 10ArielGlenn) [22:05:38] (03PS1) 10Dzahn: librenms: add missing Apache headers module [puppet] - 10https://gerrit.wikimedia.org/r/362591 [22:07:36] (03PS2) 10Dzahn: librenms: add missing Apache headers module [puppet] - 10https://gerrit.wikimedia.org/r/362591 [22:09:10] (03PS2) 10ArielGlenn: cleanup the dump list commands template syntax [puppet] - 10https://gerrit.wikimedia.org/r/355151 [22:10:51] (03CR) 10Dzahn: [C: 032] librenms: add missing Apache headers module [puppet] - 10https://gerrit.wikimedia.org/r/362591 (owner: 10Dzahn) [22:12:01] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1052 - https://phabricator.wikimedia.org/T169355#3396375 (10Marostegui) 05Open>03Resolved a:03Cmjohnson Great!! Thanks! I will close this for now, and we will check if we need to buy more disks next week! Thanks a lot Chris! [22:18:38] (03PS1) 10MarcoAurelio: Set wgCategoryCollation to 'numeric' at he.wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362592 (https://phabricator.wikimedia.org/T168321) [22:18:40] (03CR) 10ArielGlenn: [C: 032] cleanup the dump list commands template syntax [puppet] - 10https://gerrit.wikimedia.org/r/355151 (owner: 10ArielGlenn) [22:19:44] (03CR) 10MarcoAurelio: "Note to deployer: requires running a maintenance script after going live on-wiki to finish the task." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362592 (https://phabricator.wikimedia.org/T168321) (owner: 10MarcoAurelio) [22:22:48] (03PS3) 10Dzahn: librenms: ensure install_dir exists [puppet] - 10https://gerrit.wikimedia.org/r/362590 [22:25:02] (03CR) 10jerkins-bot: [V: 04-1] librenms: ensure install_dir exists [puppet] - 10https://gerrit.wikimedia.org/r/362590 (owner: 10Dzahn) [22:25:03] (03PS4) 10Dzahn: librenms: ensure install_dir exists [puppet] - 10https://gerrit.wikimedia.org/r/362590 [22:26:59] (03PS5) 10Dzahn: librenms: ensure install_dir exists [puppet] - 10https://gerrit.wikimedia.org/r/362590 [22:26:59] (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/6911/netmon1001.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/362590 (owner: 10Dzahn) [22:28:07] 10Operations, 10Mail, 10fundraising-tech-ops, 10Patch-For-Review, 10Security: Make donate.wikimedia.org SPF more strict - https://phabricator.wikimedia.org/T167704#3396400 (10Reedy) [22:28:10] 10Operations, 10Mail, 10fundraising-tech-ops, 10Security: Make donate.wikimedia.org SPF more strict - https://phabricator.wikimedia.org/T167704#3341636 (10Reedy) [22:28:12] (03PS2) 10ArielGlenn: ditch dateext and just use normal rotation for cirrusdump logs [puppet] - 10https://gerrit.wikimedia.org/r/362383 (https://phabricator.wikimedia.org/T162688) [22:40:10] (03CR) 10ArielGlenn: [C: 032] ditch dateext and just use normal rotation for cirrusdump logs [puppet] - 10https://gerrit.wikimedia.org/r/362383 (https://phabricator.wikimedia.org/T162688) (owner: 10ArielGlenn) [22:40:11] (03PS4) 10ArielGlenn: ditch dateext and just use normal rotation for cirrusdump logs [puppet] - 10https://gerrit.wikimedia.org/r/362383 (https://phabricator.wikimedia.org/T162688) [22:40:15] (03CR) 10ArielGlenn: [C: 032] "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/362383 (https://phabricator.wikimedia.org/T162688) (owner: 10ArielGlenn) [22:42:36] !log green unicorn now running on cubalibre (this means gunicornd used by the librenms role now works on stretch :) [22:42:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:44:07] (03PS4) 10Legoktm: Limit thanks for new users at pl.wikipedia to 3 per day [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362403 (https://phabricator.wikimedia.org/T169268) (owner: 10Urbanecm) [22:54:27] (03PS1) 10Dzahn: netmon1002: add librenms role [puppet] - 10https://gerrit.wikimedia.org/r/362595 [22:54:31] (03PS3) 10Paladox: servermon: Add gunicorn.service systemd script [puppet] - 10https://gerrit.wikimedia.org/r/362455 [22:55:15] (03PS1) 10ArielGlenn: keep 22 cirrusdump logs [puppet] - 10https://gerrit.wikimedia.org/r/362597 [22:55:16] (03PS4) 10Paladox: servermon: Add gunicorn.service systemd script [puppet] - 10https://gerrit.wikimedia.org/r/362455 [22:55:18] (03PS5) 10Legoktm: Limit thanks for new users at pl.wikipedia to 3 per day [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362403 (https://phabricator.wikimedia.org/T169268) (owner: 10Urbanecm) [22:56:23] (03PS2) 10Dzahn: netmon1002: add librenms role [puppet] - 10https://gerrit.wikimedia.org/r/362595 [22:56:25] (03CR) 10jerkins-bot: [V: 04-1] servermon: Add gunicorn.service systemd script [puppet] - 10https://gerrit.wikimedia.org/r/362455 (owner: 10Paladox) [22:57:13] (03CR) 10Legoktm: [C: 032] Limit thanks for new users at pl.wikipedia to 3 per day [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362403 (https://phabricator.wikimedia.org/T169268) (owner: 10Urbanecm) [22:57:14] (03CR) 10ArielGlenn: [C: 032] keep 22 cirrusdump logs [puppet] - 10https://gerrit.wikimedia.org/r/362597 (owner: 10ArielGlenn) [22:57:17] (03PS5) 10Paladox: servermon: Add gunicorn.service systemd script [puppet] - 10https://gerrit.wikimedia.org/r/362455 [22:58:20] (03PS3) 10Dzahn: netmon1002: add librenms role [puppet] - 10https://gerrit.wikimedia.org/r/362595 [22:59:27] (03CR) 10jerkins-bot: [V: 04-1] servermon: Add gunicorn.service systemd script [puppet] - 10https://gerrit.wikimedia.org/r/362455 (owner: 10Paladox) [22:59:29] (03Merged) 10jenkins-bot: Limit thanks for new users at pl.wikipedia to 3 per day [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362403 (https://phabricator.wikimedia.org/T169268) (owner: 10Urbanecm) [22:59:31] (03CR) 10jenkins-bot: Limit thanks for new users at pl.wikipedia to 3 per day [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362403 (https://phabricator.wikimedia.org/T169268) (owner: 10Urbanecm) [23:01:19] (03PS1) 10Dzahn: servermon: add missing package python-mysqldb [puppet] - 10https://gerrit.wikimedia.org/r/362598 (https://phabricator.wikimedia.org/T159756) [23:01:21] (03CR) 10Dzahn: [C: 032] netmon1002: add librenms role [puppet] - 10https://gerrit.wikimedia.org/r/362595 (owner: 10Dzahn) [23:02:00] (03PS6) 10Paladox: servermon: Add gunicorn.service systemd script [puppet] - 10https://gerrit.wikimedia.org/r/362455 [23:03:23] (03CR) 10Paladox: [C: 031] servermon: add missing package python-mysqldb [puppet] - 10https://gerrit.wikimedia.org/r/362598 (https://phabricator.wikimedia.org/T159756) (owner: 10Dzahn) [23:03:35] (03PS2) 10Dzahn: servermon: add missing package python-mysqldb [puppet] - 10https://gerrit.wikimedia.org/r/362598 (https://phabricator.wikimedia.org/T159756) [23:05:18] !log librenms has been deployed on netmon1002 - works on stretch now - except Letsencrypt part, expected. not switched yet [23:05:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:05:38] legoktm@mwdebug1002:~$ mwscript eval.php --wiki=plwiki [23:05:38] PHP Fatal error: Class 'Memcached' not found in /srv/mediawiki/php-1.30.0-wmf.7/includes/libs/objectcache/MemcachedPeclBagOStuff.php on line 63 [23:05:39] Fatal error: Class 'Memcached' not found in /srv/mediawiki/php-1.30.0-wmf.7/includes/libs/objectcache/MemcachedPeclBagOStuff.php on line 63 [23:05:44] wtf? [23:06:11] 10Operations, 10Mobile-Content-Service, 10Parsoid, 10Services, and 2 others: Mobileapps swagger spec is broken (no pronounciation for `page/mobile-sections-lead` endpoints) - https://phabricator.wikimedia.org/T169299#3396503 (10ArielGlenn) Happy to help, even if that was a partial patch only :-) Is there... [23:06:54] PROBLEM - puppet last run on netmon1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[acme-setup-acme-librenms] [23:08:26] (03CR) 10Dzahn: [C: 032] netmon1002: disable Letsencrypt cert creation for migration [puppet] - 10https://gerrit.wikimedia.org/r/362126 (owner: 10Dzahn) [23:08:26] (03PS2) 10Dzahn: netmon1002: disable Letsencrypt cert creation for migration [puppet] - 10https://gerrit.wikimedia.org/r/362126 [23:08:54] RECOVERY - puppet last run on netmon1002 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [23:11:46] mw1196.eqiad.wmnet is lagging? [23:11:52] scap is hanging on it [23:12:16] (03CR) 10Dzahn: [C: 032] servermon: add missing package python-mysqldb [puppet] - 10https://gerrit.wikimedia.org/r/362598 (https://phabricator.wikimedia.org/T159756) (owner: 10Dzahn) [23:12:16] 19:41 cmjohnson1: powering off mw1196 for unresponsive idrac [23:12:33] !log legoktm@tin Synchronized wmf-config/InitialiseSettings.php: Limit thanks for new users at pl.wikipedia to 3 per day - T169268 (duration: 02m 49s) [23:12:41] 23:12:33 ['/usr/bin/scap', 'pull', '--no-update-l10n', '--include', 'wmf-config', '--include', 'wmf-config/InitialiseSettings.php', 'mw1211.eqiad.wmnet', 'mw1280.eqiad.wmnet', 'mw1161.eqiad.wmnet', 'mw2117.codfw.wmnet', 'mw2215.codfw.wmnet', 'mw1201.eqiad.wmnet', 'mw2187.codfw.wmnet', 'mw1216.eqiad.wmnet'] on mw1196.eqiad.wmnet returned [255]: ssh: connect to host mw1196.eqiad.wmnet port 22: Connection timed out [23:12:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:12:43] T169268: Limiting thanks for new users at pl.wikipedia - https://phabricator.wikimedia.org/T169268 [23:13:07] RainbowSprinkles: do we need to remove a host from the scap list? ^ [23:14:59] legoktm: An opsen needs to remove it via conftool, yeah [23:15:13] mutante: About? mw1196 seems unreachable [23:16:20] PROBLEM - LibreNMS HTTPS on netmon1002 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error - 285 bytes in 0.009 second response time [23:16:58] !log dzahn@neodymium conftool action : set/pooled=no; selector: name=mw1196.eqiad.wmnet [23:17:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:17:08] RainbowSprinkles: ^ [23:17:33] ty [23:18:29] in icinga it has a scheduled downtime [23:18:39] but should have been depooled, ack [23:19:22] thanks mutante [23:30:41] 1196 is downed from https://phabricator.wikimedia.org/T169360 [23:31:05] 10Operations, 10Patch-For-Review: setup netmon1002.wikimedia.org - https://phabricator.wikimedia.org/T159756#3396553 (10Dzahn) We have the following blocker for servermon: Using this systemd unit file (https://gerrit.wikimedia.org/r/#/c/362455/) works but: {P5659} This error doesn't happen with python-djang... [23:35:50] PROBLEM - dhclient process on thumbor1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:36:20] 10Operations, 10Patch-For-Review: setup netmon1002.wikimedia.org - https://phabricator.wikimedia.org/T159756#3077593 (10Paladox) Here's the migration doc https://docs.djangoproject.com/en/1.11/ref/templates/upgrading/#the-templates-settings and release notes annoucing this change https://github.com/django/djan... [23:38:10] PROBLEM - salt-minion processes on thumbor1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:38:20] PROBLEM - nutcracker process on thumbor1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:38:40] RECOVERY - dhclient process on thumbor1001 is OK: PROCS OK: 0 processes with command name dhclient [23:39:00] RECOVERY - salt-minion processes on thumbor1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [23:39:10] RECOVERY - nutcracker process on thumbor1001 is OK: PROCS OK: 1 process with UID = 115 (nutcracker), command name nutcracker [23:40:13] (03PS7) 10Paladox: servermon: Add gunicorn.service systemd script [puppet] - 10https://gerrit.wikimedia.org/r/362455 [23:50:00] PROBLEM - HTTPS on netmon1002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:SSL connect attempt failed error:140770FC:SSL routines:SSL23_GET_SERVER_HELLO:unknown protocol [23:53:09] (03Draft1) 10Paladox: servermon: Make sure /etc/gunicorn.d/ exists [puppet] - 10https://gerrit.wikimedia.org/r/362601 [23:53:11] (03PS2) 10Paladox: servermon: Make sure /etc/gunicorn.d/ exists [puppet] - 10https://gerrit.wikimedia.org/r/362601 [23:54:00] RECOVERY - HTTPS on netmon1002 is OK: SSL OK - Certificate librenms.wikimedia.org valid until 2017-09-28 23:04:24 +0000 (expires in 89 days) [23:56:10] (03PS3) 10Paladox: servermon: Make sure /etc/gunicorn.d/ exists [puppet] - 10https://gerrit.wikimedia.org/r/362601 [23:57:16] (03CR) 10jerkins-bot: [V: 04-1] servermon: Make sure /etc/gunicorn.d/ exists [puppet] - 10https://gerrit.wikimedia.org/r/362601 (owner: 10Paladox) [23:58:06] (03PS4) 10Paladox: servermon: Make sure /etc/gunicorn.d/ exists [puppet] - 10https://gerrit.wikimedia.org/r/362601