[00:08:34] 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Kanban): contint1001: DISK WARNING - free space: /srv 88397 MB (10% inode=94%): - https://phabricator.wikimedia.org/T219850 (10Dzahn) The long term fix should still be T178663 [00:11:52] (03PS4) 10Dzahn: Gerrit: Rename error_log to gerrit.log [puppet] - 10https://gerrit.wikimedia.org/r/510625 (owner: 10Paladox) [00:27:25] (03PS5) 10Andrew Bogott: ldap-admins: add foks, add admin group on labweb hosts [puppet] - 10https://gerrit.wikimedia.org/r/506542 (https://phabricator.wikimedia.org/T220860) (owner: 10Dzahn) [00:28:27] (03CR) 10Andrew Bogott: [C: 03+2] ldap-admins: add foks, add admin group on labweb hosts [puppet] - 10https://gerrit.wikimedia.org/r/506542 (https://phabricator.wikimedia.org/T220860) (owner: 10Dzahn) [00:52:31] PROBLEM - puppet last run on dns5002 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. It might be a dependency cycle. [01:16:57] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [01:18:39] PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [01:19:27] PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [01:19:33] RECOVERY - puppet last run on dns5002 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [01:20:33] PROBLEM - Esams HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [01:21:27] RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [01:26:51] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [01:27:37] RECOVERY - Esams HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [01:29:19] RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [02:49:27] RECOVERY - Debian mirror in sync with upstream on sodium is OK: /srv/mirrors/debian is over 0 hours old. https://wikitech.wikimedia.org/wiki/Mirrors [03:00:34] (03Abandoned) 10BryanDavis: ldap: disable group member list expansion on Stretch clients [puppet] - 10https://gerrit.wikimedia.org/r/496991 (https://phabricator.wikimedia.org/T217280) (owner: 10BryanDavis) [04:29:13] PROBLEM - Disk space on maps2004 is CRITICAL: DISK CRITICAL - free space: /srv 64676 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [04:49:24] 10Operations, 10serviceops, 10PHP 7.2 support, 10Performance-Team (Radar), 10Wikimedia-production-error: PHP 7 corruption during deployment (was: PHP 7 fatals on mw1262) - https://phabricator.wikimedia.org/T224491 (10Joe) So, I am disabling the opcache resets for now, and putting timestamp validation to... [05:06:29] (03PS1) 10Giuseppe Lavagetto: scap: stop resetting opcache on php7.2 [puppet] - 10https://gerrit.wikimedia.org/r/513549 (https://phabricator.wikimedia.org/T224491) [05:06:31] (03PS1) 10Giuseppe Lavagetto: php7: bump opcache, apc on canaries, mw1348 [puppet] - 10https://gerrit.wikimedia.org/r/513550 [05:41:31] (03CR) 10Giuseppe Lavagetto: [C: 03+2] scap: stop resetting opcache on php7.2 [puppet] - 10https://gerrit.wikimedia.org/r/513549 (https://phabricator.wikimedia.org/T224491) (owner: 10Giuseppe Lavagetto) [05:57:13] 10Operations, 10Dumps-Generation: Reboot dumps/snapshot hosts - https://phabricator.wikimedia.org/T223962 (10ArielGlenn) [05:57:29] 10Operations, 10Dumps-Generation: Reboot dumps/snapshot hosts - https://phabricator.wikimedia.org/T223962 (10ArielGlenn) 05Open→03Resolved [06:00:16] (03CR) 10Giuseppe Lavagetto: [C: 03+2] php7: bump opcache, apc on canaries, mw1348 [puppet] - 10https://gerrit.wikimedia.org/r/513550 (owner: 10Giuseppe Lavagetto) [06:16:29] <_joe_> !log restarting php-fpm on mw1348 [06:16:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:18:23] PROBLEM - Check systemd state on mw1348 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [06:18:41] PROBLEM - Nginx local proxy to apache on mw1348 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.010 second response time https://wikitech.wikimedia.org/wiki/Application_servers [06:18:45] PROBLEM - php7.2-fpm service on mw1348 is CRITICAL: CRITICAL - Expecting active but unit php7.2-fpm is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [06:18:57] PROBLEM - PHP7 rendering on mw1348 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - header X-Powered-By: PHP/7. not found on http://en.wikipedia.org:80/wiki/Main_Page - 1308 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers [06:19:01] PROBLEM - HHVM rendering on mw1348 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers [06:19:09] PROBLEM - Apache HTTP on mw1348 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers [06:20:02] <_joe_> this is me ^^ sorry, the server is depooled [06:21:22] oh [06:21:29] I just depooled it :p [06:21:42] !log depool mw1348 [06:21:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:22:18] sorry joe, logging out [06:26:06] <_joe_> I did depool it already [06:31:09] PROBLEM - puppet last run on bast3002 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. It might be a dependency cycle. [06:31:41] PROBLEM - puppet last run on wdqs1006 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/sbin/disable-puppet] [06:31:49] RECOVERY - HHVM rendering on mw1348 is OK: HTTP OK: HTTP/1.1 200 OK - 81452 bytes in 1.179 second response time https://wikitech.wikimedia.org/wiki/Application_servers [06:31:55] RECOVERY - Apache HTTP on mw1348 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 658 bytes in 0.053 second response time https://wikitech.wikimedia.org/wiki/Application_servers [06:32:09] PROBLEM - puppet last run on mw1307 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown) [06:32:37] RECOVERY - Check systemd state on mw1348 is OK: OK - running: The system is fully operational [06:32:53] RECOVERY - Nginx local proxy to apache on mw1348 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 659 bytes in 0.062 second response time https://wikitech.wikimedia.org/wiki/Application_servers [06:32:57] RECOVERY - php7.2-fpm service on mw1348 is OK: OK - php7.2-fpm is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [06:33:09] RECOVERY - PHP7 rendering on mw1348 is OK: HTTP OK: HTTP/1.1 200 OK - 81450 bytes in 0.159 second response time https://wikitech.wikimedia.org/wiki/Application_servers [06:33:49] <_joe_> !log repooled mw1348 [06:33:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:40:09] 10Operations, 10ops-eqiad: es1019 IPMI and its management interface are unresponsive (again) - https://phabricator.wikimedia.org/T213422 (10jcrespo) @Cmjohnson Was this done yesterday? I don't have any rush on this, but if it wasn't, I would need you to switch it on (I cannot access the management interface) s... [06:55:18] !log upgrade and restart db2058 [06:55:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:58:13] RECOVERY - puppet last run on bast3002 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [06:58:47] RECOVERY - puppet last run on wdqs1006 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [06:59:13] RECOVERY - puppet last run on mw1307 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [07:03:50] (03PS1) 10Jcrespo: mariadb: Depool labsdb1009 for maintenance [puppet] - 10https://gerrit.wikimedia.org/r/513553 [07:04:09] (03PS1) 10Giuseppe Lavagetto: php7: reduce interned strings memory to 96 MB [puppet] - 10https://gerrit.wikimedia.org/r/513554 [07:04:49] (03CR) 10Jcrespo: [C: 03+2] mariadb: Depool labsdb1009 for maintenance [puppet] - 10https://gerrit.wikimedia.org/r/513553 (owner: 10Jcrespo) [07:08:38] (03CR) 10Giuseppe Lavagetto: [C: 03+2] php7: reduce interned strings memory to 96 MB [puppet] - 10https://gerrit.wikimedia.org/r/513554 (owner: 10Giuseppe Lavagetto) [07:08:58] (03PS2) 10Giuseppe Lavagetto: php7: reduce interned strings memory to 96 MB [puppet] - 10https://gerrit.wikimedia.org/r/513554 [07:09:25] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] php7: reduce interned strings memory to 96 MB [puppet] - 10https://gerrit.wikimedia.org/r/513554 (owner: 10Giuseppe Lavagetto) [07:14:23] !log depool labsdb1009 for maintenance [07:14:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:15:55] <_joe_> !log draining mw1348 from traffic [07:15:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:24:08] !log upgrade and restart labsdb1009 [07:24:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:24:57] <_joe_> !log repooling mw1348 [07:25:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:34:41] 10Operations, 10netops: librenms logrotate script seems not working - https://phabricator.wikimedia.org/T224502 (10elukey) 05Open→03Resolved a:03elukey [07:37:58] (03PS3) 10Jcrespo: db-eqiad.php: Slowly repool db1099 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513304 (https://phabricator.wikimedia.org/T221502) (owner: 10Marostegui) [07:38:45] 10Operations, 10Maps: Maps[12]004 /srv disk space is critical - https://phabricator.wikimedia.org/T224395 (10Mathew.onipe) On trying to initialize maps2001, It encountered the same disk space issues like maps2004. I hold on now on others. We will have to reimage others (maps200[1-3]) before we can proceed. So... [07:39:44] (03PS1) 10DCausse: [cirrus] extension registration: don't assume default vars are set [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513556 (https://phabricator.wikimedia.org/T87892) [07:42:12] (03CR) 10Florianschmidtwelzow: [phabricator] Remove extra comma from footer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/513501 (owner: 10Florianschmidtwelzow) [07:42:16] (03Abandoned) 10Florianschmidtwelzow: [phabricator] Remove extra comma from footer [puppet] - 10https://gerrit.wikimedia.org/r/513501 (owner: 10Florianschmidtwelzow) [07:43:52] <_joe_> !log restarting php-fpm on canaries [07:43:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:45:06] (03CR) 10Jcrespo: [C: 03+2] db-eqiad.php: Slowly repool db1099 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513304 (https://phabricator.wikimedia.org/T221502) (owner: 10Marostegui) [07:45:30] <_joe_> ouch [07:46:00] (03Merged) 10jenkins-bot: db-eqiad.php: Slowly repool db1099 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513304 (https://phabricator.wikimedia.org/T221502) (owner: 10Marostegui) [07:46:14] <_joe_> jynus: please hold a sec [07:46:40] (03CR) 10jenkins-bot: db-eqiad.php: Slowly repool db1099 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513304 (https://phabricator.wikimedia.org/T221502) (owner: 10Marostegui) [07:46:53] waiting [07:47:06] <_joe_> i just killed php7 on the mwdebugs [07:47:15] <_joe_> I forgot those servers don't have much ram [07:47:29] PROBLEM - PHP7 rendering on mwdebug1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - header X-Powered-By: PHP/7. not found on http://en.wikipedia.org:80/wiki/Main_Page - 1313 bytes in 0.004 second response time https://wikitech.wikimedia.org/wiki/Application_servers [07:47:49] PROBLEM - php7.2-fpm service on mwdebug2002 is CRITICAL: CRITICAL - Expecting active but unit php7.2-fpm is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [07:47:55] PROBLEM - Check systemd state on mwdebug2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [07:47:55] PROBLEM - Check systemd state on mwdebug1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [07:47:57] PROBLEM - PHP7 rendering on mwdebug2002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - header X-Powered-By: PHP/7. not found on http://en.wikipedia.org:80/wiki/Main_Page - 1313 bytes in 0.074 second response time https://wikitech.wikimedia.org/wiki/Application_servers [07:47:57] PROBLEM - php7.2-fpm service on mwdebug1002 is CRITICAL: CRITICAL - Expecting active but unit php7.2-fpm is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [07:48:02] (03PS1) 10Jcrespo: Revert "mariadb: Depool labsdb1009 for maintenance" [puppet] - 10https://gerrit.wikimedia.org/r/513558 [07:48:03] PROBLEM - php7.2-fpm service on mwdebug1001 is CRITICAL: CRITICAL - Expecting active but unit php7.2-fpm is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [07:48:11] PROBLEM - PHP7 rendering on mwdebug2001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - header X-Powered-By: PHP/7. not found on http://en.wikipedia.org:80/wiki/Main_Page - 1313 bytes in 0.074 second response time https://wikitech.wikimedia.org/wiki/Application_servers [07:48:17] <_joe_> no need to revert jynus [07:48:23] <_joe_> just wait like 3 minutes [07:48:23] PROBLEM - PHP7 rendering on mwdebug1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - header X-Powered-By: PHP/7. not found on http://en.wikipedia.org:80/wiki/Main_Page - 1313 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Application_servers [07:48:27] PROBLEM - Check systemd state on mwdebug1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [07:48:37] PROBLEM - Check systemd state on mwdebug2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [07:48:59] PROBLEM - php7.2-fpm service on mwdebug2001 is CRITICAL: CRITICAL - Expecting active but unit php7.2-fpm is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [07:49:05] (03PS1) 10Giuseppe Lavagetto: php7: reduce memory usage on mwdebugs [puppet] - 10https://gerrit.wikimedia.org/r/513560 [07:49:42] (03CR) 10Giuseppe Lavagetto: [C: 03+2] php7: reduce memory usage on mwdebugs [puppet] - 10https://gerrit.wikimedia.org/r/513560 (owner: 10Giuseppe Lavagetto) [07:51:26] (03PS1) 10Jcrespo: Revert "db-eqiad.php: Depool db1099" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513561 [07:51:29] RECOVERY - Check systemd state on mwdebug2002 is OK: OK - running: The system is fully operational [07:51:39] (03CR) 10jerkins-bot: [V: 04-1] Revert "db-eqiad.php: Depool db1099" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513561 (owner: 10Jcrespo) [07:51:47] RECOVERY - PHP7 rendering on mwdebug1001 is OK: HTTP OK: HTTP/1.1 200 OK - 81521 bytes in 1.054 second response time https://wikitech.wikimedia.org/wiki/Application_servers [07:51:51] <_joe_> jynus: you can proceed with your work [07:51:52] (03CR) 10Jcrespo: [C: 03+2] Revert "mariadb: Depool labsdb1009 for maintenance" [puppet] - 10https://gerrit.wikimedia.org/r/513558 (owner: 10Jcrespo) [07:51:53] RECOVERY - php7.2-fpm service on mwdebug2001 is OK: OK - php7.2-fpm is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [07:52:02] (03PS2) 10Jcrespo: Revert "mariadb: Depool labsdb1009 for maintenance" [puppet] - 10https://gerrit.wikimedia.org/r/513558 [07:52:08] ok, doing in 1 minute [07:52:09] RECOVERY - php7.2-fpm service on mwdebug2002 is OK: OK - php7.2-fpm is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [07:52:13] RECOVERY - Check systemd state on mwdebug1002 is OK: OK - running: The system is fully operational [07:52:13] RECOVERY - Check systemd state on mwdebug2001 is OK: OK - running: The system is fully operational [07:52:15] RECOVERY - php7.2-fpm service on mwdebug1002 is OK: OK - php7.2-fpm is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [07:52:15] RECOVERY - PHP7 rendering on mwdebug2002 is OK: HTTP OK: HTTP/1.1 200 OK - 81449 bytes in 1.305 second response time https://wikitech.wikimedia.org/wiki/Application_servers [07:52:23] RECOVERY - php7.2-fpm service on mwdebug1001 is OK: OK - php7.2-fpm is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [07:52:31] RECOVERY - PHP7 rendering on mwdebug2001 is OK: HTTP OK: HTTP/1.1 200 OK - 81453 bytes in 1.192 second response time https://wikitech.wikimedia.org/wiki/Application_servers [07:52:43] RECOVERY - PHP7 rendering on mwdebug1002 is OK: HTTP OK: HTTP/1.1 200 OK - 81521 bytes in 1.149 second response time https://wikitech.wikimedia.org/wiki/Application_servers [07:52:45] RECOVERY - Check systemd state on mwdebug1001 is OK: OK - running: The system is fully operational [07:54:19] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1099 with low weight (duration: 00m 49s) [07:54:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:31] sync-apaches: 100% (ok: 267; fail: 0; left: 0) [07:54:49] no connections so far [07:54:53] <_joe_> ok let's see if within a minute the change is picked up by the server [07:55:08] now I see them [07:55:11] <_joe_> yep [07:55:17] <_joe_> within a minute as expected [07:55:22] <_joe_> ok so this at least works [07:55:38] do you need me more for this? [07:55:44] <_joe_> no thanks [07:55:49] thanks to you! [07:55:51] and your work [07:58:16] (03PS2) 10Jcrespo: Revert "db-eqiad.php: Depool db1099" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513561 [08:12:26] (03PS1) 10Jcrespo: labsdb: Depool labsdb1011 for maintenance [puppet] - 10https://gerrit.wikimedia.org/r/513563 [08:14:27] (03CR) 10Jcrespo: [C: 03+2] labsdb: Depool labsdb1011 for maintenance [puppet] - 10https://gerrit.wikimedia.org/r/513563 (owner: 10Jcrespo) [08:16:36] !log depool labsdb1011 for maintenance [08:16:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:29:29] (03PS1) 10Cparle: Add 'sms' langcode to beta commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513565 (https://phabricator.wikimedia.org/T222309) [08:30:23] (03CR) 10jerkins-bot: [V: 04-1] Add 'sms' langcode to beta commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513565 (https://phabricator.wikimedia.org/T222309) (owner: 10Cparle) [08:32:47] (03PS1) 10Jcrespo: Revert "labsdb: Depool labsdb1011 for maintenance" [puppet] - 10https://gerrit.wikimedia.org/r/513566 [08:33:44] !log upgrade and restart db2065 [08:33:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:19] (03CR) 10Jcrespo: [C: 03+2] Revert "labsdb: Depool labsdb1011 for maintenance" [puppet] - 10https://gerrit.wikimedia.org/r/513566 (owner: 10Jcrespo) [08:37:54] (03CR) 10Cparle: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513565 (https://phabricator.wikimedia.org/T222309) (owner: 10Cparle) [08:42:53] (03PS1) 10Jcrespo: labsdb: Depool labsdb1010 for maintenance [puppet] - 10https://gerrit.wikimedia.org/r/513567 [08:43:29] 10Operations, 10ops-eqiad, 10DBA: db1099 memory issues - https://phabricator.wikimedia.org/T221502 (10jcrespo) https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/513561 is pending. [08:46:05] (03PS1) 10ArielGlenn: remove orderrev config option, no longer needed [dumps] - 10https://gerrit.wikimedia.org/r/513568 (https://phabricator.wikimedia.org/T207628) [08:49:55] (03PS1) 10ArielGlenn: remove orderrevs param from dumps manifests [puppet] - 10https://gerrit.wikimedia.org/r/513570 (https://phabricator.wikimedia.org/T207628) [09:03:33] (03CR) 10Jcrespo: [C: 03+2] labsdb: Depool labsdb1010 for maintenance [puppet] - 10https://gerrit.wikimedia.org/r/513567 (owner: 10Jcrespo) [09:09:51] (03PS1) 10Giuseppe Lavagetto: mw1348: increase opcache revalidation frequency [puppet] - 10https://gerrit.wikimedia.org/r/513571 (https://phabricator.wikimedia.org/T224491) [09:11:16] !log stop and upgrade db2095 (s2, s4, s6, s7) [09:11:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:06] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mw1348: increase opcache revalidation frequency [puppet] - 10https://gerrit.wikimedia.org/r/513571 (https://phabricator.wikimedia.org/T224491) (owner: 10Giuseppe Lavagetto) [09:28:40] !log stop and upgrade db2073 [09:28:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:33] RECOVERY - Disk space on notebook1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [09:38:37] 10Operations, 10Analytics: notebook1003:/srv/ 2% disk space left - https://phabricator.wikimedia.org/T224682 (10elukey) I removed a rev of the analytics refinery in /srv/deployment, should be better now. If the following homes could be reduced it would be great: ` 11G piccardi 18G fsalutari 28G dsaez ` @die... [09:38:55] 10Operations, 10Analytics, 10User-Elukey: notebook1003:/srv/ 2% disk space left - https://phabricator.wikimedia.org/T224682 (10elukey) p:05High→03Normal a:03elukey [09:41:37] (03CR) 10Elukey: "Will merge on Monday!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/511690 (https://phabricator.wikimedia.org/T224236) (owner: 10CDanis) [09:47:48] 10Operations, 10Analytics, 10Traffic, 10Patch-For-Review: include the 'Server:' response header in varnishkafka - https://phabricator.wikimedia.org/T224236 (10elukey) @CDanis I had a chat with my team as promised, and we are +1 to add the new field to varnishkafka but -1 to add it all the way to the webreq... [09:48:05] 10Operations, 10Analytics, 10Traffic, 10Patch-For-Review, 10User-Elukey: include the 'Server:' response header in varnishkafka - https://phabricator.wikimedia.org/T224236 (10elukey) a:05Ottomata→03elukey [09:48:34] 10Operations, 10ORES, 10Scoring-platform-team, 10serviceops: Migrate ORES Redis servers to Stretch/Buster - https://phabricator.wikimedia.org/T224569 (10akosiaris) We might be able to get away with reusing the redis misc servers (rdb1005/rdb1009). That should give us more memory and allow us to use the emp... [09:51:29] 10Operations, 10ops-eqiad, 10Operations-Software-Development, 10observability: ms-be1043 sdk failed - https://phabricator.wikimedia.org/T218544 (10fgiunchedi) a:03Cmjohnson News on this @Cmjohnson ? [09:52:35] 10Operations, 10ops-codfw, 10decommission: Decommission db2037 - https://phabricator.wikimedia.org/T224720 (10Marostegui) [09:54:54] (03PS1) 10Marostegui: db2037: Prepare for decommission [puppet] - 10https://gerrit.wikimedia.org/r/513573 (https://phabricator.wikimedia.org/T224720) [09:58:50] (03CR) 10Marostegui: "PCC is happy but I won't merge this on a Friday anyways: https://puppet-compiler.wmflabs.org/compiler1002/16820/" [puppet] - 10https://gerrit.wikimedia.org/r/513573 (https://phabricator.wikimedia.org/T224720) (owner: 10Marostegui) [10:08:59] 10Operations, 10MediaWiki-Cache, 10serviceops, 10Patch-For-Review, and 2 others: Apply -R 200 to all the memcached mw object cache instances running in eqiad/codfw - https://phabricator.wikimedia.org/T208844 (10elukey) [10:10:31] (03CR) 10Elukey: "Looks good, I'd also remove the host specific hiera settings :)" [puppet] - 10https://gerrit.wikimedia.org/r/511973 (https://phabricator.wikimedia.org/T208844) (owner: 10Effie Mouzeli) [10:21:38] (03PS1) 10Arturo Borrero Gonzalez: admin: rotate arturo's ssh key for production [puppet] - 10https://gerrit.wikimedia.org/r/513576 [10:21:49] 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10serviceops, and 5 others: Introduce kask session storage service to kubernetes - https://phabricator.wikimedia.org/T220401 (10akosiaris) One thing that I just met is that kask stops accepting HTTP connections if kask cert/key pair is configure... [10:29:43] (03CR) 10Jbond: [C: 03+1] "LGTM confirmed on irc" [puppet] - 10https://gerrit.wikimedia.org/r/513576 (owner: 10Arturo Borrero Gonzalez) [10:31:56] 10Operations, 10Analytics, 10User-Elukey: Import AMD rocm packages in wikimedia-buster - https://phabricator.wikimedia.org/T224723 (10elukey) [10:32:31] (03PS1) 10Arturo Borrero Gonzalez: passwords: rotate arturo's ssh key for cloud-wide root [labs/private] - 10https://gerrit.wikimedia.org/r/513577 [10:34:01] (03PS2) 10Arturo Borrero Gonzalez: passwords: rotate arturo's ssh key for cloud-wide root [labs/private] - 10https://gerrit.wikimedia.org/r/513577 [10:34:04] 10Operations, 10Analytics, 10User-Elukey: Import AMD rocm packages in wikimedia-buster - https://phabricator.wikimedia.org/T224723 (10elukey) [10:34:06] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/513310 (owner: 10Volans) [10:34:08] (03PS2) 10Arturo Borrero Gonzalez: admin: rotate arturo's ssh key for production [puppet] - 10https://gerrit.wikimedia.org/r/513576 [10:39:41] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] admin: rotate arturo's ssh key for production [puppet] - 10https://gerrit.wikimedia.org/r/513576 (owner: 10Arturo Borrero Gonzalez) [10:42:10] (03CR) 10Arturo Borrero Gonzalez: [V: 03+2 C: 03+2] passwords: rotate arturo's ssh key for cloud-wide root [labs/private] - 10https://gerrit.wikimedia.org/r/513577 (owner: 10Arturo Borrero Gonzalez) [10:43:27] 10Operations, 10Analytics, 10User-Elukey: Import AMD rocm packages in wikimedia-buster - https://phabricator.wikimedia.org/T224723 (10elukey) [10:43:30] 10Operations, 10Analytics, 10Research-management, 10Patch-For-Review, 10User-Elukey: Remove computational bottlenecks in stats machine via adding a GPU that can be used to train ML models - https://phabricator.wikimedia.org/T148843 (10elukey) [10:47:17] !log merging multiple commits to labs/private.git. We now require `puppet-merge --labsprivate` and people may not be yet aware of that [10:47:21] 10Operations, 10serviceops, 10PHP 7.2 support, 10Performance-Team (Radar), 10Wikimedia-production-error: PHP 7 corruption during deployment (was: PHP 7 fatals on mw1262) - https://phabricator.wikimedia.org/T224491 (10Joe) FWIW - the percentage of errors coming from php-fpm servers was including timeouts,... [10:47:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:10] !log depool labsdb1010 for maintenance [10:54:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:52] (03PS1) 10Jcrespo: Revert "labsdb: Depool labsdb1010 for maintenance" [puppet] - 10https://gerrit.wikimedia.org/r/513578 [10:56:06] (03CR) 10Jcrespo: [C: 04-2] "Not yet" [puppet] - 10https://gerrit.wikimedia.org/r/513578 (owner: 10Jcrespo) [11:03:55] (03PS2) 10Jbond: firewall logging: enable firewall logging on remaining roles [puppet] - 10https://gerrit.wikimedia.org/r/511709 (https://phabricator.wikimedia.org/T116011) [11:04:53] (03CR) 10Jbond: [C: 03+2] firewall logging: enable firewall logging on remaining roles [puppet] - 10https://gerrit.wikimedia.org/r/511709 (https://phabricator.wikimedia.org/T116011) (owner: 10Jbond) [11:05:27] (03CR) 10Jcrespo: [C: 03+2] Revert "db-eqiad.php: Depool db1099" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513561 (owner: 10Jcrespo) [11:05:46] ^ _joe_ [11:06:03] <_joe_> thanks jynus [11:06:21] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1099" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513561 (owner: 10Jcrespo) [11:07:37] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1099" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513561 (owner: 10Jcrespo) [11:09:42] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Fully repool db1099 after maintenance (duration: 00m 48s) [11:09:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:31] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/513292 (https://phabricator.wikimedia.org/T224627) (owner: 10Jhedden) [11:14:36] (03CR) 10Jbond: [C: 04-1] "This will need to be approved in the monday SRE meeting" [puppet] - 10https://gerrit.wikimedia.org/r/513293 (https://phabricator.wikimedia.org/T224627) (owner: 10Jhedden) [11:21:02] (03CR) 10Jbond: [C: 03+1] "31st May LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/512404 (https://phabricator.wikimedia.org/T214623) (owner: 10Dzahn) [11:23:22] (03CR) 10Jbond: [C: 03+1] "LGMT" [puppet] - 10https://gerrit.wikimedia.org/r/511955 (owner: 10Dzahn) [11:31:28] PROBLEM - puppet last run on ms-be1023 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[ulogd2] [11:46:04] !log stop and upgrade db2084 [11:46:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:58:28] RECOVERY - puppet last run on ms-be1023 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [12:09:56] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS6939/IPv4: Connect, AS6939/IPv6: Connect https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:10:50] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 285 probes of 428 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [12:16:14] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 428 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [12:18:02] (03PS2) 10Jcrespo: Revert "labsdb: Depool labsdb1010 for maintenance" [puppet] - 10https://gerrit.wikimedia.org/r/513578 [12:21:18] RECOVERY - BGP status on cr1-eqiad is OK: BGP OK - up: 35, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:26:23] 10Operations: Cron spam from phab1001 delete of temporary files - https://phabricator.wikimedia.org/T224727 (10Volans) [12:26:32] 10Operations: Cron spam from phab1001 delete of temporary files - https://phabricator.wikimedia.org/T224727 (10Volans) p:05Triage→03High [12:30:20] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 145 probes of 428 (alerts on 35) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [12:32:31] 10Operations, 10Puppet: facter 3: add timeout to custom facts external calls - https://phabricator.wikimedia.org/T223938 (10Volans) I'm for option 2 as well, 3 breaks the contract that all facts are available at first puppet run and 1 doesn't really solve much in the long run, just the cron spam. At this time... [12:57:24] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 35 probes of 428 (alerts on 35) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [12:59:11] 10Operations, 10Analytics, 10Traffic, 10Patch-For-Review, 10User-Elukey: include the 'Server:' response header in varnishkafka - https://phabricator.wikimedia.org/T224236 (10CDanis) SGTM @elukey, thanks! [12:59:50] (03PS3) 10DCausse: [cirrus] Load cirrus using wfLoadExtension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513068 (https://phabricator.wikimedia.org/T87892) [13:09:26] (03CR) 10Ema: "One fix needed in the VTC test, otherwise LGTM." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/506043 (https://phabricator.wikimedia.org/T219819) (owner: 10Santhosh) [13:12:09] (03PS1) 10Ema: varnish: install libvmod-re2 in Vagrantfile [puppet] - 10https://gerrit.wikimedia.org/r/513587 [13:13:26] (03CR) 10DCausse: [C: 04-1] [cirrus] extension registration: don't assume default vars are set (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513556 (https://phabricator.wikimedia.org/T87892) (owner: 10DCausse) [13:15:41] (03PS2) 10BBlack: cache: reimage cp3049 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/513317 (https://phabricator.wikimedia.org/T222937) [13:16:25] !log depool cp3049 for reimage - T222937 [13:16:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:31] T222937: Replace Varnish backends with ATS on cache upload nodes in esams - https://phabricator.wikimedia.org/T222937 [13:16:54] (03CR) 10BBlack: [C: 03+2] cache: reimage cp3049 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/513317 (https://phabricator.wikimedia.org/T222937) (owner: 10BBlack) [13:19:26] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in esams - https://phabricator.wikimedia.org/T222937 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by bblack on cumin1001.eqiad.wmnet for hosts: ` ['cp3049.esams.wmnet'] ` The log can be found i... [13:23:20] (03CR) 10Ema: [C: 04-1] varnish: ratelimit unusual image sizes (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/512495 (https://phabricator.wikimedia.org/T224434) (owner: 10Jbond) [13:35:35] (03CR) 10BBlack: "I actually like the idea of ratelimiting both thumbs and orignals separately. I think we could take a peek at their median/max values in " (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/512495 (https://phabricator.wikimedia.org/T224434) (owner: 10Jbond) [13:42:26] PROBLEM - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [13:42:51] bblack: ^^ that's you :) [13:45:24] (03PS3) 10CDanis: varnishkafka webrequest: log Server: in response as 'backend' [puppet] - 10https://gerrit.wikimedia.org/r/511690 (https://phabricator.wikimedia.org/T224236) [13:45:33] (03CR) 10CDanis: varnishkafka webrequest: log Server: in response as 'backend' (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/511690 (https://phabricator.wikimedia.org/T224236) (owner: 10CDanis) [13:47:05] (03PS13) 10Jbond: varnish: ratelimit unusual image sizes [puppet] - 10https://gerrit.wikimedia.org/r/512495 (https://phabricator.wikimedia.org/T224434) [13:47:07] (03PS1) 10Jbond: varnish: Glbal cache miss rate limit [puppet] - 10https://gerrit.wikimedia.org/r/513596 (https://phabricator.wikimedia.org/T224434) [13:47:08] paravoid: ? [13:47:28] it cleared, it was alerting about cp3049's Status field [13:47:32] RECOVERY - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is OK: puppetdb.PuppetDB OK https://wikitech.wikimedia.org/wiki/Netbox%23Reports [13:47:41] it's still in the reimage script [13:47:45] was set to Active but wasn't in PuppetDB [13:47:46] ah ok [13:47:48] is there something else manual to do? [13:47:52] nah it's ok [13:47:57] this will be fixed very soon [13:48:01] ok :) [13:48:09] by having the reimage script do status changes [13:48:26] and also preventing you to reimage something that is set to "Active" unless you force it [13:48:30] yeah I happened to be re-reading parts of the wikitech doc on lifecycle [13:48:41] and I saw some notes saying things about going into netbox to change status [13:49:17] for a reimage from active-to-active, is there a temporary state it's meant to be in to avoid alerting and/or forcing? [13:49:30] PROBLEM - configured eth on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused [13:49:52] PROBLEM - puppet last run on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused [13:49:52] PROBLEM - dhclient process on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused [13:49:52] PROBLEM - DPKG on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused [13:49:52] PROBLEM - MD RAID on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [13:50:08] PROBLEM - Disk space on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [13:50:51] (or do we just always use some --force when reimaging an active server back to active again?) [13:51:50] I dunno yet :) [13:51:54] tobefiguredout [13:52:24] (03CR) 10Jbond: "> Patch Set 12:" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/512495 (https://phabricator.wikimedia.org/T224434) (owner: 10Jbond) [13:52:39] (03PS2) 10Jbond: varnish: Global cache miss rate limit [puppet] - 10https://gerrit.wikimedia.org/r/513596 (https://phabricator.wikimedia.org/T224434) [13:55:49] (03PS1) 10Giuseppe Lavagetto: mediawiki::php: use 10 seconds as revalidate frequency everywhere [puppet] - 10https://gerrit.wikimedia.org/r/513598 (https://phabricator.wikimedia.org/T224491) [13:57:19] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki::php: use 10 seconds as revalidate frequency everywhere [puppet] - 10https://gerrit.wikimedia.org/r/513598 (https://phabricator.wikimedia.org/T224491) (owner: 10Giuseppe Lavagetto) [13:58:29] (03PS3) 10Jbond: varnish: cache_upload global cache miss rate limit [puppet] - 10https://gerrit.wikimedia.org/r/513596 (https://phabricator.wikimedia.org/T224434) [13:59:03] (03PS4) 10Jbond: varnish: cache_upload global cache miss rate limit [puppet] - 10https://gerrit.wikimedia.org/r/513596 (https://phabricator.wikimedia.org/T224434) [13:59:17] 10Operations, 10ops-eqiad, 10Cassandra, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), and 3 others: Fix restbase1017's physical rack - https://phabricator.wikimedia.org/T222960 (10Joe) [14:02:22] 10Operations, 10ops-eqiad, 10Cassandra, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), and 3 others: Fix restbase1017's physical rack - https://phabricator.wikimedia.org/T222960 (10Joe) I think #dc-ops should sync with you all on a schedule for this move. According to @Eeva... [14:04:06] PROBLEM - puppet last run on cp3044 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. It might be a dependency cycle. [14:04:24] PROBLEM - Check the NTP synchronisation status of timesyncd on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused [14:04:45] ^ seen several of those lately on random hosts, I assume it's general and known ("Failed to apply catalog, zero resources tracked by Puppet. It might be a dependency cycle") [14:06:31] bblack: https://phabricator.wikimedia.org/T221784 [14:07:04] PROBLEM - Check systemd state on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused [14:07:21] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in esams - https://phabricator.wikimedia.org/T222937 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp3049.esams.wmnet'] ` and were **ALL** successful. [14:07:34] as volans mentioned on the ticket, there's also other circumstances under which the alert is raised, not only dependency cycles [14:07:39] I forgot which ones though! [14:08:43] ema: there is a patch already +1ed I sent yesterday to change the message [14:09:02] feel free to merge if you want ;) [14:10:19] !log reboot cp3049 - T222937 [14:10:23] ah, I see: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/513310/ [14:10:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:25] T222937: Replace Varnish backends with ATS on cache upload nodes in esams - https://phabricator.wikimedia.org/T222937 [14:11:03] (03PS2) 10DCausse: [cirrus] extension registration: don't assume default vars are set [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513556 (https://phabricator.wikimedia.org/T87892) [14:11:05] (03PS4) 10DCausse: [cirrus] Load cirrus using wfLoadExtension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513068 (https://phabricator.wikimedia.org/T87892) [14:11:07] (03PS1) 10DCausse: [cirrus] drop most wmgCirrusSearch* ephemeral config vars [1/3] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513605 (https://phabricator.wikimedia.org/T87892) [14:11:08] (03PS1) 10DCausse: [cirrus] drop most wmgCirrusSearch* ephemeral config vars [2/3] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513606 (https://phabricator.wikimedia.org/T87892) [14:11:10] (03PS1) 10DCausse: [cirrus] drop most wmgCirrusSearch* ephemeral config vars [3/3] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513607 (https://phabricator.wikimedia.org/T87892) [14:11:12] (03CR) 10DCausse: [cirrus] extension registration: don't assume default vars are set (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513556 (https://phabricator.wikimedia.org/T87892) (owner: 10DCausse) [14:12:31] (03CR) 10jerkins-bot: [V: 04-1] [cirrus] drop most wmgCirrusSearch* ephemeral config vars [2/3] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513606 (https://phabricator.wikimedia.org/T87892) (owner: 10DCausse) [14:12:47] (03CR) 10jerkins-bot: [V: 04-1] [cirrus] drop most wmgCirrusSearch* ephemeral config vars [3/3] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513607 (https://phabricator.wikimedia.org/T87892) (owner: 10DCausse) [14:13:07] (03PS1) 10Papaul: DNS: Remove mgmt asset tag for rdb200[1-2] [dns] - 10https://gerrit.wikimedia.org/r/513608 [14:15:02] PROBLEM - SSH on notebook1003 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [14:17:35] PROBLEM - IPMI Sensor Status on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused [14:17:58] elukey: https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=all&type=detail&servicestatustypes=16&hoststatustypes=3&serviceprops=2097162 [14:18:13] elukey: notebook1003 is alerting a lot [14:18:28] (03PS1) 10Cwhite: prometheus: update aggregate metrics to new metric names [puppet] - 10https://gerrit.wikimedia.org/r/513609 (https://phabricator.wikimedia.org/T219825) [14:20:20] <_joe_> !log rolling restart of php-fpm across production to pick up the shorter revalidate frequency for T224491 [14:20:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:26] T224491: PHP 7 corruption during deployment (was: PHP 7 fatals on mw1262) - https://phabricator.wikimedia.org/T224491 [14:26:31] (03PS1) 10Cwhite: logstash: add deprecated-input tag to syslog input [puppet] - 10https://gerrit.wikimedia.org/r/513612 (https://phabricator.wikimedia.org/T220103) [14:27:46] ema: https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=cp3039&service=Check+Varnish+expiry+mailbox+lag [14:28:22] cdanis: https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=mw1235&service=Long+running+screen%2Ftmux [14:29:06] ooooops [14:29:10] thanks XioNoX [14:29:36] ACKNOWLEDGEMENT - Disk space on maps2004 is CRITICAL: DISK CRITICAL - free space: /srv 445 MB (0% inode=99%): Mathew.onipe yes we know! even we didnt expect it again - https://phabricator.wikimedia.org/T224395 - The acknowledgement expires at: 2019-06-03 14:28:39. https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [14:29:37] * Krinkle preparing to deploy https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/513526/ [14:30:02] 10Operations, 10ops-codfw: Interface errors on cr1-codfw: xe-5/3/1 - https://phabricator.wikimedia.org/T222967 (10Papaul) Email Telia again with another follow up email. [14:30:11] RECOVERY - puppet last run on cp3044 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:30:42] (03PS3) 10Jcrespo: Revert "labsdb: Depool labsdb1010 for maintenance" [puppet] - 10https://gerrit.wikimedia.org/r/513578 [14:31:19] (03CR) 10Jcrespo: [C: 03+2] Revert "labsdb: Depool labsdb1010 for maintenance" [puppet] - 10https://gerrit.wikimedia.org/r/513578 (owner: 10Jcrespo) [14:31:37] elukey: also WARN: Long running tmux process. (user: piccardi) on notebook1003 [14:32:08] !log powercycle notebook1003 - host stuck due to user processes, no ssh available, OOM didn't trigger [14:32:11] !log depool maps2004 (again) - T224395 [14:32:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:14] XioNoX: thanks for the ping [14:32:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:18] T224395: Maps[12]004 /srv disk space is critical - https://phabricator.wikimedia.org/T224395 [14:33:43] RECOVERY - Disk space on notebook1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [14:33:53] RECOVERY - Check systemd state on notebook1003 is OK: OK - running: The system is fully operational [14:34:25] RECOVERY - dhclient process on notebook1003 is OK: PROCS OK: 0 processes with command name dhclient [14:34:25] RECOVERY - MD RAID on notebook1003 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [14:34:25] RECOVERY - DPKG on notebook1003 is OK: All packages OK [14:34:33] RECOVERY - SSH on notebook1003 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [14:34:41] RECOVERY - Check the NTP synchronisation status of timesyncd on notebook1003 is OK: OK: synced at Fri 2019-05-31 14:34:40 UTC. [14:37:11] RECOVERY - puppet last run on notebook1003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:38:01] RECOVERY - configured eth on notebook1003 is OK: OK - interfaces up [14:39:03] elukey: this one is still active https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=notebook1003&service=Long+running+screen%2Ftmux [14:39:26] (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: update aggregate metrics to new metric names [puppet] - 10https://gerrit.wikimedia.org/r/513609 (https://phabricator.wikimedia.org/T219825) (owner: 10Cwhite) [14:40:00] XioNoX: sure, I am still triaging the host, the tmux is fine if it shows up in icinga for a couple of hours, nothing on fire :) [14:40:03] (03CR) 10Cwhite: [C: 03+2] prometheus: update aggregate metrics to new metric names [puppet] - 10https://gerrit.wikimedia.org/r/513609 (https://phabricator.wikimedia.org/T219825) (owner: 10Cwhite) [14:40:14] (03PS2) 10Cwhite: prometheus: update aggregate metrics to new metric names [puppet] - 10https://gerrit.wikimedia.org/r/513609 (https://phabricator.wikimedia.org/T219825) [14:40:36] ah, I rescheduled it and it's gone [14:44:32] cdanis: can you ack or downtime the tmux alert if it's expected? [14:44:39] XioNoX: I ended the session [14:45:01] so should go away soon? [14:45:05] ah yeah [14:45:21] I think it's scheduled to run every new moons [14:45:43] re-scheduling it, it should go away [14:46:18] (03PS2) 10Ema: varnish: install libvmod-re2 in Vagrantfile [puppet] - 10https://gerrit.wikimedia.org/r/513587 [14:47:20] (03CR) 10Ema: [C: 03+2] varnish: install libvmod-re2 in Vagrantfile [puppet] - 10https://gerrit.wikimedia.org/r/513587 (owner: 10Ema) [14:47:57] RECOVERY - IPMI Sensor Status on notebook1003 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK [14:52:07] !log pool cp3049 back into service - T222937 [14:52:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:12] T222937: Replace Varnish backends with ATS on cache upload nodes in esams - https://phabricator.wikimedia.org/T222937 [14:52:42] 10Operations, 10ops-codfw: Interface errors on cr1-codfw: xe-5/3/1 - https://phabricator.wikimedia.org/T222967 (10Papaul) Dear Customer, Our transmission 2nd line team are still investigating this. We will inform you as soon as the issue solved and sorry for any inconvenience caused. Best Regards, [14:53:20] 10Operations, 10ops-eqiad, 10Cassandra, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), and 3 others: Fix restbase1017's physical rack - https://phabricator.wikimedia.org/T222960 (10Eevans) >>! In T222960#5226150, @Joe wrote: > I think #dc-ops should sync with you all on a s... [14:58:43] 10Operations, 10ops-eqiad, 10DBA: db1099 memory issues - https://phabricator.wikimedia.org/T221502 (10Marostegui) 05Open→03Resolved Server has been fully repooled by Jaime by pushing T221502#5225565 [14:59:26] !log krinkle@deploy1001: git status in php-1.34-wmf.7/ is dirty (extensions/ORES) [14:59:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:39] !log krinkle@deploy1001: pulling down 6f91b41 for php-1.34-wmf.7/extensions/ORES (without deploy), commit seems test-only [15:00:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:53] PROBLEM - HHVM rendering on mw1297 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [15:03:02] 10Operations, 10serviceops, 10HHVM, 10Performance-Team (Radar), 10User-Marostegui: Increased instability in MediaWiki backends (according to load balancers) - https://phabricator.wikimedia.org/T223952 (10Marostegui) [15:03:33] RECOVERY - HHVM rendering on mw1297 is OK: HTTP OK: HTTP/1.1 200 OK - 81395 bytes in 0.140 second response time https://wikitech.wikimedia.org/wiki/Application_servers [15:05:13] !log cp3039: restart varnish-be for mbox lag (likely induced by 3049's depool for ATS conversion!) [15:05:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:50] (03PS2) 10DCausse: [cirrus] drop most wmgCirrusSearch* ephemeral config vars [1/3] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513605 (https://phabricator.wikimedia.org/T87892) [15:07:52] (03PS2) 10DCausse: [cirrus] drop most wmgCirrusSearch* ephemeral config vars [2/3] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513606 (https://phabricator.wikimedia.org/T87892) [15:07:54] (03PS2) 10DCausse: [cirrus] drop most wmgCirrusSearch* ephemeral config vars [3/3] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513607 (https://phabricator.wikimedia.org/T87892) [15:09:16] 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10serviceops, and 5 others: Introduce kask session storage service to kubernetes - https://phabricator.wikimedia.org/T220401 (10Eevans) >>! In T220401#5225728, @akosiaris wrote: > One thing that I just met is that kask stops accepting HTTP conne... [15:09:18] (03CR) 10jerkins-bot: [V: 04-1] [cirrus] drop most wmgCirrusSearch* ephemeral config vars [3/3] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513607 (https://phabricator.wikimedia.org/T87892) (owner: 10DCausse) [15:17:15] (03PS3) 10DCausse: [cirrus] drop most wmgCirrusSearch* ephemeral config vars [3/3] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513607 (https://phabricator.wikimedia.org/T87892) [15:18:03] 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10serviceops, and 5 others: Introduce kask session storage service to kubernetes - https://phabricator.wikimedia.org/T220401 (10fgiunchedi) >>! In T220401#5226287, @Eevans wrote: >> 1. We use an Exec probe that executes something like `curl http... [15:18:23] (03CR) 10jerkins-bot: [V: 04-1] [cirrus] drop most wmgCirrusSearch* ephemeral config vars [3/3] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513607 (https://phabricator.wikimedia.org/T87892) (owner: 10DCausse) [15:31:22] 10Operations, 10DBA: Predictive failures on disk S.M.A.R.T. status - https://phabricator.wikimedia.org/T208323 (10Marostegui) [15:31:30] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to ops group in admin for jeh - https://phabricator.wikimedia.org/T224627 (10jcrespo) However is in charge of this, T224597 should be done before or around the same time. [15:31:31] (03PS3) 10DCausse: [cirrus] drop most wmgCirrusSearch* ephemeral config vars [1/3] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513605 (https://phabricator.wikimedia.org/T87892) [15:31:33] (03PS3) 10DCausse: [cirrus] drop most wmgCirrusSearch* ephemeral config vars [2/3] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513606 (https://phabricator.wikimedia.org/T87892) [15:31:35] (03PS4) 10DCausse: [cirrus] drop most wmgCirrusSearch* ephemeral config vars [3/3] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513607 (https://phabricator.wikimedia.org/T87892) [15:33:17] PROBLEM - mailman_queue_size on fermium is CRITICAL: CRITICAL: 1 mailman queue(s) above limits (thresholds: bounces: 25 in: 25 virgin: 25) https://wikitech.wikimedia.org/wiki/Mailman [15:34:33] PROBLEM - HHVM rendering on mw1315 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Application_servers [15:35:57] RECOVERY - HHVM rendering on mw1315 is OK: HTTP OK: HTTP/1.1 200 OK - 81408 bytes in 0.151 second response time https://wikitech.wikimedia.org/wiki/Application_servers [15:36:07] RECOVERY - mailman_queue_size on fermium is OK: OK: mailman queues are below the limits. https://wikitech.wikimedia.org/wiki/Mailman [15:36:23] 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10serviceops, and 5 others: Introduce kask session storage service to kubernetes - https://phabricator.wikimedia.org/T220401 (10Eevans) >>! In T220401#5226287, @Eevans wrote: >>>! In T220401#5225728, @akosiaris wrote: >> One thing that I just me... [15:37:29] 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10serviceops, and 5 others: Introduce kask session storage service to kubernetes - https://phabricator.wikimedia.org/T220401 (10akosiaris) >>! In T220401#5226287, @Eevans wrote: >>>! In T220401#5225728, @akosiaris wrote: >> One thing that I just... [15:40:23] PROBLEM - puppet last run on db1101 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. It might be a dependency cycle. [15:43:30] 10Operations, 10Traffic: add Icinga alert on Varnish backends that are close to maxing out their allowed connections to their applayer backends - https://phabricator.wikimedia.org/T224738 (10CDanis) [15:45:35] 10Operations, 10Traffic: add Icinga alert on Varnish backends that are close to maxing out their allowed connections to their applayer backends - https://phabricator.wikimedia.org/T224738 (10CDanis) p:05Triage→03Normal [16:04:43] PROBLEM - Prometheus prometheus1003/k8s-staging restarted: beware possible monitoring artifacts on prometheus1003 is CRITICAL: instance=127.0.0.1:9907 job=prometheus https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/k8s-staging [16:10:09] hey it works [16:11:43] neat! [16:12:47] RECOVERY - puppet last run on db1101 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [16:16:17] (03CR) 10CDanis: [C: 03+1] beta: lower swift server workers [puppet] - 10https://gerrit.wikimedia.org/r/513059 (https://phabricator.wikimedia.org/T160990) (owner: 10Hashar) [16:18:04] (03PS1) 10Alexandros Kosiaris: prometheus: Support TLS enabled pods [puppet] - 10https://gerrit.wikimedia.org/r/513633 (https://phabricator.wikimedia.org/T220401) [16:18:20] (03CR) 10CDanis: "Was the container replicator causing undue load on beta? IIRC it just replicates metadata which should be small and fast" [puppet] - 10https://gerrit.wikimedia.org/r/513062 (https://phabricator.wikimedia.org/T160990) (owner: 10Hashar) [16:18:32] (03PS1) 10Alexandros Kosiaris: Fix typo in initialize_service.sh [deployment-charts] - 10https://gerrit.wikimedia.org/r/513634 [16:18:34] (03PS1) 10Alexandros Kosiaris: kask: Fix TLS certs checks [deployment-charts] - 10https://gerrit.wikimedia.org/r/513635 [16:18:36] (03PS1) 10Alexandros Kosiaris: kask: prometheus scraping over HTTPS if TLS enabled [deployment-charts] - 10https://gerrit.wikimedia.org/r/513636 (https://phabricator.wikimedia.org/T220401) [16:18:38] (03PS1) 10Alexandros Kosiaris: Bump kask version to 0.0.5 [deployment-charts] - 10https://gerrit.wikimedia.org/r/513637 [16:18:51] (03PS2) 10ArielGlenn: remove orderrev config option, no longer needed [dumps] - 10https://gerrit.wikimedia.org/r/513568 (https://phabricator.wikimedia.org/T207628) [16:19:05] (03CR) 10CDanis: [C: 03+1] swift: hiera-ize object server number of workers [puppet] - 10https://gerrit.wikimedia.org/r/513058 (https://phabricator.wikimedia.org/T160990) (owner: 10Hashar) [16:19:36] (03CR) 10CDanis: [C: 03+1] beta: tweak swift replicator [puppet] - 10https://gerrit.wikimedia.org/r/513054 (https://phabricator.wikimedia.org/T160990) (owner: 10Hashar) [16:19:59] (03CR) 10ArielGlenn: [C: 03+2] remove orderrev config option, no longer needed [dumps] - 10https://gerrit.wikimedia.org/r/513568 (https://phabricator.wikimedia.org/T207628) (owner: 10ArielGlenn) [16:20:42] !log ariel@deploy1001 Started deploy [dumps/dumps@fd6100a]: remove orderrevs config option, unneeded now [16:20:45] !log ariel@deploy1001 Finished deploy [dumps/dumps@fd6100a]: remove orderrevs config option, unneeded now (duration: 00m 03s) [16:20:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:30] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/513633 (https://phabricator.wikimedia.org/T220401) (owner: 10Alexandros Kosiaris) [16:22:50] (03CR) 10CDanis: [C: 03+1] swift: hiera-ize object-replicator interval [puppet] - 10https://gerrit.wikimedia.org/r/513053 (https://phabricator.wikimedia.org/T160990) (owner: 10Hashar) [16:23:56] (03CR) 10Alexandros Kosiaris: [C: 03+2] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/513633 (https://phabricator.wikimedia.org/T220401) (owner: 10Alexandros Kosiaris) [16:25:22] 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10serviceops, and 5 others: Introduce kask session storage service to kubernetes - https://phabricator.wikimedia.org/T220401 (10akosiaris) > And this uncovered now that prometheus can't talk to it (cause it expects HTTP I guess?). /me looking in... [16:25:35] (03PS2) 10ArielGlenn: remove orderrevs param from dumps manifests [puppet] - 10https://gerrit.wikimedia.org/r/513570 (https://phabricator.wikimedia.org/T207628) [16:25:42] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Fix typo in initialize_service.sh [deployment-charts] - 10https://gerrit.wikimedia.org/r/513634 (owner: 10Alexandros Kosiaris) [16:26:08] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] "Tested locally and in staging, worked fine" [deployment-charts] - 10https://gerrit.wikimedia.org/r/513635 (owner: 10Alexandros Kosiaris) [16:26:28] (03CR) 10ArielGlenn: [C: 03+2] remove orderrevs param from dumps manifests [puppet] - 10https://gerrit.wikimedia.org/r/513570 (https://phabricator.wikimedia.org/T207628) (owner: 10ArielGlenn) [16:26:30] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] "Tested locally and in staging, worked fine" [deployment-charts] - 10https://gerrit.wikimedia.org/r/513636 (https://phabricator.wikimedia.org/T220401) (owner: 10Alexandros Kosiaris) [16:26:36] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Bump kask version to 0.0.5 [deployment-charts] - 10https://gerrit.wikimedia.org/r/513637 (owner: 10Alexandros Kosiaris) [16:28:26] (03PS2) 10Alexandros Kosiaris: Add sessionstore LVS DNS RRs [dns] - 10https://gerrit.wikimedia.org/r/513328 (https://phabricator.wikimedia.org/T220401) [16:30:17] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install kafka-main200[1-5] - https://phabricator.wikimedia.org/T223493 (10herron) I did some testing of various software and hardware raid configurations and wrote up a summary at https://wikitech.wikimedia.org/wiki/Kafka/Kafka-main-raid-performance-t... [16:38:28] RECOVERY - Prometheus prometheus1003/k8s-staging restarted: beware possible monitoring artifacts on prometheus1003 is OK: All metrics within thresholds. https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/k8s-staging [16:45:53] 10Operations, 10Kubernetes: Decommission darmstadtium - https://phabricator.wikimedia.org/T224562 (10akosiaris) I think so, let's wait for @fsero though [16:50:21] 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10serviceops, and 5 others: Introduce kask session storage service to kubernetes - https://phabricator.wikimedia.org/T220401 (10akosiaris) One minor question. Given per T220401#5128786 1 kask instance is able to handle ~300req/s, how many instan... [16:52:29] (03CR) 10Herron: [C: 03+1] "Looks good! FWIW commit message seems to have some remnants of an anchor tag" [puppet] - 10https://gerrit.wikimedia.org/r/513612 (https://phabricator.wikimedia.org/T220103) (owner: 10Cwhite) [16:53:45] (03Restored) 10Elukey: mcrouter: allow to tune server timeout and timeouts until tko [puppet] - 10https://gerrit.wikimedia.org/r/468363 (https://phabricator.wikimedia.org/T203786) (owner: 10Elukey) [16:55:44] (03CR) 10Herron: [C: 03+1] "> Looks good! FWIW commit message seems to have some remnants of an" [puppet] - 10https://gerrit.wikimedia.org/r/513612 (https://phabricator.wikimedia.org/T220103) (owner: 10Cwhite) [16:57:35] 10Operations, 10LDAP-Access-Requests: Remove user Greta WMDE from wmde LDAP group - https://phabricator.wikimedia.org/T224507 (10ayounsi) 05Open→03Resolved User `greta` removed from the `wmde` LDAP group. [16:58:39] (03PS4) 10Elukey: mcrouter: allow to tune server timeout and timeouts until tko [puppet] - 10https://gerrit.wikimedia.org/r/468363 (https://phabricator.wikimedia.org/T203786) [17:01:04] (03PS5) 10Elukey: mcrouter: allow to tune server timeout and timeouts until tko [puppet] - 10https://gerrit.wikimedia.org/r/468363 (https://phabricator.wikimedia.org/T203786) [17:04:41] (03PS1) 10Jhedden: openstack: keystone: Update service daemon_active [puppet] - 10https://gerrit.wikimedia.org/r/513641 (https://phabricator.wikimedia.org/T224740) [17:06:04] (03PS1) 10Paladox: Test: Do not merge [puppet] - 10https://gerrit.wikimedia.org/r/513642 [17:06:43] (03PS2) 10Paladox: Test: Do not merge [puppet] - 10https://gerrit.wikimedia.org/r/513642 [17:06:51] (03CR) 10Paladox: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/513642 (owner: 10Paladox) [17:07:47] (03PS6) 10Elukey: mcrouter: allow to tune server timeout and timeouts until tko [puppet] - 10https://gerrit.wikimedia.org/r/468363 (https://phabricator.wikimedia.org/T203786) [17:08:11] (03PS3) 10Paladox: Test: Do not merge [puppet] - 10https://gerrit.wikimedia.org/r/513642 [17:08:17] (03CR) 10Paladox: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/513642 (owner: 10Paladox) [17:10:16] (03PS4) 10Paladox: Test: Do not merge [puppet] - 10https://gerrit.wikimedia.org/r/513642 [17:10:18] (03PS7) 10Elukey: mcrouter: allow to tune server timeout and timeouts until tko [puppet] - 10https://gerrit.wikimedia.org/r/468363 (https://phabricator.wikimedia.org/T203786) [17:10:23] (03CR) 10Paladox: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/513642 (owner: 10Paladox) [17:10:34] (03CR) 10Arturo Borrero Gonzalez: "LGTM, some comments inline. Before merging, please give it a try in the puppet catalog compiler (PCC) here:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/513641 (https://phabricator.wikimedia.org/T224740) (owner: 10Jhedden) [17:15:55] (03PS5) 10Paladox: Test: Do not merge [puppet] - 10https://gerrit.wikimedia.org/r/513642 [17:16:01] (03CR) 10Paladox: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/513642 (owner: 10Paladox) [17:16:47] (03PS8) 10Elukey: mcrouter: allow to tune server timeout and timeouts until tko [puppet] - 10https://gerrit.wikimedia.org/r/468363 (https://phabricator.wikimedia.org/T203786) [17:16:59] 10Operations, 10SRE-Access-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T224744 (10Rmaung) [17:18:05] (03PS2) 10Jhedden: openstack: keystone: Update service daemon_active [puppet] - 10https://gerrit.wikimedia.org/r/513641 (https://phabricator.wikimedia.org/T224740) [17:24:26] (03PS9) 10Elukey: mcrouter: allow to tune server timeout and timeouts until tko [puppet] - 10https://gerrit.wikimedia.org/r/468363 (https://phabricator.wikimedia.org/T203786) [17:24:31] (03PS1) 10Mholloway: WikimediaEditorTasks: Drop caption edit counter unlock delay to 0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513645 (https://phabricator.wikimedia.org/T218599) [17:25:01] (03CR) 10Mholloway: [C: 04-2] "Hold until Mon 6/3" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513645 (https://phabricator.wikimedia.org/T218599) (owner: 10Mholloway) [17:26:17] 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10serviceops, and 5 others: Introduce kask session storage service to kubernetes - https://phabricator.wikimedia.org/T220401 (10Eevans) >>! In T220401#5226531, @akosiaris wrote: > One minor question. Given per T220401#5128786 1 kask instance is... [17:26:26] (03CR) 10Cwhite: [C: 03+2] logstash: add deprecated-input tag to syslog input [puppet] - 10https://gerrit.wikimedia.org/r/513612 (https://phabricator.wikimedia.org/T220103) (owner: 10Cwhite) [17:26:35] (03PS2) 10Cwhite: logstash: add deprecated-input tag to syslog input [puppet] - 10https://gerrit.wikimedia.org/r/513612 (https://phabricator.wikimedia.org/T220103) [17:27:06] (03PS10) 10Elukey: mcrouter: allow to tune server timeout and timeouts until tko [puppet] - 10https://gerrit.wikimedia.org/r/468363 (https://phabricator.wikimedia.org/T203786) [17:29:01] !log added jeh to the 'ops' group in ldap [17:29:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:12] (03PS11) 10Elukey: mcrouter: allow to tune server timeout and timeouts until tko [puppet] - 10https://gerrit.wikimedia.org/r/468363 (https://phabricator.wikimedia.org/T203786) [17:33:00] (03CR) 10Elukey: "Sorry for the spam, this finally changes only the canaries leaving the rest untouched: https://puppet-compiler.wmflabs.org/compiler1002/16" [puppet] - 10https://gerrit.wikimedia.org/r/468363 (https://phabricator.wikimedia.org/T203786) (owner: 10Elukey) [17:39:42] (03CR) 10Ori.livneh: [C: 03+1] phabricator: add forensic apache logging and enable on phab1003 [puppet] - 10https://gerrit.wikimedia.org/r/511955 (owner: 10Dzahn) [17:41:07] 10Operations, 10LDAP-Access-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T224744 (10jcrespo) [17:42:15] (03PS3) 10Jhedden: openstack: keystone: Update service daemon_active [puppet] - 10https://gerrit.wikimedia.org/r/513641 (https://phabricator.wikimedia.org/T224740) [17:44:18] 10Operations, 10LDAP-Access-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T224744 (10DMccurdy) I am @Rmaung 's manager at WMF and I approve her request for ldap access. Thanks! [17:47:42] 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Kanban): contint1001 store docker images on separate partition or disk - https://phabricator.wikimedia.org/T207707 (10greg) >>! In T207707#5223912, @Cmjohnson wrote: > @greg @robh I am just plugging these disks into the server... [17:52:37] 10Operations, 10LDAP-Access-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T224744 (10jcrespo) Thanks, @DMccurdy! @Rmaung have you already created a "Developer account" (Wikitech wiki account)? If yes, please provide its name; if not, you can do it now at: https://wik... [17:53:04] 10Operations, 10LDAP-Access-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T224744 (10Aklapper) The task summary says `ldap/wmde`. Did you mean `ldap/wmf`? [17:53:05] (03PS4) 10Andrew Bogott: openstack: keystone: Update service daemon_active [puppet] - 10https://gerrit.wikimedia.org/r/513641 (https://phabricator.wikimedia.org/T224740) (owner: 10Jhedden) [17:53:32] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install kafka-main200[1-5] - https://phabricator.wikimedia.org/T223493 (10herron) @Papaul could you have a look at kafka-main2002? It seems to be stuck, at least I'm not able to open a console or power cycle. ` /admin1-> racadm serveraction powercyc... [17:54:03] (03CR) 10Andrew Bogott: [C: 03+2] "This shows as substituting True for True in the puppet compiler. /probably/ that's because it's replacing the string 'True' with the bool" [puppet] - 10https://gerrit.wikimedia.org/r/513641 (https://phabricator.wikimedia.org/T224740) (owner: 10Jhedden) [17:56:29] (03PS5) 10Bstorm: cloudstore: switch maps mounts from labstore1003 to cloudstore1008 [puppet] - 10https://gerrit.wikimedia.org/r/509470 (https://phabricator.wikimedia.org/T209527) [18:01:15] 10Operations, 10LDAP-Access-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T224744 (10Rmaung) @Aklapper yes, I meant WMF! Sorry about that. I have created a developer account w/Wikitech, and it is the same username, Rmaung. Let me know if there's anything else I need... [18:07:46] (03PS1) 10Ayounsi: Add user dartmon [puppet] - 10https://gerrit.wikimedia.org/r/513662 [18:08:42] (03CR) 10jerkins-bot: [V: 04-1] Add user dartmon [puppet] - 10https://gerrit.wikimedia.org/r/513662 (owner: 10Ayounsi) [18:08:50] (03CR) 10Bstorm: [C: 03+2] cloudstore: switch maps mounts from labstore1003 to cloudstore1008 [puppet] - 10https://gerrit.wikimedia.org/r/509470 (https://phabricator.wikimedia.org/T209527) (owner: 10Bstorm) [18:11:07] (03PS2) 10Ayounsi: Add user dartmon [puppet] - 10https://gerrit.wikimedia.org/r/513662 (https://phabricator.wikimedia.org/T222788) [18:13:40] (03CR) 10ArielGlenn: [C: 03+1] "Thanks for this, I must have forgotten to add her here." [puppet] - 10https://gerrit.wikimedia.org/r/513662 (https://phabricator.wikimedia.org/T222788) (owner: 10Ayounsi) [18:14:20] 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Kanban), 10User-greg: contint1001 store docker images on separate partition or disk - https://phabricator.wikimedia.org/T207707 (10Cmjohnson) a:03greg @greg the disks have been added and assigned to you [18:16:03] 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Kanban): contint1001 store docker images on separate partition or disk - https://phabricator.wikimedia.org/T207707 (10greg) a:05greg→03None [18:17:21] 10Operations, 10Continuous-Integration-Infrastructure, 10Release Pipeline, 10Release-Engineering-Team (Kanban): Switch CI Docker Storage Driver to its own partition and to use devicemapper - https://phabricator.wikimedia.org/T178663 (10greg) [18:17:24] 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Kanban): contint1001 store docker images on separate partition or disk - https://phabricator.wikimedia.org/T207707 (10greg) 05Stalled→03Open [18:18:27] (03CR) 10Ayounsi: [C: 03+2] admins: Add jeh to production shell users [puppet] - 10https://gerrit.wikimedia.org/r/513292 (https://phabricator.wikimedia.org/T224627) (owner: 10Jhedden) [18:18:36] (03PS4) 10Ayounsi: admins: Add jeh to production shell users [puppet] - 10https://gerrit.wikimedia.org/r/513292 (https://phabricator.wikimedia.org/T224627) (owner: 10Jhedden) [18:20:07] (03CR) 10Ayounsi: [C: 03+2] Add user dartmon [puppet] - 10https://gerrit.wikimedia.org/r/513662 (https://phabricator.wikimedia.org/T222788) (owner: 10Ayounsi) [18:20:15] (03PS3) 10Ayounsi: Add user dartmon [puppet] - 10https://gerrit.wikimedia.org/r/513662 (https://phabricator.wikimedia.org/T222788) [18:24:45] PROBLEM - ensure kvm processes are running on cloudvirt1015 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:25:22] 10Operations, 10LDAP-Access-Requests, 10WMF-Legal, 10WMF-NDA-Requests, 10Patch-For-Review: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T222788 (10ayounsi) 05Open→03Resolved a:03ayounsi Everything should be set, please reopen the task if any issue. [18:25:38] (03PS5) 10Ayounsi: admins: Add jeh to production shell users [puppet] - 10https://gerrit.wikimedia.org/r/513292 (https://phabricator.wikimedia.org/T224627) (owner: 10Jhedden) [18:25:54] 10Operations, 10Domains, 10Traffic, 10WMF-Legal, 10Patch-For-Review: Move wikimedia.ee under WM-EE - https://phabricator.wikimedia.org/T204056 (10reosarevok) We're waiting until we have control over the domain so that we can release our updated website. Is there any particular reason that makes just tran... [18:30:11] (03PS1) 10Bstorm: cloudstore: scratch is a rw mount [puppet] - 10https://gerrit.wikimedia.org/r/513666 (https://phabricator.wikimedia.org/T209527) [18:31:22] (03CR) 10Bstorm: [C: 03+2] cloudstore: scratch is a rw mount [puppet] - 10https://gerrit.wikimedia.org/r/513666 (https://phabricator.wikimedia.org/T209527) (owner: 10Bstorm) [18:40:59] 10Operations: IPMI Audit 2018-04 - https://phabricator.wikimedia.org/T193155 (10Cmjohnson) [18:41:03] 10Operations, 10observability, 10Patch-For-Review: Several hosts return "internal IPMI error" in the check_ipmi_temp check - https://phabricator.wikimedia.org/T167121 (10Cmjohnson) [18:41:06] 10Operations, 10ops-eqiad: es1019 IPMI and its management interface are unresponsive (again) - https://phabricator.wikimedia.org/T213422 (10Cmjohnson) 05Open→03Resolved @jcrespo the server is back on and I am able to reach the mgmt interface. [18:42:17] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10Patch-For-Review, 10cloud-services-team (Kanban): rack/setup/install cloudstore1008 & cloudstore1009 - https://phabricator.wikimedia.org/T193655 (10Bstorm) [18:42:22] 10Operations, 10ops-eqiad, 10DC-Ops, 10Data-Services, and 2 others: Decommission labstore100[123] and their disk shelves - https://phabricator.wikimedia.org/T187456 (10Bstorm) [18:43:38] 10Operations, 10ops-eqiad: es1019 IPMI and its management interface are unresponsive (again) - https://phabricator.wikimedia.org/T213422 (10Marostegui) Confirmed, I can access mgmt interface as well. I am going to enable puppet and start MySQL. [18:44:07] !log Start MySQL on es1019 - T213422 [18:44:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:44:12] T213422: es1019 IPMI and its management interface are unresponsive (again) - https://phabricator.wikimedia.org/T213422 [18:47:20] 10Operations, 10ops-eqiad, 10DC-Ops, 10Data-Services, and 2 others: Decommission labstore100[123] and their disk shelves - https://phabricator.wikimedia.org/T187456 (10Bstorm) Holding this back until Monday in case of any data concerns, but we are now pretty much unblocked here. [18:47:45] 10Operations, 10ops-eqiad, 10DC-Ops, 10Data-Services, and 2 others: Decommission labstore100[123] and their disk shelves - https://phabricator.wikimedia.org/T187456 (10Bstorm) p:05Low→03High a:03Bstorm [18:59:12] 10Operations, 10ops-codfw: pull decom hardware and ship to Harry/OIT @ SF office - https://phabricator.wikimedia.org/T222383 (10HMarcus) 05Resolved→03Open Hi all, Apologies, but the work for this is not quite finished. After installing the cards, we realized that the current SAS connector cable is not the... [19:18:15] (03PS1) 10Marostegui: db-eqiad.php: Slowly repool es1019 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513676 (https://phabricator.wikimedia.org/T213422) [19:22:01] 10Operations, 10LDAP-Access-Requests: Request to be added to the ldap/wmf group - https://phabricator.wikimedia.org/T224744 (10Aklapper) [19:22:09] 10Operations, 10LDAP-Access-Requests: Request to be added to the ldap/wmf group - https://phabricator.wikimedia.org/T224744 (10Aklapper) [19:31:06] (03PS1) 10Bstorm: cloudstore: disable alerts for labstore1003 and pass to cloudstores [puppet] - 10https://gerrit.wikimedia.org/r/513678 (https://phabricator.wikimedia.org/T187456) [19:31:25] 10Operations, 10ops-eqiad, 10DC-Ops, 10Data-Services, and 2 others: Decommission labstore100[123] and their disk shelves - https://phabricator.wikimedia.org/T187456 (10Bstorm) 05Stalled→03Open [19:32:25] (03CR) 10Bstorm: [C: 03+2] cloudstore: disable alerts for labstore1003 and pass to cloudstores [puppet] - 10https://gerrit.wikimedia.org/r/513678 (https://phabricator.wikimedia.org/T187456) (owner: 10Bstorm) [19:39:45] (03PS1) 10Bstorm: cloudstore: enable more monitors on cloudstore1008/9 [puppet] - 10https://gerrit.wikimedia.org/r/513681 [19:44:48] (03PS1) 10Paladox: Gerrit: Double web session cache to 512 [puppet] - 10https://gerrit.wikimedia.org/r/513682 [19:45:27] (03PS2) 10Paladox: Gerrit: Double web session cache to 512 [puppet] - 10https://gerrit.wikimedia.org/r/513682 [19:52:15] (03PS3) 10Paladox: Gerrit: Double web session cache memory to 4096 [puppet] - 10https://gerrit.wikimedia.org/r/513682 [19:58:43] (03PS4) 10Paladox: Gerrit: Double web session cache memory to 4096 [puppet] - 10https://gerrit.wikimedia.org/r/513682 [19:59:29] (03PS1) 10BBlack: cp3037: remove from cache config [puppet] - 10https://gerrit.wikimedia.org/r/513685 (https://phabricator.wikimedia.org/T222041) [19:59:31] (03PS1) 10BBlack: cache: reimage cp3034 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/513686 (https://phabricator.wikimedia.org/T222937) [20:03:28] (03CR) 10BBlack: [C: 03+2] cp3037: remove from cache config [puppet] - 10https://gerrit.wikimedia.org/r/513685 (https://phabricator.wikimedia.org/T222041) (owner: 10BBlack) [20:04:24] !log cp3034: depool for reimage - T222937 [20:04:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:04:30] T222937: Replace Varnish backends with ATS on cache upload nodes in esams - https://phabricator.wikimedia.org/T222937 [20:04:41] (03CR) 10BBlack: [C: 03+2] cache: reimage cp3034 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/513686 (https://phabricator.wikimedia.org/T222937) (owner: 10BBlack) [20:06:04] RECOVERY - IPsec on cp1076 is OK: Strongswan OK - 44 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [20:06:13] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in esams - https://phabricator.wikimedia.org/T222937 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by bblack on cumin1001.eqiad.wmnet for hosts: ` ['cp3034.esams.wmnet'] ` The log can be found i... [20:08:06] RECOVERY - IPsec on cp2008 is OK: Strongswan OK - 36 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [20:08:40] RECOVERY - IPsec on cp2018 is OK: Strongswan OK - 36 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [20:11:58] RECOVERY - IPsec on cp1078 is OK: Strongswan OK - 44 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [20:13:26] RECOVERY - IPsec on cp2025 is OK: Strongswan OK - 36 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [20:18:17] (03PS1) 10Bstorm: labstore: remove hieradata for labstore1003/misc and strip down puppet [puppet] - 10https://gerrit.wikimedia.org/r/513690 (https://phabricator.wikimedia.org/T187456) [20:18:25] RECOVERY - IPsec on cp2020 is OK: Strongswan OK - 36 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [20:20:05] RECOVERY - IPsec on cp1082 is OK: Strongswan OK - 44 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [20:20:15] RECOVERY - IPsec on cp2026 is OK: Strongswan OK - 36 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [20:21:19] RECOVERY - IPsec on cp2017 is OK: Strongswan OK - 36 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [20:22:51] RECOVERY - IPsec on cp1086 is OK: Strongswan OK - 44 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [20:23:17] RECOVERY - IPsec on cp1088 is OK: Strongswan OK - 44 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [20:23:32] (03CR) 10Bstorm: "https://puppet-compiler.wmflabs.org/compiler1001/16831/labstore1003.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/513690 (https://phabricator.wikimedia.org/T187456) (owner: 10Bstorm) [20:23:49] (03PS2) 10Bstorm: labstore: remove hieradata for labstore1003/misc and strip down puppet [puppet] - 10https://gerrit.wikimedia.org/r/513690 (https://phabricator.wikimedia.org/T187456) [20:24:03] RECOVERY - IPsec on cp2002 is OK: Strongswan OK - 36 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [20:24:17] RECOVERY - IPsec on cp1084 is OK: Strongswan OK - 44 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [20:24:47] RECOVERY - IPsec on cp2022 is OK: Strongswan OK - 36 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [20:25:27] RECOVERY - IPsec on cp1090 is OK: Strongswan OK - 44 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [20:25:36] (03CR) 10Bstorm: [C: 03+2] labstore: remove hieradata for labstore1003/misc and strip down puppet [puppet] - 10https://gerrit.wikimedia.org/r/513690 (https://phabricator.wikimedia.org/T187456) (owner: 10Bstorm) [20:26:40] RECOVERY - IPsec on cp2024 is OK: Strongswan OK - 36 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [20:34:40] (03CR) 10Hashar: [C: 03+1] "Looks good now :) Danke!" [puppet] - 10https://gerrit.wikimedia.org/r/513034 (owner: 10Hashar) [20:35:44] RECOVERY - IPsec on cp2014 is OK: Strongswan OK - 36 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [20:36:07] 10Operations, 10Operations-Software-Development, 10netbox, 10netops, and 2 others: Netbox report to validate network equipment data - https://phabricator.wikimedia.org/T221507 (10ayounsi) >>! In T221507#5219523, @faidon wrote: > - On the device types errors, I can't help but think that we're looking at th... [20:36:18] (03PS1) 10CDanis: conftool: use non-default ports for integration test etcd [software/conftool] - 10https://gerrit.wikimedia.org/r/513694 [20:37:12] RECOVERY - IPsec on cp1080 is OK: Strongswan OK - 44 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [20:37:14] RECOVERY - IPsec on cp2005 is OK: Strongswan OK - 36 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [20:44:48] RECOVERY - IPsec on cp2011 is OK: Strongswan OK - 36 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [20:52:03] (03CR) 10EBernhardson: [C: 03+1] "afaict this should all work." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513607 (https://phabricator.wikimedia.org/T87892) (owner: 10DCausse) [20:52:09] (03CR) 10EBernhardson: [C: 03+1] [cirrus] drop most wmgCirrusSearch* ephemeral config vars [2/3] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513606 (https://phabricator.wikimedia.org/T87892) (owner: 10DCausse) [20:52:17] (03CR) 10EBernhardson: [C: 03+1] [cirrus] drop most wmgCirrusSearch* ephemeral config vars [1/3] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513605 (https://phabricator.wikimedia.org/T87892) (owner: 10DCausse) [20:54:35] (03CR) 10EBernhardson: [C: 03+1] [cirrus] extension registration: don't assume default vars are set [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513556 (https://phabricator.wikimedia.org/T87892) (owner: 10DCausse) [20:54:54] (03CR) 10EBernhardson: [C: 03+1] [cirrus] Load cirrus using wfLoadExtension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513068 (https://phabricator.wikimedia.org/T87892) (owner: 10DCausse) [20:57:01] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in esams - https://phabricator.wikimedia.org/T222937 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp3034.esams.wmnet'] ` and were **ALL** successful. [20:59:35] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team: labstore1003: more SMART failures - https://phabricator.wikimedia.org/T199780 (10Bstorm) 05Open→03Declined T187456 - No point in fixing at this stage. [20:59:37] (03CR) 10Ayounsi: Bird anycast: add anycast_healthchecker (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/397723 (owner: 10Ayounsi) [21:01:16] 10Operations, 10Data-Services, 10video2commons, 10cloud-services-team (Kanban): Consider mounting labs NFS labstore1003.eqiad.wmnet:/scratch for server-side uploads - https://phabricator.wikimedia.org/T153068 (10Bstorm) @zhuyifei1999 Isn't this currently being done/this ticket can go away? If not, let's ch... [21:02:45] (03PS26) 10Ayounsi: Bird anycast: add anycast_healthchecker [puppet] - 10https://gerrit.wikimedia.org/r/397723 [21:08:12] PROBLEM - grafana-labs.wikimedia.org on labmon1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Grafana.wikimedia.org [21:08:43] 10Operations, 10cloud-services-team (Kanban): Migrate labstore1004/labstore1005 to Stretch/Buster - https://phabricator.wikimedia.org/T224582 (10Bstorm) [21:08:56] ACKNOWLEDGEMENT - IPMI Sensor Status on cp3034 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] Brandon Black Non-redundant power, like many esams hosts. These are all due for replacement Soon... [21:09:04] 10Operations, 10cloud-services-team (Kanban): Migrate labstore1006/1007 to Stretch/Buster - https://phabricator.wikimedia.org/T224583 (10Bstorm) [21:10:06] !log cp3034: repool - T222937 [21:10:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:10:11] T222937: Replace Varnish backends with ATS on cache upload nodes in esams - https://phabricator.wikimedia.org/T222937 [21:10:53] 10Operations, 10cloud-services-team (Kanban): Migrate labstore1004/labstore1005 to Stretch/Buster - https://phabricator.wikimedia.org/T224582 (10Bstorm) [21:13:17] 10Operations, 10cloud-services-team (Kanban): Track remaining trusty servers in production - https://phabricator.wikimedia.org/T212772 (10Bstorm) [21:15:13] 10Operations, 10DC-Ops, 10cloud-services-team (Kanban): labstore1005 A PCIe link training failure error on boot - https://phabricator.wikimedia.org/T169286 (10Bstorm) [21:15:59] 10Operations, 10DC-Ops, 10cloud-services-team (Kanban): labstore1005 A PCIe link training failure error on boot - https://phabricator.wikimedia.org/T169286 (10Bstorm) This is pretty old. We'll have to reboot it again to know if this is still happening. I suspect it actually isn't. [21:25:57] RECOVERY - grafana-labs.wikimedia.org on labmon1002 is OK: HTTP OK: HTTP/1.1 200 OK - 9002 bytes in 0.009 second response time https://wikitech.wikimedia.org/wiki/Grafana.wikimedia.org [21:40:54] (03PS3) 10Aaron Schulz: Set "secret" field in $wgLBFactoryConf for ChronologyProtector HMACs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509357 [21:42:34] (03CR) 10Aaron Schulz: [C: 03+2] Set "secret" field in $wgLBFactoryConf for ChronologyProtector HMACs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509357 (owner: 10Aaron Schulz) [21:43:24] (03Merged) 10jenkins-bot: Set "secret" field in $wgLBFactoryConf for ChronologyProtector HMACs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509357 (owner: 10Aaron Schulz) [21:43:39] (03CR) 10jenkins-bot: Set "secret" field in $wgLBFactoryConf for ChronologyProtector HMACs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509357 (owner: 10Aaron Schulz) [21:46:19] !log aaron@deploy1001 Synchronized wmf-config/db-codfw.php: Set "secret" field in $wgLBFactoryConf for ChronologyProtector HMACs (duration: 00m 50s) [21:46:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:47:49] !log aaron@deploy1001 Synchronized wmf-config/db-eqiad.php: Set "secret" field in $wgLBFactoryConf for ChronologyProtector HMACs (duration: 00m 47s) [21:47:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:03:32] (03PS1) 10Bstorm: labstore: remove unused hiera yaml [puppet] - 10https://gerrit.wikimedia.org/r/513702 (https://phabricator.wikimedia.org/T187456) [22:09:42] (03CR) 10Bstorm: "Using the regex labstore.*: https://puppet-compiler.wmflabs.org/compiler1002/16833/" [puppet] - 10https://gerrit.wikimedia.org/r/513702 (https://phabricator.wikimedia.org/T187456) (owner: 10Bstorm) [22:24:23] 10Operations, 10Mail, 10Phabricator: Phabricator email comments not posted - https://phabricator.wikimedia.org/T224752 (10greg) [22:28:38] 10Operations, 10Data-Services, 10video2commons, 10cloud-services-team (Kanban): Consider mounting labs NFS labstore1003.eqiad.wmnet:/scratch for server-side uploads - https://phabricator.wikimedia.org/T153068 (10zhuyifei1999) I don't mind which server it is mounting as long as it matches the current scratc... [22:31:38] 10Operations, 10Mail, 10Phabricator: Phabricator email comments not posted - https://phabricator.wikimedia.org/T224752 (10mmodell) from /var/log/exim4/mainlog: `2019-05-31 22:30:20 Warning: No server certificate defined; will use a selfsigned one.` [23:06:33] PROBLEM - puppet last run on analytics1045 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. It might be a dependency cycle. [23:33:37] RECOVERY - puppet last run on analytics1045 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures