[00:08:34] <wikibugs>	 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Kanban): contint1001: DISK WARNING - free space: /srv 88397 MB (10% inode=94%): - https://phabricator.wikimedia.org/T219850 (10Dzahn) The long term fix should still be T178663
[00:11:52] <wikibugs>	 (03PS4) 10Dzahn: Gerrit: Rename error_log to gerrit.log [puppet] - 10https://gerrit.wikimedia.org/r/510625 (owner: 10Paladox)
[00:27:25] <wikibugs>	 (03PS5) 10Andrew Bogott: ldap-admins: add foks, add admin group on labweb hosts [puppet] - 10https://gerrit.wikimedia.org/r/506542 (https://phabricator.wikimedia.org/T220860) (owner: 10Dzahn)
[00:28:27] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] ldap-admins: add foks, add admin group on labweb hosts [puppet] - 10https://gerrit.wikimedia.org/r/506542 (https://phabricator.wikimedia.org/T220860) (owner: 10Dzahn)
[00:52:31] <icinga-wm>	 PROBLEM - puppet last run on dns5002 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. It might be a dependency cycle.
[01:16:57] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[01:18:39] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[01:19:27] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5
[01:19:33] <icinga-wm>	 RECOVERY - puppet last run on dns5002 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures
[01:20:33] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5
[01:21:27] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[01:26:51] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[01:27:37] <icinga-wm>	 RECOVERY - Esams HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5
[01:29:19] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5
[02:49:27] <icinga-wm>	 RECOVERY - Debian mirror in sync with upstream on sodium is OK: /srv/mirrors/debian is over 0 hours old. https://wikitech.wikimedia.org/wiki/Mirrors
[03:00:34] <wikibugs>	 (03Abandoned) 10BryanDavis: ldap: disable group member list expansion on Stretch clients [puppet] - 10https://gerrit.wikimedia.org/r/496991 (https://phabricator.wikimedia.org/T217280) (owner: 10BryanDavis)
[04:29:13] <icinga-wm>	 PROBLEM - Disk space on maps2004 is CRITICAL: DISK CRITICAL - free space: /srv 64676 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space
[04:49:24] <wikibugs>	 10Operations, 10serviceops, 10PHP 7.2 support, 10Performance-Team (Radar), 10Wikimedia-production-error: PHP 7 corruption during deployment (was: PHP 7 fatals on mw1262) - https://phabricator.wikimedia.org/T224491 (10Joe) So, I am disabling the opcache resets for now, and putting timestamp validation to...
[05:06:29] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: scap: stop resetting opcache on php7.2 [puppet] - 10https://gerrit.wikimedia.org/r/513549 (https://phabricator.wikimedia.org/T224491)
[05:06:31] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: php7: bump opcache, apc on canaries, mw1348 [puppet] - 10https://gerrit.wikimedia.org/r/513550
[05:41:31] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] scap: stop resetting opcache on php7.2 [puppet] - 10https://gerrit.wikimedia.org/r/513549 (https://phabricator.wikimedia.org/T224491) (owner: 10Giuseppe Lavagetto)
[05:57:13] <wikibugs>	 10Operations, 10Dumps-Generation: Reboot dumps/snapshot hosts - https://phabricator.wikimedia.org/T223962 (10ArielGlenn)
[05:57:29] <wikibugs>	 10Operations, 10Dumps-Generation: Reboot dumps/snapshot hosts - https://phabricator.wikimedia.org/T223962 (10ArielGlenn) 05Open→03Resolved
[06:00:16] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] php7: bump opcache, apc on canaries, mw1348 [puppet] - 10https://gerrit.wikimedia.org/r/513550 (owner: 10Giuseppe Lavagetto)
[06:16:29] <_joe_>	 !log restarting php-fpm on mw1348
[06:16:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:18:23] <icinga-wm>	 PROBLEM - Check systemd state on mw1348 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[06:18:41] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1348 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.010 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[06:18:45] <icinga-wm>	 PROBLEM - php7.2-fpm service on mw1348 is CRITICAL: CRITICAL - Expecting active but unit php7.2-fpm is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[06:18:57] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1348 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - header X-Powered-By: PHP/7. not found on http://en.wikipedia.org:80/wiki/Main_Page - 1308 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[06:19:01] <icinga-wm>	 PROBLEM - HHVM rendering on mw1348 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[06:19:09] <icinga-wm>	 PROBLEM - Apache HTTP on mw1348 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[06:20:02] <_joe_>	 this is me ^^ sorry, the server is depooled
[06:21:22] <jijiki>	 oh 
[06:21:29] <jijiki>	 I just depooled it :p
[06:21:42] <jijiki>	 !log depool mw1348
[06:21:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:22:18] <jijiki>	 sorry joe, logging out
[06:26:06] <_joe_>	 I did depool it already
[06:31:09] <icinga-wm>	 PROBLEM - puppet last run on bast3002 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. It might be a dependency cycle.
[06:31:41] <icinga-wm>	 PROBLEM - puppet last run on wdqs1006 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/sbin/disable-puppet]
[06:31:49] <icinga-wm>	 RECOVERY - HHVM rendering on mw1348 is OK: HTTP OK: HTTP/1.1 200 OK - 81452 bytes in 1.179 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[06:31:55] <icinga-wm>	 RECOVERY - Apache HTTP on mw1348 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 658 bytes in 0.053 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[06:32:09] <icinga-wm>	 PROBLEM - puppet last run on mw1307 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown)
[06:32:37] <icinga-wm>	 RECOVERY - Check systemd state on mw1348 is OK: OK - running: The system is fully operational
[06:32:53] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1348 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 659 bytes in 0.062 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[06:32:57] <icinga-wm>	 RECOVERY - php7.2-fpm service on mw1348 is OK: OK - php7.2-fpm is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[06:33:09] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1348 is OK: HTTP OK: HTTP/1.1 200 OK - 81450 bytes in 0.159 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[06:33:49] <_joe_>	 !log repooled mw1348
[06:33:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:40:09] <wikibugs>	 10Operations, 10ops-eqiad: es1019 IPMI and its management interface are unresponsive (again) - https://phabricator.wikimedia.org/T213422 (10jcrespo) @Cmjohnson Was this done yesterday? I don't have any rush on this, but if it wasn't, I would need you to switch it on (I cannot access the management interface) s...
[06:55:18] <jynus>	 !log upgrade and restart db2058
[06:55:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:58:13] <icinga-wm>	 RECOVERY - puppet last run on bast3002 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[06:58:47] <icinga-wm>	 RECOVERY - puppet last run on wdqs1006 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[06:59:13] <icinga-wm>	 RECOVERY - puppet last run on mw1307 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[07:03:50] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: Depool labsdb1009 for maintenance [puppet] - 10https://gerrit.wikimedia.org/r/513553
[07:04:09] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: php7: reduce interned strings memory to 96 MB [puppet] - 10https://gerrit.wikimedia.org/r/513554
[07:04:49] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] mariadb: Depool labsdb1009 for maintenance [puppet] - 10https://gerrit.wikimedia.org/r/513553 (owner: 10Jcrespo)
[07:08:38] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] php7: reduce interned strings memory to 96 MB [puppet] - 10https://gerrit.wikimedia.org/r/513554 (owner: 10Giuseppe Lavagetto)
[07:08:58] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: php7: reduce interned strings memory to 96 MB [puppet] - 10https://gerrit.wikimedia.org/r/513554
[07:09:25] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] php7: reduce interned strings memory to 96 MB [puppet] - 10https://gerrit.wikimedia.org/r/513554 (owner: 10Giuseppe Lavagetto)
[07:14:23] <jynus>	 !log depool labsdb1009 for maintenance
[07:14:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:15:55] <_joe_>	 !log draining mw1348 from traffic
[07:15:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:24:08] <jynus>	 !log upgrade and restart labsdb1009
[07:24:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:24:57] <_joe_>	 !log repooling mw1348
[07:25:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:34:41] <wikibugs>	 10Operations, 10netops: librenms logrotate script seems not working - https://phabricator.wikimedia.org/T224502 (10elukey) 05Open→03Resolved a:03elukey
[07:37:58] <wikibugs>	 (03PS3) 10Jcrespo: db-eqiad.php: Slowly repool db1099 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513304 (https://phabricator.wikimedia.org/T221502) (owner: 10Marostegui)
[07:38:45] <wikibugs>	 10Operations, 10Maps: Maps[12]004 /srv disk space is critical - https://phabricator.wikimedia.org/T224395 (10Mathew.onipe) On trying to initialize maps2001, It encountered the same disk space issues like maps2004. I hold on now on others. We will have to reimage others (maps200[1-3]) before we can proceed. So...
[07:39:44] <wikibugs>	 (03PS1) 10DCausse: [cirrus] extension registration: don't assume default vars are set [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513556 (https://phabricator.wikimedia.org/T87892)
[07:42:12] <wikibugs>	 (03CR) 10Florianschmidtwelzow: [phabricator] Remove extra comma from footer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/513501 (owner: 10Florianschmidtwelzow)
[07:42:16] <wikibugs>	 (03Abandoned) 10Florianschmidtwelzow: [phabricator] Remove extra comma from footer [puppet] - 10https://gerrit.wikimedia.org/r/513501 (owner: 10Florianschmidtwelzow)
[07:43:52] <_joe_>	 !log restarting php-fpm on canaries
[07:43:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:45:06] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] db-eqiad.php: Slowly repool db1099 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513304 (https://phabricator.wikimedia.org/T221502) (owner: 10Marostegui)
[07:45:30] <_joe_>	 ouch
[07:46:00] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Slowly repool db1099 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513304 (https://phabricator.wikimedia.org/T221502) (owner: 10Marostegui)
[07:46:14] <_joe_>	 jynus: please hold a sec
[07:46:40] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Slowly repool db1099 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513304 (https://phabricator.wikimedia.org/T221502) (owner: 10Marostegui)
[07:46:53] <jynus>	 waiting
[07:47:06] <_joe_>	 i just killed php7 on the mwdebugs
[07:47:15] <_joe_>	 I forgot those servers don't have much ram
[07:47:29] <icinga-wm>	 PROBLEM - PHP7 rendering on mwdebug1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - header X-Powered-By: PHP/7. not found on http://en.wikipedia.org:80/wiki/Main_Page - 1313 bytes in 0.004 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[07:47:49] <icinga-wm>	 PROBLEM - php7.2-fpm service on mwdebug2002 is CRITICAL: CRITICAL - Expecting active but unit php7.2-fpm is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[07:47:55] <icinga-wm>	 PROBLEM - Check systemd state on mwdebug2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[07:47:55] <icinga-wm>	 PROBLEM - Check systemd state on mwdebug1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[07:47:57] <icinga-wm>	 PROBLEM - PHP7 rendering on mwdebug2002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - header X-Powered-By: PHP/7. not found on http://en.wikipedia.org:80/wiki/Main_Page - 1313 bytes in 0.074 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[07:47:57] <icinga-wm>	 PROBLEM - php7.2-fpm service on mwdebug1002 is CRITICAL: CRITICAL - Expecting active but unit php7.2-fpm is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[07:48:02] <wikibugs>	 (03PS1) 10Jcrespo: Revert "mariadb: Depool labsdb1009 for maintenance" [puppet] - 10https://gerrit.wikimedia.org/r/513558
[07:48:03] <icinga-wm>	 PROBLEM - php7.2-fpm service on mwdebug1001 is CRITICAL: CRITICAL - Expecting active but unit php7.2-fpm is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[07:48:11] <icinga-wm>	 PROBLEM - PHP7 rendering on mwdebug2001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - header X-Powered-By: PHP/7. not found on http://en.wikipedia.org:80/wiki/Main_Page - 1313 bytes in 0.074 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[07:48:17] <_joe_>	 no need to revert jynus
[07:48:23] <_joe_>	 just wait like 3 minutes
[07:48:23] <icinga-wm>	 PROBLEM - PHP7 rendering on mwdebug1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - header X-Powered-By: PHP/7. not found on http://en.wikipedia.org:80/wiki/Main_Page - 1313 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[07:48:27] <icinga-wm>	 PROBLEM - Check systemd state on mwdebug1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[07:48:37] <icinga-wm>	 PROBLEM - Check systemd state on mwdebug2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[07:48:59] <icinga-wm>	 PROBLEM - php7.2-fpm service on mwdebug2001 is CRITICAL: CRITICAL - Expecting active but unit php7.2-fpm is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[07:49:05] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: php7: reduce memory usage on mwdebugs [puppet] - 10https://gerrit.wikimedia.org/r/513560
[07:49:42] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] php7: reduce memory usage on mwdebugs [puppet] - 10https://gerrit.wikimedia.org/r/513560 (owner: 10Giuseppe Lavagetto)
[07:51:26] <wikibugs>	 (03PS1) 10Jcrespo: Revert "db-eqiad.php: Depool db1099" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513561
[07:51:29] <icinga-wm>	 RECOVERY - Check systemd state on mwdebug2002 is OK: OK - running: The system is fully operational
[07:51:39] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Revert "db-eqiad.php: Depool db1099" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513561 (owner: 10Jcrespo)
[07:51:47] <icinga-wm>	 RECOVERY - PHP7 rendering on mwdebug1001 is OK: HTTP OK: HTTP/1.1 200 OK - 81521 bytes in 1.054 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[07:51:51] <_joe_>	 jynus: you can proceed with your work
[07:51:52] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] Revert "mariadb: Depool labsdb1009 for maintenance" [puppet] - 10https://gerrit.wikimedia.org/r/513558 (owner: 10Jcrespo)
[07:51:53] <icinga-wm>	 RECOVERY - php7.2-fpm service on mwdebug2001 is OK: OK - php7.2-fpm is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[07:52:02] <wikibugs>	 (03PS2) 10Jcrespo: Revert "mariadb: Depool labsdb1009 for maintenance" [puppet] - 10https://gerrit.wikimedia.org/r/513558
[07:52:08] <jynus>	 ok, doing in 1 minute
[07:52:09] <icinga-wm>	 RECOVERY - php7.2-fpm service on mwdebug2002 is OK: OK - php7.2-fpm is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[07:52:13] <icinga-wm>	 RECOVERY - Check systemd state on mwdebug1002 is OK: OK - running: The system is fully operational
[07:52:13] <icinga-wm>	 RECOVERY - Check systemd state on mwdebug2001 is OK: OK - running: The system is fully operational
[07:52:15] <icinga-wm>	 RECOVERY - php7.2-fpm service on mwdebug1002 is OK: OK - php7.2-fpm is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[07:52:15] <icinga-wm>	 RECOVERY - PHP7 rendering on mwdebug2002 is OK: HTTP OK: HTTP/1.1 200 OK - 81449 bytes in 1.305 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[07:52:23] <icinga-wm>	 RECOVERY - php7.2-fpm service on mwdebug1001 is OK: OK - php7.2-fpm is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[07:52:31] <icinga-wm>	 RECOVERY - PHP7 rendering on mwdebug2001 is OK: HTTP OK: HTTP/1.1 200 OK - 81453 bytes in 1.192 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[07:52:43] <icinga-wm>	 RECOVERY - PHP7 rendering on mwdebug1002 is OK: HTTP OK: HTTP/1.1 200 OK - 81521 bytes in 1.149 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[07:52:45] <icinga-wm>	 RECOVERY - Check systemd state on mwdebug1001 is OK: OK - running: The system is fully operational
[07:54:19] <logmsgbot>	 !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1099 with low weight (duration: 00m 49s)
[07:54:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:54:31] <jynus>	 sync-apaches: 100% (ok: 267; fail: 0; left: 0)
[07:54:49] <jynus>	 no connections so far
[07:54:53] <_joe_>	 ok let's see if within a minute the change is picked up by the server
[07:55:08] <jynus>	 now I see them
[07:55:11] <_joe_>	 yep
[07:55:17] <_joe_>	 within a minute as expected
[07:55:22] <_joe_>	 ok so this at least works
[07:55:38] <jynus>	 do you need me more for this?
[07:55:44] <_joe_>	 no thanks
[07:55:49] <jynus>	 thanks to you!
[07:55:51] <jynus>	 and your work
[07:58:16] <wikibugs>	 (03PS2) 10Jcrespo: Revert "db-eqiad.php: Depool db1099" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513561
[08:12:26] <wikibugs>	 (03PS1) 10Jcrespo: labsdb: Depool labsdb1011 for maintenance [puppet] - 10https://gerrit.wikimedia.org/r/513563
[08:14:27] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] labsdb: Depool labsdb1011 for maintenance [puppet] - 10https://gerrit.wikimedia.org/r/513563 (owner: 10Jcrespo)
[08:16:36] <jynus>	 !log depool labsdb1011 for maintenance
[08:16:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:29:29] <wikibugs>	 (03PS1) 10Cparle: Add 'sms' langcode to beta commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513565 (https://phabricator.wikimedia.org/T222309)
[08:30:23] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add 'sms' langcode to beta commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513565 (https://phabricator.wikimedia.org/T222309) (owner: 10Cparle)
[08:32:47] <wikibugs>	 (03PS1) 10Jcrespo: Revert "labsdb: Depool labsdb1011 for maintenance" [puppet] - 10https://gerrit.wikimedia.org/r/513566
[08:33:44] <jynus>	 !log upgrade and restart db2065
[08:33:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:36:19] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] Revert "labsdb: Depool labsdb1011 for maintenance" [puppet] - 10https://gerrit.wikimedia.org/r/513566 (owner: 10Jcrespo)
[08:37:54] <wikibugs>	 (03CR) 10Cparle: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513565 (https://phabricator.wikimedia.org/T222309) (owner: 10Cparle)
[08:42:53] <wikibugs>	 (03PS1) 10Jcrespo: labsdb: Depool labsdb1010 for maintenance [puppet] - 10https://gerrit.wikimedia.org/r/513567
[08:43:29] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA: db1099 memory issues - https://phabricator.wikimedia.org/T221502 (10jcrespo) https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/513561 is pending.
[08:46:05] <wikibugs>	 (03PS1) 10ArielGlenn: remove orderrev config option, no longer needed [dumps] - 10https://gerrit.wikimedia.org/r/513568 (https://phabricator.wikimedia.org/T207628)
[08:49:55] <wikibugs>	 (03PS1) 10ArielGlenn: remove orderrevs param from dumps manifests [puppet] - 10https://gerrit.wikimedia.org/r/513570 (https://phabricator.wikimedia.org/T207628)
[09:03:33] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] labsdb: Depool labsdb1010 for maintenance [puppet] - 10https://gerrit.wikimedia.org/r/513567 (owner: 10Jcrespo)
[09:09:51] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: mw1348: increase opcache revalidation frequency [puppet] - 10https://gerrit.wikimedia.org/r/513571 (https://phabricator.wikimedia.org/T224491)
[09:11:16] <jynus>	 !log stop and upgrade db2095 (s2, s4, s6, s7)
[09:11:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:28:06] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] mw1348: increase opcache revalidation frequency [puppet] - 10https://gerrit.wikimedia.org/r/513571 (https://phabricator.wikimedia.org/T224491) (owner: 10Giuseppe Lavagetto)
[09:28:40] <jynus>	 !log stop and upgrade db2073
[09:28:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:36:33] <icinga-wm>	 RECOVERY - Disk space on notebook1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space
[09:38:37] <wikibugs>	 10Operations, 10Analytics: notebook1003:/srv/ 2% disk space left - https://phabricator.wikimedia.org/T224682 (10elukey) I removed a rev of the analytics refinery in /srv/deployment, should be better now.  If the following homes could be reduced it would be great:  ` 11G piccardi 18G fsalutari 28G dsaez `  @die...
[09:38:55] <wikibugs>	 10Operations, 10Analytics, 10User-Elukey: notebook1003:/srv/ 2% disk space left - https://phabricator.wikimedia.org/T224682 (10elukey) p:05High→03Normal a:03elukey
[09:41:37] <wikibugs>	 (03CR) 10Elukey: "Will merge on Monday!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/511690 (https://phabricator.wikimedia.org/T224236) (owner: 10CDanis)
[09:47:48] <wikibugs>	 10Operations, 10Analytics, 10Traffic, 10Patch-For-Review: include the 'Server:' response header in varnishkafka - https://phabricator.wikimedia.org/T224236 (10elukey) @CDanis I had a chat with my team as promised, and we are +1 to add the new field to varnishkafka but -1 to add it all the way to the webreq...
[09:48:05] <wikibugs>	 10Operations, 10Analytics, 10Traffic, 10Patch-For-Review, 10User-Elukey: include the 'Server:' response header in varnishkafka - https://phabricator.wikimedia.org/T224236 (10elukey) a:05Ottomata→03elukey
[09:48:34] <wikibugs>	 10Operations, 10ORES, 10Scoring-platform-team, 10serviceops: Migrate ORES Redis servers to Stretch/Buster - https://phabricator.wikimedia.org/T224569 (10akosiaris) We might be able to get away with reusing the redis misc servers (rdb1005/rdb1009). That should give us more memory and allow us to use the emp...
[09:51:29] <wikibugs>	 10Operations, 10ops-eqiad, 10Operations-Software-Development, 10observability: ms-be1043 sdk failed - https://phabricator.wikimedia.org/T218544 (10fgiunchedi) a:03Cmjohnson News on this @Cmjohnson ?
[09:52:35] <wikibugs>	 10Operations, 10ops-codfw, 10decommission: Decommission db2037 - https://phabricator.wikimedia.org/T224720 (10Marostegui)
[09:54:54] <wikibugs>	 (03PS1) 10Marostegui: db2037: Prepare for decommission [puppet] - 10https://gerrit.wikimedia.org/r/513573 (https://phabricator.wikimedia.org/T224720)
[09:58:50] <wikibugs>	 (03CR) 10Marostegui: "PCC is happy but I won't merge this on a Friday anyways: https://puppet-compiler.wmflabs.org/compiler1002/16820/" [puppet] - 10https://gerrit.wikimedia.org/r/513573 (https://phabricator.wikimedia.org/T224720) (owner: 10Marostegui)
[10:08:59] <wikibugs>	 10Operations, 10MediaWiki-Cache, 10serviceops, 10Patch-For-Review, and 2 others: Apply -R 200 to all the memcached mw object cache instances running in eqiad/codfw - https://phabricator.wikimedia.org/T208844 (10elukey)
[10:10:31] <wikibugs>	 (03CR) 10Elukey: "Looks good, I'd also remove the host specific hiera settings :)" [puppet] - 10https://gerrit.wikimedia.org/r/511973 (https://phabricator.wikimedia.org/T208844) (owner: 10Effie Mouzeli)
[10:21:38] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: admin: rotate arturo's ssh key for production [puppet] - 10https://gerrit.wikimedia.org/r/513576
[10:21:49] <wikibugs>	 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10serviceops, and 5 others: Introduce kask session storage service to kubernetes - https://phabricator.wikimedia.org/T220401 (10akosiaris) One thing that I just met is that kask stops accepting HTTP connections if kask cert/key pair is configure...
[10:29:43] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM confirmed on irc" [puppet] - 10https://gerrit.wikimedia.org/r/513576 (owner: 10Arturo Borrero Gonzalez)
[10:31:56] <wikibugs>	 10Operations, 10Analytics, 10User-Elukey: Import AMD rocm packages in wikimedia-buster - https://phabricator.wikimedia.org/T224723 (10elukey)
[10:32:31] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: passwords: rotate arturo's ssh key for cloud-wide root [labs/private] - 10https://gerrit.wikimedia.org/r/513577
[10:34:01] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: passwords: rotate arturo's ssh key for cloud-wide root [labs/private] - 10https://gerrit.wikimedia.org/r/513577
[10:34:04] <wikibugs>	 10Operations, 10Analytics, 10User-Elukey: Import AMD rocm packages in wikimedia-buster - https://phabricator.wikimedia.org/T224723 (10elukey)
[10:34:06] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/513310 (owner: 10Volans)
[10:34:08] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: admin: rotate arturo's ssh key for production [puppet] - 10https://gerrit.wikimedia.org/r/513576
[10:39:41] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] admin: rotate arturo's ssh key for production [puppet] - 10https://gerrit.wikimedia.org/r/513576 (owner: 10Arturo Borrero Gonzalez)
[10:42:10] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [V: 03+2 C: 03+2] passwords: rotate arturo's ssh key for cloud-wide root [labs/private] - 10https://gerrit.wikimedia.org/r/513577 (owner: 10Arturo Borrero Gonzalez)
[10:43:27] <wikibugs>	 10Operations, 10Analytics, 10User-Elukey: Import AMD rocm packages in wikimedia-buster - https://phabricator.wikimedia.org/T224723 (10elukey)
[10:43:30] <wikibugs>	 10Operations, 10Analytics, 10Research-management, 10Patch-For-Review, 10User-Elukey: Remove computational bottlenecks in stats machine via adding a GPU that can be used to train ML models - https://phabricator.wikimedia.org/T148843 (10elukey)
[10:47:17] <arturo>	 !log merging multiple commits to labs/private.git. We now require `puppet-merge --labsprivate` and people may not be yet aware of that
[10:47:21] <wikibugs>	 10Operations, 10serviceops, 10PHP 7.2 support, 10Performance-Team (Radar), 10Wikimedia-production-error: PHP 7 corruption during deployment (was: PHP 7 fatals on mw1262) - https://phabricator.wikimedia.org/T224491 (10Joe) FWIW - the percentage of errors coming from php-fpm servers was including timeouts,...
[10:47:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:54:10] <jynus>	 !log depool labsdb1010 for maintenance
[10:54:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:54:52] <wikibugs>	 (03PS1) 10Jcrespo: Revert "labsdb: Depool labsdb1010 for maintenance" [puppet] - 10https://gerrit.wikimedia.org/r/513578
[10:56:06] <wikibugs>	 (03CR) 10Jcrespo: [C: 04-2] "Not yet" [puppet] - 10https://gerrit.wikimedia.org/r/513578 (owner: 10Jcrespo)
[11:03:55] <wikibugs>	 (03PS2) 10Jbond: firewall logging: enable firewall logging on remaining roles [puppet] - 10https://gerrit.wikimedia.org/r/511709 (https://phabricator.wikimedia.org/T116011)
[11:04:53] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] firewall logging: enable firewall logging on remaining roles [puppet] - 10https://gerrit.wikimedia.org/r/511709 (https://phabricator.wikimedia.org/T116011) (owner: 10Jbond)
[11:05:27] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] Revert "db-eqiad.php: Depool db1099" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513561 (owner: 10Jcrespo)
[11:05:46] <jynus>	 ^ _joe_
[11:06:03] <_joe_>	 thanks jynus
[11:06:21] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1099" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513561 (owner: 10Jcrespo)
[11:07:37] <wikibugs>	 (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1099" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513561 (owner: 10Jcrespo)
[11:09:42] <logmsgbot>	 !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Fully repool db1099 after maintenance (duration: 00m 48s)
[11:09:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:12:31] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/513292 (https://phabricator.wikimedia.org/T224627) (owner: 10Jhedden)
[11:14:36] <wikibugs>	 (03CR) 10Jbond: [C: 04-1] "This will need to be approved in the monday SRE meeting" [puppet] - 10https://gerrit.wikimedia.org/r/513293 (https://phabricator.wikimedia.org/T224627) (owner: 10Jhedden)
[11:21:02] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "31st May LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/512404 (https://phabricator.wikimedia.org/T214623) (owner: 10Dzahn)
[11:23:22] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGMT" [puppet] - 10https://gerrit.wikimedia.org/r/511955 (owner: 10Dzahn)
[11:31:28] <icinga-wm>	 PROBLEM - puppet last run on ms-be1023 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[ulogd2]
[11:46:04] <jynus>	 !log stop and upgrade db2084
[11:46:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:58:28] <icinga-wm>	 RECOVERY - puppet last run on ms-be1023 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[12:09:56] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS6939/IPv4: Connect, AS6939/IPv6: Connect https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[12:10:50] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 285 probes of 428 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts
[12:16:14] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 428 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts
[12:18:02] <wikibugs>	 (03PS2) 10Jcrespo: Revert "labsdb: Depool labsdb1010 for maintenance" [puppet] - 10https://gerrit.wikimedia.org/r/513578
[12:21:18] <icinga-wm>	 RECOVERY - BGP status on cr1-eqiad is OK: BGP OK - up: 35, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[12:26:23] <wikibugs>	 10Operations: Cron spam from phab1001 delete of temporary files - https://phabricator.wikimedia.org/T224727 (10Volans)
[12:26:32] <wikibugs>	 10Operations: Cron spam from phab1001 delete of temporary files - https://phabricator.wikimedia.org/T224727 (10Volans) p:05Triage→03High
[12:30:20] <icinga-wm>	 PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 145 probes of 428 (alerts on 35) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts
[12:32:31] <wikibugs>	 10Operations, 10Puppet: facter 3: add timeout to custom facts external calls - https://phabricator.wikimedia.org/T223938 (10Volans) I'm for option 2 as well, 3 breaks the contract that all facts are available at first puppet run and 1 doesn't really solve much in the long run, just the cron spam.  At this time...
[12:57:24] <icinga-wm>	 RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 35 probes of 428 (alerts on 35) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts
[12:59:11] <wikibugs>	 10Operations, 10Analytics, 10Traffic, 10Patch-For-Review, 10User-Elukey: include the 'Server:' response header in varnishkafka - https://phabricator.wikimedia.org/T224236 (10CDanis) SGTM @elukey, thanks!
[12:59:50] <wikibugs>	 (03PS3) 10DCausse: [cirrus] Load cirrus using wfLoadExtension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513068 (https://phabricator.wikimedia.org/T87892)
[13:09:26] <wikibugs>	 (03CR) 10Ema: "One fix needed in the VTC test, otherwise LGTM." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/506043 (https://phabricator.wikimedia.org/T219819) (owner: 10Santhosh)
[13:12:09] <wikibugs>	 (03PS1) 10Ema: varnish: install libvmod-re2 in Vagrantfile [puppet] - 10https://gerrit.wikimedia.org/r/513587
[13:13:26] <wikibugs>	 (03CR) 10DCausse: [C: 04-1] [cirrus] extension registration: don't assume default vars are set (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513556 (https://phabricator.wikimedia.org/T87892) (owner: 10DCausse)
[13:15:41] <wikibugs>	 (03PS2) 10BBlack: cache: reimage cp3049 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/513317 (https://phabricator.wikimedia.org/T222937)
[13:16:25] <bblack>	 !log depool cp3049 for reimage - T222937
[13:16:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:16:31] <stashbot>	 T222937: Replace Varnish backends with ATS on cache upload nodes in esams - https://phabricator.wikimedia.org/T222937
[13:16:54] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] cache: reimage cp3049 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/513317 (https://phabricator.wikimedia.org/T222937) (owner: 10BBlack)
[13:19:26] <wikibugs>	 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in esams - https://phabricator.wikimedia.org/T222937 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by bblack on cumin1001.eqiad.wmnet for hosts: ` ['cp3049.esams.wmnet'] ` The log can be found i...
[13:23:20] <wikibugs>	 (03CR) 10Ema: [C: 04-1] varnish: ratelimit unusual image sizes (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/512495 (https://phabricator.wikimedia.org/T224434) (owner: 10Jbond)
[13:35:35] <wikibugs>	 (03CR) 10BBlack: "I actually like the idea of ratelimiting both thumbs and orignals separately.  I think we could take a peek at their median/max values in " (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/512495 (https://phabricator.wikimedia.org/T224434) (owner: 10Jbond)
[13:42:26] <icinga-wm>	 PROBLEM - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports
[13:42:51] <paravoid>	 bblack: ^^ that's you :)
[13:45:24] <wikibugs>	 (03PS3) 10CDanis: varnishkafka webrequest: log Server: in response as 'backend' [puppet] - 10https://gerrit.wikimedia.org/r/511690 (https://phabricator.wikimedia.org/T224236)
[13:45:33] <wikibugs>	 (03CR) 10CDanis: varnishkafka webrequest: log Server: in response as 'backend' (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/511690 (https://phabricator.wikimedia.org/T224236) (owner: 10CDanis)
[13:47:05] <wikibugs>	 (03PS13) 10Jbond: varnish: ratelimit unusual image sizes [puppet] - 10https://gerrit.wikimedia.org/r/512495 (https://phabricator.wikimedia.org/T224434)
[13:47:07] <wikibugs>	 (03PS1) 10Jbond: varnish: Glbal cache miss rate limit [puppet] - 10https://gerrit.wikimedia.org/r/513596 (https://phabricator.wikimedia.org/T224434)
[13:47:08] <bblack>	 paravoid: ?
[13:47:28] <paravoid>	 it cleared, it was alerting about cp3049's Status field
[13:47:32] <icinga-wm>	 RECOVERY - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is OK: puppetdb.PuppetDB OK https://wikitech.wikimedia.org/wiki/Netbox%23Reports
[13:47:41] <bblack>	 it's still in the reimage script
[13:47:45] <paravoid>	 was set to Active but wasn't in PuppetDB
[13:47:46] <paravoid>	 ah ok
[13:47:48] <bblack>	 is there something else manual to do?
[13:47:52] <paravoid>	 nah it's ok
[13:47:57] <paravoid>	 this will be fixed very soon
[13:48:01] <bblack>	 ok :)
[13:48:09] <paravoid>	 by having the reimage script do status changes
[13:48:26] <paravoid>	 and also preventing you to reimage something that is set to "Active" unless you force it
[13:48:30] <bblack>	 yeah I happened to be re-reading parts of the wikitech doc on lifecycle
[13:48:41] <bblack>	 and I saw some notes saying things about going into netbox to change status
[13:49:17] <bblack>	 for a reimage from active-to-active, is there a temporary state it's meant to be in to avoid alerting and/or forcing?
[13:49:30] <icinga-wm>	 PROBLEM - configured eth on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused
[13:49:52] <icinga-wm>	 PROBLEM - puppet last run on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused
[13:49:52] <icinga-wm>	 PROBLEM - dhclient process on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused
[13:49:52] <icinga-wm>	 PROBLEM - DPKG on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused
[13:49:52] <icinga-wm>	 PROBLEM - MD RAID on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[13:50:08] <icinga-wm>	 PROBLEM - Disk space on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space
[13:50:51] <bblack>	 (or do we just always use some --force when reimaging an active server back to active again?)
[13:51:50] <paravoid>	 I dunno yet :)
[13:51:54] <paravoid>	 tobefiguredout
[13:52:24] <wikibugs>	 (03CR) 10Jbond: "> Patch Set 12:" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/512495 (https://phabricator.wikimedia.org/T224434) (owner: 10Jbond)
[13:52:39] <wikibugs>	 (03PS2) 10Jbond: varnish: Global cache miss rate limit [puppet] - 10https://gerrit.wikimedia.org/r/513596 (https://phabricator.wikimedia.org/T224434)
[13:55:49] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: mediawiki::php: use 10 seconds as revalidate frequency everywhere [puppet] - 10https://gerrit.wikimedia.org/r/513598 (https://phabricator.wikimedia.org/T224491)
[13:57:19] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki::php: use 10 seconds as revalidate frequency everywhere [puppet] - 10https://gerrit.wikimedia.org/r/513598 (https://phabricator.wikimedia.org/T224491) (owner: 10Giuseppe Lavagetto)
[13:58:29] <wikibugs>	 (03PS3) 10Jbond: varnish: cache_upload global cache miss rate limit [puppet] - 10https://gerrit.wikimedia.org/r/513596 (https://phabricator.wikimedia.org/T224434)
[13:59:03] <wikibugs>	 (03PS4) 10Jbond: varnish: cache_upload global cache miss rate limit [puppet] - 10https://gerrit.wikimedia.org/r/513596 (https://phabricator.wikimedia.org/T224434)
[13:59:17] <wikibugs>	 10Operations, 10ops-eqiad, 10Cassandra, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), and 3 others: Fix restbase1017's physical rack - https://phabricator.wikimedia.org/T222960 (10Joe)
[14:02:22] <wikibugs>	 10Operations, 10ops-eqiad, 10Cassandra, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), and 3 others: Fix restbase1017's physical rack - https://phabricator.wikimedia.org/T222960 (10Joe) I think #dc-ops should sync with you all on a schedule for this move. According to @Eeva...
[14:04:06] <icinga-wm>	 PROBLEM - puppet last run on cp3044 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. It might be a dependency cycle.
[14:04:24] <icinga-wm>	 PROBLEM - Check the NTP synchronisation status of timesyncd on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused
[14:04:45] <bblack>	 ^ seen several of those lately on random hosts, I assume it's general and known ("Failed to apply catalog, zero resources tracked by Puppet. It might be a dependency cycle")
[14:06:31] <ema>	 bblack: https://phabricator.wikimedia.org/T221784
[14:07:04] <icinga-wm>	 PROBLEM - Check systemd state on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused
[14:07:21] <wikibugs>	 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in esams - https://phabricator.wikimedia.org/T222937 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp3049.esams.wmnet'] `  and were **ALL** successful.
[14:07:34] <ema>	 as volans mentioned on the ticket, there's also other circumstances under which the alert is raised, not only dependency cycles
[14:07:39] <ema>	 I forgot which ones though!
[14:08:43] <volans|off>	 ema: there is a patch already +1ed I sent yesterday to change the message
[14:09:02] <volans|off>	 feel free to merge if you want ;)
[14:10:19] <bblack>	 !log reboot cp3049 - T222937
[14:10:23] <ema>	 ah, I see: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/513310/
[14:10:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:10:25] <stashbot>	 T222937: Replace Varnish backends with ATS on cache upload nodes in esams - https://phabricator.wikimedia.org/T222937
[14:11:03] <wikibugs>	 (03PS2) 10DCausse: [cirrus] extension registration: don't assume default vars are set [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513556 (https://phabricator.wikimedia.org/T87892)
[14:11:05] <wikibugs>	 (03PS4) 10DCausse: [cirrus] Load cirrus using wfLoadExtension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513068 (https://phabricator.wikimedia.org/T87892)
[14:11:07] <wikibugs>	 (03PS1) 10DCausse: [cirrus] drop most wmgCirrusSearch* ephemeral config vars [1/3] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513605 (https://phabricator.wikimedia.org/T87892)
[14:11:08] <wikibugs>	 (03PS1) 10DCausse: [cirrus] drop most wmgCirrusSearch* ephemeral config vars [2/3] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513606 (https://phabricator.wikimedia.org/T87892)
[14:11:10] <wikibugs>	 (03PS1) 10DCausse: [cirrus] drop most wmgCirrusSearch* ephemeral config vars [3/3] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513607 (https://phabricator.wikimedia.org/T87892)
[14:11:12] <wikibugs>	 (03CR) 10DCausse: [cirrus] extension registration: don't assume default vars are set (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513556 (https://phabricator.wikimedia.org/T87892) (owner: 10DCausse)
[14:12:31] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] [cirrus] drop most wmgCirrusSearch* ephemeral config vars [2/3] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513606 (https://phabricator.wikimedia.org/T87892) (owner: 10DCausse)
[14:12:47] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] [cirrus] drop most wmgCirrusSearch* ephemeral config vars [3/3] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513607 (https://phabricator.wikimedia.org/T87892) (owner: 10DCausse)
[14:13:07] <wikibugs>	 (03PS1) 10Papaul: DNS: Remove mgmt asset tag for rdb200[1-2] [dns] - 10https://gerrit.wikimedia.org/r/513608
[14:15:02] <icinga-wm>	 PROBLEM - SSH on notebook1003 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring
[14:17:35] <icinga-wm>	 PROBLEM - IPMI Sensor Status on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused
[14:17:58] <XioNoX>	 elukey: https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=all&type=detail&servicestatustypes=16&hoststatustypes=3&serviceprops=2097162
[14:18:13] <XioNoX>	 elukey: notebook1003 is alerting a lot
[14:18:28] <wikibugs>	 (03PS1) 10Cwhite: prometheus: update aggregate metrics to new metric names [puppet] - 10https://gerrit.wikimedia.org/r/513609 (https://phabricator.wikimedia.org/T219825)
[14:20:20] <_joe_>	 !log rolling restart of php-fpm across production to pick up the shorter revalidate frequency for T224491
[14:20:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:20:26] <stashbot>	 T224491: PHP 7 corruption during deployment (was: PHP 7 fatals on mw1262) - https://phabricator.wikimedia.org/T224491
[14:26:31] <wikibugs>	 (03PS1) 10Cwhite: logstash: add deprecated-input tag to syslog input [puppet] - 10https://gerrit.wikimedia.org/r/513612 (https://phabricator.wikimedia.org/T220103)
[14:27:46] <XioNoX>	 ema: https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=cp3039&service=Check+Varnish+expiry+mailbox+lag
[14:28:22] <XioNoX>	 cdanis: https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=mw1235&service=Long+running+screen%2Ftmux
[14:29:06] <cdanis>	 ooooops
[14:29:10] <cdanis>	 thanks XioNoX
[14:29:36] <icinga-wm>	 ACKNOWLEDGEMENT - Disk space on maps2004 is CRITICAL: DISK CRITICAL - free space: /srv 445 MB (0% inode=99%): Mathew.onipe yes we know! even we didnt expect it again - https://phabricator.wikimedia.org/T224395 - The acknowledgement expires at: 2019-06-03 14:28:39. https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space
[14:29:37] * Krinkle preparing to deploy https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/513526/
[14:30:02] <wikibugs>	 10Operations, 10ops-codfw: Interface errors on cr1-codfw: xe-5/3/1 - https://phabricator.wikimedia.org/T222967 (10Papaul) Email Telia again with another follow up email.
[14:30:11] <icinga-wm>	 RECOVERY - puppet last run on cp3044 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[14:30:42] <wikibugs>	 (03PS3) 10Jcrespo: Revert "labsdb: Depool labsdb1010 for maintenance" [puppet] - 10https://gerrit.wikimedia.org/r/513578
[14:31:19] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] Revert "labsdb: Depool labsdb1010 for maintenance" [puppet] - 10https://gerrit.wikimedia.org/r/513578 (owner: 10Jcrespo)
[14:31:37] <XioNoX>	 elukey: also WARN: Long running tmux process. (user: piccardi) on notebook1003
[14:32:08] <elukey>	 !log powercycle notebook1003 - host stuck due to user processes, no ssh available, OOM didn't trigger
[14:32:11] <onimisionipe>	 !log depool maps2004 (again) - T224395
[14:32:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:32:14] <elukey>	 XioNoX: thanks for the ping
[14:32:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:32:18] <stashbot>	 T224395: Maps[12]004 /srv disk space is critical - https://phabricator.wikimedia.org/T224395
[14:33:43] <icinga-wm>	 RECOVERY - Disk space on notebook1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space
[14:33:53] <icinga-wm>	 RECOVERY - Check systemd state on notebook1003 is OK: OK - running: The system is fully operational
[14:34:25] <icinga-wm>	 RECOVERY - dhclient process on notebook1003 is OK: PROCS OK: 0 processes with command name dhclient
[14:34:25] <icinga-wm>	 RECOVERY - MD RAID on notebook1003 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[14:34:25] <icinga-wm>	 RECOVERY - DPKG on notebook1003 is OK: All packages OK
[14:34:33] <icinga-wm>	 RECOVERY - SSH on notebook1003 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[14:34:41] <icinga-wm>	 RECOVERY - Check the NTP synchronisation status of timesyncd on notebook1003 is OK: OK: synced at Fri 2019-05-31 14:34:40 UTC.
[14:37:11] <icinga-wm>	 RECOVERY - puppet last run on notebook1003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[14:38:01] <icinga-wm>	 RECOVERY - configured eth on notebook1003 is OK: OK - interfaces up
[14:39:03] <XioNoX>	 elukey: this one is still active https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=notebook1003&service=Long+running+screen%2Ftmux
[14:39:26] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: update aggregate metrics to new metric names [puppet] - 10https://gerrit.wikimedia.org/r/513609 (https://phabricator.wikimedia.org/T219825) (owner: 10Cwhite)
[14:40:00] <elukey>	 XioNoX: sure, I am still triaging the host, the tmux is fine if it shows up in icinga for a couple of hours, nothing on fire :)
[14:40:03] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] prometheus: update aggregate metrics to new metric names [puppet] - 10https://gerrit.wikimedia.org/r/513609 (https://phabricator.wikimedia.org/T219825) (owner: 10Cwhite)
[14:40:14] <wikibugs>	 (03PS2) 10Cwhite: prometheus: update aggregate metrics to new metric names [puppet] - 10https://gerrit.wikimedia.org/r/513609 (https://phabricator.wikimedia.org/T219825)
[14:40:36] <XioNoX>	 ah, I rescheduled it and it's gone
[14:44:32] <XioNoX>	 cdanis: can you ack or downtime the tmux alert if it's expected?
[14:44:39] <cdanis>	 XioNoX: I ended the session
[14:45:01] <cdanis>	 so should go away soon?
[14:45:05] <XioNoX>	 ah yeah
[14:45:21] <XioNoX>	 I think it's scheduled to run every new moons
[14:45:43] <XioNoX>	 re-scheduling it, it should go away
[14:46:18] <wikibugs>	 (03PS2) 10Ema: varnish: install libvmod-re2 in Vagrantfile [puppet] - 10https://gerrit.wikimedia.org/r/513587
[14:47:20] <wikibugs>	 (03CR) 10Ema: [C: 03+2] varnish: install libvmod-re2 in Vagrantfile [puppet] - 10https://gerrit.wikimedia.org/r/513587 (owner: 10Ema)
[14:47:57] <icinga-wm>	 RECOVERY - IPMI Sensor Status on notebook1003 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK
[14:52:07] <bblack>	 !log pool cp3049 back into service - T222937
[14:52:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:52:12] <stashbot>	 T222937: Replace Varnish backends with ATS on cache upload nodes in esams - https://phabricator.wikimedia.org/T222937
[14:52:42] <wikibugs>	 10Operations, 10ops-codfw: Interface errors on cr1-codfw: xe-5/3/1 - https://phabricator.wikimedia.org/T222967 (10Papaul) Dear Customer,     Our transmission 2nd line team are still investigating this. We will inform you as soon as the issue solved and sorry for any inconvenience caused.        Best Regards,
[14:53:20] <wikibugs>	 10Operations, 10ops-eqiad, 10Cassandra, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), and 3 others: Fix restbase1017's physical rack - https://phabricator.wikimedia.org/T222960 (10Eevans) >>! In T222960#5226150, @Joe wrote: > I think #dc-ops should sync with you all on a s...
[14:58:43] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA: db1099 memory issues - https://phabricator.wikimedia.org/T221502 (10Marostegui) 05Open→03Resolved Server has been fully repooled by Jaime by pushing T221502#5225565
[14:59:26] <Krinkle>	 !log krinkle@deploy1001: git status in php-1.34-wmf.7/ is dirty (extensions/ORES)
[14:59:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:00:39] <Krinkle>	 !log krinkle@deploy1001: pulling down 6f91b41 for  php-1.34-wmf.7/extensions/ORES (without deploy), commit seems test-only
[15:00:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:00:53] <icinga-wm>	 PROBLEM - HHVM rendering on mw1297 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[15:03:02] <wikibugs>	 10Operations, 10serviceops, 10HHVM, 10Performance-Team (Radar), 10User-Marostegui: Increased instability in MediaWiki backends (according to load balancers) - https://phabricator.wikimedia.org/T223952 (10Marostegui)
[15:03:33] <icinga-wm>	 RECOVERY - HHVM rendering on mw1297 is OK: HTTP OK: HTTP/1.1 200 OK - 81395 bytes in 0.140 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[15:05:13] <bblack>	 !log cp3039: restart varnish-be for mbox lag (likely induced by 3049's depool for ATS conversion!)
[15:05:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:07:50] <wikibugs>	 (03PS2) 10DCausse: [cirrus] drop most wmgCirrusSearch* ephemeral config vars [1/3] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513605 (https://phabricator.wikimedia.org/T87892)
[15:07:52] <wikibugs>	 (03PS2) 10DCausse: [cirrus] drop most wmgCirrusSearch* ephemeral config vars [2/3] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513606 (https://phabricator.wikimedia.org/T87892)
[15:07:54] <wikibugs>	 (03PS2) 10DCausse: [cirrus] drop most wmgCirrusSearch* ephemeral config vars [3/3] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513607 (https://phabricator.wikimedia.org/T87892)
[15:09:16] <wikibugs>	 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10serviceops, and 5 others: Introduce kask session storage service to kubernetes - https://phabricator.wikimedia.org/T220401 (10Eevans) >>! In T220401#5225728, @akosiaris wrote: > One thing that I just met is that kask stops accepting HTTP conne...
[15:09:18] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] [cirrus] drop most wmgCirrusSearch* ephemeral config vars [3/3] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513607 (https://phabricator.wikimedia.org/T87892) (owner: 10DCausse)
[15:17:15] <wikibugs>	 (03PS3) 10DCausse: [cirrus] drop most wmgCirrusSearch* ephemeral config vars [3/3] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513607 (https://phabricator.wikimedia.org/T87892)
[15:18:03] <wikibugs>	 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10serviceops, and 5 others: Introduce kask session storage service to kubernetes - https://phabricator.wikimedia.org/T220401 (10fgiunchedi) >>! In T220401#5226287, @Eevans wrote: >> 1. We use an Exec probe that executes something like `curl http...
[15:18:23] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] [cirrus] drop most wmgCirrusSearch* ephemeral config vars [3/3] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513607 (https://phabricator.wikimedia.org/T87892) (owner: 10DCausse)
[15:31:22] <wikibugs>	 10Operations, 10DBA: Predictive failures on disk S.M.A.R.T. status - https://phabricator.wikimedia.org/T208323 (10Marostegui)
[15:31:30] <wikibugs>	 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to ops group in admin for jeh - https://phabricator.wikimedia.org/T224627 (10jcrespo) However is in charge of this, T224597 should be done before or around the same time.
[15:31:31] <wikibugs>	 (03PS3) 10DCausse: [cirrus] drop most wmgCirrusSearch* ephemeral config vars [1/3] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513605 (https://phabricator.wikimedia.org/T87892)
[15:31:33] <wikibugs>	 (03PS3) 10DCausse: [cirrus] drop most wmgCirrusSearch* ephemeral config vars [2/3] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513606 (https://phabricator.wikimedia.org/T87892)
[15:31:35] <wikibugs>	 (03PS4) 10DCausse: [cirrus] drop most wmgCirrusSearch* ephemeral config vars [3/3] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513607 (https://phabricator.wikimedia.org/T87892)
[15:33:17] <icinga-wm>	 PROBLEM - mailman_queue_size on fermium is CRITICAL: CRITICAL: 1 mailman queue(s) above limits (thresholds: bounces: 25 in: 25 virgin: 25) https://wikitech.wikimedia.org/wiki/Mailman
[15:34:33] <icinga-wm>	 PROBLEM - HHVM rendering on mw1315 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[15:35:57] <icinga-wm>	 RECOVERY - HHVM rendering on mw1315 is OK: HTTP OK: HTTP/1.1 200 OK - 81408 bytes in 0.151 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[15:36:07] <icinga-wm>	 RECOVERY - mailman_queue_size on fermium is OK: OK: mailman queues are below the limits. https://wikitech.wikimedia.org/wiki/Mailman
[15:36:23] <wikibugs>	 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10serviceops, and 5 others: Introduce kask session storage service to kubernetes - https://phabricator.wikimedia.org/T220401 (10Eevans) >>! In T220401#5226287, @Eevans wrote: >>>! In T220401#5225728, @akosiaris wrote: >> One thing that I just me...
[15:37:29] <wikibugs>	 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10serviceops, and 5 others: Introduce kask session storage service to kubernetes - https://phabricator.wikimedia.org/T220401 (10akosiaris) >>! In T220401#5226287, @Eevans wrote: >>>! In T220401#5225728, @akosiaris wrote: >> One thing that I just...
[15:40:23] <icinga-wm>	 PROBLEM - puppet last run on db1101 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. It might be a dependency cycle.
[15:43:30] <wikibugs>	 10Operations, 10Traffic: add Icinga alert on Varnish backends that are close to maxing out their allowed connections to their applayer backends - https://phabricator.wikimedia.org/T224738 (10CDanis)
[15:45:35] <wikibugs>	 10Operations, 10Traffic: add Icinga alert on Varnish backends that are close to maxing out their allowed connections to their applayer backends - https://phabricator.wikimedia.org/T224738 (10CDanis) p:05Triage→03Normal
[16:04:43] <icinga-wm>	 PROBLEM - Prometheus prometheus1003/k8s-staging restarted: beware possible monitoring artifacts on prometheus1003 is CRITICAL: instance=127.0.0.1:9907 job=prometheus https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/k8s-staging
[16:10:09] <cdanis>	 hey it works
[16:11:43] <godog>	 neat!
[16:12:47] <icinga-wm>	 RECOVERY - puppet last run on db1101 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[16:16:17] <wikibugs>	 (03CR) 10CDanis: [C: 03+1] beta: lower swift server workers [puppet] - 10https://gerrit.wikimedia.org/r/513059 (https://phabricator.wikimedia.org/T160990) (owner: 10Hashar)
[16:18:04] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: prometheus: Support TLS enabled pods [puppet] - 10https://gerrit.wikimedia.org/r/513633 (https://phabricator.wikimedia.org/T220401)
[16:18:20] <wikibugs>	 (03CR) 10CDanis: "Was the container replicator causing undue load on beta?  IIRC it just replicates metadata which should be small and fast" [puppet] - 10https://gerrit.wikimedia.org/r/513062 (https://phabricator.wikimedia.org/T160990) (owner: 10Hashar)
[16:18:32] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Fix typo in initialize_service.sh [deployment-charts] - 10https://gerrit.wikimedia.org/r/513634
[16:18:34] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: kask: Fix TLS certs checks [deployment-charts] - 10https://gerrit.wikimedia.org/r/513635
[16:18:36] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: kask: prometheus scraping over HTTPS if TLS enabled [deployment-charts] - 10https://gerrit.wikimedia.org/r/513636 (https://phabricator.wikimedia.org/T220401)
[16:18:38] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Bump kask version to 0.0.5 [deployment-charts] - 10https://gerrit.wikimedia.org/r/513637
[16:18:51] <wikibugs>	 (03PS2) 10ArielGlenn: remove orderrev config option, no longer needed [dumps] - 10https://gerrit.wikimedia.org/r/513568 (https://phabricator.wikimedia.org/T207628)
[16:19:05] <wikibugs>	 (03CR) 10CDanis: [C: 03+1] swift: hiera-ize object server number of workers [puppet] - 10https://gerrit.wikimedia.org/r/513058 (https://phabricator.wikimedia.org/T160990) (owner: 10Hashar)
[16:19:36] <wikibugs>	 (03CR) 10CDanis: [C: 03+1] beta: tweak swift replicator [puppet] - 10https://gerrit.wikimedia.org/r/513054 (https://phabricator.wikimedia.org/T160990) (owner: 10Hashar)
[16:19:59] <wikibugs>	 (03CR) 10ArielGlenn: [C: 03+2] remove orderrev config option, no longer needed [dumps] - 10https://gerrit.wikimedia.org/r/513568 (https://phabricator.wikimedia.org/T207628) (owner: 10ArielGlenn)
[16:20:42] <logmsgbot>	 !log ariel@deploy1001 Started deploy [dumps/dumps@fd6100a]: remove orderrevs config option, unneeded now
[16:20:45] <logmsgbot>	 !log ariel@deploy1001 Finished deploy [dumps/dumps@fd6100a]: remove orderrevs config option, unneeded now (duration: 00m 03s)
[16:20:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:20:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:22:30] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/513633 (https://phabricator.wikimedia.org/T220401) (owner: 10Alexandros Kosiaris)
[16:22:50] <wikibugs>	 (03CR) 10CDanis: [C: 03+1] swift: hiera-ize object-replicator interval [puppet] - 10https://gerrit.wikimedia.org/r/513053 (https://phabricator.wikimedia.org/T160990) (owner: 10Hashar)
[16:23:56] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/513633 (https://phabricator.wikimedia.org/T220401) (owner: 10Alexandros Kosiaris)
[16:25:22] <wikibugs>	 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10serviceops, and 5 others: Introduce kask session storage service to kubernetes - https://phabricator.wikimedia.org/T220401 (10akosiaris) > And this uncovered now that prometheus can't talk to it (cause it expects HTTP I guess?). /me looking in...
[16:25:35] <wikibugs>	 (03PS2) 10ArielGlenn: remove orderrevs param from dumps manifests [puppet] - 10https://gerrit.wikimedia.org/r/513570 (https://phabricator.wikimedia.org/T207628)
[16:25:42] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Fix typo in initialize_service.sh [deployment-charts] - 10https://gerrit.wikimedia.org/r/513634 (owner: 10Alexandros Kosiaris)
[16:26:08] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] "Tested locally and in staging, worked fine" [deployment-charts] - 10https://gerrit.wikimedia.org/r/513635 (owner: 10Alexandros Kosiaris)
[16:26:28] <wikibugs>	 (03CR) 10ArielGlenn: [C: 03+2] remove orderrevs param from dumps manifests [puppet] - 10https://gerrit.wikimedia.org/r/513570 (https://phabricator.wikimedia.org/T207628) (owner: 10ArielGlenn)
[16:26:30] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] "Tested locally and in staging, worked fine" [deployment-charts] - 10https://gerrit.wikimedia.org/r/513636 (https://phabricator.wikimedia.org/T220401) (owner: 10Alexandros Kosiaris)
[16:26:36] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Bump kask version to 0.0.5 [deployment-charts] - 10https://gerrit.wikimedia.org/r/513637 (owner: 10Alexandros Kosiaris)
[16:28:26] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: Add sessionstore LVS DNS RRs [dns] - 10https://gerrit.wikimedia.org/r/513328 (https://phabricator.wikimedia.org/T220401)
[16:30:17] <wikibugs>	 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install kafka-main200[1-5] - https://phabricator.wikimedia.org/T223493 (10herron) I did some testing of various software and hardware raid configurations and wrote up a summary at https://wikitech.wikimedia.org/wiki/Kafka/Kafka-main-raid-performance-t...
[16:38:28] <icinga-wm>	 RECOVERY - Prometheus prometheus1003/k8s-staging restarted: beware possible monitoring artifacts on prometheus1003 is OK: All metrics within thresholds. https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/k8s-staging
[16:45:53] <wikibugs>	 10Operations, 10Kubernetes: Decommission darmstadtium - https://phabricator.wikimedia.org/T224562 (10akosiaris) I think so, let's wait for @fsero though
[16:50:21] <wikibugs>	 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10serviceops, and 5 others: Introduce kask session storage service to kubernetes - https://phabricator.wikimedia.org/T220401 (10akosiaris) One minor question. Given per T220401#5128786 1 kask instance is able to handle ~300req/s, how many instan...
[16:52:29] <wikibugs>	 (03CR) 10Herron: [C: 03+1] "Looks good!  FWIW commit message seems to have some remnants of an anchor tag" [puppet] - 10https://gerrit.wikimedia.org/r/513612 (https://phabricator.wikimedia.org/T220103) (owner: 10Cwhite)
[16:53:45] <wikibugs>	 (03Restored) 10Elukey: mcrouter: allow to tune server timeout and timeouts until tko [puppet] - 10https://gerrit.wikimedia.org/r/468363 (https://phabricator.wikimedia.org/T203786) (owner: 10Elukey)
[16:55:44] <wikibugs>	 (03CR) 10Herron: [C: 03+1] "> Looks good!  FWIW commit message seems to have some remnants of an" [puppet] - 10https://gerrit.wikimedia.org/r/513612 (https://phabricator.wikimedia.org/T220103) (owner: 10Cwhite)
[16:57:35] <wikibugs>	 10Operations, 10LDAP-Access-Requests: Remove user Greta WMDE from wmde LDAP group - https://phabricator.wikimedia.org/T224507 (10ayounsi) 05Open→03Resolved User `greta` removed from the `wmde` LDAP group.
[16:58:39] <wikibugs>	 (03PS4) 10Elukey: mcrouter: allow to tune server timeout and timeouts until tko [puppet] - 10https://gerrit.wikimedia.org/r/468363 (https://phabricator.wikimedia.org/T203786)
[17:01:04] <wikibugs>	 (03PS5) 10Elukey: mcrouter: allow to tune server timeout and timeouts until tko [puppet] - 10https://gerrit.wikimedia.org/r/468363 (https://phabricator.wikimedia.org/T203786)
[17:04:41] <wikibugs>	 (03PS1) 10Jhedden: openstack: keystone: Update service daemon_active [puppet] - 10https://gerrit.wikimedia.org/r/513641 (https://phabricator.wikimedia.org/T224740)
[17:06:04] <wikibugs>	 (03PS1) 10Paladox: Test: Do not merge [puppet] - 10https://gerrit.wikimedia.org/r/513642
[17:06:43] <wikibugs>	 (03PS2) 10Paladox: Test: Do not merge [puppet] - 10https://gerrit.wikimedia.org/r/513642
[17:06:51] <wikibugs>	 (03CR) 10Paladox: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/513642 (owner: 10Paladox)
[17:07:47] <wikibugs>	 (03PS6) 10Elukey: mcrouter: allow to tune server timeout and timeouts until tko [puppet] - 10https://gerrit.wikimedia.org/r/468363 (https://phabricator.wikimedia.org/T203786)
[17:08:11] <wikibugs>	 (03PS3) 10Paladox: Test: Do not merge [puppet] - 10https://gerrit.wikimedia.org/r/513642
[17:08:17] <wikibugs>	 (03CR) 10Paladox: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/513642 (owner: 10Paladox)
[17:10:16] <wikibugs>	 (03PS4) 10Paladox: Test: Do not merge [puppet] - 10https://gerrit.wikimedia.org/r/513642
[17:10:18] <wikibugs>	 (03PS7) 10Elukey: mcrouter: allow to tune server timeout and timeouts until tko [puppet] - 10https://gerrit.wikimedia.org/r/468363 (https://phabricator.wikimedia.org/T203786)
[17:10:23] <wikibugs>	 (03CR) 10Paladox: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/513642 (owner: 10Paladox)
[17:10:34] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "LGTM, some comments inline. Before merging, please give it a try in the puppet catalog compiler (PCC) here:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/513641 (https://phabricator.wikimedia.org/T224740) (owner: 10Jhedden)
[17:15:55] <wikibugs>	 (03PS5) 10Paladox: Test: Do not merge [puppet] - 10https://gerrit.wikimedia.org/r/513642
[17:16:01] <wikibugs>	 (03CR) 10Paladox: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/513642 (owner: 10Paladox)
[17:16:47] <wikibugs>	 (03PS8) 10Elukey: mcrouter: allow to tune server timeout and timeouts until tko [puppet] - 10https://gerrit.wikimedia.org/r/468363 (https://phabricator.wikimedia.org/T203786)
[17:16:59] <wikibugs>	 10Operations, 10SRE-Access-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T224744 (10Rmaung)
[17:18:05] <wikibugs>	 (03PS2) 10Jhedden: openstack: keystone: Update service daemon_active [puppet] - 10https://gerrit.wikimedia.org/r/513641 (https://phabricator.wikimedia.org/T224740)
[17:24:26] <wikibugs>	 (03PS9) 10Elukey: mcrouter: allow to tune server timeout and timeouts until tko [puppet] - 10https://gerrit.wikimedia.org/r/468363 (https://phabricator.wikimedia.org/T203786)
[17:24:31] <wikibugs>	 (03PS1) 10Mholloway: WikimediaEditorTasks: Drop caption edit counter unlock delay to 0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513645 (https://phabricator.wikimedia.org/T218599)
[17:25:01] <wikibugs>	 (03CR) 10Mholloway: [C: 04-2] "Hold until Mon 6/3" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513645 (https://phabricator.wikimedia.org/T218599) (owner: 10Mholloway)
[17:26:17] <wikibugs>	 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10serviceops, and 5 others: Introduce kask session storage service to kubernetes - https://phabricator.wikimedia.org/T220401 (10Eevans) >>! In T220401#5226531, @akosiaris wrote: > One minor question. Given per T220401#5128786 1 kask instance is...
[17:26:26] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] logstash: add deprecated-input tag to syslog input [puppet] - 10https://gerrit.wikimedia.org/r/513612 (https://phabricator.wikimedia.org/T220103) (owner: 10Cwhite)
[17:26:35] <wikibugs>	 (03PS2) 10Cwhite: logstash: add deprecated-input tag to syslog input [puppet] - 10https://gerrit.wikimedia.org/r/513612 (https://phabricator.wikimedia.org/T220103)
[17:27:06] <wikibugs>	 (03PS10) 10Elukey: mcrouter: allow to tune server timeout and timeouts until tko [puppet] - 10https://gerrit.wikimedia.org/r/468363 (https://phabricator.wikimedia.org/T203786)
[17:29:01] <andrewbogott>	 !log added jeh to the 'ops' group in ldap
[17:29:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:30:12] <wikibugs>	 (03PS11) 10Elukey: mcrouter: allow to tune server timeout and timeouts until tko [puppet] - 10https://gerrit.wikimedia.org/r/468363 (https://phabricator.wikimedia.org/T203786)
[17:33:00] <wikibugs>	 (03CR) 10Elukey: "Sorry for the spam, this finally changes only the canaries leaving the rest untouched: https://puppet-compiler.wmflabs.org/compiler1002/16" [puppet] - 10https://gerrit.wikimedia.org/r/468363 (https://phabricator.wikimedia.org/T203786) (owner: 10Elukey)
[17:39:42] <wikibugs>	 (03CR) 10Ori.livneh: [C: 03+1] phabricator: add forensic apache logging and enable on phab1003 [puppet] - 10https://gerrit.wikimedia.org/r/511955 (owner: 10Dzahn)
[17:41:07] <wikibugs>	 10Operations, 10LDAP-Access-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T224744 (10jcrespo)
[17:42:15] <wikibugs>	 (03PS3) 10Jhedden: openstack: keystone: Update service daemon_active [puppet] - 10https://gerrit.wikimedia.org/r/513641 (https://phabricator.wikimedia.org/T224740)
[17:44:18] <wikibugs>	 10Operations, 10LDAP-Access-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T224744 (10DMccurdy) I am @Rmaung 's manager at WMF and I approve her request for ldap access. Thanks!
[17:47:42] <wikibugs>	 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Kanban): contint1001 store docker images on separate partition or disk - https://phabricator.wikimedia.org/T207707 (10greg) >>! In T207707#5223912, @Cmjohnson wrote: > @greg @robh I am just plugging these disks into the server...
[17:52:37] <wikibugs>	 10Operations, 10LDAP-Access-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T224744 (10jcrespo) Thanks, @DMccurdy!  @Rmaung have you already created a "Developer account" (Wikitech wiki account)? If yes, please provide its name; if not, you can do it now at: https://wik...
[17:53:04] <wikibugs>	 10Operations, 10LDAP-Access-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T224744 (10Aklapper) The task summary says `ldap/wmde`. Did you mean `ldap/wmf`?
[17:53:05] <wikibugs>	 (03PS4) 10Andrew Bogott: openstack: keystone: Update service daemon_active [puppet] - 10https://gerrit.wikimedia.org/r/513641 (https://phabricator.wikimedia.org/T224740) (owner: 10Jhedden)
[17:53:32] <wikibugs>	 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install kafka-main200[1-5] - https://phabricator.wikimedia.org/T223493 (10herron) @Papaul could you have a look at kafka-main2002?  It seems to be stuck, at least I'm not able to open a console or power cycle.  ` /admin1-> racadm serveraction powercyc...
[17:54:03] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] "This shows as substituting True for True in the puppet compiler.  /probably/ that's because it's replacing the string 'True' with the bool" [puppet] - 10https://gerrit.wikimedia.org/r/513641 (https://phabricator.wikimedia.org/T224740) (owner: 10Jhedden)
[17:56:29] <wikibugs>	 (03PS5) 10Bstorm: cloudstore: switch maps mounts from labstore1003 to cloudstore1008 [puppet] - 10https://gerrit.wikimedia.org/r/509470 (https://phabricator.wikimedia.org/T209527)
[18:01:15] <wikibugs>	 10Operations, 10LDAP-Access-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T224744 (10Rmaung) @Aklapper yes, I meant WMF! Sorry about that.  I have created a developer account w/Wikitech, and it is the same username, Rmaung. Let me know if there's anything else I need...
[18:07:46] <wikibugs>	 (03PS1) 10Ayounsi: Add user dartmon [puppet] - 10https://gerrit.wikimedia.org/r/513662
[18:08:42] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add user dartmon [puppet] - 10https://gerrit.wikimedia.org/r/513662 (owner: 10Ayounsi)
[18:08:50] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] cloudstore: switch maps mounts from labstore1003 to cloudstore1008 [puppet] - 10https://gerrit.wikimedia.org/r/509470 (https://phabricator.wikimedia.org/T209527) (owner: 10Bstorm)
[18:11:07] <wikibugs>	 (03PS2) 10Ayounsi: Add user dartmon [puppet] - 10https://gerrit.wikimedia.org/r/513662 (https://phabricator.wikimedia.org/T222788)
[18:13:40] <wikibugs>	 (03CR) 10ArielGlenn: [C: 03+1] "Thanks for this, I must have forgotten to add her here." [puppet] - 10https://gerrit.wikimedia.org/r/513662 (https://phabricator.wikimedia.org/T222788) (owner: 10Ayounsi)
[18:14:20] <wikibugs>	 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Kanban), 10User-greg: contint1001 store docker images on separate partition or disk - https://phabricator.wikimedia.org/T207707 (10Cmjohnson) a:03greg @greg the disks have been added and assigned to you
[18:16:03] <wikibugs>	 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Kanban): contint1001 store docker images on separate partition or disk - https://phabricator.wikimedia.org/T207707 (10greg) a:05greg→03None
[18:17:21] <wikibugs>	 10Operations, 10Continuous-Integration-Infrastructure, 10Release Pipeline, 10Release-Engineering-Team (Kanban): Switch CI Docker Storage Driver to its own partition and to use devicemapper - https://phabricator.wikimedia.org/T178663 (10greg)
[18:17:24] <wikibugs>	 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Kanban): contint1001 store docker images on separate partition or disk - https://phabricator.wikimedia.org/T207707 (10greg) 05Stalled→03Open
[18:18:27] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] admins: Add jeh to production shell users [puppet] - 10https://gerrit.wikimedia.org/r/513292 (https://phabricator.wikimedia.org/T224627) (owner: 10Jhedden)
[18:18:36] <wikibugs>	 (03PS4) 10Ayounsi: admins: Add jeh to production shell users [puppet] - 10https://gerrit.wikimedia.org/r/513292 (https://phabricator.wikimedia.org/T224627) (owner: 10Jhedden)
[18:20:07] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Add user dartmon [puppet] - 10https://gerrit.wikimedia.org/r/513662 (https://phabricator.wikimedia.org/T222788) (owner: 10Ayounsi)
[18:20:15] <wikibugs>	 (03PS3) 10Ayounsi: Add user dartmon [puppet] - 10https://gerrit.wikimedia.org/r/513662 (https://phabricator.wikimedia.org/T222788)
[18:24:45] <icinga-wm>	 PROBLEM - ensure kvm processes are running on cloudvirt1015 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[18:25:22] <wikibugs>	 10Operations, 10LDAP-Access-Requests, 10WMF-Legal, 10WMF-NDA-Requests, 10Patch-For-Review: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T222788 (10ayounsi) 05Open→03Resolved a:03ayounsi Everything should be set, please reopen the task if any issue.
[18:25:38] <wikibugs>	 (03PS5) 10Ayounsi: admins: Add jeh to production shell users [puppet] - 10https://gerrit.wikimedia.org/r/513292 (https://phabricator.wikimedia.org/T224627) (owner: 10Jhedden)
[18:25:54] <wikibugs>	 10Operations, 10Domains, 10Traffic, 10WMF-Legal, 10Patch-For-Review: Move wikimedia.ee under WM-EE - https://phabricator.wikimedia.org/T204056 (10reosarevok) We're waiting until we have control over the domain so that we can release our updated website. Is there any particular reason that makes just tran...
[18:30:11] <wikibugs>	 (03PS1) 10Bstorm: cloudstore: scratch is a rw mount [puppet] - 10https://gerrit.wikimedia.org/r/513666 (https://phabricator.wikimedia.org/T209527)
[18:31:22] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] cloudstore: scratch is a rw mount [puppet] - 10https://gerrit.wikimedia.org/r/513666 (https://phabricator.wikimedia.org/T209527) (owner: 10Bstorm)
[18:40:59] <wikibugs>	 10Operations: IPMI Audit 2018-04 - https://phabricator.wikimedia.org/T193155 (10Cmjohnson)
[18:41:03] <wikibugs>	 10Operations, 10observability, 10Patch-For-Review: Several hosts return "internal IPMI error" in the check_ipmi_temp check - https://phabricator.wikimedia.org/T167121 (10Cmjohnson)
[18:41:06] <wikibugs>	 10Operations, 10ops-eqiad: es1019 IPMI and its management interface are unresponsive (again) - https://phabricator.wikimedia.org/T213422 (10Cmjohnson) 05Open→03Resolved @jcrespo the server is back on and I am able to reach the mgmt interface.
[18:42:17] <wikibugs>	 10Operations, 10ops-eqiad, 10Cloud-VPS, 10Patch-For-Review, 10cloud-services-team (Kanban): rack/setup/install cloudstore1008 & cloudstore1009 - https://phabricator.wikimedia.org/T193655 (10Bstorm)
[18:42:22] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Data-Services, and 2 others: Decommission labstore100[123] and their disk shelves - https://phabricator.wikimedia.org/T187456 (10Bstorm)
[18:43:38] <wikibugs>	 10Operations, 10ops-eqiad: es1019 IPMI and its management interface are unresponsive (again) - https://phabricator.wikimedia.org/T213422 (10Marostegui) Confirmed, I can access mgmt interface as well. I am going to enable puppet and start MySQL.
[18:44:07] <marostegui>	 !log Start MySQL on es1019 - T213422
[18:44:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:44:12] <stashbot>	 T213422: es1019 IPMI and its management interface are unresponsive (again) - https://phabricator.wikimedia.org/T213422
[18:47:20] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Data-Services, and 2 others: Decommission labstore100[123] and their disk shelves - https://phabricator.wikimedia.org/T187456 (10Bstorm) Holding this back until Monday in case of any data concerns, but we are now pretty much unblocked here.
[18:47:45] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Data-Services, and 2 others: Decommission labstore100[123] and their disk shelves - https://phabricator.wikimedia.org/T187456 (10Bstorm) p:05Low→03High a:03Bstorm
[18:59:12] <wikibugs>	 10Operations, 10ops-codfw: pull decom hardware and ship to Harry/OIT @ SF office - https://phabricator.wikimedia.org/T222383 (10HMarcus) 05Resolved→03Open Hi all,  Apologies, but the work for this is not quite finished. After installing the cards, we realized that the current SAS connector cable is not the...
[19:18:15] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Slowly repool es1019 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513676 (https://phabricator.wikimedia.org/T213422)
[19:22:01] <wikibugs>	 10Operations, 10LDAP-Access-Requests: Request to be added to the ldap/wmf group - https://phabricator.wikimedia.org/T224744 (10Aklapper)
[19:22:09] <wikibugs>	 10Operations, 10LDAP-Access-Requests: Request to be added to the ldap/wmf group - https://phabricator.wikimedia.org/T224744 (10Aklapper)
[19:31:06] <wikibugs>	 (03PS1) 10Bstorm: cloudstore: disable alerts for labstore1003 and pass to cloudstores [puppet] - 10https://gerrit.wikimedia.org/r/513678 (https://phabricator.wikimedia.org/T187456)
[19:31:25] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Data-Services, and 2 others: Decommission labstore100[123] and their disk shelves - https://phabricator.wikimedia.org/T187456 (10Bstorm) 05Stalled→03Open
[19:32:25] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] cloudstore: disable alerts for labstore1003 and pass to cloudstores [puppet] - 10https://gerrit.wikimedia.org/r/513678 (https://phabricator.wikimedia.org/T187456) (owner: 10Bstorm)
[19:39:45] <wikibugs>	 (03PS1) 10Bstorm: cloudstore: enable more monitors on cloudstore1008/9 [puppet] - 10https://gerrit.wikimedia.org/r/513681
[19:44:48] <wikibugs>	 (03PS1) 10Paladox: Gerrit: Double web session cache to 512 [puppet] - 10https://gerrit.wikimedia.org/r/513682
[19:45:27] <wikibugs>	 (03PS2) 10Paladox: Gerrit: Double web session cache to 512 [puppet] - 10https://gerrit.wikimedia.org/r/513682
[19:52:15] <wikibugs>	 (03PS3) 10Paladox: Gerrit: Double web session cache memory to 4096 [puppet] - 10https://gerrit.wikimedia.org/r/513682
[19:58:43] <wikibugs>	 (03PS4) 10Paladox: Gerrit: Double web session cache memory to 4096 [puppet] - 10https://gerrit.wikimedia.org/r/513682
[19:59:29] <wikibugs>	 (03PS1) 10BBlack: cp3037: remove from cache config [puppet] - 10https://gerrit.wikimedia.org/r/513685 (https://phabricator.wikimedia.org/T222041)
[19:59:31] <wikibugs>	 (03PS1) 10BBlack: cache: reimage cp3034 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/513686 (https://phabricator.wikimedia.org/T222937)
[20:03:28] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] cp3037: remove from cache config [puppet] - 10https://gerrit.wikimedia.org/r/513685 (https://phabricator.wikimedia.org/T222041) (owner: 10BBlack)
[20:04:24] <bblack>	 !log cp3034: depool for reimage - T222937
[20:04:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:04:30] <stashbot>	 T222937: Replace Varnish backends with ATS on cache upload nodes in esams - https://phabricator.wikimedia.org/T222937
[20:04:41] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] cache: reimage cp3034 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/513686 (https://phabricator.wikimedia.org/T222937) (owner: 10BBlack)
[20:06:04] <icinga-wm>	 RECOVERY - IPsec on cp1076 is OK: Strongswan OK - 44 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[20:06:13] <wikibugs>	 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in esams - https://phabricator.wikimedia.org/T222937 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by bblack on cumin1001.eqiad.wmnet for hosts: ` ['cp3034.esams.wmnet'] ` The log can be found i...
[20:08:06] <icinga-wm>	 RECOVERY - IPsec on cp2008 is OK: Strongswan OK - 36 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[20:08:40] <icinga-wm>	 RECOVERY - IPsec on cp2018 is OK: Strongswan OK - 36 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[20:11:58] <icinga-wm>	 RECOVERY - IPsec on cp1078 is OK: Strongswan OK - 44 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[20:13:26] <icinga-wm>	 RECOVERY - IPsec on cp2025 is OK: Strongswan OK - 36 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[20:18:17] <wikibugs>	 (03PS1) 10Bstorm: labstore: remove hieradata for labstore1003/misc and strip down puppet [puppet] - 10https://gerrit.wikimedia.org/r/513690 (https://phabricator.wikimedia.org/T187456)
[20:18:25] <icinga-wm>	 RECOVERY - IPsec on cp2020 is OK: Strongswan OK - 36 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[20:20:05] <icinga-wm>	 RECOVERY - IPsec on cp1082 is OK: Strongswan OK - 44 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[20:20:15] <icinga-wm>	 RECOVERY - IPsec on cp2026 is OK: Strongswan OK - 36 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[20:21:19] <icinga-wm>	 RECOVERY - IPsec on cp2017 is OK: Strongswan OK - 36 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[20:22:51] <icinga-wm>	 RECOVERY - IPsec on cp1086 is OK: Strongswan OK - 44 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[20:23:17] <icinga-wm>	 RECOVERY - IPsec on cp1088 is OK: Strongswan OK - 44 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[20:23:32] <wikibugs>	 (03CR) 10Bstorm: "https://puppet-compiler.wmflabs.org/compiler1001/16831/labstore1003.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/513690 (https://phabricator.wikimedia.org/T187456) (owner: 10Bstorm)
[20:23:49] <wikibugs>	 (03PS2) 10Bstorm: labstore: remove hieradata for labstore1003/misc and strip down puppet [puppet] - 10https://gerrit.wikimedia.org/r/513690 (https://phabricator.wikimedia.org/T187456)
[20:24:03] <icinga-wm>	 RECOVERY - IPsec on cp2002 is OK: Strongswan OK - 36 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[20:24:17] <icinga-wm>	 RECOVERY - IPsec on cp1084 is OK: Strongswan OK - 44 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[20:24:47] <icinga-wm>	 RECOVERY - IPsec on cp2022 is OK: Strongswan OK - 36 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[20:25:27] <icinga-wm>	 RECOVERY - IPsec on cp1090 is OK: Strongswan OK - 44 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[20:25:36] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] labstore: remove hieradata for labstore1003/misc and strip down puppet [puppet] - 10https://gerrit.wikimedia.org/r/513690 (https://phabricator.wikimedia.org/T187456) (owner: 10Bstorm)
[20:26:40] <icinga-wm>	 RECOVERY - IPsec on cp2024 is OK: Strongswan OK - 36 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[20:34:40] <wikibugs>	 (03CR) 10Hashar: [C: 03+1] "Looks good now :) Danke!" [puppet] - 10https://gerrit.wikimedia.org/r/513034 (owner: 10Hashar)
[20:35:44] <icinga-wm>	 RECOVERY - IPsec on cp2014 is OK: Strongswan OK - 36 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[20:36:07] <wikibugs>	 10Operations, 10Operations-Software-Development, 10netbox, 10netops, and 2 others: Netbox report to validate network equipment data - https://phabricator.wikimedia.org/T221507 (10ayounsi) >>! In T221507#5219523, @faidon wrote:  > - On the device types errors, I can't help but think that we're looking at th...
[20:36:18] <wikibugs>	 (03PS1) 10CDanis: conftool: use non-default ports for integration test etcd [software/conftool] - 10https://gerrit.wikimedia.org/r/513694
[20:37:12] <icinga-wm>	 RECOVERY - IPsec on cp1080 is OK: Strongswan OK - 44 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[20:37:14] <icinga-wm>	 RECOVERY - IPsec on cp2005 is OK: Strongswan OK - 36 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[20:44:48] <icinga-wm>	 RECOVERY - IPsec on cp2011 is OK: Strongswan OK - 36 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[20:52:03] <wikibugs>	 (03CR) 10EBernhardson: [C: 03+1] "afaict this should all work." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513607 (https://phabricator.wikimedia.org/T87892) (owner: 10DCausse)
[20:52:09] <wikibugs>	 (03CR) 10EBernhardson: [C: 03+1] [cirrus] drop most wmgCirrusSearch* ephemeral config vars [2/3] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513606 (https://phabricator.wikimedia.org/T87892) (owner: 10DCausse)
[20:52:17] <wikibugs>	 (03CR) 10EBernhardson: [C: 03+1] [cirrus] drop most wmgCirrusSearch* ephemeral config vars [1/3] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513605 (https://phabricator.wikimedia.org/T87892) (owner: 10DCausse)
[20:54:35] <wikibugs>	 (03CR) 10EBernhardson: [C: 03+1] [cirrus] extension registration: don't assume default vars are set [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513556 (https://phabricator.wikimedia.org/T87892) (owner: 10DCausse)
[20:54:54] <wikibugs>	 (03CR) 10EBernhardson: [C: 03+1] [cirrus] Load cirrus using wfLoadExtension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513068 (https://phabricator.wikimedia.org/T87892) (owner: 10DCausse)
[20:57:01] <wikibugs>	 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in esams - https://phabricator.wikimedia.org/T222937 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp3034.esams.wmnet'] `  and were **ALL** successful.
[20:59:35] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team: labstore1003: more SMART failures - https://phabricator.wikimedia.org/T199780 (10Bstorm) 05Open→03Declined T187456 - No point in fixing at this stage.
[20:59:37] <wikibugs>	 (03CR) 10Ayounsi: Bird anycast: add anycast_healthchecker (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/397723 (owner: 10Ayounsi)
[21:01:16] <wikibugs>	 10Operations, 10Data-Services, 10video2commons, 10cloud-services-team (Kanban): Consider mounting labs NFS labstore1003.eqiad.wmnet:/scratch for server-side uploads - https://phabricator.wikimedia.org/T153068 (10Bstorm) @zhuyifei1999 Isn't this currently being done/this ticket can go away? If not, let's ch...
[21:02:45] <wikibugs>	 (03PS26) 10Ayounsi: Bird anycast: add anycast_healthchecker [puppet] - 10https://gerrit.wikimedia.org/r/397723
[21:08:12] <icinga-wm>	 PROBLEM - grafana-labs.wikimedia.org on labmon1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Grafana.wikimedia.org
[21:08:43] <wikibugs>	 10Operations, 10cloud-services-team (Kanban): Migrate labstore1004/labstore1005 to Stretch/Buster - https://phabricator.wikimedia.org/T224582 (10Bstorm)
[21:08:56] <icinga-wm>	 ACKNOWLEDGEMENT - IPMI Sensor Status on cp3034 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] Brandon Black Non-redundant power, like many esams hosts. These are all due for replacement Soon...
[21:09:04] <wikibugs>	 10Operations, 10cloud-services-team (Kanban): Migrate labstore1006/1007 to Stretch/Buster - https://phabricator.wikimedia.org/T224583 (10Bstorm)
[21:10:06] <bblack>	 !log cp3034: repool - T222937
[21:10:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:10:11] <stashbot>	 T222937: Replace Varnish backends with ATS on cache upload nodes in esams - https://phabricator.wikimedia.org/T222937
[21:10:53] <wikibugs>	 10Operations, 10cloud-services-team (Kanban): Migrate labstore1004/labstore1005 to Stretch/Buster - https://phabricator.wikimedia.org/T224582 (10Bstorm)
[21:13:17] <wikibugs>	 10Operations, 10cloud-services-team (Kanban): Track remaining trusty servers in production - https://phabricator.wikimedia.org/T212772 (10Bstorm)
[21:15:13] <wikibugs>	 10Operations, 10DC-Ops, 10cloud-services-team (Kanban): labstore1005 A PCIe link training failure error on boot - https://phabricator.wikimedia.org/T169286 (10Bstorm)
[21:15:59] <wikibugs>	 10Operations, 10DC-Ops, 10cloud-services-team (Kanban): labstore1005 A PCIe link training failure error on boot - https://phabricator.wikimedia.org/T169286 (10Bstorm) This is pretty old.  We'll have to reboot it again to know if this is still happening.  I suspect it actually isn't.
[21:25:57] <icinga-wm>	 RECOVERY - grafana-labs.wikimedia.org on labmon1002 is OK: HTTP OK: HTTP/1.1 200 OK - 9002 bytes in 0.009 second response time https://wikitech.wikimedia.org/wiki/Grafana.wikimedia.org
[21:40:54] <wikibugs>	 (03PS3) 10Aaron Schulz: Set "secret" field in $wgLBFactoryConf for ChronologyProtector HMACs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509357
[21:42:34] <wikibugs>	 (03CR) 10Aaron Schulz: [C: 03+2] Set "secret" field in $wgLBFactoryConf for ChronologyProtector HMACs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509357 (owner: 10Aaron Schulz)
[21:43:24] <wikibugs>	 (03Merged) 10jenkins-bot: Set "secret" field in $wgLBFactoryConf for ChronologyProtector HMACs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509357 (owner: 10Aaron Schulz)
[21:43:39] <wikibugs>	 (03CR) 10jenkins-bot: Set "secret" field in $wgLBFactoryConf for ChronologyProtector HMACs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509357 (owner: 10Aaron Schulz)
[21:46:19] <logmsgbot>	 !log aaron@deploy1001 Synchronized wmf-config/db-codfw.php: Set "secret" field in $wgLBFactoryConf for ChronologyProtector HMACs (duration: 00m 50s)
[21:46:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:47:49] <logmsgbot>	 !log aaron@deploy1001 Synchronized wmf-config/db-eqiad.php: Set "secret" field in $wgLBFactoryConf for ChronologyProtector HMACs (duration: 00m 47s)
[21:47:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:03:32] <wikibugs>	 (03PS1) 10Bstorm: labstore: remove unused hiera yaml [puppet] - 10https://gerrit.wikimedia.org/r/513702 (https://phabricator.wikimedia.org/T187456)
[22:09:42] <wikibugs>	 (03CR) 10Bstorm: "Using the regex labstore.*: https://puppet-compiler.wmflabs.org/compiler1002/16833/" [puppet] - 10https://gerrit.wikimedia.org/r/513702 (https://phabricator.wikimedia.org/T187456) (owner: 10Bstorm)
[22:24:23] <wikibugs>	 10Operations, 10Mail, 10Phabricator: Phabricator email comments not posted - https://phabricator.wikimedia.org/T224752 (10greg)
[22:28:38] <wikibugs>	 10Operations, 10Data-Services, 10video2commons, 10cloud-services-team (Kanban): Consider mounting labs NFS labstore1003.eqiad.wmnet:/scratch for server-side uploads - https://phabricator.wikimedia.org/T153068 (10zhuyifei1999) I don't mind which server it is mounting as long as it matches the current scratc...
[22:31:38] <wikibugs>	 10Operations, 10Mail, 10Phabricator: Phabricator email comments not posted - https://phabricator.wikimedia.org/T224752 (10mmodell) from /var/log/exim4/mainlog:  `2019-05-31 22:30:20 Warning: No server certificate defined; will use a selfsigned one.`
[23:06:33] <icinga-wm>	 PROBLEM - puppet last run on analytics1045 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. It might be a dependency cycle.
[23:33:37] <icinga-wm>	 RECOVERY - puppet last run on analytics1045 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures