[00:11:53] PROBLEM - High load average on labstore1004 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [70.0] https://grafana.wikimedia.org/dashboard/db/labs-monitoring [00:25:13] RECOVERY - High load average on labstore1004 is OK: OK: Less than 50.00% above the threshold [50.0] https://grafana.wikimedia.org/dashboard/db/labs-monitoring [00:37:47] (03PS6) 10Dzahn: quarry::database: Use mariadb instead of mysql module [puppet] - 10https://gerrit.wikimedia.org/r/454481 (https://phabricator.wikimedia.org/T181205) (owner: 10Zhuyifei1999) [01:09:54] (03PS6) 10Krinkle: Test that all wikis are in one of the section dblists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455587 (https://phabricator.wikimedia.org/T202904) (owner: 10Anomie) [01:09:59] (03CR) 10Krinkle: [C: 032] Test that all wikis are in one of the section dblists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455587 (https://phabricator.wikimedia.org/T202904) (owner: 10Anomie) [01:11:30] (03Merged) 10jenkins-bot: Test that all wikis are in one of the section dblists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455587 (https://phabricator.wikimedia.org/T202904) (owner: 10Anomie) [01:12:55] !log krinkle@deploy1001 Synchronized tests/: I43c79297499 (duration: 00m 51s) [01:12:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:14:05] !log krinkle@deploy1001 Synchronized dblists/s3.dblist: I43c79297499 (duration: 00m 49s) [01:14:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:18:11] 10Operations, 10Performance-Team, 10Wikimedia-Mailing-lists, 10User-herron: Close performance@lists.wikimedia.org in favour of wikitech-l - https://phabricator.wikimedia.org/T200733 (10Krinkle) @Dzahn That's our workboard – The action item for this task is currently blocked, on Ops. The decision itself is... [01:20:46] (03CR) 10jenkins-bot: Test that all wikis are in one of the section dblists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455587 (https://phabricator.wikimedia.org/T202904) (owner: 10Anomie) [01:49:10] !log krinkle@deploy1001 Synchronized php-1.32.0-wmf.19/resources/src/startup/: If26851eac1530f02 (duration: 00m 49s) [01:49:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:50:05] !log krinkle@deploy1001 Synchronized php-1.32.0-wmf.19/resources/src/mediawiki.user.js: I8feecddf0878 - T203275 (duration: 00m 49s) [01:50:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:50:11] T203275: JS crash in mw.user.generateRandomSessionId() - https://phabricator.wikimedia.org/T203275 [02:06:59] !log krinkle@deploy1001 Synchronized php-1.32.0-wmf.19/resources/src/startup/startup.js: (no justification provided) (duration: 00m 50s) [02:07:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:23:31] PROBLEM - DPKG on stat1005 is CRITICAL: Return code of 255 is out of bounds [03:23:51] PROBLEM - Check systemd state on stat1005 is CRITICAL: Return code of 255 is out of bounds [03:24:01] PROBLEM - Disk space on stat1005 is CRITICAL: Return code of 255 is out of bounds [03:24:02] PROBLEM - configured eth on stat1005 is CRITICAL: Return code of 255 is out of bounds [03:24:12] PROBLEM - SSH on stat1005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:24:12] PROBLEM - MD RAID on stat1005 is CRITICAL: Return code of 255 is out of bounds [03:24:21] PROBLEM - dhclient process on stat1005 is CRITICAL: Return code of 255 is out of bounds [03:25:02] RECOVERY - SSH on stat1005 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u3 (protocol 2.0) [03:26:22] PROBLEM - puppet last run on stat1005 is CRITICAL: Return code of 255 is out of bounds [03:29:41] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 905.57 seconds [03:32:01] RECOVERY - DPKG on stat1005 is OK: All packages OK [03:32:21] RECOVERY - Check systemd state on stat1005 is OK: OK - running: The system is fully operational [03:32:31] RECOVERY - Disk space on stat1005 is OK: DISK OK [03:32:41] RECOVERY - configured eth on stat1005 is OK: OK - interfaces up [03:32:51] RECOVERY - MD RAID on stat1005 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 [03:32:51] RECOVERY - dhclient process on stat1005 is OK: PROCS OK: 0 processes with command name dhclient [03:36:31] RECOVERY - puppet last run on stat1005 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [03:45:42] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 248.17 seconds [03:49:37] (03PS1) 10Legoktm: Enable SkinPerPage extension on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/456780 (https://phabricator.wikimedia.org/T203299) [05:21:16] 10Operations, 10SRE-Access-Requests: Access to restbase servers (including sudo) for Imarlier - https://phabricator.wikimedia.org/T202563 (10ArielGlenn) This is waiting for the next SRE meeting for review. [05:21:59] 10Operations, 10SRE-Access-Requests, 10Release-Engineering-Team (Watching / External): Add contint-roots to releases{1,2}001 - https://phabricator.wikimedia.org/T201470 (10ArielGlenn) This is waiting for the next SRE meeting for discussion. [05:23:04] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review, 10Performance-Team (Radar): add performance team members to webserver_misc_static servers to maintain sitemaps - https://phabricator.wikimedia.org/T202910 (10ArielGlenn) We now have: Membership of ops group in LDAP and YAML are not identical (from ac... [05:23:28] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review, 10Performance-Team (Radar): add performance team members to webserver_misc_static servers to maintain sitemaps - https://phabricator.wikimedia.org/T202910 (10ArielGlenn) 05Resolved>03Open [06:37:32] PROBLEM - Filesystem available is greater than filesystem size on ms-be1041 is CRITICAL: cluster=swift device=/dev/sde1 fstype=xfs instance=ms-be1041:9100 job=node mountpoint=/srv/swift-storage/sde1 site=eqiad https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=ms-be1041&var-datasource=eqiad%2520prometheus%252Fops [07:11:45] (03CR) 10Brian Wolff: [C: 031] "Just a +1 to note that this extension passed security review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/456780 (https://phabricator.wikimedia.org/T203299) (owner: 10Legoktm) [07:57:41] 10Operations, 10Availability (MediaWiki-MultiDC), 10Performance-Team (Radar): Investigate solutions for MySQL connection pooling - https://phabricator.wikimedia.org/T196378 (10jcrespo) No, this is not at the moment a goal, but it is ongoing work- recently there was a new 2-beta release, and I am testing if i... [07:59:55] 10Operations, 10DBA, 10Availability (MediaWiki-MultiDC), 10Performance-Team (Radar): Investigate solutions for MySQL connection pooling - https://phabricator.wikimedia.org/T196378 (10jcrespo) [11:28:32] PROBLEM - Check health of redis instance on 6382 on rdb1004 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6382 [11:29:32] RECOVERY - Check health of redis instance on 6382 on rdb1004 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6382 has 1 databases (db0) with 7219834 keys, up 59 days 10 hours [12:45:34] (03PS10) 10Mathew.onipe: Elasticsearch module is coming up. [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T199079) [12:46:44] (03CR) 10Mathew.onipe: Elasticsearch module is coming up. (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T199079) (owner: 10Mathew.onipe) [12:48:02] (03CR) 10jerkins-bot: [V: 04-1] Elasticsearch module is coming up. [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T199079) (owner: 10Mathew.onipe) [12:50:32] (03PS2) 10MarcoAurelio: Use translated MetaNamespace for fy.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455249 (https://phabricator.wikimedia.org/T202769) [12:50:46] (03PS8) 10MarcoAurelio: Modify gender namespaces for pl.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454213 (https://phabricator.wikimedia.org/T202347) [12:59:52] 10Operations, 10AutoWikiBrowser, 10Traffic, 10HTTPS: Check page failed to load on Wikia/Fandom - https://phabricator.wikimedia.org/T203316 (10Mainframe98) [13:02:03] 10Operations, 10AutoWikiBrowser, 10Traffic, 10HTTPS: Check page failed to load on Wikia/Fandom - https://phabricator.wikimedia.org/T203316 (10Reedy) [13:02:16] 10Operations, 10AutoWikiBrowser, 10Traffic, 10HTTPS: Check page failed to load on Wikia/Fandom - https://phabricator.wikimedia.org/T203316 (10Mainframe98) Related tasks: {T174241} [13:04:41] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [13:08:52] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [13:09:21] PROBLEM - Host ms-fe2006 is DOWN: PING CRITICAL - Packet loss = 100% [13:11:01] RECOVERY - Host ms-fe2006 is UP: PING OK - Packet loss = 0%, RTA = 36.09 ms [13:34:31] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [13:38:51] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [13:43:12] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [13:45:22] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [14:55:12] 10Operations, 10Commons, 10Multimedia, 10media-storage, 10User-Josve05a: Specific revisions of multiple files missing from Swift - 404 Not Found returned - https://phabricator.wikimedia.org/T124101 (10AlexisJazz) >>! In T124101#4538078, @Raymond wrote: > Original of https://commons.wikimedia.org/wiki/Fil... [17:56:58] (03CR) 10Gehel: "A few more details to fix..." (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T199079) (owner: 10Mathew.onipe) [19:15:16] (03CR) 10Mathew.onipe: Elasticsearch module is coming up. (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T199079) (owner: 10Mathew.onipe) [19:16:41] (03PS11) 10Mathew.onipe: Elasticsearch module is coming up. [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T199079) [19:17:46] (03CR) 10jerkins-bot: [V: 04-1] Elasticsearch module is coming up. [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T199079) (owner: 10Mathew.onipe) [20:03:36] (03CR) 10Gehel: Elasticsearch module is coming up. (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T199079) (owner: 10Mathew.onipe) [20:13:42] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [20:18:02] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [20:26:51] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [20:33:31] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [21:11:01] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [21:13:12] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [21:26:38] (03Abandoned) 10Matanya: standard packages: re-add intel-microcode [puppet] - 10https://gerrit.wikimedia.org/r/312714 (owner: 10Matanya) [21:47:15] !log krinkle@deploy1001 Synchronized php-1.32.0-wmf.19/extensions/WikimediaMaintenance/: I219882ba09e6a23 - T203154 (duration: 01m 06s) [21:47:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:47:21] T203154: addWiki.php is broken due to "Database selection is disallowed to enable reuse." - https://phabricator.wikimedia.org/T203154 [22:36:01] PROBLEM - HHVM rendering on mw1224 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.002 second response time [22:37:11] RECOVERY - HHVM rendering on mw1224 is OK: HTTP OK: HTTP/1.1 200 OK - 74689 bytes in 0.793 second response time [23:06:01] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [23:09:51] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [23:16:31] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [23:17:02] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [23:19:01] PROBLEM - puppet last run on cp3038 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:44:22] RECOVERY - puppet last run on cp3038 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures