[00:11:53] <icinga-wm>	 PROBLEM - High load average on labstore1004 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [70.0] https://grafana.wikimedia.org/dashboard/db/labs-monitoring
[00:25:13] <icinga-wm>	 RECOVERY - High load average on labstore1004 is OK: OK: Less than 50.00% above the threshold [50.0] https://grafana.wikimedia.org/dashboard/db/labs-monitoring
[00:37:47] <wikibugs>	 (03PS6) 10Dzahn: quarry::database: Use mariadb instead of mysql module [puppet] - 10https://gerrit.wikimedia.org/r/454481 (https://phabricator.wikimedia.org/T181205) (owner: 10Zhuyifei1999)
[01:09:54] <wikibugs>	 (03PS6) 10Krinkle: Test that all wikis are in one of the section dblists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455587 (https://phabricator.wikimedia.org/T202904) (owner: 10Anomie)
[01:09:59] <wikibugs>	 (03CR) 10Krinkle: [C: 032] Test that all wikis are in one of the section dblists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455587 (https://phabricator.wikimedia.org/T202904) (owner: 10Anomie)
[01:11:30] <wikibugs>	 (03Merged) 10jenkins-bot: Test that all wikis are in one of the section dblists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455587 (https://phabricator.wikimedia.org/T202904) (owner: 10Anomie)
[01:12:55] <logmsgbot>	 !log krinkle@deploy1001 Synchronized tests/: I43c79297499 (duration: 00m 51s)
[01:12:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:14:05] <logmsgbot>	 !log krinkle@deploy1001 Synchronized dblists/s3.dblist: I43c79297499 (duration: 00m 49s)
[01:14:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:18:11] <wikibugs>	 10Operations, 10Performance-Team, 10Wikimedia-Mailing-lists, 10User-herron: Close performance@lists.wikimedia.org in favour of wikitech-l - https://phabricator.wikimedia.org/T200733 (10Krinkle) @Dzahn That's our workboard – The action item for this task is currently blocked, on Ops. The decision itself is...
[01:20:46] <wikibugs>	 (03CR) 10jenkins-bot: Test that all wikis are in one of the section dblists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455587 (https://phabricator.wikimedia.org/T202904) (owner: 10Anomie)
[01:49:10] <logmsgbot>	 !log krinkle@deploy1001 Synchronized php-1.32.0-wmf.19/resources/src/startup/: If26851eac1530f02 (duration: 00m 49s)
[01:49:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:50:05] <logmsgbot>	 !log krinkle@deploy1001 Synchronized php-1.32.0-wmf.19/resources/src/mediawiki.user.js: I8feecddf0878 - T203275 (duration: 00m 49s)
[01:50:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:50:11] <stashbot>	 T203275: JS crash in mw.user.generateRandomSessionId() - https://phabricator.wikimedia.org/T203275
[02:06:59] <logmsgbot>	 !log krinkle@deploy1001 Synchronized php-1.32.0-wmf.19/resources/src/startup/startup.js: (no justification provided) (duration: 00m 50s)
[02:07:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:23:31] <icinga-wm>	 PROBLEM - DPKG on stat1005 is CRITICAL: Return code of 255 is out of bounds
[03:23:51] <icinga-wm>	 PROBLEM - Check systemd state on stat1005 is CRITICAL: Return code of 255 is out of bounds
[03:24:01] <icinga-wm>	 PROBLEM - Disk space on stat1005 is CRITICAL: Return code of 255 is out of bounds
[03:24:02] <icinga-wm>	 PROBLEM - configured eth on stat1005 is CRITICAL: Return code of 255 is out of bounds
[03:24:12] <icinga-wm>	 PROBLEM - SSH on stat1005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[03:24:12] <icinga-wm>	 PROBLEM - MD RAID on stat1005 is CRITICAL: Return code of 255 is out of bounds
[03:24:21] <icinga-wm>	 PROBLEM - dhclient process on stat1005 is CRITICAL: Return code of 255 is out of bounds
[03:25:02] <icinga-wm>	 RECOVERY - SSH on stat1005 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u3 (protocol 2.0)
[03:26:22] <icinga-wm>	 PROBLEM - puppet last run on stat1005 is CRITICAL: Return code of 255 is out of bounds
[03:29:41] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 905.57 seconds
[03:32:01] <icinga-wm>	 RECOVERY - DPKG on stat1005 is OK: All packages OK
[03:32:21] <icinga-wm>	 RECOVERY - Check systemd state on stat1005 is OK: OK - running: The system is fully operational
[03:32:31] <icinga-wm>	 RECOVERY - Disk space on stat1005 is OK: DISK OK
[03:32:41] <icinga-wm>	 RECOVERY - configured eth on stat1005 is OK: OK - interfaces up
[03:32:51] <icinga-wm>	 RECOVERY - MD RAID on stat1005 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0
[03:32:51] <icinga-wm>	 RECOVERY - dhclient process on stat1005 is OK: PROCS OK: 0 processes with command name dhclient
[03:36:31] <icinga-wm>	 RECOVERY - puppet last run on stat1005 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[03:45:42] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 248.17 seconds
[03:49:37] <wikibugs>	 (03PS1) 10Legoktm: Enable SkinPerPage extension on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/456780 (https://phabricator.wikimedia.org/T203299)
[05:21:16] <wikibugs>	 10Operations, 10SRE-Access-Requests: Access to restbase servers (including sudo) for Imarlier - https://phabricator.wikimedia.org/T202563 (10ArielGlenn) This is waiting for the next SRE meeting for review.
[05:21:59] <wikibugs>	 10Operations, 10SRE-Access-Requests, 10Release-Engineering-Team (Watching / External): Add contint-roots to releases{1,2}001 - https://phabricator.wikimedia.org/T201470 (10ArielGlenn) This is waiting for the next SRE meeting for discussion.
[05:23:04] <wikibugs>	 10Operations, 10SRE-Access-Requests, 10Patch-For-Review, 10Performance-Team (Radar): add performance team members to webserver_misc_static servers to maintain sitemaps - https://phabricator.wikimedia.org/T202910 (10ArielGlenn) We now have: Membership of ops group in LDAP and YAML are not identical (from ac...
[05:23:28] <wikibugs>	 10Operations, 10SRE-Access-Requests, 10Patch-For-Review, 10Performance-Team (Radar): add performance team members to webserver_misc_static servers to maintain sitemaps - https://phabricator.wikimedia.org/T202910 (10ArielGlenn) 05Resolved>03Open
[06:37:32] <icinga-wm>	 PROBLEM - Filesystem available is greater than filesystem size on ms-be1041 is CRITICAL: cluster=swift device=/dev/sde1 fstype=xfs instance=ms-be1041:9100 job=node mountpoint=/srv/swift-storage/sde1 site=eqiad https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=ms-be1041&var-datasource=eqiad%2520prometheus%252Fops
[07:11:45] <wikibugs>	 (03CR) 10Brian Wolff: [C: 031] "Just a +1 to note that this extension passed security review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/456780 (https://phabricator.wikimedia.org/T203299) (owner: 10Legoktm)
[07:57:41] <wikibugs>	 10Operations, 10Availability (MediaWiki-MultiDC), 10Performance-Team (Radar): Investigate solutions for MySQL connection pooling - https://phabricator.wikimedia.org/T196378 (10jcrespo) No, this is not at the moment a goal, but it is ongoing work- recently there was a new 2-beta release, and I am testing if i...
[07:59:55] <wikibugs>	 10Operations, 10DBA, 10Availability (MediaWiki-MultiDC), 10Performance-Team (Radar): Investigate solutions for MySQL connection pooling - https://phabricator.wikimedia.org/T196378 (10jcrespo)
[11:28:32] <icinga-wm>	 PROBLEM - Check health of redis instance on 6382 on rdb1004 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6382
[11:29:32] <icinga-wm>	 RECOVERY - Check health of redis instance on 6382 on rdb1004 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6382 has 1 databases (db0) with 7219834 keys, up 59 days 10 hours
[12:45:34] <wikibugs>	 (03PS10) 10Mathew.onipe: Elasticsearch module is coming up. [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T199079)
[12:46:44] <wikibugs>	 (03CR) 10Mathew.onipe: Elasticsearch module is coming up. (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T199079) (owner: 10Mathew.onipe)
[12:48:02] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Elasticsearch module is coming up. [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T199079) (owner: 10Mathew.onipe)
[12:50:32] <wikibugs>	 (03PS2) 10MarcoAurelio: Use translated MetaNamespace for fy.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455249 (https://phabricator.wikimedia.org/T202769)
[12:50:46] <wikibugs>	 (03PS8) 10MarcoAurelio: Modify gender namespaces for pl.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454213 (https://phabricator.wikimedia.org/T202347)
[12:59:52] <wikibugs>	 10Operations, 10AutoWikiBrowser, 10Traffic, 10HTTPS: Check page failed to load on Wikia/Fandom - https://phabricator.wikimedia.org/T203316 (10Mainframe98)
[13:02:03] <wikibugs>	 10Operations, 10AutoWikiBrowser, 10Traffic, 10HTTPS: Check page failed to load on Wikia/Fandom - https://phabricator.wikimedia.org/T203316 (10Reedy)
[13:02:16] <wikibugs>	 10Operations, 10AutoWikiBrowser, 10Traffic, 10HTTPS: Check page failed to load on Wikia/Fandom - https://phabricator.wikimedia.org/T203316 (10Mainframe98) Related tasks: {T174241}
[13:04:41] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen
[13:08:52] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen
[13:09:21] <icinga-wm>	 PROBLEM - Host ms-fe2006 is DOWN: PING CRITICAL - Packet loss = 100%
[13:11:01] <icinga-wm>	 RECOVERY - Host ms-fe2006 is UP: PING OK - Packet loss = 0%, RTA = 36.09 ms
[13:34:31] <icinga-wm>	 PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[13:38:51] <icinga-wm>	 RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[13:43:12] <icinga-wm>	 PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[13:45:22] <icinga-wm>	 RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[14:55:12] <wikibugs>	 10Operations, 10Commons, 10Multimedia, 10media-storage, 10User-Josve05a: Specific revisions of multiple files missing from Swift - 404 Not Found returned - https://phabricator.wikimedia.org/T124101 (10AlexisJazz) >>! In T124101#4538078, @Raymond wrote: > Original of https://commons.wikimedia.org/wiki/Fil...
[17:56:58] <wikibugs>	 (03CR) 10Gehel: "A few more details to fix..." (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T199079) (owner: 10Mathew.onipe)
[19:15:16] <wikibugs>	 (03CR) 10Mathew.onipe: Elasticsearch module is coming up. (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T199079) (owner: 10Mathew.onipe)
[19:16:41] <wikibugs>	 (03PS11) 10Mathew.onipe: Elasticsearch module is coming up. [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T199079)
[19:17:46] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Elasticsearch module is coming up. [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T199079) (owner: 10Mathew.onipe)
[20:03:36] <wikibugs>	 (03CR) 10Gehel: Elasticsearch module is coming up. (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T199079) (owner: 10Mathew.onipe)
[20:13:42] <icinga-wm>	 PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[20:18:02] <icinga-wm>	 RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[20:26:51] <icinga-wm>	 PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[20:33:31] <icinga-wm>	 RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[21:11:01] <icinga-wm>	 PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[21:13:12] <icinga-wm>	 RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[21:26:38] <wikibugs>	 (03Abandoned) 10Matanya: standard packages: re-add intel-microcode [puppet] - 10https://gerrit.wikimedia.org/r/312714 (owner: 10Matanya)
[21:47:15] <logmsgbot>	 !log krinkle@deploy1001 Synchronized php-1.32.0-wmf.19/extensions/WikimediaMaintenance/: I219882ba09e6a23 - T203154 (duration: 01m 06s)
[21:47:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:47:21] <stashbot>	 T203154: addWiki.php is broken due to "Database selection is disallowed to enable reuse." - https://phabricator.wikimedia.org/T203154
[22:36:01] <icinga-wm>	 PROBLEM - HHVM rendering on mw1224 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.002 second response time
[22:37:11] <icinga-wm>	 RECOVERY - HHVM rendering on mw1224 is OK: HTTP OK: HTTP/1.1 200 OK - 74689 bytes in 0.793 second response time
[23:06:01] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5
[23:09:51] <icinga-wm>	 PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5
[23:16:31] <icinga-wm>	 RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5
[23:17:02] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5
[23:19:01] <icinga-wm>	 PROBLEM - puppet last run on cp3038 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[23:44:22] <icinga-wm>	 RECOVERY - puppet last run on cp3038 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures