[00:05:57] (03PS2) 10Dzahn: acme_chief: add Icinga notes_url [puppet] - 10https://gerrit.wikimedia.org/r/509475 (https://phabricator.wikimedia.org/T197873) [00:06:10] (03PS1) 10Dzahn: labstore: add Icinga notes_urls [puppet] - 10https://gerrit.wikimedia.org/r/509545 (https://phabricator.wikimedia.org/T197873) [00:24:33] PROBLEM - Disk space on actinium is CRITICAL: DISK CRITICAL - free space: / 339 MB (3% inode=90%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [00:28:37] PROBLEM - puppet last run on labmon1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:52:13] PROBLEM - Check systemd state on ms-be2013 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [00:56:17] RECOVERY - Check systemd state on ms-be2013 is OK: OK - running: The system is fully operational [00:58:17] (03CR) 10Alex Monk: "Should probably do Ifc7d8290 first" [puppet] - 10https://gerrit.wikimedia.org/r/506672 (https://phabricator.wikimedia.org/T220894) (owner: 10Alex Monk) [01:00:50] RECOVERY - puppet last run on labmon1002 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [01:03:42] (03PS7) 10Alex Monk: base::firewall: Send (almost) all special host groups via single parameter [puppet] - 10https://gerrit.wikimedia.org/r/505793 [01:06:00] 10Operations, 10Traffic, 10Performance-Team (Radar): Support brotli compression - https://phabricator.wikimedia.org/T137979 (10Krinkle) >>! In T137979#4118215, @BBlack wrote: > Re-reading above: probably the better blend of options would be to swap gzip for brotli in Varnish one-for-one (without the whole st... [01:06:11] 10Operations, 10Traffic, 10Performance-Team (Radar): Support brotli compression - https://phabricator.wikimedia.org/T137979 (10Krinkle) p:05Low→03Normal [01:11:02] (03PS2) 10Krinkle: Set "secret" field in $wgLBFactoryConf for ChronologyProtector HMACs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509357 (owner: 10Aaron Schulz) [01:20:56] (03PS1) 10Dzahn: mariadb: set some more Icinga notes URLs for nrpe checks [puppet] - 10https://gerrit.wikimedia.org/r/509552 (https://phabricator.wikimedia.org/T197873) [01:20:58] 10Operations, 10Core Platform Team (PHP7 (TEC4)), 10Core Platform Team Kanban (Doing), 10HHVM, and 3 others: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370 (10Reedy) [01:32:19] PROBLEM - Disk space on ms-be1015 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sdb1 is not accessible: Input/output error https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [01:35:48] (03PS1) 10Dzahn: nrpe: add Icinga notes_url for systemd_unit_state check [puppet] - 10https://gerrit.wikimedia.org/r/509553 (https://phabricator.wikimedia.org/T197873) [01:38:18] 10Operations, 10ops-codfw, 10media-storage, 10Patch-For-Review, 10User-fgiunchedi: decom ms-be201[345] - https://phabricator.wikimedia.org/T221068 (10Dzahn) [01:46:30] 10Operations, 10ops-eqiad, 10media-storage: ms-be1015 - sdb1 failed - https://phabricator.wikimedia.org/T222991 (10Dzahn) [01:46:43] 10Operations, 10ops-eqiad, 10media-storage: ms-be1015 - sdb1 failed - https://phabricator.wikimedia.org/T222991 (10Dzahn) p:05Triage→03High [01:47:38] ACKNOWLEDGEMENT - Disk space on ms-be1015 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sdb1 is not accessible: Input/output error daniel_zahn https://phabricator.wikimedia.org/T222991 https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [01:48:49] 10Operations, 10ops-eqiad, 10DC-Ops, 10Discovery: Memory correctable errors -EDAC- elastic1029 - https://phabricator.wikimedia.org/T214283 (10Dzahn) 05Invalid→03Open It's back: https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=elastic1029&service=Memory+correctable+errors+-EDAC- Cur... [01:49:23] ACKNOWLEDGEMENT - Memory correctable errors -EDAC- on elastic1029 is CRITICAL: 4.001 ge 4 daniel_zahn https://phabricator.wikimedia.org/T214283 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=elastic1029&var-datasource=eqiad+prometheus/ops [01:56:29] PROBLEM - Mediawiki Cirrussearch update rate - codfw on icinga1001 is CRITICAL: CRITICAL: 30.00% of data under the critical threshold [50.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [01:57:13] PROBLEM - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is CRITICAL: CRITICAL: 40.00% of data under the critical threshold [50.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [01:57:31] PROBLEM - puppet last run on ms-be1015 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[mountpoint-/srv/swift-storage/sdb1] [01:58:18] ACKNOWLEDGEMENT - puppet last run on ms-be1015 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[mountpoint-/srv/swift-storage/sdb1] daniel_zahn https://phabricator.wikimedia.org/T222991 [02:01:38] !log actinium - low disk space - apt-get clean - gzip /var/log/squid3/access.log.1 [02:01:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:02:01] PROBLEM - Disk space on actinium is CRITICAL: DISK CRITICAL - free space: / 253 MB (2% inode=90%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [02:03:25] RECOVERY - Disk space on actinium is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [02:06:08] 10Operations, 10ops-codfw, 10Discovery-Search (Current work): elastic2038 CPU/memory errors - https://phabricator.wikimedia.org/T217398 (10Dzahn) it's fully down now: https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=1&host=elastic2038 [02:06:23] 10Operations, 10ops-codfw, 10Discovery-Search (Current work): elastic2038 DOWN (CPU/memory errors ) - https://phabricator.wikimedia.org/T217398 (10Dzahn) [02:07:28] ACKNOWLEDGEMENT - Host elastic2038 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn https://phabricator.wikimedia.org/T217398 [02:13:15] 08Warning Alert for device cr1-codfw.wikimedia.org - Inbound interface errors [02:21:13] RECOVERY - Mediawiki Cirrussearch update rate - codfw on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [02:21:55] RECOVERY - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [02:25:31] PROBLEM - Device not healthy -SMART- on ms-be2017 is CRITICAL: cluster=swift device=cciss,1 instance=ms-be2017:9100 job=node site=codfw https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be2017&var-datasource=codfw+prometheus/ops [02:57:07] (03Abandoned) 10Reedy: Revert "striker: Disable developer account creation" [puppet] - 10https://gerrit.wikimedia.org/r/508944 (https://phabricator.wikimedia.org/T222844) (owner: 10Reedy) [03:24:25] PROBLEM - puppet last run on db1089 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:26:13] RECOVERY - Device not healthy -SMART- on ms-be2017 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be2017&var-datasource=codfw+prometheus/ops [03:51:15] RECOVERY - puppet last run on db1089 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [03:53:16] 10Operations, 10Commons, 10media-storage: Upload fails at Wikimedia Commons "Internal error: Server failed to store temporary file." - https://phabricator.wikimedia.org/T222994 (10Peachey88) [04:06:27] RECOVERY - EDAC syslog messages on db1068 is OK: (C)4 ge (W)2 ge 1 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=db1068&var-datasource=eqiad+prometheus/ops [05:12:59] 10Operations, 10DBA, 10Patch-For-Review: Decommission db1061-db1073 - https://phabricator.wikimedia.org/T217396 (10Marostegui) [05:13:05] 10Operations, 10DBA, 10Patch-For-Review: correctable memory errors db1068 (commons primary master database) - https://phabricator.wikimedia.org/T213664 (10Marostegui) 05Open→03Resolved It recovered again, needs replacement though as I'm sure it will become critical again soonish Closing for now again unt... [05:33:15] 08Warning Alert for device cr1-codfw.wikimedia.org - Inbound interface errors [06:30:17] PROBLEM - puppet last run on db1092 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/puppet-enabled] [06:31:51] PROBLEM - puppet last run on dbmonitor2001 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. It might be a dependency cycle. [06:37:52] !log restart eventlogging on eventlog1002 - huge kafka consumer lag accumulated (T222941) [06:37:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:37:57] T222941: Eventlogging processors are frequently failing heartbeats causing consumer group rebalances - https://phabricator.wikimedia.org/T222941 [06:57:09] RECOVERY - puppet last run on db1092 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:58:45] RECOVERY - puppet last run on dbmonitor2001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [07:16:24] (03PS1) 10Elukey: profile::prometheus::alerts: add EL processors kafka consumer lag alert [puppet] - 10https://gerrit.wikimedia.org/r/509566 (https://phabricator.wikimedia.org/T222941) [07:17:37] (03CR) 10Elukey: [C: 03+2] profile::prometheus::alerts: add EL processors kafka consumer lag alert [puppet] - 10https://gerrit.wikimedia.org/r/509566 (https://phabricator.wikimedia.org/T222941) (owner: 10Elukey) [07:22:18] (03PS1) 10Elukey: profile::prometheus::alerts: tune EL kafka consumer lag [puppet] - 10https://gerrit.wikimedia.org/r/509567 (https://phabricator.wikimedia.org/T222941) [07:23:16] (03CR) 10Elukey: [C: 03+2] profile::prometheus::alerts: tune EL kafka consumer lag [puppet] - 10https://gerrit.wikimedia.org/r/509567 (https://phabricator.wikimedia.org/T222941) (owner: 10Elukey) [07:30:08] (03PS1) 10Elukey: profile::prometheus::alerts: fix dashboard URL [puppet] - 10https://gerrit.wikimedia.org/r/509568 (https://phabricator.wikimedia.org/T222941) [07:31:01] (03CR) 10Elukey: [C: 03+2] profile::prometheus::alerts: fix dashboard URL [puppet] - 10https://gerrit.wikimedia.org/r/509568 (https://phabricator.wikimedia.org/T222941) (owner: 10Elukey) [07:31:33] PROBLEM - puppet last run on icinga1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:33:37] fixing it --^ [07:34:41] PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [07:36:35] RECOVERY - puppet last run on icinga1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [07:39:13] PROBLEM - Esams HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [07:40:17] RECOVERY - Esams HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [07:40:47] RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [08:00:39] PROBLEM - Disk space on ms-be1014 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sdl1 is not accessible: Input/output error https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [08:11:48] (03PS1) 10ArielGlenn: make tox happy again on some subdirs so changes to others will pass ci [software] - 10https://gerrit.wikimedia.org/r/509571 [08:12:50] (03CR) 10jerkins-bot: [V: 04-1] make tox happy again on some subdirs so changes to others will pass ci [software] - 10https://gerrit.wikimedia.org/r/509571 (owner: 10ArielGlenn) [08:12:57] PROBLEM - puppet last run on ms-be1014 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): File[mountpoint-/srv/swift-storage/sdl1] [08:23:01] (03PS2) 10ArielGlenn: make tox happy again on some subdirs so changes to others will pass ci [software] - 10https://gerrit.wikimedia.org/r/509571 [08:23:27] RECOVERY - Memory correctable errors -EDAC- on db1068 is OK: (C)4 ge (W)2 ge 1 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=db1068&var-datasource=eqiad+prometheus/ops [08:30:57] (03CR) 10ArielGlenn: [C: 03+2] make tox happy again on some subdirs so changes to others will pass ci [software] - 10https://gerrit.wikimedia.org/r/509571 (owner: 10ArielGlenn) [09:23:15] 08̶W̶a̶r̶n̶i̶n̶g Device cr1-codfw.wikimedia.org recovered from Inbound interface errors [09:29:44] (03PS1) 10ArielGlenn: remove salt-misc dir, scripts no longer used [software] - 10https://gerrit.wikimedia.org/r/509576 [09:34:58] (03CR) 10Gehel: [C: 04-1] "as discussed" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/509172 (https://phabricator.wikimedia.org/T141324) (owner: 10Dzahn) [09:55:39] PROBLEM - HHVM rendering on mw1225 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [09:58:17] RECOVERY - HHVM rendering on mw1225 is OK: HTTP OK: HTTP/1.1 200 OK - 77183 bytes in 1.071 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:00:59] (03CR) 10Gehel: [C: 04-1] "Looks reasonable, see comments inline" (038 comments) [puppet] - 10https://gerrit.wikimedia.org/r/509542 (owner: 10Paladox) [10:58:56] (03CR) 10Jcrespo: [C: 04-1] "I don't think these are appropriate. elukey will have better documentation for eventlogging, and haproxy may and should have a better page" [puppet] - 10https://gerrit.wikimedia.org/r/509552 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn) [11:17:49] PROBLEM - Disk space on ms-be2013 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sdb1 is not accessible: Input/output error https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [11:21:50] (03CR) 10ArielGlenn: [C: 03+2] remove salt-misc dir, scripts no longer used [software] - 10https://gerrit.wikimedia.org/r/509576 (owner: 10ArielGlenn) [11:24:03] PROBLEM - puppet last run on ms-be2013 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[mountpoint-/srv/swift-storage/sdb1] [14:05:40] (03Abandoned) 10Giuseppe Lavagetto: mediawiki::web::beta_sites: convert wikibooks to vhost [puppet] - 10https://gerrit.wikimedia.org/r/439894 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [14:51:59] 10Operations, 10Commons, 10MediaWiki-File-management, 10Multimedia, 10media-storage: Upload fails at Wikimedia Commons "Internal error: Server failed to store temporary file." - https://phabricator.wikimedia.org/T222994 (10Reedy) [15:26:23] PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [15:27:01] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [15:28:21] PROBLEM - Esams HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [15:28:51] PROBLEM - HHVM rendering on mw1268 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [15:30:05] RECOVERY - HHVM rendering on mw1268 is OK: HTTP OK: HTTP/1.1 200 OK - 77176 bytes in 0.133 second response time https://wikitech.wikimedia.org/wiki/Application_servers [15:33:15] RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [15:33:53] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [15:35:11] RECOVERY - Esams HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [15:38:15] 08Warning Alert for device cr1-codfw.wikimedia.org - Inbound interface errors [16:27:29] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [16:28:09] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [16:29:37] PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [16:30:11] PROBLEM - Esams HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [16:30:15] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [16:30:53] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [16:31:37] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [16:32:26] all the DCs with 500s? [16:35:41] RECOVERY - Esams HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [16:36:31] RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [16:37:09] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [16:45:28] (03PS1) 10Framawiki: Enable SandboxLink extension on zhwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509593 (https://phabricator.wikimedia.org/T223006) [16:55:08] (03CR) 10Framawiki: [C: 03+1] "I80e054a2134ca was merged a month ago, I think this patch can be merged." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496677 (https://phabricator.wikimedia.org/T218363) (owner: 10Varnent) [17:11:05] (03PS1) 10Framawiki: Enable wmgProofreadPageShowHeaders on pawikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509594 (https://phabricator.wikimedia.org/T222740) [17:28:55] PROBLEM - HTTP availability for Varnish at eqiad on icinga1001 is CRITICAL: job=varnish-text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [17:29:21] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [17:30:01] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [17:30:03] PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [17:30:10] (03CR) 10Framawiki: [C: 03+1] Set wgArticleCountMethod='any' for bgwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506943 (https://phabricator.wikimedia.org/T222044) (owner: 10Ammarpad) [17:30:19] PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [17:30:27] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [17:30:41] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [17:30:49] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [17:31:59] PROBLEM - Esams HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [17:33:03] RECOVERY - HTTP availability for Varnish at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [17:33:04] RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [17:33:21] (03CR) 10Framawiki: [C: 03+1] Add namespace aliases on zhwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506892 (https://phabricator.wikimedia.org/T222024) (owner: 10DannyS712) [17:33:29] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [17:33:33] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [17:34:09] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [17:34:35] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [17:37:34] (03CR) 10Framawiki: [C: 04-1] "The task desc asks a change for 9 namespaces, there is only five here. If it is wanted, please explain why, thanks!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503680 (https://phabricator.wikimedia.org/T220881) (owner: 10DannyS712) [17:38:19] RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [17:38:53] RECOVERY - Esams HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [17:39:01] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [18:13:45] (03PS1) 10Alex Monk: deployment-prep: Move to working Mathoid service [puppet] - 10https://gerrit.wikimedia.org/r/509595 (https://phabricator.wikimedia.org/T221654) [18:14:18] (03PS1) 10Alex Monk: deployment-prep: Move to working Mathoid service [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509596 (https://phabricator.wikimedia.org/T221654) [18:14:59] (03PS2) 10Alex Monk: deployment-prep: Move to working Mathoid service [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509596 (https://phabricator.wikimedia.org/T221654) [18:20:34] (03CR) 10Krinkle: [C: 03+1] Set wgArticleCountMethod='any' for bgwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506943 (https://phabricator.wikimedia.org/T222044) (owner: 10Ammarpad) [18:23:27] (03PS1) 10Reedy: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509597 [18:23:29] (03CR) 10Reedy: [C: 03+2] Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509597 (owner: 10Reedy) [18:25:05] (03Merged) 10jenkins-bot: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509597 (owner: 10Reedy) [18:26:04] !log reedy@deploy1001 Synchronized wmf-config/interwiki.php: Update interwiki cache (duration: 02m 57s) [18:26:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:26:14] (03CR) 10jenkins-bot: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509597 (owner: 10Reedy) [18:39:37] (03CR) 10Krinkle: [C: 03+2] deployment-prep: Move to working Mathoid service [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509596 (https://phabricator.wikimedia.org/T221654) (owner: 10Alex Monk) [18:40:43] (03Merged) 10jenkins-bot: deployment-prep: Move to working Mathoid service [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509596 (https://phabricator.wikimedia.org/T221654) (owner: 10Alex Monk) [18:40:57] (03CR) 10jenkins-bot: deployment-prep: Move to working Mathoid service [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509596 (https://phabricator.wikimedia.org/T221654) (owner: 10Alex Monk) [20:14:07] PROBLEM - MariaDB Slave Lag: s8 on db1116 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 840.93 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [20:48:29] PROBLEM - Check Varnish expiry mailbox lag on cp3035 is CRITICAL: CRITICAL: expiry mailbox lag is 2065004 https://wikitech.wikimedia.org/wiki/Varnish [21:00:35] (03PS42) 10Mathew.onipe: icinga: create and apply cirrus config check [puppet] - 10https://gerrit.wikimedia.org/r/507045 (https://phabricator.wikimedia.org/T218932) [21:00:42] (03CR) 10Mathew.onipe: icinga: create and apply cirrus config check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/507045 (https://phabricator.wikimedia.org/T218932) (owner: 10Mathew.onipe) [21:01:53] (03CR) 10jerkins-bot: [V: 04-1] icinga: create and apply cirrus config check [puppet] - 10https://gerrit.wikimedia.org/r/507045 (https://phabricator.wikimedia.org/T218932) (owner: 10Mathew.onipe) [21:14:39] (03PS43) 10Mathew.onipe: icinga: create and apply cirrus config check [puppet] - 10https://gerrit.wikimedia.org/r/507045 (https://phabricator.wikimedia.org/T218932) [21:38:51] (03CR) 10Mathew.onipe: "PCC output is Ok: https://puppet-compiler.wmflabs.org/compiler1002/16474/" [puppet] - 10https://gerrit.wikimedia.org/r/507045 (https://phabricator.wikimedia.org/T218932) (owner: 10Mathew.onipe) [21:40:28] (03PS8) 10Mathew.onipe: wdqs: add WDQS restart cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/507347 (https://phabricator.wikimedia.org/T221832) [22:21:47] RECOVERY - Check Varnish expiry mailbox lag on cp3035 is OK: OK: expiry mailbox lag is 238285 https://wikitech.wikimedia.org/wiki/Varnish [22:23:15] RECOVERY - MariaDB Slave Lag: s8 on db1116 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [22:53:15] 08Warning Alert for device cr1-codfw.wikimedia.org - Inbound interface errors [22:57:11] (03PS1) 10Framawiki: quarry: nginx conf for custom 50x error pages [puppet] - 10https://gerrit.wikimedia.org/r/509608 (https://phabricator.wikimedia.org/T223018) [22:59:57] (03PS2) 10Framawiki: quarry: nginx conf for custom 50x error pages [puppet] - 10https://gerrit.wikimedia.org/r/509608 (https://phabricator.wikimedia.org/T223018) [23:29:37] PROBLEM - puppet last run on bast3002 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. It might be a dependency cycle.