[04:21:17] PROBLEM - mediawiki originals uploads -hourly- for eqiad on icinga1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe1005:9112 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [04:21:49] PROBLEM - mediawiki originals uploads -hourly- for codfw on icinga1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2005:9112 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [04:59:32] 10Operations, 10MachineVision, 10Product-Infrastructure-Team-Backlog, 10Structured-Data-Backlog, and 5 others: Some jobs are not being processed / are processed slowly - https://phabricator.wikimedia.org/T240518 (10Aklapper) >>! In T240518#5801289, @Mholloway wrote: > Incident report is in review. I assum... [05:11:02] !log T238305 hardreset cp3051 [05:11:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:11:06] T238305: servers freeze across the caching cluster - https://phabricator.wikimedia.org/T238305 [05:13:19] RECOVERY - Host cp3051 is UP: PING OK - Packet loss = 0%, RTA = 83.33 ms [05:31:49] RECOVERY - mediawiki originals uploads -hourly- for eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [05:32:21] RECOVERY - mediawiki originals uploads -hourly- for codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [07:25:07] Hi, can someone stop this from being merged?, there's an error https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/ConfirmEdit/+/571034/1 [07:48:43] PROBLEM - MariaDB Slave Lag: s8 on db2085 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 172807.53 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [07:49:37] ^ downtime expired, will downtime it a bit more [07:50:02] Ammarpad: there is a -2 there so it won't get merged [07:55:45] 10Operations, 10Discovery, 10Traffic, 10Wikidata, and 3 others: Wikidata maxlag repeatedly over 5s since Jan20, 2020 (primarily caused by the query service) - https://phabricator.wikimedia.org/T243701 (10Legoktm) p:05Medium→03High maxlag is intended to tell fully-automated bots to backoff to help serve... [08:00:47] RECOVERY - Check systemd state on sodium is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:03:43] (03PS2) 10MarcoAurelio: phabricator: remove firewall holes for port 80 [puppet] - 10https://gerrit.wikimedia.org/r/569100 (owner: 10Dzahn) [16:35:55] PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 83157632 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [16:37:47] RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 0 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [18:40:03] (03CR) 10Krinkle: Throttle override: Editathon in Charolette (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571004 (owner: 10Greg Grossmeier) [19:01:11] PROBLEM - mediawiki originals uploads -hourly- for codfw on icinga1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2005:9112 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [19:01:49] PROBLEM - mediawiki originals uploads -hourly- for eqiad on icinga1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe1005:9112 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [19:09:36] 10Operations, 10DNS, 10Traffic, 10Wikimedia-Apache-configuration: m.{project}.org portal/redirect consistency - https://phabricator.wikimedia.org/T78421 (10Krinkle) 05Open→03Resolved a:03Krinkle All plain m-dot sub domains under WMF projects now redirect to their www equivalents, including m.wikipedi... [19:09:52] 10Operations, 10DNS, 10Traffic, 10Wikimedia-Apache-configuration: m.{project}.org portal/redirect consistency - https://phabricator.wikimedia.org/T78421 (10Krinkle) a:05Krinkle→03None [19:11:06] 10Operations, 10MobileFrontend, 10TechCom-RFC, 10Traffic, 10Readers-Web-Backlog (Tracking): Remove .m. subdomain, serve mobile and desktop variants through the same URL - https://phabricator.wikimedia.org/T214998 (10Krinkle) [19:11:24] 10Operations, 10MobileFrontend, 10TechCom-RFC, 10Traffic, 10Readers-Web-Backlog (Tracking): Remove .m. subdomain, serve mobile and desktop variants through the same URL - https://phabricator.wikimedia.org/T214998 (10Krinkle) [20:21:05] RECOVERY - mediawiki originals uploads -hourly- for eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [20:22:25] RECOVERY - mediawiki originals uploads -hourly- for codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [21:15:55] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 36 probes of 521 (alerts on 35) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [21:21:51] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 33 probes of 521 (alerts on 35) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [21:41:53] PROBLEM - mediawiki originals uploads -hourly- for codfw on icinga1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2005:9112 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [21:42:27] PROBLEM - mediawiki originals uploads -hourly- for eqiad on icinga1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe1005:9112 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [22:19:11] 10Operations, 10Traffic: GRE MTU mitigations - Tracking - https://phabricator.wikimedia.org/T232602 (10ayounsi) T244610 too. [22:19:49] (03CR) 10Ayounsi: "> Patch Set 3:" [homer/public] - 10https://gerrit.wikimedia.org/r/569636 (owner: 10Ayounsi) [22:31:43] (03CR) 10CDanis: [C: 03+1] "In addition to what Arzhel says, I want to add: this CR is a mitigation for emergency use. Even though it may not be ideal, it is straigh" [homer/public] - 10https://gerrit.wikimedia.org/r/569636 (owner: 10Ayounsi) [22:41:03] RECOVERY - mediawiki originals uploads -hourly- for eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [22:41:28] 10Operations, 10Mail, 10Phabricator, 10Regression: Weekly phabricator-reports mail cronjob broken since January 2020 - https://phabricator.wikimedia.org/T244677 (10Aklapper) [22:42:23] RECOVERY - mediawiki originals uploads -hourly- for codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw