[00:05:58] 10Operations, 10Wikimedia-Logstash, 10Wikimedia-Incident: Logstash missing most messages from mediawiki (Aug 2019) - https://phabricator.wikimedia.org/T230847 (10Krinkle) I've created an initial draft at . Some open questions stil... [00:06:32] 10Operations, 10Wikimedia-Logstash, 10Wikimedia-Incident: Logstash missing most messages from mediawiki (Aug 2019) - https://phabricator.wikimedia.org/T230847 (10Krinkle) a:03fgiunchedi [00:37:36] 10Operations, 10Wikimedia-Incident: Investigate all-site outage on 2019-12-30 (HTTP 504 error) - https://phabricator.wikimedia.org/T241573 (10Krinkle) 05Open→03Resolved [00:38:08] 10Operations, 10Traffic, 10observability, 10Wikimedia-Incident: cp1083: ats-tls and varnish-fe crashed due to insufficient memory - https://phabricator.wikimedia.org/T241593 (10Krinkle) [00:38:40] 10Operations, 10Traffic, 10observability, 10Wikimedia-Incident: cp1083: ats-tls and varnish-fe crashed due to insufficient memory - https://phabricator.wikimedia.org/T241593 (10Krinkle) Looks like follow-up from T241573. [00:49:38] 10Operations, 10ContentTranslation: ContentTranslation: Publishing translation takes much time - https://phabricator.wikimedia.org/T245461 (10Zoranzoki21) @Acamicamacaraca reported on Discord server of Serbian Wikipedia that he tried to translate https://en.wikipedia.org/wiki/European_Film_Awards on Serbian Wi... [00:53:19] 10Operations, 10Performance-Team, 10serviceops, 10Core Platform Team Workboards (Clinic Duty Team), 10Wikimedia-production-error: Wiki diffs take over 15s to load - https://phabricator.wikimedia.org/T244058 (10Krinkle) [01:46:23] RECOVERY - rpki grafana alert on icinga1001 is OK: OK: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is not alerting. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [02:48:27] PROBLEM - rpki grafana alert on icinga1001 is CRITICAL: CRITICAL: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is alerting: RRDP status alert. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [02:51:51] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 36 probes of 527 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [02:55:13] PROBLEM - Host analytics1065 is DOWN: PING CRITICAL - Packet loss = 100% [02:57:51] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 32 probes of 527 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [02:59:07] PROBLEM - MariaDB Slave Lag: s3 on db2098 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1233.55 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [04:17:32] (03PS1) 10CDanis: LVS: add alert on CPU saturation, which causes pkt drops [puppet] - 10https://gerrit.wikimedia.org/r/572766 [04:19:00] (03PS2) 10CDanis: LVS: add alert on CPU saturation, which causes pkt drops [puppet] - 10https://gerrit.wikimedia.org/r/572766 [04:20:55] (03PS3) 10CDanis: LVS: add alert on CPU saturation, which causes pkt drops [puppet] - 10https://gerrit.wikimedia.org/r/572766 [04:33:21] RECOVERY - MariaDB Slave Lag: s3 on db2098 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [05:30:17] PROBLEM - snapshot of s7 in eqiad on db1115 is CRITICAL: snapshot for s7 at eqiad taken more than 4 days ago: Most recent backup 2020-02-14 05:11:28 https://wikitech.wikimedia.org/wiki/MariaDB/Backups [05:32:47] RECOVERY - rpki grafana alert on icinga1001 is OK: OK: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is not alerting. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [06:02:55] 10Operations: Remove references to m4-master - https://phabricator.wikimedia.org/T245238 (10Marostegui) There are lots of references to m4-master on Puppet (as it used to be use by eventlogging) so I believe this is something for #analytics to be tackled :) [06:07:25] (03PS1) 10Marostegui: dbproxy1002: Decommission this host. [puppet] - 10https://gerrit.wikimedia.org/r/572770 (https://phabricator.wikimedia.org/T245384) [06:08:27] (03PS1) 10Marostegui: wmnet: Remove production dns for dbproxy1002 [dns] - 10https://gerrit.wikimedia.org/r/572771 (https://phabricator.wikimedia.org/T245384) [06:08:58] !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission [06:09:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:09:30] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [06:09:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:11:01] (03CR) 10Marostegui: [C: 03+2] dbproxy1002: Decommission this host. [puppet] - 10https://gerrit.wikimedia.org/r/572770 (https://phabricator.wikimedia.org/T245384) (owner: 10Marostegui) [06:11:26] (03CR) 10Marostegui: [C: 03+2] wmnet: Remove production dns for dbproxy1002 [dns] - 10https://gerrit.wikimedia.org/r/572771 (https://phabricator.wikimedia.org/T245384) (owner: 10Marostegui) [06:13:23] (03PS1) 10Marostegui: Revert "depool esams" [dns] - 10https://gerrit.wikimedia.org/r/572772 [06:13:32] (03PS2) 10Marostegui: Revert "depool esams" [dns] - 10https://gerrit.wikimedia.org/r/572772 [06:14:19] (03CR) 10Marostegui: "I am reverting this patch, the previous one (depooling esams) was merged, so it is waiting for authdns-update to run. I was about to merge" [dns] - 10https://gerrit.wikimedia.org/r/572772 (owner: 10Marostegui) [06:17:11] (03CR) 10Marostegui: [C: 03+2] Revert "depool esams" [dns] - 10https://gerrit.wikimedia.org/r/572772 (owner: 10Marostegui) [06:21:13] 10Operations, 10ops-eqiad, 10decommission: decommission dbproxy1002.eqiad.wmnet - https://phabricator.wikimedia.org/T245384 (10Marostegui) a:05Marostegui→03Jclark-ctr [06:21:25] 10Operations, 10ops-eqiad, 10decommission: decommission dbproxy1002.eqiad.wmnet - https://phabricator.wikimedia.org/T245384 (10Marostegui) Host ready for on-site steps [06:23:21] (03PS1) 10Marostegui: report_users: Remove dbproxy1002 IP [software] - 10https://gerrit.wikimedia.org/r/572774 (https://phabricator.wikimedia.org/T245384) [06:24:26] (03CR) 10Marostegui: [C: 03+2] report_users: Remove dbproxy1002 IP [software] - 10https://gerrit.wikimedia.org/r/572774 (https://phabricator.wikimedia.org/T245384) (owner: 10Marostegui) [06:25:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1107 with weight 100 and weight 10 in API for 10.4 testing - T242702', diff saved to https://phabricator.wikimedia.org/P10438 and previous config saved to /var/cache/conftool/dbconfig/20200218-062459-marostegui.json [06:25:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:25:04] T242702: Test MariaDB 10.4 in production - https://phabricator.wikimedia.org/T242702 [06:27:48] !log Stop haproxy on dbproxy1007 - T245385 [06:27:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:27:52] T245385: decommission dbproxy1007.eqiad.wmnet - https://phabricator.wikimedia.org/T245385 [06:31:06] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/572682 (https://phabricator.wikimedia.org/T183146) (owner: 10Ema) [06:36:25] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/572667 (https://phabricator.wikimedia.org/T242910) (owner: 10Jbond) [06:38:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Increase weight for db1107 100 -> 200 for 10.4 testing - T242702', diff saved to https://phabricator.wikimedia.org/P10439 and previous config saved to /var/cache/conftool/dbconfig/20200218-063819-marostegui.json [06:38:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:38:23] T242702: Test MariaDB 10.4 in production - https://phabricator.wikimedia.org/T242702 [06:39:01] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/572707 (https://phabricator.wikimedia.org/T242910) (owner: 10Jbond) [06:39:50] !log Remove wikiadmin2 from pc1007, pc1008, pc1009 and pc1010 T243512 [06:39:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:39:54] T243512: Clean up wikiadmin2 user from core hosts - https://phabricator.wikimedia.org/T243512 [06:48:43] 10Operations, 10Traffic: Session resumption seems to be broken in ATS for TLSv1.3 - https://phabricator.wikimedia.org/T245419 (10Vgutierrez) This time looks like `openssl s_client -reconnect` is at fault here. It doesn't seem to be working as expected with TLSv1.3: ` vgutierrez@traffic-cache-atstext-buster:~$... [06:49:10] 10Operations, 10Traffic: Session resumption seems to be broken in ATS for TLSv1.3 - https://phabricator.wikimedia.org/T245419 (10Vgutierrez) 05Open→03Declined [06:49:16] 10Operations, 10Traffic, 10Goal, 10Patch-For-Review, 10Performance-Team (Radar): Support TLSv1.3 - https://phabricator.wikimedia.org/T170567 (10Vgutierrez) [06:52:02] (03PS1) 10Muehlenhoff: Add library hint for boost1.67 [puppet] - 10https://gerrit.wikimedia.org/r/572775 [06:52:15] PROBLEM - rpki grafana alert on icinga1001 is CRITICAL: CRITICAL: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is alerting: RRDP status alert. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [06:57:07] (03CR) 10Muehlenhoff: [C: 03+2] Add library hint for boost1.67 [puppet] - 10https://gerrit.wikimedia.org/r/572775 (owner: 10Muehlenhoff) [07:00:47] 10Operations, 10serviceops: (Need By Dec 20) rack/setup/install mw13[49-84].eqiad.wmnet - https://phabricator.wikimedia.org/T236437 (10Joe) a:05Joe→03None @RLazarus and @Dzahn can you pick up the task of putting these into rotation? They will eventually replace mw1221-mw1258, and I think we can go 1:1 in... [07:05:55] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 74, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:07:21] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 62, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:10:09] (03Abandoned) 10Muehlenhoff: role::labs::nfs::secondary: Add Ferm rules for DRBD [puppet] - 10https://gerrit.wikimedia.org/r/392430 (https://phabricator.wikimedia.org/T165136) (owner: 10Muehlenhoff) [07:13:19] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 64, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:13:53] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 76, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:14:22] (03PS7) 10Muehlenhoff: Add script to track OS migrations status [puppet] - 10https://gerrit.wikimedia.org/r/572251 [07:14:24] (03CR) 10Muehlenhoff: Add script to track OS migrations status (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/572251 (owner: 10Muehlenhoff) [07:34:19] !log powercycle analytics1065 (crashed hours ago, no mgmt console available, no ssh) [07:34:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:35:34] 10Operations, 10Analytics: Remove references to m4-master - https://phabricator.wikimedia.org/T245238 (10elukey) [07:35:59] 10Operations, 10Analytics: Remove references to m4-master - https://phabricator.wikimedia.org/T245238 (10elukey) My bad, I thought that this task was only for the DNS change :) [07:36:45] RECOVERY - Host analytics1065 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [07:43:14] 10Operations, 10Analytics, 10Analytics-Kanban, 10LDAP-Access-Requests: LDAP access to the wmf group for CherRaye Glenn (superset, turnilo, hue) - https://phabricator.wikimedia.org/T244410 (10elukey) Done! [07:43:17] 10Operations, 10Analytics, 10Analytics-Kanban, 10LDAP-Access-Requests: LDAP access to the wmf group for CherRaye Glenn (superset, turnilo, hue) - https://phabricator.wikimedia.org/T244410 (10elukey) 05Open→03Resolved [07:47:08] (03CR) 10Muehlenhoff: Add script to track OS migrations status (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/572251 (owner: 10Muehlenhoff) [08:06:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1107 to temporary change optimizer options - T245489', diff saved to https://phabricator.wikimedia.org/P10440 and previous config saved to /var/cache/conftool/dbconfig/20200218-080623-marostegui.json [08:06:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:06:28] T245489: Possibly disable optimizer flag: rowid_filter on 10.4 - https://phabricator.wikimedia.org/T245489 [08:08:15] !log Restart MySQL to pick up optimizer_switch changes - T245489 [08:08:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1107 after temporary change optimizer options - T245489', diff saved to https://phabricator.wikimedia.org/P10441 and previous config saved to /var/cache/conftool/dbconfig/20200218-080952-marostegui.json [08:09:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:15:54] (03PS1) 10Marostegui: production.my.cnf: Disable rowid_filter on 10.4 [puppet] - 10https://gerrit.wikimedia.org/r/572811 (https://phabricator.wikimedia.org/T245489) [08:19:51] (03CR) 10Marostegui: [C: 03+2] "PCC looks as expected: https://puppet-compiler.wmflabs.org/compiler1003/20842/" [puppet] - 10https://gerrit.wikimedia.org/r/572811 (https://phabricator.wikimedia.org/T245489) (owner: 10Marostegui) [08:23:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1107 after temporary change optimizer options - T245489', diff saved to https://phabricator.wikimedia.org/P10442 and previous config saved to /var/cache/conftool/dbconfig/20200218-082306-marostegui.json [08:23:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:23:12] T245489: Possibly disable optimizer flag: rowid_filter on 10.4 - https://phabricator.wikimedia.org/T245489 [08:36:09] (03PS6) 10Ema: cache: consolidate common text/upload hiera [puppet] - 10https://gerrit.wikimedia.org/r/566495 [08:46:47] (03PS7) 10Ema: cache: consolidate common text/upload hiera [puppet] - 10https://gerrit.wikimedia.org/r/566495 [08:55:52] (03CR) 10Ema: "noop https://puppet-compiler.wmflabs.org/compiler1003/20844/" [puppet] - 10https://gerrit.wikimedia.org/r/566495 (owner: 10Ema) [08:57:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1107 after temporary change optimizer options', diff saved to https://phabricator.wikimedia.org/P10443 and previous config saved to /var/cache/conftool/dbconfig/20200218-085713-marostegui.json [08:57:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:42] jouncebot: now [08:57:42] No deployments scheduled for the next 3 hour(s) and 2 minute(s) [08:59:29] !log Remove wikiadmin2 grants from es1 [08:59:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:34] !log Remove wikiadmin2 grants from es1 T243512 [08:59:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:37] T243512: Clean up wikiadmin2 user from core hosts - https://phabricator.wikimedia.org/T243512 [09:00:58] dcausse, gehel: I’m here if you need me for the rewrite rules :) [09:01:10] Lucas_WMDE: thanks! :) [09:01:10] Lucas_WMDE: here as well [09:02:29] (03PS5) 10Gehel: Support /entity/ and other Wikidata URLs for Commons [puppet] - 10https://gerrit.wikimedia.org/r/526757 (https://phabricator.wikimedia.org/T222321) (owner: 10Smalyshev) [09:03:36] Lucas_WMDE: let's move to #wikimedia-discovery, less background noise [09:07:21] !log rm /var/log/exim4/paniclog on cumin1001 to clear OOM from last week error [09:07:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:06] !log disabling puppet on mw* to deploy apache config change - T222321 [09:09:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:10] T222321: Make /entity/ alias work for Commons - https://phabricator.wikimedia.org/T222321 [09:10:50] (03CR) 10Gehel: [C: 03+2] Support /entity/ and other Wikidata URLs for Commons [puppet] - 10https://gerrit.wikimedia.org/r/526757 (https://phabricator.wikimedia.org/T222321) (owner: 10Smalyshev) [09:13:16] (03PS5) 10Gehel: [wikidata] provide better link to statement information [puppet] - 10https://gerrit.wikimedia.org/r/552061 (https://phabricator.wikimedia.org/T203397) (owner: 10DCausse) [09:13:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1107 after temporary change optimizer options - T245489', diff saved to https://phabricator.wikimedia.org/P10444 and previous config saved to /var/cache/conftool/dbconfig/20200218-091343-marostegui.json [09:13:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:48] T245489: Possibly disable optimizer flag: rowid_filter on 10.4 - https://phabricator.wikimedia.org/T245489 [09:16:57] (03CR) 10Gehel: [C: 03+2] [wikidata] provide better link to statement information [puppet] - 10https://gerrit.wikimedia.org/r/552061 (https://phabricator.wikimedia.org/T203397) (owner: 10DCausse) [09:27:56] !log re-enable puppet on mw* - T222321 [09:27:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:00] T222321: Make /entity/ alias work for Commons - https://phabricator.wikimedia.org/T222321 [09:30:05] PROBLEM - MediaWiki memcached error rate on icinga1001 is CRITICAL: 7441 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [09:30:51] hm, that’s a strange spike pattern in those memcached errors [09:31:04] is that related? how could it? [09:31:13] probably not? [09:31:20] first spike 9:13 UTC [09:31:32] I think it might be effie testing [09:31:33] looks like there where spikes before we started, so probably not [09:31:37] lemme check [09:31:47] (03PS1) 10Ema: tlsproxy: remove lua_support [puppet] - 10https://gerrit.wikimedia.org/r/572821 (https://phabricator.wikimedia.org/T238625) [09:32:03] https://grafana.wikimedia.org/d/000000549/mcrouter?orgId=1&fullscreen&panelId=9 [09:32:05] RECOVERY - MediaWiki memcached error rate on icinga1001 is OK: (C)5000 gt (W)1000 gt 10 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [09:32:11] it is mwdebug1001 only afaics gehel [09:32:16] all good [09:32:51] that's where we tested our change, could this be wrong? [09:33:13] I can't imagine how those rewrite rules would have an impact on memcached [09:33:46] nah I don't think so, plus I see only mcrouter errors from mwdebug, by now puppet should have alread ran on some mws no? [09:34:19] it should, yep [09:34:28] next spike already visible in Grafana – happens every five minutes so far [09:34:30] let's keep an eye open... [09:34:38] odd [09:34:57] from logstash is mwdebug only, so yeah I am 99% sure it is related to the ongoing tests [09:35:05] it correlates with TKOs from mcrouter [09:35:19] elukey: I'll trust you on this one ... [09:35:52] gehel: let's keep an eye in general since apache changes are always "fun", if more hosts alarms then let's worry :) [09:36:03] PROBLEM - MediaWiki memcached error rate on icinga1001 is CRITICAL: 5542 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [09:36:06] * gehel is scared of apache [09:38:15] (03CR) 10Ema: "pcc looks good: https://puppet-compiler.wmflabs.org/compiler1001/20847/" [puppet] - 10https://gerrit.wikimedia.org/r/572821 (https://phabricator.wikimedia.org/T238625) (owner: 10Ema) [09:39:00] ACKNOWLEDGEMENT - snapshot of s7 in eqiad on db1115 is CRITICAL: snapshot for s7 at eqiad taken more than 4 days ago: Most recent backup 2020-02-14 05:11:28 Jcrespo known issue, working on it https://wikitech.wikimedia.org/wiki/MariaDB/Backups [09:40:01] PROBLEM - MediaWiki memcached error rate on icinga1001 is CRITICAL: 1.093e+04 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [09:40:22] (03CR) 10Filippo Giunchedi: "In previous cases where LVS was overwhelmed IIRC we got pages due to unavailable services, do you know if this alert can act as early warn" [puppet] - 10https://gerrit.wikimedia.org/r/572766 (owner: 10CDanis) [09:40:51] hm, now the rate in Grafana isn’t going down anymore :S [09:41:59] RECOVERY - MediaWiki memcached error rate on icinga1001 is OK: (C)5000 gt (W)1000 gt 8 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [09:49:57] PROBLEM - MediaWiki memcached error rate on icinga1001 is CRITICAL: 1.096e+04 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [09:51:57] RECOVERY - MediaWiki memcached error rate on icinga1001 is OK: (C)5000 gt (W)1000 gt 19 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [09:56:19] what's going on? [09:57:04] 10Operations, 10Wikimedia-Logstash, 10Wikimedia-Incident: Logstash missing most messages from mediawiki (Aug 2019) - https://phabricator.wikimedia.org/T230847 (10fgiunchedi) Thanks @Krinkle, to answer your questions: * Yes logs were in Kafka and fully recovered once the consumers caught up * We did detect i... [09:59:24] <_joe_> marostegui: i would gess effie did some tests? [10:00:19] don't know [10:00:27] see above [10:00:39] it is all from mwdebug1001, matches with tkos [10:00:47] aaah ok, thanks :) [10:00:53] sorry for panicking [10:00:56] <_joe_> yes [10:01:09] nono please, ping are always good :) [10:02:11] <_joe_> I'll exclude mwdebug1001 from the dashboard [10:04:55] +1 [10:07:26] <_joe_> I can do that in logstash, not in grafana :/ [10:14:14] (03CR) 10Vgutierrez: [C: 03+1] cache: consolidate common text/upload hiera [puppet] - 10https://gerrit.wikimedia.org/r/566495 (owner: 10Ema) [10:15:22] (03CR) 10Jbond: [C: 03+2] profile::ganeti: update the permissions of the users file [puppet] - 10https://gerrit.wikimedia.org/r/572667 (https://phabricator.wikimedia.org/T242910) (owner: 10Jbond) [10:15:32] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "This goes against our puppet coding rules." [puppet] - 10https://gerrit.wikimedia.org/r/566495 (owner: 10Ema) [10:19:39] PROBLEM - MediaWiki memcached error rate on icinga1001 is CRITICAL: 1.483e+04 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [10:21:24] (03PS1) 10Jbond: profile::ganeti: update the permissions of the users file [puppet] - 10https://gerrit.wikimedia.org/r/572829 (https://phabricator.wikimedia.org/T242910) [10:21:37] RECOVERY - MediaWiki memcached error rate on icinga1001 is OK: (C)5000 gt (W)1000 gt 13 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [10:23:37] (03CR) 10Vgutierrez: "hmm at least for clouds instances this is going to be mainly a NOOP, they're picking the nginx-extras variant, so that's going to result i" [puppet] - 10https://gerrit.wikimedia.org/r/572821 (https://phabricator.wikimedia.org/T238625) (owner: 10Ema) [10:24:27] (03CR) 10Jbond: [C: 03+2] profile::ci::docker: manage all group membership in data module [puppet] - 10https://gerrit.wikimedia.org/r/572707 (https://phabricator.wikimedia.org/T242910) (owner: 10Jbond) [10:24:41] (03PS3) 10Jbond: profile::ci::docker: manage all group membership in data module [puppet] - 10https://gerrit.wikimedia.org/r/572707 (https://phabricator.wikimedia.org/T242910) [10:25:08] (03CR) 10Jbond: [C: 03+2] profile::ganeti: update the permissions of the users file [puppet] - 10https://gerrit.wikimedia.org/r/572829 (https://phabricator.wikimedia.org/T242910) (owner: 10Jbond) [10:25:13] PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 109111880 and 4 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [10:25:33] PROBLEM - MediaWiki memcached error rate on icinga1001 is CRITICAL: 6877 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [10:27:13] RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 26848 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [10:27:33] RECOVERY - MediaWiki memcached error rate on icinga1001 is OK: (C)5000 gt (W)1000 gt 7 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [10:30:12] (03CR) 10Jbond: [C: 03+2] openstack::clientpackages::mitaka::buster: change notice to warning [puppet] - 10https://gerrit.wikimedia.org/r/572696 (https://phabricator.wikimedia.org/T242910) (owner: 10Jbond) [10:35:31] PROBLEM - MediaWiki memcached error rate on icinga1001 is CRITICAL: 6258 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [10:36:16] (03PS1) 10Giuseppe Lavagetto: envoy: split base profile out of tlsproxy [puppet] - 10https://gerrit.wikimedia.org/r/572831 (https://phabricator.wikimedia.org/T244843) [10:36:17] (03PS1) 10Giuseppe Lavagetto: profile::services_proxy: envoy-based version [puppet] - 10https://gerrit.wikimedia.org/r/572832 (https://phabricator.wikimedia.org/T244843) [10:36:22] (03PS1) 10Giuseppe Lavagetto: mwdebug: enable envoy-based services proxy [puppet] - 10https://gerrit.wikimedia.org/r/572833 [10:37:29] RECOVERY - MediaWiki memcached error rate on icinga1001 is OK: (C)5000 gt (W)1000 gt 960 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [10:37:55] (03PS2) 10Filippo Giunchedi: varnish: use journald for varnishlog consumers [puppet] - 10https://gerrit.wikimedia.org/r/572012 (https://phabricator.wikimedia.org/T227108) [10:38:23] (03CR) 10Giuseppe Lavagetto: "https://puppet-compiler.wmflabs.org/compiler1003/20848/restbase1020.eqiad.wmnet/ seems to DTRT" [puppet] - 10https://gerrit.wikimedia.org/r/572831 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto) [10:39:43] (03CR) 10jerkins-bot: [V: 04-1] profile::services_proxy: envoy-based version [puppet] - 10https://gerrit.wikimedia.org/r/572832 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto) [10:43:25] PROBLEM - MediaWiki memcached error rate on icinga1001 is CRITICAL: 6867 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [10:47:23] RECOVERY - MediaWiki memcached error rate on icinga1001 is OK: (C)5000 gt (W)1000 gt 6 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [10:48:05] (03CR) 10Jbond: [C: 03+2] profile::prometheus::ops_mysql: change exec to a system timer [puppet] - 10https://gerrit.wikimedia.org/r/572691 (https://phabricator.wikimedia.org/T242910) (owner: 10Jbond) [10:49:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Increase API weight for db1107 15 -> 25 for 10.4 testing - T242702', diff saved to https://phabricator.wikimedia.org/P10445 and previous config saved to /var/cache/conftool/dbconfig/20200218-104958-marostegui.json [10:50:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:03] T242702: Test MariaDB 10.4 in production - https://phabricator.wikimedia.org/T242702 [10:54:15] (03PS3) 10Filippo Giunchedi: varnish: use journald for varnishlog consumers [puppet] - 10https://gerrit.wikimedia.org/r/572012 (https://phabricator.wikimedia.org/T227108) [10:56:01] (03CR) 10Ema: [C: 03+1] envoy: split base profile out of tlsproxy [puppet] - 10https://gerrit.wikimedia.org/r/572831 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto) [10:56:34] (03CR) 10Ema: [C: 03+1] varnish: use journald for varnishlog consumers [puppet] - 10https://gerrit.wikimedia.org/r/572012 (https://phabricator.wikimedia.org/T227108) (owner: 10Filippo Giunchedi) [10:58:57] (03CR) 10Filippo Giunchedi: [C: 03+2] varnish: use journald for varnishlog consumers [puppet] - 10https://gerrit.wikimedia.org/r/572012 (https://phabricator.wikimedia.org/T227108) (owner: 10Filippo Giunchedi) [10:59:59] (03CR) 10Jbond: [C: 03+1] Add script to track OS migrations status [puppet] - 10https://gerrit.wikimedia.org/r/572251 (owner: 10Muehlenhoff) [11:05:47] (03PS8) 10Ema: cache: remove common text/upload hiera [puppet] - 10https://gerrit.wikimedia.org/r/566495 [11:09:20] (03CR) 10Ema: "> If a parameter is common across all of the infrastructure, maybe it" [puppet] - 10https://gerrit.wikimedia.org/r/566495 (owner: 10Ema) [11:10:54] !log temp. disabling prometheus exporter metadata user for prometheus1003 [11:10:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:04] PROBLEM - Check the last execution of generate-mysqld-exporter-config on prometheus1003 is CRITICAL: CRITICAL: Status of the systemd unit generate-mysqld-exporter-config https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [11:25:22] ^^ this is me testing [11:27:37] !log reenabling prometheus exporter metadata user for prometheus1003 [11:27:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:30] I'm updating cxserver. Any deploy going on right now? [11:30:16] I guess not :) [11:30:29] (03CR) 10KartikMistry: [C: 03+2] Add config for OpusMT [deployment-charts] - 10https://gerrit.wikimedia.org/r/563110 (https://phabricator.wikimedia.org/T234194) (owner: 10KartikMistry) [11:31:34] (03PS4) 10KartikMistry: Add config for OpusMT [deployment-charts] - 10https://gerrit.wikimedia.org/r/563110 (https://phabricator.wikimedia.org/T234194) [11:33:24] RECOVERY - Check the last execution of generate-mysqld-exporter-config on prometheus1003 is OK: OK: Status of the systemd unit generate-mysqld-exporter-config https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [11:35:51] !log kartik@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'cxserver' for release 'staging' . [11:35:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:15] 10Operations, 10DBA, 10Wikimedia-Etherpad: Upgrade and restart m1 master (db1135) - https://phabricator.wikimedia.org/T244238 (10Marostegui) Window reserved on the deployment's page [11:38:14] (03CR) 10Jcrespo: "> Patch Set 4: Code-Review+2" [puppet] - 10https://gerrit.wikimedia.org/r/572691 (https://phabricator.wikimedia.org/T242910) (owner: 10Jbond) [11:41:22] (03PS1) 10KartikMistry: Update cxserver to 2020-02-13-162638-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/572841 (https://phabricator.wikimedia.org/T234194) [11:41:38] !log kartik@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'cxserver' for release 'production' . [11:41:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:19] I need to deploy other patch too! [11:42:28] Will do that right after first one is done. [11:43:44] !log kartik@deploy1001 helmfile [CODFW] Ran 'apply' command on namespace 'cxserver' for release 'production' . [11:43:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:45:26] (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2020-02-13-162638-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/572841 (https://phabricator.wikimedia.org/T234194) (owner: 10KartikMistry) [11:45:42] (03Merged) 10jenkins-bot: Update cxserver to 2020-02-13-162638-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/572841 (https://phabricator.wikimedia.org/T234194) (owner: 10KartikMistry) [11:47:05] !log kartik@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'cxserver' for release 'staging' . [11:47:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:48:56] !log kartik@deploy1001 helmfile [CODFW] Ran 'apply' command on namespace 'cxserver' for release 'production' . [11:48:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:51:22] !log kartik@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'cxserver' for release 'production' . [11:51:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:25] jouncebot: next [11:56:25] In 0 hour(s) and 3 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200218T1200) [12:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: May I have your attention please! European Mid-day SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200218T1200) [12:00:04] Urbanecm, Amir1, and MatmaRex: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:01:04] o/ [12:01:16] o/ [12:01:34] * Urbanecm here [12:02:11] Amir1: I'll deploy mine and hand over to you then? [12:02:57] (03CR) 10Urbanecm: [C: 03+2] Increase Commons linkpurge rate limit for patrollers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572339 (https://phabricator.wikimedia.org/T245214) (owner: 10Gergő Tisza) [12:03:59] (03Merged) 10jenkins-bot: Increase Commons linkpurge rate limit for patrollers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572339 (https://phabricator.wikimedia.org/T245214) (owner: 10Gergő Tisza) [12:04:26] Urbanecm: sure, thanks! [12:06:20] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: 4b193dd: Increase Commons linkpurge rate limit for patrollers (T245214) (duration: 01m 31s) [12:06:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:28] T245214: Ratelimit on Commons is way too aggressive/broken - https://phabricator.wikimedia.org/T245214 [12:06:33] Amir1: the floor is yours [12:06:56] thanks! [12:07:37] (03CR) 10Ladsgroup: [C: 03+2] Start reading for the new term store for clients up to Q1000 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572628 (https://phabricator.wikimedia.org/T225057) (owner: 10Ladsgroup) [12:08:39] (03PS1) 10KartikMistry: Fix typo: OpusMt -> OpusMT [deployment-charts] - 10https://gerrit.wikimedia.org/r/572847 [12:09:10] (03PS3) 10Ladsgroup: Start reading for the new term store for clients up to Q1000 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572628 (https://phabricator.wikimedia.org/T225057) [12:09:30] Amir1: ping me when SWAT is done. [12:09:49] (03CR) 10Ladsgroup: [C: 03+2] Start reading for the new term store for clients up to Q1000 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572628 (https://phabricator.wikimedia.org/T225057) (owner: 10Ladsgroup) [12:09:53] kart_: sure [12:10:48] (03Merged) 10jenkins-bot: Start reading for the new term store for clients up to Q1000 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572628 (https://phabricator.wikimedia.org/T225057) (owner: 10Ladsgroup) [12:13:48] 10Operations, 10Wikimedia-Mailing-lists: Creation of North Carolina mailing list - https://phabricator.wikimedia.org/T245462 (10jbond) 05Open→03Resolved a:03jbond List has been created list info page: https://lists.wikimedia.org/mailman/listinfo/wikimedia-us-nc admin login: https://lists.wikimedia.org/m... [12:14:38] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:572628|Start reading for the new term store for clients up to Q1000 (T225057)]] (duration: 01m 05s) [12:14:41] (03PS7) 10Ladsgroup: Wikibase: added config variables to configure entity sources [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569031 (https://phabricator.wikimedia.org/T242087) (owner: 10WMDE-leszek) [12:14:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:42] T225057: Switch `tmpItemTermsMigrationStages` to MIGRATION_WRITE_NEW - https://phabricator.wikimedia.org/T225057 [12:14:55] (03CR) 10Ladsgroup: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569031 (https://phabricator.wikimedia.org/T242087) (owner: 10WMDE-leszek) [12:16:33] (03Merged) 10jenkins-bot: Wikibase: added config variables to configure entity sources [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569031 (https://phabricator.wikimedia.org/T242087) (owner: 10WMDE-leszek) [12:17:24] (03CR) 10Jbond: [C: 03+2] authdns: create dns-admin group to allow pushing dns changes [puppet] - 10https://gerrit.wikimedia.org/r/571957 (owner: 10Jbond) [12:17:34] (03PS4) 10Jbond: authdns: create dns-admin group to allow pushing dns changes [puppet] - 10https://gerrit.wikimedia.org/r/571957 [12:18:51] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:569031|Wikibase: added config variables to configure entity sources (T242087)]], Part I (duration: 01m 06s) [12:18:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:55] T242087: entitysources: Directly create entitySources config for WMF sites instead of using compat layer - https://phabricator.wikimedia.org/T242087 [12:20:22] (03PS4) 10Ayounsi: Add cookbook to control CF BGP advertisements [cookbooks] - 10https://gerrit.wikimedia.org/r/572262 [12:20:27] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:569031|Wikibase: added config variables to configure entity sources (T242087)]], Part I, take II (the cache issue) (duration: 01m 04s) [12:20:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:21:55] (03PS1) 10Ayounsi: Revert "Add prepending and TCP-mss clamping to esams" [homer/public] - 10https://gerrit.wikimedia.org/r/572850 [12:22:17] !log ladsgroup@deploy1001 Synchronized wmf-config/Wikibase.php: SWAT: [[gerrit:569031|Wikibase: added config variables to configure entity sources (T242087)]], Part II (duration: 01m 03s) [12:22:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:30] MatmaRex is not around [12:22:43] kart_: do you want to deploy? I want to wait before next batch [12:22:45] (03PS6) 10Jbond: admin: add fr-tech-admins and allow dns-admin and gitpuppt privileges [puppet] - 10https://gerrit.wikimedia.org/r/572052 [12:23:01] (03CR) 10Ayounsi: [C: 03+2] Revert "Add prepending and TCP-mss clamping to esams" [homer/public] - 10https://gerrit.wikimedia.org/r/572850 (owner: 10Ayounsi) [12:24:08] (03PS7) 10Jbond: admin: add fr-tech-admins and allow dns-admin privileges [puppet] - 10https://gerrit.wikimedia.org/r/572052 (https://phabricator.wikimedia.org/T244901) [12:27:56] (03CR) 10Jbond: [C: 03+2] admin: add fr-tech-admins and allow dns-admin privileges [puppet] - 10https://gerrit.wikimedia.org/r/572052 (https://phabricator.wikimedia.org/T244901) (owner: 10Jbond) [12:29:55] (03PS1) 10Ladsgroup: Increase the read for clients on the new term store up to Q100K [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572853 (https://phabricator.wikimedia.org/T225057) [12:30:23] (03PS5) 10Ladsgroup: Beta wikidata: Define entity sources configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569204 (https://phabricator.wikimedia.org/T242087) (owner: 10WMDE-leszek) [12:30:35] (03CR) 10Ladsgroup: [C: 03+2] "noop for prod" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569204 (https://phabricator.wikimedia.org/T242087) (owner: 10WMDE-leszek) [12:31:38] (03Merged) 10jenkins-bot: Beta wikidata: Define entity sources configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569204 (https://phabricator.wikimedia.org/T242087) (owner: 10WMDE-leszek) [12:32:48] hi. am i too late for swat? [12:33:38] I don’t think it’s been formally closed yet [12:33:43] Amir1, kart_: what’s the SWAT status? [12:34:05] MatmaRex: nope, Can you deploy your patch? [12:34:28] no, i don't have the access :) [12:34:32] Lucas_WMDE: I'm monitoring things before I move to the next one, kart_ is not around apparently [12:34:38] okay, let me review [12:35:01] (03PS2) 10Ladsgroup: Add DiscussionTools to four wikis in hidden mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572731 (https://phabricator.wikimedia.org/T244870) (owner: 10Bartosz Dziewoński) [12:35:08] (03CR) 10Ladsgroup: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572731 (https://phabricator.wikimedia.org/T244870) (owner: 10Bartosz Dziewoński) [12:35:38] thanks [12:36:01] (03Merged) 10jenkins-bot: Add DiscussionTools to four wikis in hidden mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572731 (https://phabricator.wikimedia.org/T244870) (owner: 10Bartosz Dziewoński) [12:36:15] !log push new Junos to cr2-esams:re1 (backup RE, noop) - T243080 [12:36:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:36:19] T243080: Upgrade routers - https://phabricator.wikimedia.org/T243080 [12:37:01] Amir1: I'm waiting to SWAT over, need to deploy service (cxserver) patch :) [12:37:29] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to ops group for dwisehaupt - https://phabricator.wikimedia.org/T244901 (10jbond) @Jgreen @Dwisehaupt I have created a new group `fr-tech-admins` and given this group permissions to run the `authdns-update` command this should allow y... [12:37:48] kart_: I see, okay [12:38:01] (03PS2) 10Giuseppe Lavagetto: profile::services_proxy: envoy-based version [puppet] - 10https://gerrit.wikimedia.org/r/572832 (https://phabricator.wikimedia.org/T244843) [12:38:03] (03PS2) 10Giuseppe Lavagetto: mwdebug: enable envoy-based services proxy [puppet] - 10https://gerrit.wikimedia.org/r/572833 [12:42:04] (03CR) 10jerkins-bot: [V: 04-1] profile::services_proxy: envoy-based version [puppet] - 10https://gerrit.wikimedia.org/r/572832 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto) [12:45:01] MatmaRex: it's live in mwdebu1002 [12:45:09] can you take a look and then I deploy [12:45:17] yeah, looking [12:45:33] !log remove graceful-switchover and nonstop-routing from cr2-esams - T243080 [12:45:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:37] T243080: Upgrade routers - https://phabricator.wikimedia.org/T243080 [12:47:35] Amir1: are you sure it's deployed? i don't see the expected change [12:48:10] MatmaRex: sorry, I didn't rebase [12:48:31] MatmaRex: check again, sorry [12:48:47] thanks, i see it now [12:50:20] Amir1: [12:50:26] Amir1: looks good [12:50:29] okay, deploying now [12:50:47] 10Operations, 10Traffic: ATS session cache efficiency reduced in TLSv1.3 - https://phabricator.wikimedia.org/T245502 (10Vgutierrez) [12:51:53] 10Operations, 10Traffic: ATS TLS session cache efficiency reduced in TLSv1.3 - https://phabricator.wikimedia.org/T245502 (10Vgutierrez) p:05Triage→03Medium [12:52:42] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:572731|Add DiscussionTools to four wikis in hidden mode (T244870)]] (duration: 01m 04s) [12:52:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:52:46] T244870: Deploy v1.0 via query string parameter to target wikis - https://phabricator.wikimedia.org/T244870 [12:53:56] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:572731|Add DiscussionTools to four wikis in hidden mode (T244870)]], take II (duration: 01m 03s) [12:53:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:14] Amir1: thank you! [12:55:28] !log EU SWAT done [12:55:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:34] 10Operations, 10Traffic: ATS TLS session cache efficiency reduced in TLSv1.3 - https://phabricator.wikimedia.org/T245502 (10Vgutierrez) [12:55:38] MatmaRex: thank you for doing the work! [12:55:45] kart_: I think I'm done here [12:56:43] cool. [12:56:53] (03CR) 10Jbond: "not tested but lgtm" (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/572262 (owner: 10Ayounsi) [12:57:10] (03CR) 10KartikMistry: [C: 03+2] Fix typo: OpusMt -> OpusMT [deployment-charts] - 10https://gerrit.wikimedia.org/r/572847 (owner: 10KartikMistry) [12:57:30] (03Merged) 10jenkins-bot: Fix typo: OpusMt -> OpusMT [deployment-charts] - 10https://gerrit.wikimedia.org/r/572847 (owner: 10KartikMistry) [12:58:47] !log kartik@deploy1001 helmfile [CODFW] Ran 'apply' command on namespace 'cxserver' for release 'production' . [12:58:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:12] !log fail vrrp master to cr3-esams - T243080 [13:10:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:16] T243080: Upgrade routers - https://phabricator.wikimedia.org/T243080 [13:13:03] alright finally installing the OS on re1 (backup RE) [13:14:02] RECOVERY - snapshot of s7 in eqiad on db1115 is OK: snapshot for s7 at eqiad taken less than 4 days ago and larger than 90 GB: Last one 2020-02-18 11:45:45 from db1116.eqiad.wmnet:3317 (917 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups [13:16:23] \o/ [13:23:06] !log bump cost of eqiad-esams transport - T243080 [13:23:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:10] T243080: Upgrade routers - https://phabricator.wikimedia.org/T243080 [13:24:15] !log reboot cr2-esams:re1 (backup) - T243080 [13:24:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:19] Hostname: re1.cr2-esams - Junos: 17.3R3-S7.2 [13:35:30] (03CR) 10CDanis: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/572766 (owner: 10CDanis) [13:35:38] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=icinga site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:35:45] waiting for it to fully boot then will disable BGP neighbors then will failover [13:37:38] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:37:39] !log deactivate peering/transit on cr2-esams - T243080 [13:37:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:43] T243080: Upgrade routers - https://phabricator.wikimedia.org/T243080 [13:38:19] (03PS1) 10KartikMistry: Adjust MT Threshold for Assamese to 70% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572871 (https://phabricator.wikimedia.org/T245509) [13:39:22] !log cr2-esams - request chassis routing-engine master switch - T243080 [13:39:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:49] this might make a bit of noise [13:39:55] but not production impacting [13:40:09] 10Operations, 10Puppet, 10PostgreSQL, 10User-jbond: Investigate puppetdb replication lag - https://phabricator.wikimedia.org/T245510 (10jbond) p:05Triage→03Medium [13:40:24] 10Operations, 10Traffic, 10Wikimedia-Logstash, 10observability, 10User-fgiunchedi: Port varnishlog consumers to log to syslog / logging infra - https://phabricator.wikimedia.org/T227108 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi This is now complete! All varnish logging goes through the loggin... [13:40:26] 10Operations, 10Wikimedia-Logstash, 10observability, 10User-fgiunchedi: Deprecate all non-Kafka logstash inputs - https://phabricator.wikimedia.org/T227080 (10fgiunchedi) [13:40:32] linecards are booting up [13:42:49] 10Operations, 10Puppet, 10PostgreSQL, 10User-jbond: Investigate puppetdb replication lag - https://phabricator.wikimedia.org/T245510 (10jbond) [13:43:05] interfaces comming up [13:43:33] BGP established [13:44:28] !log installing OS on cr2-esams:re0 - T243080 [13:44:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:32] T243080: Upgrade routers - https://phabricator.wikimedia.org/T243080 [13:45:05] 10Operations, 10Puppet, 10PostgreSQL, 10User-jbond: Investigate puppetdb replication lag - https://phabricator.wikimedia.org/T245510 (10jbond) [13:47:05] (03CR) 10Ottomata: [C: 03+1] "Thank you Luca!" [puppet] - 10https://gerrit.wikimedia.org/r/572690 (https://phabricator.wikimedia.org/T245414) (owner: 10Elukey) [13:48:29] 10Operations, 10Wikimedia-Logstash, 10observability, 10User-fgiunchedi: Deprecate all non-Kafka logstash inputs - https://phabricator.wikimedia.org/T227080 (10fgiunchedi) [13:48:50] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 44 probes of 523 (alerts on 35) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [13:50:36] (03PS2) 10Muehlenhoff: Remove obsolete Partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/572275 (https://phabricator.wikimedia.org/T156955) [13:53:39] we're going to get paged [13:53:45] Telia in esams is saturating [13:53:51] (03CR) 10Muehlenhoff: [C: 03+2] Remove obsolete Partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/572275 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff) [13:55:16] Amir1: you still around? can you revert my SWAT patch? (or help me debug) [13:55:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Increase API weight for db1107 25 -> 50 for 10.4 testing - T242702', diff saved to https://phabricator.wikimedia.org/P10448 and previous config saved to /var/cache/conftool/dbconfig/20200218-135525-marostegui.json [13:55:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:30] T242702: Test MariaDB 10.4 in production - https://phabricator.wikimedia.org/T242702 [13:55:31] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/572731 [13:55:49] somehow it looks like wgDiscussionToolsEnable is true in production, which is wrong [13:56:45] MatmaRex: sorry I was a meeting [13:56:47] let me fix it [13:56:52] PROBLEM - LibreNMS has a critical alert #page on icinga1001 is CRITICAL: Primary outbound port utilisation over 80% #page (cr3-esams.wikimedia.org) https://wikitech.wikimedia.org/wiki/Network_monitoring%23LibreNMS_alerts [13:57:52] * apergos peeks in [13:58:00] apergos: router upgrade fallout [13:58:13] * apergos sees it in the backread now [13:58:28] MatmaRex: Which wiki is enabled [13:58:35] https://nl.wikipedia.org/wiki/Overleg:Hoofdpagina/2019 [13:58:47] i see "Antwoorden" links on this page, they should not be there [13:59:21] https://www.irccloud.com/pastebin/6Fr3xDA4/ [13:59:32] so the value for nlwiki is set to false [13:59:46] (03PS1) 10CDanis: depool esams [dns] - 10https://gerrit.wikimedia.org/r/572880 [14:00:10] (03PS1) 10BBlack: depool esams [dns] - 10https://gerrit.wikimedia.org/r/572881 [14:00:38] Amir1: aaaargh, it's running the old version. how are wikipedias still on wmf.18 [14:01:17] (03CR) 10CDanis: [C: 03+2] depool esams [dns] - 10https://gerrit.wikimedia.org/r/572880 (owner: 10CDanis) [14:01:19] (03CR) 10BBlack: [C: 03+1] depool esams [dns] - 10https://gerrit.wikimedia.org/r/572880 (owner: 10CDanis) [14:01:20] MatmaRex: It might go live later today [14:01:25] !log re-enable cr2-esams BGP group IX4 - T243080 [14:01:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:29] T243080: Upgrade routers - https://phabricator.wikimedia.org/T243080 [14:01:36] (03Abandoned) 10BBlack: depool esams [dns] - 10https://gerrit.wikimedia.org/r/572881 (owner: 10BBlack) [14:01:45] traffic should come back down [14:01:58] Amir1: we could backport https://gerrit.wikimedia.org/r/c/mediawiki/extensions/DiscussionTools/+/570142 , or revert the config patch, up to you [14:02:09] !log depool esams [14:02:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:46] MatmaRex: see https://phabricator.wikimedia.org/T245202 for why group2 is on .18 - still blocked [14:03:07] 10Operations, 10Wikimedia-Logstash, 10observability, 10Patch-For-Review, 10User-herron: Migrate at least 3 existing Logstash inputs and associated producers to the new Kafka-logging pipeline, and remove the associated non-Kafka Logstash inputs - https://phabricator.wikimedia.org/T213899 (10fgiunchedi) [14:03:14] MatmaRex: It looks small enough for backport if you help with monitoring things [14:03:31] yeah, i'm around [14:05:15] okay, the backport is meerging [14:05:24] 10Operations, 10Wikimedia-Logstash, 10observability, 10User-fgiunchedi: Deprecate all non-Kafka logstash inputs - https://phabricator.wikimedia.org/T227080 (10fgiunchedi) [14:05:27] 10Operations, 10Wikimedia-Logstash, 10observability, 10Patch-For-Review, 10User-herron: Migrate at least 3 existing Logstash inputs and associated producers to the new Kafka-logging pipeline, and remove the associated non-Kafka Logstash inputs - https://phabricator.wikimedia.org/T213899 (10fgiunchedi) 0... [14:05:35] i'm now checking eveyrthing else we worked on in the last weeks to see if anything else critical is missing from wmf.19 [14:05:46] i mean, from wmf.18 [14:06:50] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 35 probes of 523 (alerts on 35) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [14:06:54] RECOVERY - LibreNMS has a critical alert #page on icinga1001 is OK: OK: no critical LibreNMS alerts matching re.compile((?i)#page, re.IGNORECASE) https://wikitech.wikimedia.org/wiki/Network_monitoring%23LibreNMS_alerts [14:09:26] Amir1: it merged [14:10:41] MatmaRex: it's live in mwdebug1002, can you take a look [14:10:54] 10Operations, 10RESTBase, 10Wikimedia-Logstash, 10observability: Move restrouter to the logging pipeline - https://phabricator.wikimedia.org/T245515 (10fgiunchedi) [14:11:31] Amir1: yes. looks good [14:11:49] let's go live then [14:12:01] the reply links are loading on https://nl.wikipedia.org/wiki/Overleg:Hoofdpagina/2019?dtenable=1 but not on https://nl.wikipedia.org/wiki/Overleg:Hoofdpagina/2019 [14:12:34] 10Operations, 10Mathoid, 10Wikimedia-Logstash, 10observability: Move mathoid to the logging pipeline - https://phabricator.wikimedia.org/T245516 (10fgiunchedi) [14:13:44] MatmaRex: it looks nice, thank you for the work! [14:14:16] !log ladsgroup@deploy1001 Synchronized php-1.35.0-wmf.18/extensions/DiscussionTools: [[gerrit:572882|wmf.18: Add config option and query parameter to control loading]] (duration: 01m 11s) [14:14:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:22] Amir1: yeahhhh sorry for messing up [14:14:33] i didn't think to check which version is live [14:14:44] MatmaRex: oh no worries, I do it all the time [14:15:12] okay, it's live now [14:15:48] Amir1: thank you. i hope i won't have any more reasons to ping you :) [14:16:06] nah, don't worry, that's why I'm here [14:16:19] (03CR) 10Jhedden: [C: 03+1] icinga: remove wmflabs.org HTTPS cert check [puppet] - 10https://gerrit.wikimedia.org/r/572665 (https://phabricator.wikimedia.org/T235252) (owner: 10Arturo Borrero Gonzalez) [14:19:47] (03CR) 10Jhedden: [C: 03+1] cloud: refresh names for DNS servers in eqiad1/codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/572213 (https://phabricator.wikimedia.org/T243766) (owner: 10Arturo Borrero Gonzalez) [14:21:31] 10Operations, 10netops: Upgrade cr2-esams to JTAC recommended - https://phabricator.wikimedia.org/T237027 (10ayounsi) [14:21:34] 10Operations, 10netops: Upgrade routers - https://phabricator.wikimedia.org/T243080 (10ayounsi) [14:23:10] (03CR) 10Jhedden: [C: 03+1] Keystone: switch from UUID tokens to fernet tokens [puppet] - 10https://gerrit.wikimedia.org/r/572507 (https://phabricator.wikimedia.org/T243418) (owner: 10Andrew Bogott) [14:26:16] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [14:27:12] PROBLEM - Varnish traffic drop between 30min ago and now at esams on icinga1001 is CRITICAL: 46.33 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [14:28:01] ^ expected [14:28:49] 10Operations: Upgrade rpki VMs to buster - https://phabricator.wikimedia.org/T244585 (10ayounsi) 05Open→03Resolved Done. [14:29:12] !log re-disable cr2-esams BGP group IX4 - T243080 [14:29:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:18] (03CR) 10Jhedden: Switch more WMCS systems to standard Partman recipes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/572196 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff) [14:29:18] T243080: Upgrade routers - https://phabricator.wikimedia.org/T243080 [14:30:37] (03PS1) 10CDanis: Revert "depool esams" [dns] - 10https://gerrit.wikimedia.org/r/572890 [14:30:58] (03PS8) 10Muehlenhoff: Add script to track OS migrations status [puppet] - 10https://gerrit.wikimedia.org/r/572251 [14:31:24] !log cr2-esams - request chassis routing-engine master switch - T243080 [14:31:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:25] (03CR) 10Muehlenhoff: Add script to track OS migrations status (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/572251 (owner: 10Muehlenhoff) [14:33:23] !log re-enable cr2-esams BGP transit/peering - T243080 [14:33:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:43] o_o [14:34:40] !log restore default esams-eqiad link cost - T243080 [14:34:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:46] T243080: Upgrade routers - https://phabricator.wikimedia.org/T243080 [14:36:22] (03PS6) 10ArielGlenn: properly handle failure of writing of temp stubs for page content jobs [dumps] - 10https://gerrit.wikimedia.org/r/562995 (https://phabricator.wikimedia.org/T242209) [14:36:26] (03CR) 10Jhedden: "> Patch Set 5:" [puppet] - 10https://gerrit.wikimedia.org/r/571821 (https://phabricator.wikimedia.org/T224582) (owner: 10Bstorm) [14:37:48] (03CR) 10Filippo Giunchedi: [C: 03+1] "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/572766 (owner: 10CDanis) [14:39:09] !log remove cr2-esams VRRP handicap - T243080 [14:39:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:23] 10Operations, 10Wikimedia-Mailing-lists, 10Space (Jan-Mar-2020): Integrate mailing lists in Wikimedia Space - https://phabricator.wikimedia.org/T226727 (10Qgil) 05Open→03Declined I'm declining this task following with the Space announcement - https://discuss-space.wmflabs.org/t/next-steps-on-wikimedia-sp... [14:42:27] and of course 2/4 of the interfaces between cr2 and cr3 didn't come back up [14:42:28] ... [14:43:10] uh :_) [14:43:21] (03PS2) 10Ottomata: Configure production and canary release for eventgate-analytics [deployment-charts] - 10https://gerrit.wikimedia.org/r/572100 (https://phabricator.wikimedia.org/T245203) [14:43:54] (03CR) 10BBlack: [C: 03+2] Revert "depool esams" [dns] - 10https://gerrit.wikimedia.org/r/572890 (owner: 10CDanis) [14:44:39] (03CR) 10Ottomata: [C: 03+2] Configure production and canary release for eventgate-analytics [deployment-charts] - 10https://gerrit.wikimedia.org/r/572100 (https://phabricator.wikimedia.org/T245203) (owner: 10Ottomata) [14:44:50] (03PS1) 10Muehlenhoff: Add profile::prometheus::node_exporter [puppet] - 10https://gerrit.wikimedia.org/r/572892 [14:45:15] and bouncing them doesn't solve the issue [14:47:46] !log otto@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-analytics' for release 'production' . [14:47:46] !log otto@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-analytics' for release 'canary' . [14:47:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:52] (03CR) 10jerkins-bot: [V: 04-1] Add profile::prometheus::node_exporter [puppet] - 10https://gerrit.wikimedia.org/r/572892 (owner: 10Muehlenhoff) [14:47:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:07] 10Operations, 10Pywikibot: WMFTimeoutException on non-existent files - https://phabricator.wikimedia.org/T245374 (10Anomie) > Pywikibot is using POST and not GET, but it fails the same. Most recent attempt resulted in > ` > {"error":{"code":"internal_api_error_WMFTimeoutException","info":"[XknCBQpAAEsAAJ7w@woA... [14:51:54] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 87, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:52:04] <_joe_> XioNoX: ^^ [14:52:10] PROBLEM - Router interfaces on cr3-esams is CRITICAL: CRITICAL: host 91.198.174.245, interfaces up: 87, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:52:12] I think known [14:52:15] yep known [14:52:19] <_joe_> ack [14:52:28] it's the downtime expiring, I'm opening a task for DCops right now [14:53:24] 10Operations, 10ops-esams, 10netops: 2*10G optics down on cr2-esams - https://phabricator.wikimedia.org/T245520 (10ayounsi) p:05Triage→03High [14:53:26] https://phabricator.wikimedia.org/T245520 [14:53:59] !log deploying new 'canary' and 'production' releases for eventgate-analytics. (These releases use a new nodePort, and so will not be active until LVS is modified. The old 'analytics' release and nodePort is left as is.) - T242861 [14:54:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:05] T242861: Clarify multi-service instance concepts in helm charts and enable canary releases - https://phabricator.wikimedia.org/T242861 [14:54:09] !log otto@deploy1001 helmfile [CODFW] Ran 'apply' command on namespace 'eventgate-analytics' for release 'production' . [14:54:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:14] 10Operations, 10Wikimedia-Mailing-lists, 10Space (Jan-Mar-2020): Integrate mailing lists in Wikimedia Space - https://phabricator.wikimedia.org/T226727 (10Demian) >>! In T226727#5893035, @Qgil wrote: > As a mailing list user and as Discourse user, personally I still think the idea has merit for many reasons,... [14:55:37] (03PS1) 10SBassett: Revert "Also log authevents channel." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572895 [14:56:01] (03CR) 10Muehlenhoff: Switch more WMCS systems to standard Partman recipes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/572196 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff) [14:56:10] !log esams repooled in DNS [14:56:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:40] ACKNOWLEDGEMENT - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 87, down: 2, dormant: 0, excluded: 0, unused: 0: Ayounsi https://phabricator.wikimedia.org/T245520 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:57:40] ACKNOWLEDGEMENT - Router interfaces on cr3-esams is CRITICAL: CRITICAL: host 91.198.174.245, interfaces up: 87, down: 2, dormant: 0, excluded: 0, unused: 0: Ayounsi https://phabricator.wikimedia.org/T245520 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:58:13] 10Operations, 10ops-esams, 10netops: 2*10G optics down on cr2-esams - https://phabricator.wikimedia.org/T245520 (10ayounsi) Can probably be combined with T242097. [14:58:52] (03PS3) 10Giuseppe Lavagetto: profile::services_proxy: envoy-based version [puppet] - 10https://gerrit.wikimedia.org/r/572832 (https://phabricator.wikimedia.org/T244843) [14:58:54] (03PS3) 10Giuseppe Lavagetto: mwdebug: enable envoy-based services proxy [puppet] - 10https://gerrit.wikimedia.org/r/572833 [14:59:24] 10Operations, 10ops-esams, 10netops: 2*10G optics down on cr2-esams - https://phabricator.wikimedia.org/T245520 (10mark) There are multiple 10G LR optics on-site for sure. Longer distance ones, less so. [15:00:59] (03PS1) 10Ottomata: eventgate-analytics - Bump image version to for readiness probe schema [deployment-charts] - 10https://gerrit.wikimedia.org/r/572897 (https://phabricator.wikimedia.org/T242861) [15:01:08] RECOVERY - Varnish traffic drop between 30min ago and now at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [15:01:25] (03CR) 10jerkins-bot: [V: 04-1] eventgate-analytics - Bump image version to for readiness probe schema [deployment-charts] - 10https://gerrit.wikimedia.org/r/572897 (https://phabricator.wikimedia.org/T242861) (owner: 10Ottomata) [15:01:29] (03PS2) 10Ottomata: eventgate-analytics - Bump image version to for readiness probe schema [deployment-charts] - 10https://gerrit.wikimedia.org/r/572897 (https://phabricator.wikimedia.org/T242861) [15:01:43] (03PS2) 10Muehlenhoff: Switch more WMCS systems to standard Partman recipes [puppet] - 10https://gerrit.wikimedia.org/r/572196 (https://phabricator.wikimedia.org/T156955) [15:02:03] (03CR) 10Ottomata: [C: 03+2] eventgate-analytics - Bump image version to for readiness probe schema [deployment-charts] - 10https://gerrit.wikimedia.org/r/572897 (https://phabricator.wikimedia.org/T242861) (owner: 10Ottomata) [15:02:21] (03CR) 10jerkins-bot: [V: 04-1] profile::services_proxy: envoy-based version [puppet] - 10https://gerrit.wikimedia.org/r/572832 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto) [15:02:34] (03PS3) 10Andrew Bogott: Keystone: switch from UUID tokens to fernet tokens [puppet] - 10https://gerrit.wikimedia.org/r/572507 (https://phabricator.wikimedia.org/T243418) [15:02:49] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "LGTM, with one caveat: if we use the same code in deployment-prep, we might not be able to remove hiera after all." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/566495 (owner: 10Ema) [15:03:31] (03CR) 10CDanis: [C: 03+2] maps.wm.o: reduce TTL from 1D to 10m [dns] - 10https://gerrit.wikimedia.org/r/572274 (owner: 10CDanis) [15:04:00] (03PS3) 10CDanis: maps.wm.o: reduce TTL from 1D to 10m [dns] - 10https://gerrit.wikimedia.org/r/572274 [15:04:08] (03CR) 10Andrew Bogott: [C: 03+2] Keystone: switch from UUID tokens to fernet tokens [puppet] - 10https://gerrit.wikimedia.org/r/572507 (https://phabricator.wikimedia.org/T243418) (owner: 10Andrew Bogott) [15:04:24] 10Operations, 10netops, 10Wikimedia-Incident: Juniper HA audit - https://phabricator.wikimedia.org/T191667 (10ayounsi) [15:04:54] !log otto@deploy1001 helmfile [CODFW] Ran 'apply' command on namespace 'eventgate-analytics' for release 'production' . [15:04:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:24] PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 53.79 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [15:06:35] ^ expected [15:06:47] oh codfw, don't worry, we'll use you more soon [15:08:06] !log vgutierrez@puppetmaster1001 conftool action : set/weight=100; selector: dc=eqiad,cluster=cache_text,service=ats-be,name=cp1089.eqiad.wmnet [15:08:08] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on icinga1001 is CRITICAL: 52.11 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [15:08:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:17] (03PS1) 10Ottomata: eventgate-analytics Use primary and secondary schema repos [deployment-charts] - 10https://gerrit.wikimedia.org/r/572900 (https://phabricator.wikimedia.org/T240985) [15:12:22] 10Operations, 10ops-codfw: asw-a-codfw:FPC8 PEM0 flapping - https://phabricator.wikimedia.org/T245458 (10ayounsi) p:05Triage→03High [15:12:34] (03PS4) 10Giuseppe Lavagetto: mwdebug: enable envoy-based services proxy [puppet] - 10https://gerrit.wikimedia.org/r/572833 [15:12:39] (03PS4) 10Herron: logstash: output logs ingested by deprecated inputs to kafka-logging [puppet] - 10https://gerrit.wikimedia.org/r/571548 (https://phabricator.wikimedia.org/T234854) [15:12:55] (03CR) 10Ottomata: [C: 03+2] eventgate-analytics Use primary and secondary schema repos [deployment-charts] - 10https://gerrit.wikimedia.org/r/572900 (https://phabricator.wikimedia.org/T240985) (owner: 10Ottomata) [15:13:01] (03CR) 10Jhedden: [C: 03+1] Switch more WMCS systems to standard Partman recipes [puppet] - 10https://gerrit.wikimedia.org/r/572196 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff) [15:14:53] !log otto@deploy1001 helmfile [CODFW] Ran 'apply' command on namespace 'eventgate-analytics' for release 'canary' . [15:14:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:23] (03CR) 10Herron: "PCC looks good https://puppet-compiler.wmflabs.org/compiler1003/20854/" [puppet] - 10https://gerrit.wikimedia.org/r/571548 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron) [15:29:03] (03PS9) 10Ema: cache: consolidate common text/upload hiera [puppet] - 10https://gerrit.wikimedia.org/r/566495 [15:32:24] (03CR) 10Ema: "> LGTM, with one caveat: if we use the same code in deployment-prep," [puppet] - 10https://gerrit.wikimedia.org/r/566495 (owner: 10Ema) [15:34:18] RECOVERY - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [15:34:19] (03CR) 10Jbond: Add script to track OS migrations status (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/572251 (owner: 10Muehlenhoff) [15:34:20] !log otto@deploy1001 helmfile [CODFW] Ran 'apply' command on namespace 'eventgate-analytics' for release 'canary' . [15:34:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:41] !log otto@deploy1001 helmfile [CODFW] Ran 'apply' command on namespace 'eventgate-analytics' for release 'canary' . [15:34:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:43] !log otto@deploy1001 helmfile [CODFW] Ran 'apply' command on namespace 'eventgate-analytics' for release 'production' . [15:35:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:06] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [15:36:28] !log stopping db1140:s3 instance [15:36:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:02] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on icinga1001 is OK: (C)60 le (W)70 le 81.49 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [15:38:34] (03PS4) 10Giuseppe Lavagetto: profile::services_proxy: envoy-based version [puppet] - 10https://gerrit.wikimedia.org/r/572832 (https://phabricator.wikimedia.org/T244843) [15:38:36] (03PS5) 10Giuseppe Lavagetto: mwdebug: enable envoy-based services proxy [puppet] - 10https://gerrit.wikimedia.org/r/572833 [15:39:08] (03PS1) 10Jcrespo: Revert "backups: Disable s3-eqiad backups until source host is restored" [puppet] - 10https://gerrit.wikimedia.org/r/572901 [15:39:50] !log otto@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'eventgate-analytics' for release 'canary' . [15:40:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:15] !log otto@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'eventgate-analytics' for release 'production' . [15:41:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:52] (03CR) 10Andrew Bogott: [C: 03+2] nova_fixed_multi: support adding/deleting records in a 'legacy' domain [puppet] - 10https://gerrit.wikimedia.org/r/572122 (https://phabricator.wikimedia.org/T245173) (owner: 10Andrew Bogott) [15:42:02] (03PS3) 10Andrew Bogott: nova_fixed_multi: support adding/deleting records in a 'legacy' domain [puppet] - 10https://gerrit.wikimedia.org/r/572122 (https://phabricator.wikimedia.org/T245173) [15:42:04] (03CR) 10jerkins-bot: [V: 04-1] profile::services_proxy: envoy-based version [puppet] - 10https://gerrit.wikimedia.org/r/572832 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto) [15:42:13] (03PS3) 10Andrew Bogott: Designate: start using '.eqiad1.wikimedia.cloud' domain in eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/572686 (https://phabricator.wikimedia.org/T245173) [15:44:00] 10Operations, 10ops-codfw: (No Need By Date Provided) codfw: rack/setup/install elastic20{55,56,57,58,59,60}.codfw.wmnet - https://phabricator.wikimedia.org/T241337 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` elastic2057.codfw.wmnet ` The log can... [15:45:35] (03CR) 10Giuseppe Lavagetto: [C: 03+1] cache: consolidate common text/upload hiera [puppet] - 10https://gerrit.wikimedia.org/r/566495 (owner: 10Ema) [15:47:57] !log dns2001 - stopping bgp to drain service for hw/reimage work - T242017 [15:47:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:45] !log dns2001 - shutdown for hw/reimage work - T242017 [15:51:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:38] PROBLEM - BFD status on cr2-codfw is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:52:56] PROBLEM - BFD status on cr1-codfw is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:53:12] 10Operations, 10Wikimedia-Logstash, 10Wikimedia-Incident: Logstash missing most messages from mediawiki (Aug 2019) - https://phabricator.wikimedia.org/T230847 (10Krinkle) 05Open→03Resolved [15:54:46] PROBLEM - Host dns2001 is DOWN: PING CRITICAL - Packet loss = 100% [15:56:00] (03PS2) 10Ottomata: Configure production and canary release for eventgate-main [deployment-charts] - 10https://gerrit.wikimedia.org/r/572106 (https://phabricator.wikimedia.org/T245203) [15:57:16] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=pdnsrec site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:57:44] (03CR) 10Andrew Bogott: [C: 03+2] Designate: start using '.eqiad1.wikimedia.cloud' domain in eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/572686 (https://phabricator.wikimedia.org/T245173) (owner: 10Andrew Bogott) [15:59:55] (03CR) 10Ottomata: [C: 03+2] Configure production and canary release for eventgate-main [deployment-charts] - 10https://gerrit.wikimedia.org/r/572106 (https://phabricator.wikimedia.org/T245203) (owner: 10Ottomata) [16:02:11] !log otto@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-main' for release 'canary' . [16:02:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:02:35] !log deploying new 'canary' and 'production' releases for eventgate-main. (These releases use a new nodePort, and so will not be active until LVS is modified. The old 'main' release and nodePort is left as is.) - T242861 [16:02:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:02:39] T242861: Clarify multi-service instance concepts in helm charts and enable canary releases - https://phabricator.wikimedia.org/T242861 [16:03:14] (03CR) 10Ema: [C: 03+2] cache: consolidate common text/upload hiera [puppet] - 10https://gerrit.wikimedia.org/r/566495 (owner: 10Ema) [16:03:30] !log otto@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-main' for release 'production' . [16:03:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:23] (03CR) 10Vgutierrez: [C: 03+1] tlsproxy: remove lua_support [puppet] - 10https://gerrit.wikimedia.org/r/572821 (https://phabricator.wikimedia.org/T238625) (owner: 10Ema) [16:05:06] (03PS2) 10Ema: tlsproxy: remove lua_support [puppet] - 10https://gerrit.wikimedia.org/r/572821 (https://phabricator.wikimedia.org/T238625) [16:07:04] 10Operations, 10Puppet, 10PostgreSQL, 10User-jbond: Investigate puppetdb replication lag - https://phabricator.wikimedia.org/T245510 (10jbond) Currently running th following in tmux on puppetdb2002 ` lang=bash while : ; do printf '%s (%s):\t' "$(date)" $(date +'%s'); psql puppetdb -tc 'select pg_l... [16:08:04] !log otto@deploy1001 helmfile [CODFW] Ran 'apply' command on namespace 'eventgate-main' for release 'canary' . [16:08:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:42] !log otto@deploy1001 helmfile [CODFW] Ran 'apply' command on namespace 'eventgate-main' for release 'production' . [16:09:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:47] (03CR) 10Ema: [C: 03+2] tlsproxy: remove lua_support [puppet] - 10https://gerrit.wikimedia.org/r/572821 (https://phabricator.wikimedia.org/T238625) (owner: 10Ema) [16:11:35] !log otto@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'eventgate-main' for release 'canary' . [16:11:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:45] (03CR) 10Muehlenhoff: Add script to track OS migrations status (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/572251 (owner: 10Muehlenhoff) [16:12:08] (03PS9) 10Muehlenhoff: Add script to track OS migrations status [puppet] - 10https://gerrit.wikimedia.org/r/572251 [16:12:38] !log otto@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'eventgate-main' for release 'production' . [16:12:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:19] 10Operations, 10LDAP-Access-Requests: LDAP access to the wmf group for Lauren Dickinson - https://phabricator.wikimedia.org/T245524 (10MarcoAurelio) [16:16:50] (03PS1) 10Ema: fifo-log-tailer: do not convert stdout to io.Writer [software/fifo-log-demux] - 10https://gerrit.wikimedia.org/r/572905 [16:18:03] (03PS5) 10Herron: logstash: output logs ingested by deprecated inputs to kafka-logging [puppet] - 10https://gerrit.wikimedia.org/r/571548 (https://phabricator.wikimedia.org/T234854) [16:20:37] (03CR) 10Vgutierrez: [C: 03+1] fifo-log-tailer: do not convert stdout to io.Writer [software/fifo-log-demux] - 10https://gerrit.wikimedia.org/r/572905 (owner: 10Ema) [16:22:24] (03CR) 10Herron: [C: 03+2] logstash: output logs ingested by deprecated inputs to kafka-logging [puppet] - 10https://gerrit.wikimedia.org/r/571548 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron) [16:22:37] liw: any news re: the roll-out of the scap changes from last december? [16:22:46] (03CR) 10Jbond: [C: 03+1] Add script to track OS migrations status [puppet] - 10https://gerrit.wikimedia.org/r/572251 (owner: 10Muehlenhoff) [16:23:32] 10Operations, 10serviceops, 10Patch-For-Review: Test gutter pool failover in production and memcached 1.5.x - https://phabricator.wikimedia.org/T240684 (10elukey) Very nice summary, thanks! A couple of questions: > FailoverWithExptimeRoute, where we define how long the values keys updated (eg set/add) dur... [16:23:33] ori, https://phabricator.wikimedia.org/T245530 [16:24:22] (03PS2) 10Ema: fifo-log-tailer: do not convert stdout to io.Writer [software/fifo-log-demux] - 10https://gerrit.wikimedia.org/r/572905 [16:25:32] (03PS1) 10Muehlenhoff: Add system::role for role::graphite::production [puppet] - 10https://gerrit.wikimedia.org/r/572906 [16:28:13] (03PS1) 10Muehlenhoff: Add system::role for role::kubernetes::worker and role::kubernetes::master [puppet] - 10https://gerrit.wikimedia.org/r/572907 [16:29:08] 10Operations, 10ops-codfw: asw-a-codfw:FPC8 PEM0 flapping - https://phabricator.wikimedia.org/T245458 (10Papaul) 05Open→03Resolved PEM 0 has no problem the problem is from PS1-a8 where PEM 0 is plug into. maybe problem from last week upgrade i have no readings on PS1-a8 right now. closing this task [16:31:09] (03PS1) 10Muehlenhoff: Add system::role for role::deployment_server [puppet] - 10https://gerrit.wikimedia.org/r/572908 [16:32:13] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [16:32:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:57] 10Operations, 10serviceops, 10Patch-For-Review: Test gutter pool failover in production and memcached 1.5.x - https://phabricator.wikimedia.org/T240684 (10Joe) A few notes: - We cannot really worry too much about stale keys over failovers - we had a system before mcrouter where this was happening regularly... [16:33:33] liw: \o/ thank you [16:34:29] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [16:34:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:01] (03PS1) 10Muehlenhoff: Add system::role for role::configcluster [puppet] - 10https://gerrit.wikimedia.org/r/572910 [16:36:27] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: (Needed by 31st January) eqiad: rack/setup/install es102[0-5].eqiad.wmnet - https://phabricator.wikimedia.org/T241359 (10Cmjohnson) [16:39:52] (03PS1) 10Muehlenhoff: Add system::role for role::backup::offsite [puppet] - 10https://gerrit.wikimedia.org/r/572911 [16:41:16] 10Operations, 10ops-codfw: (No Need By Date Provided) codfw: rack/setup/install elastic20{55,56,57,58,59,60}.codfw.wmnet - https://phabricator.wikimedia.org/T241337 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['elastic2057.codfw.wmnet'] ` and were **ALL** successful. [16:44:03] (03PS1) 10Jbond: profile::puppetdb::database: add ability to override replication alert thresholds [puppet] - 10https://gerrit.wikimedia.org/r/572912 (https://phabricator.wikimedia.org/T245510) [16:44:05] (03PS1) 10Jbond: role::puppetmaster::puppetdb: increase replication thresholds [puppet] - 10https://gerrit.wikimedia.org/r/572913 (https://phabricator.wikimedia.org/T245510) [16:46:33] (03PS2) 10Jbond: profile::puppetdb::database: add ability to override replication alert thresholds [puppet] - 10https://gerrit.wikimedia.org/r/572912 (https://phabricator.wikimedia.org/T245510) [16:46:58] (03PS2) 10Jbond: role::puppetmaster::puppetdb: increase replication thresholds [puppet] - 10https://gerrit.wikimedia.org/r/572913 (https://phabricator.wikimedia.org/T245510) [16:47:28] (03PS3) 10Jbond: role::puppetmaster::puppetdb: increase replication thresholds [puppet] - 10https://gerrit.wikimedia.org/r/572913 (https://phabricator.wikimedia.org/T245510) [16:49:30] (03CR) 10Volans: [C: 03+1] "LGTM and I agree with the new values, just a nit in the commit message" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/572913 (https://phabricator.wikimedia.org/T245510) (owner: 10Jbond) [16:49:33] (03CR) 10jerkins-bot: [V: 04-1] profile::puppetdb::database: add ability to override replication alert thresholds [puppet] - 10https://gerrit.wikimedia.org/r/572912 (https://phabricator.wikimedia.org/T245510) (owner: 10Jbond) [16:50:44] (03PS3) 10Jbond: profile::puppetdb::database: add ability to override replication alert threshold [puppet] - 10https://gerrit.wikimedia.org/r/572912 (https://phabricator.wikimedia.org/T245510) [16:54:54] (03PS4) 10Jbond: role::puppetmaster::puppetdb: increase replication thresholds [puppet] - 10https://gerrit.wikimedia.org/r/572913 (https://phabricator.wikimedia.org/T245510) [16:58:10] (03CR) 10Volans: [C: 03+1] role::puppetmaster::puppetdb: increase replication thresholds (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/572913 (https://phabricator.wikimedia.org/T245510) (owner: 10Jbond) [16:59:55] (03CR) 10Jcrespo: [V: 03+2 C: 03+2] Add system::role for role::backup::offsite [puppet] - 10https://gerrit.wikimedia.org/r/572911 (owner: 10Muehlenhoff) [17:00:04] godog and _joe_: Dear deployers, time to do the Puppet SWAT(Max 6 patches) deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200218T1700). [17:00:04] No GERRIT patches in the queue for this window AFAICS. [17:00:37] !log restting ps1-a8-codfw see T245164 [17:00:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:41] T245164: Alert for device ps1-a8-codfw.mgmt.codfw.wmnet - Device rebooted - https://phabricator.wikimedia.org/T245164 [17:02:17] (03CR) 10Volans: "I think this might break some WMCS instances as the params default are being removed. That aside I agree with the change." [puppet] - 10https://gerrit.wikimedia.org/r/572912 (https://phabricator.wikimedia.org/T245510) (owner: 10Jbond) [17:03:00] PROBLEM - Host dns2001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:03:02] (03CR) 10Alexandros Kosiaris: [C: 03+1] LVS: add alert on CPU saturation, which causes pkt drops [puppet] - 10https://gerrit.wikimedia.org/r/572766 (owner: 10CDanis) [17:03:23] (03PS4) 10Jbond: profile::puppetdb::database: add ability to override replication alert threshold [puppet] - 10https://gerrit.wikimedia.org/r/572912 (https://phabricator.wikimedia.org/T245510) [17:03:33] (03Abandoned) 10CRusnov: Separate mgmt interface addresses into appropriately included files [dns] - 10https://gerrit.wikimedia.org/r/532456 (https://phabricator.wikimedia.org/T228387) (owner: 10CRusnov) [17:04:32] (03CR) 10Jcrespo: "Notice: /Stage[main]/Role::Backup::Offsite/System::Role[backup::offsite]/Motd::Script[role-backup::offsite]/File[/etc/update-motd.d/05-rol" [puppet] - 10https://gerrit.wikimedia.org/r/572911 (owner: 10Muehlenhoff) [17:05:05] 10Operations, 10netops: cr3-knams:xe-0/1/3 down - https://phabricator.wikimedia.org/T244497 (10RobH) [17:05:55] 10Operations, 10ops-eqiad, 10fundraising-tech-ops: (ASAP) rack/setup/install frnetmon1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T232137 (10Cmjohnson) eth0 c1a port 6 eth1 c1b port 6 [17:08:13] (03PS1) 10Jbond: add puppetdb passwords [labs/private] - 10https://gerrit.wikimedia.org/r/572918 [17:08:40] ebernhardson: You around? Want someone to monitor Cirrus as we re-roll last week's train to group2. [17:08:41] (03CR) 10Jbond: [V: 03+2 C: 03+2] add puppetdb passwords [labs/private] - 10https://gerrit.wikimedia.org/r/572918 (owner: 10Jbond) [17:09:18] (03PS2) 10CRusnov: Fix -extras for netbox 2.7 upgrade [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/572116 (https://phabricator.wikimedia.org/T244281) [17:09:40] (03CR) 10CRusnov: "> Patch Set 1:" (033 comments) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/572116 (https://phabricator.wikimedia.org/T244281) (owner: 10CRusnov) [17:11:05] James_F, ebernhardson: just getting all my monitoring in place. sec [17:11:12] Sure. [17:11:16] Also we should ping SREs. [17:12:38] (03PS4) 10CRusnov: netbox: Update configuration to support v2.7 [puppet] - 10https://gerrit.wikimedia.org/r/572123 (https://phabricator.wikimedia.org/T244291) [17:12:58] (03PS5) 10Jbond: profile::puppetdb::database: add ability to override replication alert threshold [puppet] - 10https://gerrit.wikimedia.org/r/572912 (https://phabricator.wikimedia.org/T245510) [17:14:16] (03PS6) 10Jbond: profile::puppetdb::database: add ability to override replication alert threshold [puppet] - 10https://gerrit.wikimedia.org/r/572912 (https://phabricator.wikimedia.org/T245510) [17:14:30] (03PS5) 10Jbond: role::puppetmaster::puppetdb: increase replication thresholds [puppet] - 10https://gerrit.wikimedia.org/r/572913 (https://phabricator.wikimedia.org/T245510) [17:14:39] (03CR) 10CDanis: [C: 03+2] LVS: add alert on CPU saturation, which causes pkt drops [puppet] - 10https://gerrit.wikimedia.org/r/572766 (owner: 10CDanis) [17:15:56] (03PS3) 10Dzahn: add apt1001.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/572311 (https://phabricator.wikimedia.org/T244626) [17:17:36] (03CR) 10Dzahn: [C: 03+2] add apt1001.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/572311 (https://phabricator.wikimedia.org/T244626) (owner: 10Dzahn) [17:18:15] (03CR) 10Volans: [C: 04-1] "LGTM, just a typo" (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/572116 (https://phabricator.wikimedia.org/T244281) (owner: 10CRusnov) [17:18:41] (03PS2) 10Dzahn: add apt2001.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/572377 [17:19:04] (03PS1) 10CDanis: Revert "enable TCP MSS clamping in eqiad/eqord" [homer/public] - 10https://gerrit.wikimedia.org/r/572920 [17:19:12] (03PS3) 10CRusnov: Fix -extras for netbox 2.7 upgrade [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/572116 (https://phabricator.wikimedia.org/T244281) [17:19:32] (03CR) 10CRusnov: Fix -extras for netbox 2.7 upgrade (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/572116 (https://phabricator.wikimedia.org/T244281) (owner: 10CRusnov) [17:21:32] (03PS1) 10Ladsgroup: Revert "Beta wikidata: Define entity sources configuration" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572921 [17:21:38] (03CR) 10Ladsgroup: [C: 03+2] Revert "Beta wikidata: Define entity sources configuration" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572921 (owner: 10Ladsgroup) [17:26:22] (03PS2) 10Ladsgroup: Revert "Beta wikidata: Define entity sources configuration" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572921 [17:26:30] (03CR) 10Ladsgroup: [C: 03+2] Revert "Beta wikidata: Define entity sources configuration" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572921 (owner: 10Ladsgroup) [17:26:56] (03CR) 10WMDE-leszek: Beta wikidata: Define entity sources configuration (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569204 (https://phabricator.wikimedia.org/T242087) (owner: 10WMDE-leszek) [17:27:26] (03Merged) 10jenkins-bot: Revert "Beta wikidata: Define entity sources configuration" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572921 (owner: 10Ladsgroup) [17:27:29] 10Operations, 10ops-esams, 10netops: 2*10G optics down on cr2-esams - https://phabricator.wikimedia.org/T245520 (10RobH) >>! In T245520#5893147, @mark wrote: > There are multiple 10G LR optics on-site for sure. Longer distance ones, less so. @Mark, Can you advise where those spare optics are being stored f... [17:27:39] (03CR) 10Dzahn: [C: 03+2] add apt2001.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/572377 (owner: 10Dzahn) [17:28:47] !log cp3 (esams edge) - revert GRE MTU mitigations - T232602 [17:28:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:28:51] T232602: GRE MTU mitigations - Tracking - https://phabricator.wikimedia.org/T232602 [17:32:26] (03CR) 10Volans: "One typo and one question, looks good otherwise. Would be helpful to have a compiler result." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/572123 (https://phabricator.wikimedia.org/T244291) (owner: 10CRusnov) [17:32:28] (03CR) 10WMDE-leszek: [C: 04-1] Beta commons: Define entity sources configuration (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569205 (https://phabricator.wikimedia.org/T242087) (owner: 10WMDE-leszek) [17:33:24] RECOVERY - MegaRAID on heze is OK: OK: no disks configured for RAID https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [17:33:30] PROBLEM - Host heze.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:33:48] PROBLEM - Host elastic2026.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:33:48] PROBLEM - Host elastic2027.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:34:04] PROBLEM - Juniper alarms on cr2-codfw is CRITICAL: JNX_ALARMS CRITICAL - 6 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [17:34:06] PROBLEM - Host mc2022.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:34:16] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:34:22] PROBLEM - Host mc2021.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:34:51] [17:35:17] papaul: mgmt switch cable by any chance ?^ [17:35:31] looking which rack that is [17:36:00] there was a log earlier about a codfw psu work [17:36:04] RECOVERY - Juniper alarms on cr2-codfw is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [17:36:16] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 133, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:36:33] (03CR) 10Volans: [C: 03+1] "LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/572912 (https://phabricator.wikimedia.org/T245510) (owner: 10Jbond) [17:36:45] mutante: https://phabricator.wikimedia.org/T245164 i log the message working on ps1-a8 [17:37:38] thanks bblack and papaul. i missed the log [17:38:12] papaul: ACK :) thx [17:38:37] is gerrit working for you? [17:39:02] and now is back [17:39:07] wfm [17:39:08] I just had that hiccup too [17:39:24] I didn't try gerrit, but enwiki, phab, grafana were all "no route to host" [17:39:34] RECOVERY - Host heze.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.87 ms [17:39:50] all working now but spooky [17:39:52] RECOVERY - Host elastic2026.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.68 ms [17:39:52] RECOVERY - Host elastic2027.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.67 ms [17:39:58] I could still reach enwiki and phab, only gerrit didn’t work (briefly) [17:40:10] RECOVERY - Host mc2022.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.70 ms [17:40:18] RECOVERY - Host mc2021.mgmt is UP: PING OK - Packet loss = 0%, RTA = 37.08 ms [17:41:35] (03CR) 10Volans: [C: 03+1] "LGTM, thanks for taking care of it" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/572116 (https://phabricator.wikimedia.org/T244281) (owner: 10CRusnov) [17:44:00] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:44:22] PROBLEM - Widespread puppet agent failures- no resources reported on icinga1001 is CRITICAL: 0.01063 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [17:45:10] pupper failing on the lvses? [17:45:16] *puppet even [17:45:58] We're about to roll the train. No random merges in mw-config, please. :-) [17:46:28] Error while evaluating a Function Call, All exclamation marks in the query parameter must be escaped e.g. \! (file: /etc/puppet/modules/monitoring/manifests/check_prometheus.pp, line: 100, column: 9) [17:47:21] !log re-rolling wmf.19 to all wikis (T233867) with eyes particularly on (T245202) [17:47:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:26] T245202: RESTBase 500 spike of all /page/related/ hits following 1.35.0-wmf.19 all-wiki deployment - https://phabricator.wikimedia.org/T245202 [17:47:26] T233867: 1.35.0-wmf.19 deployment blockers - https://phabricator.wikimedia.org/T233867 [17:47:32] volans: argh, thanks, my fault [17:47:43] cdanis: I was checking if was related before pinginh [17:48:08] yeah it's a check we do in the check_prometheus [17:48:11] your last change has one [17:48:55] bblack: during authdns-update: ssh: connect to host dns2001.wikimedia.org port 22: Connection timed out [17:49:19] (03PS1) 10Dduvall: Revert "Revert "all wikis to 1.35.0-wmf.19"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572926 [17:49:21] (03CR) 10Dduvall: [C: 03+2] Revert "Revert "all wikis to 1.35.0-wmf.19"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572926 (owner: 10Dduvall) [17:49:26] (03PS1) 10CDanis: lvs-cpu-saturation: fix prom query mis-quoting [puppet] - 10https://gerrit.wikimedia.org/r/572927 [17:49:32] mutante: bblack: perhaps the PDU maintenance there? [17:50:19] (03Merged) 10jenkins-bot: Revert "Revert "all wikis to 1.35.0-wmf.19"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572926 (owner: 10Dduvall) [17:50:47] cdanis: ah. could be. yea, it's currently 100% packet loss also ICMP [17:51:20] James_F, ebernhardson: syncing wikiversions [17:51:29] (03CR) 10CDanis: [C: 03+2] "PCC verifies fix" [puppet] - 10https://gerrit.wikimedia.org/r/572927 (owner: 10CDanis) [17:51:32] 🤞🏽 [17:51:48] mutante: dns2001 is under hw maint right now [17:51:50] (03CR) 10Herron: "PCC looks good, shows a noop https://puppet-compiler.wmflabs.org/compiler1002/20864/" [puppet] - 10https://gerrit.wikimedia.org/r/571813 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron) [17:51:53] as long as the rest complete, you're fine [17:52:11] (03CR) 10jerkins-bot: [V: 04-1] lvs-cpu-saturation: fix prom query mis-quoting [puppet] - 10https://gerrit.wikimedia.org/r/572927 (owner: 10CDanis) [17:52:16] bblack: what's the procedure when it comes back to update it? [17:52:18] bblack: gotcha, ack [17:52:28] just wondering if it's documented and well known or automated [17:52:28] !log dduvall@deploy1001 rebuilt and synchronized wikiversions files: Re-roll all wikis to 1.35.0-wmf.19 (T233867) [17:52:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:52:32] T233867: 1.35.0-wmf.19 deployment blockers - https://phabricator.wikimedia.org/T233867 [17:52:35] it will fix itself, there's nothing to do [17:52:41] nice :) [17:52:44] :) [17:53:04] (03PS2) 10CDanis: lvs-cpu-saturation: fix prom query mis-quoting [puppet] - 10https://gerrit.wikimedia.org/r/572927 [17:53:26] 10Operations, 10LDAP-Access-Requests: LDAP access to the wmf group for Lauren Dickinson - https://phabricator.wikimedia.org/T245524 (10Aklapper) @Lookd_Up: Hi and welcome! https://phabricator.wikimedia.org/p/Lookd_Up/ is currently not linked to an account created by WMF Office IT, which makes it hard to verify... [17:54:10] so far so good [17:54:15] great [17:54:18] (03CR) 10CRusnov: "This PS also fixes the python path for the rq unit." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/572123 (https://phabricator.wikimedia.org/T244291) (owner: 10CRusnov) [17:54:18] [there's some complex hacky stuff in the puppetization actually, that ensure the latest committed dns data is synced to the reimaged server correctly before the gdnsd daemon is allowed to start for the first time, so it cannot accidentally serve stale data due to a situation like this] [17:54:37] ah, good to know. very nice [17:55:44] Oh, yeah, I should backport the array-to-string logspam before I cut the wmf.20 train. [17:55:54] (03CR) 10CDanis: [C: 03+2] lvs-cpu-saturation: fix prom query mis-quoting [puppet] - 10https://gerrit.wikimedia.org/r/572927 (owner: 10CDanis) [17:56:50] (03CR) 10Volans: [C: 03+1] "LGTM" [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/572025 (https://phabricator.wikimedia.org/T244291) (owner: 10CRusnov) [17:57:27] James_F: would be helpful [17:57:38] (03PS1) 10WMDE-leszek: Beta wikidata: Define entity sources configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572928 (https://phabricator.wikimedia.org/T242087) [17:57:52] volans: fixed, puppet run on an lvs is successful [17:57:58] marxarelli: https://gerrit.wikimedia.org/r/c/mediawiki/core/+/572930 [17:58:06] ebernhardson, mutante: thanks for helping out with train this morning. lgtm [17:58:14] marxarelli: yup, looks happy on this end [17:58:21] marxarelli: glad it was so smooth. thank you [17:58:28] cdanis: great, are you forcing a run? [17:58:35] volans: no [17:58:36] RECOVERY - Host dns2001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.96 ms [17:58:37] you may [17:58:42] ack, sgtm [17:58:46] it was just not running [17:58:49] so skipped a run [17:58:52] nothing bad [17:59:11] no need for https://wikitech.wikimedia.org/wiki/Cumin#Run_Puppet_only_if_last_run_failed IMHO this time [18:00:04] cscott, arlolra, subbu, halfak, and accraze: It is that lovely time of the day again! You are hereby commanded to deploy Services – Graphoid / Parsoid / Citoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200218T1800). [18:01:14] !log completed promotion of 1.35.0-wmf.19 to all wikis (T233867) [18:01:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:01:18] T233867: 1.35.0-wmf.19 deployment blockers - https://phabricator.wikimedia.org/T233867 [18:01:43] !log 1.35.0-wmf.20 was branched at c664b4f1b933d110bd69f074c399695bd6b17d13 for T233868 [18:01:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:01:47] T233868: 1.35.0-wmf.20 deployment blockers - https://phabricator.wikimedia.org/T233868 [18:02:03] That's 30 seconds from wmf.19 to wmf.20 steps. ;-) [18:04:38] (03PS4) 10WMDE-leszek: Beta commons: Define entity sources configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569205 (https://phabricator.wikimedia.org/T242087) [18:08:06] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:09:39] (03PS1) 10Jforrester: Revert "cirrus: redirect more_like to codfw to rebuild query cache" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572932 [18:10:07] (03CR) 10Jforrester: [C: 04-1] "Not yet. Target: After 2020-02-19 Z 18:00." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572932 (owner: 10Jforrester) [18:11:28] (03PS5) 10CRusnov: netbox: Update configuration to support v2.7 [puppet] - 10https://gerrit.wikimedia.org/r/572123 (https://phabricator.wikimedia.org/T244291) [18:12:29] James_F: does this count as CD? :) [18:12:30] RECOVERY - Widespread puppet agent failures- no resources reported on icinga1001 is OK: (C)0.01 ge (W)0.006 ge 0.00598 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [18:12:35] (03PS2) 10Jforrester: Revert "cirrus: redirect more_like to codfw to rebuild query cache" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572932 [18:12:55] marxarelli: Let's say yes, collect the praise, and go home? :-) [18:13:10] i'm already home [18:13:20] so wfm :) [18:13:23] Well then. [18:13:46] (03PS5) 10WMDE-leszek: Beta commons: Define entity sources configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569205 (https://phabricator.wikimedia.org/T242087) [18:15:12] 10Operations, 10serviceops, 10Patch-For-Review: Test gutter pool failover in production and memcached 1.5.x - https://phabricator.wikimedia.org/T240684 (10elukey) About the TTL, I'd involve Timo and Aaron. For some keys, that are expensive to generate or that might cause a ton of traffic if regenerated and... [18:15:28] 10Operations, 10ops-codfw: (No Need By Date Provided) codfw: rack/setup/install elastic20{55,56,57,58,59,60}.codfw.wmnet - https://phabricator.wikimedia.org/T241337 (10Papaul) [18:20:08] 10Operations, 10Parsoid, 10VisualEditor, 10wikitech.wikimedia.org: VisualEditor was removed from Wikitech because Parsoid/PHP isn't yet compatible with how Wikitech is set up - https://phabricator.wikimedia.org/T241961 (10bd808) >>! In T241961#5890295, @ayounsi wrote: > Any ETA on when it will be fixed? I... [18:23:35] PROBLEM - At least one CPU core of an LVS is saturated- packet drops are likely on lvs2002 is CRITICAL: bad_data: parse error at char 43: unknown function with name node_cpu_seconds_total https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/000000377/host-overview?var-server=lvs2002&var-datasource=codfw+prometheus/ops [18:23:52] 10Operations, 10Parsoid, 10VisualEditor, 10wikitech.wikimedia.org: VisualEditor was removed from Wikitech because Parsoid/PHP isn't yet compatible with how Wikitech is set up - https://phabricator.wikimedia.org/T241961 (10Jdforrester-WMF) AIUI, options are: A. Heroïc levels of engineering to lash up a pro... [18:25:09] !log Running `scap prep` for 1.35.0-wmf.20 ref. T233868 [18:25:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:14] T233868: 1.35.0-wmf.20 deployment blockers - https://phabricator.wikimedia.org/T233868 [18:26:05] (03CR) 10Ayounsi: [C: 03+2] Revert "enable TCP MSS clamping in eqiad/eqord" [homer/public] - 10https://gerrit.wikimedia.org/r/572920 (owner: 10CDanis) [18:26:11] PROBLEM - At least one CPU core of an LVS is saturated- packet drops are likely on lvs5002 is CRITICAL: bad_data: parse error at char 43: unknown function with name node_cpu_seconds_total https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/000000377/host-overview?var-server=lvs5002&var-datasource=eqsin+prometheus/ops [18:26:25] (03CR) 10Dzahn: [C: 03+2] ATS: remove commented webperf2001 from backend config [puppet] - 10https://gerrit.wikimedia.org/r/572395 (owner: 10Dzahn) [18:27:34] (03PS2) 10Dzahn: add apt[12]00[12] to partman recipes, flat, standard for VMs [puppet] - 10https://gerrit.wikimedia.org/r/572354 [18:27:48] (03PS3) 10Dzahn: add apt[12]00[12] to partman recipes, flat, standard for VMs [puppet] - 10https://gerrit.wikimedia.org/r/572354 (https://phabricator.wikimedia.org/T244626) [18:28:18] (03CR) 10Dzahn: [C: 03+2] add apt[12]00[12] to partman recipes, flat, standard for VMs [puppet] - 10https://gerrit.wikimedia.org/r/572354 (https://phabricator.wikimedia.org/T244626) (owner: 10Dzahn) [18:28:43] PROBLEM - At least one CPU core of an LVS is saturated- packet drops are likely on lvs3007 is CRITICAL: bad_data: parse error at char 43: unknown function with name node_cpu_seconds_total https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/000000377/host-overview?var-server=lvs3007&var-datasource=esams+prometheus/ops [18:29:06] ^ already have been mentioned. failure in alert itself [18:33:51] PROBLEM - At least one CPU core of an LVS is saturated- packet drops are likely on lvs4006 is CRITICAL: bad_data: parse error at char 43: unknown function with name node_cpu_seconds_total https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/000000377/host-overview?var-server=lvs4006&var-datasource=ulsfo+prometheus/ops [18:34:15] (03CR) 10Dzahn: [C: 04-1] "invalid secret ssl/doc.discovery.wmnet.key" [puppet] - 10https://gerrit.wikimedia.org/r/572378 (https://phabricator.wikimedia.org/T210411) (owner: 10Dzahn) [18:34:43] 10Operations, 10ops-codfw: Alert for device ps1-a8-codfw.mgmt.codfw.wmnet - Device rebooted - https://phabricator.wikimedia.org/T245164 (10Papaul) a:03wiki_willy @wiki_willy ps1-a8 is not stable to still in production. I notice that there is a clicking noise coming from the PDU and the readings are not stabl... [18:34:51] (03PS1) 10Jbond: fr-tech: add ability to read contacts file [puppet] - 10https://gerrit.wikimedia.org/r/572936 [18:36:21] PROBLEM - At least one CPU core of an LVS is saturated- packet drops are likely on lvs2003 is CRITICAL: bad_data: parse error at char 43: unknown function with name node_cpu_seconds_total https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/000000377/host-overview?var-server=lvs2003&var-datasource=codfw+prometheus/ops [18:36:21] PROBLEM - At least one CPU core of an LVS is saturated- packet drops are likely on lvs2005 is CRITICAL: bad_data: parse error at char 43: unknown function with name node_cpu_seconds_total https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/000000377/host-overview?var-server=lvs2005&var-datasource=codfw+prometheus/ops [18:36:21] PROBLEM - At least one CPU core of an LVS is saturated- packet drops are likely on lvs2001 is CRITICAL: bad_data: parse error at char 43: unknown function with name node_cpu_seconds_total https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/000000377/host-overview?var-server=lvs2001&var-datasource=codfw+prometheus/ops [18:37:26] (03PS1) 10Dzahn: add fake key for doc.discovery.wmnet [labs/private] - 10https://gerrit.wikimedia.org/r/572937 (https://phabricator.wikimedia.org/T210411) [18:38:21] mutante: were you referring to teh cpu core alerts for lvses? [18:38:51] PROBLEM - At least one CPU core of an LVS is saturated- packet drops are likely on lvs1014 is CRITICAL: bad_data: parse error at char 43: unknown function with name node_cpu_seconds_total https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/000000377/host-overview?var-server=lvs1014&var-datasource=eqiad+prometheus/ops [18:39:42] volans: yes, i was. because they were also parse errors [18:40:41] are you fixing them? chris is off tomorrow so if noone is fixing them maybe we should revert whatever patch broke them :) [18:41:51] volans: ok, i assumed it was still the fix in progress. looking now [18:42:10] dunno, cdanis still around? [18:42:25] I guess revert is ok too as it's a new check [18:42:59] ack. yea, i'll do that then [18:43:00] (03PS2) 10Jbond: fr-tech: add ability to read contacts file [puppet] - 10https://gerrit.wikimedia.org/r/572936 [18:43:14] it's 2 changes though, one was the follow-up [18:43:57] PROBLEM - At least one CPU core of an LVS is saturated- packet drops are likely on lvs3006 is CRITICAL: bad_data: parse error at char 43: unknown function with name node_cpu_seconds_total https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/000000377/host-overview?var-server=lvs3006&var-datasource=esams+prometheus/ops [18:43:58] (03PS5) 10Jforrester: Merge $wgLogo into $wgLogos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570379 (https://phabricator.wikimedia.org/T232140) [18:44:12] (03CR) 10Jforrester: "This should now be safe to deploy." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570379 (https://phabricator.wikimedia.org/T232140) (owner: 10Jforrester) [18:44:17] (03CR) 10Dzahn: "once added, these became CRIT with " CRITICAL: bad_data: parse error at char 43: unknown function with name node_cpu_seconds_total "" [puppet] - 10https://gerrit.wikimedia.org/r/572766 (owner: 10CDanis) [18:44:37] (03PS1) 10Dzahn: Revert "LVS: add alert on CPU saturation, which causes pkt drops" [puppet] - 10https://gerrit.wikimedia.org/r/572938 [18:44:59] (03CR) 10jerkins-bot: [V: 04-1] Revert "LVS: add alert on CPU saturation, which causes pkt drops" [puppet] - 10https://gerrit.wikimedia.org/r/572938 (owner: 10Dzahn) [18:46:29] PROBLEM - At least one CPU core of an LVS is saturated- packet drops are likely on lvs4005 is CRITICAL: bad_data: parse error at char 43: unknown function with name node_cpu_seconds_total https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/000000377/host-overview?var-server=lvs4005&var-datasource=ulsfo+prometheus/ops [18:46:41] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/572936 (owner: 10Jbond) [18:48:22] (03CR) 10Muehlenhoff: add apt[12]00[12] to partman recipes, flat, standard for VMs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/572354 (https://phabricator.wikimedia.org/T244626) (owner: 10Dzahn) [18:48:25] (03CR) 10Jbond: [C: 03+2] fr-tech: add ability to read contacts file [puppet] - 10https://gerrit.wikimedia.org/r/572936 (owner: 10Jbond) [18:48:29] (03PS2) 10Dzahn: Revert "LVS: add alert on CPU saturation, which causes pkt drops" [puppet] - 10https://gerrit.wikimedia.org/r/572938 [18:49:04] (03PS6) 10CRusnov: netbox: Update configuration to support v2.7 [puppet] - 10https://gerrit.wikimedia.org/r/572123 (https://phabricator.wikimedia.org/T244291) [18:49:11] !log jforrester@deploy1001 Pruned MediaWiki: 1.35.0-wmf.18 (duration: 15m 29s) [18:49:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:50:03] (03CR) 10jerkins-bot: [V: 04-1] netbox: Update configuration to support v2.7 [puppet] - 10https://gerrit.wikimedia.org/r/572123 (https://phabricator.wikimedia.org/T244291) (owner: 10CRusnov) [18:51:36] !log jforrester@deploy1001 Started scap: testwiki to 1.35.0-wmf.20 and re-build l10n cache T233868 [18:51:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:51:40] T233868: 1.35.0-wmf.20 deployment blockers - https://phabricator.wikimedia.org/T233868 [18:51:46] PROBLEM - At least one CPU core of an LVS is saturated- packet drops are likely on lvs2004 is CRITICAL: bad_data: parse error at char 43: unknown function with name node_cpu_seconds_total https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/000000377/host-overview?var-server=lvs2004&var-datasource=codfw+prometheus/ops [18:51:46] PROBLEM - At least one CPU core of an LVS is saturated- packet drops are likely on lvs3005 is CRITICAL: bad_data: parse error at char 43: unknown function with name node_cpu_seconds_total https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/000000377/host-overview?var-server=lvs3005&var-datasource=esams+prometheus/ops [18:52:01] (03CR) 10jerkins-bot: [V: 04-1] Revert "LVS: add alert on CPU saturation, which causes pkt drops" [puppet] - 10https://gerrit.wikimedia.org/r/572938 (owner: 10Dzahn) [18:52:37] (03PS7) 10CRusnov: netbox: Update configuration to support v2.7 [puppet] - 10https://gerrit.wikimedia.org/r/572123 (https://phabricator.wikimedia.org/T244291) [18:55:39] (03PS3) 10Dzahn: Revert "LVS: add alert on CPU saturation, which causes pkt drops" [puppet] - 10https://gerrit.wikimedia.org/r/572938 [18:56:13] (03PS1) 10Jbond: user:dwisehaupt: add alias [puppet] - 10https://gerrit.wikimedia.org/r/572940 (https://phabricator.wikimedia.org/T244901) [18:57:14] (03CR) 10Dzahn: [C: 03+2] Revert "LVS: add alert on CPU saturation, which causes pkt drops" [puppet] - 10https://gerrit.wikimedia.org/r/572938 (owner: 10Dzahn) [18:57:20] PROBLEM - At least one CPU core of an LVS is saturated- packet drops are likely on lvs1016 is CRITICAL: bad_data: parse error at char 43: unknown function with name node_cpu_seconds_total https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/000000377/host-overview?var-server=lvs1016&var-datasource=eqiad+prometheus/ops [18:57:40] (03PS4) 10Dzahn: Revert "LVS: add alert on CPU saturation, which causes pkt drops" [puppet] - 10https://gerrit.wikimedia.org/r/572938 [18:57:49] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to ops group for dwisehaupt - https://phabricator.wikimedia.org/T244901 (10jbond) >! In T244901#5892528, @jbond wrote: > In relation to the contacts.cfg file, we are currently migrating towards using a new tool as such i would like t... [18:58:44] PROBLEM - At least one CPU core of an LVS is saturated- packet drops are likely on lvs5003 is CRITICAL: bad_data: parse error at char 43: unknown function with name node_cpu_seconds_total https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/000000377/host-overview?var-server=lvs5003&var-datasource=eqsin+prometheus/ops [18:59:48] (03CR) 10Jbond: [C: 03+1] Add system::role for role::kubernetes::worker and role::kubernetes::master [puppet] - 10https://gerrit.wikimedia.org/r/572907 (owner: 10Muehlenhoff) [19:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200218T1900) [19:00:41] (03PS8) 10CRusnov: netbox: Update configuration to support v2.7 [puppet] - 10https://gerrit.wikimedia.org/r/572123 (https://phabricator.wikimedia.org/T244291) [19:00:51] (03CR) 10Jbond: [C: 03+1] Add system::role for role::deployment_server [puppet] - 10https://gerrit.wikimedia.org/r/572908 (owner: 10Muehlenhoff) [19:01:14] (03CR) 10Jbond: [C: 03+1] Add system::role for role::graphite::production [puppet] - 10https://gerrit.wikimedia.org/r/572906 (owner: 10Muehlenhoff) [19:01:40] removing the LVS alerts with parse error above. running puppet on icinga [19:01:48] (03CR) 10Jbond: [C: 03+1] Add system::role for role::configcluster [puppet] - 10https://gerrit.wikimedia.org/r/572910 (owner: 10Muehlenhoff) [19:02:14] (03CR) 10Dzahn: [C: 03+2] Add system::role for role::deployment_server [puppet] - 10https://gerrit.wikimedia.org/r/572908 (owner: 10Muehlenhoff) [19:02:54] (03CR) 10Dzahn: [C: 03+2] Add system::role for role::graphite::production [puppet] - 10https://gerrit.wikimedia.org/r/572906 (owner: 10Muehlenhoff) [19:03:39] (03CR) 10Dzahn: [C: 03+2] Add system::role for role::configcluster [puppet] - 10https://gerrit.wikimedia.org/r/572910 (owner: 10Muehlenhoff) [19:05:30] (03CR) 10Dzahn: [V: 03+2 C: 03+2] add fake key for doc.discovery.wmnet [labs/private] - 10https://gerrit.wikimedia.org/r/572937 (https://phabricator.wikimedia.org/T210411) (owner: 10Dzahn) [19:06:25] (03PS2) 10Dzahn: add fake key for doc.discovery.wmnet [labs/private] - 10https://gerrit.wikimedia.org/r/572937 (https://phabricator.wikimedia.org/T210411) [19:06:27] (03CR) 10CRusnov: "Compiler output" [puppet] - 10https://gerrit.wikimedia.org/r/572123 (https://phabricator.wikimedia.org/T244291) (owner: 10CRusnov) [19:10:58] (03PS1) 10Cmjohnson: Adding dhcpd and partman for snapshot1010 [puppet] - 10https://gerrit.wikimedia.org/r/572945 (https://phabricator.wikimedia.org/T241794) [19:11:38] PROBLEM - At least one CPU core of an LVS is saturated- packet drops are likely on lvs4007 is CRITICAL: bad_data: parse error at char 43: unknown function with name node_cpu_seconds_total https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/000000377/host-overview?var-server=lvs4007&var-datasource=ulsfo+prometheus/ops [19:11:43] (03PS2) 10Cmjohnson: Adding dhcpd and partman for snapshot1010 [puppet] - 10https://gerrit.wikimedia.org/r/572945 (https://phabricator.wikimedia.org/T241794) [19:12:26] 10Operations, 10Parsoid, 10VisualEditor, 10wikitech.wikimedia.org: VisualEditor was removed from Wikitech because Parsoid/PHP isn't yet compatible with how Wikitech is set up - https://phabricator.wikimedia.org/T241961 (10cscott) Just for planning purposes, option C is "18 months out" (very rough estimate)... [19:13:36] RECOVERY - rpki grafana alert on icinga1001 is OK: OK: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is not alerting. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [19:14:56] 10Operations, 10Android-app-Bugs, 10Traffic, 10Wikipedia-Android-App-Backlog, and 4 others: When fetching siteinfo from the MediaWiki API, set the `maxage` and `smaxage` parameters. - https://phabricator.wikimedia.org/T245033 (10LGoto) [19:16:09] (03CR) 10Dzahn: [V: 03+2 C: 03+2] add fake key for doc.discovery.wmnet [labs/private] - 10https://gerrit.wikimedia.org/r/572937 (https://phabricator.wikimedia.org/T210411) (owner: 10Dzahn) [19:16:56] PROBLEM - At least one CPU core of an LVS is saturated- packet drops are likely on lvs1015 is CRITICAL: bad_data: parse error at char 43: unknown function with name node_cpu_seconds_total https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/000000377/host-overview?var-server=lvs1015&var-datasource=eqiad+prometheus/ops [19:16:56] PROBLEM - At least one CPU core of an LVS is saturated- packet drops are likely on lvs1013 is CRITICAL: bad_data: parse error at char 43: unknown function with name node_cpu_seconds_total https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/000000377/host-overview?var-server=lvs1013&var-datasource=eqiad+prometheus/ops [19:17:51] (03CR) 10Dzahn: [C: 03+2] add doc.discovery.wmnet for use in envoy config [dns] - 10https://gerrit.wikimedia.org/r/572380 (https://phabricator.wikimedia.org/T210411) (owner: 10Dzahn) [19:17:55] (03PS2) 10Dzahn: add doc.discovery.wmnet for use in envoy config [dns] - 10https://gerrit.wikimedia.org/r/572380 (https://phabricator.wikimedia.org/T210411) [19:18:32] PROBLEM - At least one CPU core of an LVS is saturated- packet drops are likely on lvs5001 is CRITICAL: bad_data: parse error at char 43: unknown function with name node_cpu_seconds_total https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/000000377/host-overview?var-server=lvs5001&var-datasource=eqsin+prometheus/ops [19:18:46] ^ these are getting removed on next puppet run [19:18:52] only a few left [19:20:06] (03PS1) 10BBlack: dns2001: update mac for 10G card [puppet] - 10https://gerrit.wikimedia.org/r/572948 (https://phabricator.wikimedia.org/T242017) [19:20:38] (03CR) 10Cmjohnson: [C: 03+2] Adding dhcpd and partman for snapshot1010 [puppet] - 10https://gerrit.wikimedia.org/r/572945 (https://phabricator.wikimedia.org/T241794) (owner: 10Cmjohnson) [19:20:41] (03CR) 10BBlack: [V: 03+2 C: 03+2] dns2001: update mac for 10G card [puppet] - 10https://gerrit.wikimedia.org/r/572948 (https://phabricator.wikimedia.org/T242017) (owner: 10BBlack) [19:20:52] 10Operations, 10Parsoid, 10VisualEditor, 10wikitech.wikimedia.org: VisualEditor was removed from Wikitech because Parsoid/PHP isn't yet compatible with how Wikitech is set up - https://phabricator.wikimedia.org/T241961 (10bd808) >>! In T241961#5894194, @cscott wrote: > Just for planning purposes, option C... [19:20:59] (03PS2) 10BBlack: dns2001: update mac for 10G card [puppet] - 10https://gerrit.wikimedia.org/r/572948 (https://phabricator.wikimedia.org/T242017) [19:21:04] (03CR) 10BBlack: [V: 03+2 C: 03+2] dns2001: update mac for 10G card [puppet] - 10https://gerrit.wikimedia.org/r/572948 (https://phabricator.wikimedia.org/T242017) (owner: 10BBlack) [19:21:42] cmjohnson1: merged yours too [19:21:57] * James_F sighs, waiting for scap. [19:22:52] (03CR) 10Herron: [C: 03+2] logstash: remove defalut value from kafka input type field [puppet] - 10https://gerrit.wikimedia.org/r/571813 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron) [19:24:36] 10Operations, 10Android-app-Bugs, 10Traffic, 10Wikipedia-Android-App-Backlog, and 4 others: When fetching siteinfo from the MediaWiki API, set the `maxage` and `smaxage` parameters. - https://phabricator.wikimedia.org/T245033 (10JoeWalsh) @Joe could MediaWiki use a default value other than 0 if a client do... [19:26:35] (03PS3) 10Herron: logstash::collector7 ingest deprecated logs from kafka [puppet] - 10https://gerrit.wikimedia.org/r/571554 (https://phabricator.wikimedia.org/T234854) [19:30:48] !log Running `foreachwiki sql.php php-1.35.0-wmf.19/maintenance/archives/patch-watchlist_expiry.sql` for T244631 [19:30:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:30:53] T244631: Create `watchlist_expiry` table in production after wmf.19 is available - https://phabricator.wikimedia.org/T244631 [19:32:01] 10Operations, 10Parsoid, 10VisualEditor, 10wikitech.wikimedia.org: VisualEditor was removed from Wikitech because Parsoid/PHP isn't yet compatible with how Wikitech is set up - https://phabricator.wikimedia.org/T241961 (10Jdforrester-WMF) Note that wikitech never had RESTbase and it was "fine", so option D... [19:32:10] (03CR) 10Volans: [C: 03+1] "LGTM, optional nit inline. Compiler looks ok!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/572123 (https://phabricator.wikimedia.org/T244291) (owner: 10CRusnov) [19:33:46] (03CR) 10Jbond: [C: 03+2] profile::puppetdb::database: add ability to override replication alert threshold [puppet] - 10https://gerrit.wikimedia.org/r/572912 (https://phabricator.wikimedia.org/T245510) (owner: 10Jbond) [19:33:51] (03CR) 10Jbond: [C: 03+2] role::puppetmaster::puppetdb: increase replication thresholds [puppet] - 10https://gerrit.wikimedia.org/r/572913 (https://phabricator.wikimedia.org/T245510) (owner: 10Jbond) [19:36:38] (03PS9) 10CRusnov: netbox: Update configuration to support v2.7 [puppet] - 10https://gerrit.wikimedia.org/r/572123 (https://phabricator.wikimedia.org/T244291) [19:37:01] (03CR) 10CRusnov: "Thanks!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/572123 (https://phabricator.wikimedia.org/T244291) (owner: 10CRusnov) [19:41:11] !log shutting down dns2001 for 10G card troubleshooting [19:41:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:45:25] (03PS1) 10Cmjohnson: Adding macs for es102[0-5] to dhcpd file [puppet] - 10https://gerrit.wikimedia.org/r/572953 (https://phabricator.wikimedia.org/T241359) [19:50:29] 10Operations, 10Puppet, 10Patch-For-Review, 10PostgreSQL, 10User-jbond: Investigate puppetdb replication lag - https://phabricator.wikimedia.org/T245510 (10jbond) 05Open→03Resolved a:03jbond have increased the icinga limits as replication seems to be working in acceptable limits and puppet updates... [19:51:33] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:52:18] 10Operations, 10Parsoid, 10VisualEditor, 10wikitech.wikimedia.org: VisualEditor was removed from Wikitech because Parsoid/PHP isn't yet compatible with how Wikitech is set up - https://phabricator.wikimedia.org/T241961 (10cscott) OK. I don't think option D is officially on our roadmap, just a "nice to hav... [19:52:37] !log jforrester@deploy1001 Finished scap: testwiki to 1.35.0-wmf.20 and re-build l10n cache T233868 (duration: 61m 01s) [19:52:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:52:41] T233868: 1.35.0-wmf.20 deployment blockers - https://phabricator.wikimedia.org/T233868 [19:53:39] Finally. [19:55:12] choo choo [19:55:39] (03PS1) 10Jforrester: Group0 to 1.35.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572956 (https://phabricator.wikimedia.org/T233868) [19:57:41] Reedy: Not for 3 minutes. ;-) [19:58:18] 10Operations, 10ops-codfw: Alert for device ps1-a8-codfw.mgmt.codfw.wmnet - Device rebooted - https://phabricator.wikimedia.org/T245164 (10wiki_willy) @Papaul - if the spare one in storage is the same one, I think we can try replacing it with that first. Thanks, Willy [19:59:00] (03PS2) 10Cmjohnson: Adding macs for es102[0-5] to dhcpd file [puppet] - 10https://gerrit.wikimedia.org/r/572953 (https://phabricator.wikimedia.org/T241359) [19:59:08] 10Operations, 10Parsoid, 10VisualEditor, 10wikitech.wikimedia.org: VisualEditor was removed from Wikitech because Parsoid/PHP isn't yet compatible with how Wikitech is set up - https://phabricator.wikimedia.org/T241961 (10Joe) >>! In T241961#5778315, @cscott wrote: > I *think* this is because wikitech does... [20:00:04] James_F and longma: My dear minions, it's time we take the moon! Just kidding. Time for Mediawiki train - American Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200218T2000). [20:00:15] Okie-dokie. [20:00:21] longma: You around for back-up? [20:00:28] present [20:00:36] Let's rock. [20:00:55] (03CR) 10Jforrester: [C: 03+2] Group0 to 1.35.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572956 (https://phabricator.wikimedia.org/T233868) (owner: 10Jforrester) [20:01:39] (03Merged) 10jenkins-bot: Group0 to 1.35.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572956 (https://phabricator.wikimedia.org/T233868) (owner: 10Jforrester) [20:03:19] !log jforrester@deploy1001 rebuilt and synchronized wikiversions files: group0 to 1.35.0-wmf.20 T233868 [20:03:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:03:24] T233868: 1.35.0-wmf.20 deployment blockers - https://phabricator.wikimedia.org/T233868 [20:03:43] (03CR) 10Cmjohnson: [C: 03+2] Adding macs for es102[0-5] to dhcpd file [puppet] - 10https://gerrit.wikimedia.org/r/572953 (https://phabricator.wikimedia.org/T241359) (owner: 10Cmjohnson) [20:04:03] longma: LGTM so far. You? [20:04:21] I don't see anything concerning so far [20:05:25] 10Operations, 10MediaWiki-General, 10serviceops, 10Core Platform Team Workboards (Clinic Duty Team): siteinfo api calls should be cached for N minutes on the caching layer - https://phabricator.wikimedia.org/T244204 (10Krinkle) The Api classes in MediaWiki also have a way to enable caching by default, via... [20:06:25] !log jforrester@deploy1001 Synchronized php-1.35.0-wmf.19/includes/libs/StatusValue.php: T245155 StatusValue: Fix __toString() to not choke on special parameters (duration: 01m 04s) [20:06:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:06:29] T245155: PHP Notice: Array to string conversion from libs/StatusValue.php (via ApiClientLogin) - https://phabricator.wikimedia.org/T245155 [20:07:02] (03CR) 10Herron: "PCC looks good https://puppet-compiler.wmflabs.org/compiler1003/20874/" [puppet] - 10https://gerrit.wikimedia.org/r/571554 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron) [20:07:05] (03CR) 10Herron: [C: 03+2] logstash::collector7 ingest deprecated logs from kafka [puppet] - 10https://gerrit.wikimedia.org/r/571554 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron) [20:09:23] (03PS1) 10Cmjohnson: updating dhcp file for snapshot1010 [puppet] - 10https://gerrit.wikimedia.org/r/572958 (https://phabricator.wikimedia.org/T241794) [20:10:15] (03PS2) 10Cmjohnson: updating dhcp file for snapshot1010 [puppet] - 10https://gerrit.wikimedia.org/r/572958 (https://phabricator.wikimedia.org/T241794) [20:10:53] PROBLEM - rpki grafana alert on icinga1001 is CRITICAL: CRITICAL: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is alerting: RRDP status alert. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [20:11:12] OK, I'm declaring the train rolled to group0. [20:11:19] Prod clear for anyone that needs to deploy. [20:11:24] (03PS1) 10Ottomata: Add new LVS services for new eventgate-main and eventgate-analytics ports [puppet] - 10https://gerrit.wikimedia.org/r/572960 (https://phabricator.wikimedia.org/T245203) [20:12:12] 10Operations, 10Android-app-Bugs, 10Traffic, 10Wikipedia-Android-App-Backlog, and 4 others: When fetching siteinfo from the MediaWiki API, set the `maxage` and `smaxage` parameters. - https://phabricator.wikimedia.org/T245033 (10Krinkle) @JoeWalsh There's a separate task about the MW default where I just a... [20:12:13] (03CR) 10Cmjohnson: [C: 03+2] updating dhcp file for snapshot1010 [puppet] - 10https://gerrit.wikimedia.org/r/572958 (https://phabricator.wikimedia.org/T241794) (owner: 10Cmjohnson) [20:12:56] Umm. [20:13:02] Why is test2wiki not in group0? [20:13:06] (03CR) 10Ottomata: "I believe renaming the services like this is ok, but I think it requires some manual intervention? Alex I'm hoping I can get your attenti" [puppet] - 10https://gerrit.wikimedia.org/r/572960 (https://phabricator.wikimedia.org/T245203) (owner: 10Ottomata) [20:15:36] Oh, I see, T182326 [20:15:36] T182326: Make one group1 wiki a client of testwikidata (preferably a test wiki) - https://phabricator.wikimedia.org/T182326 [20:15:38] 10Operations, 10Patch-For-Review, 10User-fgiunchedi: Standardizing our partman recipes - https://phabricator.wikimedia.org/T156955 (10BBlack) After the recent merger https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/571928/ - I'm having installer failure on `dns2001` (we did its sibling `dns2002` a few... [20:17:30] 10Operations, 10netops, 10Wikimedia-Incident: Investigate Juniper storm control - https://phabricator.wikimedia.org/T245192 (10ayounsi) a:05ayounsi→03None [20:21:02] 10Operations, 10MediaWiki-Debug-Logger, 10Traffic: noc.wikimedia.org doesn't route to the docroot when WikimediaDebug browser extension is live - https://phabricator.wikimedia.org/T245552 (10Jdforrester-WMF) [20:22:39] 10Operations, 10Parsoid, 10VisualEditor, 10wikitech.wikimedia.org: VisualEditor was removed from Wikitech because Parsoid/PHP isn't yet compatible with how Wikitech is set up - https://phabricator.wikimedia.org/T241961 (10ssastry) Based on what @Joe said, if MediaWiki in the production cluster can access t... [20:24:16] 10Operations, 10netops, 10Wikimedia-Incident: Investigate Juniper storm control - https://phabricator.wikimedia.org/T245192 (10Papaul) a:03Papaul [20:25:11] 10Operations, 10LDAP-Access-Requests: LDAP access to the wmf group for Lauren Dickinson - https://phabricator.wikimedia.org/T245524 (10Lookd_Up) Hi @Aklapper: Thanks for your quick reply! And apologies for not making this request using my WMF account. Would it be best for me to delete this request and submit... [20:28:22] (03PS2) 10Ottomata: Enable EventStreamConfig on testwiki and configure test.event stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571582 (https://phabricator.wikimedia.org/T242122) [20:29:24] (03PS3) 10Ottomata: Enable EventStreamConfig on testwiki and configure test.event stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571582 (https://phabricator.wikimedia.org/T242122) [20:29:40] 10Operations, 10MediaWiki-General, 10serviceops, 10Core Platform Team Workboards (Clinic Duty Team): siteinfo api calls should be cached for N minutes on the caching layer - https://phabricator.wikimedia.org/T244204 (10Anomie) >>! In T244204#5894396, @Krinkle wrote: > Should we set `setCacheMaxAge()` by de... [20:31:49] (03PS4) 10Ottomata: Enable EventStreamConfig on testwiki and configure test.event stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571582 (https://phabricator.wikimedia.org/T242122) [20:37:45] (03CR) 10Ottomata: [C: 03+2] Enable EventStreamConfig on testwiki and configure test.event stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571582 (https://phabricator.wikimedia.org/T242122) (owner: 10Ottomata) [20:38:59] 10Operations, 10ops-eqiad, 10DC-Ops: (no need by provided) rack/setup/install fran1001 - https://phabricator.wikimedia.org/T245554 (10RobH) [20:39:08] 10Operations, 10ops-eqiad, 10DC-Ops: (no need by provided) rack/setup/install fran1001 - https://phabricator.wikimedia.org/T245554 (10RobH) [20:39:55] !log ppchelko@deploy1001 Started deploy [changeprop/deploy@e2fe8ca]: respect service name in consumer group T244387 [20:39:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:39:59] T244387: Change-Prop consumer group must respect service name - https://phabricator.wikimedia.org/T244387 [20:40:12] (03CR) 10Anomie: [C: 03+1] Raise minimum log level for 'OAuth' from DEBUG to INFO [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572737 (https://phabricator.wikimedia.org/T244185) (owner: 10Krinkle) [20:40:26] 10Operations, 10Performance-Team, 10serviceops, 10Patch-For-Review: Test gutter pool failover in production and memcached 1.5.x - https://phabricator.wikimedia.org/T240684 (10Krinkle) [20:41:25] ottomata: I assume the service is live and able to be pointed at? [20:41:36] yup, actually i just pulled on mwdebug1001 [20:41:37] works there [20:41:46] e.g. [20:41:46] curl -H 'X-Wikimedia-Debug: backend=mwdebug1001.eqiad.wmnet' 'https://test.wikipedia.org/w/api.php?action=streamconfigs&format=json&all_settings' [20:42:01] was about to sync-file InitializeSettings.php [20:42:08] then if that's ok, proceed to do metawwiki [20:42:12] then i can stop for a while and work on other stuff [20:42:26] (like getting thigns to actually use it :) ) [20:42:52] James_F: ok if I proceed? [20:43:05] WFM. [20:43:13] (03PS1) 10Dzahn: site: fix comments about rack location of app servers [puppet] - 10https://gerrit.wikimedia.org/r/572966 [20:44:57] !log otto@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Enabling EventStreamConfig extension on testwiki - T242122 (duration: 01m 04s) [20:45:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:45:06] T242122: Deploy EventStreamConfig extension - https://phabricator.wikimedia.org/T242122 [20:45:41] (03PS1) 10Ottomata: Enable EventStreamConfig on metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572967 (https://phabricator.wikimedia.org/T242122) [20:45:48] looks good! proceeding for metawiki [20:46:51] PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) timed out before a response was received: /{domain}/v1/page/references/{title} (Get references of a test page) timed out before a response was received: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received: /{domain}/v [20:46:51] /{title} (retrieve en-wiktionary definitions for cat) timed out before a response was received: /{domain}/v1/page/media-list/{title} (Get media list from test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [20:46:54] (03PS2) 10Ottomata: Enable EventStreamConfig on metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572967 (https://phabricator.wikimedia.org/T242122) [20:47:26] ^^ I did just sync a config, but it should be very unrelated [20:47:54] !log ppchelko@deploy1001 Finished deploy [changeprop/deploy@e2fe8ca]: respect service name in consumer group T244387 (duration: 07m 59s) [20:47:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:47:58] T244387: Change-Prop consumer group must respect service name - https://phabricator.wikimedia.org/T244387 [20:48:29] (03CR) 10Ottomata: [C: 03+2] Enable EventStreamConfig on metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572967 (https://phabricator.wikimedia.org/T242122) (owner: 10Ottomata) [20:48:43] RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [20:48:58] James_F: fyi there is a small working dir change on deploy1001 [20:49:02] in mediawiki-staging [20:49:17] -php-1.35.0-wmf.19 [20:49:17] \ No newline at end of file [20:49:17] +php-1.35.0-wmf.20 [20:49:29] i'm ignoring it tho, not sure if it should be that way [20:50:00] Hmm, yeah. [20:50:08] I'll deal later, thanks for flagging. [20:50:11] k [20:54:01] (03PS1) 10Cmjohnson: updating dhpd file for es102[0-5]. used the wrong eth port [puppet] - 10https://gerrit.wikimedia.org/r/572970 (https://phabricator.wikimedia.org/T241359) [20:54:35] (03PS2) 10Cmjohnson: updating dhpd file for es102[0-5]. used the wrong eth port [puppet] - 10https://gerrit.wikimedia.org/r/572970 (https://phabricator.wikimedia.org/T241359) [20:54:38] !log otto@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Enabling EventStreamConfig extension on metawiki - T242122 (duration: 01m 03s) [20:54:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:54:42] T242122: Deploy EventStreamConfig extension - https://phabricator.wikimedia.org/T242122 [20:56:29] (03CR) 10Cmjohnson: [C: 03+2] updating dhpd file for es102[0-5]. used the wrong eth port [puppet] - 10https://gerrit.wikimedia.org/r/572970 (https://phabricator.wikimedia.org/T241359) (owner: 10Cmjohnson) [20:58:11] ottomata: You clear of prod? [21:01:18] (03PS1) 10Dzahn: add new eqiad mw appservers with spare role [puppet] - 10https://gerrit.wikimedia.org/r/572975 (https://phabricator.wikimedia.org/T236437) [21:02:54] 10Operations, 10ops-codfw: Alert for device ps1-a8-codfw.mgmt.codfw.wmnet - Device rebooted - https://phabricator.wikimedia.org/T245164 (10wiki_willy) a:05wiki_willy→03Papaul [21:07:00] !log power down and set incinga downtime on cloudvirt1022 T241884 [21:07:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:07:04] T241884: Degraded RAID on cloudvirt1024 - https://phabricator.wikimedia.org/T241884 [21:07:38] !log power down and set incinga downtime on cloudvirt1022 T243536 [21:07:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:07:42] T243536: cloudvirt1022 memory errors causing host to crash - https://phabricator.wikimedia.org/T243536 [21:07:59] (03PS2) 10Dzahn: add new eqiad mw appservers with spare role [puppet] - 10https://gerrit.wikimedia.org/r/572975 (https://phabricator.wikimedia.org/T236437) [21:09:06] (03CR) 10RLazarus: [C: 03+1] add new eqiad mw appservers with spare role [puppet] - 10https://gerrit.wikimedia.org/r/572975 (https://phabricator.wikimedia.org/T236437) (owner: 10Dzahn) [21:12:19] (03PS1) 10Cmjohnson: updating snapshot1010 to raid1-lvm-ext4 cfg [puppet] - 10https://gerrit.wikimedia.org/r/572978 (https://phabricator.wikimedia.org/T241794) [21:12:51] (03PS1) 10Ottomata: EventStreamConfig - allow eventgate to produce error events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572979 (https://phabricator.wikimedia.org/T233629) [21:13:34] (03CR) 10jerkins-bot: [V: 04-1] updating snapshot1010 to raid1-lvm-ext4 cfg [puppet] - 10https://gerrit.wikimedia.org/r/572978 (https://phabricator.wikimedia.org/T241794) (owner: 10Cmjohnson) [21:14:14] (03CR) 10jerkins-bot: [V: 04-1] EventStreamConfig - allow eventgate to produce error events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572979 (https://phabricator.wikimedia.org/T233629) (owner: 10Ottomata) [21:20:45] (03PS6) 10Jforrester: Merge $wgLogo into $wgLogos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570379 (https://phabricator.wikimedia.org/T232140) [21:20:47] (03PS1) 10Jforrester: Stop setting wgVectorPrintLogo for back-compat. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572980 [21:20:49] (03PS1) 10Jforrester: Stop setting wgLogoHD for back-compat. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572981 [21:20:51] (03CR) 10Dzahn: [C: 03+2] add new eqiad mw appservers with spare role [puppet] - 10https://gerrit.wikimedia.org/r/572975 (https://phabricator.wikimedia.org/T236437) (owner: 10Dzahn) [21:24:17] (03PS2) 10Ottomata: New eventgate-analytics-external instance using remote EventStreamConfig API [deployment-charts] - 10https://gerrit.wikimedia.org/r/563211 (https://phabricator.wikimedia.org/T233629) [21:25:49] (03CR) 10Ottomata: "> Patch Set 1: Code-Review-1" [deployment-charts] - 10https://gerrit.wikimedia.org/r/563211 (https://phabricator.wikimedia.org/T233629) (owner: 10Ottomata) [21:26:26] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): cloudvirt1022 memory errors causing host to crash - https://phabricator.wikimedia.org/T243536 (10Jclark-ctr) Replaced Failed Dimm [21:26:29] !log rollback tcp-mss clamping in eqiad/eqord [21:26:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:28:50] going to wait a bit before each routers as it bounces the interfaces and thus BGP [21:32:56] (03PS1) 10Muehlenhoff: Fix broken dhcpd config [puppet] - 10https://gerrit.wikimedia.org/r/572985 [21:34:05] moritzm: arr.. my bad. thanks for fix [21:34:27] (03PS1) 10RLazarus: site: Fix racking comments for MW hosts to match what's in netbox [puppet] - 10https://gerrit.wikimedia.org/r/572986 [21:35:15] 10Operations, 10ops-eqiad, 10cloud-services-team (Hardware): (No Need By Date Provided) rack/setup/install cloudvirt-wdqs100[123].eqiad.wmnet - https://phabricator.wikimedia.org/T235685 (10Cmjohnson) [21:35:24] no worries, the whole dhcp config is fairly brittle, at some point we should simply designate the partman setup in netbox and have the dhcpd config generated [21:35:33] (03CR) 10Muehlenhoff: [C: 03+2] Fix broken dhcpd config [puppet] - 10https://gerrit.wikimedia.org/r/572985 (owner: 10Muehlenhoff) [21:38:21] 10Operations, 10Performance-Team, 10serviceops, 10Patch-For-Review: Test gutter pool failover in production and memcached 1.5.x - https://phabricator.wikimedia.org/T240684 (10Krinkle) a:03aaron [21:40:54] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Hardware): relocate/reimage cloudvirt1013 with 10G interfaces - https://phabricator.wikimedia.org/T243414 (10wiki_willy) a:05Cmjohnson→03Jclark-ctr [21:41:35] (03CR) 10Dzahn: [C: 03+1] site: Fix racking comments for MW hosts to match what's in netbox [puppet] - 10https://gerrit.wikimedia.org/r/572986 (owner: 10RLazarus) [21:42:56] (03CR) 10RLazarus: [C: 03+2] site: Fix racking comments for MW hosts to match what's in netbox [puppet] - 10https://gerrit.wikimedia.org/r/572986 (owner: 10RLazarus) [21:43:03] ACKNOWLEDGEMENT - BFD status on cr1-codfw is CRITICAL: CRIT: Down: 1 Ayounsi https://phabricator.wikimedia.org/T156955 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [21:43:03] ACKNOWLEDGEMENT - BFD status on cr2-codfw is CRITICAL: CRIT: Down: 1 Ayounsi https://phabricator.wikimedia.org/T156955 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [21:43:09] (03PS2) 10RLazarus: site: Fix racking comments for MW hosts to match what's in netbox [puppet] - 10https://gerrit.wikimedia.org/r/572986 [21:43:44] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): cloudvirt1022 memory errors causing host to crash - https://phabricator.wikimedia.org/T243536 (10JHedden) 05Open→03Resolved Thanks, @Jclark-ctr. I've confirmed the new DIMM is seen by the OS and the memory count is correct. [21:43:51] (03Abandoned) 10Dzahn: site: fix comments about rack location of app servers [puppet] - 10https://gerrit.wikimedia.org/r/572966 (owner: 10Dzahn) [21:45:35] that reminds me that I haven't seen the librenms bot in a while here [21:45:53] what happen [21:46:05] (03CR) 10Dzahn: [C: 03+1] profile::url_downloader: Add types and switch to lookup() [puppet] - 10https://gerrit.wikimedia.org/r/562472 (owner: 10Muehlenhoff) [21:46:28] (03PS1) 10Jhedden: nova: add cloudvirt1022 to scheduler pool [puppet] - 10https://gerrit.wikimedia.org/r/572990 (https://phabricator.wikimedia.org/T243536) [21:48:09] (03CR) 10Jhedden: [C: 03+2] nova: add cloudvirt1022 to scheduler pool [puppet] - 10https://gerrit.wikimedia.org/r/572990 (https://phabricator.wikimedia.org/T243536) (owner: 10Jhedden) [21:48:19] 10Operations, 10ops-eqiad, 10cloud-services-team (Hardware): (No Need By Date Provided) rack/setup/install cloudvirt-wdqs100[123].eqiad.wmnet - https://phabricator.wikimedia.org/T235685 (10Cmjohnson) [21:48:49] dunno [21:51:21] the logs looks good on a restart [21:51:38] might need to do more verbose logs [21:52:17] Is anybody available who can SWAT an UBN? https://gerrit.wikimedia.org/r/#/c/mediawiki/skins/MinervaNeue/+/572991/ [21:54:26] !log bblack@cumin1001 START - Cookbook sre.hosts.downtime [21:54:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:55:25] RECOVERY - rpki grafana alert on icinga1001 is OK: OK: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is not alerting. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [21:56:44] !log bblack@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [21:56:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:58:41] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10cloud-services-team (Hardware): Degraded RAID on cloudvirt1024 - https://phabricator.wikimedia.org/T241884 (10Jclark-ctr) updated dell ticket with new tsr report [21:58:45] 10Operations, 10Parsoid, 10VisualEditor, 10wikitech.wikimedia.org: VisualEditor was removed from Wikitech because Parsoid/PHP isn't yet compatible with how Wikitech is set up - https://phabricator.wikimedia.org/T241961 (10bd808) >>! In T241961#5894457, @ssastry wrote: > Based on what @Joe said, if MediaWik... [22:00:07] (03PS1) 10RLazarus: Site: Assign mw1349 as mediawiki::appserver [puppet] - 10https://gerrit.wikimedia.org/r/572992 (https://phabricator.wikimedia.org/T236437) [22:00:20] (03PS2) 10RLazarus: site: Assign mw1349 as mediawiki::appserver [puppet] - 10https://gerrit.wikimedia.org/r/572992 (https://phabricator.wikimedia.org/T236437) [22:00:24] (03PS2) 10Ottomata: EventStreamConfig - allow eventgate to produce error events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572979 (https://phabricator.wikimedia.org/T233629) [22:02:06] PROBLEM - Host 2620:0:860:3:b226:28ff:fed9:f3c0 is DOWN: PING CRITICAL - Packet loss = 100% [22:02:38] (03CR) 10Dzahn: [C: 03+1] site: Assign mw1349 as mediawiki::appserver [puppet] - 10https://gerrit.wikimedia.org/r/572992 (https://phabricator.wikimedia.org/T236437) (owner: 10RLazarus) [22:02:50] (03CR) 10RLazarus: [C: 03+2] site: Assign mw1349 as mediawiki::appserver [puppet] - 10https://gerrit.wikimedia.org/r/572992 (https://phabricator.wikimedia.org/T236437) (owner: 10RLazarus) [22:02:54] (03PS1) 10Dzahn: site: add new codfw mw appservers with spare role [puppet] - 10https://gerrit.wikimedia.org/r/572993 (https://phabricator.wikimedia.org/T241852) [22:03:18] RECOVERY - BFD status on cr1-codfw is OK: OK: UP: 13 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [22:03:37] (03CR) 10jerkins-bot: [V: 04-1] site: add new codfw mw appservers with spare role [puppet] - 10https://gerrit.wikimedia.org/r/572993 (https://phabricator.wikimedia.org/T241852) (owner: 10Dzahn) [22:04:22] 10Operations, 10Research, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users and researchers for Aroraakhil - https://phabricator.wikimedia.org/T241096 (10Aroraakhil) 05Resolved→03Open @Dzahn, I am reopening this request, as I am unable to use "hadoop" on the stat machines. @leila... [22:04:26] 10Operations, 10ops-codfw: Alert for device ps1-a8-codfw.mgmt.codfw.wmnet - Device rebooted - https://phabricator.wikimedia.org/T245164 (10Papaul) I open a request ticket (TICKET NO.1578279) with CY1 to assistance me on unplugging the old PDU and plugging the new one tomorrow the 19th at 10:30 Dallas time [22:08:46] RECOVERY - BFD status on cr2-codfw is OK: OK: UP: 11 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [22:10:11] (03PS1) 10Clarakosi: Enable EventBus Run Job API on only jobrunner clusters [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572994 (https://phabricator.wikimedia.org/T244770) [22:10:46] 10Operations, 10Research, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users and researchers for Aroraakhil - https://phabricator.wikimedia.org/T241096 (10Ottomata) Hi @Aroraakhil, I just created your kerberos account principal. You should receive an email with instructions. See also h... [22:12:04] (03PS1) 10Ottomata: Set krb: present for user aarora [puppet] - 10https://gerrit.wikimedia.org/r/572995 (https://phabricator.wikimedia.org/T241096) [22:15:57] 10Operations, 10SRE-Access-Requests: Requesting access to stat1007 for jmorgan - https://phabricator.wikimedia.org/T244785 (10leila) @Nuria if you're still around today, can you please review and approve? If not, @Ottomata who can review while Nuria is not around? Thanks! [22:17:48] 10Operations, 10SRE-Access-Requests: Requesting access to stat1007 for jmorgan - https://phabricator.wikimedia.org/T244785 (10Ottomata) I know Nuria is going to be off for the next few days. Do I have powers to approve this? If I do, then I approve! [22:18:06] (03PS1) 10EBernhardson: airflow: Expand sudo rights to analytics-search user [puppet] - 10https://gerrit.wikimedia.org/r/572997 [22:18:44] 10Operations, 10Research, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users and researchers for Aroraakhil - https://phabricator.wikimedia.org/T241096 (10Aroraakhil) Hi @Ottomata, Thank you so much for your prompt response and help. I am now able to use "hadoop" s... [22:21:27] (03PS1) 10Jforrester: [DNM] Merge wgMinervaCustomLogos into wgLogos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572998 [22:26:46] (03CR) 10Jforrester: "We might need to implement a merge strategy for this. Meh." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572998 (owner: 10Jforrester) [22:30:12] !log Upgrading Netbox to 2.7.4 [22:30:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:30:21] (03PS2) 10Jforrester: Raise minimum log level for 'OAuth' from DEBUG to INFO [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572737 (https://phabricator.wikimedia.org/T244185) (owner: 10Krinkle) [22:30:48] (03CR) 10Jforrester: [C: 03+2] Raise minimum log level for 'OAuth' from DEBUG to INFO [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572737 (https://phabricator.wikimedia.org/T244185) (owner: 10Krinkle) [22:31:54] (03CR) 10Krinkle: [DNM] Merge wgMinervaCustomLogos into wgLogos (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572998 (owner: 10Jforrester) [22:32:02] (03Merged) 10jenkins-bot: Raise minimum log level for 'OAuth' from DEBUG to INFO [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572737 (https://phabricator.wikimedia.org/T244185) (owner: 10Krinkle) [22:32:43] (03CR) 10Jforrester: [C: 04-1] "This needs I5c14d6734f08e6beab001c62b69c2f5791e9ee14 to be on group1 in prod (or Incubator will break)." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570379 (https://phabricator.wikimedia.org/T232140) (owner: 10Jforrester) [22:35:52] 10Operations, 10serviceops: Upgrade and improve our application object caching service (memcached) - https://phabricator.wikimedia.org/T244852 (10RobH) [22:37:28] 10Operations, 10ops-codfw, 10DC-Ops: (no due date provided) rack/setup/install frdb2001 - https://phabricator.wikimedia.org/T245566 (10RobH) [22:37:34] 10Operations, 10ops-codfw, 10DC-Ops: (no due date provided) rack/setup/install frdb2001 - https://phabricator.wikimedia.org/T245566 (10RobH) [22:37:42] (03PS1) 10RLazarus: Add fake keys for mw13[49-84].eqiad.wmnet [labs/private] - 10https://gerrit.wikimedia.org/r/573002 (https://phabricator.wikimedia.org/T236437) [22:37:46] (03CR) 10CRusnov: [C: 03+2] Fix -extras for netbox 2.7 upgrade [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/572116 (https://phabricator.wikimedia.org/T244281) (owner: 10CRusnov) [22:38:06] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T244185 Raise minimum log level for 'OAuth' from DEBUG to INFO (duration: 01m 04s) [22:38:09] 10Operations, 10ops-codfw, 10DC-Ops: (no due date provided) rack/setup/install frdb2001 - https://phabricator.wikimedia.org/T245566 (10RobH) [22:38:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:38:11] T244185: OAuth logs getting quite a lot bigger - https://phabricator.wikimedia.org/T244185 [22:38:21] (03CR) 10Krinkle: Merge $wgLogo into $wgLogos (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570379 (https://phabricator.wikimedia.org/T232140) (owner: 10Jforrester) [22:38:42] (03PS1) 10EBernhardson: Enable ores_articletopics field for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/573003 (https://phabricator.wikimedia.org/T240550) [22:39:22] 10Operations, 10ops-eqiad, 10DC-Ops: (2020-03-01) rack/setup/install htmldumper1001.eqiad.wmnet. - https://phabricator.wikimedia.org/T245567 (10RobH) [22:39:27] (03PS10) 10CRusnov: netbox: Update configuration to support v2.7 [puppet] - 10https://gerrit.wikimedia.org/r/572123 (https://phabricator.wikimedia.org/T244291) [22:39:33] 10Operations, 10ops-eqiad, 10DC-Ops: (2020-03-01) rack/setup/install htmldumper1001.eqiad.wmnet. - https://phabricator.wikimedia.org/T245567 (10RobH) [22:39:39] PROBLEM - Check systemd state on notebook1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:39:55] (03CR) 10Jforrester: [C: 04-1] Merge $wgLogo into $wgLogos (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570379 (https://phabricator.wikimedia.org/T232140) (owner: 10Jforrester) [22:40:44] (03CR) 10CRusnov: [C: 03+2] netbox: Update configuration to support v2.7 [puppet] - 10https://gerrit.wikimedia.org/r/572123 (https://phabricator.wikimedia.org/T244291) (owner: 10CRusnov) [22:40:55] (03PS2) 10EBernhardson: Enable ores_articletopics field for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/573003 (https://phabricator.wikimedia.org/T240550) [22:41:41] (03CR) 10Krinkle: [C: 04-1] "It looks like ContentTranslationSpecialPage.php is still reading from wgMinervaCustomLogos only, so it's header would likely become blank " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572998 (owner: 10Jforrester) [22:42:02] 10Operations: Anycast for webproxies - https://phabricator.wikimedia.org/T242715 (10Volans) +1 for improving HA of them and I agree that the LVS approach seems the saner one If we don't plan to do this anytime soon though, maybe we could make an intermediate step with geodns. We could have `webproxy.discovery.wm... [22:42:21] (03CR) 10CRusnov: [V: 03+2 C: 03+2] Bump Netbox revision to v2.7.4 [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/572025 (https://phabricator.wikimedia.org/T244291) (owner: 10CRusnov) [22:42:43] (03CR) 10Jforrester: [DNM] Merge wgMinervaCustomLogos into wgLogos (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572998 (owner: 10Jforrester) [22:43:43] (03CR) 10Jforrester: "> Patch Set 1: Code-Review-1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572998 (owner: 10Jforrester) [22:43:45] 10Operations: Webproxies are a SPOF - https://phabricator.wikimedia.org/T242715 (10ayounsi) [22:43:54] (03CR) 10Ppchelko: [C: 04-1] "Let's also enable it in beta cluster unconditionally. See CommonSettings-labs.php" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572994 (https://phabricator.wikimedia.org/T244770) (owner: 10Clarakosi) [22:45:05] !log crusnov@deploy1001 Started deploy [netbox/deploy@f3d56dd]: netbox 2.7.4 upgrade T244291 [22:45:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:45:09] T244291: Upgrade Netbox to 2.7 series - https://phabricator.wikimedia.org/T244291 [22:46:24] !log crusnov@deploy1001 Finished deploy [netbox/deploy@f3d56dd]: netbox 2.7.4 upgrade T244291 (duration: 01m 19s) [22:46:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:48:23] !log crusnov@deploy1001 Started deploy [netbox/deploy@f3d56dd]: netbox 2.7.4 upgrade T244291 (part2) [22:48:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:49:43] !log crusnov@deploy1001 Finished deploy [netbox/deploy@f3d56dd]: netbox 2.7.4 upgrade T244291 (part2) (duration: 01m 19s) [22:49:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:50:38] 10Operations, 10Research, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users and researchers for Aroraakhil - https://phabricator.wikimedia.org/T241096 (10leila) 05Open→03Resolved [22:50:53] (03CR) 10EBernhardson: [C: 03+2] "Not user facing, has no effect on active requests or processes. Prepares config for a maint script to run." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/573003 (https://phabricator.wikimedia.org/T240550) (owner: 10EBernhardson) [22:51:13] !log crusnov@deploy1001 Started deploy [netbox/deploy@f3d56dd]: netbox 2.7.4 upgrade T244291 (part3) [22:51:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:51:17] T244291: Upgrade Netbox to 2.7 series - https://phabricator.wikimedia.org/T244291 [22:51:25] !log crusnov@deploy1001 Finished deploy [netbox/deploy@f3d56dd]: netbox 2.7.4 upgrade T244291 (part3) (duration: 00m 11s) [22:51:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:51:48] (03Merged) 10jenkins-bot: Enable ores_articletopics field for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/573003 (https://phabricator.wikimedia.org/T240550) (owner: 10EBernhardson) [22:52:34] (03PS10) 10Holger Knust: Migrate changeprop & cpjobqueue to kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/554576 [22:52:46] (03PS24) 10ArielGlenn: write out and reuse pagerange info for big page content jobs [dumps] - 10https://gerrit.wikimedia.org/r/566580 (https://phabricator.wikimedia.org/T243434) [22:52:48] (03PS7) 10ArielGlenn: properly handle failure of writing of temp stubs for page content jobs [dumps] - 10https://gerrit.wikimedia.org/r/562995 (https://phabricator.wikimedia.org/T242209) [22:52:53] !log completed upgrading Netbox to 2.7.4 T244291 [22:52:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:52:58] (03CR) 10jerkins-bot: [V: 04-1] write out and reuse pagerange info for big page content jobs [dumps] - 10https://gerrit.wikimedia.org/r/566580 (https://phabricator.wikimedia.org/T243434) (owner: 10ArielGlenn) [22:53:10] bah [22:54:48] !log ebernhardson@deploy1001 Synchronized wmf-config/InitialiseSettings.php: cirrus: Enable ores_articletopics field creation for all wikis (duration: 01m 03s) [22:54:48] hmmm looks like notme. maybe ci instead. will recheck tomorrow [22:54:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:56:40] (03PS1) 10Jforrester: Bump php pointer from 1.35.0-wmf.19 to 1.35.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/573006 [22:57:08] 10Operations, 10ops-codfw: (No Need By Date Provided) codfw: rack/setup/install elastic20{55,56,57,58,59,60}.codfw.wmnet - https://phabricator.wikimedia.org/T241337 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` elastic2058.codfw.wmnet ` The log can... [22:58:00] !log ebernhardson@deploy1001 Synchronized wmf-config/InitialiseSettings.php: cirrus: Enable ores_articletopics field creation for all wikis (extra sync for T236104) (duration: 01m 04s) [22:58:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:58:04] T236104: Cache of wmf-config/InitialiseSettings often 1 step behind - https://phabricator.wikimedia.org/T236104 [22:58:25] 10Operations, 10ops-eqiad, 10Analytics, 10DC-Ops: (no date provided) rack/setup/install an-druid1001 and druid1007 - https://phabricator.wikimedia.org/T245569 (10RobH) [22:58:33] 10Operations, 10ops-eqiad, 10Analytics, 10DC-Ops: (no date provided) rack/setup/install an-druid1001 and druid1007 - https://phabricator.wikimedia.org/T245569 (10RobH) [22:59:17] RECOVERY - Check systemd state on notebook1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:00:21] (03CR) 10Dzahn: [C: 03+1] Add fake keys for mw13[49-84].eqiad.wmnet [labs/private] - 10https://gerrit.wikimedia.org/r/573002 (https://phabricator.wikimedia.org/T236437) (owner: 10RLazarus) [23:02:41] (03CR) 10RLazarus: [V: 03+2 C: 03+2] Add fake keys for mw13[49-84].eqiad.wmnet [labs/private] - 10https://gerrit.wikimedia.org/r/573002 (https://phabricator.wikimedia.org/T236437) (owner: 10RLazarus) [23:03:51] (03PS11) 10Holger Knust: Migrate changeprop & cpjobqueue to kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/554576 [23:11:07] 10Operations, 10MediaWiki-Debug-Logger, 10Traffic: noc.wikimedia.org doesn't route to the docroot when WikimediaDebug browser extension is live - https://phabricator.wikimedia.org/T245552 (10Krinkle) This has regressed last month as well and was fixed shortly after. Presumably somehing went wrong in the ATS... [23:12:14] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [23:12:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:12:19] PROBLEM - rpki grafana alert on icinga1001 is CRITICAL: CRITICAL: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is alerting: RRDP status alert. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [23:12:46] !log dzahn@cumin1001 conftool action : set/weight=10; selector: name=mw2150.wmnet [23:12:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:12:53] !log dzahn@cumin1001 conftool action : set/weight=10; selector: name=mw2151.wmnet [23:12:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:14:31] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [23:14:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:15:48] PROBLEM - Check systemd state on mw1349 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:16:16] ^ new install [23:16:26] rlazarus: there we go.. hard to avoid them on new installs [23:16:37] womp womp [23:19:11] RECOVERY - Check systemd state on mw1349 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:19:13] 10Operations, 10ops-codfw: (No Need By Date Provided) codfw: rack/setup/install elastic20{55,56,57,58,59,60}.codfw.wmnet - https://phabricator.wikimedia.org/T241337 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['elastic2058.codfw.wmnet'] ` and were **ALL** successful. [23:19:27] PROBLEM - mediawiki-installation DSH group on mw1349 is CRITICAL: Host mw1349 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [23:28:09] (03PS2) 10Jforrester: [DNM] Merge wgMinervaCustomLogos into wgLogos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572998 [23:28:19] !log running reindex for wikimedia wikis [23:28:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:34:34] !log running reindex on mwmaint1002 - T194448 [23:34:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:34:38] T194448: Support negatives for the hastemplate keyword in AdvancedSearch - https://phabricator.wikimedia.org/T194448 [23:34:54] (03PS6) 10VolkerE: Fix latin Wikipedia (VICIPÆDIA) wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571838 (https://phabricator.wikimedia.org/T240728) [23:40:20] (03PS2) 10Dzahn: site: add new codfw mw appservers with spare role [puppet] - 10https://gerrit.wikimedia.org/r/572993 (https://phabricator.wikimedia.org/T241852) [23:42:06] (03CR) 10Dzahn: [C: 03+2] ATS: remove commented blubberoid non-discovery record from backend [puppet] - 10https://gerrit.wikimedia.org/r/572396 (owner: 10Dzahn) [23:42:14] (03PS2) 10Dzahn: ATS: remove commented blubberoid non-discovery record from backend [puppet] - 10https://gerrit.wikimedia.org/r/572396 [23:47:49] (03PS2) 10Dzahn: releases: remove port 80 firewall hole [puppet] - 10https://gerrit.wikimedia.org/r/572353 [23:50:54] (03PS3) 10Dzahn: releases: remove port 80 firewall hole [puppet] - 10https://gerrit.wikimedia.org/r/572353 (https://phabricator.wikimedia.org/T210411) [23:51:12] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/compiler1003/20875/releases1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/572353 (https://phabricator.wikimedia.org/T210411) (owner: 10Dzahn) [23:54:47] !log dzahn@cumin1001 conftool action : set/weight=10; selector: name=mw1349.eqiad.wmnet [23:54:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:55:10] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1349.eqiad.wmnet [23:55:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:56:26] !log mw1349 - scap pull [23:56:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:56:45] (03CR) 10Dzahn: [C: 03+1] "https://puppet-compiler.wmflabs.org/compiler1003/20876/doc1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/572378 (https://phabricator.wikimedia.org/T210411) (owner: 10Dzahn)