[00:02:04] (03PS1) 10Jgreen: switch payments.wm.o cname to payments-codfw.wm.o [dns] - 10https://gerrit.wikimedia.org/r/570970 (https://phabricator.wikimedia.org/T244610) [00:03:34] (03CR) 10Jgreen: [C: 03+2] switch payments.wm.o cname to payments-codfw.wm.o [dns] - 10https://gerrit.wikimedia.org/r/570970 (https://phabricator.wikimedia.org/T244610) (owner: 10Jgreen) [00:03:46] 10Operations, 10Commons, 10Traffic, 10Wikimedia-Site-requests, and 2 others: Allow anonymous users to change interface language on Commons with ULS - https://phabricator.wikimedia.org/T161517 (10AronManning) [00:05:03] !log switched payments.wikimedia.org to codfw datacenter due to T244610 [00:05:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:05:07] T244610: some outbound is TCP failing from fundraising cluster as of approx 2020-02-07 16:15UTC - https://phabricator.wikimedia.org/T244610 [00:06:30] (03PS3) 10Dzahn: installserver: create new role without HTTP/APT, rename existing role [puppet] - 10https://gerrit.wikimedia.org/r/570969 (https://phabricator.wikimedia.org/T224576) [00:06:37] (03PS1) 10Dzahn: introduce new role to install nginx and APT repo without DHCP/TFTP [puppet] - 10https://gerrit.wikimedia.org/r/570971 (https://phabricator.wikimedia.org/T224576) [00:11:37] 10Operations, 10observability, 10serviceops, 10vm-requests: Provision grafana VM in codfw - https://phabricator.wikimedia.org/T244357 (10Dzahn) VM looks up and running, all green in Icinga. It has the check for grafana.wikimedia.org on it. (Though this means the alert is duplicated kind of since it checks... [00:15:11] 10Operations, 10SRE-Access-Requests: Requesting access to sites from Google Search Console - https://phabricator.wikimedia.org/T244407 (10Usmanino) [00:17:45] 10Operations, 10SRE-Access-Requests: Requesting access to sites from Google Search Console - https://phabricator.wikimedia.org/T244407 (10Reedy) [00:18:03] 10Operations, 10serviceops-radar, 10vm-requests: vm requests for APT repo / webserver - https://phabricator.wikimedia.org/T244626 (10Dzahn) [00:20:44] 10Operations, 10serviceops: rack/setup/install ganeti10([09]|1[0-8]).eqiad.wmnet - https://phabricator.wikimedia.org/T228924 (10Dzahn) [00:23:58] 10Operations, 10Patch-For-Review: Sort out plan for install* servers in edge sites - https://phabricator.wikimedia.org/T242602 (10Dzahn) [00:24:02] 10Operations, 10Patch-For-Review: Upgrade install servers to Buster - https://phabricator.wikimedia.org/T224576 (10Dzahn) [00:24:15] 10Operations, 10serviceops-radar, 10vm-requests: vm requests for APT repo / webserver - https://phabricator.wikimedia.org/T244626 (10Dzahn) [00:24:18] 10Operations, 10Patch-For-Review: Sort out plan for install* servers in edge sites - https://phabricator.wikimedia.org/T242602 (10Dzahn) [00:26:06] 10Operations, 10serviceops-radar, 10vm-requests: vm requests for APT repo / webserver - https://phabricator.wikimedia.org/T244626 (10Dzahn) [00:27:21] 10Operations, 10serviceops-radar, 10vm-requests: vm requests for APT repo / webserver - https://phabricator.wikimedia.org/T244626 (10Dzahn) regarding disk requirements: from current install server: 40G wikimedia 31G junos 7.7G firmware 2.0G tftpboot [00:31:17] (03Abandoned) 10Dzahn: switch apt.wikimedia.org from install1002 to install1003 [dns] - 10https://gerrit.wikimedia.org/r/569682 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn) [00:35:30] (03PS2) 10Dzahn: wmflib: replace install1002 with install1003 in ipresolve_spec [puppet] - 10https://gerrit.wikimedia.org/r/569688 (https://phabricator.wikimedia.org/T224576) [00:36:58] (03CR) 10Dzahn: "just like last time over 4 years ago for carbon 9fff7ef9e3beee0db" [puppet] - 10https://gerrit.wikimedia.org/r/569688 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn) [00:38:17] (03CR) 10jerkins-bot: [V: 04-1] wmflib: replace install1002 with install1003 in ipresolve_spec [puppet] - 10https://gerrit.wikimedia.org/r/569688 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn) [00:42:43] (03PS3) 10Dzahn: wmflib: replace install1002 with install1003 in ipresolve_spec [puppet] - 10https://gerrit.wikimedia.org/r/569688 (https://phabricator.wikimedia.org/T224576) [00:45:22] (03CR) 10jerkins-bot: [V: 04-1] wmflib: replace install1002 with install1003 in ipresolve_spec [puppet] - 10https://gerrit.wikimedia.org/r/569688 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn) [00:52:20] 10Operations, 10SRE-Access-Requests: Requesting access to Deployment for Clarakosi - https://phabricator.wikimedia.org/T244381 (10Dzahn) [00:55:32] 10Operations, 10SRE-Access-Requests: Requesting access to Deployment for Clarakosi - https://phabricator.wikimedia.org/T244381 (10Dzahn) [01:00:11] (03PS4) 10Dzahn: wmflib: replace install1002 with install1003 in ipresolve_spec [puppet] - 10https://gerrit.wikimedia.org/r/569688 (https://phabricator.wikimedia.org/T224576) [01:03:10] (03CR) 10jerkins-bot: [V: 04-1] wmflib: replace install1002 with install1003 in ipresolve_spec [puppet] - 10https://gerrit.wikimedia.org/r/569688 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn) [01:03:23] PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 214609688 and 10 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:05:07] (03CR) 10Dzahn: ""expected ipresolve("install1003.eqiad.wmnet", "6") to have returned "2620:0:861:103:10:64:32:120" instead of "2620::861:103:10:64:32:120"" [puppet] - 10https://gerrit.wikimedia.org/r/569688 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn) [01:05:17] RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 0 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:09:19] (03PS1) 10Dzahn: admins: add Clara Andrew-Wani to MW deployers [puppet] - 10https://gerrit.wikimedia.org/r/570973 (https://phabricator.wikimedia.org/T244381) [01:10:19] (03PS5) 10Dzahn: wmflib: replace install1002 with install1003 in ipresolve_spec [puppet] - 10https://gerrit.wikimedia.org/r/569688 (https://phabricator.wikimedia.org/T224576) [01:11:24] (03CR) 10jerkins-bot: [V: 04-1] admins: add Clara Andrew-Wani to MW deployers [puppet] - 10https://gerrit.wikimedia.org/r/570973 (https://phabricator.wikimedia.org/T244381) (owner: 10Dzahn) [01:12:52] (03PS2) 10Dzahn: admins: add Clara Andrew-Wani to MW deployers [puppet] - 10https://gerrit.wikimedia.org/r/570973 (https://phabricator.wikimedia.org/T244381) [01:14:50] (03PS1) 10Bstorm: k8s-resources: resources must be converted to str from Decimal [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/570974 (https://phabricator.wikimedia.org/T244289) [01:17:06] (03CR) 10Bstorm: "Good thing I am live testing every change we made in toolsbeta." [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/570974 (https://phabricator.wikimedia.org/T244289) (owner: 10Bstorm) [01:18:07] (03CR) 10Bstorm: "This works on livehack on toolsbeta, btw." [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/570974 (https://phabricator.wikimedia.org/T244289) (owner: 10Bstorm) [01:20:48] (03CR) 10Ppchelko: admins: add Clara Andrew-Wani to MW deployers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/570973 (https://phabricator.wikimedia.org/T244381) (owner: 10Dzahn) [01:28:15] (03PS1) 10Dzahn: gerrit: make db_pass configurable from (private) Hiera [puppet] - 10https://gerrit.wikimedia.org/r/570976 (https://phabricator.wikimedia.org/T243800) [01:31:03] (03CR) 10Dzahn: admins: add Clara Andrew-Wani to MW deployers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/570973 (https://phabricator.wikimedia.org/T244381) (owner: 10Dzahn) [01:32:00] (03PS3) 10Dzahn: admins: add Clara Andrew-Wani to MW deployers [puppet] - 10https://gerrit.wikimedia.org/r/570973 (https://phabricator.wikimedia.org/T244381) [01:33:55] (03CR) 10Ppchelko: [C: 03+1] admins: add Clara Andrew-Wani to MW deployers [puppet] - 10https://gerrit.wikimedia.org/r/570973 (https://phabricator.wikimedia.org/T244381) (owner: 10Dzahn) [01:45:46] (03CR) 10Jforrester: [C: 03+1] "Good to go in next SWAT window." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/566842 (https://phabricator.wikimedia.org/T242122) (owner: 10Ottomata) [02:21:12] (03CR) 10Paladox: [C: 03+1] gerrit: make db_pass configurable from (private) Hiera [puppet] - 10https://gerrit.wikimedia.org/r/570976 (https://phabricator.wikimedia.org/T243800) (owner: 10Dzahn) [02:27:31] 10Operations, 10Beta-Cluster-Infrastructure, 10RESTBase, 10Traffic: Restbase routing down on beta, 2020-02-07 - https://phabricator.wikimedia.org/T244586 (10Jdforrester-WMF) [04:31:55] 10Operations, 10Performance-Team, 10serviceops, 10Wikimedia-production-error: Wiki diffs take over 15s to load - https://phabricator.wikimedia.org/T244058 (10aaron) >>! In T244058#5851362, @Joe wrote: >>>! In T244058#5849290, @aaron wrote: >> Links to old (non-current) versions due not use the parser cache... [04:48:48] (03PS1) 10Ammarpad: Disable MobileFrontend mainpage special casing on metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570982 (https://phabricator.wikimedia.org/T244577) [05:15:55] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [05:17:47] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [06:10:55] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 36 probes of 526 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [06:16:53] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 34 probes of 526 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [08:03:57] PROBLEM - Check systemd state on sodium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:26:57] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 37 probes of 522 (alerts on 35) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [08:29:07] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 37 probes of 526 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [08:29:51] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 36 probes of 526 (alerts on 35) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [08:29:55] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 39 probes of 526 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [08:40:57] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 35 probes of 526 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [08:50:37] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 39 probes of 526 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [08:59:17] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 35 probes of 526 (alerts on 35) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [09:08:49] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 38 probes of 526 (alerts on 35) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [09:14:41] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 35 probes of 526 (alerts on 35) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [09:20:13] (03CR) 10Elukey: Add profile::analytics::refinery::job::import_wikidata_entites_dumps (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/567954 (https://phabricator.wikimedia.org/T209655) (owner: 10Joal) [09:24:39] (03PS13) 10Elukey: Add profile::analytics::refinery::job::import_wikidata_entites_dumps [puppet] - 10https://gerrit.wikimedia.org/r/567954 (https://phabricator.wikimedia.org/T209655) (owner: 10Joal) [09:26:40] (03CR) 10Elukey: "We should be ready to merge :)" [puppet] - 10https://gerrit.wikimedia.org/r/567954 (https://phabricator.wikimedia.org/T209655) (owner: 10Joal) [09:27:37] (03CR) 10Elukey: [C: 03+2] Add profile::analytics::refinery::job::import_wikidata_entites_dumps [puppet] - 10https://gerrit.wikimedia.org/r/567954 (https://phabricator.wikimedia.org/T209655) (owner: 10Joal) [10:02:14] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 36 probes of 526 (alerts on 35) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [10:09:54] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 35 probes of 522 (alerts on 35) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [10:19:22] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 36 probes of 522 (alerts on 35) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [10:27:30] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 35 probes of 526 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [10:34:14] 10Operations, 10Performance-Team, 10serviceops, 10Wikimedia-production-error: Wiki diffs take over 15s to load - https://phabricator.wikimedia.org/T244058 (10jcrespo) What about tuning HTTP frontend caching? Last revision is very dynamic, but a hardcoded diff maybe could return stale results more often, as... [10:37:10] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 36 probes of 526 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [10:42:42] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 34 probes of 522 (alerts on 35) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [10:42:50] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 35 probes of 526 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [10:42:50] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 35 probes of 526 (alerts on 35) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [10:44:07] they recovered now [10:46:58] I was writing to #sre, couldn't repro on my vm in ams [10:52:20] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 37 probes of 522 (alerts on 35) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [10:52:28] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 36 probes of 526 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [11:02:06] (03PS1) 10Elukey: profile::swap: add the option to skip nfs dumps mountpoints [puppet] - 10https://gerrit.wikimedia.org/r/570991 [11:04:07] 10Operations, 10Performance-Team, 10serviceops, 10Wikimedia-production-error: Wiki diffs take over 15s to load - https://phabricator.wikimedia.org/T244058 (10Joe) >>! In T244058#5861637, @aaron wrote: >>>! In T244058#5851362, @Joe wrote: >>>>! In T244058#5849290, @aaron wrote: >>> Links to old (non-current... [11:04:08] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 35 probes of 522 (alerts on 35) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [11:04:18] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 35 probes of 526 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [11:05:05] (03CR) 10Elukey: [C: 03+2] "https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/570991/" [puppet] - 10https://gerrit.wikimedia.org/r/570991 (owner: 10Elukey) [11:25:36] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 36 probes of 522 (alerts on 35) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [11:25:44] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 37 probes of 526 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [11:43:22] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 35 probes of 526 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [11:45:03] 10Operations, 10Beta-Cluster-Infrastructure, 10RESTBase, 10Traffic: Restbase routing down on beta, 2020-02-07 - https://phabricator.wikimedia.org/T244586 (10Krenair) a:03Krenair looks like profile::trafficserver::backend::mapping_rules in hieradata/labs.yaml only has support for mediawiki and upload - it... [11:49:06] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 34 probes of 522 (alerts on 35) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [11:57:39] 10Operations, 10Beta-Cluster-Infrastructure, 10RESTBase, 10Traffic, 10User-Ryasmeen: Restbase routing down on beta, 2020-02-07 - https://phabricator.wikimedia.org/T244586 (10Krenair) 05Open→03Resolved [11:59:20] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 35 probes of 526 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [12:02:32] 10Operations, 10Beta-Cluster-Infrastructure, 10RESTBase, 10Traffic, 10User-Ryasmeen: Restbase routing down on beta, 2020-02-07 - https://phabricator.wikimedia.org/T244586 (10Krenair) (I added some corrected hieradata to cache-text05 in horizon) [12:08:56] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 38 probes of 526 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [12:10:32] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 36 probes of 522 (alerts on 35) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [12:10:42] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 37 probes of 526 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [12:27:07] hm, i'm having troubles with gerrit sometimes "server unavailable", probably that network fun [12:40:04] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 34 probes of 522 (alerts on 35) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [12:40:14] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 34 probes of 526 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [12:50:16] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 35 probes of 526 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [13:01:17] (03CR) 10Brion VIBBER: "I can run thumbor in the docker image, but can't figure out how to get it to give me thumbnails to copy into the test cases, as the URLs r" [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/569341 (https://phabricator.wikimedia.org/T166024) (owner: 10Brion VIBBER) [13:02:06] (03PS3) 10Brion VIBBER: Support MPEG-1 and MPEG-2 video files with .mpg or .mpeg extension [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/569341 (https://phabricator.wikimedia.org/T166024) [13:29:22] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 36 probes of 526 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [13:41:16] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 35 probes of 526 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [14:10:30] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 37 probes of 526 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [14:16:32] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 35 probes of 526 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:19:54] (03PS1) 10Zoranzoki21: Add throttle rules for OSU Editathon, remove expired ones [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570998 (https://phabricator.wikimedia.org/T244608) [15:20:59] (03CR) 10jerkins-bot: [V: 04-1] Add throttle rules for OSU Editathon, remove expired ones [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570998 (https://phabricator.wikimedia.org/T244608) (owner: 10Zoranzoki21) [15:22:12] 10Operations, 10Beta-Cluster-Infrastructure: Upgrade puppet in deployment-prep (Puppet agent broken in Beta Cluster) - https://phabricator.wikimedia.org/T243226 (10Krenair) Thanks dpifke. I've seen it do that before too. We probably need to tune it a bit - I recall puppetdb hosts in particular have a hiera set... [15:22:56] (03PS2) 10Zoranzoki21: Add throttle rules for OSU Editathon, remove expired ones [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570998 (https://phabricator.wikimedia.org/T244608) [15:39:54] error creating phab task [15:39:56] Unable to establish a connection to any database host (while trying "phabricator_file"). All masters and replicas are completely unreachable. AphrontConnectionQueryException: Attempt to connect to phuser@m3-master.eqiad.wmnet failed with error #2002: Cannot assign requested address. [15:40:11] arg, went back and my text is gone [15:44:50] (03CR) 10Luke081515: [C: 03+1] Add throttle rules for OSU Editathon, remove expired ones [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570998 (https://phabricator.wikimedia.org/T244608) (owner: 10Zoranzoki21) [15:45:01] Looks like upgrading puppet is a total big ol' PITA [15:45:37] hauskatze, hard part is done [15:45:53] (if you were referring to the deployment-prep thing) [15:46:07] just gotta go through and swap everything to the new puppetmaster and shut the old one down [15:46:13] Glad to hear. Yep, I was refering to deployment-prep. [15:47:57] puppet is at least functioning on almost all VMs at the moment, excluding imagescaler01 (which I just made a task for) and the zombie instance that is logstash2 [16:29:23] Daimona: AF question if possible? [16:29:33] hauskatze: Sure. [16:29:46] Daimona: any idea why AF#27@meta isn't working? [16:29:59] (Ping me if I stop answering, I'm dealing with a very complex and annoying rebase) [16:30:12] this can wait [16:30:20] deal with the rebase first please :) [16:30:27] * hauskatze hates manual rebases as well [16:30:58] Nah, I can multitask :D [16:31:08] What exactly isn't working? [16:31:28] hauskatze: do you have an edit that was not hit althrough it should? [16:31:34] * Urbanecm needs an example for debugging [16:31:54] Urbanecm: yup, but it was suppressed in the meanwhile :| [16:32:04] IP edit while logged out [16:32:14] that's how I noticed it [16:32:40] What's the edit? [16:32:43] hauskatze: hmm you able to send me an examine link? https://meta.wikimedia.org/wiki/Special:AbuseFilter/examine [16:32:49] Daimona: you're not going to see it if it's oversighted [16:33:20] Yeah, but I'm going to feel it by heart [16:34:02] lol [16:34:11] well since an edit is suppressed... [16:34:13] ...https://meta.wikimedia.org/w/index.php?title=Stewards/Elections_2020/Votes/Martin_Urbanec&diff=19786328&oldid=19786285 is here [16:34:15] Urbanecm: examine link being sent by Royal Mail [16:34:17] (03PS1) 10Greg Grossmeier: Throttle override: Editathon in Charolette [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571004 [16:34:28] thank you [16:34:37] that is: PM [16:34:46] Uhm interesting, let me check [16:35:42] A missing space [16:35:55] where? [16:35:57] In the regexp, after "Elections" [16:36:38] gotcha [16:36:45] Daimona: shall I add an underscore or using a space would work? [16:36:54] Space will definitely work [16:37:02] I'm getting ready to deploy ^^ my throttle patch [16:37:02] Actually, underscore will not [16:37:10] hauskatze: already fixed [16:37:13] https://usercontent.irccloud-cdn.com/file/oUf1UX7r/image.png [16:37:13] I knew it gotta be something simple but coffee not yet in effect [16:37:32] greg-g: oh, since when you do config changes? :) [16:37:47] I'm more than happy, hey [16:38:09] This kind of silly mistakes is what will usually drain your time [16:38:24] hauskatze: wanna +1 it? [16:38:47] greg-g: looks good to me; just Urbanecm, is the date format okay? [16:39:00] hauskatze: Daimona: You know what's weird? It apparently worked before, but still no space. Have we used diff format before? [16:39:01] looking [16:39:10] Daimona: ma io sono un buffoni :P [16:39:54] Maybe different page titles, yes [16:39:56] Ahah [16:40:35] Urbanecm: not good? (it's taking you a while ;)) [16:40:49] greg-g: not sure if 24:00 would work [16:41:36] greg-g: but seems it's okay [16:41:36] (03CR) 10MarcoAurelio: [C: 03+1] "Looks good to me. Thanks." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571004 (owner: 10Greg Grossmeier) [16:41:41] (03CR) 10Urbanecm: [C: 03+1] Throttle override: Editathon in Charolette [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571004 (owner: 10Greg Grossmeier) [16:41:53] lol, +1-conflict [16:42:13] (03PS2) 10Greg Grossmeier: Throttle override: Editathon in Charolette [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571004 [16:42:44] updated the time to 1UTC the next ay [16:42:47] ok? [16:43:01] Ma il mio mistero é chiuso in me! <-- this is what AF sings when trolling us Daimona [16:43:34] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Throttle override: Editathon in Charolette [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571004 (owner: 10Greg Grossmeier) [16:43:38] kk [16:43:46] (03CR) 10Greg Grossmeier: [C: 03+2] Throttle override: Editathon in Charolette [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571004 (owner: 10Greg Grossmeier) [16:44:17] greg-g: irc +1 [16:44:24] Lol I imagine it more as https://www.youtube.com/watch?v=o1eHKf-dMwo [16:44:46] (03Merged) 10jenkins-bot: Throttle override: Editathon in Charolette [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571004 (owner: 10Greg Grossmeier) [16:45:12] lolol [16:47:38] !log gjg@deploy1001 Synchronized wmf-config/throttle.php: SWAT: Editathon in Charolette (duration: 00m 58s) [16:47:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:48:24] thanks all ;) [16:48:35] greg-g: You're very welcome :) [16:51:01] hauskatze: Urbanecm grant is still reporting it's blocked right now, thoughts? [16:51:48] greg-g: Sorry, which grant? [16:52:02] sorry, Grant Ingersoll, our CTO [16:52:16] he's at the editathon [16:52:17] greg-g: ah, known issue [16:52:42] if the rule is added without some time before the event, you need to do something else [16:52:43] greg-g: cache [16:52:45] I'll purge it [16:52:47] damn [16:52:48] it's documented in the wikitech [16:52:50] thanks Urbanecm [16:53:42] !log mwscript resetAuthenticationThrottle.php --wiki=enwiki --signup --ip 12.24.27.50 [16:53:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:46] greg-g: shoiuld work now [16:54:01] that's the trick [16:54:18] thanks both :) [16:54:36] asking them to verify in person [16:54:51] for future reference: https://wikitech.wikimedia.org/wiki/Increasing_account_creation_threshold [16:55:12] gracias :) [16:55:45] de nada [17:02:35] ok, what... [17:02:36] 09:02:08 we are being blocked from editing [17:02:50] greg-g: let me grep our logs [17:03:29] greg-g: can they create accs? [17:03:57] I think most people were able to create accounts [17:03:59] unclear, g.rant said they were all able to create accounts, but they might have used pesonal hotspots :/ [17:04:08] some of us are on hotspots [17:04:32] GrantI: maybe it is an IP block? [17:04:38] <_joe_> !log restarted php7.2-fpm on mw1332 [17:04:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:04:40] hauskatze: that's what I tend to thiink [17:05:05] you'll need either an IP block exemption or ask some enwiki admin to turn that block to anonymous only [17:05:18] I don't see anything on https://en.wikipedia.org/wiki/Special:BlockList?wpTarget=12.24.27.50&blockType=&limit=50&wpFormIdentifier=blocklist [17:05:24] +1 [17:05:31] I see some throttle hits from before I cleared the cache, but not after [17:05:41] GrantI: could you please send here the exact error message? [17:06:12] Urbanecm: so in theory our patch worked? just worried I messed up the timezone format (sigh) [17:06:30] greg-g: judging from logstash, yes; I did see successful user creations after the maintenance script ran [17:06:42] cdanis: well that's good [17:06:42] If accounts can be created, your patch is working as expected :) [17:06:42] greg-g: definitely worked, there is account created after I run the script [17:07:02] <_joe_> cdanis: I think what problem they have now is edit throttling [17:07:08] hauskatze: I wasn't clear if the accounts were created via that IP or a hotspot, hence asking :) [17:07:16] aha, gotcha [17:07:29] https://usercontent.irccloud-cdn.com/file/9PLqearY/image.png [17:08:12] yeah, a block [17:08:17] I think most folks have tethered to cell phones now [17:08:19] it's an autoblock [17:08:22] yep [17:08:22] yes [17:08:30] you need an enwiki admin to turn that off [17:08:48] anyone can edit, unless you're doing it as a group :) [17:08:56] gotcha, thanks! [17:09:16] sorry we couldn't fix it all for you, GrantI [17:09:18] I suppose it's do to a bunch of new people getting accounts [17:09:29] all from same IP [17:09:54] GrantI: it's because someone vandalized Wikipedia from that account, and the IP got blocked to prevent block bypassing [17:10:08] ah [17:10:10] and that account was on that IP [17:10:17] yes [17:10:46] PROBLEM - mediawiki originals uploads -hourly- for codfw on icinga1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2005:9112 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [17:11:14] 09:11:00 greg-g: yes, lifted the autoblock [17:11:26] should be good now ^ [17:11:28] looks like we are good [17:11:29] confirmed, autoblock is gone [17:11:31] thank you! [17:12:16] PROBLEM - mediawiki originals uploads -hourly- for eqiad on icinga1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe1005:9112 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [17:12:17] _joe_: for the record, there is next to none edit throttling when you have an entry in throttle.php. I've made it to count throttles per-user instead of per-IP when you have the entry. [17:12:28] _joe_: https://github.com/wikimedia/operations-mediawiki-config/blob/master/wmf-config/throttle-analyze.php#L49 [17:12:37] thanks so much for the quick response on a Saturday! [17:12:39] <_joe_> oh ok [17:12:41] (03PS1) 10CDanis: throttle: add comment re: clearing cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571005 [17:12:44] <_joe_> I wasn't aware [17:12:57] GrantI: happy to help! [17:13:16] (03CR) 10Greg Grossmeier: [C: 03+1] "Yes thanks, bit me today." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571005 (owner: 10CDanis) [17:13:41] (03CR) 10Urbanecm: [C: 03+1] "good idea" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571005 (owner: 10CDanis) [17:17:35] (03CR) 10CDanis: [C: 03+2] throttle: add comment re: clearing cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571005 (owner: 10CDanis) [17:17:45] there's no need to deploy comment-only changes, right? [17:18:22] cdanis: you only +2 and do git fetch to not confuse someone else [17:18:32] (03Merged) 10jenkins-bot: throttle: add comment re: clearing cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571005 (owner: 10CDanis) [17:18:36] got it [17:19:21] well, you can scap it and don't let it hanging around [17:19:44] it's a comment so it's a no-op after all [17:19:44] I think we have an incinga alert on undeployed merged changes, no? [17:20:06] greg-g: did that several times and nothing fired on me [17:20:12] hmmm [17:20:23] (maybe we have on +2'ed, but not fetched at the deployment host) [17:20:31] greg-g: i know we do for the puppet repo, no idea about mediawiki-config [17:20:31] maybe that's it [17:20:40] and yes, maybe it's puppet [17:20:41] cdanis: maybe that's what I thinking of [17:21:19] anywho, I'm going to go afk now, this saturday morning headache is a pain [17:21:24] feel better! [17:21:32] thanks :) [17:21:39] greg-g: paracetamol and movie [17:21:44] see you later all! [17:25:09] <_joe_> yes we do have an alert on mediawiki too [17:25:14] <_joe_> or at least we used to [17:25:17] <_joe_> I added it :P [17:26:43] well i fetched and merged on deploy1001 and i'll hang out for a while to see if anything complains [17:27:07] afaics, it should be fine [17:27:38] cdanis: the alert was migrated to your cat; if they meow, fear the worse ;-) [17:28:09] 😼 [17:28:34] I have mine sleeping in my knees for some reason [17:29:58] Writing it up here just in case: I'm going to dry-run a maint script on the Beta Cluster [17:39:55] Hello, someone to help me with T244644, I need this done asap because I can't access to horizon [17:39:56] T244644: Disable 2FA for Zoranzoki21 on Wikitech - https://phabricator.wikimedia.org/T244644 [18:20:54] RECOVERY - mediawiki originals uploads -hourly- for codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [18:22:22] RECOVERY - mediawiki originals uploads -hourly- for eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [18:36:25] 10Operations, 10Release-Engineering-Team, 10Core Platform Team Workboards (Clinic Duty Team), 10Performance Issue, 10Wikimedia-database-error: WikiPage::updateCategoryCounts causing replication lag due to long-running writes on commonswiki - https://phabricator.wikimedia.org/T240405 (10russblau) As of to... [19:08:48] (03PS3) 10Zoranzoki21: Add throttle rules for OSU Editathon, remove expired ones [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570998 (https://phabricator.wikimedia.org/T244608) [19:10:08] (03CR) 10Zoranzoki21: "I scheduled this patch for Morning SWAT at Monday.. In this patch is also removed throttle rule added by Greg, but it will expire before d" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570998 (https://phabricator.wikimedia.org/T244608) (owner: 10Zoranzoki21) [19:10:32] (03CR) 10Zoranzoki21: "> I scheduled this patch for Morning SWAT at Monday.. In this patch" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570998 (https://phabricator.wikimedia.org/T244608) (owner: 10Zoranzoki21) [19:12:40] <_joe_> !log set cpufreq governor to performance on mw1328 [19:12:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:29:06] _joe_: I thought we had done so across all the appservers? [20:45:08] 10Operations, 10Beta-Cluster-Infrastructure: Upgrade puppet in deployment-prep (Puppet agent broken in Beta Cluster) - https://phabricator.wikimedia.org/T243226 (10Krenair) I've moved the remaining instances over to using the new puppetmaster. Puppet does appear to be struggling on deployment-mwmaint01 and dep... [20:52:28] 10Operations, 10Beta-Cluster-Infrastructure: Upgrade puppet in deployment-prep (Puppet agent broken in Beta Cluster) - https://phabricator.wikimedia.org/T243226 (10Krenair) a:05Krenair→03None So before closing this task and removing puppetmaster03, someone should address: * puppetdb03 memory usage * puppet... [21:59:12] effie: https://phabricator.wikimedia.org/T244636#5862299 [23:52:52] PROBLEM - Host cp3051 is DOWN: PING CRITICAL - Packet loss = 100%