[01:35:29] PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on icinga1001 is CRITICAL: 2.1e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [01:36:55] RECOVERY - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on icinga1001 is OK: (C)1e+05 gt (W)1e+04 gt 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [02:22:01] (03PS3) 10Andrew Bogott: Openstack: move eqiad1 glance/keystone/nova/neutron to Newton [puppet] - 10https://gerrit.wikimedia.org/r/540643 (https://phabricator.wikimedia.org/T212302) [02:22:03] (03PS1) 10Andrew Bogott: Horizon: put in maintenance mode for the mitaka->newton upgrade [puppet] - 10https://gerrit.wikimedia.org/r/541133 (https://phabricator.wikimedia.org/T212302) [02:22:05] (03PS1) 10Andrew Bogott: Revert "Horizon: put in maintenance mode for the mitaka->newton upgrade" [puppet] - 10https://gerrit.wikimedia.org/r/541134 [04:20:02] 10Operations, 10ops-eqiad, 10DC-Ops: a1-eqiad pdu refresh (Tuesday 10/15 @11am UTC) - https://phabricator.wikimedia.org/T226782 (10Marostegui) >>! In T226782#5536948, @wiki_willy wrote: > @Marostegui - sure, will do. This week is the approval & ordering phase of the procurement cycle, so it shouldn't be an i... [04:39:39] 10Operations, 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Prepare and check storage layer for banwiki - https://phabricator.wikimedia.org/T234770 (10Marostegui) Let us know when the database is created so we can sanitize it on labs hosts [04:54:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1098:3316 for schema change T233135 T234066', diff saved to https://phabricator.wikimedia.org/P9245 and previous config saved to /var/cache/conftool/dbconfig/20191007-045411-marostegui.json [04:54:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:54:18] T233135: Schema change for refactored actor and comment storage - https://phabricator.wikimedia.org/T233135 [04:54:18] T234066: Schema change to rename user_newtalk indexes - https://phabricator.wikimedia.org/T234066 [05:02:27] !log Fix replication on labsdb1011:s8 [05:02:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:08:38] !log Stop replication on db2076 to modify triggers on db2096:3316 T234704 [05:08:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:08:42] T234704: Remove ar_comment from sanitarium triggers - https://phabricator.wikimedia.org/T234704 [05:10:39] !log The above was for db2095:3316 T234704 [05:10:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:25:27] !log Deploy schema change on db2124 T233135 T234066 [05:25:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:25:32] T233135: Schema change for refactored actor and comment storage - https://phabricator.wikimedia.org/T233135 [05:25:33] T234066: Schema change to rename user_newtalk indexes - https://phabricator.wikimedia.org/T234066 [05:45:45] PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [05:45:58] !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission [05:46:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:46:09] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [05:46:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:46:14] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2055.codfw.wmnet - https://phabricator.wikimedia.org/T233186 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by marostegui@cumin1001 for hosts: `db2055.codfw.wmnet` - db2055.codfw.wmnet (**PASS**) - Downtimed host on Ic... [05:47:56] (03PS1) 10Marostegui: site.pp: Remove references to db2055 [puppet] - 10https://gerrit.wikimedia.org/r/541144 (https://phabricator.wikimedia.org/T233186) [05:48:12] (03PS1) 10Marostegui: wmnet: Remove db2055 production entries [dns] - 10https://gerrit.wikimedia.org/r/541145 (https://phabricator.wikimedia.org/T233186) [05:48:57] (03CR) 10Marostegui: [C: 03+2] site.pp: Remove references to db2055 [puppet] - 10https://gerrit.wikimedia.org/r/541144 (https://phabricator.wikimedia.org/T233186) (owner: 10Marostegui) [05:49:25] (03CR) 10Marostegui: [C: 03+2] wmnet: Remove db2055 production entries [dns] - 10https://gerrit.wikimedia.org/r/541145 (https://phabricator.wikimedia.org/T233186) (owner: 10Marostegui) [05:54:22] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2055.codfw.wmnet - https://phabricator.wikimedia.org/T233186 (10Marostegui) a:05RobH→03Papaul [05:54:40] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2055.codfw.wmnet - https://phabricator.wikimedia.org/T233186 (10Marostegui) Host ready for @Papaul to decommission and disable switch [06:00:47] 10Operations, 10Core Platform Team, 10Editing-team, 10Parsing-Team, and 9 others: RFC: Serve Main Page of Wikimedia wikis from a consistent URL - https://phabricator.wikimedia.org/T120085 (10Yair_rand) How will this work for languages with a different main page for each language, eg Commons? The main page... [06:07:01] RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [06:08:18] !log upgrade python-kafka on eventlog1002 to 1.4.7-1 (manually via dpkg -i) - T222941 [06:08:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:08:22] T222941: Eventlogging processors are frequently failing heartbeats causing consumer group rebalances - https://phabricator.wikimedia.org/T222941 [06:17:55] 10Operations, 10Core Platform Team, 10Editing-team, 10Parsing-Team, and 9 others: RFC: Serve Main Page of Wikimedia wikis from a consistent URL - https://phabricator.wikimedia.org/T120085 (10Bawolff) >>! In T120085#5550694, @Yair_rand wrote: > How will this work for projects with a different main page for... [06:21:23] (03PS1) 10Marostegui: db-eqiad.php: Depool es1011 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541148 (https://phabricator.wikimedia.org/T227138) [06:22:44] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool es1011 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541148 (https://phabricator.wikimedia.org/T227138) (owner: 10Marostegui) [06:23:31] (03Merged) 10jenkins-bot: db-eqiad.php: Depool es1011 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541148 (https://phabricator.wikimedia.org/T227138) (owner: 10Marostegui) [06:25:14] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool es1011 T227138 (duration: 01m 10s) [06:25:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:25:22] T227138: a2-eqiad pdu refresh (Tuesday 10/8 @11am UTC) - https://phabricator.wikimedia.org/T227138 [06:31:35] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: a2-eqiad pdu refresh (Tuesday 10/8 @11am UTC) - https://phabricator.wikimedia.org/T227138 (10Marostegui) db1074 has a broken PSU and the new PSU is scheduled to arrive the 10th (T233567#5544445), so I will power off this host and will need to be powe... [06:43:08] PROBLEM - MegaRAID on analytics1049 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [06:43:10] ACKNOWLEDGEMENT - MegaRAID on analytics1049 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T234785 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [06:43:13] 10Operations, 10ops-eqiad: Degraded RAID on analytics1049 - https://phabricator.wikimedia.org/T234785 (10ops-monitoring-bot) [06:56:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1096:3315 for schema change T233625', diff saved to https://phabricator.wikimedia.org/P9246 and previous config saved to /var/cache/conftool/dbconfig/20191007-065645-marostegui.json [06:56:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:56:50] T233625: Change PK and remove partitions from the logging table - https://phabricator.wikimedia.org/T233625 [06:57:11] hello analytics1049 [06:57:39] XDDD [07:01:32] (03PS1) 10Elukey: Add host overrides for analytics1049 to bypass a broken disk [puppet] - 10https://gerrit.wikimedia.org/r/541154 (https://phabricator.wikimedia.org/T234785) [07:01:56] (03CR) 10Elukey: [C: 03+2] Add host overrides for analytics1049 to bypass a broken disk [puppet] - 10https://gerrit.wikimedia.org/r/541154 (https://phabricator.wikimedia.org/T234785) (owner: 10Elukey) [07:09:58] !log Remove grants for dbproxy1006 on m1 databases - T231280 [07:10:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:10:01] T231280: Remove grants for the old dbproxy hosts from the misc databases - https://phabricator.wikimedia.org/T231280 [07:17:37] (03PS1) 10Marostegui: mariadb: Decommission dbproxy1006 [puppet] - 10https://gerrit.wikimedia.org/r/541192 (https://phabricator.wikimedia.org/T233207) [07:19:12] (03CR) 10Marostegui: [C: 03+2] mariadb: Decommission dbproxy1006 [puppet] - 10https://gerrit.wikimedia.org/r/541192 (https://phabricator.wikimedia.org/T233207) (owner: 10Marostegui) [07:32:57] 10Operations, 10ops-eqiad, 10DC-Ops: a2-eqiad pdu refresh (Tuesday 10/8 @11am UTC) - https://phabricator.wikimedia.org/T227138 (10Marostegui) [07:34:45] 10Operations, 10ops-eqiad, 10DC-Ops: a1-eqiad pdu refresh (Tuesday 10/15 @11am UTC) - https://phabricator.wikimedia.org/T226782 (10Marostegui) [07:44:35] 10Operations, 10ops-eqiad, 10DC-Ops: a1-eqiad pdu refresh (Tuesday 10/15 @11am UTC) - https://phabricator.wikimedia.org/T226782 (10wiki_willy) @Marostegui - it was ordered last Friday morning. We haven't received the tracking number from the vendor yet, but will update that in T233277 once provided. There'... [07:46:52] 10Operations, 10ops-eqiad, 10DC-Ops: a1-eqiad pdu refresh (Tuesday 10/15 @11am UTC) - https://phabricator.wikimedia.org/T226782 (10Marostegui) Ah - thanks, as T233277 wasn't updated I thought it wasn't ordered. Let's see what the ETA is. Thanks for the update [07:51:13] 10Operations, 10serviceops, 10Beta-Cluster-reproducible, 10Patch-For-Review, 10User-Joe: Update confd package - https://phabricator.wikimedia.org/T147204 (10Joe) For posterity, what I am doing to run these tests is as follows: * Import etcd to a local daemon from a production backup (that will typically... [07:55:00] 10Operations, 10Traffic: ATS fails to log the used SSLCurve when the SSL session is being reused - https://phabricator.wikimedia.org/T234011 (10Vgutierrez) Proposed a fix to upstream on https://github.com/apache/trafficserver/pull/5992 [07:55:53] 10Operations, 10cloud-services-team (Kanban): Migrate labmon* to Stretch (or Buster, better yet!) - https://phabricator.wikimedia.org/T224585 (10MoritzMuehlenhoff) @Phamhi: 1. Grafana is installed from an external repository. There's already a config to pull in the new deb package for our Buster repository (C... [07:56:35] PROBLEM - High average POST latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [07:58:13] RECOVERY - High average POST latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [07:58:23] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM, it's also covered by misc-nonprod, but a specific one for spares can't hurt." [puppet] - 10https://gerrit.wikimedia.org/r/540837 (owner: 10Jbond) [08:04:15] (03CR) 10Muehlenhoff: [C: 04-1] profile::kerberos::kdc: add support for bacula backups (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/540832 (https://phabricator.wikimedia.org/T226089) (owner: 10Elukey) [08:04:26] 10Operations, 10serviceops, 10Beta-Cluster-reproducible, 10Patch-For-Review, 10User-Joe: Update confd package - https://phabricator.wikimedia.org/T147204 (10Joe) Varnish caches are a special case, as they already don't declare a prefix on the command line, so they will need no modification and work out o... [08:08:53] (03CR) 10Elukey: profile::kerberos::kdc: add support for bacula backups (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/540832 (https://phabricator.wikimedia.org/T226089) (owner: 10Elukey) [08:10:09] (03PS1) 10Elukey: eventlogging::analytics: remove specific python-kafka constraints [puppet] - 10https://gerrit.wikimedia.org/r/541196 (https://phabricator.wikimedia.org/T222941) [08:11:09] PROBLEM - High average POST latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [08:12:05] the avg went up a lot --^ [08:12:45] RECOVERY - High average POST latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [08:13:49] (03PS13) 10Jcrespo: backups: Change file owner of bacula storage&director config [puppet] - 10https://gerrit.wikimedia.org/r/538239 (https://phabricator.wikimedia.org/T229209) [08:15:26] and it seems only appservers related (not api) [08:15:29] (03PS2) 10Giuseppe Lavagetto: confd: move all prefix declarations to the files [puppet] - 10https://gerrit.wikimedia.org/r/540868 (https://phabricator.wikimedia.org/T147204) [08:15:57] PROBLEM - High average POST latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [08:17:26] <_joe_> uh [08:19:14] from the memcached side, bytes read increased at the same time [08:19:15] https://grafana.wikimedia.org/d/000000316/memcache?panelId=44&fullscreen&orgId=1 [08:19:27] but I don't see clear increases in mcrouter [08:19:29] mmmmm [08:20:48] there is a strange s5 pattern: https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&from=now-6h&to=now&var-dc=eqiad%20prometheus%2Fops&var-group=core&var-shard=s5&var-role=All [08:20:49] PROBLEM - High average POST latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [08:21:17] writes increased 10x [08:21:25] reads 2x [08:24:01] RECOVERY - High average POST latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [08:27:44] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "https://puppet-compiler.wmflabs.org/compiler1001/18754/cp1077.eqiad.wmnet/ Needs improvements." [puppet] - 10https://gerrit.wikimedia.org/r/540868 (https://phabricator.wikimedia.org/T147204) (owner: 10Giuseppe Lavagetto) [08:34:10] !log gerrit: force reindexing all changes ( gerrit index start changes --force ) [08:34:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:21] (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/18755/eventlog1002.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/541196 (https://phabricator.wikimedia.org/T222941) (owner: 10Elukey) [08:44:35] PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [08:46:15] PROBLEM - SSH cp3008.mgmt on cp3008.mgmt is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:48:01] (03CR) 10Jcrespo: [C: 03+2] backups: Change file owner of bacula storage&director config [puppet] - 10https://gerrit.wikimedia.org/r/538239 (https://phabricator.wikimedia.org/T229209) (owner: 10Jcrespo) [08:48:12] (03PS14) 10Jcrespo: backups: Change file owner of bacula storage&director config [puppet] - 10https://gerrit.wikimedia.org/r/538239 (https://phabricator.wikimedia.org/T229209) [08:48:36] (03PS1) 10Mobrovac: RESTRouter: Add banwiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/541202 (https://phabricator.wikimedia.org/T234772) [08:49:26] (03CR) 10Mobrovac: [C: 03+2] RESTRouter: Add banwiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/541202 (https://phabricator.wikimedia.org/T234772) (owner: 10Mobrovac) [08:49:38] (03Merged) 10jenkins-bot: RESTRouter: Add banwiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/541202 (https://phabricator.wikimedia.org/T234772) (owner: 10Mobrovac) [08:50:26] (03PS3) 10Giuseppe Lavagetto: confd: move all prefix declarations to the files [puppet] - 10https://gerrit.wikimedia.org/r/540868 (https://phabricator.wikimedia.org/T147204) [08:53:51] (03CR) 10Giuseppe Lavagetto: [C: 03+2] "The change results in a noop on the cache servers, and a change everywhere else. I will merge this change with care and run puppet on one " [puppet] - 10https://gerrit.wikimedia.org/r/540868 (https://phabricator.wikimedia.org/T147204) (owner: 10Giuseppe Lavagetto) [08:54:00] (03PS4) 10Giuseppe Lavagetto: confd: move all prefix declarations to the files [puppet] - 10https://gerrit.wikimedia.org/r/540868 (https://phabricator.wikimedia.org/T147204) [08:55:11] RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [08:58:03] 10Operations, 10ops-codfw, 10Cloud-Services: Build bdsync for Buster, or update block_sync.py script to use rsync --copy-devices - https://phabricator.wikimedia.org/T234683 (10MoritzMuehlenhoff) bdsync was never packaged in Debian, it's an internally packaged tool (originally done by Chase), from a quick gla... [09:00:01] (03CR) 10Elukey: [C: 03+2] Release upstream version 1.4.7 [debs/python-kafka] (debian) - 10https://gerrit.wikimedia.org/r/540809 (owner: 10Elukey) [09:10:31] (03PS1) 10Jcrespo: bacula: Add conditional storage device setup [puppet] - 10https://gerrit.wikimedia.org/r/541205 (https://phabricator.wikimedia.org/T229209) [09:11:12] (03PS2) 10Jcrespo: bacula: Add conditional storage device setup [puppet] - 10https://gerrit.wikimedia.org/r/541205 (https://phabricator.wikimedia.org/T229209) [09:13:02] 10Operations, 10Core Platform Team, 10Editing-team, 10Parsing-Team, and 9 others: RFC: Serve Main Page of Wikimedia wikis from a consistent URL - https://phabricator.wikimedia.org/T120085 (10Ladsgroup) Very valid point, I personally would be okay with not turning on the config on wkis that set `$wgForceUIM... [09:14:53] 10Operations, 10DBA: Switchover s1 primary database master db1067 -> db1083 - 14th Nov 05:00 - 05:30 UTC - https://phabricator.wikimedia.org/T234800 (10Marostegui) [09:15:02] 10Operations, 10DBA: Switchover s1 primary database master db1067 -> db1083 - 14th Nov 05:00 - 05:30 UTC - https://phabricator.wikimedia.org/T234800 (10Marostegui) p:05Triage→03Normal [09:16:02] (03CR) 10Alexandros Kosiaris: [C: 03+1] bacula: Add conditional storage device setup [puppet] - 10https://gerrit.wikimedia.org/r/541205 (https://phabricator.wikimedia.org/T229209) (owner: 10Jcrespo) [09:16:04] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/540851 (owner: 10Jbond) [09:19:21] (03PS3) 10Jcrespo: bacula: Add conditional storage device setup [puppet] - 10https://gerrit.wikimedia.org/r/541205 (https://phabricator.wikimedia.org/T229209) [09:25:18] (03CR) 10Jcrespo: [C: 03+2] bacula: Add conditional storage device setup [puppet] - 10https://gerrit.wikimedia.org/r/541205 (https://phabricator.wikimedia.org/T229209) (owner: 10Jcrespo) [09:25:39] (03PS1) 10Jcrespo: bacula: Remove old storage setup layout and increase concurrency [puppet] - 10https://gerrit.wikimedia.org/r/541209 (https://phabricator.wikimedia.org/T229209) [09:27:52] (03CR) 10Muehlenhoff: [C: 03+1] "Nice, that's useful for testing as well. Some comments inline, but LGTM" (033 comments) [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/540850 (owner: 10Jbond) [09:30:12] (03PS1) 10Giuseppe Lavagetto: scap::dsh: fix logstash "service" tag [puppet] - 10https://gerrit.wikimedia.org/r/541211 [09:33:57] 10Operations, 10Traffic: Provide an easy way of picking the traffic serving TLS certificate used by ATS - https://phabricator.wikimedia.org/T234803 (10Vgutierrez) [09:34:08] (03PS3) 10Arturo Borrero Gonzalez: Remove old Toolforge Clush master files [puppet] - 10https://gerrit.wikimedia.org/r/539685 (owner: 10Alex Monk) [09:34:32] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "Confirmed in PCC this is a noop: https://puppet-compiler.wmflabs.org/compiler1002/18756/" [puppet] - 10https://gerrit.wikimedia.org/r/539685 (owner: 10Alex Monk) [09:35:23] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/540849 (owner: 10Jbond) [09:37:12] (03CR) 10Alexandros Kosiaris: [C: 03+1] scap::dsh: fix logstash "service" tag (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/541211 (owner: 10Giuseppe Lavagetto) [09:41:18] (03CR) 10Giuseppe Lavagetto: [C: 03+2] scap::dsh: fix logstash "service" tag (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/541211 (owner: 10Giuseppe Lavagetto) [09:41:27] (03PS2) 10Giuseppe Lavagetto: scap::dsh: fix logstash "service" tag [puppet] - 10https://gerrit.wikimedia.org/r/541211 [09:47:16] 10Operations, 10Traffic: Provide an easy way of picking the traffic serving TLS certificate used by ATS - https://phabricator.wikimedia.org/T234803 (10Vgutierrez) p:05Triage→03Normal [09:52:06] 10Operations, 10DBA: Decommission db1061-db1073 - https://phabricator.wikimedia.org/T217396 (10Marostegui) [09:55:01] !log Deploy schema change on db2129 (s6 codfw master), this will generate lag on s6 codfw - T233135 T234066 [09:55:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:07] T233135: Schema change for refactored actor and comment storage - https://phabricator.wikimedia.org/T233135 [09:55:07] T234066: Schema change to rename user_newtalk indexes - https://phabricator.wikimedia.org/T234066 [09:56:48] 10Operations, 10Traffic, 10netops, 10Wikimedia-Incident: Configure interface damping on primary links - https://phabricator.wikimedia.org/T196432 (10elukey) [09:57:02] 10Operations, 10netops, 10Wikimedia-Incident: ospf link-protection - https://phabricator.wikimedia.org/T167306 (10elukey) [09:57:23] 10Operations, 10SRE-tools, 10netops, 10Goal, and 2 others: Configuration management for network operations - https://phabricator.wikimedia.org/T228388 (10elukey) [10:06:10] (03PS1) 10Jbond: puppet: change $::cluster variable to a hiera default [puppet] - 10https://gerrit.wikimedia.org/r/541213 (https://phabricator.wikimedia.org/T234805) [10:07:24] 10Operations, 10Puppet, 10Patch-For-Review: puppet: remove cluster variable - https://phabricator.wikimedia.org/T234805 (10jbond) [10:08:33] (03CR) 10jerkins-bot: [V: 04-1] puppet: change $::cluster variable to a hiera default [puppet] - 10https://gerrit.wikimedia.org/r/541213 (https://phabricator.wikimedia.org/T234805) (owner: 10Jbond) [10:10:44] !log mobrovac@deploy1001 Started deploy [restbase/deploy@1798e39]: Skip checking resources on start-up, add banwiki, add metrics/mediarequests end points and log all VE requests - T233127 T234772 [10:10:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:10:50] T234772: Add banwiki to restbase - https://phabricator.wikimedia.org/T234772 [10:10:50] T233127: HTTP 404 error in VE possibly when confronted with an edit conflict - https://phabricator.wikimedia.org/T233127 [10:11:24] (03CR) 10Jbond: profile: sanity checks for cluster (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/539934 (https://phabricator.wikimedia.org/T234232) (owner: 10Filippo Giunchedi) [10:16:42] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@1798e39]: Skip checking resources on start-up, add banwiki, add metrics/mediarequests end points and log all VE requests - T233127 T234772 (duration: 05m 58s) [10:16:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:48] T234772: Add banwiki to restbase - https://phabricator.wikimedia.org/T234772 [10:16:49] T233127: HTTP 404 error in VE possibly when confronted with an edit conflict - https://phabricator.wikimedia.org/T233127 [10:19:07] !log mobrovac@deploy1001 Started deploy [restbase/deploy@1798e39]: Skip checking resources on start-up, add banwiki, add metrics/mediarequests end points and log all VE requests, take #2 [10:19:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:03] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@1798e39]: Skip checking resources on start-up, add banwiki, add metrics/mediarequests end points and log all VE requests, take #2 (duration: 01m 56s) [10:21:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:35] 10Operations, 10ops-codfw, 10Cloud-Services: Build bdsync for Buster, or update block_sync.py script to use rsync --copy-devices - https://phabricator.wikimedia.org/T234683 (10aborrero) I can do the buster build. Perhaps we should consider uploading this package to Debian, it should be interesting for other... [10:30:04] jan_drewniak: It is that lovely time of the day again! You are hereby commanded to deploy Wikimedia Portals Update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191007T1030). [10:31:09] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541217 (https://phabricator.wikimedia.org/T128546) [10:31:50] <_joe_> !log uploading confd 0.16.0 to stretch [10:31:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:04] (03CR) 10Jdrewniak: [C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541217 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:32:51] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541217 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:36:18] !log jdrewniak@deploy1001 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:541217| Bumping portals to master (T128546)]] (duration: 00m 53s) [10:36:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:22] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [10:37:10] !log jdrewniak@deploy1001 Synchronized portals: Wikimedia Portals Update: [[gerrit:541217| Bumping portals to master (T128546)]] (duration: 00m 51s) [10:37:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:08] !log mobrovac@deploy1001 Started deploy [restbase/deploy@5321aac]: Skip checking resources on start-up, add banwiki, add metrics/mediarequests end points and log all VE requests, take #3 [10:44:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:32] (03PS1) 10Vgutierrez: ATS: Pick the unified cert using hiera key public_tls_unified_cert_vendor [puppet] - 10https://gerrit.wikimedia.org/r/541220 (https://phabricator.wikimedia.org/T234803) [10:47:28] (03CR) 10jerkins-bot: [V: 04-1] ATS: Pick the unified cert using hiera key public_tls_unified_cert_vendor [puppet] - 10https://gerrit.wikimedia.org/r/541220 (https://phabricator.wikimedia.org/T234803) (owner: 10Vgutierrez) [10:48:01] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@5321aac]: Skip checking resources on start-up, add banwiki, add metrics/mediarequests end points and log all VE requests, take #3 (duration: 03m 53s) [10:48:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:38] (03PS2) 10Vgutierrez: ATS: Pick the unified cert using hiera key public_tls_unified_cert_vendor [puppet] - 10https://gerrit.wikimedia.org/r/541220 (https://phabricator.wikimedia.org/T234803) [10:50:04] !log mobrovac@deploy1001 Started deploy [restbase/deploy@5321aac]: Skip checking resources on start-up, add banwiki, add metrics/mediarequests end points and log all VE requests, take #4 [10:50:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:31] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@5321aac]: Skip checking resources on start-up, add banwiki, add metrics/mediarequests end points and log all VE requests, take #4 (duration: 04m 27s) [10:54:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:29] !log mobrovac@deploy1001 Started deploy [restbase/deploy@5321aac]: Skip checking resources on start-up, add banwiki, add metrics/mediarequests end points and log all VE requests, take #5 [10:55:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:57:43] (03PS3) 10Vgutierrez: ATS: Pick the unified cert using hiera key public_tls_unified_cert_vendor [puppet] - 10https://gerrit.wikimedia.org/r/541220 (https://phabricator.wikimedia.org/T234803) [10:59:26] PROBLEM - restbase endpoints health on restbase2018 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:59:45] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@5321aac]: Skip checking resources on start-up, add banwiki, add metrics/mediarequests end points and log all VE requests, take #5 (duration: 04m 17s) [10:59:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor I � Unicode. All rise for European Mid-day SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191007T1100). [11:00:04] tassu, kostajh, and Amir1: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:13] \o [11:00:38] RECOVERY - restbase endpoints health on restbase2018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:00:38] Here, but I have no way of testing blocking on nlwiki [11:00:43] (03PS4) 10Vgutierrez: ATS: Pick the unified cert using hiera key public_tls_unified_cert_vendor [puppet] - 10https://gerrit.wikimedia.org/r/541220 (https://phabricator.wikimedia.org/T234803) [11:01:11] o/ [11:01:56] o/ [11:06:06] (03PS5) 10Vgutierrez: ATS: Pick the unified cert using hiera key public_tls_unified_cert_vendor [puppet] - 10https://gerrit.wikimedia.org/r/541220 (https://phabricator.wikimedia.org/T234803) [11:07:39] ok, I can do the SWAT [11:09:40] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541008 (https://phabricator.wikimedia.org/T234685) (owner: 10Majavah) [11:10:33] (03Merged) 10jenkins-bot: Enable partial blocks on nlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541008 (https://phabricator.wikimedia.org/T234685) (owner: 10Majavah) [11:10:43] tassu: are any nlwiki admins online who could test it? [11:10:59] (I know Sjoerddebruin and apparently he’s an admin, but he doesn’t seem to be online) [11:11:07] None that I know [11:11:56] I suppose I could promote myself via shell access and then remove the rights again when the test is done [11:12:06] (03PS6) 10Vgutierrez: ATS: Pick the unified cert using hiera key public_tls_unified_cert_vendor [puppet] - 10https://gerrit.wikimedia.org/r/541220 (https://phabricator.wikimedia.org/T234803) [11:12:08] but that seems potentially… what’s the word [11:12:13] like someone might get upset ^^ [11:12:25] LOL [11:12:38] "potentially" [11:13:36] https://meta.wikimedia.org/wiki/System_administrators [11:13:43] (03PS1) 10Muehlenhoff: Remove unused/unnecessary passwords::postgres include [puppet] - 10https://gerrit.wikimedia.org/r/541224 [11:14:14] effie: are you User:Effieetsanders by any chance? [11:14:31] no, should I be? [11:14:40] what's wrong? [11:15:11] nah, just looking for any nlwiki sysadmin that might help me test something [11:15:14] and that name looked familiar on the user list [11:15:17] sorry to bother you [11:15:38] you scared me a bit there :p [11:15:43] I’ll just sync the config change and assume nothing blows up… it’s a simple enough code change [11:15:45] sorry :D [11:16:01] heh [11:16:09] "To facilitate these changes being made in a transparent fashion with no need of the steward flag, all system administrators who ask can be added to the 'sysadmin' global group (automatic members lists). This group allows them to set user rights for any user on any wiki, in the same fashion as stewards. So if a system administrator needs to perform [11:16:09] an action restricted to administrators (like editing system messages) on a particular wiki, they can simply grant themselves admin status on that wiki to make the action." [11:16:32] you are not on that group? [11:16:50] haha, fixcopyright.wikimedia.org is badly broken, we can't break it even more :D [11:16:57] !log added bdsync 0.11.1-1~wmf1 to buster-wikimedia (T234683) [11:16:58] I don’t think so [11:17:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:01] T234683: Build bdsync for Buster, or update block_sync.py script to use rsync --copy-devices - https://phabricator.wikimedia.org/T234683 [11:17:09] I’m more part of that “Most of the technical Wikimedia Foundation staff have some form of shell access.” sentence [11:17:27] syncing now [11:17:46] lets hope nothing breaks [11:18:13] I’ll watch the nlwiki village pump for changes [11:18:14] !log lucaswerkmeister-wmde@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:541008|Enable partial blocks on nlwiki (T234685)]] (duration: 00m 52s) [11:18:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:18] T234685: Enable Partial Blocks on Dutch Wikipedia - https://phabricator.wikimedia.org/T234685 [11:19:41] kostajh: your patch is up next [11:19:48] Lucas_WMDE: cool [11:20:10] Lucas_WMDE: thanks and sorry for the hassle [11:20:23] no problem [11:21:39] (03CR) 10Vgutierrez: "pcc seems happy: https://puppet-compiler.wmflabs.org/compiler1002/18765/" [puppet] - 10https://gerrit.wikimedia.org/r/541220 (https://phabricator.wikimedia.org/T234803) (owner: 10Vgutierrez) [11:24:44] (03CR) 10Volans: [C: 03+1] "LGTM, compiler seems happy too:" [puppet] - 10https://gerrit.wikimedia.org/r/541224 (owner: 10Muehlenhoff) [11:33:08] kostajh: your change should be on mwdebug1002 now, can you test it? [11:33:14] Lucas_WMDE: looking [11:33:55] Lucas_WMDE: let's do it! [11:33:58] ok [11:35:33] !log lucaswerkmeister-wmde@deploy1001 Synchronized php-1.34.0-wmf.25/extensions/GrowthExperiments/: SWAT: [[gerrit:541139|Homepage: Don't use flexbox for vertical layouts in mobile start module (T234380)]] (duration: 00m 53s) [11:35:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:37] T234380: [regression- mobile] Homepage - Start model displays broken layout - https://phabricator.wikimedia.org/T234380 [11:35:41] Amir1: your changes look like I should `scap pull` them both to mwdebug1002 together for testing? [11:35:55] (and then still `scap sync` them separately, so they’re done in the correct order [11:35:57] ) [11:36:11] or do you want to do the deployment? [11:36:12] Lucas_WMDE: yes please [11:36:25] Lucas_WMDE: if you do it, it would be great [11:36:28] ok [11:36:46] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540678 (https://phabricator.wikimedia.org/T120085) (owner: 10Ladsgroup) [11:37:17] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540679 (https://phabricator.wikimedia.org/T120085) (owner: 10Ladsgroup) [11:37:34] (03CR) 10jerkins-bot: [V: 04-1] Get rid of main page hack for fixcopyrightwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540679 (https://phabricator.wikimedia.org/T120085) (owner: 10Ladsgroup) [11:38:08] what [11:38:24] uh [11:38:37] I’ll try rebasing them [11:38:43] (03PS2) 10Lucas Werkmeister (WMDE): Set $wgMainPageIsDomainRoot true for fixcopyrightwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540678 (https://phabricator.wikimedia.org/T120085) (owner: 10Ladsgroup) [11:38:59] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540678 (https://phabricator.wikimedia.org/T120085) (owner: 10Ladsgroup) [11:39:03] (03PS2) 10Lucas Werkmeister (WMDE): Get rid of main page hack for fixcopyrightwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540679 (https://phabricator.wikimedia.org/T120085) (owner: 10Ladsgroup) [11:39:09] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540679 (https://phabricator.wikimedia.org/T120085) (owner: 10Ladsgroup) [11:39:13] Danke schon [11:39:32] strange that it didn’t report any error on the first change… [11:39:51] (03Merged) 10jenkins-bot: Set $wgMainPageIsDomainRoot true for fixcopyrightwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540678 (https://phabricator.wikimedia.org/T120085) (owner: 10Ladsgroup) [11:39:55] (03Merged) 10jenkins-bot: Get rid of main page hack for fixcopyrightwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540679 (https://phabricator.wikimedia.org/T120085) (owner: 10Ladsgroup) [11:40:28] ok, testing on mwdebug1002 [11:40:51] seems to be working [11:40:58] main page link goes to domain, /wiki/Main_Page redirects to / [11:41:09] yup, that's intended [11:41:15] ok, syncing [11:41:28] !log another hack bites the dust [11:41:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:40] I was looking for an occasion to say this [11:42:04] lol [11:42:45] !log lucaswerkmeister-wmde@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:540678|Set $wgMainPageIsDomainRoot true for fixcopyrightwiki (T120085)]] (duration: 00m 52s) [11:42:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:49] T120085: RFC: Serve Main Page of Wikimedia wikis from a consistent URL - https://phabricator.wikimedia.org/T120085 [11:44:18] !log lucaswerkmeister-wmde@deploy1001 Synchronized wmf-config/CommonSettings.php: SWAT: [[gerrit:540679|Get rid of main page hack for fixcopyrightwiki (T120085)]] (duration: 00m 52s) [11:44:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:44:49] !log EU SWAT done [11:44:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:45:55] nlwiki village pump isn’t screaming bloody murder so it looks like that also didn’t break anything [11:46:16] (they don’t seem to have an administrators’ noticeboard? at least no page is linked to that wikidata item) [11:51:57] 10Operations, 10Math, 10Patch-For-Review: Clean up artifacts from LaTeX based math rendering - https://phabricator.wikimedia.org/T195847 (10jijiki) @Joe do we need to test that on other servers as well? I was thinking of merging this along with https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/538884/... [11:53:18] 10Operations, 10Wikimedia-General-or-Unknown, 10serviceops, 10Patch-For-Review: Remove pear/mail packages from WMF MW app servers - https://phabricator.wikimedia.org/T195364 (10jijiki) @Joe should I remove php-pear, php-mail, php-mail-mime from the rest of the fleet? [11:54:01] 10Operations, 10Core Platform Team, 10Editing-team, 10Parsing-Team, and 9 others: RFC: Serve Main Page of Wikimedia wikis from a consistent URL - https://phabricator.wikimedia.org/T120085 (10Ladsgroup) Just want to emphasis that this config variable at this current state redirects `/wiki/Main_Page` to `/`... [11:56:04] (03PS2) 10Jbond: puppet: change $::cluster variable to a hiera default [puppet] - 10https://gerrit.wikimedia.org/r/541213 (https://phabricator.wikimedia.org/T234805) [11:56:07] Lucas_WMDE: seems to be working based on block log (https://nl.wikipedia.org/wiki/Special:Log/block) [11:58:06] (03CR) 10jerkins-bot: [V: 04-1] puppet: change $::cluster variable to a hiera default [puppet] - 10https://gerrit.wikimedia.org/r/541213 (https://phabricator.wikimedia.org/T234805) (owner: 10Jbond) [11:58:43] cool [11:58:44] 10Operations, 10Wikimedia-General-or-Unknown, 10serviceops, 10Patch-For-Review, 10User-jijiki: Remove pear/mail packages from WMF MW app servers - https://phabricator.wikimedia.org/T195364 (10jijiki) [12:01:42] (03Abandoned) 10Effie Mouzeli: WIP: mediawiki: remove cleanup apache configs from hhvm [puppet] - 10https://gerrit.wikimedia.org/r/539541 (https://phabricator.wikimedia.org/T229792) (owner: 10Effie Mouzeli) [12:03:38] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, I double-checked wmf-config for any binaries which mediawiki shells out to (to prevent that some other program got automatical" [puppet] - 10https://gerrit.wikimedia.org/r/540154 (https://phabricator.wikimedia.org/T195847) (owner: 10Giuseppe Lavagetto) [12:41:35] (03PS3) 10Jbond: puppet: change $::cluster variable to a hiera default [puppet] - 10https://gerrit.wikimedia.org/r/541213 (https://phabricator.wikimedia.org/T234805) [12:43:48] (03CR) 10jerkins-bot: [V: 04-1] puppet: change $::cluster variable to a hiera default [puppet] - 10https://gerrit.wikimedia.org/r/541213 (https://phabricator.wikimedia.org/T234805) (owner: 10Jbond) [12:47:13] (03CR) 10Jbond: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/541213 (https://phabricator.wikimedia.org/T234805) (owner: 10Jbond) [12:49:12] (03CR) 10jerkins-bot: [V: 04-1] puppet: change $::cluster variable to a hiera default [puppet] - 10https://gerrit.wikimedia.org/r/541213 (https://phabricator.wikimedia.org/T234805) (owner: 10Jbond) [12:50:34] (03PS4) 10Jbond: puppet: change $::cluster variable to a hiera default [puppet] - 10https://gerrit.wikimedia.org/r/541213 (https://phabricator.wikimedia.org/T234805) [12:52:31] (03CR) 10jerkins-bot: [V: 04-1] puppet: change $::cluster variable to a hiera default [puppet] - 10https://gerrit.wikimedia.org/r/541213 (https://phabricator.wikimedia.org/T234805) (owner: 10Jbond) [12:54:30] !log upload python-kafka and python3-kafka 1.4.7-1 to stretch-wikimedia - T222941 [12:54:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:34] T222941: Eventlogging processors are frequently failing heartbeats causing consumer group rebalances - https://phabricator.wikimedia.org/T222941 [12:54:46] !log akosiaris@puppetmaster1001 conftool action : set/pooled=true; selector: dnsdisc=restrouter [12:54:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:09] !log mobrovac@deploy1001 Started deploy [restbase/deploy@fe39197]: Minor tweaks to VE logging [13:03:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1096:3315 after schema change T233625', diff saved to https://phabricator.wikimedia.org/P9247 and previous config saved to /var/cache/conftool/dbconfig/20191007-130317-marostegui.json [13:03:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:22] T233625: Change PK and remove partitions from the logging table - https://phabricator.wikimedia.org/T233625 [13:04:16] !log mobrovac@deploy1001 deploy aborted: Minor tweaks to VE logging (duration: 01m 07s) [13:04:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:27] (03PS3) 10Elukey: profile::kerberos::kdc: add support for bacula backups [puppet] - 10https://gerrit.wikimedia.org/r/540832 (https://phabricator.wikimedia.org/T226089) [13:04:51] !log mobrovac@deploy1001 Started deploy [restbase/deploy@5321aac]: (no justification provided) [13:04:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:20] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@5321aac]: (no justification provided) (duration: 00m 29s) [13:05:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:29] (03PS4) 10Elukey: profile::kerberos::kdc: add support for bacula backups [puppet] - 10https://gerrit.wikimedia.org/r/540832 (https://phabricator.wikimedia.org/T226089) [13:06:58] 10Operations, 10ops-codfw, 10Cloud-Services: Build bdsync for Buster, or update block_sync.py script to use rsync --copy-devices - https://phabricator.wikimedia.org/T234683 (10Andrew) @arturo if you wanted to submit the package upstream there's at least one other person who would appreciate it. https://bugs... [13:09:17] !log mobrovac@deploy1001 Started deploy [restbase/deploy@bf72f5c]: Minor tweaks to VE logging [13:09:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:18] (03PS1) 10Jbond: WIP: migrate profile spec tests to shared spec_healper [puppet] - 10https://gerrit.wikimedia.org/r/541245 [13:11:35] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Bump the version in Chart.yaml as well" (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/540967 (https://phabricator.wikimedia.org/T170455) (owner: 10Mholloway) [13:13:29] (03CR) 10jerkins-bot: [V: 04-1] WIP: migrate profile spec tests to shared spec_healper [puppet] - 10https://gerrit.wikimedia.org/r/541245 (owner: 10Jbond) [13:13:49] !log upload python-kafka and python3-kafka 1.4.7-1 to buster-wikimedia - T222941 [13:13:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:52] T222941: Eventlogging processors are frequently failing heartbeats causing consumer group rebalances - https://phabricator.wikimedia.org/T222941 [13:15:01] PROBLEM - restbase endpoints health on restbase2016 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:15:03] PROBLEM - restbase endpoints health on restbase2012 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:15:17] PROBLEM - mobileapps endpoints health on scb2003 is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mob [13:15:55] PROBLEM - restbase endpoints health on restbase2013 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:15:59] PROBLEM - restbase endpoints health on restbase2019 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:16:18] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@bf72f5c]: Minor tweaks to VE logging (duration: 07m 01s) [13:16:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:33] RECOVERY - restbase endpoints health on restbase2016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:16:35] RECOVERY - restbase endpoints health on restbase2012 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:16:49] RECOVERY - mobileapps endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [13:17:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1098:3316 schema change T233135 T234066', diff saved to https://phabricator.wikimedia.org/P9248 and previous config saved to /var/cache/conftool/dbconfig/20191007-131720-marostegui.json [13:17:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:27] T233135: Schema change for refactored actor and comment storage - https://phabricator.wikimedia.org/T233135 [13:17:28] T234066: Schema change to rename user_newtalk indexes - https://phabricator.wikimedia.org/T234066 [13:19:51] !log mobrovac@deploy1001 Started deploy [restbase/deploy@1337290]: Minor tweaks to VE logging, v2 [13:19:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:23] (03CR) 10Muehlenhoff: profile::kerberos::kdc: add support for bacula backups (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/540832 (https://phabricator.wikimedia.org/T226089) (owner: 10Elukey) [13:21:25] RECOVERY - restbase endpoints health on restbase2013 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:21:39] (03PS1) 10Ladsgroup: Set all of wikidata to write both for item term store [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541249 (https://phabricator.wikimedia.org/T225055) [13:22:27] going to deploy this now ^ (coordinated with DBAs) [13:23:07] (03CR) 10Ladsgroup: [C: 03+2] Set all of wikidata to write both for item term store [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541249 (https://phabricator.wikimedia.org/T225055) (owner: 10Ladsgroup) [13:23:25] PROBLEM - mobileapps endpoints health on scb2003 is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mob [13:23:41] PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles [13:23:41] 016 (with aggregated=true)) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [13:23:41] PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mob [13:23:53] (03Merged) 10jenkins-bot: Set all of wikidata to write both for item term store [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541249 (https://phabricator.wikimedia.org/T225055) (owner: 10Ladsgroup) [13:23:59] PROBLEM - restbase endpoints health on restbase2017 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:23:59] PROBLEM - restbase endpoints health on restbase2011 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:24:09] PROBLEM - restbase endpoints health on restbase2012 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:24:25] PROBLEM - restbase endpoints health on restbase2020 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:24:29] PROBLEM - restbase endpoints health on restbase2010 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:24:39] PROBLEM - restbase endpoints health on restbase2013 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html/{title} (Get mobile-html from storage) timed out before a response was received: /en.wikipedia.org/v1/page/metadata/{title} (Get metadata from storage) timed out before a response was received: /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) timed out before a response was received: /en.wikipedia.org/v1/p [13:24:39] itle} (Get references from storage) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from stor [13:24:39] fore a response was received: /en.wikipedia.org/v1/media/math/check/{type} (Mathoid - check test formula) timed out before a response was received: /en.wikipedia.org/v1/page/talk/{title} ( https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:24:43] RECOVERY - mobileapps endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [13:24:45] RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [13:24:45] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [13:25:35] RECOVERY - mobileapps endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [13:25:35] RECOVERY - restbase endpoints health on restbase2010 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:25:37] RECOVERY - restbase endpoints health on restbase2020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:25:43] RECOVERY - restbase endpoints health on restbase2013 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:26:23] PROBLEM - restbase endpoints health on restbase2016 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:26:25] RECOVERY - restbase endpoints health on restbase2012 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:26:27] PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mob [13:26:27] PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [13:26:29] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@1337290]: Minor tweaks to VE logging, v2 (duration: 06m 38s) [13:26:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:20] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: [[gerrit:541249|Set all of wikidata to write both for item term store (T225055)]] (duration: 00m 54s) [13:27:21] RECOVERY - restbase endpoints health on restbase2011 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:27:21] RECOVERY - restbase endpoints health on restbase2017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:27:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:24] T225055: Switch `tmpItemTermsMigrationStages` to MIGRATION_WRITE_BOTH - https://phabricator.wikimedia.org/T225055 [13:27:31] RECOVERY - restbase endpoints health on restbase2016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:27:31] RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [13:27:31] RECOVERY - mobileapps endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [13:27:50] marostegui: done now [13:28:02] Amir1: let's see what happens :) [13:28:11] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [13:28:12] Amir1: for now the master is fine [13:28:44] (03PS4) 10Jbond: refactor: Refactor script and use the PyYAML [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/506188 [13:29:43] (03CR) 10Jbond: "Thanks for the comments, I think i have addressed everything except the debian build comment which ill speak with moritz about" (0314 comments) [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/506188 (owner: 10Jbond) [13:29:53] Amir1: so far everything good with the master and the slaves [13:30:32] that's fishy [13:30:55] RECOVERY - restbase endpoints health on restbase2019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:31:10] (03CR) 10Jbond: [C: 03+2] cumin: add an alias for spare::system [puppet] - 10https://gerrit.wikimedia.org/r/540837 (owner: 10Jbond) [13:31:19] (03PS2) 10Jbond: cumin: add an alias for spare::system [puppet] - 10https://gerrit.wikimedia.org/r/540837 [13:31:24] Amir1: I would expect more writes, right? [13:31:56] yeah, nothing showed up in tables but I know it has to go through some caches [13:32:12] I don't see any significant change on any patterns so far [13:32:16] so I would say let's double check in half an hour, I checked and I rebased the patch [13:32:29] ok - sounds good [13:33:42] the new config var is indeed deployed across the fleet [13:37:45] (03PS2) 10Jbond: debdeploy: Change the `--servers` flag to a global flag [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/540849 [13:37:55] (03CR) 10Jbond: debdeploy: Change the `--servers` flag to a global flag (033 comments) [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/540849 (owner: 10Jbond) [13:40:19] (03CR) 10Jbond: [V: 03+2 C: 03+2] debdeploy: Change the `--servers` flag to a global flag [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/540849 (owner: 10Jbond) [13:41:45] (03PS2) 10Jbond: debdeploy: add support for raw cumin query strings [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/540850 [13:42:05] 10Operations, 10Core Platform Team, 10Editing-team, 10Parsing-Team, and 9 others: RFC: Serve Main Page of Wikimedia wikis from a consistent URL - https://phabricator.wikimedia.org/T120085 (10Nikerabbit) MediaWiki does not HTTP redirect (at least not in translatewiki.net). Wikimedia has rewrites outside Med... [13:42:19] (03PS2) 10Jbond: debdeploy: refactor [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/540851 [13:44:53] (03PS3) 10Jbond: debdeploy: add support for raw cumin query strings [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/540850 [13:47:02] (03CR) 10Jbond: [V: 03+2 C: 03+2] "Thanks for the reviews" (034 comments) [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/540850 (owner: 10Jbond) [13:47:18] (03PS3) 10Jbond: debdeploy: refactor [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/540851 [13:47:28] (03CR) 10Jbond: [V: 03+2 C: 03+2] debdeploy: refactor [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/540851 (owner: 10Jbond) [13:49:00] PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [13:50:02] RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [13:50:09] ^^ sorry this is me merging now [13:51:40] (03PS5) 10Jbond: puppet: change $::cluster variable to a hiera default [puppet] - 10https://gerrit.wikimedia.org/r/541213 (https://phabricator.wikimedia.org/T234805) [13:53:18] (03PS1) 10Marostegui: wikireplica_analytics: Change query killer from 4h to 1h [puppet] - 10https://gerrit.wikimedia.org/r/541257 [13:53:55] (03CR) 10jerkins-bot: [V: 04-1] wikireplica_analytics: Change query killer from 4h to 1h [puppet] - 10https://gerrit.wikimedia.org/r/541257 (owner: 10Marostegui) [13:54:27] (03CR) 10Jbond: "so that was a bigger rabbit whole then i expected. Ready for review now though." [puppet] - 10https://gerrit.wikimedia.org/r/541213 (https://phabricator.wikimedia.org/T234805) (owner: 10Jbond) [13:55:30] 10Operations, 10Puppet, 10Patch-For-Review: puppet: remove cluster variable - https://phabricator.wikimedia.org/T234805 (10Joe) The reason for the presence of the global variable was that, back in the day, it was the only way to inject the "cluster" in WMCS VMs. Nowadays they can all use hiera so that should... [13:57:13] (03CR) 10Jhedden: [C: 03+1] openstack: drop jessie code [puppet] - 10https://gerrit.wikimedia.org/r/539065 (https://phabricator.wikimedia.org/T212302) (owner: 10Arturo Borrero Gonzalez) [13:57:42] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "Overall LGTM - please verify with WMCS if someone is still using global variables (via ENC) to configure VMs though." [puppet] - 10https://gerrit.wikimedia.org/r/541213 (https://phabricator.wikimedia.org/T234805) (owner: 10Jbond) [13:57:52] (03PS2) 10Andrew Bogott: Horizon: put in maintenance mode for the mitaka->newton upgrade [puppet] - 10https://gerrit.wikimedia.org/r/541133 (https://phabricator.wikimedia.org/T212302) [13:58:09] (03PS4) 10Andrew Bogott: Openstack: move eqiad1 glance/keystone/nova/neutron to Newton [puppet] - 10https://gerrit.wikimedia.org/r/540643 (https://phabricator.wikimedia.org/T212302) [14:00:58] (03CR) 10Jhedden: [C: 03+1] Horizon: put in maintenance mode for the mitaka->newton upgrade [puppet] - 10https://gerrit.wikimedia.org/r/541133 (https://phabricator.wikimedia.org/T212302) (owner: 10Andrew Bogott) [14:01:34] (03CR) 10Andrew Bogott: [C: 03+2] Horizon: put in maintenance mode for the mitaka->newton upgrade [puppet] - 10https://gerrit.wikimedia.org/r/541133 (https://phabricator.wikimedia.org/T212302) (owner: 10Andrew Bogott) [14:04:55] (03CR) 10Andrew Bogott: [C: 03+2] Openstack: move eqiad1 glance/keystone/nova/neutron to Newton [puppet] - 10https://gerrit.wikimedia.org/r/540643 (https://phabricator.wikimedia.org/T212302) (owner: 10Andrew Bogott) [14:05:53] (03PS2) 10Marostegui: wikireplica_analytics: Change query killer from 4h to 1h [puppet] - 10https://gerrit.wikimedia.org/r/541257 (https://phabricator.wikimedia.org/T233986) [14:06:14] (03PS1) 10Herron: add dns entries for logstash100[123] & logstash201[012] [dns] - 10https://gerrit.wikimedia.org/r/541259 [14:06:22] (03PS1) 10Herron: add dhcp and netboot entries for logstash100[123] & logstash201[012] [puppet] - 10https://gerrit.wikimedia.org/r/541260 [14:06:39] (03CR) 10Jbond: "Andrew or Arturo could you confirm If WMCS's relies on `$::cluster` being a global variable." [puppet] - 10https://gerrit.wikimedia.org/r/541213 (https://phabricator.wikimedia.org/T234805) (owner: 10Jbond) [14:06:45] (03CR) 10jerkins-bot: [V: 04-1] add dns entries for logstash100[123] & logstash201[012] [dns] - 10https://gerrit.wikimedia.org/r/541259 (owner: 10Herron) [14:08:27] 10Operations, 10Puppet, 10Patch-For-Review: puppet: remove cluster variable - https://phabricator.wikimedia.org/T234805 (10jbond) >>! In T234805#5552137, @Joe wrote: > The reason for the presence of the global variable was that, back in the day, it was the only way to inject the "cluster" in WMCS VMs. Nowada... [14:15:29] 10Operations, 10Puppet, 10Patch-For-Review: puppet: remove cluster variable - https://phabricator.wikimedia.org/T234805 (10jbond) p:05Triage→03Normal [14:15:51] 10Operations, 10Puppet, 10Cloud-Services, 10Patch-For-Review: puppet: remove cluster variable - https://phabricator.wikimedia.org/T234805 (10jbond) [14:17:00] PROBLEM - Widespread puppet agent failures on icinga1001 is CRITICAL: 0.01012 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [14:17:40] !log Deploy schema change on db1139:3316 - T233135 T234066 [14:17:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:45] T233135: Schema change for refactored actor and comment storage - https://phabricator.wikimedia.org/T233135 [14:17:46] T234066: Schema change to rename user_newtalk indexes - https://phabricator.wikimedia.org/T234066 [14:25:09] 10Operations, 10ops-codfw, 10decommission, 10media-storage, 10User-fgiunchedi: decom ms-be201[345] - https://phabricator.wikimedia.org/T221068 (10Papaul) [14:25:16] !log upgrading openstack in CloudVPS. Some IRC bots and related stuff may be unavailable (T212302) [14:25:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:19] T212302: CloudVPS: upgrade: jessie -> stretch & mitaka -> newton - https://phabricator.wikimedia.org/T212302 [14:25:24] (03PS2) 10Herron: add dns entries for logstash100[123] & logstash201[012] [dns] - 10https://gerrit.wikimedia.org/r/541259 [14:25:28] 10Operations, 10ops-codfw, 10decommission, 10media-storage, 10User-fgiunchedi: decom ms-be201[345] - https://phabricator.wikimedia.org/T221068 (10Papaul) 05Open→03Resolved Complete [14:28:20] PROBLEM - Widespread puppet agent failures on icinga1001 is CRITICAL: 0.01012 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [14:30:02] (03CR) 10Jcrespo: [C: 04-2] "Only to be deployed as part of the deprecation of the old hosts." [puppet] - 10https://gerrit.wikimedia.org/r/541209 (https://phabricator.wikimedia.org/T229209) (owner: 10Jcrespo) [14:30:13] 10Operations, 10Core Platform Team, 10Editing-team, 10Parsing-Team, and 9 others: RFC: Serve Main Page of Wikimedia wikis from a consistent URL - https://phabricator.wikimedia.org/T120085 (10Ladsgroup) >>! In T120085#5552102, @Nikerabbit wrote: > MediaWiki does not HTTP redirect (at least not in translatew... [14:31:26] (03PS2) 10Alexandros Kosiaris: Fully remove scap-helm [puppet] - 10https://gerrit.wikimedia.org/r/540843 (https://phabricator.wikimedia.org/T212130) [14:31:32] (03CR) 10Alexandros Kosiaris: [C: 03+2] Fully remove scap-helm [puppet] - 10https://gerrit.wikimedia.org/r/540843 (https://phabricator.wikimedia.org/T212130) (owner: 10Alexandros Kosiaris) [14:32:12] (03PS2) 10Ottomata: reportupdater::jobs::mysql.pp: Absent jobs affected by migration [puppet] - 10https://gerrit.wikimedia.org/r/540658 (https://phabricator.wikimedia.org/T223414) (owner: 10Mforns) [14:32:56] (03CR) 10Ottomata: [V: 03+2 C: 03+2] reportupdater::jobs::mysql.pp: Absent jobs affected by migration [puppet] - 10https://gerrit.wikimedia.org/r/540658 (https://phabricator.wikimedia.org/T223414) (owner: 10Mforns) [14:34:09] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2055.codfw.wmnet - https://phabricator.wikimedia.org/T233186 (10Papaul) ` papaul@asw-d-codfw# show | compare [edit interfaces interface-range vlan-private1-d-codfw] - member ge-6/0/3; [edit interfaces interface-range disabled] mem... [14:35:03] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2055.codfw.wmnet - https://phabricator.wikimedia.org/T233186 (10Papaul) [14:36:14] (03CR) 10Elukey: [C: 03+1] Set up presto single node on analytics1030 in hadoop test cluster [puppet] - 10https://gerrit.wikimedia.org/r/540968 (owner: 10Ottomata) [14:39:23] (03PS4) 10Ottomata: ::reportupdater::jobs: Migrate MySQL jobs to Hive [puppet] - 10https://gerrit.wikimedia.org/r/540661 (https://phabricator.wikimedia.org/T223414) (owner: 10Mforns) [14:41:10] (03CR) 10Elukey: profile::kerberos::kdc: add support for bacula backups (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/540832 (https://phabricator.wikimedia.org/T226089) (owner: 10Elukey) [14:54:24] (03CR) 10Ottomata: [C: 03+2] ::reportupdater::jobs: Migrate MySQL jobs to Hive [puppet] - 10https://gerrit.wikimedia.org/r/540661 (https://phabricator.wikimedia.org/T223414) (owner: 10Mforns) [14:56:26] (03PS1) 10Giuseppe Lavagetto: echostore: add LVS service IPs [dns] - 10https://gerrit.wikimedia.org/r/541275 (https://phabricator.wikimedia.org/T234464) [14:56:28] (03PS1) 10Giuseppe Lavagetto: echostore: add discovery record [dns] - 10https://gerrit.wikimedia.org/r/541276 (https://phabricator.wikimedia.org/T234464) [14:59:54] (03PS1) 10Alexandros Kosiaris: restrouter: Kadelmia should listen on all IPs [deployment-charts] - 10https://gerrit.wikimedia.org/r/541278 (https://phabricator.wikimedia.org/T223953) [15:03:39] PROBLEM - Check systemd state on stat1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:04:03] PROBLEM - Check systemd state on stat1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:04:39] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [15:05:57] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [15:06:29] (03PS2) 10Alexandros Kosiaris: restrouter: Kademlia should listen on all IPs [deployment-charts] - 10https://gerrit.wikimedia.org/r/541278 (https://phabricator.wikimedia.org/T223953) [15:06:36] (03PS3) 10Jforrester: Remove defunct VisualEditorEnableNewMobileContext config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540972 (owner: 10Esanders) [15:06:48] (03CR) 10Jforrester: [C: 03+2] Remove defunct VisualEditorEnableNewMobileContext config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540972 (owner: 10Esanders) [15:07:47] (03Merged) 10jenkins-bot: Remove defunct VisualEditorEnableNewMobileContext config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540972 (owner: 10Esanders) [15:08:16] Amir1: everything keeps looking good [15:11:05] marostegui: now I see new writes on the table [15:11:07] that's good [15:12:18] Amir1: that's good, I would have expected to see more writes though, but so far no significant change [15:12:52] Q40Mio up to Q70Mio is not very big chunk of items [15:13:00] it's mostly new items that write [15:13:27] !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: Stop reading wmgVisualEditorEnableNewMobileContext (duration: 00m 52s) [15:13:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:35] (03PS3) 10Alexandros Kosiaris: Fully remove scap-helm [puppet] - 10https://gerrit.wikimedia.org/r/540843 (https://phabricator.wikimedia.org/T212130) [15:14:49] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Stop writing wmgVisualEditorEnableNewMobileContext (duration: 00m 51s) [15:14:53] 10Operations, 10ops-codfw, 10fundraising-tech-ops: rack/setup/install frban2001.codfw.wmnet - https://phabricator.wikimedia.org/T234069 (10Papaul) [15:14:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:35] (03PS2) 10Herron: add dhcp and netboot entries for logstash[12]02[012] [puppet] - 10https://gerrit.wikimedia.org/r/541260 [15:15:36] (03PS3) 10Herron: add forward/reverse dns entries for logstash[12]02[012] [dns] - 10https://gerrit.wikimedia.org/r/541259 [15:16:14] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Fully remove scap-helm [puppet] - 10https://gerrit.wikimedia.org/r/540843 (https://phabricator.wikimedia.org/T212130) (owner: 10Alexandros Kosiaris) [15:16:44] 10Operations, 10ops-codfw, 10decommission: Decommission sarin - https://phabricator.wikimedia.org/T220504 (10Papaul) [15:17:06] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2050.codfw.wmnet - https://phabricator.wikimedia.org/T230391 (10Papaul) [15:25:39] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Tested locally, works fine, merging" [deployment-charts] - 10https://gerrit.wikimedia.org/r/541278 (https://phabricator.wikimedia.org/T223953) (owner: 10Alexandros Kosiaris) [15:25:55] (03Merged) 10jenkins-bot: restrouter: Kademlia should listen on all IPs [deployment-charts] - 10https://gerrit.wikimedia.org/r/541278 (https://phabricator.wikimedia.org/T223953) (owner: 10Alexandros Kosiaris) [15:27:28] !log @ helmfile [STAGING] Ran 'apply' command on namespace 'restrouter' for release 'staging' . [15:27:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:59] !log mholloway-shell@deploy1001 Started deploy [mobileapps/deploy@334e809]: Update mobileapps to 16cb9ae [15:28:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:10] RECOVERY - Widespread puppet agent failures on icinga1001 is OK: (C)0.01 ge (W)0.006 ge 0.005785 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [15:29:23] !log @ helmfile [EQIAD] Ran 'apply' command on namespace 'restrouter' for release 'production' . [15:29:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:10] !log @ helmfile [CODFW] Ran 'apply' command on namespace 'restrouter' for release 'production' . [15:30:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:28] !log mholloway-shell@deploy1001 Finished deploy [mobileapps/deploy@334e809]: Update mobileapps to 16cb9ae (duration: 06m 28s) [15:34:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:05] (03PS5) 10Elukey: profile::kerberos::kdc: add support for bacula backups [puppet] - 10https://gerrit.wikimedia.org/r/540832 (https://phabricator.wikimedia.org/T226089) [15:37:04] 10Puppet, 10Patch-For-Review: upgrade puppet master frontends servers - https://phabricator.wikimedia.org/T234315 (10jbond) https://phabricator.wikimedia.org/P9250 [15:42:08] Bye bots o/ [15:44:32] fatal: unable to access 'https://gerrit.wikimedia.org/r/mediawiki/core/': Could not resolve host: gerrit.wikimedia.org [15:44:44] Always a good sign. [15:45:14] It failed once for me now works [15:45:44] had similar errors in some CI builds :/ [15:45:58] Yeah, it killed all running jobs in CI, whatever it was. [15:46:01] * James_F sighs. [15:46:47] James_F that'll be the openstack upgrade [15:47:04] ah, that would explain it, if dns failed more people would be angry now [15:47:20] as in production dns vs openstack resolution [15:47:52] Should I file a task or is it going to get fixed soon? [15:48:37] ask/monitor on cloud channel [15:48:45] they did mention network hicups in the email [15:48:46] https://lists.wikimedia.org/pipermail/cloud-announce/2019-October/000215.html [15:50:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1096:3315 after schema change T233625', diff saved to https://phabricator.wikimedia.org/P9251 and previous config saved to /var/cache/conftool/dbconfig/20191007-155038-marostegui.json [15:51:02] PROBLEM - nova instance creation test on cloudcontrol1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [15:53:06] our openstack upgrado should have nothing to do with the .wikimedia.org domain [15:53:13] upgrade* [15:53:41] still nothing all of my jenkins jobs work [15:55:17] if jenkins is running in CloudVPS (I believe at least part of it is) then that should be related indeed [15:57:30] arturo: Yeah, all of CI is. [15:58:42] (Also the bots died for the same reason.) [16:02:16] RECOVERY - nova instance creation test on cloudcontrol1003 is OK: PROCS OK: 1 process with command name python, args nova-fullstack https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [16:23:35] (03PS3) 10Andrew Bogott: nova: don't install nova libvirt originals on cloudvirts [puppet] - 10https://gerrit.wikimedia.org/r/541295 [16:23:37] (03PS4) 10Andrew Bogott: Revert "Horizon: put in maintenance mode for the mitaka->newton upgrade" [puppet] - 10https://gerrit.wikimedia.org/r/541134 [16:39:15] (03PS4) 10Andrew Bogott: nova: don't install nova libvirt originals on cloudvirts [puppet] - 10https://gerrit.wikimedia.org/r/541295 [16:39:17] (03PS5) 10Andrew Bogott: Revert "Horizon: put in maintenance mode for the mitaka->newton upgrade" [puppet] - 10https://gerrit.wikimedia.org/r/541134 [16:41:12] sorry about the network issues, we some incorrect originating IPs. Things should recover shortly. [16:41:17] ^ cloudwise, I mean [16:42:11] (03CR) 10Andrew Bogott: [C: 03+2] nova: don't install nova libvirt originals on cloudvirts [puppet] - 10https://gerrit.wikimedia.org/r/541295 (owner: 10Andrew Bogott) [16:55:28] (03PS2) 10Eevans: [WIP]: cassandra config updates for 3.11.4 upgrade [puppet] - 10https://gerrit.wikimedia.org/r/540948 (https://phabricator.wikimedia.org/T200803) [16:55:50] (03CR) 10Eevans: [WIP]: cassandra config updates for 3.11.4 upgrade (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/540948 (https://phabricator.wikimedia.org/T200803) (owner: 10Eevans) [16:57:10] (03CR) 10Eevans: [C: 04-1] "> Left a couple of comments. Otherwise, there are quite a number of places where this patch introduces trailing white space to lines, I th" [puppet] - 10https://gerrit.wikimedia.org/r/540948 (https://phabricator.wikimedia.org/T200803) (owner: 10Eevans) [16:58:07] (03PS1) 10Jhedden: openstack: pin mitaka version on jessie openstack clients [puppet] - 10https://gerrit.wikimedia.org/r/541310 (https://phabricator.wikimedia.org/T212302) [16:59:58] (03PS1) 10SBassett: Beta Cluster cross-wiki login request would be blocked by CSP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541312 (https://phabricator.wikimedia.org/T211539) [17:00:04] gehel and onimisionipe: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Wikidata Query Service weekly deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191007T1700). [17:01:16] jouncebot: no deploy [17:06:18] (03CR) 10Andrew Bogott: [C: 03+1] "this is what I would do, for the short run" [puppet] - 10https://gerrit.wikimedia.org/r/541310 (https://phabricator.wikimedia.org/T212302) (owner: 10Jhedden) [17:08:28] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "We don't have Openstack Newton packages available for Debian Jessie. LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/541310 (https://phabricator.wikimedia.org/T212302) (owner: 10Jhedden) [17:08:44] (03CR) 10Jforrester: Beta Cluster cross-wiki login request would be blocked by CSP (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541312 (https://phabricator.wikimedia.org/T211539) (owner: 10SBassett) [17:10:00] (03PS6) 10Andrew Bogott: Revert "Horizon: put in maintenance mode for the mitaka->newton upgrade" [puppet] - 10https://gerrit.wikimedia.org/r/541134 [17:10:02] (03PS1) 10Andrew Bogott: nova-api: allow forcing a host when creating a VM [puppet] - 10https://gerrit.wikimedia.org/r/541314 [17:11:45] (03PS2) 10Andrew Bogott: nova-api: allow forcing a host when creating a VM [puppet] - 10https://gerrit.wikimedia.org/r/541314 [17:11:47] (03PS7) 10Andrew Bogott: Revert "Horizon: put in maintenance mode for the mitaka->newton upgrade" [puppet] - 10https://gerrit.wikimedia.org/r/541134 [17:13:01] (03CR) 10Andrew Bogott: [C: 03+2] nova-api: allow forcing a host when creating a VM [puppet] - 10https://gerrit.wikimedia.org/r/541314 (owner: 10Andrew Bogott) [17:22:10] !log add BGP route damping on IX sessions - eqsin - T222424 [17:22:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:13] T222424: configure BGP route damping on IX sessions - https://phabricator.wikimedia.org/T222424 [17:22:18] 10Operations, 10ops-codfw, 10fundraising-tech-ops: rack/setup/install frban2001.codfw.wmnet - https://phabricator.wikimedia.org/T234069 (10Papaul) [17:24:46] PROBLEM - puppet last run on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [17:24:56] PROBLEM - Check whether ferm is active by checking the default input chain on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [17:24:56] PROBLEM - Disk space on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=stat1007&var-datasource=eqiad+prometheus/ops [17:25:10] PROBLEM - MD RAID on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [17:25:12] 10Operations, 10ops-eqiad, 10fundraising-tech-ops: rack/setup/install frnetmon1001 - https://phabricator.wikimedia.org/T232137 (10Jgreen) [17:25:16] PROBLEM - DPKG on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [17:25:32] PROBLEM - Check size of conntrack table on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [17:25:39] 10Operations, 10ops-codfw, 10fundraising-tech-ops: rack/setup/install frban2001.codfw.wmnet - https://phabricator.wikimedia.org/T234069 (10Jgreen) [17:25:50] PROBLEM - dhclient process on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [17:25:52] PROBLEM - SSH on stat1007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [17:26:02] 10Operations, 10ops-eqiad, 10fundraising-tech-ops: rack/setup/install frban1001.eqiad.wmnet - https://phabricator.wikimedia.org/T234068 (10Jgreen) [17:26:06] PROBLEM - configured eth on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [17:26:08] (03PS9) 10Ayounsi: Add commit action to the Homer class [software/homer] - 10https://gerrit.wikimedia.org/r/539551 (https://phabricator.wikimedia.org/T228388) (owner: 10Volans) [17:26:22] 10Operations, 10ops-codfw, 10fundraising-tech-ops: rack/setup/install frqueue2001 - https://phabricator.wikimedia.org/T232630 (10Jgreen) [17:27:08] !log add BGP route damping on IX sessions - esams - T222424 [17:27:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:27:30] PROBLEM - Check the last execution of search-drop-query-clicks on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [17:28:19] !log add BGP route damping on IX sessions - eqiad - T222424 [17:29:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:04] T222424: configure BGP route damping on IX sessions - https://phabricator.wikimedia.org/T222424 [17:29:22] 10Operations, 10netops: configure BGP route damping on IX sessions - https://phabricator.wikimedia.org/T222424 (10ayounsi) 05Open→03Resolved All done! [17:30:48] (03PS1) 10Papaul: DNS: Add mgmt and production DNS for frban2001 [dns] - 10https://gerrit.wikimedia.org/r/541317 [17:32:44] (03CR) 10Ayounsi: [C: 03+2] "There is a bug on (at least some of) the SRXs where the commit check times out and never validates the commit check." (032 comments) [software/homer] - 10https://gerrit.wikimedia.org/r/539551 (https://phabricator.wikimedia.org/T228388) (owner: 10Volans) [17:33:12] (03CR) 10SBassett: Beta Cluster cross-wiki login request would be blocked by CSP (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541312 (https://phabricator.wikimedia.org/T211539) (owner: 10SBassett) [17:33:50] 10Operations, 10ops-codfw, 10fundraising-tech-ops, 10Patch-For-Review: rack/setup/install frban2001.codfw.wmnet - https://phabricator.wikimedia.org/T234069 (10Papaul) [17:34:34] 10Operations, 10ops-codfw, 10fundraising-tech-ops, 10Patch-For-Review: rack/setup/install frban2001.codfw.wmnet - https://phabricator.wikimedia.org/T234069 (10Papaul) @Jgreen you can take this task after you review the mgmt and production DNS [17:41:40] PROBLEM - Check the NTP synchronisation status of timesyncd on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/NTP [17:46:16] !log stat1007 is unresponsive, can't login via mgmt either. powercycling. [17:46:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:49:00] PROBLEM - Host stat1007 is DOWN: PING CRITICAL - Packet loss = 100% [17:49:44] RECOVERY - DPKG on stat1007 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [17:49:50] RECOVERY - Host stat1007 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [17:49:56] RECOVERY - Check size of conntrack table on stat1007 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [17:50:12] RECOVERY - dhclient process on stat1007 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [17:50:16] RECOVERY - SSH on stat1007 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [17:50:28] RECOVERY - configured eth on stat1007 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [17:50:40] RECOVERY - Check systemd state on stat1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:50:52] (03PS8) 10Andrew Bogott: Revert "Horizon: put in maintenance mode for the mitaka->newton upgrade" [puppet] - 10https://gerrit.wikimedia.org/r/541134 [17:50:55] (03PS1) 10Andrew Bogott: neutron: set rpc workers to 25 [puppet] - 10https://gerrit.wikimedia.org/r/541323 [17:50:56] RECOVERY - Check whether ferm is active by checking the default input chain on stat1007 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [17:50:58] RECOVERY - Disk space on stat1007 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=stat1007&var-datasource=eqiad+prometheus/ops [17:51:12] RECOVERY - MD RAID on stat1007 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [17:51:14] (03PS1) 10Mforns: :reportupdater:manifests:job.pp: fix typo in config-file param [puppet] - 10https://gerrit.wikimedia.org/r/541324 (https://phabricator.wikimedia.org/T223414) [17:51:53] (03CR) 10Andrew Bogott: [C: 03+2] neutron: set rpc workers to 25 [puppet] - 10https://gerrit.wikimedia.org/r/541323 (owner: 10Andrew Bogott) [17:52:34] def some OOM, last syslogs are about nrpe not being able to allocate memory [17:52:36] ls [17:52:56] RECOVERY - puppet last run on stat1007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [17:58:02] (03CR) 10EBernhardson: [C: 03+1] "+1 in general, although i'm not familiar with completion using DB again. Is there a reference ticket we can link?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539117 (owner: 10DCausse) [17:59:22] RECOVERY - Check the last execution of search-drop-query-clicks on stat1007 is OK: OK: Status of the systemd unit search-drop-query-clicks https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [18:00:04] MaxSem, RoanKattouw, Niharika, and Urbanecm: #bothumor I � Unicode. All rise for Morning SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191007T1800). [18:00:04] MatmaRex: A patch you scheduled for Morning SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:09] (03PS3) 10Filippo Giunchedi: logstash: parse nested json from mmkubernetes [puppet] - 10https://gerrit.wikimedia.org/r/539978 (https://phabricator.wikimedia.org/T207200) [18:00:20] hi [18:00:22] (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: parse nested json from mmkubernetes (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/539978 (https://phabricator.wikimedia.org/T207200) (owner: 10Filippo Giunchedi) [18:00:30] MatmaRex: I can SWAT today [18:00:54] thanks! [18:01:25] MatmaRex: from the deployment calendar: "Changes in the submodule must be merged first: 541296, 541299", that seems done already, right? [18:01:52] yeah [18:01:59] they weren't merged when i wrote that :) [18:02:08] PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [18:02:11] Ok, just wanted to make sure there's not a typo or anything [18:02:12] thanks [18:02:26] Urbanecm: Yeah, I landed them when CI came back up. [18:03:03] thx James_F [18:03:38] MatmaRex: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/VisualEditor/+/541303 and https://gerrit.wikimedia.org/r/c/mediawiki/extensions/VisualEditor/+/541305 +2'ed [18:03:42] PROBLEM - Check systemd state on stat1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:04:53] 10Operations, 10Gerrit, 10Release-Engineering-Team-TODO, 10Continuous-Integration-Config, 10Release-Engineering-Team (Development services): Fix operations/puppet.git "rebase hell" - https://phabricator.wikimedia.org/T224033 (10fgiunchedi) I'm +1 on turning on rebase if necessary and see how things play... [18:06:34] (03PS2) 10Urbanecm: Enable NewUserMessage on sq.wikipedia and sq.wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541009 (https://phabricator.wikimedia.org/T234499) [18:06:39] (03CR) 10Urbanecm: [C: 03+2] Enable NewUserMessage on sq.wikipedia and sq.wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541009 (https://phabricator.wikimedia.org/T234499) (owner: 10Urbanecm) [18:07:40] (03Merged) 10jenkins-bot: Enable NewUserMessage on sq.wikipedia and sq.wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541009 (https://phabricator.wikimedia.org/T234499) (owner: 10Urbanecm) [18:08:57] (03PS4) 10Filippo Giunchedi: logstash: parse nested json from mmkubernetes [puppet] - 10https://gerrit.wikimedia.org/r/539978 (https://phabricator.wikimedia.org/T207200) [18:09:02] (03CR) 10Filippo Giunchedi: [C: 03+2] logstash: parse nested json from mmkubernetes [puppet] - 10https://gerrit.wikimedia.org/r/539978 (https://phabricator.wikimedia.org/T207200) (owner: 10Filippo Giunchedi) [18:10:13] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: f434ae3: Enable NewUserMessage on sq.wikipedia and sq.wikiquote (T234499) (duration: 00m 52s) [18:10:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:17] T234499: Enable NewUserMessage on sq.wikipedia and sq.wikiquote - https://phabricator.wikimedia.org/T234499 [18:12:04] !log start swiftrepl eqiad -> codfw (no deletes) [18:12:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:10] !log apt dist-upgrade on all cloudvirts (for nova upgrades) [18:12:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:18] RECOVERY - Check the NTP synchronisation status of timesyncd on stat1007 is OK: OK: synced at Mon 2019-10-07 18:12:17 UTC. https://wikitech.wikimedia.org/wiki/NTP [18:12:35] (03PS5) 10Phedenskog: Grafana: Add external Graphite for synthetic testing [puppet] - 10https://gerrit.wikimedia.org/r/540572 (https://phabricator.wikimedia.org/T231870) [18:12:48] RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [18:13:16] (03CR) 10Eevans: [V: 03+2 C: 03+2] Updated list of RESTBase hosts [software/logstash-logback-encoder] - 10https://gerrit.wikimedia.org/r/522218 (https://phabricator.wikimedia.org/T222960) (owner: 10Eevans) [18:13:34] (03CR) 10Jgreen: [C: 03+2] DNS: Add mgmt and production DNS for frban2001 [dns] - 10https://gerrit.wikimedia.org/r/541317 (owner: 10Papaul) [18:13:42] (03CR) 10Urbanecm: "> Patch Set 1:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540380 (https://phabricator.wikimedia.org/T222117) (owner: 10Urbanecm) [18:14:01] (03CR) 10Jhedden: [C: 03+2] openstack: pin mitaka version on jessie openstack clients [puppet] - 10https://gerrit.wikimedia.org/r/541310 (https://phabricator.wikimedia.org/T212302) (owner: 10Jhedden) [18:14:13] (03PS2) 10Jhedden: openstack: pin mitaka version on jessie openstack clients [puppet] - 10https://gerrit.wikimedia.org/r/541310 (https://phabricator.wikimedia.org/T212302) [18:15:16] (03PS2) 10Jforrester: [Beta Cluster] Let cross-wiki login requests work when CSP is switched on [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541312 (https://phabricator.wikimedia.org/T211539) (owner: 10SBassett) [18:15:18] (03PS1) 10Jforrester: CommonSettings: Run Labs config after CSP config so it can change it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541339 [18:16:03] !log roll-restart logstash to apply https://gerrit.wikimedia.org/r/c/operations/puppet/+/539978 [18:16:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:30] (03CR) 10Jforrester: "How's this? My only concern is that this replaces rather than extends img-src and so if we ever start setting that it'll break… but it kee" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541312 (https://phabricator.wikimedia.org/T211539) (owner: 10SBassett) [18:17:58] PROBLEM - Widespread puppet agent failures on icinga1001 is CRITICAL: 0.07664 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [18:18:49] James_F: after applying git submodule update extensions/VisualEditor, /srv/mediawiki-staging/php-1.34.0-wmf.25 is still dirty, could you please help? [18:19:29] * James_F looks. [18:19:55] thank you [18:20:09] Urbanecm: Did you do a submodule update inside MW-VE to pick up VE itself? [18:20:34] James_F: good point, thanks [18:20:37] now it seems to be okay [18:20:42] Cool. [18:20:53] MatmaRex: both patches should be ready at mwdebug1002, could you have a look? [18:20:54] And yes, VE and Wikibase make things more complex with submodules. :-) [18:21:30] :) [18:21:41] looking [18:22:45] Urbanecm: both look good. thanks [18:22:52] RECOVERY - Widespread puppet agent failures on icinga1001 is OK: (C)0.01 ge (W)0.006 ge 0.002169 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [18:22:53] MatmaRex: syncing [18:25:08] !log urbanecm@deploy1001 Synchronized php-1.34.0-wmf.25/extensions/VisualEditor/: SWAT: 011b6eb: 11033b7: Update VE core submodule to 2ffb699eb (TreeModifier fixes), T234489, T234742 + ve.ui.MWDefinedTransclusionContextItem: Fix handling of template names (T234817) (duration: 00m 53s) [18:25:16] PROBLEM - nova-compute proc minimum on cloudvirt1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n] /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:25:19] MatmaRex: should be synced [18:25:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:20] T234817: Template context menu doesn't appear in some cases - https://phabricator.wikimedia.org/T234817 [18:25:21] T234742: TreeModifier: ensureNotText can return the wrong position - https://phabricator.wikimedia.org/T234742 [18:25:21] T234489: Problems with deleting/cutting/moving references in VisualEditor - https://phabricator.wikimedia.org/T234489 [18:26:10] !log Morning SWAT done [18:26:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:26:14] Urbanecm: thanks, indeed looks good [18:26:22] thank you both Urbanecm James_F :) [18:26:26] happy to help, MatmaRex [18:26:30] ah damn, I missed the morning SWAT? [18:26:38] Lucas_WMDE: seems so [18:26:42] I wanted to backport something but got very sidetracked [18:26:44] oops [18:26:48] I've already closed it, btw [18:26:49] Lucas_WMDE: We can sling it out. [18:26:54] (03PS4) 10Herron: add forward/reverse dns entries for logstash[12]02[012] [dns] - 10https://gerrit.wikimedia.org/r/541259 [18:26:54] RECOVERY - nova-compute proc minimum on cloudvirt1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n] /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:26:59] Unlimited power etc. [18:27:11] I think I’ll deploy it out-of-SWAT – it should really be deployed before today’s Wikidata entity dumps start [18:27:15] (which would be 23:00 UTC, I believe) [18:27:21] (03PS1) 10Jhedden: openstack: Update jessie openstack clients in eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/541342 (https://phabricator.wikimedia.org/T212302) [18:27:42] (03PS2) 10Jhedden: openstack: Update jessie openstack clients in eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/541342 (https://phabricator.wikimedia.org/T212302) [18:29:23] (03PS9) 10Andrew Bogott: Revert "Horizon: put in maintenance mode for the mitaka->newton upgrade" [puppet] - 10https://gerrit.wikimedia.org/r/541134 [18:30:08] (03CR) 10Herron: [C: 03+2] add forward/reverse dns entries for logstash[12]02[012] [dns] - 10https://gerrit.wikimedia.org/r/541259 (owner: 10Herron) [18:30:46] (03CR) 10Jhedden: [C: 03+2] openstack: Update jessie openstack clients in eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/541342 (https://phabricator.wikimedia.org/T212302) (owner: 10Jhedden) [18:33:29] okay, sidetrack finished (the outcome is https://www.mediawiki.org/wiki/Manual:ChronologyProtector if anyone’s interested) [18:33:36] !log reopen Morning SWAT for another backport (sorry) [18:33:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:35:15] backport is https://gerrit.wikimedia.org/r/540419, let’s hope gate-and-submit doesn’t take too long… [18:38:27] (03PS3) 10Herron: add dhcp and netboot entries for logstash[12]02[012] [puppet] - 10https://gerrit.wikimedia.org/r/541260 [18:43:39] (03CR) 10Herron: [C: 03+2] add dhcp and netboot entries for logstash[12]02[012] [puppet] - 10https://gerrit.wikimedia.org/r/541260 (owner: 10Herron) [18:50:34] (03PS10) 10Andrew Bogott: Revert "Horizon: put in maintenance mode for the mitaka->newton upgrade" [puppet] - 10https://gerrit.wikimedia.org/r/541134 [18:53:08] (03CR) 10Andrew Bogott: [C: 03+2] Revert "Horizon: put in maintenance mode for the mitaka->newton upgrade" [puppet] - 10https://gerrit.wikimedia.org/r/541134 (owner: 10Andrew Bogott) [19:04:00] still waiting on CI for that backport… [19:04:59] (03PS1) 10Andrew Bogott: keystone wmtotp: update newton code to handle new args [puppet] - 10https://gerrit.wikimedia.org/r/541350 [19:05:46] (03CR) 10Andrew Bogott: [C: 03+2] keystone wmtotp: update newton code to handle new args [puppet] - 10https://gerrit.wikimedia.org/r/541350 (owner: 10Andrew Bogott) [19:06:50] backport merged, deploying… [19:07:35] works as intended on mwdebug1002, syncing [19:09:33] !log lucaswerkmeister-wmde@deploy1001 Synchronized php-1.34.0-wmf.25/extensions/Wikibase: SWAT: [[gerrit:540419|Revert "Format coordinates with limited precision" (T174504)]] (duration: 00m 57s) [19:09:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:37] T174504: Coordinates are exported into RDF with excessive precision - https://phabricator.wikimedia.org/T174504 [19:10:58] !log Morning SWAT done [19:11:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:11:03] sorry for the hold-up everyone [19:11:57] (03PS1) 10Andrew Bogott: keystone wmtotp: catch up with another upstream change [puppet] - 10https://gerrit.wikimedia.org/r/541353 [19:12:49] (03CR) 10Andrew Bogott: [C: 03+2] keystone wmtotp: catch up with another upstream change [puppet] - 10https://gerrit.wikimedia.org/r/541353 (owner: 10Andrew Bogott) [19:18:11] !log herron@cumin1001 START - Cookbook sre.hosts.downtime [19:18:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:23] !log herron@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [19:20:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:19] (03PS3) 10Jforrester: Add the beta REL1_34 to ExtensionDistributor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539916 [19:25:06] (03CR) 10Jforrester: [C: 03+2] "To the Sea, to the Sea! The white gulls are crying." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539916 (owner: 10Jforrester) [19:26:35] (03Merged) 10jenkins-bot: Add the beta REL1_34 to ExtensionDistributor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539916 (owner: 10Jforrester) [19:31:11] !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: Add the beta REL1_34 to ExtensionDistributor (duration: 00m 50s) [19:31:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:33:43] (03PS1) 10Ladsgroup: Add ban language [dns] - 10https://gerrit.wikimedia.org/r/541356 (https://phabricator.wikimedia.org/T234768) [19:34:09] (03CR) 10jerkins-bot: [V: 04-1] Add ban language [dns] - 10https://gerrit.wikimedia.org/r/541356 (https://phabricator.wikimedia.org/T234768) (owner: 10Ladsgroup) [19:36:06] (03PS2) 10Ladsgroup: Add ban language [dns] - 10https://gerrit.wikimedia.org/r/541356 (https://phabricator.wikimedia.org/T234768) [19:42:46] (03PS1) 10Umherirrender: Add 'periodical' as run mode to $wgDisableQueryPageUpdate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541358 (https://phabricator.wikimedia.org/T78711) [19:43:52] (03PS1) 10Ottomata: Ensure eventlogging-consumer mysql is absent on eventlog1002 [puppet] - 10https://gerrit.wikimedia.org/r/541359 (https://phabricator.wikimedia.org/T223414) [19:44:18] (03CR) 10Ottomata: [C: 04-1] "Waiting a day at least to ensure hive reportupdater is working" [puppet] - 10https://gerrit.wikimedia.org/r/541359 (https://phabricator.wikimedia.org/T223414) (owner: 10Ottomata) [19:44:23] (03CR) 10Umherirrender: [C: 04-1] "I have created a new patch set for 'periodical': Ied48d3cd918e1c87cae48939e0275d975d3bf7c9" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530871 (https://phabricator.wikimedia.org/T78711) (owner: 10Umherirrender) [19:44:48] (03CR) 10Umherirrender: "Or better copy that to labs config?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541358 (https://phabricator.wikimedia.org/T78711) (owner: 10Umherirrender) [19:45:08] (03Abandoned) 10Paladox: Gerrit: Add new line to gerrit1001.yaml [puppet] - 10https://gerrit.wikimedia.org/r/540726 (owner: 10Paladox) [19:46:16] (03CR) 10Umherirrender: "But it seems the cron is not running on labs - data from 2018 under https://de.wikipedia.beta.wmflabs.org/wiki/Spezial:Älteste_Seiten" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541358 (https://phabricator.wikimedia.org/T78711) (owner: 10Umherirrender) [20:00:04] cscott, arlolra, subbu, bearND, halfak, and accraze: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Services – Parsoid / Citoid / Mobileapps / ORES / … deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191007T2000). [20:00:13] no parsoid deploy today [20:02:40] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2050.codfw.wmnet - https://phabricator.wikimedia.org/T230391 (10Papaul) [20:05:35] 10Operations, 10Wikimedia-Logstash: Upgrade ELK Stack - https://phabricator.wikimedia.org/T234854 (10herron) p:05Triage→03Normal [20:06:59] 10Operations, 10ops-codfw, 10decommission: Decommission sarin - https://phabricator.wikimedia.org/T220504 (10Papaul) [20:11:57] (03PS1) 10Jgreen: add frban1001 and frnetmon1001 [dns] - 10https://gerrit.wikimedia.org/r/541362 (https://phabricator.wikimedia.org/T232137) [20:13:02] 10Operations, 10SRE-tools, 10Traffic, 10Goal, and 2 others: Automate generation of Management DNS records from Netbox - https://phabricator.wikimedia.org/T233183 (10crusnov) After discussing this a bit and thinking about it quite a lot, I'm highly in favor of a machine git repo for the generated side. This... [20:17:44] (03CR) 10Jgreen: [C: 03+1] Restrict NTP servers to production networks (including frack and network gear) [puppet] - 10https://gerrit.wikimedia.org/r/531808 (owner: 10Muehlenhoff) [20:21:22] (03PS3) 10Eevans: [WIP]: cassandra config updates for 3.11.4 upgrade [puppet] - 10https://gerrit.wikimedia.org/r/540948 (https://phabricator.wikimedia.org/T200803) [20:24:17] (03CR) 10Filippo Giunchedi: [C: 03+2] "Thanks for taking care of this!" [puppet] - 10https://gerrit.wikimedia.org/r/540676 (https://phabricator.wikimedia.org/T234567) (owner: 10CDanis) [20:25:46] (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: output mediawiki type to logstash-medaiwiki ES index [puppet] - 10https://gerrit.wikimedia.org/r/540486 (owner: 10Herron) [20:26:15] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, modulo what Tyler said" [puppet] - 10https://gerrit.wikimedia.org/r/540117 (owner: 1020after4) [20:26:31] (03CR) 10Papaul: [C: 03+1] add frban1001 and frnetmon1001 [dns] - 10https://gerrit.wikimedia.org/r/541362 (https://phabricator.wikimedia.org/T232137) (owner: 10Jgreen) [20:27:25] (03CR) 10Jgreen: [C: 03+2] add frban1001 and frnetmon1001 [dns] - 10https://gerrit.wikimedia.org/r/541362 (https://phabricator.wikimedia.org/T232137) (owner: 10Jgreen) [20:28:56] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, thanks for working on this!" [puppet] - 10https://gerrit.wikimedia.org/r/541213 (https://phabricator.wikimedia.org/T234805) (owner: 10Jbond) [20:29:11] !log herron@cumin1001 START - Cookbook sre.hosts.downtime [20:29:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:30:19] (03CR) 10Herron: "Great! Will plan to move forward with this tomorrow morning" [puppet] - 10https://gerrit.wikimedia.org/r/540486 (owner: 10Herron) [20:30:20] !log herron@cumin1001 START - Cookbook sre.hosts.downtime [20:30:20] !log herron@cumin1001 START - Cookbook sre.hosts.downtime [20:30:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:30:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:18] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [20:31:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:23] !log herron@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [20:31:23] (03PS1) 10Ayounsi: Add BGP prefix damping to IX policies [homer/public] - 10https://gerrit.wikimedia.org/r/541367 (https://phabricator.wikimedia.org/T222424) [20:31:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:30] !log herron@cumin1001 START - Cookbook sre.hosts.downtime [20:31:31] !log herron@cumin1001 START - Cookbook sre.hosts.downtime [20:31:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:32:15] (03CR) 1020after4: "I might be wrong about the quotes. We can try leaving them in and then see if it works but I wanted to avoid another round of submitting o" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/540117 (owner: 1020after4) [20:32:17] (03CR) 10Ayounsi: [V: 03+2 C: 03+2] "Already push to prod." [homer/public] - 10https://gerrit.wikimedia.org/r/541367 (https://phabricator.wikimedia.org/T222424) (owner: 10Ayounsi) [20:32:25] PROBLEM - DNS kafka2002.mgmt on kafka2002.mgmt is CRITICAL: Domain kafka2002.mgmt.codfw.wmnet was not found by the server https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:32:46] 10Operations, 10Core Platform Team, 10Editing-team, 10Parsing-Team, and 9 others: RFC: Serve Main Page of Wikimedia wikis from a consistent URL - https://phabricator.wikimedia.org/T120085 (10Tacsipacsi) >>! In T120085#5551136, @Ladsgroup wrote: > Very valid point, I personally would be okay with not turnin... [20:33:10] (03CR) 10SBassett: [C: 03+1] "> Patch Set 2:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541312 (https://phabricator.wikimedia.org/T211539) (owner: 10SBassett) [20:33:31] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [20:33:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:33:34] !log herron@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [20:33:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:35:38] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [20:35:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:39:58] PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [20:41:00] (03PS4) 10Eevans: [WIP]: cassandra config updates for 3.11.4 upgrade [puppet] - 10https://gerrit.wikimedia.org/r/540948 (https://phabricator.wikimedia.org/T200803) [20:41:41] (03PS2) 10Mholloway: Update wikifeeds chart to 0.0.4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/540967 (https://phabricator.wikimedia.org/T170455) [20:41:59] (03CR) 10Filippo Giunchedi: "LGTM overall! See inline for metrics naming comment and the rest looks good" (033 comments) [debs/prometheus-swagger-exporter] - 10https://gerrit.wikimedia.org/r/536376 (owner: 10Cwhite) [20:42:20] (03PS1) 10Ayounsi: Add kerberos hosts to analytics-in4 + add kerberos to analytics-in6 [homer/public] - 10https://gerrit.wikimedia.org/r/541370 (https://phabricator.wikimedia.org/T226089) [20:42:58] (03CR) 10Ayounsi: [V: 03+2 C: 03+2] "Already live in prod. Diff tested." [homer/public] - 10https://gerrit.wikimedia.org/r/541370 (https://phabricator.wikimedia.org/T226089) (owner: 10Ayounsi) [20:43:19] (03Abandoned) 10Filippo Giunchedi: rsyslog: Correctly parse docker logs [puppet] - 10https://gerrit.wikimedia.org/r/539519 (https://phabricator.wikimedia.org/T207200) (owner: 10Alexandros Kosiaris) [20:44:22] (03PS1) 10Jeena Huneidi: Use new dev image for parsoid [deployment-charts] - 10https://gerrit.wikimedia.org/r/541371 (https://phabricator.wikimedia.org/T234578) [20:45:20] (03PS2) 10Jeena Huneidi: Use new dev image for parsoid [deployment-charts] - 10https://gerrit.wikimedia.org/r/541371 (https://phabricator.wikimedia.org/T234578) [20:49:26] (03CR) 10Filippo Giunchedi: [C: 03+1] "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/540486 (owner: 10Herron) [20:51:09] (03PS3) 10Filippo Giunchedi: role: drop Thanos labels from global Prometheus [puppet] - 10https://gerrit.wikimedia.org/r/539342 (https://phabricator.wikimedia.org/T233956) [20:51:15] (03CR) 10Filippo Giunchedi: [C: 03+2] role: drop Thanos labels from global Prometheus [puppet] - 10https://gerrit.wikimedia.org/r/539342 (https://phabricator.wikimedia.org/T233956) (owner: 10Filippo Giunchedi) [20:58:41] 10Operations, 10ops-eqiad, 10Patch-For-Review: Degraded RAID on analytics1049 - https://phabricator.wikimedia.org/T234785 (10wiki_willy) a:03Cmjohnson Hi @elukey - looks like this host is out of warranty (ended in June 2018). Let me know if you want us to purchase a replacement part or if this system is c... [21:00:04] Reedy and sbassett: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191007T2100). [21:00:27] RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [21:01:32] (03CR) 10Thcipriani: [C: 03+1] Fix phatality deployment script (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/540117 (owner: 1020after4) [21:03:11] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster, 10DC-Ops: analytics1045 - RAID failure and /var/lib/hadoop/data/j can't be mounted - https://phabricator.wikimedia.org/T232069 (10wiki_willy) Thanks @elukey . Should we ignore/resolve this alert then? Thanks, Willy [21:03:24] (03PS1) 10Filippo Giunchedi: role: fix Prometheus global metric relabel config [puppet] - 10https://gerrit.wikimedia.org/r/541374 (https://phabricator.wikimedia.org/T233956) [21:04:31] (03CR) 10Filippo Giunchedi: [C: 03+2] "Turns out the syntax wasn't was I thought it was, the previous config results in" [puppet] - 10https://gerrit.wikimedia.org/r/541374 (https://phabricator.wikimedia.org/T233956) (owner: 10Filippo Giunchedi) [21:07:28] sbassett: OK for me to deploy your CSP change for Beta Cluster? [21:07:40] (03PS1) 10Ayounsi: Add templating for trusted_space [homer/public] - 10https://gerrit.wikimedia.org/r/541375 [21:07:47] James_F: Yes. Thanks. [21:07:57] (03CR) 10Jforrester: [C: 03+2] CommonSettings: Run Labs config after CSP config so it can change it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541339 (owner: 10Jforrester) [21:09:32] (03CR) 10Ayounsi: "I'm wondering if we could add a Jinja filter to check if an IP is IPv4 or IPv6." [homer/public] - 10https://gerrit.wikimedia.org/r/541375 (owner: 10Ayounsi) [21:12:33] (03CR) 10Filippo Giunchedi: [C: 03+1] profile: sanity checks for cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/539934 (https://phabricator.wikimedia.org/T234232) (owner: 10Filippo Giunchedi) [21:12:37] 10Operations, 10ops-eqiad, 10Traffic: cp1085 - IPMI not working - https://phabricator.wikimedia.org/T231525 (10wiki_willy) @Dzahn - just wanted to confirm that this has been depooled. Thanks, Willy [21:14:20] 10Operations, 10ops-codfw, 10media-storage, 10User-fgiunchedi: rack/setup/install ms-be205[1-6].codfw.wmnet - https://phabricator.wikimedia.org/T233638 (10fgiunchedi) [21:15:54] (03PS1) 10Dzahn: parsoid/conftool: add wtp servers as apache appservers [puppet] - 10https://gerrit.wikimedia.org/r/541377 (https://phabricator.wikimedia.org/T233654) [21:16:33] 10Operations, 10ops-eqiad, 10Discovery-Search (Current work): Degraded RAID on elastic1046 - https://phabricator.wikimedia.org/T228606 (10wiki_willy) @Cmjohnson - let me know if we need to order a replacement drive (along with what type of disk), since it's out of warranty. Thanks, Willy [21:19:20] ACKNOWLEDGEMENT - mediawiki-installation DSH group on wtp1025 is CRITICAL: Host wtp1025 is not in mediawiki-installation dsh group daniel_zahn WIP https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [21:20:13] !log swift codfw-prod: add ms-be205[3456] - T233638 [21:20:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:20:17] T233638: rack/setup/install ms-be205[1-6].codfw.wmnet - https://phabricator.wikimedia.org/T233638 [21:21:17] 10Operations, 10serviceops, 10wikitech.wikimedia.org, 10PHP 7.2 support, 10Patch-For-Review: switch wikitech to PHP 7.2 - https://phabricator.wikimedia.org/T223393 (10Jdforrester-WMF) p:05Low→03High If this isn't done before tomorrow, the train rollout will break wikitechwiki. :-( [21:21:42] 10Operations, 10Release-Engineering-Team-TODO, 10serviceops, 10wikitech.wikimedia.org, and 2 others: switch wikitech to PHP 7.2 - https://phabricator.wikimedia.org/T223393 (10Jdforrester-WMF) [21:22:17] (03Abandoned) 10Dzahn: conftool: turn wtp1025 and wtp2001 into test servers [puppet] - 10https://gerrit.wikimedia.org/r/540684 (https://phabricator.wikimedia.org/T233654) (owner: 10Dzahn) [21:29:14] 10Operations, 10Release-Engineering-Team-TODO, 10serviceops, 10wikitech.wikimedia.org, and 2 others: switch wikitech to PHP 7.2 - https://phabricator.wikimedia.org/T223393 (10Ladsgroup) >>! In T223393#5553766, @Jdforrester-WMF wrote: > If this isn't done before tomorrow, the train rollout will break wikite... [21:30:03] 10Operations, 10Release-Engineering-Team-TODO, 10serviceops, 10wikitech.wikimedia.org, and 2 others: switch wikitech to PHP 7.2 - https://phabricator.wikimedia.org/T223393 (10Jdforrester-WMF) Well, we can pin this wiki for special treatment for a week or two if needed. [21:34:16] (03PS3) 10Reedy: wikitech: change Apache config from hhvm to php-fpm [puppet] - 10https://gerrit.wikimedia.org/r/510949 (https://phabricator.wikimedia.org/T223393) (owner: 10Dzahn) [21:36:38] 10Operations, 10ops-eqiad, 10Traffic: cp1085 - IPMI not working - https://phabricator.wikimedia.org/T231525 (10Dzahn) No, it's not depooled. Let's wait a day please because traffic is mostly out today. [21:37:17] (03PS2) 10Jforrester: CommonSettings: Run Labs config after CSP config so it can change it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541339 [21:37:21] (03CR) 10Jforrester: [C: 03+2] "…" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541339 (owner: 10Jforrester) [21:37:44] (03PS3) 10Jforrester: [Beta Cluster] Let cross-wiki login requests work when CSP is switched on [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541312 (https://phabricator.wikimedia.org/T211539) (owner: 10SBassett) [21:37:49] (03CR) 10Jforrester: [C: 03+2] [Beta Cluster] Let cross-wiki login requests work when CSP is switched on [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541312 (https://phabricator.wikimedia.org/T211539) (owner: 10SBassett) [21:38:08] (03Merged) 10jenkins-bot: CommonSettings: Run Labs config after CSP config so it can change it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541339 (owner: 10Jforrester) [21:38:42] (03Merged) 10jenkins-bot: [Beta Cluster] Let cross-wiki login requests work when CSP is switched on [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541312 (https://phabricator.wikimedia.org/T211539) (owner: 10SBassett) [21:40:29] !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: Run Labs config after CSP config so it can change it (duration: 00m 51s) [21:40:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:40:39] 10Operations, 10ops-eqiad, 10Traffic: cp1085 - IPMI not working - https://phabricator.wikimedia.org/T231525 (10wiki_willy) Ok @Dzahn - just let us know when it's ready to go. Thanks, Willy [21:43:33] PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [21:45:58] (03CR) 10Dzahn: [V: 03+1 C: 03+1] Remove gerrit-slave from dns [dns] - 10https://gerrit.wikimedia.org/r/541112 (owner: 10Paladox) [21:46:56] (03PS3) 10Dzahn: Add ban language [dns] - 10https://gerrit.wikimedia.org/r/541356 (https://phabricator.wikimedia.org/T234768) (owner: 10Ladsgroup) [21:47:17] (03PS4) 10Dzahn: Add Balinese (ban) language [dns] - 10https://gerrit.wikimedia.org/r/541356 (https://phabricator.wikimedia.org/T234768) (owner: 10Ladsgroup) [21:48:04] 10Operations, 10cloud-services-team (Kanban): Migrate labmon* to Stretch (or Buster, better yet!) - https://phabricator.wikimedia.org/T224585 (10fgiunchedi) @bd808 I'm echoing what @MoritzMuehlenhoff said (thanks!) and going with Buster seems worthwhile to me. Specifically Grafana 6 is a safe upgrade AFAIK (cc... [21:49:24] (03CR) 10Dzahn: [C: 03+2] Add Balinese (ban) language [dns] - 10https://gerrit.wikimedia.org/r/541356 (https://phabricator.wikimedia.org/T234768) (owner: 10Ladsgroup) [21:49:57] (03PS3) 10Paladox: Remove gerrit-slave from dns [dns] - 10https://gerrit.wikimedia.org/r/541112 [21:52:49] (03CR) 10Dzahn: [C: 03+1] "i don't find it anymore anywhere in /var/lib/gerrit2/review_site but fwiw, i also don't find "gerrit-replica" there." [dns] - 10https://gerrit.wikimedia.org/r/541112 (owner: 10Paladox) [21:53:24] (03CR) 10Paladox: "> i don't find it anymore anywhere in /var/lib/gerrit2/review_site" [dns] - 10https://gerrit.wikimedia.org/r/541112 (owner: 10Paladox) [21:54:09] RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [21:56:50] !log gerrit2001 - sudo rm /etc/apache2/sites-available/50-gerrit-slave-wikimedia-org.conf [21:56:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:57:35] (03CR) 10Dzahn: [C: 03+2] Remove gerrit-slave from dns [dns] - 10https://gerrit.wikimedia.org/r/541112 (owner: 10Paladox) [21:57:47] (03CR) 10Dzahn: "17:56 < mutante> !log gerrit2001 - sudo rm /etc/apache2/sites-available/50-gerrit-slave-wikimedia-org.conf" [dns] - 10https://gerrit.wikimedia.org/r/541112 (owner: 10Paladox) [21:57:52] thanks mutante! [22:02:36] (03CR) 10Dzahn: "i just started using that in gerrit::migration though ... modules/profile/manifests/gerrit/migration.pp: $source_host = lookup(gerrit::" [puppet] - 10https://gerrit.wikimedia.org/r/541108 (owner: 10Paladox) [22:03:26] (03PS2) 10Krinkle: Move the remaining wikis to AbuseFilterCachingParser [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541026 (https://phabricator.wikimedia.org/T156096) (owner: 10Daimona Eaytoy) [22:03:35] (03PS3) 10Krinkle: Move the remaining wikis to AbuseFilterCachingParser [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541026 (https://phabricator.wikimedia.org/T156096) (owner: 10Daimona Eaytoy) [22:03:40] (03CR) 10Paladox: "> i just started using that in gerrit::migration though ..." [puppet] - 10https://gerrit.wikimedia.org/r/541108 (owner: 10Paladox) [22:03:53] (03PS4) 10Paladox: gerrit: Remove master_host variable from profile::gerrit::server [puppet] - 10https://gerrit.wikimedia.org/r/541108 [22:04:00] (03PS5) 10Paladox: gerrit: Remove master_host variable from profile::gerrit::server [puppet] - 10https://gerrit.wikimedia.org/r/541108 [22:04:08] (03CR) 10Paladox: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/541108 (owner: 10Paladox) [22:04:18] (03PS4) 10Paladox: Gerrit: Switch master from cobalt to gerrit1001 [puppet] - 10https://gerrit.wikimedia.org/r/541110 [22:08:29] (03CR) 10BryanDavis: [C: 04-1] "PCC output: https://puppet-compiler.wmflabs.org/compiler1002/18768/" [puppet] - 10https://gerrit.wikimedia.org/r/510949 (https://phabricator.wikimedia.org/T223393) (owner: 10Dzahn) [22:11:09] (03CR) 10BryanDavis: [C: 04-1] wikitech: change Apache config from hhvm to php-fpm (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/510949 (https://phabricator.wikimedia.org/T223393) (owner: 10Dzahn) [22:16:57] (03PS4) 10BryanDavis: wikitech: change Apache config from hhvm to php-fpm [puppet] - 10https://gerrit.wikimedia.org/r/510949 (https://phabricator.wikimedia.org/T223393) (owner: 10Dzahn) [22:17:42] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/18769/cobalt.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/541108 (owner: 10Paladox) [22:20:55] (03PS1) 10Dzahn: gerrit::migration: switch master to gerrit1001 [puppet] - 10https://gerrit.wikimedia.org/r/541382 [22:22:37] 10Operations, 10Wikimedia-Logstash, 10observability: Standardize the logging format - https://phabricator.wikimedia.org/T234565 (10fgiunchedi) Thanks @colewhite for starting this! I'm cc'ing @eevans as I know he's interested in a standardized logging schema too and we've chatted about it in the past as well. [22:25:37] (03CR) 10Krinkle: [C: 03+2] Move the remaining wikis to AbuseFilterCachingParser [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541026 (https://phabricator.wikimedia.org/T156096) (owner: 10Daimona Eaytoy) [22:26:26] (03Merged) 10jenkins-bot: Move the remaining wikis to AbuseFilterCachingParser [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541026 (https://phabricator.wikimedia.org/T156096) (owner: 10Daimona Eaytoy) [22:26:37] PROBLEM - Check the last execution of search-drop-query-clicks on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [22:27:11] PROBLEM - DPKG on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [22:27:45] PROBLEM - configured eth on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [22:27:59] PROBLEM - Disk space on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=stat1007&var-datasource=eqiad+prometheus/ops [22:28:17] PROBLEM - dhclient process on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [22:28:23] PROBLEM - MD RAID on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [22:28:29] PROBLEM - Check size of conntrack table on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [22:28:39] PROBLEM - Check whether ferm is active by checking the default input chain on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [22:29:23] RECOVERY - configured eth on stat1007 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [22:29:25] !log restart nagios-nrpe-server on stat1007 [22:29:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:29:37] RECOVERY - Disk space on stat1007 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=stat1007&var-datasource=eqiad+prometheus/ops [22:29:55] RECOVERY - dhclient process on stat1007 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [22:30:01] RECOVERY - MD RAID on stat1007 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [22:30:07] RECOVERY - Check size of conntrack table on stat1007 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [22:30:17] RECOVERY - Check whether ferm is active by checking the default input chain on stat1007 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [22:30:25] RECOVERY - DPKG on stat1007 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [22:30:33] (03PS5) 10BryanDavis: wikitech: change Apache config from hhvm to php-fpm [puppet] - 10https://gerrit.wikimedia.org/r/510949 (https://phabricator.wikimedia.org/T223393) (owner: 10Dzahn) [22:32:52] * Krinkle staging on mwdebug1002 [22:34:47] PROBLEM - DNS kafka2003.mgmt on kafka2003.mgmt is CRITICAL: Domain kafka2003.mgmt.codfw.wmnet was not found by the server https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:35:14] (03PS1) 10Jforrester: CommonSettings-labs: Run CSP fiddles as an extension function [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541384 [22:35:55] (03CR) 10jerkins-bot: [V: 04-1] CommonSettings-labs: Run CSP fiddles as an extension function [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541384 (owner: 10Jforrester) [22:37:11] RECOVERY - Check the last execution of search-drop-query-clicks on stat1007 is OK: OK: Status of the systemd unit search-drop-query-clicks https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [22:38:16] (03CR) 10BryanDavis: "PCC output: https://puppet-compiler.wmflabs.org/compiler1001/18771/" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/510949 (https://phabricator.wikimedia.org/T223393) (owner: 10Dzahn) [22:39:49] (03PS6) 10Cwhite: initial commit [debs/prometheus-swagger-exporter] - 10https://gerrit.wikimedia.org/r/536376 [22:40:21] !log krinkle@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 7b9e6829821, T156095 (duration: 00m 51s) [22:40:23] (03CR) 10Cwhite: initial commit (033 comments) [debs/prometheus-swagger-exporter] - 10https://gerrit.wikimedia.org/r/536376 (owner: 10Cwhite) [22:40:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:40:25] T156095: Re-enable AbuseFilterCachingParser once we are sure it's safe - https://phabricator.wikimedia.org/T156095 [22:42:29] (03PS2) 10Jforrester: CommonSettings-labs: Run CSP fiddles as an extension function [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541384 [22:45:00] (03CR) 10Jforrester: [C: 03+2] CommonSettings-labs: Run CSP fiddles as an extension function [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541384 (owner: 10Jforrester) [22:46:04] (03Merged) 10jenkins-bot: CommonSettings-labs: Run CSP fiddles as an extension function [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541384 (owner: 10Jforrester) [22:46:07] (03PS3) 10Paladox: Gerrit: Disable auto reloading replication config [puppet] - 10https://gerrit.wikimedia.org/r/541115 [22:46:09] (03CR) 10Cwhite: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/541213 (https://phabricator.wikimedia.org/T234805) (owner: 10Jbond) [22:46:32] (03PS1) 10MaxSem: labs: Enable $wgAllowRequiringEmailForResets [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541385 (https://phabricator.wikimedia.org/T234861) [22:47:17] (03PS1) 10Paladox: Gerrit: Switch replication url for replica to gerrit-replica [puppet] - 10https://gerrit.wikimedia.org/r/541386 [22:48:11] (03PS2) 10Paladox: Gerrit: Switch replication url for replica to gerrit-replica [puppet] - 10https://gerrit.wikimedia.org/r/541386 [22:49:26] (03CR) 10Paladox: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/541386 (owner: 10Paladox) [22:50:40] (03CR) 10Dzahn: [C: 03+1] "yes, per IRC. thanks for making it." [puppet] - 10https://gerrit.wikimedia.org/r/541386 (owner: 10Paladox) [22:51:16] (03CR) 10Dzahn: [C: 03+1] "i tested i can ssh from cobalt to gerrit-replica when doing it as the user gerrit2" [puppet] - 10https://gerrit.wikimedia.org/r/541386 (owner: 10Paladox) [22:54:46] (03CR) 10Filippo Giunchedi: [C: 04-1] "See inline for logic change, the rest LGTM and merge at will once fixed!" (031 comment) [debs/prometheus-swagger-exporter] - 10https://gerrit.wikimedia.org/r/536376 (owner: 10Cwhite) [22:57:17] (03PS1) 10Dzahn: cumin: remove yubiauth alias [puppet] - 10https://gerrit.wikimedia.org/r/541388 [22:59:58] (03CR) 10Dzahn: "Paladox, maybe we can use topic branches to make it easy to see what should be merged when." [puppet] - 10https://gerrit.wikimedia.org/r/532391 (owner: 10Paladox) [23:00:05] MaxSem, RoanKattouw, Niharika, and Urbanecm: Dear deployers, time to do the Evening SWAT (Max 6 patches) deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191007T2300). [23:00:05] MaxSem: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:01:16] (03CR) 10Dzahn: [C: 04-1] "let's lower the TTL so we can revert faster. this change is a bit early." [dns] - 10https://gerrit.wikimedia.org/r/541111 (owner: 10Paladox) [23:01:55] (03CR) 10MaxSem: [C: 03+2] labs: Enable $wgAllowRequiringEmailForResets [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541385 (https://phabricator.wikimedia.org/T234861) (owner: 10MaxSem) [23:02:52] (03Merged) 10jenkins-bot: labs: Enable $wgAllowRequiringEmailForResets [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541385 (https://phabricator.wikimedia.org/T234861) (owner: 10MaxSem) [23:08:11] (03PS7) 10Cwhite: initial commit [debs/prometheus-swagger-exporter] - 10https://gerrit.wikimedia.org/r/536376 [23:12:18] 10Operations, 10Release-Engineering-Team-TODO, 10serviceops, 10wikitech.wikimedia.org, and 3 others: switch wikitech to PHP 7.2 - https://phabricator.wikimedia.org/T223393 (10bd808) a:05Dzahn→03Andrew Stealing this cookie from @Dzahn and handing it to @Andrew. I did a little work on the Gerrit patch @D... [23:20:38] !log dzahn@cumin1001 START - Cookbook sre.hosts.ipmi-password-reset [23:20:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:21:10] !log dzahn@cumin1001 Updating IPMI password on 1254 hosts - dzahn@cumin1001 [23:21:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:25:50] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.ipmi-password-reset (exit_code=0) [23:25:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:26:32] !log krinkle@deploy1001 Synchronized wmf-config/InitialiseSettings.php: no-op / config cache issue? (duration: 00m 49s) [23:26:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:28:59] James_F: Four deployment on three separate days (AF-related) each time, settings did not change until second sync of the same time. [23:29:03] Gave it half an hour. [23:29:07] strange uh? [23:29:53] I checked various app servers, the source file is up to date. And for mwdebug the config cache seems up to date as well. Haven't checked that on other servers, should do that next time. [23:30:16] Might also be AF-specific though couldn't find any caching of that settin [23:34:19] PROBLEM - mediawiki-installation DSH group on wtp2001 is CRITICAL: Host wtp2001 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [23:35:07] (03PS1) 10Paladox: Gerrit: Lower TTL to 300 [dns] - 10https://gerrit.wikimedia.org/r/541393 [23:35:39] (03PS2) 10Paladox: Gerrit: Lower TTL to 300 [dns] - 10https://gerrit.wikimedia.org/r/541393 [23:37:15] ACKNOWLEDGEMENT - mediawiki-installation DSH group on wtp2001 is CRITICAL: Host wtp2001 is not in mediawiki-installation dsh group daniel_zahn WIP https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [23:43:33] (03PS1) 10Ayounsi: Add conditional sampling in firewall filters [homer/public] - 10https://gerrit.wikimedia.org/r/541394 [23:49:00] (03PS5) 10DannyS712: Add `autopatrol` to translation administrators on mediawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541057 [23:52:11] !log dzahn@cumin1001 START - Cookbook sre.hosts.ipmi-password-reset [23:52:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:52:33] !log dzahn@cumin1001 Updating IPMI password on 1254 hosts - dzahn@cumin1001 [23:52:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log