[02:14:19] PROBLEM - Postgres Replication Lag on maps1001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 29203128 and 0 seconds [02:21:33] PROBLEM - Postgres Replication Lag on maps1001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 94457256 and 6 seconds [02:28:51] RECOVERY - Postgres Replication Lag on maps1001 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 73160 and 58 seconds [04:13:48] (03PS1) 10Legoktm: Enable ExtensionDistributor log channel to help with T225243 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517356 [04:52:13] ACKNOWLEDGEMENT - Device not healthy -SMART- on db2058 is CRITICAL: cluster=mysql device=cciss,11 instance=db2058:9100 job=node site=codfw Marostegui T225902 https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db2058&var-datasource=codfw+prometheus/ops [04:52:58] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2058 - https://phabricator.wikimedia.org/T225902 (10Marostegui) p:05Triage→03Normal a:03Papaul Can we get the disk replaced? Thanks! [04:59:44] (03PS1) 10Marostegui: db-eqiad.php: Depool pc1008 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517357 (https://phabricator.wikimedia.org/T210725) [05:00:14] (03PS2) 10Marostegui: install_server: Allow installation of new dbproxy [puppet] - 10https://gerrit.wikimedia.org/r/516758 (https://phabricator.wikimedia.org/T225704) [05:01:02] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool pc1008 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517357 (https://phabricator.wikimedia.org/T210725) (owner: 10Marostegui) [05:01:22] (03CR) 10Marostegui: [C: 03+2] install_server: Allow installation of new dbproxy [puppet] - 10https://gerrit.wikimedia.org/r/516758 (https://phabricator.wikimedia.org/T225704) (owner: 10Marostegui) [05:02:04] (03Merged) 10jenkins-bot: db-eqiad.php: Depool pc1008 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517357 (https://phabricator.wikimedia.org/T210725) (owner: 10Marostegui) [05:02:19] (03CR) 10jenkins-bot: db-eqiad.php: Depool pc1008 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517357 (https://phabricator.wikimedia.org/T210725) (owner: 10Marostegui) [05:03:49] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool pc1008 and pool pc1010 temporarily while pc1008 gets all its tables optimized T210725 (duration: 00m 59s) [05:03:53] !log Optimize all pc1008's tables T210725 [05:03:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:03:55] T210725: Replace parsercache keys to something more meaningful on db-XXXX.php - https://phabricator.wikimedia.org/T210725 [05:04:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:04:30] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-EventLogging, 10DBA: db1107 (eventlogging db master) possibly memory issues - https://phabricator.wikimedia.org/T222050 (10Marostegui) 05Open→03Resolved Closing this for now as it self-recovered and never showed up again. [05:07:37] (03PS1) 10Marostegui: db-codfw.php: Depool db2107 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517358 [05:09:22] (03CR) 10Marostegui: [C: 03+2] db-codfw.php: Depool db2107 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517358 (owner: 10Marostegui) [05:10:13] (03Merged) 10jenkins-bot: db-codfw.php: Depool db2107 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517358 (owner: 10Marostegui) [05:10:30] (03CR) 10jenkins-bot: db-codfw.php: Depool db2107 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517358 (owner: 10Marostegui) [05:11:20] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Depool db2107 to clone db2051 (duration: 00m 47s) [05:11:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:16:36] !log Stop MySQL on db2107 to clone db2051 - T221533 [05:16:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:16:41] T221533: Decommission old coredb machines (<=db2042) - https://phabricator.wikimedia.org/T221533 [05:18:32] (03PS1) 10Ema: cache: stop passing gethdr_extrachance to varnish [puppet] - 10https://gerrit.wikimedia.org/r/517359 (https://phabricator.wikimedia.org/T224694) [05:38:25] (03PS1) 10Marostegui: wmnet: Update s4-master CNAME [dns] - 10https://gerrit.wikimedia.org/r/517360 (https://phabricator.wikimedia.org/T224852) [05:38:56] (03CR) 10Marostegui: [C: 04-2] "Wait for the failover day" [dns] - 10https://gerrit.wikimedia.org/r/517360 (https://phabricator.wikimedia.org/T224852) (owner: 10Marostegui) [05:43:37] (03PS1) 10Marostegui: mariadb: Promote db1081 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/517361 (https://phabricator.wikimedia.org/T224852) [05:45:41] (03CR) 10Marostegui: [C: 04-2] "Wait for the failover day" [puppet] - 10https://gerrit.wikimedia.org/r/517361 (https://phabricator.wikimedia.org/T224852) (owner: 10Marostegui) [05:50:21] (03PS1) 10Marostegui: db-eqiad.php: Set s4 in read only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517362 (https://phabricator.wikimedia.org/T224852) [05:52:19] (03CR) 10Marostegui: [C: 04-2] "Wait for the failover day" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517362 (https://phabricator.wikimedia.org/T224852) (owner: 10Marostegui) [05:52:29] (03PS1) 10Marostegui: db-eqiad.php: Promote db1081 to s4 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517363 (https://phabricator.wikimedia.org/T224852) [05:53:57] (03CR) 10Marostegui: [C: 04-2] "Wait for the failover day" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517363 (https://phabricator.wikimedia.org/T224852) (owner: 10Marostegui) [06:04:03] 10Operations, 10DBA: db2084 temporary correctable hardware errors - https://phabricator.wikimedia.org/T225884 (10Marostegui) Some more errors from yesterday evening: ` [Sun Jun 16 21:33:58 2019] {7}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4 [Sun Jun 16 21:33:58 2019] {7}[Hardwa... [06:04:45] !log Stop MySQ on db2084 to reboot the host T225884 [06:04:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:04:50] T225884: db2084 temporary correctable hardware errors - https://phabricator.wikimedia.org/T225884 [06:06:08] (03PS1) 10Marostegui: db-codfw.php: Depool db2084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517364 (https://phabricator.wikimedia.org/T225884) [06:07:18] (03PS1) 10Vgutierrez: varnish: Move query.wikidata.org ratelimit to the misc VCL [puppet] - 10https://gerrit.wikimedia.org/r/517365 [06:07:45] (03CR) 10Marostegui: [C: 03+2] db-codfw.php: Depool db2084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517364 (https://phabricator.wikimedia.org/T225884) (owner: 10Marostegui) [06:08:38] (03Merged) 10jenkins-bot: db-codfw.php: Depool db2084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517364 (https://phabricator.wikimedia.org/T225884) (owner: 10Marostegui) [06:08:53] (03CR) 10jenkins-bot: db-codfw.php: Depool db2084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517364 (https://phabricator.wikimedia.org/T225884) (owner: 10Marostegui) [06:09:14] (03CR) 10Vgutierrez: varnish: Rate limit wdqs requests violating UA policy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/516803 (owner: 10Vgutierrez) [06:12:37] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Depool db2084 for a reboot (duration: 00m 48s) [06:12:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:16:39] (03CR) 10Smalyshev: [C: 03+1] varnish: Move query.wikidata.org ratelimit to the misc VCL [puppet] - 10https://gerrit.wikimedia.org/r/517365 (owner: 10Vgutierrez) [06:19:07] (03PS1) 10Marostegui: Revert "db-codfw.php: Depool db2084" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517366 [06:20:33] 10Operations, 10DBA: db2084 temporary correctable hardware errors - https://phabricator.wikimedia.org/T225884 (10Marostegui) Host rebooted. No new logs on HW side. [06:20:36] (03CR) 10Marostegui: [C: 03+2] Revert "db-codfw.php: Depool db2084" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517366 (owner: 10Marostegui) [06:21:25] (03Merged) 10jenkins-bot: Revert "db-codfw.php: Depool db2084" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517366 (owner: 10Marostegui) [06:21:50] (03CR) 10jenkins-bot: Revert "db-codfw.php: Depool db2084" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517366 (owner: 10Marostegui) [06:22:34] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Repool db2084 (duration: 00m 47s) [06:22:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:30:47] PROBLEM - puppet last run on mw1303 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/profile.d/mysql-ps1.sh] [06:31:05] PROBLEM - puppet last run on wdqs1006 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/depool] [06:38:05] PROBLEM - toolschecker: check mtime mod from tools cron job on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/cron - 177 bytes in 0.007 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [06:40:00] (03CR) 10Ema: [C: 03+1] varnish: Move query.wikidata.org ratelimit to the misc VCL [puppet] - 10https://gerrit.wikimedia.org/r/517365 (owner: 10Vgutierrez) [06:40:37] (03CR) 10Vgutierrez: [C: 03+2] varnish: Move query.wikidata.org ratelimit to the misc VCL [puppet] - 10https://gerrit.wikimedia.org/r/517365 (owner: 10Vgutierrez) [06:53:49] RECOVERY - toolschecker: check mtime mod from tools cron job on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 158 bytes in 0.003 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [06:57:47] RECOVERY - puppet last run on mw1303 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:07] RECOVERY - puppet last run on wdqs1006 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [07:01:17] (03PS1) 10Marostegui: Revert "db-codfw.php: Depool db2107" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517370 [07:04:39] PROBLEM - Check whether ferm is active by checking the default input chain on db2084 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [07:04:57] ^ I wil check that [07:05:07] PROBLEM - Check systemd state on db2084 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [07:06:03] RECOVERY - Check whether ferm is active by checking the default input chain on db2084 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [07:06:31] RECOVERY - Check systemd state on db2084 is OK: OK - running: The system is fully operational [07:08:29] (03CR) 10Marostegui: [C: 03+2] Revert "db-codfw.php: Depool db2107" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517370 (owner: 10Marostegui) [07:09:18] (03Merged) 10jenkins-bot: Revert "db-codfw.php: Depool db2107" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517370 (owner: 10Marostegui) [07:09:36] (03CR) 10jenkins-bot: Revert "db-codfw.php: Depool db2107" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517370 (owner: 10Marostegui) [07:10:21] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Repool db2107 (duration: 00m 47s) [07:10:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:17] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1077 crashed - https://phabricator.wikimedia.org/T225391 (10Marostegui) db1077 has had its BBU in charging status for around 30h now. I have taken a look at the HW logs and: ` /system1/log1/record20 Targets Properties number=20 severity=C... [07:21:33] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1077 crashed - https://phabricator.wikimedia.org/T225391 (10Marostegui) Also db1114 (test-s1) can be a host we can place instead of db1077 and move db1077 to be test-s1? [07:24:24] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1077 crashed - https://phabricator.wikimedia.org/T225391 (10jcrespo) note db1114 was a host we removed from production because it was unstable. I would vote for another. Did you try depooling and forcing a learning cycle? [07:25:30] (03PS1) 10Urbanecm: Add "autoreview" protection level on ar.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517371 (https://phabricator.wikimedia.org/T225896) [07:25:36] !log restart snmp daemon on mr1-eqsin [07:25:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:25:45] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1077 crashed - https://phabricator.wikimedia.org/T225391 (10Marostegui) >>! In T225391#5261673, @jcrespo wrote: > note db1114 was a host we removed from production because it was unstable. I would vote for another. Did you try depooling and forcing a... [07:34:54] (03PS1) 10Urbanecm: Set nds_nlwiki's sitename and metanamespace back to defaults [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517372 (https://phabricator.wikimedia.org/T224349) [07:45:04] 10Operations, 10Traffic, 10observability: varnish: implement FetchError logging - https://phabricator.wikimedia.org/T224994 (10ema) [07:51:15] 10Operations, 10netops: check_ospf.py fails on mr1-eqsin - https://phabricator.wikimedia.org/T225905 (10ayounsi) p:05Triage→03Normal [07:54:29] (03PS1) 10Ema: varnishlog: implement varnishfetcherr [puppet] - 10https://gerrit.wikimedia.org/r/517375 (https://phabricator.wikimedia.org/T224994) [07:55:43] (03CR) 10jerkins-bot: [V: 04-1] varnishlog: implement varnishfetcherr [puppet] - 10https://gerrit.wikimedia.org/r/517375 (https://phabricator.wikimedia.org/T224994) (owner: 10Ema) [07:56:14] 10Operations, 10Traffic, 10observability, 10Patch-For-Review: varnish: implement FetchError logging - https://phabricator.wikimedia.org/T224994 (10ema) The initial plan of adding a synthetic header to varnish with the FetchError cause seems a little to complicated to implement. Send error logs to logstash... [07:57:27] (03PS2) 10Ema: varnishlog: implement varnishfetcherr [puppet] - 10https://gerrit.wikimedia.org/r/517375 (https://phabricator.wikimedia.org/T224994) [08:17:22] (03PS1) 10Mathew.onipe: Cassandra nodetool repair cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/517377 (https://phabricator.wikimedia.org/T225694) [08:18:31] (03CR) 10Muehlenhoff: Allow Hadoop-related profiles to deploy Kerberos keytabs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/515010 (https://phabricator.wikimedia.org/T212257) (owner: 10Elukey) [08:18:47] (03CR) 10jerkins-bot: [V: 04-1] Cassandra nodetool repair cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/517377 (https://phabricator.wikimedia.org/T225694) (owner: 10Mathew.onipe) [08:19:47] (03CR) 10Mathew.onipe: "pylint will still not run" [cookbooks] - 10https://gerrit.wikimedia.org/r/517377 (https://phabricator.wikimedia.org/T225694) (owner: 10Mathew.onipe) [08:20:08] 10Operations, 10MediaWiki-Cache, 10Performance-Team (Radar), 10User-Elukey: mcrouter does not remove a memcached shard from consistent hashing when timeouts happen - https://phabricator.wikimedia.org/T208934 (10aaron) Some sort of meeting sounds reasonable. [08:24:20] (03CR) 10Elukey: Allow Hadoop-related profiles to deploy Kerberos keytabs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/515010 (https://phabricator.wikimedia.org/T212257) (owner: 10Elukey) [08:32:07] 10Operations, 10MediaWiki-Cache, 10Performance-Team (Radar), 10User-Elukey: mcrouter does not remove a memcached shard from consistent hashing when timeouts happen - https://phabricator.wikimedia.org/T208934 (10elukey) What I'd like to discuss in the meeting (or even in here) is the following: >>! In T208... [08:33:00] 10Operations, 10DC-Ops, 10netops, 10observability: Send some LibreNMS alerts to dcops and netops only - https://phabricator.wikimedia.org/T224180 (10ayounsi) 05Open→03Resolved a:03ayounsi I made the change to email dcops + me for those alerts. All have a linked runbook. The alerting might need to b... [08:36:19] (03PS1) 10Fdans: Refinery: Restore filter_out_non_wiki_hostname as function is fixed [puppet] - 10https://gerrit.wikimedia.org/r/517379 [08:40:53] (03CR) 10Muehlenhoff: Allow Hadoop-related profiles to deploy Kerberos keytabs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/515010 (https://phabricator.wikimedia.org/T212257) (owner: 10Elukey) [08:47:34] (03PS2) 10Fdans: Refinery: Restore filter_out_non_wiki_hostname as function is fixed [puppet] - 10https://gerrit.wikimedia.org/r/517379 (https://phabricator.wikimedia.org/T225342) [08:48:32] (03CR) 10jerkins-bot: [V: 04-1] Refinery: Restore filter_out_non_wiki_hostname as function is fixed [puppet] - 10https://gerrit.wikimedia.org/r/517379 (https://phabricator.wikimedia.org/T225342) (owner: 10Fdans) [08:49:19] (03PS3) 10Elukey: profile::analytics::refinery::job::refine: restore filter [puppet] - 10https://gerrit.wikimedia.org/r/517379 (https://phabricator.wikimedia.org/T225342) (owner: 10Fdans) [08:49:59] (03PS4) 10Elukey: profile::analytics::refinery::job::refine: restore filter [puppet] - 10https://gerrit.wikimedia.org/r/517379 (https://phabricator.wikimedia.org/T225342) (owner: 10Fdans) [08:50:03] (03CR) 10jerkins-bot: [V: 04-1] profile::analytics::refinery::job::refine: restore filter [puppet] - 10https://gerrit.wikimedia.org/r/517379 (https://phabricator.wikimedia.org/T225342) (owner: 10Fdans) [08:52:54] !log remove maps1001 from cassandra cluster - T224395 [08:53:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:00] T224395: Maps[12]004 /srv disk space is critical - https://phabricator.wikimedia.org/T224395 [08:53:43] (03Abandoned) 10Fdans: Analytics Refinery: bump up jar version to apply latest source changes [puppet] - 10https://gerrit.wikimedia.org/r/517050 (owner: 10Fdans) [08:55:37] (03PS5) 10Fdans: profile::analytics::refinery::job::refine: restore filter [puppet] - 10https://gerrit.wikimedia.org/r/517379 (https://phabricator.wikimedia.org/T225342) [08:56:22] (03PS6) 10Elukey: profile::analytics::refinery::job::refine: restore filter [puppet] - 10https://gerrit.wikimedia.org/r/517379 (https://phabricator.wikimedia.org/T225342) (owner: 10Fdans) [08:57:27] (03CR) 10Elukey: [C: 03+2] profile::analytics::refinery::job::refine: restore filter [puppet] - 10https://gerrit.wikimedia.org/r/517379 (https://phabricator.wikimedia.org/T225342) (owner: 10Fdans) [08:58:43] 10Operations, 10MediaWiki-Cache, 10Performance-Team (Radar), 10User-Elukey: mcrouter does not remove a memcached shard from consistent hashing when timeouts happen - https://phabricator.wikimedia.org/T208934 (10Joe) >>! In T208934#5261737, @elukey wrote: > What I'd like to discuss in the meeting (or even i... [09:00:33] (03PS1) 10Matthias Mullie: [SDC] Enable depicts qualifiers on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517381 [09:03:02] 10Operations, 10MediaWiki-Cache, 10Performance-Team (Radar), 10User-Elukey: mcrouter does not remove a memcached shard from consistent hashing when timeouts happen - https://phabricator.wikimedia.org/T208934 (10Joe) >>! In T208934#5261729, @aaron wrote: > Some sort of meeting sounds reasonable. We will f... [09:06:14] (03PS1) 10Volans: tests: temporarily limit max version of prospector [software/spicerack] - 10https://gerrit.wikimedia.org/r/517384 [09:07:46] (03PS6) 10Elukey: Allow Hadoop-related profiles to deploy Kerberos keytabs [puppet] - 10https://gerrit.wikimedia.org/r/515010 (https://phabricator.wikimedia.org/T212257) [09:11:28] (03CR) 10Mathew.onipe: [C: 03+1] tests: temporarily limit max version of prospector [software/spicerack] - 10https://gerrit.wikimedia.org/r/517384 (owner: 10Volans) [09:11:35] volans: ^ [09:12:06] (03CR) 10Volans: [C: 03+2] tests: temporarily limit max version of prospector [software/spicerack] - 10https://gerrit.wikimedia.org/r/517384 (owner: 10Volans) [09:13:18] <_joe_> !log setting cpufreq governor to "ondemand" on mw1348, T225713 [09:13:22] (03PS1) 10Volans: tests: temporarily limit max version of prospector [cookbooks] - 10https://gerrit.wikimedia.org/r/517386 [09:13:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:23] T225713: CPU scaling governor audit - https://phabricator.wikimedia.org/T225713 [09:15:40] <_joe_> !log The governor was set to "powersave", not "ondemand" [09:15:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:46] 10Operations, 10Analytics, 10Traffic: Investigate varnish behavior change since new ATS-change in webrequest upload - https://phabricator.wikimedia.org/T225786 (10ema) p:05Triage→03Normal [09:15:52] (03Merged) 10jenkins-bot: tests: temporarily limit max version of prospector [software/spicerack] - 10https://gerrit.wikimedia.org/r/517384 (owner: 10Volans) [09:16:02] (03CR) 10Mathew.onipe: [C: 03+1] tests: temporarily limit max version of prospector [cookbooks] - 10https://gerrit.wikimedia.org/r/517386 (owner: 10Volans) [09:17:03] 10Operations, 10media-storage: CPU scaling governor audit - https://phabricator.wikimedia.org/T225713 (10Joe) Moving from "powersave" to "performance" slightly reduced the CPU load on one api application server (a 10-20% reduction in cpu usage) at the cost of significantly higher temperatures. [09:17:04] (03CR) 10Volans: [C: 03+2] tests: temporarily limit max version of prospector [cookbooks] - 10https://gerrit.wikimedia.org/r/517386 (owner: 10Volans) [09:17:13] (03CR) 10jenkins-bot: tests: temporarily limit max version of prospector [software/spicerack] - 10https://gerrit.wikimedia.org/r/517384 (owner: 10Volans) [09:17:18] !log rebooting sulfur for some tests [09:17:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:27] 10Operations, 10media-storage: CPU scaling governor audit - https://phabricator.wikimedia.org/T225713 (10Joe) Also please note that newer Intel CPUs don't have the `ondemand` governor on newer kernels, so to all effects the `powersave` governor is what `ondemand` used to be. [09:17:58] (03Merged) 10jenkins-bot: tests: temporarily limit max version of prospector [cookbooks] - 10https://gerrit.wikimedia.org/r/517386 (owner: 10Volans) [09:25:41] PROBLEM - mailman_queue_size on fermium is CRITICAL: CRITICAL: 1 mailman queue(s) above limits (thresholds: bounces: 25 in: 25 virgin: 25) https://wikitech.wikimedia.org/wiki/Mailman [09:27:07] RECOVERY - mailman_queue_size on fermium is OK: OK: mailman queues are below the limits. https://wikitech.wikimedia.org/wiki/Mailman [09:30:11] (03PS2) 10Volans: Cassandra nodetool repair cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/517377 (https://phabricator.wikimedia.org/T225694) (owner: 10Mathew.onipe) [09:30:15] 10Operations, 10Analytics, 10Traffic: Investigate varnish behavior change since new ATS-change in webrequest upload - https://phabricator.wikimedia.org/T225786 (10ema) @JAllemandou thanks for the analysis! A few initial points that might help investigation with regards to ATS: * So far we've upgraded upload... [09:30:17] (03PS8) 10Volans: Add maps reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/511819 (https://phabricator.wikimedia.org/T224072) (owner: 10Mathew.onipe) [09:30:25] (03PS8) 10Volans: add WDQS reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/512915 (https://phabricator.wikimedia.org/T224385) (owner: 10Mathew.onipe) [09:31:04] !log set cpu governor to performance (was powersave) on analytics1070 (hadoop worker node) [09:31:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:12] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [09:36:13] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:36:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:20] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [09:36:21] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:36:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:06] !log rebooting mw2184, mw1265 for some tests [09:39:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:35] 10Operations, 10media-storage: CPU scaling governor audit - https://phabricator.wikimedia.org/T225713 (10elukey) Just executed `echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor` (hope it is the right way) on analytics1070 (hadoop worker). Since the workload varies a lot on these nod... [09:42:52] 10Operations, 10ops-eqiad, 10DBA: eqiad: rack/setup/install (4) dbproxy systems. - https://phabricator.wikimedia.org/T225704 (10Marostegui) So, 2 of these should go to replace dbproxy1010 and dbproxy1011, right? If so, we can rack 2 them on the same racks as those (C5) and put them on that same VLAN to do a... [09:43:58] 10Operations, 10ops-eqiad, 10DBA: eqiad: rack/setup/install (4) dbproxy systems. - https://phabricator.wikimedia.org/T225704 (10jcrespo) I don't know if 2 or 3, depending on the needs of the others. There was discussion with cloud if to also put a proxy in front of toolsdb. [09:45:46] 10Operations, 10ops-eqiad, 10DBA: eqiad: rack/setup/install (4) dbproxy systems. - https://phabricator.wikimedia.org/T225704 (10Marostegui) a:05Cmjohnson→03Marostegui Assigning this to myself to let Chris know that this is still blocked on DBAs to decide. So for now 2 of them will go to replace 1010 and... [09:47:06] 10Operations, 10ops-eqiad, 10DBA: eqiad: rack/setup/install (4) dbproxy systems. - https://phabricator.wikimedia.org/T225704 (10jcrespo) More like: we need 1 for m5, something else? [09:47:12] (03CR) 10Gilles: [C: 03+1] varnishlog: implement varnishfetcherr [puppet] - 10https://gerrit.wikimedia.org/r/517375 (https://phabricator.wikimedia.org/T224994) (owner: 10Ema) [09:49:13] 10Operations, 10ops-eqiad, 10DBA: eqiad: rack/setup/install (4) dbproxy systems. - https://phabricator.wikimedia.org/T225704 (10Marostegui) m5 at the moment doesn't use the proxies (I know it should but they are not being used at the moment) (T202367#5252689) [09:49:44] (03PS1) 10Elukey: Deprecate the use of profile::hadoop::users [puppet] - 10https://gerrit.wikimedia.org/r/517389 (https://phabricator.wikimedia.org/T225464) [09:53:53] (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/16962/" [puppet] - 10https://gerrit.wikimedia.org/r/517389 (https://phabricator.wikimedia.org/T225464) (owner: 10Elukey) [09:57:56] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/512915 (https://phabricator.wikimedia.org/T224385) (owner: 10Mathew.onipe) [10:00:48] (03PS7) 10Elukey: Allow Hadoop-related profiles to deploy Kerberos keytabs [puppet] - 10https://gerrit.wikimedia.org/r/515010 (https://phabricator.wikimedia.org/T212257) [10:01:58] 10Operations, 10DNS, 10Traffic: GSuite Test Domain Verification - https://phabricator.wikimedia.org/T223921 (10ema) p:05Triage→03Normal [10:04:18] 10Operations, 10serviceops, 10Service-deployment-requests, 10Services (watching): Internal deployment of open_nsfw-- image scoring service - https://phabricator.wikimedia.org/T225664 (10Joe) Hi! A very quick skim of the upstream project suggests me there is no storage need, is this correct? Besides this -... [10:06:29] (03PS1) 10Jbond: python CI: add test binary file to check CI [puppet] - 10https://gerrit.wikimedia.org/r/517390 [10:06:53] (03CR) 10jerkins-bot: [V: 04-1] python CI: add test binary file to check CI [puppet] - 10https://gerrit.wikimedia.org/r/517390 (owner: 10Jbond) [10:10:56] (03PS1) 10Pmiazga: Enable AMC mode for Persian, Japanese, Thai and Italian wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517391 (https://phabricator.wikimedia.org/T225123) [10:12:43] 10Operations, 10Performance-Team, 10serviceops: Test usage of igbinary with apcu with MediaWiki - https://phabricator.wikimedia.org/T225074 (10Joe) [10:15:36] (03CR) 10Elukey: "No op as expected https://puppet-compiler.wmflabs.org/compiler1002/16964/" [puppet] - 10https://gerrit.wikimedia.org/r/515010 (https://phabricator.wikimedia.org/T212257) (owner: 10Elukey) [10:18:52] (03PS2) 10Jbond: python CI: skips tests for binary files [puppet] - 10https://gerrit.wikimedia.org/r/517390 (https://phabricator.wikimedia.org/T225710) [10:19:18] (03CR) 10jerkins-bot: [V: 04-1] python CI: skips tests for binary files [puppet] - 10https://gerrit.wikimedia.org/r/517390 (https://phabricator.wikimedia.org/T225710) (owner: 10Jbond) [10:20:31] (03PS3) 10Jbond: python CI: skips tests for binary files [puppet] - 10https://gerrit.wikimedia.org/r/517390 (https://phabricator.wikimedia.org/T225710) [10:20:55] (03CR) 10jerkins-bot: [V: 04-1] python CI: skips tests for binary files [puppet] - 10https://gerrit.wikimedia.org/r/517390 (https://phabricator.wikimedia.org/T225710) (owner: 10Jbond) [10:22:23] (03CR) 10Muehlenhoff: [C: 03+1] Allow Hadoop-related profiles to deploy Kerberos keytabs [puppet] - 10https://gerrit.wikimedia.org/r/515010 (https://phabricator.wikimedia.org/T212257) (owner: 10Elukey) [10:24:16] (03CR) 10Elukey: "Multiple profiles, under the same role, including profile::kerberos::keytabs to create different users/keytabs might not work as expected," [puppet] - 10https://gerrit.wikimedia.org/r/515010 (https://phabricator.wikimedia.org/T212257) (owner: 10Elukey) [10:24:37] (03PS4) 10Jbond: python CI: skips tests for binary files [puppet] - 10https://gerrit.wikimedia.org/r/517390 (https://phabricator.wikimedia.org/T225710) [10:25:03] (03CR) 10jerkins-bot: [V: 04-1] python CI: skips tests for binary files [puppet] - 10https://gerrit.wikimedia.org/r/517390 (https://phabricator.wikimedia.org/T225710) (owner: 10Jbond) [10:25:27] 10Operations, 10DBA: Predictive failures on disk S.M.A.R.T. status - https://phabricator.wikimedia.org/T208323 (10Marostegui) [10:26:06] 10Operations, 10ops-eqiad, 10DBA: eqiad: rack/setup/install (4) dbproxy systems. - https://phabricator.wikimedia.org/T225704 (10jcrespo) Thinking more, as toolsdb was canibalized by openstack, maybe its potential proxies should too. I guess 2/2 is the safe option right now. Sorry, but I didn't think too much... [10:27:06] (03PS5) 10Jbond: python CI: skips tests for binary files [puppet] - 10https://gerrit.wikimedia.org/r/517390 (https://phabricator.wikimedia.org/T225710) [10:27:47] 10Operations, 10ops-eqiad, 10DBA: eqiad: rack/setup/install (4) dbproxy systems. - https://phabricator.wikimedia.org/T225704 (10Marostegui) >>! In T225704#5262147, @jcrespo wrote: > Thinking more, as toolsdb was canibalized by openstack, maybe its potential proxies should too. I guess 2/2 is the safe option... [10:30:04] jan_drewniak: My dear minions, it's time we take the moon! Just kidding. Time for Wikimedia Portals Update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190617T1030). [10:31:46] 10Operations, 10ops-eqiad, 10DBA: eqiad: rack/setup/install (4) dbproxy systems. - https://phabricator.wikimedia.org/T225704 (10jcrespo) > 2/2 meaning 2 for cloud (to replace 1010 and 1011) and 2 for other usages (misc, core..)? Yes. [10:32:23] 10Operations, 10ops-eqiad, 10DBA: eqiad: rack/setup/install (4) dbproxy systems. - https://phabricator.wikimedia.org/T225704 (10Marostegui) Thanks! I will update the task accordingly to reflect this discussion on top so it is easier for Chris [10:35:43] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517404 (https://phabricator.wikimedia.org/T128546) [10:36:25] 10Operations, 10ops-eqiad, 10DBA: eqiad: rack/setup/install (4) dbproxy systems. - https://phabricator.wikimedia.org/T225704 (10Marostegui) [10:37:16] 10Operations, 10ops-eqiad, 10DBA: eqiad: rack/setup/install (4) dbproxy systems. - https://phabricator.wikimedia.org/T225704 (10Marostegui) a:05Marostegui→03Cmjohnson @Cmjohnson I have updated the task with the racking proposal at the beginning. Thanks! [10:37:29] 10Operations, 10ops-eqiad, 10DBA: eqiad: rack/setup/install (4) dbproxy systems. - https://phabricator.wikimedia.org/T225704 (10jcrespo) No name change? I do not mind, just want to make sure it is a conscious decision. [10:38:43] (03CR) 10Giuseppe Lavagetto: [C: 03+1] sre.switchdc.mediawiki: Use localized read-only message [cookbooks] - 10https://gerrit.wikimedia.org/r/460730 (owner: 10Legoktm) [10:38:46] 10Operations, 10ops-eqiad, 10DBA: eqiad: rack/setup/install (4) dbproxy systems. - https://phabricator.wikimedia.org/T225704 (10Marostegui) >>! In T225704#5262204, @jcrespo wrote: > No name change? I do not mind, just want to make sure it is a conscious decision. I would prefer not to change them for now as... [10:40:31] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "We need to do this to verify it's the source of the socket errors we're seeing coming from php-fpm." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513033 (https://phabricator.wikimedia.org/T224538) (owner: 10Effie Mouzeli) [10:41:25] (03CR) 10Jdrewniak: [C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517404 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:42:28] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517404 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:49:33] (03PS6) 10Jbond: python CI: skips tests for binary files [puppet] - 10https://gerrit.wikimedia.org/r/517390 (https://phabricator.wikimedia.org/T225710) [10:51:04] !log jdrewniak@deploy1001 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:517404| Bumping portals to master (T128546)]] (duration: 00m 49s) [10:51:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:10] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [10:51:47] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/517390 (https://phabricator.wikimedia.org/T225710) (owner: 10Jbond) [10:51:52] !log jdrewniak@deploy1001 Synchronized portals: Wikimedia Portals Update: [[gerrit:517404| Bumping portals to master (T128546)]] (duration: 00m 47s) [10:51:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:32] (03CR) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517404 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: My dear minions, it's time we take the moon! Just kidding. Time for European Mid-day SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190617T1100). [11:00:04] awight, Urbanecm, kart_, Amir1, and raynor: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:12] o/ [11:00:14] here [11:00:19] (03CR) 10Jbond: [C: 03+2] python CI: skips tests for binary files [puppet] - 10https://gerrit.wikimedia.org/r/517390 (https://phabricator.wikimedia.org/T225710) (owner: 10Jbond) [11:00:28] o/ [11:00:31] awight, around? :) [11:00:39] (03PS7) 10Jbond: python CI: skips tests for binary files [puppet] - 10https://gerrit.wikimedia.org/r/517390 (https://phabricator.wikimedia.org/T225710) [11:00:45] Urbanecm: yes, hi! [11:01:02] Ready to monitor. [11:01:05] Hi awight, looks you're a deployer. Want to deploy your patch, or should I? [11:01:14] Oh sure--happy to do so myself! [11:01:27] Ping me once you're done! [11:01:54] kk [11:02:04] oh. Seems my clock of laptop is lagging behind a minute! [11:02:49] (03PS4) 10Awight: New configuration to pull sitelinks from Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514715 (https://phabricator.wikimedia.org/T224007) [11:03:10] Urbanecm can you deploy my patch? [11:03:22] kart_, yes, technically :) [11:03:28] (03CR) 10Awight: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514715 (https://phabricator.wikimedia.org/T224007) (owner: 10Awight) [11:03:43] Urbanecm: cool. Thanks :) [11:03:43] 10Operations, 10Operations-Software-Development, 10Patch-For-Review, 10cloud-services-team (Kanban): Flake8 for python files without extension in puppet repo - https://phabricator.wikimedia.org/T144169 (10jbond) [11:03:45] 10Operations, 10Operations-Software-Development, 10Patch-For-Review: Error while checking binary files for python shebang - https://phabricator.wikimedia.org/T225710 (10jbond) 05Open→03Resolved a:03jbond I think that with my latest change and the one from Antonie this should be fixed, please reopen if... [11:04:23] (03Merged) 10jenkins-bot: New configuration to pull sitelinks from Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514715 (https://phabricator.wikimedia.org/T224007) (owner: 10Awight) [11:04:41] kart_, since it's the only patch non-config patch, I'll CR+2 it now, to give time for CI [11:04:57] Urbanecm: sure. [11:06:17] (my patch is on mwdebug1002 now) [11:08:11] (03CR) 10Volans: "> Patch Set 1: Code-Review+1" [cookbooks] - 10https://gerrit.wikimedia.org/r/460730 (owner: 10Legoktm) [11:08:34] awight, let me know if you need any help, feel free to deploy it once you test it [11:08:40] (03CR) 10jenkins-bot: New configuration to pull sitelinks from Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514715 (https://phabricator.wikimedia.org/T224007) (owner: 10Awight) [11:11:43] (03CR) 10Giuseppe Lavagetto: [C: 03+1] redirects.dat: Get rid of rules not working due to DNS misconfiguration [puppet] - 10https://gerrit.wikimedia.org/r/515080 (owner: 10Vgutierrez) [11:11:46] Urbanecm: I'm a bit rusty, but deploying now. [11:11:52] awight, ack [11:11:59] !log awight@deploy1001 Synchronized wmf-config/CommonSettings.php: wmf-config/CommonSettings-labs.php SWAT: [[gerrit:514715|FileImporter configuration to fetch sitelinks from Wikidata (T225609 T224007)]] (duration: 00m 47s) [11:12:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:06] T224007: Show Now Commons info for files that were moved - https://phabricator.wikimedia.org/T224007 [11:12:06] T225609: Supplement the FileImporter environment on beta to allow full testing - https://phabricator.wikimedia.org/T225609 [11:12:21] Urbanecm: lmk if I can help with any of the other patches, it's a long list! [11:12:26] Otherwise--it's all yours now. [11:12:57] awight, if you want to train yourself a little bit more, feel free to start with my patches ;) [11:13:04] hehe sure thing! [11:13:10] ok! [11:13:29] I'll (test first and then) deploy them in a batch, if that works? [11:13:48] (03CR) 10Awight: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512220 (https://phabricator.wikimedia.org/T223024) (owner: 10Acamicamacaraca) [11:13:51] up to you! [11:14:13] (03CR) 10Awight: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517372 (https://phabricator.wikimedia.org/T224349) (owner: 10Urbanecm) [11:14:42] (03CR) 10Awight: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517371 (https://phabricator.wikimedia.org/T225896) (owner: 10Urbanecm) [11:15:58] (03Merged) 10jenkins-bot: Set nds_nlwiki's sitename and metanamespace back to defaults [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517372 (https://phabricator.wikimedia.org/T224349) (owner: 10Urbanecm) [11:16:31] (03CR) 10jenkins-bot: Set nds_nlwiki's sitename and metanamespace back to defaults [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517372 (https://phabricator.wikimedia.org/T224349) (owner: 10Urbanecm) [11:16:59] Hmm, reading the deployment docs I think this might not be the norm. But I've merged already, so going ahead with the bad idea. [11:17:20] (03CR) 10Awight: Add "autoreview" protection level on ar.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517371 (https://phabricator.wikimedia.org/T225896) (owner: 10Urbanecm) [11:17:54] (03CR) 10Awight: Enable VisualEditor in draft namespace on sr.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512220 (https://phabricator.wikimedia.org/T223024) (owner: 10Acamicamacaraca) [11:18:09] awight, usually, patches are deployed one by one, but I saw deployers deploying all at once as well, so it's definitely possible [11:18:12] Urbanecm: merge conflict on https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/512220/ , maybe you can look while I deploy the first? [11:18:20] sure! [11:18:39] (03PS6) 10Urbanecm: Enable VisualEditor in draft namespace on sr.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512220 (https://phabricator.wikimedia.org/T223024) (owner: 10Acamicamacaraca) [11:19:10] rebased at current master, seems it works [11:19:13] ping awight [11:19:41] Urbanecm: https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/517372/ is ready to test on mwdebug1002 [11:19:47] testing [11:20:20] (03CR) 10Awight: [C: 03+2] Enable VisualEditor in draft namespace on sr.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512220 (https://phabricator.wikimedia.org/T223024) (owner: 10Acamicamacaraca) [11:20:35] awight, seems it works, feel free to deploy 517372 [11:20:49] (please, run namespaceDupes.php after deploying, to be sure there are no conflicts) [11:20:54] or I can do that if you want, up2you [11:21:16] (03Merged) 10jenkins-bot: Enable VisualEditor in draft namespace on sr.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512220 (https://phabricator.wikimedia.org/T223024) (owner: 10Acamicamacaraca) [11:21:33] (03CR) 10jenkins-bot: Enable VisualEditor in draft namespace on sr.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512220 (https://phabricator.wikimedia.org/T223024) (owner: 10Acamicamacaraca) [11:21:38] deploying [11:21:54] Sure, I'll run namespaceDupes [11:22:19] !log awight@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:517372|Set nds_nlwiki's sitename and metanamespace back to defaults (T224349)]] (duration: 00m 47s) [11:22:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:24] T224349: Change site name and the project namespace on nds-nl.wikipedia.org - https://phabricator.wikimedia.org/T224349 [11:22:27] ok awight [11:23:32] !log ran mwscript namespaceDupes.php nds_nlwiki, no dupes found [11:23:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:50] Urbanecm: srwiki VE change is ready to test [11:24:58] ok, testing awight [11:25:30] (03CR) 10Awight: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517371 (https://phabricator.wikimedia.org/T225896) (owner: 10Urbanecm) [11:26:44] awight, srwiki VE change works, feel free to deploy [11:27:19] Deploying. [11:27:43] (03PS2) 10Awight: Add "autoreview" protection level on ar.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517371 (https://phabricator.wikimedia.org/T225896) (owner: 10Urbanecm) [11:28:02] !log awight@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:512220|Enable VisualEditor in draft namespace on sr.wiki (T223024)]] (duration: 00m 47s) [11:28:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:07] T223024: Enable VisualEditor in draft namespace (Нацрт) on sr.wiki - https://phabricator.wikimedia.org/T223024 [11:28:25] I see now that the "merge conflict" is just an extra-cautious configuration on the config repo, not necessarily a real problem. [11:28:46] yup [11:29:11] (03CR) 10Awight: Add "autoreview" protection level on ar.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517371 (https://phabricator.wikimedia.org/T225896) (owner: 10Urbanecm) [11:29:17] (03CR) 10Awight: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517371 (https://phabricator.wikimedia.org/T225896) (owner: 10Urbanecm) [11:30:19] (03Merged) 10jenkins-bot: Add "autoreview" protection level on ar.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517371 (https://phabricator.wikimedia.org/T225896) (owner: 10Urbanecm) [11:30:36] (03CR) 10jenkins-bot: Add "autoreview" protection level on ar.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517371 (https://phabricator.wikimedia.org/T225896) (owner: 10Urbanecm) [11:30:55] Urbanecm: ^ ready to test [11:31:03] awight, testing [11:31:20] Urbanecm: oops--not pushed yet [11:31:44] awight, yes, just was going to say it doesn't work :) [11:31:45] Urbanecm: Okay now ready [11:32:04] Good thing I'm getting all this practice ;-) [11:32:08] yup, ready to deploy :) [11:32:25] :) [11:33:49] !log awight@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:517371|Add autoreview protection level on ar.wikipedia (T225896)]] (duration: 00m 47s) [11:33:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:54] T225896: Add "autoreview" protection level on ar.wikipedia - https://phabricator.wikimedia.org/T225896 [11:34:43] Urbanecm: shall I keep going? [11:35:01] kart_ asked for his patch to be deployed, feel free to do it [11:35:11] kk [11:35:11] not sure if Amir1 or raynor mind you deploying their patches :) [11:35:23] (ftr, I've already CR+2 kart_'s patch before, to save some time) [11:35:29] should be merged now [11:35:35] I can deploy mine [11:36:08] I can deploy mine, but I'd love if someone can deploy it for me please. I'm on LTE now (mobile tethering) [11:36:16] as my internet went down ;/ [11:36:40] 10Operations, 10Analytics, 10Traffic: Investigate varnish behavior change since new ATS-change in webrequest upload - https://phabricator.wikimedia.org/T225786 (10JAllemandou) Hi @ema We can easily get data for older days if needed (we don't drop statistic-data). Here are the hosts with issues for June 6t... [11:36:44] Amir1, awight wants to get some deploy-experience, if I understood it correctly, so that's why the question :) [11:37:43] I can deploy if needed but if awight wants to do it, it's fine for me :) [11:37:58] cool! [11:38:03] It's not hard to test [11:39:28] I'm down for anything :) [11:39:40] hehe you all are so adventurous [11:39:54] kart_: https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/ContentTranslation/+/514734/ is on mwdebug1002, lmk what you think [11:40:17] Sure. Testing.. [11:40:39] (03CR) 10Awight: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/516478 (https://phabricator.wikimedia.org/T225500) (owner: 10Michael Große) [11:41:32] (03CR) 10Awight: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/516608 (https://phabricator.wikimedia.org/T223303) (owner: 10Michael Große) [11:41:43] (03CR) 10Awight: Enable feature flag for breaking Wikibase API change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/516608 (https://phabricator.wikimedia.org/T223303) (owner: 10Michael Große) [11:41:54] (03PS4) 10Awight: Set EntityUsageTable addUsage batch size to 200 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/516478 (https://phabricator.wikimedia.org/T225500) (owner: 10Michael Große) [11:42:09] (03CR) 10Awight: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/516478 (https://phabricator.wikimedia.org/T225500) (owner: 10Michael Große) [11:43:35] (03Merged) 10jenkins-bot: Set EntityUsageTable addUsage batch size to 200 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/516478 (https://phabricator.wikimedia.org/T225500) (owner: 10Michael Große) [11:43:52] (03CR) 10jenkins-bot: Set EntityUsageTable addUsage batch size to 200 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/516478 (https://phabricator.wikimedia.org/T225500) (owner: 10Michael Große) [11:45:01] awight: go ahead! [11:45:09] great :) [11:46:02] !log awight@deploy1001 Synchronized php-1.34.0-wmf.8/extensions/ContentTranslation: SWAT: [[gerrit:514734|Fix undefined index notices (T225198)]] (duration: 00m 49s) [11:46:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:08] T225198: PHP error from SpecialContentTranslation.php: "Undefined index: 0" and "Undefined index: 1" - https://phabricator.wikimedia.org/T225198 [11:46:47] Thanks awight ! [11:46:56] Amir1: https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/516478/ is ready to test [11:47:19] kart_: my pleasure [11:47:42] awight: this one is not testable. [11:48:10] Amir1: okay, deploying [11:48:57] (03PS2) 10Awight: Enable feature flag for breaking Wikibase API change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/516608 (https://phabricator.wikimedia.org/T223303) (owner: 10Michael Große) [11:49:11] (03CR) 10Awight: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/516608 (https://phabricator.wikimedia.org/T223303) (owner: 10Michael Große) [11:49:20] !log awight@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit|516478 Set EntityUsageTable addUsage batch size to 200 (T225500)]] (duration: 00m 47s) [11:49:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:49:25] T225500: Decrease EntityUsageTable addUsage batch size to 100 - https://phabricator.wikimedia.org/T225500 [11:50:11] (03Merged) 10jenkins-bot: Enable feature flag for breaking Wikibase API change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/516608 (https://phabricator.wikimedia.org/T223303) (owner: 10Michael Große) [11:50:29] (03CR) 10jenkins-bot: Enable feature flag for breaking Wikibase API change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/516608 (https://phabricator.wikimedia.org/T223303) (owner: 10Michael Große) [11:51:06] Amir1: ^ ready to test [11:51:50] awight: Tested, it works, please check if there are any errors in logstash [11:53:36] awight, can you squeeze one more, please? [11:54:11] Amir1: no errors. [11:54:14] raynor: Sure [11:54:22] awight: let's go then \o/ [11:55:00] (03PS2) 10Awight: Enable AMC mode for Persian, Japanese, Thai and Italian wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517391 (https://phabricator.wikimedia.org/T225123) (owner: 10Pmiazga) [11:55:08] (03CR) 10Awight: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517391 (https://phabricator.wikimedia.org/T225123) (owner: 10Pmiazga) [11:55:19] !log awight@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit|516608 Enable feature flag for breaking Wikibase API change (T223303)]] (duration: 00m 47s) [11:55:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:55:24] T223303: on production enable bugfix for wbeditentity setting aliases to empty array - https://phabricator.wikimedia.org/T223303 [11:56:04] raynor: Will you paste to the Deployment calendar? [11:56:06] (03Merged) 10jenkins-bot: Enable AMC mode for Persian, Japanese, Thai and Italian wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517391 (https://phabricator.wikimedia.org/T225123) (owner: 10Pmiazga) [11:56:10] it's there [11:56:15] awight, ^ [11:56:16] ah yep ok [11:56:21] (03CR) 10jenkins-bot: Enable AMC mode for Persian, Japanese, Thai and Italian wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517391 (https://phabricator.wikimedia.org/T225123) (owner: 10Pmiazga) [11:57:11] raynor: Please test now [11:57:52] on mwdebug1002 [11:58:01] on it [11:58:33] Nothing else is on the calendar for 5 hours, btw, so no rush [12:00:14] Heads-up, SWAT is going a few minutes beyond our window. [12:01:31] looks ok, awight I just want to double check one thing [12:01:38] cool [12:01:54] the feature is enabled on all wikis but not Italian [12:01:56] o_O [12:02:29] !log EU SWAT is going a few minutes beyond its window [12:02:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:34] * awight squints [12:02:39] Urbanecm: ty [12:02:43] yw [12:03:16] Amir1: And thanks for the great how-to session for deployers! [12:03:49] yw, I did the basics ;) [12:04:34] It was lots of stuff I did not know after 6 years of blindly stumbling through various deploys, really helpful to see how a pro works :-) [12:04:42] awight, pleace, proceed to prod, Looks good [12:04:46] please* [12:04:48] will do [12:05:26] raynor: Did you figure out what was happening with itwiki? [12:05:49] the only thing that comes to my mind is some cache [12:05:52] !log awight@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit|517391 Enable AMC mode for Persian, Japanese, Thai and Italian wikis (T225123)]] (duration: 00m 47s) [12:05:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:05:56] T225123: Deploy AMC to fawiki, jawiki, thwiki, itwiki - https://phabricator.wikimedia.org/T225123 [12:06:01] I added it to 4 wikis, double checked the keys [12:06:03] !log EU SWAT complete [12:06:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:18] Okay, hrm glad it cleared up! [12:06:19] on 3 it works, on Italian it still works in the way that conflig flag is off [12:06:29] ah [12:06:34] good luck ;-) [12:06:37] I'm feeling ok, we're enabling it for more wikis [12:06:48] awight, thanks for deploying all the stuff! [12:06:55] (previously it was enabled for 3 wikis [12:07:00] (almost) any time [12:07:20] and it worked there, we're just enabling it for more at this time. I'll find out why Italiang didn't get the change [12:07:25] awight, thanks for deploying my patch [12:08:46] raynor: :) [12:08:55] awight -> ok, when it's on prod it works on itwiki [12:08:59] it's definitely some cache ;/ [12:09:05] great [12:09:26] awight, btw, ad message from 11:12 in SAL, seems wmf-config/CommonSettings-labs.php was understood as a part of message, not a file to sync. Since it isn't loaded for prod, it probably doesn't matter much, but IIRC you can't sync multiple files at once with scap sync-file [12:09:48] oho! [12:09:59] That explains why I couldn't get the new behavior on beta. dang [12:10:15] Okay I should deploy that file to leave the tree clean. [12:10:15] awight, beta is deployed automatically after CR+2 [12:10:30] and it runs over different machines, too [12:11:14] Urbanecm: Do you think I should deploy that -labs file? Seems like it's needed to keep the books clean? [12:11:29] awight, not sure [12:11:42] Hmm I'll do it out of superstition, I guess. [12:12:52] awight, probably won't cause anything at least :). [12:13:11] !log awight@deploy1001 Synchronized wmf-config/CommonSettings-labs.php: SWAT: [[gerrit:514715|FileImporter configuration to fetch sitelinks from Wikidata (T225609 T224007)]] - finishing partial deployment (duration: 00m 47s) [12:13:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:17] T224007: Show Now Commons info for files that were moved - https://phabricator.wikimedia.org/T224007 [12:13:18] T225609: Supplement the FileImporter environment on beta to allow full testing - https://phabricator.wikimedia.org/T225609 [12:13:27] anyway, just verified if your patch really got to beta, it should be there [12:13:52] (03PS9) 10Jbond: icinga: Add a script to parse and query the status.dat file [puppet] - 10https://gerrit.wikimedia.org/r/514459 [12:13:53] if you don't see the new behav at beta, it's either cache, or a problem :) [12:14:29] Unfortunately, I made that patch very "robust" but without logging, so it's hard to tell why the new behavior wouldn't appear. [12:15:20] :( [12:15:23] can't help much [12:15:36] is it working at prod at least? :-D [12:19:24] (03CR) 10Jbond: icinga: Add a script to parse and query the status.dat file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/514459 (owner: 10Jbond) [12:25:12] 10Operations, 10Cloud-VPS, 10Traffic, 10cloud-services-team (Kanban): cloudcontrol: decide on FQDN for service endpoints - https://phabricator.wikimedia.org/T223902 (10Andrew) We had a session about this during the SRE summit. The conclusions were: - Use HA Proxy instead of trying to get into the LVS poo... [12:25:58] 10Operations, 10Cloud-VPS, 10Traffic, 10cloud-services-team (Kanban): cloudcontrol: decide on FQDN for service endpoints - https://phabricator.wikimedia.org/T223902 (10Andrew) The remaining task here is to make/update a wiki page about this. [12:35:51] !log add mtail_3.0.0~rc24.1-1+wmf1_amd64.deb to jessie-wikimedia backports [12:35:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:37:02] !log upgrade mtail on lithium - T225604 [12:37:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:37:08] T225604: log spam from mtail 3.0.0~rc19 on wezen - https://phabricator.wikimedia.org/T225604 [12:42:45] (03CR) 10Andrew Bogott: "This seems perfectly reasonable, although it breaks our coding standard against default profile arguments. Would it work to leave the "if" [puppet] - 10https://gerrit.wikimedia.org/r/517110 (https://phabricator.wikimedia.org/T221721) (owner: 10Bstorm) [12:43:45] (03PS6) 10Mvolz: Enable reftabs on testwikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514461 (https://phabricator.wikimedia.org/T199197) [12:45:28] 10Operations, 10Analytics, 10Traffic: Investigate varnish behavior change since new ATS-change in webrequest upload - https://phabricator.wikimedia.org/T225786 (10ema) Mmmh interesting. Certainly, the issue is not ATS-specific: eqsin is still running Varnish, and requests routed through eqsin do not involve... [13:02:06] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/516724 (owner: 10Faidon Liambotis) [13:06:19] (03PS1) 10Ema: cp4027: upgrade Varnish to 5.1.3-1wm10 [puppet] - 10https://gerrit.wikimedia.org/r/517415 (https://phabricator.wikimedia.org/T224694) [13:06:21] (03CR) 10Ottomata: "Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/517389 (https://phabricator.wikimedia.org/T225464) (owner: 10Elukey) [13:07:23] (03PS3) 10Giuseppe Lavagetto: Remove kafka1018 from ProductionServices [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513033 (https://phabricator.wikimedia.org/T224538) (owner: 10Effie Mouzeli) [13:08:06] (03CR) 10Ottomata: [C: 03+1] Remove kafka1018 from ProductionServices [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513033 (https://phabricator.wikimedia.org/T224538) (owner: 10Effie Mouzeli) [13:08:09] (03CR) 10Ema: "pcc output looks correct: https://puppet-compiler.wmflabs.org/compiler1002/16965/" [puppet] - 10https://gerrit.wikimedia.org/r/517415 (https://phabricator.wikimedia.org/T224694) (owner: 10Ema) [13:08:27] <_joe_> thanks ottomata [13:08:31] <_joe_> I'm merging it now [13:09:51] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/516726 (owner: 10Faidon Liambotis) [13:11:50] (03CR) 10Elukey: "Andrew, I think that the cleanest solution would be to have a specific require user(s) for each keytab, so the keytabs would be surely cre" [puppet] - 10https://gerrit.wikimedia.org/r/515010 (https://phabricator.wikimedia.org/T212257) (owner: 10Elukey) [13:13:40] (03CR) 10Ottomata: "> Patch Set 7:" [puppet] - 10https://gerrit.wikimedia.org/r/515010 (https://phabricator.wikimedia.org/T212257) (owner: 10Elukey) [13:19:39] (03CR) 10Ema: [C: 03+2] cp4027: upgrade Varnish to 5.1.3-1wm10 [puppet] - 10https://gerrit.wikimedia.org/r/517415 (https://phabricator.wikimedia.org/T224694) (owner: 10Ema) [13:25:11] !log cp4027: upgrade Varnish packages to 5.1.3-1wm10 T224694 [13:25:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:16] T224694: cp3041 - Varnish frontend child restarted icinga alert - https://phabricator.wikimedia.org/T224694 [13:27:31] (03CR) 10Jbond: [C: 03+1] "Nice addition, looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/516725 (https://phabricator.wikimedia.org/T210723) (owner: 10Faidon Liambotis) [13:27:51] PROBLEM - Check systemd state on cp4027 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:30:05] PROBLEM - puppet last run on cobalt is CRITICAL: CRITICAL: Puppet has 13 failures. Last run 4 minutes ago with 13 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Exec[ip addr add 208.80.154.85/32 dev eth0],Exec[ip addr add 2620:0:861:3:208:80:154:85/128 preferred_lft 0 dev eth0],Service[gerrit] [13:32:42] (03PS1) 10Elukey: profile::hive::*: add jvm heap usage alarms [puppet] - 10https://gerrit.wikimedia.org/r/517420 (https://phabricator.wikimedia.org/T222895) [13:34:22] the cp4027 alert above is caused by my upgrade, please ignore [13:34:42] !log reboot of an-worker* (Hadoop worker nodes) for kernel + openjdk upgrades [13:34:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:12] (03CR) 10Elukey: [C: 03+2] profile::hive::*: add jvm heap usage alarms [puppet] - 10https://gerrit.wikimedia.org/r/517420 (https://phabricator.wikimedia.org/T222895) (owner: 10Elukey) [13:41:10] (03CR) 10Vgutierrez: [C: 03+2] redirects.dat: Get rid of rules not working due to DNS misconfiguration [puppet] - 10https://gerrit.wikimedia.org/r/515080 (owner: 10Vgutierrez) [13:41:51] (03CR) 10Vgutierrez: [C: 03+2] redirects.dat: Get rid of redundant wikiipedia.org entries [puppet] - 10https://gerrit.wikimedia.org/r/515020 (https://phabricator.wikimedia.org/T224539) (owner: 10Vgutierrez) [13:42:02] (03PS2) 10Vgutierrez: redirects.dat: Get rid of redundant wikiipedia.org entries [puppet] - 10https://gerrit.wikimedia.org/r/515020 (https://phabricator.wikimedia.org/T224539) [13:45:07] (03CR) 10Vgutierrez: [C: 03+2] redirects.dat: Remove redirections for invalid DNS entries [puppet] - 10https://gerrit.wikimedia.org/r/515022 (https://phabricator.wikimedia.org/T224539) (owner: 10Vgutierrez) [13:45:22] !log reboot cp4027 for dist and Varnish upgrade T224694 [13:45:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:27] T224694: cp3041 - Varnish frontend child restarted icinga alert - https://phabricator.wikimedia.org/T224694 [13:45:30] (03PS2) 10Vgutierrez: redirects.dat: Remove redirections for invalid DNS entries [puppet] - 10https://gerrit.wikimedia.org/r/515022 (https://phabricator.wikimedia.org/T224539) [13:45:36] !log ema@cumin1001 START - Cookbook sre.hosts.upgrade-and-reboot [13:45:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:35] (03CR) 10Vgutierrez: [C: 03+2] redirects.dat: Remove redundant wikipedia.com rules [puppet] - 10https://gerrit.wikimedia.org/r/515024 (https://phabricator.wikimedia.org/T224539) (owner: 10Vgutierrez) [13:48:43] (03PS2) 10Vgutierrez: redirects.dat: Remove redundant wikipedia.com rules [puppet] - 10https://gerrit.wikimedia.org/r/515024 (https://phabricator.wikimedia.org/T224539) [13:49:24] !log installing libav security updates [13:49:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:03] (03CR) 10Vgutierrez: [C: 03+2] redirects.dat: Remove redundant wikipedia.net redirections [puppet] - 10https://gerrit.wikimedia.org/r/515025 (https://phabricator.wikimedia.org/T224539) (owner: 10Vgutierrez) [13:51:12] (03PS2) 10Vgutierrez: redirects.dat: Remove redundant wikipedia.net redirections [puppet] - 10https://gerrit.wikimedia.org/r/515025 (https://phabricator.wikimedia.org/T224539) [13:52:50] (03PS2) 10Ema: cache: stop passing gethdr_extrachance to varnish [puppet] - 10https://gerrit.wikimedia.org/r/517359 (https://phabricator.wikimedia.org/T224694) [13:53:07] !log ema@cumin1001 END (PASS) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=0) [13:53:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:29] (03CR) 10Vgutierrez: [C: 03+2] redirects.dat: Remove redundant rules for wiktionary.com [puppet] - 10https://gerrit.wikimedia.org/r/515026 (https://phabricator.wikimedia.org/T224539) (owner: 10Vgutierrez) [13:53:38] (03PS2) 10Vgutierrez: redirects.dat: Remove redundant rules for wiktionary.com [puppet] - 10https://gerrit.wikimedia.org/r/515026 (https://phabricator.wikimedia.org/T224539) [13:56:34] (03PS3) 10Vgutierrez: redirects.dat: Get rid of rules not working due to DNS misconfiguration [puppet] - 10https://gerrit.wikimedia.org/r/515080 [14:00:45] cobalt load average 30+ when i logged in [14:00:47] PROBLEM - Gerrit Health Check on gerrit.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1529 bytes in 0.012 second response time https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [14:00:49] PROBLEM - Check systemd state on cobalt is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:01:47] and java just got oomkilled on cobalt [14:02:47] PROBLEM - puppet last run on cobalt is CRITICAL: CRITICAL: Puppet has 16 failures. Last run 2 minutes ago with 16 failures. Failed resources (up to 3 shown): Exec[set debconf flag seen for wireshark-common/install-setuid],Service[prometheus-node-exporter],Package[gerrit/gerrit],Exec[chown /srv/deployment/gerrit for gerrit2] [14:03:13] !log cdanis@cobalt.wikimedia.org ~ % sudo systemctl start gerrit.service [14:03:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:41] PROBLEM - puppet last run on releases2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_jenkins CI Composer] [14:03:45] RECOVERY - Check systemd state on cobalt is OK: OK - running: The system is fully operational [14:03:45] PROBLEM - Gerrit JSON on gerrit.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - page size 1529 too small - 1529 bytes in 0.009 second response time https://wikitech.wikimedia.org/wiki/Gerrit%23Monitoring [14:03:53] the puppet failures on cobalt were also OOM [14:03:55] Jun 17 13:56:34 cobalt puppet-agent[13728]: (/Stage[main]/Gerrit::Jetty/Scap::Target[gerrit/gerrit]/Package[gerrit/gerrit]) Could not evaluate: Cannot allocate memory - fork(2) [14:04:06] hmm [14:04:07] PROBLEM - SSH access on cobalt is CRITICAL: connect to address 208.80.154.85 and port 29418: Connection refused https://wikitech.wikimedia.org/wiki/Gerrit [14:04:13] :-\ [14:04:24] until now it only has been the JVM being oom iirc [14:04:25] PROBLEM - puppet last run on gerrit2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_All-Avatars] [14:04:40] 10Operations, 10Maps: Maps[12]004 /srv disk space is critical - https://phabricator.wikimedia.org/T224395 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by gehel on cumin1001.eqiad.wmnet for hosts: ` ['maps1001.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/201906171404_gehel_1... [14:05:54] we really need a bigger machine [14:06:09] PROBLEM - puppet last run on labsdb1012 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config] [14:06:16] ^ expected if gerrit was down [14:06:47] PROBLEM - puppet last run on vega is CRITICAL: CRITICAL: Puppet has 4 failures. Last run 6 minutes ago with 4 failures. Failed resources (up to 3 shown): Exec[git_pull_research/landing-page],Exec[git_pull_design/landing-page],Exec[git_pull_design/style-guide],Exec[git_pull_wikimedia/campaigns/eswiki-2018] [14:06:51] PROBLEM - puppet last run on labsdb1009 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config] [14:07:31] PROBLEM - puppet last run on db2095 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config] [14:07:49] PROBLEM - puppet last run on kafka2002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] [14:08:03] RECOVERY - Gerrit Health Check on gerrit.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 865 bytes in 0.081 second response time https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [14:08:03] PROBLEM - puppet last run on labsdb1010 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config] [14:08:07] RECOVERY - Gerrit JSON on gerrit.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 27472 bytes in 0.773 second response time https://wikitech.wikimedia.org/wiki/Gerrit%23Monitoring [14:08:09] PROBLEM - puppet last run on tungsten is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/software/xhgui] [14:08:17] PROBLEM - puppet last run on labsdb1011 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config] [14:08:27] RECOVERY - SSH access on cobalt is OK: SSH OK - GerritCodeReview_2.15.13-13-gd782b2dd6b (SSHD-CORE-1.6.0) (protocol 2.0) https://wikitech.wikimedia.org/wiki/Gerrit [14:08:28] those should start clearing soon [14:08:37] PROBLEM - puppet last run on releases1001 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 6 minutes ago with 3 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/tools/release],Exec[git_pull_operations/deployment-charts],Exec[git_pull_jenkins CI Composer] [14:08:41] Do we know what caused the OOM? [14:08:55] it has been OOMing every two weeks or so [14:09:39] PROBLEM - puppet last run on contint1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_jenkins CI slave scripts] [14:09:51] cdanis i wonder if someone is hitting an endpoint in gerrit that triggers it to use more memory than it's allocated? [14:09:54] https://grafana.wikimedia.org/d/000000377/host-overview?refresh=5m&panelId=4&fullscreen&orgId=1&var-server=cobalt&var-datasource=eqiad%20prometheus%2Fops&var-cluster=misc&from=now-90d&to=now [14:10:33] PROBLEM - puppet last run on kafka2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] [14:11:41] PROBLEM - puppet last run on cumin1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/cookbooks] [14:11:45] PROBLEM - puppet last run on db1124 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config] [14:11:47] paladox: possibly, but looking at the memory history for the machine, usage slowly creeps up and up over time as well [14:12:31] PROBLEM - puppet last run on kafka1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] [14:12:37] PROBLEM - puppet last run on bromine is CRITICAL: CRITICAL: Puppet has 7 failures. Last run 6 minutes ago with 7 failures. Failed resources (up to 3 shown): Exec[git_pull_wikimedia/annualreport],Exec[git_pull_wikimedia/TransparencyReport],Exec[git_pull_wikibase/wikiba.se-deploy],Exec[git_pull_research/landing-page] [14:13:07] Do we know if anyone called "/projects/" endpoint at the time of the problem? [14:13:25] PROBLEM - puppet last run on thorium is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 5 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[git_pull_wikistats-v2],Exec[git_pull_analytics.wikimedia.org] [14:13:39] RECOVERY - puppet last run on cobalt is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [14:15:08] !log installing poppler security updates on jessie [14:15:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:59] RECOVERY - puppet last run on db1124 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:17:30] paladox: people call that endpoint all the time. there were several calls within a few minutes of the crash, however they were all for project details (e.g. https://gerrit.wikimedia.org/r/projects/mediawiki/core/config) except for one request to /r/projects/ which very quickly returned 503 (along with all the requests nearby it in time, as the server had already crashed) [14:18:42] it looks like near the time of the crash there were lots of long-running (and likely in-flight at time-of-crash) requests to git-upload-pack, which is probably unsurprising [14:18:49] cdanis i ment https://gerrit.wikimedia.org/r/projects/ :) (anything after /projects/ should be fine) Apparently doing curl -X GET https://gerrit.wikimedia.org/r/projects/ will store all projects in the memory [14:19:51] there's a bug about that https://bugs.chromium.org/p/gerrit/issues/detail?id=10326 (which has kind of been fixed for 2.15, but still can trigger it to use alot of memory, in 2.16 there's a new option to stop it doing that) [14:20:03] https://gerrit-review.googlesource.com/Documentation/config-gerrit.html#gerrit.listProjectsFromIndex [14:21:17] there was no request for /r/projects/ with no suffix since at least 10 hours ago [14:21:50] but likely we should enable that option anyway [14:22:09] ok [14:23:04] (03PS4) 10Andrew Bogott: Puppet CAs: Make it easy to swap CAs by hiera change [puppet] - 10https://gerrit.wikimedia.org/r/506872 (https://phabricator.wikimedia.org/T220268) (owner: 10Alex Monk) [14:24:24] (03CR) 10Andrew Bogott: [C: 03+2] Puppet CAs: Make it easy to swap CAs by hiera change [puppet] - 10https://gerrit.wikimedia.org/r/506872 (https://phabricator.wikimedia.org/T220268) (owner: 10Alex Monk) [14:24:38] please file a task or make a patch :) [14:25:33] (03PS3) 10Andrew Bogott: Puppet certs: Move old client certs away when Puppet CA changes [puppet] - 10https://gerrit.wikimedia.org/r/506873 (https://phabricator.wikimedia.org/T220268) (owner: 10Alex Monk) [14:25:50] cdanis i'm already planning on enabling that config :) (though that will only take affect when we upgrade to 2.16) [14:26:43] !log otto@deploy1001 scap-helm eventgate-main upgrade main -f /srv/scap-helm/eventgate/main/staging-values.yaml --reset-values stable/eventgate [namespace: eventgate-main, clusters: staging] [14:26:44] !log otto@deploy1001 scap-helm eventgate-main cluster staging completed [14:26:44] !log otto@deploy1001 scap-helm eventgate-main finished [14:26:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:32] (03CR) 10Andrew Bogott: [V: 03+2] "Thanks for doing this, and thanks for your patience :)" [puppet] - 10https://gerrit.wikimedia.org/r/506873 (https://phabricator.wikimedia.org/T220268) (owner: 10Alex Monk) [14:27:45] (03CR) 10Andrew Bogott: [C: 03+2] Puppet certs: Move old client certs away when Puppet CA changes [puppet] - 10https://gerrit.wikimedia.org/r/506873 (https://phabricator.wikimedia.org/T220268) (owner: 10Alex Monk) [14:29:12] (03CR) 10Gehel: "Minor comments inline. I'd like to have Elukey's opinion on this!" (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/517377 (https://phabricator.wikimedia.org/T225694) (owner: 10Mathew.onipe) [14:31:43] (03CR) 10Andrew Bogott: [C: 03+2] "I haven't yet read all the proposed changes that use this, but having it present is definitely a step forward" [puppet] - 10https://gerrit.wikimedia.org/r/513752 (https://phabricator.wikimedia.org/T224708) (owner: 10Alex Monk) [14:32:00] (03PS3) 10Andrew Bogott: openstack mwopenstackclients: Add designateclient [puppet] - 10https://gerrit.wikimedia.org/r/513752 (https://phabricator.wikimedia.org/T224708) (owner: 10Alex Monk) [14:32:27] !log otto@deploy1001 scap-helm eventgate-main upgrade main -f /srv/scap-helm/eventgate/main/codfw-values.yaml --reset-values stable/eventgate [namespace: eventgate-main, clusters: codfw] [14:32:29] !log otto@deploy1001 scap-helm eventgate-main cluster codfw completed [14:32:29] !log otto@deploy1001 scap-helm eventgate-main finished [14:32:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:55] RECOVERY - puppet last run on vega is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [14:32:59] RECOVERY - puppet last run on labsdb1009 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [14:33:13] RECOVERY - puppet last run on db2095 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:33:27] RECOVERY - puppet last run on kafka2002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:33:39] RECOVERY - puppet last run on labsdb1010 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [14:33:45] RECOVERY - puppet last run on tungsten is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [14:33:55] RECOVERY - puppet last run on labsdb1011 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [14:34:15] RECOVERY - puppet last run on releases1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [14:34:49] RECOVERY - puppet last run on releases2001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [14:35:23] RECOVERY - puppet last run on contint1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:35:39] RECOVERY - puppet last run on gerrit2001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [14:36:25] RECOVERY - puppet last run on kafka2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:37:25] RECOVERY - puppet last run on cumin1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:38:13] RECOVERY - puppet last run on kafka1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:38:15] RECOVERY - puppet last run on bromine is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [14:38:53] RECOVERY - puppet last run on thorium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:41:57] (03PS1) 10Thcipriani: Revert "gerrit: bump heap limit" [puppet] - 10https://gerrit.wikimedia.org/r/517433 [14:43:05] (03CR) 10Paladox: [C: 03+1] Revert "gerrit: bump heap limit" [puppet] - 10https://gerrit.wikimedia.org/r/517433 (owner: 10Thcipriani) [14:43:20] (03PS1) 10CDanis: gerrit: have systemd autorestart on any failure [puppet] - 10https://gerrit.wikimedia.org/r/517434 [14:44:13] 10Operations, 10Maps: Maps[12]004 /srv disk space is critical - https://phabricator.wikimedia.org/T224395 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['maps1001.eqiad.wmnet'] ` and were **ALL** successful. [14:44:42] (03CR) 10Andrew Bogott: "This looks right to me -- you've tested it?" [puppet] - 10https://gerrit.wikimedia.org/r/513910 (https://phabricator.wikimedia.org/T224708) (owner: 10Alex Monk) [14:45:53] !log stop eventlogging on eventlog1002 and reboot for kernel upgrades [14:45:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:04] RECOVERY - puppet last run on labsdb1012 is OK: OK: Puppet is currently enabled, last run 19 minutes ago with 0 failures [14:50:45] (03PS2) 10Andrew Bogott: openstack mwopenstackclients: Use designateclient in ensure functions [puppet] - 10https://gerrit.wikimedia.org/r/513909 (https://phabricator.wikimedia.org/T224708) (owner: 10Alex Monk) [14:51:17] elukey: it doesn't have to be now, but it would be nice to do an eventlogging stop at some poing to upgrade the mariadb dbs, too [14:53:10] (03CR) 10Andrew Bogott: [C: 03+2] openstack mwopenstackclients: Use designateclient in ensure functions [puppet] - 10https://gerrit.wikimedia.org/r/513909 (https://phabricator.wikimedia.org/T224708) (owner: 10Alex Monk) [14:53:11] jynus: o/ - we can do it anytime, since I just need to stop the component of EL that pushes to db1107 [14:53:45] I prefer if you send an alert some time in advance for users [14:54:27] sure [14:54:54] do you have a preference for the maintenance? [14:54:59] could be even tomorrow morning [14:55:02] whenever you prefer, but preferifly next week [14:55:10] we are busy/vacations this week [14:55:41] sure, let's do it on Monday next week? Would it be ok? [14:55:55] ok for me, early on monday [14:57:17] (03CR) 10Bstorm: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/517110 (https://phabricator.wikimedia.org/T221721) (owner: 10Bstorm) [14:58:36] (03PS2) 10CDanis: gerrit: have systemd autorestart on any failure [puppet] - 10https://gerrit.wikimedia.org/r/517434 [14:58:45] (03CR) 10CDanis: [C: 03+2] gerrit: have systemd autorestart on any failure [puppet] - 10https://gerrit.wikimedia.org/r/517434 (owner: 10CDanis) [15:02:20] (03CR) 10Andrew Bogott: "I keep looking at this but not responding, sorry. Two thoughts:" [puppet] - 10https://gerrit.wikimedia.org/r/374897 (https://phabricator.wikimedia.org/T166845) (owner: 10Alex Monk) [15:03:20] (03PS1) 10Milimetric: Update timer descriptions [puppet] - 10https://gerrit.wikimedia.org/r/517439 [15:04:40] (03PS2) 10Bstorm: toolforge: make backup registry optional (for toolsbeta) [puppet] - 10https://gerrit.wikimedia.org/r/517110 (https://phabricator.wikimedia.org/T221721) [15:06:02] (03PS1) 10MarcoAurelio: WIP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517440 [15:09:32] (03PS3) 10Andrew Bogott: toolforge: make backup registry optional (for toolsbeta) [puppet] - 10https://gerrit.wikimedia.org/r/517110 (https://phabricator.wikimedia.org/T221721) (owner: 10Bstorm) [15:10:03] (03PS2) 10CDanis: Revert "gerrit: bump heap limit" [puppet] - 10https://gerrit.wikimedia.org/r/517433 (owner: 10Thcipriani) [15:10:12] (03CR) 10CDanis: [C: 03+2] Revert "gerrit: bump heap limit" [puppet] - 10https://gerrit.wikimedia.org/r/517433 (owner: 10Thcipriani) [15:11:02] (03PS8) 10CDanis: Gerrit: Quadruple web session cache memory to 8192 [puppet] - 10https://gerrit.wikimedia.org/r/513682 (https://phabricator.wikimedia.org/T222472) (owner: 10Paladox) [15:11:11] (03CR) 10CDanis: [C: 03+2] Gerrit: Quadruple web session cache memory to 8192 [puppet] - 10https://gerrit.wikimedia.org/r/513682 (https://phabricator.wikimedia.org/T222472) (owner: 10Paladox) [15:11:32] 10Operations, 10Analytics, 10Analytics-Kanban, 10Discovery, and 2 others: Make hadoop cluster able to push to swift - https://phabricator.wikimedia.org/T219544 (10Ottomata) a:05CDanis→03Ottomata [15:11:46] (03CR) 10Andrew Bogott: [C: 03+1] toolforge: make backup registry optional (for toolsbeta) [puppet] - 10https://gerrit.wikimedia.org/r/517110 (https://phabricator.wikimedia.org/T221721) (owner: 10Bstorm) [15:12:37] (03PS4) 10Bstorm: toolforge: make backup registry optional (for toolsbeta) [puppet] - 10https://gerrit.wikimedia.org/r/517110 (https://phabricator.wikimedia.org/T221721) [15:13:03] about to restart gerrit for a config change [15:13:34] (03CR) 10Bstorm: [C: 03+2] toolforge: make backup registry optional (for toolsbeta) [puppet] - 10https://gerrit.wikimedia.org/r/517110 (https://phabricator.wikimedia.org/T221721) (owner: 10Bstorm) [15:14:27] (03PS6) 10Ema: ATS: add hardening features to systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/510168 [15:15:10] (03CR) 10Ema: [C: 03+2] ATS: add hardening features to systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/510168 (owner: 10Ema) [15:15:57] (03Abandoned) 10MarcoAurelio: WIP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517440 (owner: 10MarcoAurelio) [15:16:01] !log gerrit restart to pick up new config changes. [15:16:03] (03CR) 10Elukey: "Looks good! the only problem is the cdh module update, that shouldn't be part of this code change :)" [puppet] - 10https://gerrit.wikimedia.org/r/517439 (owner: 10Milimetric) [15:16:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:38] Gerrit is down. We're working on bringing it back as soon as possible. [15:16:39] Please follow along the discussion at #wikimedia-operations on freenode as we debug. [15:16:39] Please try again later! [15:16:52] hauskatze: restarting for config changes [15:17:01] thcipriani: ack - thanks :) [15:17:51] !log gerrit back [15:17:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:59] (03PS1) 10Jhedden: icinga: Fix jhedden username [puppet] - 10https://gerrit.wikimedia.org/r/517446 (https://phabricator.wikimedia.org/T224192) [15:20:02] (03PS2) 10Jhedden: icinga: Fix jhedden username [puppet] - 10https://gerrit.wikimedia.org/r/517446 (https://phabricator.wikimedia.org/T224192) [15:21:44] PROBLEM - puppet last run on webperf2002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/software/xhgui] [15:25:07] (03CR) 10Bstorm: "This looks incorrect to me. Your wikitech username is https://wikitech.wikimedia.org/wiki/User:Jhedden" [puppet] - 10https://gerrit.wikimedia.org/r/517446 (https://phabricator.wikimedia.org/T224192) (owner: 10Jhedden) [15:25:37] (03CR) 10Jcrespo: [V: 03+2 C: 03+2] WMFBackup: Increase xtrabackup memory use to 20GB [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/515063 (https://phabricator.wikimedia.org/T206203) (owner: 10Jcrespo) [15:25:57] (03CR) 10Jcrespo: [C: 03+1] "I answered you in person." [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/515064 (https://phabricator.wikimedia.org/T206203) (owner: 10Jcrespo) [15:27:25] (03CR) 10Bstorm: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/517446 (https://phabricator.wikimedia.org/T224192) (owner: 10Jhedden) [15:29:05] (03PS7) 10Jcrespo: mariadb: Allow the passing of a full path to section on prepare [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/515064 (https://phabricator.wikimedia.org/T206203) [15:29:28] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Allow the passing of a full path to section on prepare [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/515064 (https://phabricator.wikimedia.org/T206203) (owner: 10Jcrespo) [15:30:06] (03CR) 10Muehlenhoff: "Hmmh, it should match the cn: attribute of your LDAP user, so "Jhedden" seems fine to me, maybe we're missing a restart of Icinga or so?" [puppet] - 10https://gerrit.wikimedia.org/r/517446 (https://phabricator.wikimedia.org/T224192) (owner: 10Jhedden) [15:30:14] !log otto@deploy1001 scap-helm eventgate-main upgrade main -f /srv/scap-helm/eventgate/main/staging-values.yaml --reset-values stable/eventgate [namespace: eventgate-main, clusters: staging] [15:30:15] !log otto@deploy1001 scap-helm eventgate-main cluster staging completed [15:30:15] !log otto@deploy1001 scap-helm eventgate-main finished [15:30:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:49] (03CR) 10Andrew Bogott: "My experience with icinga is that if I log in as 'Andrew Bogott' then it gives me access but I can't actually do anything (like downtime a" [puppet] - 10https://gerrit.wikimedia.org/r/517446 (https://phabricator.wikimedia.org/T224192) (owner: 10Jhedden) [15:32:38] !log otto@deploy1001 scap-helm eventgate-main upgrade main -f /srv/scap-helm/eventgate/main/staging-values.yaml --reset-values stable/eventgate [namespace: eventgate-main, clusters: staging] [15:32:38] !log otto@deploy1001 scap-helm eventgate-main cluster staging completed [15:32:39] !log otto@deploy1001 scap-helm eventgate-main finished [15:32:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:04] (03CR) 10Andrew Bogott: "If I had to venture a theory, I would say that the login process does a case-insensitive check against ldap to approve logins. But that s" [puppet] - 10https://gerrit.wikimedia.org/r/517446 (https://phabricator.wikimedia.org/T224192) (owner: 10Jhedden) [15:34:06] (03CR) 10Jhedden: "Ah, I see what's happening here now. You can login with any combination of lower and upper case letters in your username (e.g. jheDDEN, jH" [puppet] - 10https://gerrit.wikimedia.org/r/517446 (https://phabricator.wikimedia.org/T224192) (owner: 10Jhedden) [15:34:21] (03Abandoned) 10Jhedden: icinga: Fix jhedden username [puppet] - 10https://gerrit.wikimedia.org/r/517446 (https://phabricator.wikimedia.org/T224192) (owner: 10Jhedden) [15:37:11] (03PS7) 10Jcrespo: mariadb-snapshots: Use full paths for postprocessing new snapshots [puppet] - 10https://gerrit.wikimedia.org/r/515072 (https://phabricator.wikimedia.org/T206203) [15:42:21] !log cp4026: ats-backend-restart to apply systemd unit hardening changes [15:42:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:25] (03CR) 10Jcrespo: [V: 03+2 C: 03+2] mariadb: Allow the passing of a full path to section on prepare [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/515064 (https://phabricator.wikimedia.org/T206203) (owner: 10Jcrespo) [15:43:24] (03CR) 10Jcrespo: [C: 03+2] mariadb-snapshots: Use full paths for postprocessing new snapshots [puppet] - 10https://gerrit.wikimedia.org/r/515072 (https://phabricator.wikimedia.org/T206203) (owner: 10Jcrespo) [15:46:34] 10Operations, 10Acme-chief, 10Traffic: acme-chief staging time not working as expected - https://phabricator.wikimedia.org/T225945 (10Vgutierrez) [15:46:37] 10Operations, 10Acme-chief, 10Traffic: acme-chief staging time not working as expected - https://phabricator.wikimedia.org/T225945 (10Vgutierrez) p:05Triage→03High [15:46:50] 10Operations, 10Acme-chief, 10Traffic: acme-chief staging time not working as expected - https://phabricator.wikimedia.org/T225945 (10Vgutierrez) [15:47:59] (03PS1) 10Jhedden: Nagios: Add jhedden to shinken contact group [puppet] - 10https://gerrit.wikimedia.org/r/517449 (https://phabricator.wikimedia.org/T224192) [15:48:52] RECOVERY - puppet last run on webperf2002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [15:49:21] (03PS2) 10Andrew Bogott: Nagios: Add jhedden to shinken contact group [puppet] - 10https://gerrit.wikimedia.org/r/517449 (https://phabricator.wikimedia.org/T224192) (owner: 10Jhedden) [15:50:38] (03CR) 10Andrew Bogott: [C: 03+2] Nagios: Add jhedden to shinken contact group [puppet] - 10https://gerrit.wikimedia.org/r/517449 (https://phabricator.wikimedia.org/T224192) (owner: 10Jhedden) [15:55:18] !log otto@deploy1001 scap-helm eventgate-main upgrade main -f /srv/scap-helm/eventgate/main/staging-values.yaml --reset-values stable/eventgate [namespace: eventgate-main, clusters: staging] [15:55:19] !log otto@deploy1001 scap-helm eventgate-main cluster staging completed [15:55:19] !log otto@deploy1001 scap-helm eventgate-main finished [15:55:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:35] !log otto@deploy1001 scap-helm eventgate-main upgrade main -f /srv/scap-helm/eventgate/main/codfw-values.yaml --reset-values stable/eventgate [namespace: eventgate-main, clusters: codfw] [15:57:37] !log otto@deploy1001 scap-helm eventgate-main cluster codfw completed [15:57:37] !log otto@deploy1001 scap-helm eventgate-main finished [15:57:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:07] (03PS3) 10Ema: varnishlog: implement varnishfetcherr [puppet] - 10https://gerrit.wikimedia.org/r/517375 (https://phabricator.wikimedia.org/T224994) [16:02:36] !log otto@deploy1001 scap-helm eventgate-main upgrade main -f /srv/scap-helm/eventgate/main/eqiad-values.yaml --reset-values stable/eventgate [namespace: eventgate-main, clusters: eqiad] [16:02:37] !log otto@deploy1001 scap-helm eventgate-main cluster eqiad completed [16:02:37] !log otto@deploy1001 scap-helm eventgate-main finished [16:02:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:02:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:02:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:07] 10Operations, 10ops-codfw, 10DBA: db2097 (codfw s1&s6 source backups) mariadb@s6 *process* (10.1.39) crashed on 2019-06-08 - https://phabricator.wikimedia.org/T225378 (10jcrespo) I am going to run a data check, no matter what we do in the end with the instance. [16:11:50] (03PS1) 10Ottomata: Produce mediawiki.user-blocks-change stream to eventgate-main [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517452 (https://phabricator.wikimedia.org/T211248) [16:23:39] (03PS4) 10Ema: varnishlog: implement varnishfetcherr [puppet] - 10https://gerrit.wikimedia.org/r/517375 (https://phabricator.wikimedia.org/T224994) [16:26:46] (03CR) 10Volans: "This needs manual rebase due to conflicts." [software/conftool] - 10https://gerrit.wikimedia.org/r/514632 (owner: 10CDanis) [16:27:41] !log starting data check on db2097+db2046, expect increase in read row rate T225378 [16:27:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:47] T225378: db2097 (codfw s1&s6 source backups) mariadb@s6 *process* (10.1.39) crashed on 2019-06-08 - https://phabricator.wikimedia.org/T225378 [16:28:58] (03PS2) 10Milimetric: Update timer descriptions [puppet] - 10https://gerrit.wikimedia.org/r/517439 [16:29:17] (03CR) 10Milimetric: "oops! thanks, missed a submodule update" [puppet] - 10https://gerrit.wikimedia.org/r/517439 (owner: 10Milimetric) [16:43:24] (03PS3) 10Elukey: Update timer descriptions [puppet] - 10https://gerrit.wikimedia.org/r/517439 (owner: 10Milimetric) [16:44:34] (03CR) 10Elukey: [C: 03+2] Update timer descriptions [puppet] - 10https://gerrit.wikimedia.org/r/517439 (owner: 10Milimetric) [16:52:40] 10Operations, 10ops-eqiad, 10DBA: eqiad: rack/setup/install (4) dbproxy systems. - https://phabricator.wikimedia.org/T225704 (10bd808) >>! In T225704#5261978, @Marostegui wrote: > @Bstorm @bd808 any comments on T225704#5261972? I think that if we need a proxy in front of ToolsDB we should probably do that w... [16:55:36] (03CR) 10Hashar: [C: 03+1] "Looks fine to me :]" [puppet] - 10https://gerrit.wikimedia.org/r/517111 (https://phabricator.wikimedia.org/T224069) (owner: 10Brennen Bearnes) [17:00:04] gehel and onimisionipe: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Wikidata Query Service weekly deploy . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190617T1700). [17:01:24] here here [17:03:15] (03CR) 10Volans: "I've seen it in action at the summit. Just few minor things to fix and a bunch of optional nitpicks inline. Looks good otherwise!" (0311 comments) [software/conftool] - 10https://gerrit.wikimedia.org/r/515323 (owner: 10CDanis) [17:10:26] (03CR) 10Ottomata: [C: 03+2] Produce mediawiki.user-blocks-change stream to eventgate-main [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517452 (https://phabricator.wikimedia.org/T211248) (owner: 10Ottomata) [17:10:45] (03CR) 10jenkins-bot: Produce mediawiki.user-blocks-change stream to eventgate-main [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517452 (https://phabricator.wikimedia.org/T211248) (owner: 10Ottomata) [17:10:55] !log mw-config change to produce user-blocks-change event to eventgate-main - T211248 [17:11:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:00] T211248: Modern Event Platform: Stream Intake Service: Migrate Mediawiki Eventbus events to eventgate-main - https://phabricator.wikimedia.org/T211248 [17:13:42] (03CR) 10EBernhardson: LVS for cloudelastic (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/512925 (https://phabricator.wikimedia.org/T224324) (owner: 10EBernhardson) [17:14:43] !log otto@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Produce user-blocks-change to eventgate-main - T211248 (duration: 00m 48s) [17:14:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:16:09] !log onimisionipe@deploy1001 Started deploy [wdqs/wdqs@d6ed70b]: New Updater, GUI and Blazegraph build [17:16:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:19:26] PROBLEM - WDQS HTTP Port on wdqs1003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 649 bytes in 0.002 second response time [17:19:58] I'm looking [17:20:52] PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - normal source and target) is CRITICAL: Test article.creation.translation - normal source and target returned the unexpected status 503 (expecting: 200): /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - normal source and target with [17:20:52] : Test article.creation.translation - normal source and target with seed returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [17:20:56] PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - normal source and target with seed) is CRITICAL: Test article.creation.translation - normal source and target with seed returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [17:21:07] (03PS1) 10Ottomata: Revert "Produce mediawiki.user-blocks-change stream to eventgate-main" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517460 [17:21:26] PROBLEM - Check systemd state on wdqs1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:22:04] (03CR) 10Ottomata: [C: 03+2] Revert "Produce mediawiki.user-blocks-change stream to eventgate-main" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517460 (owner: 10Ottomata) [17:22:20] RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [17:22:21] (03CR) 10jenkins-bot: Revert "Produce mediawiki.user-blocks-change stream to eventgate-main" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517460 (owner: 10Ottomata) [17:22:22] onimisionipe: maybe my fault [17:22:28] bad pattern in pattern file [17:22:30] will fix [17:22:55] (03PS1) 10Muehlenhoff: Disable TCP selective acknowledgements [puppet] - 10https://gerrit.wikimedia.org/r/517461 [17:23:18] PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - normal source and target) is CRITICAL: Test article.creation.translation - normal source and target returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [17:23:18] onimisionipe, SMalyshev - what are the next steps? rollback and fix or something else? [17:23:20] (03PS2) 10Esanders: Turn off mobile-ab test for VE section editing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/516659 (https://phabricator.wikimedia.org/T225645) [17:23:20] onimisionipe: try deloying now [17:23:23] !log otto@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Revert - Produce user-blocks-change to eventgate-main. Depends on https://gerrit.wikimedia.org/r/c/mediawiki/extensions/EventBus/+/514560 (duration: 00m 47s) [17:23:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:34] elukey: rollback, check out new master and re-try [17:23:43] (03CR) 10Esanders: [C: 03+1] "Green light from analytics, product & CL" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/516659 (https://phabricator.wikimedia.org/T225645) (owner: 10Esanders) [17:23:46] (03CR) 10jerkins-bot: [V: 04-1] Disable TCP selective acknowledgements [puppet] - 10https://gerrit.wikimedia.org/r/517461 (owner: 10Muehlenhoff) [17:23:50] RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [17:24:32] onimisionipe: all good with the rollback plan? Shout out if you need help [17:24:38] not sure what is the status now [17:24:44] RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [17:24:51] Only canary was affected [17:24:54] onimisionipe: usually journalctl -u wdqs-blazegraph shows what's wrong if it fails to start [17:24:58] I have not pushed to other nodes [17:25:06] (03PS2) 10Muehlenhoff: Disable TCP selective acknowledgements [puppet] - 10https://gerrit.wikimedia.org/r/517461 [17:25:21] onimisionipe: sure but is it depooled now or still broken? [17:25:23] onimisionipe: so yes, roll back wdq3, check out new master and re-try [17:25:25] wdqs--updater cannot update blazegraph..even while blazegraph is up [17:25:54] depooled [17:26:00] I'm rolling back now [17:26:03] super [17:26:28] !log onimisionipe@deploy1001 Finished deploy [wdqs/wdqs@d6ed70b]: New Updater, GUI and Blazegraph build (duration: 10m 19s) [17:26:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:38] RECOVERY - WDQS HTTP Port on wdqs1003 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.615 second response time [17:27:10] RECOVERY - Check systemd state on wdqs1003 is OK: OK - running: The system is fully operational [17:29:33] !log pooled wdqs1003 - after rolling back failed deployment. [17:29:37] (03PS1) 10Niharika29: Deploy Partial blocks to English wikisource, wiktionary and wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517463 (https://phabricator.wikimedia.org/T218626) [17:29:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:34] Everything OK? [17:30:47] (03PS1) 10Cmjohnson: Adding mgmt dns for dbprox1018-22 [dns] - 10https://gerrit.wikimedia.org/r/517464 [17:31:22] thanks onimisionipe [17:31:34] yw! [17:31:40] do we need an incident report? Not sure about the user impact [17:32:01] Probably not. [17:32:05] yea. not sure about that too. [17:36:00] not trying to do the blame game in here, just to figure out if there is a gap in procedures that could be reviewed/fixed.. Bad deployments happen, but if they cause user impact (not sure in this case) then it is worth to follow up [17:36:04] in my opinion [17:39:36] PROBLEM - Host elastic1029 is DOWN: PING CRITICAL - Packet loss = 100% [17:39:57] hmmm [17:40:16] expired downtime? that host was down for a hw check IIRC [17:40:20] RECOVERY - Check systemd state on an-coord1001 is OK: OK - running: The system is fully operational [17:40:36] https://phabricator.wikimedia.org/T214283 [17:43:06] !log onimisionipe@deploy1001 Started deploy [wdqs/wdqs@dcf3338]: New Updater, GUI and Blazegraph build [17:43:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:45] !log otto@deploy1001 scap-helm eventgate-main upgrade main -f /srv/scap-helm/eventgate/main/staging-values.yaml --reset-values stable/eventgate [namespace: eventgate-main, clusters: staging] [17:54:45] !log otto@deploy1001 scap-helm eventgate-main cluster staging completed [17:54:46] !log otto@deploy1001 scap-helm eventgate-main finished [17:54:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:58] [SWAT's in five minutes' time.] [17:55:14] (03PS1) 10Bstorm: cloudstore: move secondary monitoring stuff into profile and fix it [puppet] - 10https://gerrit.wikimedia.org/r/517470 (https://phabricator.wikimedia.org/T225265) [17:56:15] !log otto@deploy1001 scap-helm eventgate-main upgrade main -f /srv/scap-helm/eventgate/main/codfw-values.yaml --reset-values stable/eventgate [namespace: eventgate-main, clusters: codfw] [17:56:16] !log otto@deploy1001 scap-helm eventgate-main cluster codfw completed [17:56:16] !log otto@deploy1001 scap-helm eventgate-main finished [17:56:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:24] !log otto@deploy1001 scap-helm eventgate-main upgrade main -f /srv/scap-helm/eventgate/main/eqiad-values.yaml --reset-values stable/eventgate [namespace: eventgate-main, clusters: eqiad] [17:56:25] !log otto@deploy1001 scap-helm eventgate-main cluster eqiad completed [17:56:25] !log otto@deploy1001 scap-helm eventgate-main finished [17:56:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:04] MaxSem, RoanKattouw, and Niharika: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Morning SWAT (Max 6 patches) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190617T1800). [18:00:04] framawiki, ottomata, and Niharika: A patch you scheduled for Morning SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:07] ottomata: All clear? [18:00:10] (I'll SWAT.) [18:00:43] !log onimisionipe@deploy1001 Finished deploy [wdqs/wdqs@dcf3338]: New Updater, GUI and Blazegraph build (duration: 17m 37s) [18:00:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:01:41] (03CR) 10Jforrester: [C: 03+2] Turn off mobile-ab test for VE section editing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/516659 (https://phabricator.wikimedia.org/T225645) (owner: 10Esanders) [18:02:06] here [18:02:12] (03CR) 10Jforrester: [C: 03+2] Add *.*.nasa.gov to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517170 (https://phabricator.wikimedia.org/T225852) (owner: 10Framawiki) [18:02:15] yes please James_F thanky ou! [18:02:15] (03CR) 10Jforrester: [C: 03+2] Add deliver.odai.yale.edu to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517172 (https://phabricator.wikimedia.org/T224875) (owner: 10Framawiki) [18:02:20] (03CR) 10Jforrester: [C: 03+2] Add *.mojnews.com to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517173 (https://phabricator.wikimedia.org/T213901) (owner: 10Framawiki) [18:02:45] (03Merged) 10jenkins-bot: Turn off mobile-ab test for VE section editing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/516659 (https://phabricator.wikimedia.org/T225645) (owner: 10Esanders) [18:03:00] (03CR) 10jenkins-bot: Turn off mobile-ab test for VE section editing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/516659 (https://phabricator.wikimedia.org/T225645) (owner: 10Esanders) [18:03:29] !log disabled TCP selective acknowledgements on caches/bastions [18:03:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:55] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: eqiad: rack/setup/install (4) dbproxy systems. - https://phabricator.wikimedia.org/T225704 (10Cmjohnson) [18:05:00] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: Turn off mobile-ab test for VE section editing (duration: 00m 48s) [18:05:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:11] (03CR) 10Cmjohnson: [C: 03+2] Adding mgmt dns for dbprox1018-22 [dns] - 10https://gerrit.wikimedia.org/r/517464 (owner: 10Cmjohnson) [18:07:17] !log jforrester@deploy1001 Synchronized php-1.34.0-wmf.8/extensions/FlaggedRevs/frontend/modules/ext.flaggedRevs.advanced.js: SWAT: FlaggedRevs: Bring back diff toggle T225351 (duration: 00m 48s) [18:07:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:31] T225351: Regression in diffing unreviewed changes in edit mode in FlaggedRevs - https://phabricator.wikimedia.org/T225351 [18:08:12] (03PS2) 10Jforrester: Add *.*.nasa.gov to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517170 (https://phabricator.wikimedia.org/T225852) (owner: 10Framawiki) [18:08:21] (03CR) 10Jforrester: [C: 03+2] "…" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517170 (https://phabricator.wikimedia.org/T225852) (owner: 10Framawiki) [18:08:31] (03PS2) 10Jforrester: Add deliver.odai.yale.edu to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517172 (https://phabricator.wikimedia.org/T224875) (owner: 10Framawiki) [18:08:44] (03CR) 10Jforrester: [C: 03+2] "…" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517172 (https://phabricator.wikimedia.org/T224875) (owner: 10Framawiki) [18:09:18] (03Merged) 10jenkins-bot: Add *.*.nasa.gov to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517170 (https://phabricator.wikimedia.org/T225852) (owner: 10Framawiki) [18:09:32] (03CR) 10jenkins-bot: Add *.*.nasa.gov to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517170 (https://phabricator.wikimedia.org/T225852) (owner: 10Framawiki) [18:09:34] ottomata: Testable on debug or should I just let it fly? [18:09:44] (03Merged) 10jenkins-bot: Add deliver.odai.yale.edu to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517172 (https://phabricator.wikimedia.org/T224875) (owner: 10Framawiki) [18:10:02] James_F: testable if you can make a user block with an expiry [18:10:04] i don't have perms to do that [18:10:15] Sure, one sec. [18:10:18] (03PS2) 10Smalyshev: Also ban empty user agents [puppet] - 10https://gerrit.wikimedia.org/r/516959 [18:10:34] (03PS3) 10Gehel: Also ban empty user agents [puppet] - 10https://gerrit.wikimedia.org/r/516959 (owner: 10Smalyshev) [18:11:09] 10Operations, 10ops-eqiad, 10DBA: eqiad: rack/setup/install (4) dbproxy systems. - https://phabricator.wikimedia.org/T225704 (10Cmjohnson) a:05Cmjohnson→03ayounsi Assigning to @ayounsi to add cloud-support1-d-eqiad. Once that is done, the vlan for dbproxy1020 and 1021 will need to be set up. Switch port... [18:11:30] 10Operations, 10ops-eqiad, 10DBA: eqiad: rack/setup/install (4) dbproxy systems. - https://phabricator.wikimedia.org/T225704 (10Cmjohnson) [18:11:32] (03CR) 10jenkins-bot: Add deliver.odai.yale.edu to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517172 (https://phabricator.wikimedia.org/T224875) (owner: 10Framawiki) [18:11:56] (03CR) 10Gehel: [C: 03+2] Also ban empty user agents [puppet] - 10https://gerrit.wikimedia.org/r/516959 (owner: 10Smalyshev) [18:12:58] ottomata: I blocked "Test12345~testwiki" on testwiki. [18:13:18] looks good I see it! [18:13:22] 10Operations, 10ops-eqiad, 10DBA: eqiad: rack/setup/install (4) dbproxy systems. - https://phabricator.wikimedia.org/T225704 (10Marostegui) @cmjohnson which ones will go in the cloud vlan finally? 1018 and 1019 or 1020 and 1021? I'm fine either way but I'm confused with your last comment :) [18:13:25] please proceed :) [18:13:42] Excellent. [18:14:57] !log jforrester@deploy1001 Synchronized php-1.34.0-wmf.8/extensions/EventBus/includes/EventFactory.php: SWAT: Ensure user-blocks-change expiry_dt is in ISO-8601 (duration: 00m 48s) [18:15:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:30] (03PS2) 10Jforrester: Add *.mojnews.com to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517173 (https://phabricator.wikimedia.org/T213901) (owner: 10Framawiki) [18:15:52] (03CR) 10Jforrester: [C: 03+2] Add *.mojnews.com to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517173 (https://phabricator.wikimedia.org/T213901) (owner: 10Framawiki) [18:16:48] (03Merged) 10jenkins-bot: Add *.mojnews.com to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517173 (https://phabricator.wikimedia.org/T213901) (owner: 10Framawiki) [18:16:55] (03CR) 10Jforrester: [C: 03+2] Deploy Partial blocks to English wikisource, wiktionary and wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517463 (https://phabricator.wikimedia.org/T218626) (owner: 10Niharika29) [18:17:03] (03CR) 10jenkins-bot: Add *.mojnews.com to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517173 (https://phabricator.wikimedia.org/T213901) (owner: 10Framawiki) [18:18:09] 10Operations, 10ops-eqiad, 10DBA: eqiad: rack/setup/install (4) dbproxy systems. - https://phabricator.wikimedia.org/T225704 (10Cmjohnson) @marostegui: do they all go to the cloud vlan? if they do then 1020 and 1021 are in row D...that support-cloud vlan is not available on row D yet. I need Arzhel to copy... [18:18:12] 10Operations, 10ops-eqiad, 10DBA: eqiad: rack/setup/install (4) dbproxy systems. - https://phabricator.wikimedia.org/T225704 (10ayounsi) a:05ayounsi→03Cmjohnson The cloud support vlan/network is legacy, so I'd rather not create a new one (in a new row). As we already have cloud-support1-a-eqiad and cloud... [18:18:20] RECOVERY - Mjolnir bulk update failure check - codfw on icinga1001 is OK: (C)2 gt (W)1 gt 0 https://grafana.wikimedia.org/d/000000591/elasticsearch-mjolnir-bulk-updates?orgId=1&from=now-7d&to=now&panelId=1&fullscreen [18:18:39] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: Extend wgCopyUploadsDomains T213901 T224875 T225852 (duration: 00m 47s) [18:18:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:18:46] T213901: Please add mojnews.com to the wgCopyUploadsDomains whitelist of Wikimedia Commons - https://phabricator.wikimedia.org/T213901 [18:18:46] T224875: Add deliver.odai.yale.edu domain to $wgCopyUploadsDomains - https://phabricator.wikimedia.org/T224875 [18:18:46] T225852: Add svs.gsfc.nasa.gov to the wgCopyUploadsDomains whitelist of Wikimedia Commons - https://phabricator.wikimedia.org/T225852 [18:19:05] (03PS2) 10Jforrester: Deploy Partial blocks to English wikisource, wiktionary and wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517463 (https://phabricator.wikimedia.org/T218626) (owner: 10Niharika29) [18:19:21] (03CR) 10Jforrester: [C: 03+2] "…" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517463 (https://phabricator.wikimedia.org/T218626) (owner: 10Niharika29) [18:19:52] 10Operations, 10ops-eqiad, 10DBA: eqiad: rack/setup/install (4) dbproxy systems. - https://phabricator.wikimedia.org/T225704 (10Marostegui) Yep! Not a problem, I don't mind which hosts as long as we have two on that VLAN, whichever ones work best for you [18:20:20] (03Merged) 10jenkins-bot: Deploy Partial blocks to English wikisource, wiktionary and wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517463 (https://phabricator.wikimedia.org/T218626) (owner: 10Niharika29) [18:20:22] (03CR) 10Jforrester: [C: 03+2] Consistent beta wikidata urls, without www [mediawiki-config] - 10https://gerrit.wikimedia.org/r/516753 (owner: 10Matthias Mullie) [18:20:31] (03CR) 10jenkins-bot: Deploy Partial blocks to English wikisource, wiktionary and wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517463 (https://phabricator.wikimedia.org/T218626) (owner: 10Niharika29) [18:21:14] (03Merged) 10jenkins-bot: Consistent beta wikidata urls, without www [mediawiki-config] - 10https://gerrit.wikimedia.org/r/516753 (owner: 10Matthias Mullie) [18:21:30] (03PS2) 10Jforrester: ExtensionDistributor: Enable REL1_33 (beta), drop pre-REL1_30 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/515142 (owner: 10Legoktm) [18:21:32] (03CR) 10Jforrester: [C: 03+2] ExtensionDistributor: Enable REL1_33 (beta), drop pre-REL1_30 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/515142 (owner: 10Legoktm) [18:21:34] SWAT done. [18:21:36] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: Deploy Partial blocks to English wikisource, wiktionary and wikivoyage T218626 (duration: 00m 47s) [18:21:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:41] T218626: [Epic] Partial block rollout - https://phabricator.wikimedia.org/T218626 [18:21:54] :o [18:22:01] James_F: could I also get you to sync out https://gerrit.wikimedia.org/r/517356 ? :) [18:22:13] 10Operations, 10ops-eqiad, 10DBA: eqiad: rack/setup/install (4) dbproxy systems. - https://phabricator.wikimedia.org/T225704 (10Marostegui) 1018 and 1019 are ok to go to cloud VLAN from my side (as they are in row C) We just need two hosts on that vlan [18:22:26] (03Merged) 10jenkins-bot: ExtensionDistributor: Enable REL1_33 (beta), drop pre-REL1_30 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/515142 (owner: 10Legoktm) [18:22:29] (03CR) 10jenkins-bot: Consistent beta wikidata urls, without www [mediawiki-config] - 10https://gerrit.wikimedia.org/r/516753 (owner: 10Matthias Mullie) [18:22:43] (03PS2) 10Jforrester: Enable ExtensionDistributor log channel to help with T225243 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517356 (owner: 10Legoktm) [18:22:48] (03CR) 10Jforrester: [C: 03+2] Enable ExtensionDistributor log channel to help with T225243 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517356 (owner: 10Legoktm) [18:22:52] legoktm: Of course. [18:22:57] thank you :) [18:23:42] (03Merged) 10jenkins-bot: Enable ExtensionDistributor log channel to help with T225243 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517356 (owner: 10Legoktm) [18:24:13] kostajh: Do you want me to push out https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/514638 ? [18:24:14] !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: ExtensionDistributor: Enable REL1_33 (beta), drop pre-REL1_30 (duration: 00m 48s) [18:24:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:23] (03CR) 10jenkins-bot: ExtensionDistributor: Enable REL1_33 (beta), drop pre-REL1_30 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/515142 (owner: 10Legoktm) [18:24:27] (03CR) 10jenkins-bot: Enable ExtensionDistributor log channel to help with T225243 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517356 (owner: 10Legoktm) [18:25:24] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Enable ExtensionDistributor log channel to help with T225243 (duration: 00m 47s) [18:25:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:32] T225243: ExtensionDistributor API is missing extensions - https://phabricator.wikimedia.org/T225243 [18:26:06] Thankjs James_F ! [18:26:16] ottomata: I live to please. [18:26:41] (03PS4) 10Jforrester: Even more invariant config moved over to CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512418 [18:26:54] PROBLEM - cassandra service on maps1001 is CRITICAL: CRITICAL - Unit cassandra is active but reported SubState exited, wanted running https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:26:55] (03PS1) 10Ottomata: Revert "Revert "Produce mediawiki.user-blocks-change stream to eventgate-main"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517472 [18:27:04] PROBLEM - cassandra CQL 10.64.0.79:9042 on maps1001 is CRITICAL: connect to address 10.64.0.79 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [18:27:19] ottomata: Want me to sling out ^^ ? [18:27:39] James_F: sure, woudl love a test of that on mwdebug [18:27:56] (03CR) 10Jforrester: [C: 03+2] Revert "Revert "Produce mediawiki.user-blocks-change stream to eventgate-main"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517472 (owner: 10Ottomata) [18:28:04] (03CR) 10Bstorm: "compiler results https://puppet-compiler.wmflabs.org/compiler1001/16968/" [puppet] - 10https://gerrit.wikimedia.org/r/517470 (https://phabricator.wikimedia.org/T225265) (owner: 10Bstorm) [18:29:04] thanks James_F for the swat! [18:29:29] framawiki: Happy to help. [18:29:46] RoanKattouw: Maybe you know if we want to push out https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/514638 now? [18:30:12] I'm not sure, but I will ask [18:30:29] We can probably skip it for this SWAT, because if it does need to go out ahead of tomorrow's train, we can also do that at 4pm [18:30:50] OK. [18:31:09] (03PS2) 10Jforrester: Revert "Revert "Produce mediawiki.user-blocks-change stream to eventgate-main"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517472 (owner: 10Ottomata) [18:31:17] (03CR) 10Jforrester: [C: 03+2] "…" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517472 (owner: 10Ottomata) [18:31:25] ottomata: Sorry, gerrit failed to auto-rebase. [18:31:54] looking [18:32:09] (03Merged) 10jenkins-bot: Revert "Revert "Produce mediawiki.user-blocks-change stream to eventgate-main"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517472 (owner: 10Ottomata) [18:32:21] One second. [18:32:23] (03CR) 10jenkins-bot: Revert "Revert "Produce mediawiki.user-blocks-change stream to eventgate-main"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517472 (owner: 10Ottomata) [18:32:25] James_F: Word is that yes, it was intended to be deployed today [18:32:36] RoanKattouw: OK, will ship once ottomata is done. [18:32:36] So if you can still fit it into this window, then great, otherwise I can do it at 4pm [18:32:40] Great thanks [18:32:56] ottomata: Now live on mwdebug1002. [18:33:04] (03PS2) 10Jforrester: GrowthExperiments (testwiki): Switch on mobile homepage feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514638 (owner: 10Kosta Harlan) [18:33:06] 10Operations, 10ops-eqiad: rack/setup 3 new single cpu spare pool systems - https://phabricator.wikimedia.org/T219890 (10Cmjohnson) [18:33:08] ok James_Fcan you make a user block there? [18:33:46] ottomata: Done. [18:34:46] yes lookds good James_F ! [18:34:49] please proceed again! [18:34:53] Okie, dokie. [18:34:55] (03CR) 10Jforrester: [C: 03+2] GrowthExperiments (testwiki): Switch on mobile homepage feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514638 (owner: 10Kosta Harlan) [18:35:48] (03Merged) 10jenkins-bot: GrowthExperiments (testwiki): Switch on mobile homepage feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514638 (owner: 10Kosta Harlan) [18:36:00] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Produce mediawiki.user-blocks-change stream to eventgate-main, again (duration: 00m 49s) [18:36:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:36:23] RoanKattouw: Live. [18:36:39] Well, on some servers it's live. [18:37:06] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: GrowthExperiments (testwiki): Switch on mobile homepage feature (duration: 00m 47s) [18:37:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:37:10] There. [18:37:31] (03CR) 10jenkins-bot: GrowthExperiments (testwiki): Switch on mobile homepage feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514638 (owner: 10Kosta Harlan) [18:39:28] 10Operations, 10ops-eqiad, 10DBA: eqiad: rack/setup/install (4) dbproxy systems. - https://phabricator.wikimedia.org/T225704 (10Cmjohnson) @ayounsi I rather not move the servers...I racked them based on the instructions and they're already in racks and setup [18:40:05] James_F: i'm waiting for some more blocks to flow in....not seeing any. i think that is normal but just in case, you can make another block on a main wwiki [18:40:06] ? [18:40:11] not on mwdebug? [18:40:38] oh, i just saw a few [18:40:39] ok good [18:40:43] never mind James_F looks good! [18:42:45] Sure. :-) [18:42:58] The problem with relatively rare events from users. [18:43:41] 10Operations, 10ops-eqiad: rack/setup 3 new single cpu spare pool systems - https://phabricator.wikimedia.org/T219890 (10Cmjohnson) 05Open→03Resolved servers are ready as spares and in tracking sheet [18:44:27] 10Operations, 10ops-eqiad: eqiad: rack and setup (3) dual CPU servers - https://phabricator.wikimedia.org/T225219 (10Cmjohnson) [18:44:37] 10Operations, 10Wikimedia-Mailing-lists: Mailing list admin pass reset for winedale-l (for migration off lists.wikimedia.org) - https://phabricator.wikimedia.org/T224612 (10Quiddity) Reset and sent. Brion or Mike to confirm once resolved. [18:44:47] 10Operations, 10ops-eqiad: eqiad: rack and setup (3) dual CPU servers - https://phabricator.wikimedia.org/T225219 (10Cmjohnson) 05Open→03Resolved servers are set up and have been added to the tracking sheet [18:48:49] 10Operations, 10ops-eqiad: rack/setup/add to spares tracking 2 single cpu misc class systems - https://phabricator.wikimedia.org/T196697 (10Cmjohnson) 05Open→03Resolved these have been racked [19:10:29] (03PS1) 10Mholloway: rm old ssh public key for mholloway-shell [puppet] - 10https://gerrit.wikimedia.org/r/517475 [19:18:02] (03PS1) 10Andrew Bogott: VPS resolv.conf: change ndots to 1, the default [puppet] - 10https://gerrit.wikimedia.org/r/517476 (https://phabricator.wikimedia.org/T224828) [19:37:50] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [140.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [19:55:08] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [19:59:31] 10Operations, 10Analytics, 10Analytics-Kanban, 10vm-requests, 10User-Elukey: Create an-tool1005 (Staging environment for Superset) - https://phabricator.wikimedia.org/T217738 (10Nuria) 05Open→03Resolved [20:00:04] cscott, arlolra, subbu, bearND, and halfak: That opportune time is upon us again. Time for a Services – Parsoid / Citoid / Mobileapps / ORES / … deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190617T2000). [20:00:16] I have an ORES deployment. [20:02:39] Here we go! [20:02:43] !log halfak@deploy1001 Started deploy [ores/deploy@04fbd58]: T224484 [20:02:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:02:49] T224484: ORES deployment: Early June - https://phabricator.wikimedia.org/T224484 [20:07:25] Looks good. Moving forward. [20:13:09] 10Operations, 10Availability (MediaWiki-MultiDC), 10Performance-Team (Radar): Allow async foreign set/delete WAN cache operations in mcrouter - https://phabricator.wikimedia.org/T225642 (10Gilles) [20:17:59] !log halfak@deploy1001 Finished deploy [ores/deploy@04fbd58]: T224484 (duration: 15m 17s) [20:18:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:18:04] T224484: ORES deployment: Early June - https://phabricator.wikimedia.org/T224484 [20:30:28] (03CR) 10CRusnov: "Well after fiddling I'm not sure I can tell what's wrong with the test. It doesn't seem to have parse problems on other checks but on pyli" [software/cumin] - 10https://gerrit.wikimedia.org/r/514840 (owner: 10CRusnov) [20:34:58] !log arlolra@deploy1001 Started deploy [parsoid/deploy@a8d9f6e]: Updating Parsoid to 2bf94f0 [20:35:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:42:04] 10Operations, 10ops-codfw, 10DBA: db2097 (codfw s1&s6 source backups) mariadb@s6 *process* (10.1.39) crashed on 2019-06-08 - https://phabricator.wikimedia.org/T225378 (10jcrespo) Check finished, no differences found. [20:42:44] (03PS12) 10CRusnov: profile::netbox: Reorganize for splitting front and back-end. [puppet] - 10https://gerrit.wikimedia.org/r/514395 (https://phabricator.wikimedia.org/T223291) [20:45:26] !log arlolra@deploy1001 Finished deploy [parsoid/deploy@a8d9f6e]: Updating Parsoid to 2bf94f0 (duration: 10m 28s) [20:45:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:53:05] !log Updated Parsoid to 2bf94f0 (T225217) [20:53:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:53:11] T225217: VE is removing spaces (dirty diffs) on some wikis (wikitech, officewiki) - https://phabricator.wikimedia.org/T225217 [21:00:04] bawolff and Reedy: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190617T2100). [21:01:01] I didn't get the joke. Need to study more PHP I think... [21:01:36] Hammers hit nails [21:01:42] Hammers accidentally hit thumbs [21:02:02] Aha [21:02:11] PHP didn't used to be so strict on types and stuff [21:02:15] So you could shove anything... [21:04:33] (03PS3) 10MarcoAurelio: Set two new namespace aliases for es.wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/516195 (https://phabricator.wikimedia.org/T216143) [21:05:07] jouncebot: next [21:05:08] In 1 hour(s) and 54 minute(s): Evening SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190617T2300) [21:05:35] if it were to be in one hour I'd schedule a couple of pending patches :( [21:06:06] Reedy: Also, arrrrrray. ;-) [21:06:16] valid syntax yo [21:06:22] ArRay () [21:06:36] $ArAray(); :-) [21:08:56] I think I kind of get it now Reedy, thanks :-) [21:18:21] (03CR) 10MarcoAurelio: [C: 03+1] "Not sure if the scripts have to be run before or after deployment, but indeed must be run. Please "un-WIP" (Start Review) this change when" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517180 (https://phabricator.wikimedia.org/T225870) (owner: 10Jayprakash12345) [21:20:06] (03CR) 10MarcoAurelio: [C: 04-1] "Isn't is easier to just add the wiki to the commonsuploads.dblist instead? Uploader would have to be configured separately but just the up" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/516623 (https://phabricator.wikimedia.org/T225505) (owner: 10Lokal Profil) [21:21:48] (03CR) 10MarcoAurelio: [C: 03+1] "Yep, it is 'import'." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/516053 (owner: 10Gergő Tisza) [21:24:43] (03CR) 10Alex Monk: "no" [puppet] - 10https://gerrit.wikimedia.org/r/513910 (https://phabricator.wikimedia.org/T224708) (owner: 10Alex Monk) [21:59:55] (03PS2) 10CDanis: WIP: diff support. [software/conftool] - 10https://gerrit.wikimedia.org/r/515323 [22:07:42] (03CR) 10CDanis: "most comments addressed, still some work to do" (0310 comments) [software/conftool] - 10https://gerrit.wikimedia.org/r/515323 (owner: 10CDanis) [22:09:28] (03CR) 10Urbanecm: "> Patch Set 2: Code-Review+1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517180 (https://phabricator.wikimedia.org/T225870) (owner: 10Jayprakash12345) [22:09:42] (03CR) 10Urbanecm: [C: 03+1] "Code LGTM too." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517180 (https://phabricator.wikimedia.org/T225870) (owner: 10Jayprakash12345) [22:18:15] (03PS2) 10Andrew Bogott: VPS resolv.conf: change ndots to 1, the default [puppet] - 10https://gerrit.wikimedia.org/r/517476 (https://phabricator.wikimedia.org/T224828) [22:19:23] (03CR) 10Andrew Bogott: [C: 03+2] VPS resolv.conf: change ndots to 1, the default [puppet] - 10https://gerrit.wikimedia.org/r/517476 (https://phabricator.wikimedia.org/T224828) (owner: 10Andrew Bogott) [22:24:14] (03PS13) 10CRusnov: profile::netbox: Reorganize for splitting front and back-end. [puppet] - 10https://gerrit.wikimedia.org/r/514395 (https://phabricator.wikimedia.org/T223291) [22:43:07] (03CR) 10Volans: "> Patch Set 4:" [software/cumin] - 10https://gerrit.wikimedia.org/r/514840 (owner: 10CRusnov) [22:49:10] (03PS14) 10CRusnov: profile::netbox: Reorganize for splitting front and back-end. [puppet] - 10https://gerrit.wikimedia.org/r/514395 (https://phabricator.wikimedia.org/T223291) [23:00:04] MaxSem, RoanKattouw, and Niharika: That opportune time is upon us again. Time for a Evening SWAT (Max 6 patches) deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190617T2300). [23:00:04] Jayprakash12345: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:33] I am here :) [23:01:38] o/ [23:01:55] musikanimal: Can you add the core and extension patches on the Deployments page? [23:02:28] I will try! [23:04:35] (03PS3) 10Niharika29: Enable ShortUrl Extension at aswiki and aswikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517180 (https://phabricator.wikimedia.org/T225870) (owner: 10Jayprakash12345) [23:04:42] (03CR) 10Niharika29: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517180 (https://phabricator.wikimedia.org/T225870) (owner: 10Jayprakash12345) [23:05:10] Niharika: Have you run the script? [23:05:25] Jayprakash12345: Not yet. Which script? [23:05:50] I mentioned in the gerrit comment [23:05:52] (03Merged) 10jenkins-bot: Enable ShortUrl Extension at aswiki and aswikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517180 (https://phabricator.wikimedia.org/T225870) (owner: 10Jayprakash12345) [23:06:04] Jayprakash12345: Reading. [23:06:29] Jayprakash12345: Ah, before the merge? [23:06:38] (03CR) 10jenkins-bot: Enable ShortUrl Extension at aswiki and aswikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517180 (https://phabricator.wikimedia.org/T225870) (owner: 10Jayprakash12345) [23:06:39] Yes [23:07:06] Otherwise, It will give Database error. [23:07:08] Okay. Let's see. [23:08:52] Jayprakash12345: That script command gives me an error - `Fatal error: no version entry for shorturl` [23:09:06] where did you get that command from? [23:10:03] I copied it from another same task. [23:10:27] Jayprakash12345: Link please? [23:11:29] Ok [23:11:46] sounds like parameters are wrong [23:11:48] Niharika: I added two entires to the deployments page. Not sure if I formatted it properly [23:11:50] (order) [23:12:14] https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/417762/ [23:12:58] Reedy: You're right. Looks like Zelko found the right order. [23:13:21] Should we be enabling that extension? [23:13:35] Oh, wait. The ShortUrl extension? [23:13:38] We're switching that off. [23:13:44] Jayprakash12345: Please don't do that [23:13:45] Oh really? [23:13:48] (03PS1) 10Jeena Huneidi: Add restbase chart (port from local-charts) [deployment-charts] - 10https://gerrit.wikimedia.org/r/517557 (https://phabricator.wikimedia.org/T224935) [23:13:57] Yeah, we have the new one [23:13:58] 'wmgUseUrlShortener' => [ [23:13:58] 'default' => true, [23:14:01] Niharika: Yeah, UrlShortener replaces it. [23:14:13] I was like, why are we creating database tables... [23:14:18] They work differently, but we plan to uninstall ShortUrl. [23:14:54] Okay. I ran the table script for aswiki. [23:15:28] There is community consensus for it. [23:15:36] Jayprakash12345: Doesn't matter [23:15:39] (03PS1) 10Jforrester: Note that no further wikis should get ShortUrl enabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517558 [23:16:57] https://www.mediawiki.org/w/index.php?title=Extension%3AShortUrl&type=revision&diff=3280664&oldid=3263997 [23:17:00] Reedy: So we should declined it? [23:17:02] Not great, but better than nothing [23:17:09] Jayprakash12345: Yes, you should have the new shortener... [23:17:13] Reedy: Should I just revert and merge the patch I already merged? https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/517180/ [23:17:25] Niharika: Yes, sorry. [23:17:26] Niharika: Yeah [23:17:46] Don't worry about the database table too much, DBAs will drop it eventually [23:18:36] Cool. [23:18:45] (03PS1) 10Niharika29: Revert "Enable ShortUrl Extension at aswiki and aswikisource" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517559 [23:19:01] Jayprakash12345: Sorry that this wasn't clear. [23:19:01] (03CR) 10Niharika29: [C: 03+2] "Reverting patch - SWAT." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517559 (owner: 10Niharika29) [23:19:05] (03CR) 10Jeena Huneidi: "Sorry for the very long commit message! But probably would be helpful to read before reviewing." [deployment-charts] - 10https://gerrit.wikimedia.org/r/517557 (https://phabricator.wikimedia.org/T224935) (owner: 10Jeena Huneidi) [23:19:17] (03CR) 10Reedy: [C: 03+1] Note that no further wikis should get ShortUrl enabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517558 (owner: 10Jforrester) [23:19:30] Sorry I did not find any doc note that wikimedia is going to drop it. Thanks I will remember :) [23:19:32] Niharika: Also https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/517558 ? :-) [23:19:56] (03Merged) 10jenkins-bot: Revert "Enable ShortUrl Extension at aswiki and aswikisource" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517559 (owner: 10Niharika29) [23:20:10] (03PS2) 10Niharika29: Note that no further wikis should get ShortUrl enabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517558 (owner: 10Jforrester) [23:20:12] (03CR) 10jenkins-bot: Revert "Enable ShortUrl Extension at aswiki and aswikisource" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517559 (owner: 10Niharika29) [23:20:24] (03CR) 10Niharika29: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517558 (owner: 10Jforrester) [23:21:18] (03Merged) 10jenkins-bot: Note that no further wikis should get ShortUrl enabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517558 (owner: 10Jforrester) [23:21:25] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/517461 (owner: 10Muehlenhoff) [23:21:36] Niharika: Ji, Can you drop the comments on the task? It will more effective than my comment. [23:21:45] !log Prune random spare "BetaMediaWiki.*" data points from graphite1004 and graphite2003 from pre Nov 2018. [23:21:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:22:17] (03CR) 10jenkins-bot: Note that no further wikis should get ShortUrl enabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517558 (owner: 10Jforrester) [23:22:21] !log Prune debugging data "coal_tmp2.*" and "coal_tmp3.*" from graphite1004 and graphite2003 from last week, ref T221401 [23:22:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:22:26] T221401: Repopulate missing coal data in Graphite for 2019-04-17 outage - https://phabricator.wikimedia.org/T221401 [23:23:18] Jayprakash12345: Done! [23:24:14] PROBLEM - puppet last run on analytics1062 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [23:27:34] !log niharika29@deploy1001 Synchronized wmf-config/InitialiseSettings.php: No further use of ShortUrl (duration: 00m 47s) [23:27:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:36:55] 10Operations, 10Wikimedia-Mailing-lists: Close the engineering mailing list - https://phabricator.wikimedia.org/T222308 (10Jdforrester-WMF) Ping @ArielGlenn who is on SRE clinic duty this week(?) [23:37:55] musikanimal: Few more minutes. Waiting on Zuul. [23:38:07] 👍 [23:40:28] !log Repopulating lost "coal.*" data in Graphite from NavigationTiming for 2019-04-17, ref T221401 [23:40:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:40:34] T221401: Repopulate missing coal data in Graphite for 2019-04-17 outage - https://phabricator.wikimedia.org/T221401 [23:49:18] (03PS1) 10Nuria: [WIP] Enable downloads of matomo.js [puppet] - 10https://gerrit.wikimedia.org/r/517564 (https://phabricator.wikimedia.org/T225882) [23:51:26] RECOVERY - puppet last run on analytics1062 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures