[00:08:15] 10Operations, 10Dumps-Generation, 10SDC General, 10Wikidata: Capacity planning for Commons Structured Data - https://phabricator.wikimedia.org/T226093 (10Addshore) So there are some details on https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/Growth, but I havn't written too much about media info yet. Det... [00:09:42] "we'll need capacity" [00:45:40] !log Running FlowReserializeRevisionContent.php on testwiki [00:45:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:31:34] 10Operations, 10DC-Ops, 10Traffic: poll power data for redeployment of esams/knams - https://phabricator.wikimedia.org/T225720 (10RobH) Ok, as it is now peak hours (according to @bblack) for eqiad, I'm re-pulling all the power data now. Please note that I'll update the task description AFTER this post (and... [03:23:53] (03PS1) 10Bmansurov: Labs: enable QuickSurveys on hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518186 (https://phabricator.wikimedia.org/T225819) [03:26:26] Hi, can anyone please deploy a tiny labs config for me? I'd appreciate it: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/518186 [04:26:29] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [04:30:47] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [04:35:26] (03PS1) 10ArielGlenn: svwiki officially 'big', 6 dumps jobs in parallel like the others [puppet] - 10https://gerrit.wikimedia.org/r/518189 (https://phabricator.wikimedia.org/T226200) [04:57:16] 10Operations, 10DBA: db2084 temporary correctable hardware errors - https://phabricator.wikimedia.org/T225884 (10Marostegui) 05Open→03Resolved And it finally cleared up ` 23:38:30 <+icinga-wm> RECOVERY - EDAC syslog messages on db2084 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dash... [04:57:52] 10Operations, 10Performance-Team, 10Traffic, 10Performance: Sometimes pages load slowly for European users (due to some factor outside of Wikimedia cluster) - https://phabricator.wikimedia.org/T226048 (10ema) >>! In T226048#5270181, @CDanis wrote: > My guess is that the beginning of this problem correlates... [05:00:26] 10Operations, 10DBA: eqiad: rack/setup/install (4) dbproxy systems. - https://phabricator.wikimedia.org/T225704 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['dbproxy1018.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/201906... [05:02:11] (03PS1) 10Marostegui: db-codfw.php: Pool db2051 into s2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518193 (https://phabricator.wikimedia.org/T221533) [05:05:11] (03CR) 10Marostegui: [C: 03+2] db-codfw.php: Pool db2051 into s2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518193 (https://phabricator.wikimedia.org/T221533) (owner: 10Marostegui) [05:06:05] (03Merged) 10jenkins-bot: db-codfw.php: Pool db2051 into s2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518193 (https://phabricator.wikimedia.org/T221533) (owner: 10Marostegui) [05:06:32] (03CR) 10jenkins-bot: db-codfw.php: Pool db2051 into s2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518193 (https://phabricator.wikimedia.org/T221533) (owner: 10Marostegui) [05:07:42] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Pool db2051 into s2 to replace db2035 as a master (duration: 01m 00s) [05:07:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:10:05] (03CR) 10Elukey: "> Hm, I'm trying to remember why we made the cdh:exec be in the cdh" [puppet/cdh] - 10https://gerrit.wikimedia.org/r/518097 (https://phabricator.wikimedia.org/T212259) (owner: 10Elukey) [05:26:12] 10Operations, 10DBA: eqiad: rack/setup/install (4) dbproxy systems. - https://phabricator.wikimedia.org/T225704 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['dbproxy1020.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/201906... [05:34:13] 10Operations, 10DBA: eqiad: rack/setup/install (4) dbproxy systems. - https://phabricator.wikimedia.org/T225704 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['dbproxy1019.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/201906... [05:34:37] 10Operations, 10DBA: eqiad: rack/setup/install (4) dbproxy systems. - https://phabricator.wikimedia.org/T225704 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['dbproxy1021.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/201906... [05:35:16] 10Operations, 10Performance-Team, 10Traffic, 10Performance: Sometimes pages load slowly for European users (due to some factor outside of Wikimedia cluster) - https://phabricator.wikimedia.org/T226048 (10Wurgl) @ema It was not a single timeout. A chunk of data (the start of the page) was rendered by the br... [05:36:15] (03CR) 10Elukey: "Sorry for the delay! I like the cookbook, but I have to admit that I have rarely used repair in the past with the AQS cassandra cluster. I" (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/517377 (https://phabricator.wikimedia.org/T225694) (owner: 10Mathew.onipe) [05:38:18] 10Operations, 10SRE-Access-Requests: Request access to analytics cluster for Alaa Sarhan - https://phabricator.wikimedia.org/T223697 (10jijiki) @alaa_wmde If turnilo is enough for analysis, should we mark this as resolved? [05:41:06] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: access for foks to labweb (in one way or another) (or make changePassword.php work on mwmaint hosts) - https://phabricator.wikimedia.org/T220860 (10jijiki) 05Stalled→03Resolved @andrew @jrbs This looks like resolved, please ping if it is not :) [05:53:24] (03PS1) 10Marostegui: install_server: Add MAC address for dbproxy1018 [puppet] - 10https://gerrit.wikimedia.org/r/518197 (https://phabricator.wikimedia.org/T225704) [05:55:00] (03CR) 10Marostegui: [C: 03+2] install_server: Add MAC address for dbproxy1018 [puppet] - 10https://gerrit.wikimedia.org/r/518197 (https://phabricator.wikimedia.org/T225704) (owner: 10Marostegui) [06:00:43] 10Operations, 10DBA, 10Patch-For-Review: eqiad: rack/setup/install (4) dbproxy systems. - https://phabricator.wikimedia.org/T225704 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['dbproxy1018.eqiad.wmnet'] ` Of which those **FAILED**: ` ['dbproxy1018.eqiad.wmnet'] ` [06:05:34] (03CR) 10Muehlenhoff: "cdh::exec was added to the cdh submodule as necessary base infrastructure so that anyone using the submodule can use Hadoop with Kerberos " [puppet/cdh] - 10https://gerrit.wikimedia.org/r/518097 (https://phabricator.wikimedia.org/T212259) (owner: 10Elukey) [06:17:03] (03PS1) 10Elukey: hadoop: lower down the min.user.id's yarn config [puppet] - 10https://gerrit.wikimedia.org/r/518200 [06:17:35] (03CR) 10Elukey: [C: 03+2] hadoop: lower down the min.user.id's yarn config [puppet] - 10https://gerrit.wikimedia.org/r/518200 (owner: 10Elukey) [06:19:41] (03PS1) 10Marostegui: install_server: Add MAC for dbproxies [puppet] - 10https://gerrit.wikimedia.org/r/518203 (https://phabricator.wikimedia.org/T225704) [06:20:12] (03PS2) 10Marostegui: install_server: Add MAC for dbproxies [puppet] - 10https://gerrit.wikimedia.org/r/518203 (https://phabricator.wikimedia.org/T225704) [06:21:04] (03CR) 10Marostegui: [C: 03+2] install_server: Add MAC for dbproxies [puppet] - 10https://gerrit.wikimedia.org/r/518203 (https://phabricator.wikimedia.org/T225704) (owner: 10Marostegui) [06:23:26] 10Operations, 10DBA, 10Patch-For-Review: eqiad: rack/setup/install (4) dbproxy systems. - https://phabricator.wikimedia.org/T225704 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['dbproxy1018.eqiad.wmnet'] ` The log can be found in `/var/log/w... [06:26:33] 10Operations, 10DBA, 10Patch-For-Review: eqiad: rack/setup/install (4) dbproxy systems. - https://phabricator.wikimedia.org/T225704 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['dbproxy1020.eqiad.wmnet'] ` Of which those **FAILED**: ` ['dbproxy1020.eqiad.wmnet'] ` [06:30:21] PROBLEM - puppet last run on mw1307 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/profile.d/bash_autologout.sh] [06:30:25] 10Operations, 10DBA, 10Patch-For-Review: eqiad: rack/setup/install (4) dbproxy systems. - https://phabricator.wikimedia.org/T225704 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['dbproxy1020.eqiad.wmnet'] ` The log can be found in `/var/log/w... [06:30:57] PROBLEM - puppet last run on mw1278 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ImageMagick-6/policy.xml] [06:32:35] 10Operations, 10DNS, 10Matrix, 10Traffic, and 2 others: Configure wikimedia.org to enable *:wikimedia.org Matrix user IDs - https://phabricator.wikimedia.org/T223835 (10Tgr) https://wikimedia.org/.well-known/matrix/server works corrently. https://wikimedia.org/.well-known/matrix/client is loaded via AJAX a... [06:33:25] PROBLEM - puppet last run on mw1308 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/rsyslog.d/20-mcrouter.conf] [06:34:37] 10Operations, 10DBA, 10Patch-For-Review: eqiad: rack/setup/install (4) dbproxy systems. - https://phabricator.wikimedia.org/T225704 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['dbproxy1019.eqiad.wmnet'] ` Of which those **FAILED**: ` ['dbproxy1019.eqiad.wmnet'] ` [06:35:05] 10Operations, 10DBA, 10Patch-For-Review: eqiad: rack/setup/install (4) dbproxy systems. - https://phabricator.wikimedia.org/T225704 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['dbproxy1021.eqiad.wmnet'] ` Of which those **FAILED**: ` ['dbproxy1021.eqiad.wmnet'] ` [06:40:18] 10Operations, 10DBA, 10Patch-For-Review: eqiad: rack/setup/install (4) dbproxy systems. - https://phabricator.wikimedia.org/T225704 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['dbproxy1021.eqiad.wmnet'] ` The log can be found in `/var/log/w... [06:44:21] !log installed python-opencv on stat1005 (T220811) [06:44:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:44:26] T220811: Test Thumbor OpenCL smart cropping on stat1005 - https://phabricator.wikimedia.org/T220811 [06:45:01] moritzm: thanks! [06:47:07] 10Operations, 10DBA, 10Patch-For-Review: eqiad: rack/setup/install (4) dbproxy systems. - https://phabricator.wikimedia.org/T225704 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['dbproxy1020.eqiad.wmnet'] ` and were **ALL** successful. [06:47:44] (03PS1) 10Gergő Tisza: Add permissive CORS headers for wikimedia.org/.well-known/matrix [puppet] - 10https://gerrit.wikimedia.org/r/518209 (https://phabricator.wikimedia.org/T223835) [06:48:34] 10Operations, 10Wikibase-Containers, 10Wikidata, 10serviceops, and 2 others: Create a wmf production ready nginx image - https://phabricator.wikimedia.org/T209292 (10hashar) 05Open→03Declined This was for #serviceops! But I decline the task based on @Ladsgroup comment which it would be better to have... [06:48:41] 10Operations, 10DBA, 10Patch-For-Review: eqiad: rack/setup/install (4) dbproxy systems. - https://phabricator.wikimedia.org/T225704 (10Marostegui) [06:54:44] !log installed radeontop on stat1005 to diagnose GPU usage (T220811) [06:54:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:54:49] T220811: Test Thumbor OpenCL smart cropping on stat1005 - https://phabricator.wikimedia.org/T220811 [06:56:18] RECOVERY - puppet last run on mw1307 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [06:56:56] RECOVERY - puppet last run on mw1278 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:34] 10Operations, 10DBA, 10Patch-For-Review: eqiad: rack/setup/install (4) dbproxy systems. - https://phabricator.wikimedia.org/T225704 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['dbproxy1021.eqiad.wmnet'] ` and were **ALL** successful. [06:59:12] RECOVERY - puppet last run on mw1308 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [06:59:34] 10Operations, 10serviceops, 10PHP 7.2 support, 10Patch-For-Review: Socket Errors on PHP7 - https://phabricator.wikimedia.org/T224538 (10Joe) a:03jijiki [07:03:45] (03PS1) 10Muehlenhoff: Allow gpu-testers to run radeontop [puppet] - 10https://gerrit.wikimedia.org/r/518210 (https://phabricator.wikimedia.org/T220811) [07:07:41] 10Operations, 10MediaWiki-General-or-Unknown, 10serviceops, 10Core Platform Team (PHP7 (TEC4)), and 4 others: Some pages will become completely unreachable after PHP7 update due to Unicode changes - https://phabricator.wikimedia.org/T219279 (10Joe) a:05Joe→03None [07:10:26] PROBLEM - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [07:23:34] (03CR) 10DCausse: "not entirely sure but I wonder if you should not split this patch in 2 steps so that you ship InitialiseSettings.php first then ship the w" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517871 (https://phabricator.wikimedia.org/T222268) (owner: 10Ottomata) [07:23:44] 10Operations, 10DBA: eqiad: rack/setup/install (4) dbproxy systems. - https://phabricator.wikimedia.org/T225704 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['dbproxy1018.eqiad.wmnet'] ` Of which those **FAILED**: ` ['dbproxy1018.eqiad.wmnet'] ` [07:24:19] !log installing python-thumbor-wikimedia, python-opencv on stat1006 [07:24:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:56] 10Operations, 10DBA: eqiad: rack/setup/install (4) dbproxy systems. - https://phabricator.wikimedia.org/T225704 (10Marostegui) [07:30:16] 10Operations, 10DBA: eqiad: rack/setup/install (4) dbproxy systems. - https://phabricator.wikimedia.org/T225704 (10Marostegui) a:05Marostegui→03Cmjohnson @Cmjohnson @ayounsi is there anything special with dbproxy1018 and dbproxy1019 VLAN's and PXE? None of the seems to be booting up from PXE, despite that... [07:35:36] 10Operations, 10serviceops, 10HHVM, 10Performance-Team (Radar), 10User-Marostegui: Increased instability in MediaWiki backends (according to load balancers) - https://phabricator.wikimedia.org/T223952 (10Joe) 05Open→03Resolved I've looked back in the last week or so and we don't see those kind of ins... [07:35:54] (03CR) 10DCausse: "same here I'd suggest to split this patch into multiple steps that are all valid. I think that the deployement order is non-trivial enough" (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517874 (https://phabricator.wikimedia.org/T222268) (owner: 10Ottomata) [07:36:35] RECOVERY - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is OK: puppetdb.PuppetDB OK https://wikitech.wikimedia.org/wiki/Netbox%23Reports [07:38:06] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=codfw,cluster=kubernetes,service=eventgate-main,name=kubernetes2001.codfw.wmnet [07:38:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:39:03] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=codfw,cluster=kubernetes,service=eventgate-analytics,name=kubernetes2001.codfw.wmnet [07:39:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:53:44] 10Operations, 10DBA: eqiad: rack/setup/install (4) dbproxy systems. - https://phabricator.wikimedia.org/T225704 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['dbproxy1018.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/201906... [07:58:00] (03PS1) 10Ema: Revert "cache::upload: temporarily prevent abuses" [puppet] - 10https://gerrit.wikimedia.org/r/518215 [07:58:20] (03CR) 10jerkins-bot: [V: 04-1] Revert "cache::upload: temporarily prevent abuses" [puppet] - 10https://gerrit.wikimedia.org/r/518215 (owner: 10Ema) [08:02:56] (03PS1) 10Marostegui: wmnet: Change dbproxy1018,dbproxy1019 IPs to be in cloud [dns] - 10https://gerrit.wikimedia.org/r/518216 (https://phabricator.wikimedia.org/T225704) [08:03:41] (03PS10) 10Ema: Normalize thumbnail URLs to avoid cachebusting [puppet] - 10https://gerrit.wikimedia.org/r/495643 (https://phabricator.wikimedia.org/T216339) (owner: 10Gilles) [08:04:29] (03CR) 10Ayounsi: [C: 03+2] wmnet: Change dbproxy1018,dbproxy1019 IPs to be in cloud [dns] - 10https://gerrit.wikimedia.org/r/518216 (https://phabricator.wikimedia.org/T225704) (owner: 10Marostegui) [08:08:14] 10Operations, 10DBA, 10Patch-For-Review: eqiad: rack/setup/install (4) dbproxy systems. - https://phabricator.wikimedia.org/T225704 (10Marostegui) a:05Cmjohnson→03Marostegui While debugging we Arzhel we have noticed that the DNS entries for dbproxy1018 and dbproxy1019 didn't belong to the cloud network,... [08:09:10] (03CR) 10Ema: [C: 03+2] Normalize thumbnail URLs to avoid cachebusting [puppet] - 10https://gerrit.wikimedia.org/r/495643 (https://phabricator.wikimedia.org/T216339) (owner: 10Gilles) [08:14:14] (03PS2) 10Ema: Revert "cache::upload: temporarily prevent abuses" [puppet] - 10https://gerrit.wikimedia.org/r/518215 [08:22:13] (03CR) 10Ema: [C: 03+2] Revert "cache::upload: temporarily prevent abuses" [puppet] - 10https://gerrit.wikimedia.org/r/518215 (owner: 10Ema) [08:22:48] 10Operations, 10Performance-Team, 10Traffic, 10media-storage, 10Patch-For-Review: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661 (10Gilles) [08:22:52] 10Operations, 10Performance-Team, 10Traffic, 10Patch-For-Review: Normalize thumbnail request URLs in Varnish to avoid cachebusting - https://phabricator.wikimedia.org/T216339 (10Gilles) 05Open→03Resolved [08:23:00] 10Operations, 10Performance-Team, 10Traffic, 10media-storage, 10Patch-For-Review: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661 (10Gilles) 05Stalled→03Open [08:23:59] 10Operations, 10Performance-Team, 10Traffic, 10media-storage, 10Patch-For-Review: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661 (10Gilles) @aaron your concern has been addressed now, the Varnish-level thumbnail URL normalization is live. We can now proceed... [08:25:39] 10Operations, 10Performance-Team, 10Traffic, 10media-storage, 10Patch-For-Review: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661 (10Gilles) @jijiki @fgiunchedi Have Swift proxies been restarted since https://gerrit.wikimedia.org/r/#/c/mediawiki/vagrant/+/489... [08:27:25] PROBLEM - Upload HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=upload&var-status_type=5 [08:27:41] 10Operations, 10Performance-Team, 10Traffic, 10media-storage, 10Patch-For-Review: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661 (10jijiki) @Gilles can it wait till next week when we will all be back, unless it is urgent, in which case we will figure it out [08:28:36] (03PS1) 10Elukey: profile::hue: add a parameter to selectively enable oozie security [puppet] - 10https://gerrit.wikimedia.org/r/518220 (https://phabricator.wikimedia.org/T212259) [08:30:18] 10Operations, 10Performance-Team, 10Traffic, 10media-storage, 10Patch-For-Review: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661 (10Gilles) It can wait. Basically I want to figure out where we're at in regards to that patch, what's actually deployed and runn... [08:30:53] (03PS2) 10Ema: Add debian/patches/0034-r02135.vtc-fixes.patch [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/518065 [08:30:55] (03PS2) 10Ema: Add debian/patches/0032-vbe_dir_finish-no-VBT_Wait.patch [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/518063 [08:30:57] (03PS2) 10Ema: Add debian/patches/0033-recycled-honor-first_byte_timeout.patch [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/518064 [08:31:02] 10Operations, 10Performance-Team, 10Traffic, 10media-storage, and 2 others: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661 (10jijiki) [08:31:18] (03CR) 10jerkins-bot: [V: 04-1] Add debian/patches/0034-r02135.vtc-fixes.patch [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/518065 (owner: 10Ema) [08:31:20] (03CR) 10jerkins-bot: [V: 04-1] Add debian/patches/0032-vbe_dir_finish-no-VBT_Wait.patch [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/518063 (owner: 10Ema) [08:31:30] (03PS1) 10Elukey: hue: move $oozie_security_enabled to parameters [puppet/cdh] - 10https://gerrit.wikimedia.org/r/518221 [08:33:09] 10Operations, 10serviceops, 10wikitech.wikimedia.org, 10PHP 7.2 support, 10Patch-For-Review: switch wikitech to PHP 7.2 - https://phabricator.wikimedia.org/T223393 (10jijiki) p:05Triage→03Low [08:34:29] 10Operations, 10Gerrit, 10Release-Engineering-Team-TODO, 10serviceops: Gerrit Hardware Upgrade - https://phabricator.wikimedia.org/T222391 (10jijiki) p:05Triage→03Low [08:35:11] 10Operations, 10Performance-Team, 10Traffic, 10media-storage, and 2 others: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661 (10Gilles) Nevermind, this was the Vagrant patch... I'm going to make the production one now [08:35:43] (03CR) 10Elukey: [C: 03+2] hue: move $oozie_security_enabled to parameters [puppet/cdh] - 10https://gerrit.wikimedia.org/r/518221 (owner: 10Elukey) [08:36:01] PROBLEM - puppet last run on schema1002 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [08:36:31] PROBLEM - puppet last run on lvs3003 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [08:36:54] (03PS2) 10Elukey: profile::hue: add a parameter to selectively enable oozie security [puppet] - 10https://gerrit.wikimedia.org/r/518220 (https://phabricator.wikimedia.org/T212259) [08:37:43] 10Operations, 10Continuous-Integration-Infrastructure (phase-out-jessie): Upload docker-ce 18.06.3 upstream package for Stretch - https://phabricator.wikimedia.org/T226236 (10hashar) [08:37:48] 10Operations, 10Wikimedia-Mailing-lists: gmail users being suspended from mediawiki-l due to excessive bounces due to DMARC - https://phabricator.wikimedia.org/T225553 (10Aklapper) [08:38:31] (03PS3) 10Ema: Honor first_byte_timeout for recycled backend connections [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/518064 [08:40:33] !log ema@cumin1001 START - Cookbook sre.hosts.upgrade-and-reboot [08:40:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:36] (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/17054/" [puppet] - 10https://gerrit.wikimedia.org/r/518220 (https://phabricator.wikimedia.org/T212259) (owner: 10Elukey) [08:42:01] !log ema@cumin1001 START - Cookbook sre.hosts.upgrade-and-reboot [08:42:01] 10Operations, 10Continuous-Integration-Infrastructure (phase-out-jessie): Upload docker-ce 18.06.3 upstream package for Stretch - https://phabricator.wikimedia.org/T226236 (10hashar) And maybe we could use some `reprepro` configuration to ease further upgrades? [08:42:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:33] (03CR) 10Effie Mouzeli: [V: 03+1] "LGTM https://puppet-compiler.wmflabs.org/compiler1002/17049/" [puppet] - 10https://gerrit.wikimedia.org/r/517755 (https://phabricator.wikimedia.org/T225284) (owner: 10Effie Mouzeli) [08:43:40] (03Abandoned) 10Ema: Add debian/patches/0031-vbt-close-stolen.patch [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/518062 (owner: 10Ema) [08:43:55] (03Abandoned) 10Ema: Add debian/patches/0034-r02135.vtc-fixes.patch [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/518065 (owner: 10Ema) [08:44:05] (03Abandoned) 10Ema: Add debian/patches/0032-vbe_dir_finish-no-VBT_Wait.patch [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/518063 (owner: 10Ema) [08:46:24] (03PS1) 10Hashar: contint: remove zuul-cloner from Docker agent [puppet] - 10https://gerrit.wikimedia.org/r/518222 (https://phabricator.wikimedia.org/T226233) [08:46:44] !log ema@cumin1001 END (PASS) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=0) [08:46:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:47:32] 10Operations, 10MediaWiki-Cache, 10serviceops-radar, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), and 5 others: Use a multi-dc aware store for ObjectCache's MainStash if needed. - https://phabricator.wikimedia.org/T212129 (10Joe) [08:48:01] 10Operations, 10Scap, 10serviceops-radar, 10User-jijiki: Introduce state to Scap - https://phabricator.wikimedia.org/T209881 (10Joe) [08:48:25] !log ema@cumin1001 END (PASS) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=0) [08:48:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:35] 10Operations, 10CX-cxserver, 10Citoid, 10Graphoid, and 10 others: Make services swagger specs standard compliant - https://phabricator.wikimedia.org/T218217 (10Joe) [08:50:17] 10Operations, 10Operations-Software-Development, 10serviceops, 10Patch-For-Review, and 3 others: Convert makevm to spicerack cookbook - https://phabricator.wikimedia.org/T203963 (10akosiaris) Should we close this? Is there anything left to be done? [08:50:47] 10Operations, 10Analytics, 10Research, 10serviceops-radar, and 4 others: Transferring data from Hadoop to production MySQL database - https://phabricator.wikimedia.org/T213566 (10Joe) [08:51:08] 10Operations, 10DBA, 10Patch-For-Review: eqiad: rack/setup/install (4) dbproxy systems. - https://phabricator.wikimedia.org/T225704 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['dbproxy1018.eqiad.wmnet'] ` Of which those **FAILED**: ` ['dbproxy1018.eqiad.wmnet'] ` [08:52:20] 10Operations, 10Operations-Software-Development, 10serviceops-radar, 10Patch-For-Review, and 3 others: Convert makevm to spicerack cookbook - https://phabricator.wikimedia.org/T203963 (10akosiaris) [08:52:35] 10Operations, 10Proton, 10Reading-Infrastructure-Team-Backlog, 10Traffic, and 3 others: Document and possibly fine-tune how Proton interacts with Varnish - https://phabricator.wikimedia.org/T213371 (10Joe) [08:55:44] 10Operations, 10DBA, 10Patch-For-Review: eqiad: rack/setup/install (4) dbproxy systems. - https://phabricator.wikimedia.org/T225704 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['dbproxy1018.eqiad.wmnet'] ` The log can be found in `/var/log/w... [09:01:44] !log ema@cumin1001 START - Cookbook sre.hosts.upgrade-and-reboot [09:01:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:15] RECOVERY - Upload HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=upload&var-status_type=5 [09:03:15] RECOVERY - puppet last run on schema1002 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [09:03:20] (03PS1) 10Ema: varnish (5.1.3-1wm11) stretch-wikimedia; urgency=medium [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/518224 [09:03:41] RECOVERY - puppet last run on lvs3003 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [09:04:23] ema: o/ - are the above 50x due to upload upgrades? [09:04:39] (03CR) 10Effie Mouzeli: [C: 03+2] Remove kafka1018 from ProductionServices [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513033 (https://phabricator.wikimedia.org/T224538) (owner: 10Effie Mouzeli) [09:05:39] (03Merged) 10jenkins-bot: Remove kafka1018 from ProductionServices [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513033 (https://phabricator.wikimedia.org/T224538) (owner: 10Effie Mouzeli) [09:06:32] (03CR) 10jenkins-bot: Remove kafka1018 from ProductionServices [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513033 (https://phabricator.wikimedia.org/T224538) (owner: 10Effie Mouzeli) [09:07:11] elukey: not sure, see ~ema/upload-503.log on weblog1001 if you wanna help find out [09:08:19] !log ema@cumin1001 END (PASS) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=0) [09:08:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:25] !log ema@cumin1001 START - Cookbook sre.hosts.upgrade-and-reboot [09:08:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:04] elukey: likely not, various eqiad/esams nodes affected [09:09:44] !log jiji@deploy1001 Synchronized wmf-config/ProductionServices.php: Remove kafka1018 from ProductionServices - T224538 (duration: 00m 56s) [09:09:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:49] T224538: Socket Errors on PHP7 - https://phabricator.wikimedia.org/T224538 [09:11:18] looks like swift is in trouble https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?orgId=1&from=now-24h&to=now-1m [09:11:47] 10Operations, 10DBA, 10Patch-For-Review: eqiad: rack/setup/install (4) dbproxy systems. - https://phabricator.wikimedia.org/T225704 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['dbproxy1019.eqiad.wmnet'] ` The log can be found in `/var/log/w... [09:11:53] the vast majority of the 503 errors had ttfb 20s [09:12:31] 10Operations, 10DBA, 10Patch-For-Review: eqiad: rack/setup/install (4) dbproxy systems. - https://phabricator.wikimedia.org/T225704 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['dbproxy1018.eqiad.wmnet'] ` and were **ALL** successful. [09:13:04] PROBLEM - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [09:14:01] 10Operations, 10DBA, 10Patch-For-Review: eqiad: rack/setup/install (4) dbproxy systems. - https://phabricator.wikimedia.org/T225704 (10Marostegui) [09:15:46] !log ema@cumin1001 END (PASS) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=0) [09:15:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:06] 10Operations, 10service-runner, 10serviceops-radar, 10Patch-For-Review, 10Services (later): Re-evaluate service-runner's (ab)use of statsd timing metric for nodejs GC stats - https://phabricator.wikimedia.org/T222795 (10akosiaris) [09:17:21] (03PS1) 10Gilles: Have the Swift rewrite proxy renew expiry headers [puppet] - 10https://gerrit.wikimedia.org/r/518226 (https://phabricator.wikimedia.org/T211661) [09:17:25] (03CR) 10jerkins-bot: [V: 04-1] Have the Swift rewrite proxy renew expiry headers [puppet] - 10https://gerrit.wikimedia.org/r/518226 (https://phabricator.wikimedia.org/T211661) (owner: 10Gilles) [09:18:57] (03PS2) 10Gilles: Have the Swift rewrite proxy renew expiry headers [puppet] - 10https://gerrit.wikimedia.org/r/518226 (https://phabricator.wikimedia.org/T211661) [09:19:23] (03CR) 10jerkins-bot: [V: 04-1] Have the Swift rewrite proxy renew expiry headers [puppet] - 10https://gerrit.wikimedia.org/r/518226 (https://phabricator.wikimedia.org/T211661) (owner: 10Gilles) [09:23:19] !log ema@cumin1001 START - Cookbook sre.hosts.upgrade-and-reboot [09:23:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:31] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/518210 (https://phabricator.wikimedia.org/T220811) (owner: 10Muehlenhoff) [09:24:46] (03PS3) 10Gilles: Have the Swift rewrite proxy renew expiry headers [puppet] - 10https://gerrit.wikimedia.org/r/518226 (https://phabricator.wikimedia.org/T211661) [09:25:16] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Thanks papaul, merging!" [dns] - 10https://gerrit.wikimedia.org/r/515111 (owner: 10Papaul) [09:25:20] (03PS3) 10Alexandros Kosiaris: DNS: Add mgmt and production DNS for ganeti2009, ganeti201[0-8] [dns] - 10https://gerrit.wikimedia.org/r/515111 (owner: 10Papaul) [09:25:37] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] DNS: Add mgmt and production DNS for ganeti2009, ganeti201[0-8] [dns] - 10https://gerrit.wikimedia.org/r/515111 (owner: 10Papaul) [09:28:11] (03PS2) 10Arturo Borrero Gonzalez: toolforge: k8s: etcd: use domain names instead of IP addresses [puppet] - 10https://gerrit.wikimedia.org/r/518075 (https://phabricator.wikimedia.org/T226098) [09:28:30] 10Operations, 10DBA, 10Patch-For-Review: eqiad: rack/setup/install (4) dbproxy systems. - https://phabricator.wikimedia.org/T225704 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['dbproxy1019.eqiad.wmnet'] ` and were **ALL** successful. [09:30:03] 10Operations, 10DBA, 10Patch-For-Review: eqiad: rack/setup/install (4) dbproxy systems. - https://phabricator.wikimedia.org/T225704 (10Marostegui) [09:30:14] 10Operations, 10DBA, 10Patch-For-Review: eqiad: rack/setup/install (4) dbproxy systems. - https://phabricator.wikimedia.org/T225704 (10Marostegui) 05Open→03Resolved All hosts installed [09:30:31] !log ema@cumin1001 END (PASS) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=0) [09:30:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:00] ema: sorry just seen the ping [09:33:57] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: k8s: etcd: use domain names instead of IP addresses [puppet] - 10https://gerrit.wikimedia.org/r/518075 (https://phabricator.wikimedia.org/T226098) (owner: 10Arturo Borrero Gonzalez) [09:35:46] !log ema@cumin1001 START - Cookbook sre.hosts.upgrade-and-reboot [09:35:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:45] 10Operations, 10ops-eqiad, 10Cloud-Services, 10cloud-services-team (Kanban): rack/setup/install (3) new osd ceph nodes - https://phabricator.wikimedia.org/T224188 (10faidon) >>! In T224188#5271528, @Bstorm wrote: > Ceph is capable of saturating 10G links under heavy load > [...] > Rate-limiting traffic is... [09:38:50] RECOVERY - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is OK: puppetdb.PuppetDB OK https://wikitech.wikimedia.org/wiki/Netbox%23Reports [09:42:28] !log ema@cumin1001 END (PASS) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=0) [09:42:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:23] !log rebooting cp1008 [09:43:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:58] (03PS1) 10Ema: Revert "Normalize thumbnail URLs to avoid cachebusting" [puppet] - 10https://gerrit.wikimedia.org/r/518230 [09:44:14] (03CR) 10jerkins-bot: [V: 04-1] Revert "Normalize thumbnail URLs to avoid cachebusting" [puppet] - 10https://gerrit.wikimedia.org/r/518230 (owner: 10Ema) [09:45:24] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/511686 (owner: 10Jbond) [09:45:32] !log ema@cumin1001 START - Cookbook sre.hosts.upgrade-and-reboot [09:45:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:45:40] 10Operations, 10Performance-Team, 10Traffic, 10media-storage, and 2 others: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661 (10Gilles) [09:45:44] 10Operations, 10Performance-Team, 10Traffic, 10Patch-For-Review: Normalize thumbnail request URLs in Varnish to avoid cachebusting - https://phabricator.wikimedia.org/T216339 (10Gilles) 05Resolved→03Open [09:46:01] (03PS2) 10Ema: Revert "Normalize thumbnail URLs to avoid cachebusting" [puppet] - 10https://gerrit.wikimedia.org/r/518230 [09:50:16] (03PS1) 10Ema: Revert "Normalize thumbnail URLs to avoid cachebusting" [puppet] - 10https://gerrit.wikimedia.org/r/518231 [09:50:30] (03Abandoned) 10Ema: Revert "Normalize thumbnail URLs to avoid cachebusting" [puppet] - 10https://gerrit.wikimedia.org/r/518230 (owner: 10Ema) [09:51:47] !log ema@cumin1001 END (PASS) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=0) [09:51:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:36] (03PS1) 10Arturo Borrero Gonzalez: toolforge: k8s: etcd: enable 2379/tcp for peers as well [puppet] - 10https://gerrit.wikimedia.org/r/518235 (https://phabricator.wikimedia.org/T226098) [10:01:24] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: k8s: etcd: enable 2379/tcp for peers as well [puppet] - 10https://gerrit.wikimedia.org/r/518235 (https://phabricator.wikimedia.org/T226098) (owner: 10Arturo Borrero Gonzalez) [10:06:48] !log ema@cumin1001 START - Cookbook sre.hosts.upgrade-and-reboot [10:06:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:54] 10Puppet, 10Cloud-VPS, 10serviceops, 10Patch-For-Review, and 2 others: upgrade simplelamp class (apache -> httpd and mysql -> mariadb) or deprecate it - https://phabricator.wikimedia.org/T215662 (10Joe) p:05Triage→03Low [10:09:03] !log ema@cumin1001 START - Cookbook sre.hosts.upgrade-and-reboot [10:09:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:05] !log ema@cumin1001 END (PASS) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=0) [10:13:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:26] !log ema@cumin1001 END (PASS) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=0) [10:15:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:55] 10Operations, 10observability, 10serviceops, 10PHP 7.2 support, and 2 others: [Regression] fatal-errors.php action=segfault results in a 503 error under php7-fpm. - https://phabricator.wikimedia.org/T223336 (10Joe) [10:17:58] 10Operations, 10MediaWiki-Logging, 10Wikimedia-Logstash, 10wmerrors, and 7 others: Port mediawiki/php/wmerrors to PHP7 and deploy - https://phabricator.wikimedia.org/T187147 (10Joe) [10:19:14] 10Operations, 10MediaWiki-Logging, 10Wikimedia-Logstash, 10serviceops, and 8 others: Port mediawiki/php/wmerrors to PHP7 and deploy - https://phabricator.wikimedia.org/T187147 (10Joe) [10:25:15] 10Operations, 10MediaWiki-Logging, 10Wikimedia-Logstash, 10serviceops, and 8 others: Port mediawiki/php/wmerrors to PHP7 and deploy - https://phabricator.wikimedia.org/T187147 (10Joe) @Legoktm kindly did my job and created an [[https://salsa.debian.org/mediawiki-team/php-wmerrors | upstream package ]] for... [10:29:30] PROBLEM - Check systemd state on kubernetes2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:41:37] 10Operations, 10Performance-Team, 10Traffic, 10Performance: Sometimes pages load slowly for European users (due to some factor outside of Wikimedia cluster) - https://phabricator.wikimedia.org/T226048 (10Jheald) ^^ I saw exactly what Wurgl reports, browsing long pages on en-wiki (eg the Fram case) in Londo... [10:45:31] (03PS1) 10Muehlenhoff: Add ferm rules for kpasswd [puppet] - 10https://gerrit.wikimedia.org/r/518237 [11:23:59] (03PS1) 10Arturo Borrero Gonzalez: toolforge: k8s: etcd: refresh server with certs changes [puppet] - 10https://gerrit.wikimedia.org/r/518238 (https://phabricator.wikimedia.org/T169287) [11:28:32] (03PS2) 10Arturo Borrero Gonzalez: toolforge: k8s: etcd: refresh server with certs changes [puppet] - 10https://gerrit.wikimedia.org/r/518238 (https://phabricator.wikimedia.org/T169287) [11:30:33] (03PS3) 10Arturo Borrero Gonzalez: toolforge: k8s: etcd: refresh server with certs changes [puppet] - 10https://gerrit.wikimedia.org/r/518238 (https://phabricator.wikimedia.org/T169287) [11:32:30] 10Operations, 10SRE-Access-Requests: Access Q re maint1002 - https://phabricator.wikimedia.org/T225253 (10jijiki) @Iflorez is this working out for you? You could reach out on the -sre channel on irc to further debug this, but if this is not an access request task anymore, I would really appreciate if we mark i... [11:39:32] (03Abandoned) 10Arturo Borrero Gonzalez: toolforge: k8s: etcd: restart etcd service when certs change [puppet] - 10https://gerrit.wikimedia.org/r/518020 (https://phabricator.wikimedia.org/T226098) (owner: 10Arturo Borrero Gonzalez) [11:39:34] (03PS1) 10Lucas Werkmeister (WMDE): Specify $wgWBRepoSettings['conceptBaseUri'] again [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518239 (https://phabricator.wikimedia.org/T225212) [11:41:15] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: k8s: etcd: refresh server with certs changes [puppet] - 10https://gerrit.wikimedia.org/r/518238 (https://phabricator.wikimedia.org/T169287) (owner: 10Arturo Borrero Gonzalez) [11:47:39] Did you make any change to 'dologmsg'? It ain't working anymore [11:50:45] in Toolforge or in production? [11:53:28] hauskatze: ^ [11:54:50] Lucas_WMDE: toolforge [11:54:57] any error message? [11:55:07] I saw some puppet patch a couple of days ago but I can't remember [11:55:12] yes Lucas_WMDE , let me fetch [11:55:30] tools.stewardbots@tools-sgebastion-07:~/public_html$ dologmsg Updated stewardbots to 5099d2e. [11:55:41] tools.stewardbots@tools-sgebastion-07:~/public_html$ dologmsg Updated stewardbots to 5099d2e. [11:55:43] sigh [11:56:18] https://pastebin.com/06NVhkHb [11:56:30] bbl [11:56:56] ahah [11:57:02] there’s a space missing between "tools" and ]] [11:57:05] bstorm_: ^ [11:57:08] I’ll upload a patch [11:57:37] (03CR) 10Urbanecm: [C: 03+1] "LGTM." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518171 (https://phabricator.wikimedia.org/T226217) (owner: 10DannyS712) [12:00:16] (03PS1) 10Lucas Werkmeister (WMDE): dologmsg: fix missing space in conditional [puppet] - 10https://gerrit.wikimedia.org/r/518242 [12:06:17] 10Operations, 10Thumbor, 10serviceops, 10Patch-For-Review, 10User-jijiki: Investigate systemd hardening to replace Firejail for Thumbor - https://phabricator.wikimedia.org/T212941 (10jijiki) p:05Normal→03Low [12:06:20] 10Operations, 10Thumbor, 10serviceops, 10Patch-For-Review, 10User-jijiki: Investigate systemd hardening to replace Firejail for Thumbor - https://phabricator.wikimedia.org/T212941 (10jijiki) p:05Low→03Normal [12:06:23] 10Operations, 10Thumbor, 10Wikimedia-Logstash, 10serviceops, 10User-jijiki: Stream Thumbor logs to logstash - https://phabricator.wikimedia.org/T212946 (10jijiki) p:05Normal→03Low [12:06:47] grrr [12:15:57] (03PS1) 10Arturo Borrero Gonzalez: toolforge: k8s: etcd: also create /etc/etcd [puppet] - 10https://gerrit.wikimedia.org/r/518247 (https://phabricator.wikimedia.org/T226098) [12:30:29] !log ema@cumin1001 START - Cookbook sre.hosts.upgrade-and-reboot [12:30:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:29] (03PS1) 10Marostegui: mariadb: WIP Provision dbproxy2001 into m1 [puppet] - 10https://gerrit.wikimedia.org/r/518251 [12:36:05] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: k8s: etcd: also create /etc/etcd [puppet] - 10https://gerrit.wikimedia.org/r/518247 (https://phabricator.wikimedia.org/T226098) (owner: 10Arturo Borrero Gonzalez) [12:36:44] !log ema@cumin1001 END (PASS) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=0) [12:36:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:00] 10Operations, 10Phabricator, 10Traffic, 10Release-Engineering-Team (Kanban): Set up a subdomain for Phame to enable caching - https://phabricator.wikimedia.org/T226044 (10mmodell) This seems like a good idea. The upstream documentation has a warning that there are some issues with an external blog / dedic... [12:40:03] (03CR) 10Marostegui: "PCC looks good so far: https://puppet-compiler.wmflabs.org/compiler1001/17055/dbproxy2001.codfw.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/518251 (owner: 10Marostegui) [12:43:59] (03PS2) 10Elukey: Add ferm rules for kpasswd [puppet] - 10https://gerrit.wikimedia.org/r/518237 (owner: 10Muehlenhoff) [12:44:04] (03CR) 10Elukey: [C: 03+1] Add ferm rules for kpasswd [puppet] - 10https://gerrit.wikimedia.org/r/518237 (owner: 10Muehlenhoff) [12:47:04] 10Operations, 10Diffusion, 10Packaging, 10Release-Engineering-Team (Kanban), and 2 others: Cannot connect to vcs@git-ssh.wikimedia.org (since move from phab1001 to phab1003) - https://phabricator.wikimedia.org/T224677 (10mmodell) @LucasWerkmeister that could work, though if the fix is as simple as it appea... [12:52:35] !log ema@cumin1001 START - Cookbook sre.hosts.upgrade-and-reboot [12:52:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:41] (03CR) 10Elukey: [C: 03+2] Add ferm rules for kpasswd [puppet] - 10https://gerrit.wikimedia.org/r/518237 (owner: 10Muehlenhoff) [12:58:21] !log ema@cumin1001 END (PASS) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=0) [12:58:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:40] 10Operations, 10Diffusion, 10Packaging, 10Release-Engineering-Team (Kanban), and 2 others: Cannot connect to vcs@git-ssh.wikimedia.org (since move from phab1001 to phab1003) - https://phabricator.wikimedia.org/T224677 (10ArielGlenn) If we build it for reals, I'd ask @MoritzMuehlenhoff about all that. If we... [13:00:07] (03PS1) 10Alexandros Kosiaris: releases: Rely on cron alone for helm charts updating [puppet] - 10https://gerrit.wikimedia.org/r/518253 [13:09:24] PROBLEM - puppet last run on mw1255 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [13:11:18] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [13:11:19] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [13:11:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:44] (03CR) 10Alexandros Kosiaris: "Overall seems ok as a first draft to me, I 've left a comment about the owner/group/mode stuff, I 'll have a look whether it can be somewh" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/517888 (https://phabricator.wikimedia.org/T212130) (owner: 10Fsero) [13:16:06] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [13:16:08] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [13:16:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:30] !log rebooting kafkamon instances to pick up MDS mitigations/new kernel [13:16:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:08] (03CR) 10Alexandros Kosiaris: k8s, deploy: introducing helmfile for manage charts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/517888 (https://phabricator.wikimedia.org/T212130) (owner: 10Fsero) [13:26:50] !log ema@cumin1001 START - Cookbook sre.hosts.upgrade-and-reboot [13:26:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:40] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Still trying to wrap my head around the raw chart but I 'll figure it out." (0318 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/517887 (https://phabricator.wikimedia.org/T212130) (owner: 10Fsero) [13:33:03] !log ema@cumin1001 END (PASS) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=0) [13:33:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:53] 10Operations, 10Commons, 10Wikimedia-Site-requests, 10media-storage, 10User-Urbanecm: Server-side upload request for Hurtigruten minutt for minutt videos - https://phabricator.wikimedia.org/T223052 (10Urbanecm) Lock error disappeared, unknown error one is still in place. [13:35:38] (03PS1) 10Petar.petkovic: Don't show cannot publish error to 'sysop' users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518260 (https://phabricator.wikimedia.org/T225398) [13:36:36] RECOVERY - puppet last run on mw1255 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:36:48] (03CR) 10jerkins-bot: [V: 04-1] Don't show cannot publish error to 'sysop' users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518260 (https://phabricator.wikimedia.org/T225398) (owner: 10Petar.petkovic) [13:39:43] (03CR) 10Fsero: introducing helmfile.d values for staging cluster (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/517887 (https://phabricator.wikimedia.org/T212130) (owner: 10Fsero) [13:43:35] !log akosiaris@deploy1001 scap-helm mathoid upgrade --recreate-pods -f mathoid-staging-values.yaml staging stable/mathoid [namespace: mathoid, clusters: staging] [13:43:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:42] !log akosiaris@deploy1001 scap-helm mathoid cluster staging completed [13:43:43] !log akosiaris@deploy1001 scap-helm mathoid finished [13:43:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:24] PROBLEM - Disk space on contint1001 is CRITICAL: DISK CRITICAL - free space: / 2514 MB (5% inode=45%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [13:45:35] (03CR) 10Alexandros Kosiaris: [C: 04-1] introducing helmfile.d values for staging cluster (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/517887 (https://phabricator.wikimedia.org/T212130) (owner: 10Fsero) [13:46:56] RECOVERY - Disk space on contint1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [13:48:04] !log ema@cumin1001 START - Cookbook sre.hosts.upgrade-and-reboot [13:48:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:46] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [13:48:47] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [13:48:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:38] (03CR) 10Alexandros Kosiaris: [C: 04-2] "We've been trying to deprecate ferm macros as they have a number of issues." [puppet] - 10https://gerrit.wikimedia.org/r/518130 (owner: 10EBernhardson) [13:52:32] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [13:52:33] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [13:52:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:35] !log ema@cumin1001 END (PASS) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=0) [13:55:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:20] 10Operations, 10Performance-Team, 10Traffic, 10media-storage, and 2 others: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661 (10Gilles) [13:58:24] 10Operations, 10Performance-Team, 10Traffic, 10Patch-For-Review: Normalize thumbnail request URLs in Varnish to avoid cachebusting - https://phabricator.wikimedia.org/T216339 (10Gilles) 05Open→03Resolved This has caused a spike of thumbor thumbnailing requests, by virtue of making some objects hotter t... [14:09:46] !log Renamed Carmen0429@metawiki to Carmen0428@metawiki as part of re-attaching to global account (T223036) [14:09:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:53] T223036: Detached local SUL account User:Carmen0428 needs to be reunited with the global account - https://phabricator.wikimedia.org/T223036 [14:10:22] !log Attached Carmen0428@metawiki to Carmen0428 global account (T223036) [14:10:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:35] !log ema@cumin1001 START - Cookbook sre.hosts.upgrade-and-reboot [14:10:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:41] 10Operations, 10Operations-Software-Development, 10User-Joe, 10User-jijiki: Spicerack cookbooks TODO list - https://phabricator.wikimedia.org/T203943 (10debt) [14:15:45] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): Create Cookbook to restart WDQS - https://phabricator.wikimedia.org/T221832 (10debt) 05Open→03Resolved [14:16:49] !log ema@cumin1001 END (PASS) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=0) [14:16:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:23] !log ema@cumin1001 START - Cookbook sre.hosts.upgrade-and-reboot [14:17:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:39] (03PS2) 10Fsero: k8s, deploy: introducing helmfile for manage charts [puppet] - 10https://gerrit.wikimedia.org/r/517888 (https://phabricator.wikimedia.org/T212130) [14:21:40] 10Operations, 10Traffic, 10Patch-For-Review: cp3041 - Varnish frontend child restarted icinga alert - https://phabricator.wikimedia.org/T224694 (10ema) 05Open→03Resolved a:03ema All cache nodes are currently running Varnish 5.1.3-1wm10, which fixes this. [14:21:44] (03CR) 10Fsero: [C: 03+2] releases: Rely on cron alone for helm charts updating [puppet] - 10https://gerrit.wikimedia.org/r/518253 (owner: 10Alexandros Kosiaris) [14:21:55] (03PS1) 10Urbanecm: Add hualab.nl to $wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518278 (https://phabricator.wikimedia.org/T225917) [14:22:21] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [14:22:22] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:22:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:12] !log rebooting wezen [14:23:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:29] (03CR) 10jerkins-bot: [V: 04-1] Add hualab.nl to $wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518278 (https://phabricator.wikimedia.org/T225917) (owner: 10Urbanecm) [14:23:44] !log ema@cumin1001 END (PASS) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=0) [14:23:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:27] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [14:26:28] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:26:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:50] 10Operations, 10Performance-Team, 10Traffic, 10Performance: Study performance impact of disabling TCP selective acknowledgments - https://phabricator.wikimedia.org/T225998 (10ema) All cache nodes are now running Linux 4.9.168-1+deb9u3. Next week we can thus re-enable SACKs on part of the cache fleet and fu... [14:33:54] RECOVERY - Check systemd state on kubernetes2001 is OK: OK - running: The system is fully operational [14:37:56] !log rebooting kerberos1001 to pick up MDS mitigations/new kernel [14:37:57] (03PS1) 10Fsero: k8s,deploy: adding fake secret data for PCC [labs/private] - 10https://gerrit.wikimedia.org/r/518282 [14:38:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:14] (03PS1) 10Fsero: k8s,deploy: changing the order [labs/private] - 10https://gerrit.wikimedia.org/r/518283 [14:44:41] (03CR) 10Fsero: [V: 03+2 C: 03+2] k8s,deploy: changing the order [labs/private] - 10https://gerrit.wikimedia.org/r/518283 (owner: 10Fsero) [14:45:00] (03Abandoned) 10Fsero: k8s,deploy: adding fake secret data for PCC [labs/private] - 10https://gerrit.wikimedia.org/r/518282 (owner: 10Fsero) [14:49:57] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [14:49:58] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:50:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:44] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [14:50:45] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:50:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:13] !log rebooting planet1001 to pick up MDS mitigations/new kernel [14:51:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:59] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [14:54:00] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:54:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:31] (03PS3) 10Fsero: k8s, deploy: introducing helmfile for manage charts [puppet] - 10https://gerrit.wikimedia.org/r/517888 (https://phabricator.wikimedia.org/T212130) [14:56:56] (03PS4) 10Marostegui: db-eqiad,db-codfw.php: Change last parsercache key [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517807 (https://phabricator.wikimedia.org/T210725) [15:02:26] (03PS4) 10Fsero: k8s, deploy: introducing helmfile for manage charts [puppet] - 10https://gerrit.wikimedia.org/r/517888 (https://phabricator.wikimedia.org/T212130) [15:05:48] (03CR) 10Mforns: analytics::refinery::job::data_purge add deletion for data_quality_hourly (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/518069 (https://phabricator.wikimedia.org/T215863) (owner: 10Mforns) [15:06:54] (03PS5) 10Fsero: k8s, deploy: introducing helmfile for manage charts [puppet] - 10https://gerrit.wikimedia.org/r/517888 (https://phabricator.wikimedia.org/T212130) [15:08:29] (03PS6) 10Fsero: k8s, deploy: introducing helmfile for manage charts [puppet] - 10https://gerrit.wikimedia.org/r/517888 (https://phabricator.wikimedia.org/T212130) [15:09:25] (03PS1) 10Jbond: facter: add confine to cpu_details fact [puppet] - 10https://gerrit.wikimedia.org/r/518286 [15:09:37] (03CR) 10Fsero: "Joe, Alex i think i've addressed your nits" [puppet] - 10https://gerrit.wikimedia.org/r/517888 (https://phabricator.wikimedia.org/T212130) (owner: 10Fsero) [15:09:55] (03CR) 10Muehlenhoff: [C: 03+1] facter: add confine to cpu_details fact [puppet] - 10https://gerrit.wikimedia.org/r/518286 (owner: 10Jbond) [15:15:13] (03CR) 10Fsero: "> Patch Set 6:" [puppet] - 10https://gerrit.wikimedia.org/r/517888 (https://phabricator.wikimedia.org/T212130) (owner: 10Fsero) [15:15:46] (03PS2) 10Jbond: facter: add confine to cpu_details fact [puppet] - 10https://gerrit.wikimedia.org/r/518286 [15:17:43] (03CR) 10Jbond: [C: 03+2] facter: add confine to cpu_details fact [puppet] - 10https://gerrit.wikimedia.org/r/518286 (owner: 10Jbond) [15:42:30] (03Abandoned) 10EBernhardson: Define ferm classes for lvs owned ips [puppet] - 10https://gerrit.wikimedia.org/r/518130 (owner: 10EBernhardson) [15:47:22] (03PS7) 10EBernhardson: LVS for cloudelastic [puppet] - 10https://gerrit.wikimedia.org/r/512925 (https://phabricator.wikimedia.org/T224324) [15:53:48] 10Puppet, 10cloud-services-team (Kanban): Reduce the effects of puppet breakage on VPS - https://phabricator.wikimedia.org/T226270 (10Andrew) [15:56:04] 10Puppet, 10cloud-services-team (Kanban): Reduce the effects of puppet breakage on VPS - https://phabricator.wikimedia.org/T226270 (10Andrew) Here are some usage stats: https://phabricator.wikimedia.org/P8638 As expected, relatively many projects use zero or few user-applied modules, and a small number of pr... [15:57:01] (03PS2) 10Bstorm: dologmsg: fix missing space in conditional [puppet] - 10https://gerrit.wikimedia.org/r/518242 (owner: 10Lucas Werkmeister (WMDE)) [15:57:38] 10Puppet, 10cloud-services-team (Kanban): Reduce the effects of puppet breakage on VPS - https://phabricator.wikimedia.org/T226270 (10Andrew) And, here is some info about the base classes applied to a VM vs a production machine: https://phabricator.wikimedia.org/P8639 Lots of shared code in there! The unsha... [16:02:39] (03CR) 10Bstorm: [C: 03+2] dologmsg: fix missing space in conditional [puppet] - 10https://gerrit.wikimedia.org/r/518242 (owner: 10Lucas Werkmeister (WMDE)) [16:05:23] Lucas_WMDE: merged...that should roll out soon and fix things. Good catch, and sorry about the miss. [16:19:03] 10Operations, 10Gerrit, 10Traffic: When downloading from git using HTTPS: HTTP 500 / GnuTLS recv error (-110) - https://phabricator.wikimedia.org/T225347 (10sbassett) Hello @Ciencia_Al_Poder - Apologies for the delay on a response to this issue. Due to an ongoing security incident [0], certain IP ranges co... [16:22:38] hi operations peeps, can I rename a user with 88k edits right now? https://meta.wikimedia.org/wiki/Special:CentralAuth/Schmelzle [16:28:35] ajr: that should be fine, yes [16:48:27] thanks! [16:52:41] (03PS1) 10Urbanecm: [throttle-analyze] Grant autoconfirmed permission to user when throttle rule is applied [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518298 (https://phabricator.wikimedia.org/T204583) [16:55:03] (03CR) 10jerkins-bot: [V: 04-1] [throttle-analyze] Grant autoconfirmed permission to user when throttle rule is applied [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518298 (https://phabricator.wikimedia.org/T204583) (owner: 10Urbanecm) [16:56:03] 10Operations, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 2 others: Image thumbnail (cache?) broken on Wikimedia Commons, e.g. Information.svg, when viewing non-default resolution (e.g. 241px) - https://phabricator.wikimedia.org/T226271 (10Krinkle) [16:59:24] (03PS1) 10Nuria: Removing page_links events from the ones that get refined [puppet] - 10https://gerrit.wikimedia.org/r/518299 (https://phabricator.wikimedia.org/T226268) [17:00:52] (03PS2) 10Nuria: Removing page_links events from refine whitelist [puppet] - 10https://gerrit.wikimedia.org/r/518299 (https://phabricator.wikimedia.org/T226268) [17:07:15] (03CR) 10Elukey: [C: 03+2] Removing page_links events from refine whitelist [puppet] - 10https://gerrit.wikimedia.org/r/518299 (https://phabricator.wikimedia.org/T226268) (owner: 10Nuria) [17:13:28] (03CR) 10Nuria: analytics::refinery::job::data_purge add deletion for data_quality_hourly (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/518069 (https://phabricator.wikimedia.org/T215863) (owner: 10Mforns) [17:13:33] 10Operations, 10Gerrit, 10Traffic: When downloading from git using HTTPS: HTTP 500 / GnuTLS recv error (-110) - https://phabricator.wikimedia.org/T225347 (10Ciencia_Al_Poder) Ok, thanks for the update. I only need a read-only access to the repos, so I'd have to live with the github mirror... [17:13:52] 10Operations, 10ops-eqiad: rack/setup/install kafka-main100[1-5] - https://phabricator.wikimedia.org/T226274 (10herron) [17:14:39] 10Operations, 10ops-eqiad: rack/setup/install kafka-main100[1-5] - https://phabricator.wikimedia.org/T226274 (10herron) [17:15:59] (03PS18) 10CRusnov: profile::netbox: Reorganize for splitting front and back-end. [puppet] - 10https://gerrit.wikimedia.org/r/514395 (https://phabricator.wikimedia.org/T223291) [17:16:46] (03CR) 10jerkins-bot: [V: 04-1] profile::netbox: Reorganize for splitting front and back-end. [puppet] - 10https://gerrit.wikimedia.org/r/514395 (https://phabricator.wikimedia.org/T223291) (owner: 10CRusnov) [17:19:18] 10Operations, 10ops-eqiad: rack/setup/install kafka-main100[1-5] - https://phabricator.wikimedia.org/T226274 (10herron) a:03RobH [17:26:32] (03PS19) 10CRusnov: profile::netbox: Reorganize for splitting front and back-end. [puppet] - 10https://gerrit.wikimedia.org/r/514395 (https://phabricator.wikimedia.org/T223291) [17:27:30] PROBLEM - Apache HTTP on mw1277 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Application_servers [17:27:32] PROBLEM - HHVM rendering on mw1277 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers [17:27:57] (03CR) 10Herron: "> How does this look in terms of a cutover plan? https://docs.google.com/document/d/1o7bl1WBzSMymsXGzhWLy1GmMOSo_Evof1PHk2aQXAiE/edit?usp" [puppet] - 10https://gerrit.wikimedia.org/r/514361 (https://phabricator.wikimedia.org/T225005) (owner: 10Herron) [17:28:56] RECOVERY - Apache HTTP on mw1277 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.040 second response time https://wikitech.wikimedia.org/wiki/Application_servers [17:29:00] RECOVERY - HHVM rendering on mw1277 is OK: HTTP OK: HTTP/1.1 200 OK - 82255 bytes in 0.200 second response time https://wikitech.wikimedia.org/wiki/Application_servers [17:30:57] (03PS1) 10Papaul: DHCP: Add MAC address entries for ganeti2009 and ganeti201[0-8] [puppet] - 10https://gerrit.wikimedia.org/r/518303 (https://phabricator.wikimedia.org/T224603) [17:45:05] (03PS1) 10Papaul: Partman: Add ganeti201[0-8] [puppet] - 10https://gerrit.wikimedia.org/r/518305 (https://phabricator.wikimedia.org/T224603) [17:58:24] PROBLEM - cassandra-a SSL 10.64.0.230:7001 on restbase1007 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://phabricator.wikimedia.org/T120662 [17:58:30] PROBLEM - cassandra-a service on restbase1007 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:59:04] PROBLEM - cassandra-a CQL 10.64.0.230:9042 on restbase1007 is CRITICAL: connect to address 10.64.0.230 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [18:28:49] 10Operations, 10Continuous-Integration-Infrastructure (phase-out-jessie): Migrate contint* hosts to Stretch/Buster - https://phabricator.wikimedia.org/T224591 (10Jdforrester-WMF) [18:38:04] PROBLEM - puppet last run on bast2002 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [18:53:29] (03PS1) 10CRusnov: profile::netbox: Reorganize for splitting front and back-end. [puppet] - 10https://gerrit.wikimedia.org/r/518311 (https://phabricator.wikimedia.org/T223291) [18:53:40] 10Operations, 10ops-eqiad: rack/setup/install kafka-main100[1-5] - https://phabricator.wikimedia.org/T226274 (10RobH) [18:53:48] 10Operations, 10ops-eqiad: rack/setup/install kafka-main100[1-5] - https://phabricator.wikimedia.org/T226274 (10RobH) [18:54:17] (03CR) 10jerkins-bot: [V: 04-1] profile::netbox: Reorganize for splitting front and back-end. [puppet] - 10https://gerrit.wikimedia.org/r/518311 (https://phabricator.wikimedia.org/T223291) (owner: 10CRusnov) [18:57:13] (03PS20) 10CRusnov: profile::netbox: Reorganize for splitting front and back-end. [puppet] - 10https://gerrit.wikimedia.org/r/514395 (https://phabricator.wikimedia.org/T223291) [18:58:03] (03CR) 10jerkins-bot: [V: 04-1] profile::netbox: Reorganize for splitting front and back-end. [puppet] - 10https://gerrit.wikimedia.org/r/514395 (https://phabricator.wikimedia.org/T223291) (owner: 10CRusnov) [18:59:35] 10Operations, 10ops-eqiad: rack/setup/install kafka-main100[1-5] - https://phabricator.wikimedia.org/T226274 (10RobH) a:05RobH→03herron @herron, This looks correct, except there isn't a mention of if these need internal or external vlan/ip addresses? Please comment to state which, and then assign this... [19:01:26] (03PS1) 10CRusnov: profile::netbox: Reorganize for splitting front and back-end. [puppet] - 10https://gerrit.wikimedia.org/r/518312 (https://phabricator.wikimedia.org/T223291) [19:02:13] (03CR) 10jerkins-bot: [V: 04-1] profile::netbox: Reorganize for splitting front and back-end. [puppet] - 10https://gerrit.wikimedia.org/r/518312 (https://phabricator.wikimedia.org/T223291) (owner: 10CRusnov) [19:04:16] (03Abandoned) 10CRusnov: profile::netbox: Reorganize for splitting front and back-end. [puppet] - 10https://gerrit.wikimedia.org/r/518312 (https://phabricator.wikimedia.org/T223291) (owner: 10CRusnov) [19:05:09] (03PS21) 10CRusnov: profile::netbox: Reorganize for splitting front and back-end. [puppet] - 10https://gerrit.wikimedia.org/r/514395 (https://phabricator.wikimedia.org/T223291) [19:05:14] RECOVERY - puppet last run on bast2002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:06:00] (03CR) 10jerkins-bot: [V: 04-1] profile::netbox: Reorganize for splitting front and back-end. [puppet] - 10https://gerrit.wikimedia.org/r/514395 (https://phabricator.wikimedia.org/T223291) (owner: 10CRusnov) [19:27:12] 10Operations, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 2 others: Image thumbnail (cache?) broken on Wikimedia Commons, e.g. Information.svg, when viewing non-default resolution (e.g. 241px) - https://phabricator.wikimedia.org/T226271 (10JJMC89) I deleted it and created protected the page a... [19:38:35] 10Operations, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 2 others: Image thumbnail (cache?) broken on Wikimedia Commons, e.g. Information.svg, when viewing non-default resolution (e.g. 241px) - https://phabricator.wikimedia.org/T226271 (10JJMC89) I reuploaded the file as [[ https://en.wikipe... [19:47:40] (03PS22) 10CRusnov: profile::netbox: Reorganize for splitting front and back-end. [puppet] - 10https://gerrit.wikimedia.org/r/514395 (https://phabricator.wikimedia.org/T223291) [19:49:18] 10Operations, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 2 others: Image thumbnail (cache?) broken on Wikimedia Commons, e.g. Information.svg, when viewing non-default resolution (e.g. 241px) - https://phabricator.wikimedia.org/T226271 (10Xaosflux) Possibly related to T30299 ? I'd like to s... [19:56:30] 10Operations, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 2 others: Image thumbnail (cache?) broken on English Wikipedia, e.g. Information.svg, when viewing non-default resolution (e.g. 241px) - https://phabricator.wikimedia.org/T226271 (10JJMC89) [19:56:42] 10Operations, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 2 others: Image thumbnail (cache?) broken on English Wikipedia, e.g. Information.svg, when viewing non-default resolution (e.g. 241px) - https://phabricator.wikimedia.org/T226271 (10Krinkle) >>! In T226271#5274599, @Xaosflux wrote: > P... [19:58:28] 10Operations, 10ops-eqiad: rack/setup/install kafka-main100[1-5] - https://phabricator.wikimedia.org/T226274 (10herron) These will need internal vlan/ips. Fwiw kafka-main100[1-5] will be replacing kafka100[123], so those existing hosts could be used as a template. [19:58:41] 10Operations, 10ops-eqiad: rack/setup/install kafka-main100[1-5] - https://phabricator.wikimedia.org/T226274 (10herron) a:05herron→03Cmjohnson [20:04:17] 10Operations, 10SRE-Access-Requests: Access Q re maint1002 - https://phabricator.wikimedia.org/T225253 (10Iflorez) Not working yet. I will follow up on the -sre channel on irc. Thank you @jijiki [20:10:23] 10Puppet, 10cloud-services-team (Kanban): Reduce the effects of puppet breakage on VPS - https://phabricator.wikimedia.org/T226270 (10Andrew) I have an (ironically) unpuppetized example of a dual-run setup running now: abogott-dual-puppet-base-master.testlabs.eqiad.wmflabs serves only role::wmcs::instance for... [20:21:48] 10Operations, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 2 others: Image thumbnail (cache?) broken on English Wikipedia, e.g. Information.svg, when viewing non-default resolution (e.g. 241px) - https://phabricator.wikimedia.org/T226271 (10JJMC89) //(In case any of this is helpful to those in... [20:35:16] 10Operations, 10Gerrit, 10Traffic: When downloading from git using HTTPS: HTTP 500 / GnuTLS recv error (-110) - https://phabricator.wikimedia.org/T225347 (10sbassett) 05Open→03Resolved [20:41:54] twentyafterfour: ugh, I forgot about merging this https://gerrit.wikimedia.org/r/c/operations/puppet/+/517140 [20:43:25] herron: if you're still around would you be willing to review/merge ^ [20:54:06] 10Operations, 10Core Platform Team, 10MassMessage, 10WMF-JobQueue: Jobs not being executed on 1.34.0-wmf.10 - https://phabricator.wikimedia.org/T226109 (10MarcoAurelio) We're experiencing again global renames becoming stuck: https://meta.wikimedia.org/wiki/Special:GlobalRenameProgress This time they start... [21:00:53] greg-g: do you want me to run some commands for you in the meantime? [21:01:37] twentyafterfour: I don't have a queue of them yet, I'll ping you when I do. [21:23:57] greg-g: ok I'll be around. [21:24:29] twentyafterfour: I'm getting pulled into some yak shaving that may prevent me from starting today. Don't wait for me when you're at quitting time. [21:24:45] ok [21:25:36] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [140.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [21:28:13] 10Operations, 10Core Platform Team, 10MassMessage, 10WMF-JobQueue: Jobs not being executed on 1.34.0-wmf.10 - https://phabricator.wikimedia.org/T226109 (10Reedy) >>! In T226109#5274665, @MarcoAurelio wrote: > We're experiencing again global renames becoming stuck: https://meta.wikimedia.org/wiki/Special:Gl... [21:29:43] 10Operations, 10Core Platform Team, 10MassMessage, 10WMF-JobQueue: Jobs not being executed on 1.34.0-wmf.10 - https://phabricator.wikimedia.org/T226109 (10Reedy) ` reedy@mwlog1001:/srv/mw-log$ grep "チルノ" JobExecutor.log | grep testwiki 2019-06-21 15:29:16 [XQz3oApAMF0AAHI6IgEAAAAR] mw1336 testwikidatawiki... [21:35:46] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [21:54:51] (03Abandoned) 10CRusnov: profile::netbox: Reorganize for splitting front and back-end. [puppet] - 10https://gerrit.wikimedia.org/r/518311 (https://phabricator.wikimedia.org/T223291) (owner: 10CRusnov) [21:59:22] ACKNOWLEDGEMENT - cassandra-a CQL 10.64.0.230:9042 on restbase1007 is CRITICAL: connect to address 10.64.0.230 and port 9042: Connection refused eevans Decommissioned (T223976) https://phabricator.wikimedia.org/T93886 [21:59:22] ACKNOWLEDGEMENT - cassandra-a SSL 10.64.0.230:7001 on restbase1007 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused eevans Decommissioned (T223976) https://phabricator.wikimedia.org/T120662 [21:59:22] ACKNOWLEDGEMENT - cassandra-a service on restbase1007 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed eevans Decommissioned (T223976) https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [22:22:07] twentyafterfour: ./bin/bulk make-silent --id 1822 please :) [22:30:19] greg-g: done [22:30:23] twentyafterfour: and now ./bin/bulk make-silent --id 1823 [22:30:52] and thank you! [23:01:14] Eurgh. Anyone have a bot handy that wants to make a shedload of redirects on MediaWiki.org? Translate extension just broke rather badly and failed to make redirects… [23:35:18] James_F: I have one but ts not flagged on mw. Happy to fix those though. [23:39:57] JJMC89[m]: That's kind of you. I'll start a thread. [23:41:37] (03CR) 10Krinkle: [C: 03+1] "When deploying/SWAT, take special care to sync IS.php first. This means it cannot be reliably tested on mwdebug as 'scap pull' will apply " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518239 (https://phabricator.wikimedia.org/T225212) (owner: 10Lucas Werkmeister (WMDE)) [23:49:07] JJMC89[m]: Posted on https://www.mediawiki.org/wiki/Topic:V22uce5k6lmij7xb if you want to volunteer. :-) [23:57:48] James_F: Commented there. If you just want all of the redirects created, I can start that now. [23:59:10] JJMC89[m]: I'm pretty sure you won't be allowed to create pages in the Translate: namespace. [23:59:42] JJMC89[m]: Also there's all the talk pages. :-(