[00:00:09] * thcipriani stages [00:01:34] brion: is there anything you can test on mwdebug1002 before I run the scap sync? If so, I pulled the code there [00:01:43] just confirming it doesn't explode ;) [00:01:43] lemme test [00:01:59] good test :) [00:02:43] thcipriani: nothing exploding on a File: page or Special:TimedMediaHandler. That's a good thing! :D [00:03:00] should be ready to scap [00:03:09] cool, thanks for the check, I'll fire up scap [00:03:48] \o/ tx [00:05:31] !log thcipriani@deploy1001 Started scap: SWAT: [[gerrit:449260|Convert to extension.json]] T87981 [[gerrit:449372|Use maps for wgEnabledTranscodeSet]] T118080 [[gerrit:449373|Adjust VP9 encoding for speed, quality]] [[gerrit:449374|More conservative max bitrate for VP9 video transcodes]] [00:05:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:05:37] T118080: Rewrite $wgEnabledTranscodeSet and $wgEnabledAudioTranscodeSet settings as a map from transcode to boolean for enabled status - https://phabricator.wikimedia.org/T118080 [00:05:38] T87981: Convert TimedMediaHandler to use extension registration - https://phabricator.wikimedia.org/T87981 [00:12:01] 10Operations, 10Traffic, 10Patch-For-Review: Setup wikimediafoundation.org domain for July 30 launch of new site - https://phabricator.wikimedia.org/T198922 (10Varnent) 05Open>03Resolved a:03Varnent Thank you @BBlack for all your help today! [00:13:09] (03PS3) 10Paladox: Gerrit: Make PolyGerrit the default ui [puppet] - 10https://gerrit.wikimedia.org/r/439444 (https://phabricator.wikimedia.org/T196812) [00:19:01] brion: update l10n cache still updating [00:19:09] ok [00:25:06] actually syncing now [00:25:06] (03CR) 10BBlack: [C: 032] wikimediafoundation.org: switch TTLs back to 10m [dns] - 10https://gerrit.wikimedia.org/r/449342 (https://phabricator.wikimedia.org/T198922) (owner: 10BBlack) [00:25:11] \o/ [00:36:19] rebuilding cdb files [00:36:47] whee [00:39:16] !log thcipriani@deploy1001 Finished scap: SWAT: [[gerrit:449260|Convert to extension.json]] T87981 [[gerrit:449372|Use maps for wgEnabledTranscodeSet]] T118080 [[gerrit:449373|Adjust VP9 encoding for speed, quality]] [[gerrit:449374|More conservative max bitrate for VP9 video transcodes]] (duration: 33m 44s) [00:39:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:39:22] T118080: Rewrite $wgEnabledTranscodeSet and $wgEnabledAudioTranscodeSet settings as a map from transcode to boolean for enabled status - https://phabricator.wikimedia.org/T118080 [00:39:22] T87981: Convert TimedMediaHandler to use extension registration - https://phabricator.wikimedia.org/T87981 [00:39:23] ^ brion weee [00:39:29] wooooooo [00:39:37] ready for config change? [00:39:43] \o/ let's do it! [00:40:00] (03PS3) 10Thcipriani: Switch in WebM VP9/Opus video transcodes to replace WebM VP8/Vorbis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447572 (https://phabricator.wikimedia.org/T63805) (owner: 10Brion VIBBER) [00:40:15] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447572 (https://phabricator.wikimedia.org/T63805) (owner: 10Brion VIBBER) [00:40:31] * brion dances [00:41:26] (03Merged) 10jenkins-bot: Switch in WebM VP9/Opus video transcodes to replace WebM VP8/Vorbis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447572 (https://phabricator.wikimedia.org/T63805) (owner: 10Brion VIBBER) [00:41:43] (03CR) 10jenkins-bot: Switch in WebM VP9/Opus video transcodes to replace WebM VP8/Vorbis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447572 (https://phabricator.wikimedia.org/T63805) (owner: 10Brion VIBBER) [00:41:59] woot [00:42:10] brion: live on mwdebug1002 if there's anything you want to test there [00:42:15] ok lemme test [00:42:45] thcipriani: looks good! [00:42:48] push zee button [00:42:54] * thcipriani does [00:44:49] !log thcipriani@deploy1001 Synchronized wmf-config/CommonSettings.php: SWAT: [[gerrit:447572|Switch in WebM VP9/Opus video transcodes to replace WebM VP8/Vorbis]] T63805 (duration: 00m 48s) [00:44:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:44:53] T63805: Create WebM (VP9/opus) transcodes replacing the WebM (VP8/vorbis) ones eventually - https://phabricator.wikimedia.org/T63805 [00:44:55] ^ brion live everywhere [00:45:00] thanks!!!!! [00:45:02] :D :D :D :D [00:45:02] yw :) [00:45:10] kudos on the switch! [00:45:17] lemme run one just to confirm all's through the system :D [00:45:22] sure [00:46:03] https://upload.wikimedia.org/wikipedia/commons/transcoded/9/94/Folgers.ogv/Folgers.ogv.120p.vp9.webm [00:46:04] WORKS [00:46:23] thanks again for the swat thcipriani ! [00:46:35] awesome, glad all's well! [00:46:42] :D [00:51:31] (03PS9) 10Dzahn: netbox: add psql backups [puppet] - 10https://gerrit.wikimedia.org/r/447842 (https://phabricator.wikimedia.org/T190184) [00:54:48] (03CR) 10Dzahn: [C: 032] netbox: add psql backups [puppet] - 10https://gerrit.wikimedia.org/r/447842 (https://phabricator.wikimedia.org/T190184) (owner: 10Dzahn) [00:56:55] 10Operations, 10hardware-requests: eqiad/codfw: (4)+(4) hardware access request for videoscalers - https://phabricator.wikimedia.org/T188075 (10brion) [00:57:06] (03CR) 10Dzahn: [C: 032] "on netmon1002/2001:" [puppet] - 10https://gerrit.wikimedia.org/r/447842 (https://phabricator.wikimedia.org/T190184) (owner: 10Dzahn) [01:12:48] 10Operations, 10Gerrit, 10Patch-For-Review, 10Release-Engineering-Team (Someday): Gerrit shows HTTP 500 error when pasting extended unicode characters - https://phabricator.wikimedia.org/T145885 (10Paladox) [01:18:46] (03PS1) 10Dzahn: postgresql::backups: fix typo and amend usage notes [puppet] - 10https://gerrit.wikimedia.org/r/449387 (https://phabricator.wikimedia.org/T190184) [01:22:24] (03CR) 10Dzahn: [C: 032] postgresql::backups: fix typo and amend usage notes [puppet] - 10https://gerrit.wikimedia.org/r/449387 (https://phabricator.wikimedia.org/T190184) (owner: 10Dzahn) [01:28:42] 10Operations, 10Patch-For-Review: Netbox: setup backups - https://phabricator.wikimedia.org/T190184 (10Dzahn) There is now a puppetized dir and cron job on netbox machines. Copy/pasting the command from there and executing manually as user postgres works and creates a gzipped dump file in /srv/postgres-backup.... [02:00:31] 10Operations, 10Traffic, 10Patch-For-Review: Setup wikimediafoundation.org domain for July 30 launch of new site - https://phabricator.wikimedia.org/T198922 (10Varnent) Here is a ticket for the redirects in general on the new site: [T200754] [02:09:28] 10Operations, 10DNS, 10Release-Engineering-Team, 10Traffic, and 4 others: Move Foundation Wiki to new URL when new Wikimedia Foundation website launches - https://phabricator.wikimedia.org/T188776 (10Varnent) Task with info on redirects: [T200754] [02:27:20] !log l10nupdate@deploy1001 scap sync-l10n completed (1.32.0-wmf.14) (duration: 08m 12s) [02:27:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:37:40] !log l10nupdate@deploy1001 ResourceLoader cache refresh completed at Tue Jul 31 02:37:40 UTC 2018 (duration 10m 20s) [02:37:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:41:48] (03PS1) 10Urbanecm: Milestone logo for atjwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449389 (https://phabricator.wikimedia.org/T200713) [02:45:41] (03PS1) 10Urbanecm: Upload HD logos for hewiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449390 (https://phabricator.wikimedia.org/T200470) [02:45:43] (03PS1) 10Urbanecm: Use HD logos for hewiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449391 (https://phabricator.wikimedia.org/T200470) [04:20:41] (03CR) 10Muehlenhoff: [C: 04-1] admin: remove expiry attributes of user nettrom (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/449290 (https://phabricator.wikimedia.org/T200723) (owner: 10Herron) [04:22:29] 10Operations, 10Product-Analytics, 10SRE-Access-Requests, 10Patch-For-Review: Remove expiry date from Morten Warncke-Wang's production shell access - https://phabricator.wikimedia.org/T200723 (10MoritzMuehlenhoff) >>! In T200723#4462869, @herron wrote: > Hi @Neil_P._Quinn_WMF, I've prepared a patch for thi... [04:31:51] (03PS2) 10Muehlenhoff: mediawiki::php: Remove support for PHP 5 [puppet] - 10https://gerrit.wikimedia.org/r/449219 [04:47:16] !log Stop change_tag_def populate maintenance script for wikidata and enwiki - T193873 [04:47:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:47:21] T193873: Run maintenance script to populate change_tag_def on WMF production (all wikis) - https://phabricator.wikimedia.org/T193873 [04:49:05] (03CR) 10Marostegui: db-eqiad.php: Depool all the hosts in row B [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449141 (https://phabricator.wikimedia.org/T183585) (owner: 10Marostegui) [04:49:12] (03PS4) 10Marostegui: db-eqiad.php: Depool all the hosts in row B [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449141 (https://phabricator.wikimedia.org/T183585) [04:51:56] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool all the hosts in row B [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449141 (https://phabricator.wikimedia.org/T183585) (owner: 10Marostegui) [04:53:12] (03Merged) 10jenkins-bot: db-eqiad.php: Depool all the hosts in row B [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449141 (https://phabricator.wikimedia.org/T183585) (owner: 10Marostegui) [04:53:29] (03CR) 10jenkins-bot: db-eqiad.php: Depool all the hosts in row B [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449141 (https://phabricator.wikimedia.org/T183585) (owner: 10Marostegui) [04:54:51] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool all hosts in row B - T183585 (duration: 00m 50s) [04:54:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:54:55] T183585: Rack/cable/configure asw2-b-eqiad switch stack - https://phabricator.wikimedia.org/T183585 [05:00:04] !log Deploy schema change on dbstore1002:s8 T144010 T51190 T199368/script unload irssinotifier [05:00:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:00:11] T51190: Truncate SHA-1 indexes - https://phabricator.wikimedia.org/T51190 [05:00:11] T144010: Drop eu_touched in production - https://phabricator.wikimedia.org/T144010 [05:00:12] T199368: Convert UNIQUE INDEX to PK in Production - https://phabricator.wikimedia.org/T199368 [05:01:05] !log Deploy schema change on db1099:3318 T144010 T51190 T199368 [05:01:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_admin_log/2008-09 [05:03:19] ACKNOWLEDGEMENT - Device not healthy -SMART- on db2061 is CRITICAL: cluster=mysql device=cciss,9 instance=db2061:9100 job=node site=codfw Marostegui T200059 - The acknowledgement expires at: 2018-08-03 05:01:44. https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db2061&var-datasource=codfw%2520prometheus%252Fops [05:05:10] 10Operations, 10DNS, 10Release-Engineering-Team, 10Traffic, and 5 others: Move Foundation Wiki to new URL when new Wikimedia Foundation website launches - https://phabricator.wikimedia.org/T188776 (10Varnent) [05:06:55] (03CR) 10Muehlenhoff: "Thanks for working on this! Some comments" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/446887 (https://phabricator.wikimedia.org/T198649) (owner: 10Volans) [05:19:19] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool all the hosts in row B" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449396 [05:19:30] (03CR) 10Marostegui: [C: 04-2] "Wait for the network maintenance to be completed" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449396 (owner: 10Marostegui) [05:22:40] (03PS3) 10Jcrespo: Remove terbium for tendril grants [puppet] - 10https://gerrit.wikimedia.org/r/445590 (owner: 10Muehlenhoff) [05:24:12] (03CR) 10Jcrespo: [C: 032] Remove terbium for tendril grants [puppet] - 10https://gerrit.wikimedia.org/r/445590 (owner: 10Muehlenhoff) [05:30:50] (03PS1) 10Marostegui: check_private_data: Add root@wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/449398 [05:32:30] (03CR) 10Marostegui: [C: 032] check_private_data: Add root@wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/449398 (owner: 10Marostegui) [05:37:47] (03PS1) 10Jcrespo: mariadb: Depool external store hosts in row B [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449400 (https://phabricator.wikimedia.org/T183585) [05:39:26] (03CR) 10Marostegui: [C: 031] mariadb: Depool external store hosts in row B [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449400 (https://phabricator.wikimedia.org/T183585) (owner: 10Jcrespo) [05:39:37] (03CR) 10Marostegui: [C: 031] "thanks" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449400 (https://phabricator.wikimedia.org/T183585) (owner: 10Jcrespo) [05:39:47] (03CR) 10Jcrespo: "Isn't it a bit early to deploy all this? Maintenanace windows doesn't start until 3pm UTC." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449400 (https://phabricator.wikimedia.org/T183585) (owner: 10Jcrespo) [05:40:26] (03CR) 10Marostegui: [C: 031] "I deployed all the depools for all the other hosts to make sure we could detect problems as the traffic goes up, so it doesn't come as a s" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449400 (https://phabricator.wikimedia.org/T183585) (owner: 10Jcrespo) [05:50:49] (03PS2) 10Jcrespo: mariadb: Depool external store hosts in row B [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449400 (https://phabricator.wikimedia.org/T183585) [05:51:28] (03CR) 10Jcrespo: "See also the change to db1089." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449400 (https://phabricator.wikimedia.org/T183585) (owner: 10Jcrespo) [05:52:21] (03CR) 10Marostegui: [C: 031] "Looks good - thanks" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449400 (https://phabricator.wikimedia.org/T183585) (owner: 10Jcrespo) [06:08:34] (03PS3) 10Jcrespo: mariadb: Better support MariaDB 10.2 and 10.3 config on production [puppet] - 10https://gerrit.wikimedia.org/r/449234 (https://phabricator.wikimedia.org/T193224) [06:10:11] (03CR) 10Jcrespo: [C: 032] mariadb: Depool external store hosts in row B [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449400 (https://phabricator.wikimedia.org/T183585) (owner: 10Jcrespo) [06:11:02] !log installing ant security updates [06:11:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_admin_log/2008-09 [06:11:30] (03Merged) 10jenkins-bot: mariadb: Depool external store hosts in row B [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449400 (https://phabricator.wikimedia.org/T183585) (owner: 10Jcrespo) [06:11:46] (03CR) 10jenkins-bot: mariadb: Depool external store hosts in row B [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449400 (https://phabricator.wikimedia.org/T183585) (owner: 10Jcrespo) [06:17:55] (03PS1) 10Jcrespo: tendril: Remove grants for mwmaint and icinga hosts [puppet] - 10https://gerrit.wikimedia.org/r/449403 [06:30:11] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool all the hosts in row B" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449396 [06:31:28] PROBLEM - puppet last run on ms-be1027 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/profile.d/bash_autologout.sh] [06:31:48] PROBLEM - puppet last run on cp1065 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/lib/nagios/plugins/check_strongswan] [06:31:58] PROBLEM - puppet last run on mw1300 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/php/7.0/fpm/php.ini] [06:32:18] PROBLEM - puppet last run on es1017 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/puppet-enabled] [06:38:10] (03CR) 10Marostegui: [C: 031] tendril: Remove grants for mwmaint and icinga hosts [puppet] - 10https://gerrit.wikimedia.org/r/449403 (owner: 10Jcrespo) [06:39:24] (03CR) 10Jcrespo: [C: 032] tendril: Remove grants for mwmaint and icinga hosts [puppet] - 10https://gerrit.wikimedia.org/r/449403 (owner: 10Jcrespo) [06:39:46] (03PS2) 10Jcrespo: tendril: Remove grants for mwmaint and icinga hosts [puppet] - 10https://gerrit.wikimedia.org/r/449403 [06:42:00] jynus: I think you merged the es depools but not deployed? [06:42:35] not yet [06:42:42] I was waiting for it to merge [06:43:29] (03PS1) 10Ema: cp-misc_codfw: upgrade to stretch [puppet] - 10https://gerrit.wikimedia.org/r/449407 (https://phabricator.wikimedia.org/T200445) [06:43:55] It merged I believe: ˜/wikibugs 8:11> (Merged) jenkins-bot: mariadb: Depool external store hosts in row B [mediawiki-config] [06:44:15] yes, but by that time I was making 3 other patches [06:44:35] (03CR) 10Ema: [C: 032] cp-misc_codfw: upgrade to stretch [puppet] - 10https://gerrit.wikimedia.org/r/449407 (https://phabricator.wikimedia.org/T200445) (owner: 10Ema) [06:44:47] I was just letting you know in case I missed the deployment or you were not aware it got merged (that happens to me sometimes with the channel noise) , that is only it :) [06:45:37] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool row B es hosts, reduce db1089 load (duration: 00m 49s) [06:45:37] (03CR) 10Volans: "recheck (as the upstream bug for prospector has been supposedly fixed)" [software/cumin] - 10https://gerrit.wikimedia.org/r/449191 (owner: 10Gehel) [06:45:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_admin_log/2008-09 [06:49:20] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on sarin.codfw.wmnet for hosts: ``` cp2018.codfw.wmnet ``` The log can be found in `/var/log/wmf-auto-reimage/2018073106... [06:49:25] !log dropping unnecesary tendril web users from tendril db backends [06:49:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_admin_log/2008-09 [06:56:19] PROBLEM - IPsec on cp1051 is CRITICAL: Strongswan CRITICAL - ok: 12 not-conn: cp2018_v4, cp2018_v6 [06:56:29] PROBLEM - IPsec on cp3008 is CRITICAL: Strongswan CRITICAL - ok: 14 not-conn: cp2018_v4, cp2018_v6 [06:56:29] PROBLEM - IPsec on cp3010 is CRITICAL: Strongswan CRITICAL - ok: 14 not-conn: cp2018_v4, cp2018_v6 [06:56:38] PROBLEM - IPsec on cp3007 is CRITICAL: Strongswan CRITICAL - ok: 14 not-conn: cp2018_v4, cp2018_v6 [06:57:08] PROBLEM - IPsec on cp1058 is CRITICAL: Strongswan CRITICAL - ok: 12 not-conn: cp2018_v4, cp2018_v6 [06:57:08] RECOVERY - puppet last run on ms-be1027 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:08] PROBLEM - IPsec on cp1061 is CRITICAL: Strongswan CRITICAL - ok: 12 not-conn: cp2018_v4, cp2018_v6 [06:57:19] PROBLEM - IPsec on cp1045 is CRITICAL: Strongswan CRITICAL - ok: 12 not-conn: cp2018_v4, cp2018_v6 [06:57:29] RECOVERY - puppet last run on cp1065 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:38] RECOVERY - puppet last run on mw1300 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:57:59] RECOVERY - puppet last run on es1017 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:58:47] ACKNOWLEDGEMENT - IPsec on cp1045 is CRITICAL: Strongswan CRITICAL - ok: 12 not-conn: cp2018_v4, cp2018_v6 Ema installing cp2018 as stretch [06:58:47] ACKNOWLEDGEMENT - IPsec on cp1051 is CRITICAL: Strongswan CRITICAL - ok: 12 not-conn: cp2018_v4, cp2018_v6 Ema installing cp2018 as stretch [06:58:47] ACKNOWLEDGEMENT - IPsec on cp1058 is CRITICAL: Strongswan CRITICAL - ok: 12 not-conn: cp2018_v4, cp2018_v6 Ema installing cp2018 as stretch [06:58:47] ACKNOWLEDGEMENT - IPsec on cp1061 is CRITICAL: Strongswan CRITICAL - ok: 12 not-conn: cp2018_v4, cp2018_v6 Ema installing cp2018 as stretch [06:58:47] ACKNOWLEDGEMENT - IPsec on cp3007 is CRITICAL: Strongswan CRITICAL - ok: 14 not-conn: cp2018_v4, cp2018_v6 Ema installing cp2018 as stretch [06:58:47] ACKNOWLEDGEMENT - IPsec on cp3008 is CRITICAL: Strongswan CRITICAL - ok: 14 not-conn: cp2018_v4, cp2018_v6 Ema installing cp2018 as stretch [06:58:47] ACKNOWLEDGEMENT - IPsec on cp3010 is CRITICAL: Strongswan CRITICAL - ok: 14 not-conn: cp2018_v4, cp2018_v6 Ema installing cp2018 as stretch [06:58:48] ACKNOWLEDGEMENT - IPsec on cp1045 is CRITICAL: Strongswan CRITICAL - ok: 12 not-conn: cp2018_v4, cp2018_v6 Ema installing cp2018 as stretch [06:58:48] ACKNOWLEDGEMENT - IPsec on cp1051 is CRITICAL: Strongswan CRITICAL - ok: 12 not-conn: cp2018_v4, cp2018_v6 Ema installing cp2018 as stretch [06:58:49] ACKNOWLEDGEMENT - IPsec on cp1058 is CRITICAL: Strongswan CRITICAL - ok: 12 not-conn: cp2018_v4, cp2018_v6 Ema installing cp2018 as stretch [06:58:49] ACKNOWLEDGEMENT - IPsec on cp1061 is CRITICAL: Strongswan CRITICAL - ok: 12 not-conn: cp2018_v4, cp2018_v6 Ema installing cp2018 as stretch [06:58:50] ACKNOWLEDGEMENT - IPsec on cp3007 is CRITICAL: Strongswan CRITICAL - ok: 14 not-conn: cp2018_v4, cp2018_v6 Ema installing cp2018 as stretch [06:59:15] sorry for the spam! [06:59:19] (03PS3) 10Jcrespo: haproxy: Fix bug on /run directory creation [puppet] - 10https://gerrit.wikimedia.org/r/449179 [07:00:05] (03CR) 10Jcrespo: [C: 032] haproxy: Fix bug on /run directory creation [puppet] - 10https://gerrit.wikimedia.org/r/449179 (owner: 10Jcrespo) [07:01:26] (03PS2) 10Jcrespo: dbproxy-master: Fix hieradata reference typo [puppet] - 10https://gerrit.wikimedia.org/r/449180 [07:02:16] (03CR) 10Jcrespo: [C: 032] dbproxy-master: Fix hieradata reference typo [puppet] - 10https://gerrit.wikimedia.org/r/449180 (owner: 10Jcrespo) [07:05:25] !log Remove unused and old grants from codfw hosts - T146149#4456082 [07:05:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_admin_log/2008-09 [07:06:46] (03CR) 10Ema: [C: 031] "https://puppet-compiler.wmflabs.org/compiler02/11927/" [puppet] - 10https://gerrit.wikimedia.org/r/449347 (owner: 10Dzahn) [07:09:04] RECOVERY - IPsec on cp3008 is OK: Strongswan OK - 16 ESP OK [07:09:04] RECOVERY - IPsec on cp3010 is OK: Strongswan OK - 16 ESP OK [07:09:13] RECOVERY - IPsec on cp3007 is OK: Strongswan OK - 16 ESP OK [07:09:33] RECOVERY - IPsec on cp1061 is OK: Strongswan OK - 14 ESP OK [07:09:44] RECOVERY - IPsec on cp1058 is OK: Strongswan OK - 14 ESP OK [07:09:53] RECOVERY - IPsec on cp1051 is OK: Strongswan OK - 14 ESP OK [07:10:35] !log installing mutt security updates [07:10:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_admin_log/2008-09 [07:12:39] (03PS1) 10Volans: Fix typo in postgresql::backup [puppet] - 10https://gerrit.wikimedia.org/r/449409 (https://phabricator.wikimedia.org/T190184) [07:13:43] RECOVERY - IPsec on cp1045 is OK: Strongswan OK - 14 ESP OK [07:14:17] (03CR) 10Volans: [C: 032] Fix typo in postgresql::backup [puppet] - 10https://gerrit.wikimedia.org/r/449409 (https://phabricator.wikimedia.org/T190184) (owner: 10Volans) [07:15:15] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['cp2018.codfw.wmnet'] ``` and were **ALL** successful. [07:16:52] (03CR) 10Volans: "recheck" [software/cumin] - 10https://gerrit.wikimedia.org/r/449191 (owner: 10Gehel) [07:17:45] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on sarin.codfw.wmnet for hosts: ``` cp2025.codfw.wmnet ``` The log can be found in `/var/log/wmf-auto-reimage/2018073107... [07:19:49] (03CR) 10jerkins-bot: [V: 04-1] Extract progress bars from clustershell event handling. [software/cumin] - 10https://gerrit.wikimedia.org/r/449191 (owner: 10Gehel) [07:20:10] (03PS4) 10Jcrespo: monitoring: Harmonize check naming to a common set of rules [puppet] - 10https://gerrit.wikimedia.org/r/448503 [07:20:59] (03CR) 10jerkins-bot: [V: 04-1] monitoring: Harmonize check naming to a common set of rules [puppet] - 10https://gerrit.wikimedia.org/r/448503 (owner: 10Jcrespo) [07:23:31] (03CR) 10Jcrespo: "I will generate a list of disabled or downtime's services when this is agreed to be deployed and make sure the new ones continue on the sa" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/448503 (owner: 10Jcrespo) [07:24:16] !log depool codfw mathoid from discovery for kubernetes upgrade [07:24:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_admin_log/2008-09 [07:24:44] (03PS5) 10Jcrespo: monitoring: Harmonize check naming to a common set of rules [puppet] - 10https://gerrit.wikimedia.org/r/448503 [07:26:57] (03PS1) 10Volans: Fix typo in postgresql::backup (2) [puppet] - 10https://gerrit.wikimedia.org/r/449411 (https://phabricator.wikimedia.org/T190184) [07:27:20] !log backup kubetcd2001 etcd data before kubernetes upgrade [07:27:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_admin_log/2008-09 [07:28:16] (03CR) 10Volans: [C: 032] Fix typo in postgresql::backup (2) [puppet] - 10https://gerrit.wikimedia.org/r/449411 (https://phabricator.wikimedia.org/T190184) (owner: 10Volans) [07:28:53] (03PS6) 10Jcrespo: monitoring: Harmonize check naming to a common set of rules [puppet] - 10https://gerrit.wikimedia.org/r/448503 [07:32:16] !log run decomission_appserver on terbium [07:32:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_admin_log/2008-09 [07:32:47] (03CR) 10Jcrespo: "BTW, the change itself is low priority for me- my main concern is agreeing on the rules and document those on Wikitech for future uses." [puppet] - 10https://gerrit.wikimedia.org/r/448503 (owner: 10Jcrespo) [07:35:31] !log upgrade acrab.codfw.wmnet to kubernetes 1.9.9 [07:35:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_admin_log/2008-09 [07:36:54] !log reboot multatuli for kernel tests [07:36:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_admin_log/2008-09 [07:41:40] (03PS1) 10Jcrespo: Test MySQL 8.0.12 on test-s1 host db1118 [puppet] - 10https://gerrit.wikimedia.org/r/449413 (https://phabricator.wikimedia.org/T193224) [07:43:42] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['cp2025.codfw.wmnet'] ``` and were **ALL** successful. [07:44:19] !log stopping db1118 mariadb and starting mysql there [07:44:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_admin_log/2008-09 [07:45:51] (03CR) 10Jcrespo: [C: 032] Test MySQL 8.0.12 on test-s1 host db1118 [puppet] - 10https://gerrit.wikimedia.org/r/449413 (https://phabricator.wikimedia.org/T193224) (owner: 10Jcrespo) [07:47:16] !log upgrade acrux.codfw.wmnet to kubernetes 1.9.9 [07:47:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_admin_log/2008-09 [07:52:42] (03PS5) 10Jcrespo: Packages for MySQL 8.0.12 and MariaDB 10.3.8 [software] - 10https://gerrit.wikimedia.org/r/448854 [07:53:41] !log upgrade kubernetes2001 to 1.9.9 [07:53:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_admin_log/2008-09 [07:54:35] !log reboot cp1008 for kernel tests [07:54:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_admin_log/2008-09 [07:57:00] (03CR) 10Jcrespo: [C: 032] Packages for MySQL 8.0.12 and MariaDB 10.3.8 [software] - 10https://gerrit.wikimedia.org/r/448854 (owner: 10Jcrespo) [08:00:46] !log upgrade kubernetes2002 to 1.9.9 [08:00:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_admin_log/2008-09 [08:05:26] !log upgrade kubernetes2003 to 1.9.9 [08:05:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_admin_log/2008-09 [08:05:54] 10Operations, 10Puppet, 10Beta-Cluster-Infrastructure, 10Technical-Debt, 10Tracking: Minimize differences between beta and production (Tracking) - https://phabricator.wikimedia.org/T87220 (10mobrovac) [08:06:55] (03CR) 10Muehlenhoff: [C: 031] monitoring: Harmonize check naming to a common set of rules [puppet] - 10https://gerrit.wikimedia.org/r/448503 (owner: 10Jcrespo) [08:07:21] (03CR) 10Jcrespo: [C: 04-1] "https://puppet-compiler.wmflabs.org/compiler02/11928/" [puppet] - 10https://gerrit.wikimedia.org/r/449234 (https://phabricator.wikimedia.org/T193224) (owner: 10Jcrespo) [08:08:04] !log upgrade kubernetes2004 to 1.9.9 [08:08:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_admin_log/2008-09 [08:17:11] !log repool mathoid codfw in discovery, the kubernetes upgrade is done [08:17:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_admin_log/2008-09 [08:17:33] (03PS4) 10Jcrespo: mariadb: Better support MariaDB 10.2 and 10.3 config on production [puppet] - 10https://gerrit.wikimedia.org/r/449234 (https://phabricator.wikimedia.org/T193224) [08:22:05] (03PS1) 10Ema: varnishkafka (1.0.12-3) stretch-wikimedia; urgency=medium [software/varnish/varnishkafka] (debian) - 10https://gerrit.wikimedia.org/r/449420 (https://phabricator.wikimedia.org/T200445) [08:22:18] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: Decom/reclaim terbium - https://phabricator.wikimedia.org/T200763 (10MoritzMuehlenhoff) [08:23:54] (03PS5) 10Jcrespo: mariadb: Better support MariaDB 10.2 and 10.3 config on production [puppet] - 10https://gerrit.wikimedia.org/r/449234 (https://phabricator.wikimedia.org/T193224) [08:23:56] (03PS1) 10Jcrespo: mariadb: Reimage db1104 to stretch [puppet] - 10https://gerrit.wikimedia.org/r/449421 [08:25:39] (03CR) 10Jcrespo: [C: 031] "https://puppet-compiler.wmflabs.org/compiler03/11930/" [puppet] - 10https://gerrit.wikimedia.org/r/449234 (https://phabricator.wikimedia.org/T193224) (owner: 10Jcrespo) [08:26:19] (03PS6) 10Gehel: Extract progress bars from clustershell event handling. [software/cumin] - 10https://gerrit.wikimedia.org/r/449191 [08:27:00] (03PS7) 10Gehel: Extract progress bars from clustershell event handling. [software/cumin] - 10https://gerrit.wikimedia.org/r/449191 [08:29:52] (03CR) 10jerkins-bot: [V: 04-1] Extract progress bars from clustershell event handling. [software/cumin] - 10https://gerrit.wikimedia.org/r/449191 (owner: 10Gehel) [08:31:15] 10Operations, 10HHVM, 10User-ArielGlenn: Run all jobs on PHP7 or HHVM - https://phabricator.wikimedia.org/T195393 (10MoritzMuehlenhoff) [08:31:19] 10Operations, 10Patch-For-Review: setup replacements for maintenance_server (terbium, wasat) on Stretch - https://phabricator.wikimedia.org/T192092 (10MoritzMuehlenhoff) 05Open>03Resolved Replacements using stretch are up and running, the decom task for terbium is T200763, closing this task. [08:31:33] 10Operations, 10HHVM, 10Patch-For-Review, 10User-Elukey: Upgrade mw* servers to Debian Stretch (using HHVM) - https://phabricator.wikimedia.org/T174431 (10MoritzMuehlenhoff) [08:32:12] 10Operations, 10Core-Platform-Team, 10HHVM, 10TechCom-RFC (TechCom-Approved), 10User-ArielGlenn: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370 (10MoritzMuehlenhoff) [08:32:15] 10Operations, 10HHVM, 10Patch-For-Review, 10User-Elukey: Upgrade mw* servers to Debian Stretch (using HHVM) - https://phabricator.wikimedia.org/T174431 (10MoritzMuehlenhoff) 05Open>03Resolved a:03MoritzMuehlenhoff Script runners are now also migrated to stretch, closing the task. [08:35:18] (03CR) 10Marostegui: [C: 031] mariadb: Better support MariaDB 10.2 and 10.3 config on production [puppet] - 10https://gerrit.wikimedia.org/r/449234 (https://phabricator.wikimedia.org/T193224) (owner: 10Jcrespo) [08:38:48] (03CR) 10Jcrespo: [C: 032] mariadb: Better support MariaDB 10.2 and 10.3 config on production [puppet] - 10https://gerrit.wikimedia.org/r/449234 (https://phabricator.wikimedia.org/T193224) (owner: 10Jcrespo) [08:39:34] (03CR) 10ArielGlenn: "After hunting around, it looks like the same material is covered here: https://old.datahub.io/organization/wikimedia" [puppet] - 10https://gerrit.wikimedia.org/r/449238 (https://phabricator.wikimedia.org/T200705) (owner: 10Imarlier) [08:42:04] (03CR) 10Jcrespo: [C: 032] mariadb: Reimage db1104 to stretch [puppet] - 10https://gerrit.wikimedia.org/r/449421 (owner: 10Jcrespo) [08:44:45] !log stop and reimage db1104 to stretch [08:44:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_admin_log/2008-09 [08:48:09] PROBLEM - High CPU load on API appserver on mw1235 is CRITICAL: CRITICAL - load average: 45.39, 35.69, 23.02 [08:51:19] PROBLEM - High CPU load on API appserver on mw1231 is CRITICAL: CRITICAL - load average: 44.92, 35.65, 23.39 [08:54:40] PROBLEM - High CPU load on API appserver on mw1231 is CRITICAL: CRITICAL - load average: 37.51, 35.68, 25.81 [08:57:35] (03PS1) 10Ema: varnishkafka (1.0.13-1) stretch-wikimedia; urgency=medium [software/varnish/varnishkafka] (debian) - 10https://gerrit.wikimedia.org/r/449425 (https://phabricator.wikimedia.org/T186250) [09:00:17] RECOVERY - High CPU load on API appserver on mw1231 is OK: OK - load average: 13.40, 23.34, 23.63 [09:05:19] (03Abandoned) 10Ema: varnishkafka (1.0.12-3) stretch-wikimedia; urgency=medium [software/varnish/varnishkafka] (debian) - 10https://gerrit.wikimedia.org/r/449420 (https://phabricator.wikimedia.org/T200445) (owner: 10Ema) [09:17:40] RECOVERY - High CPU load on API appserver on mw1235 is OK: OK - load average: 10.17, 12.32, 23.43 [09:22:33] (03CR) 10Elukey: [C: 031] varnishkafka (1.0.13-1) stretch-wikimedia; urgency=medium [software/varnish/varnishkafka] (debian) - 10https://gerrit.wikimedia.org/r/449425 (https://phabricator.wikimedia.org/T186250) (owner: 10Ema) [09:24:59] (03CR) 10Vgutierrez: [C: 031] varnishkafka (1.0.13-1) stretch-wikimedia; urgency=medium [software/varnish/varnishkafka] (debian) - 10https://gerrit.wikimedia.org/r/449425 (https://phabricator.wikimedia.org/T186250) (owner: 10Ema) [09:27:54] (03CR) 10Ema: [C: 032] varnishkafka (1.0.13-1) stretch-wikimedia; urgency=medium [software/varnish/varnishkafka] (debian) - 10https://gerrit.wikimedia.org/r/449425 (https://phabricator.wikimedia.org/T186250) (owner: 10Ema) [09:29:54] !log upload varnishkafka 1.0.13-1 to stretch-wikimedia T200445 T186250 [09:29:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_admin_log/2008-09 [09:29:58] T200445: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 [09:29:59] T186250: latest varnishkafka fails to build on Debian - https://phabricator.wikimedia.org/T186250 [09:34:54] (03PS1) 10Ema: cp-misc_esams: upgrade to stretch [puppet] - 10https://gerrit.wikimedia.org/r/449429 (https://phabricator.wikimedia.org/T200445) [09:35:49] (03CR) 10Ema: [C: 032] cp-misc_esams: upgrade to stretch [puppet] - 10https://gerrit.wikimedia.org/r/449429 (https://phabricator.wikimedia.org/T200445) (owner: 10Ema) [09:36:44] (03PS1) 10Jcrespo: mariadb: Allow reimage of db109X hosts [puppet] - 10https://gerrit.wikimedia.org/r/449430 [09:40:26] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on sarin.codfw.wmnet for hosts: ``` cp3008.esams.wmnet ``` The log can be found in `/var/log/wmf-auto-reimage/2018073109... [09:46:24] (03CR) 10Gehel: WIP logstash: add 'id' to syslog input (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/449189 (owner: 10Filippo Giunchedi) [09:47:07] (03CR) 10Ema: [C: 032] "recheck" [software/varnish/varnishkafka] (debian) - 10https://gerrit.wikimedia.org/r/449425 (https://phabricator.wikimedia.org/T186250) (owner: 10Ema) [09:49:28] (03CR) 10Hashar: "recheck" [software/varnish/varnishkafka] (debian) - 10https://gerrit.wikimedia.org/r/449425 (https://phabricator.wikimedia.org/T186250) (owner: 10Ema) [09:49:38] (03CR) 10Ema: [C: 032] "recheck" [software/varnish/varnishkafka] (debian) - 10https://gerrit.wikimedia.org/r/449425 (https://phabricator.wikimedia.org/T186250) (owner: 10Ema) [09:52:01] PROBLEM - Disk space on elastic1020 is CRITICAL: DISK CRITICAL - free space: /srv 50183 MB (10% inode=99%) [10:00:07] (03PS1) 10Ema: Revert "cache: temporarily return 404 for stream.w.o/socket.io" [puppet] - 10https://gerrit.wikimedia.org/r/449434 (https://phabricator.wikimedia.org/T199813) [10:01:47] (03CR) 10Elukey: analytics_cluster::webserver: apache -> httpd module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/416742 (owner: 10Dzahn) [10:01:56] (03PS2) 10Jcrespo: mariadb: Allow reimage of db109X hosts [puppet] - 10https://gerrit.wikimedia.org/r/449430 [10:03:02] (03CR) 10Elukey: [C: 031] Revert "cache: temporarily return 404 for stream.w.o/socket.io" [puppet] - 10https://gerrit.wikimedia.org/r/449434 (https://phabricator.wikimedia.org/T199813) (owner: 10Ema) [10:03:49] (03CR) 10Jcrespo: [C: 032] mariadb: Allow reimage of db109X hosts [puppet] - 10https://gerrit.wikimedia.org/r/449430 (owner: 10Jcrespo) [10:03:59] (03CR) 10Ema: [C: 032] Revert "cache: temporarily return 404 for stream.w.o/socket.io" [puppet] - 10https://gerrit.wikimedia.org/r/449434 (https://phabricator.wikimedia.org/T199813) (owner: 10Ema) [10:04:08] (03PS2) 10Ema: Revert "cache: temporarily return 404 for stream.w.o/socket.io" [puppet] - 10https://gerrit.wikimedia.org/r/449434 (https://phabricator.wikimedia.org/T199813) [10:11:20] 10Operations, 10Product-Analytics, 10SRE-Access-Requests, 10Patch-For-Review: Remove expiry date from Morten Warncke-Wang's production shell access - https://phabricator.wikimedia.org/T200723 (10Peachey88) [10:13:02] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['cp3008.esams.wmnet'] ``` and were **ALL** successful. [10:19:54] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on sarin.codfw.wmnet for hosts: ``` cp3007.esams.wmnet ``` The log can be found in `/var/log/wmf-auto-reimage/2018073110... [10:24:22] RECOVERY - Disk space on elastic1020 is OK: DISK OK [10:25:45] 10Operations, 10Wikimedia-Apache-configuration, 10Patch-For-Review, 10User-Joe: Re-organize the apache configuration for MediaWiki in puppet - https://phabricator.wikimedia.org/T196968 (10Joe) @Krenair I think I will just reproduce the patches I did to the mediawiki_test environment in the main one, that l... [10:37:38] 10Operations, 10DBA, 10JADE, 10Scoring-platform-team, 10TechCom-RFC: Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10Marostegui) >>! In T200297#4462189, @awight wrote: > @Marostegui, I'd like to explore how we might be able to use x... [10:39:33] !log akosiaris@deploy1001 scap-helm mathoid upgrade production stable/mathoid [namespace: mathoid, clusters: eqiad,codfw] [10:39:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_admin_log/2008-09 [10:39:53] hmm found the bug ... [10:39:55] !log akosiaris@deploy1001 scap-helm mathoid cluster eqiad completed [10:39:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_admin_log/2008-09 [10:40:06] !log akosiaris@deploy1001 scap-helm mathoid cluster codfw completed [10:40:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_admin_log/2008-09 [10:40:09] !log akosiaris@deploy1001 scap-helm mathoid finished [10:40:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_admin_log/2008-09 [10:40:38] (03PS1) 10Sbisson: CleanupParent for draftquality model when PageTriage is used [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449437 (https://phabricator.wikimedia.org/T199357) [10:40:57] 10Operations, 10DBA, 10JADE, 10Scoring-platform-team, 10TechCom-RFC: Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10jcrespo) > If we do that, is it possible to do queries that join between x1 and other production tables? You will... [10:46:39] <_joe_> lol why is stashbot referencing https://wikitech.wikimedia.org/wiki/Server_admin_log/2008-09 ? [10:46:53] <_joe_> and writing there [10:47:54] <_joe_> before the data for today, there is an entry for september 30 inclding gems like [10:48:19] <_joe_> 09:20 Tim: Set up MediaWiki UDP logging [10:49:07] <_joe_> and https://wikitech.wikimedia.org/w/index.php?title=MediaWiki_UDP_logging&action=history pins that to september 30th, 2008 [10:49:10] <_joe_> :D [10:49:25] <_joe_> anyone knows more than me about stashbot? [10:50:18] haha fail! [10:51:49] looks correct into https://tools.wmflabs.org/sal/production but on https://wikitech.wikimedia.org/wiki/Server_Admin_Log last one is at 04:54 UTC this morning [10:52:06] <_joe_> yeah, the rest are at that page I linked [10:52:53] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['cp3007.esams.wmnet'] ``` and were **ALL** successful. [10:52:55] I deleted a line on https://wikitech.wikimedia.org/wiki/Server_Admin_Log  which was added by mistake, manually at around 5:00 utc [10:53:03] My irssii played tricks on me [10:53:13] <_joe_> oh [10:53:19] <_joe_> that might be the reason? [10:53:30] But why would that redirect stuff to 2008-2009? [10:53:43] Like why deleting a line would do that? [10:53:57] <_joe_> I dunno, but the correlation is pretty clear [10:54:00] https://wikitech.wikimedia.org/w/index.php?title=Server_Admin_Log&type=revision&diff=1798611&oldid=1798610 [10:54:18] Why is that + REDIRECT added? :| [10:54:19] there is a #REDIRECT [[Target page name]] [10:54:40] No idea why that was added [10:55:51] <_joe_> just removed the redirect [10:55:55] <_joe_> !log test log [10:55:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:01] <_joe_> right [10:56:20] <_joe_> ok mistery solved [10:56:54] _joe_: we should also move the logs in /2008-09 back to the current sal [10:57:04] <_joe_> I am doing that [10:57:08] ack, thx [10:57:10] I can do that [10:57:11] Ah [10:57:15] Thanks _joe_ [10:58:18] <_joe_> done [11:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: How many deployers does it take to do European Mid-day SWAT(Max 6 patches) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180731T1100). [11:00:04] Urbanecm: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:09] Here [11:00:25] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on sarin.codfw.wmnet for hosts: ``` cp3010.esams.wmnet ``` The log can be found in `/var/log/wmf-auto-reimage/2018073110... [11:00:34] I can SWAT today [11:00:49] Urbanecm: the usual procedure, I'll ping you when the first patch is at mwdebug :D [11:00:58] Good, thanks [11:03:22] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448154 (https://phabricator.wikimedia.org/T200296) (owner: 10Urbanecm) [11:05:35] (03Merged) 10jenkins-bot: Update hewikiquote logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448154 (https://phabricator.wikimedia.org/T200296) (owner: 10Urbanecm) [11:06:17] That was quite fast, in 13 second a merge? [11:06:50] no, two minutes? [11:07:25] still pretty fast [11:07:26] Ehh [11:07:31] I missed the minute thing :D [11:07:32] Urbanecm: 448154 is at mwdebug [11:07:45] ack [11:08:22] (03CR) 10jenkins-bot: Update hewikiquote logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448154 (https://phabricator.wikimedia.org/T200296) (owner: 10Urbanecm) [11:08:30] zeljkof, please deploy&purge the URL [11:08:47] Urbanecm: ok [11:10:05] !log zfilipin@deploy1001 Synchronized static/images/project-logos/hewikiquote.png: SWAT: [[gerrit:448154|Update hewikiquote logo (T200296)]] (duration: 00m 50s) [11:10:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:10] T200296: Add HD logos for hewikiquote - https://phabricator.wikimedia.org/T200296 [11:11:41] (03CR) 10Zfilipin: "Purged: T200296#4464771" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448154 (https://phabricator.wikimedia.org/T200296) (owner: 10Urbanecm) [11:11:57] Urbanecm: deployed and purged [11:12:11] Thanks [11:12:55] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449389 (https://phabricator.wikimedia.org/T200713) (owner: 10Urbanecm) [11:14:12] (03Merged) 10jenkins-bot: Milestone logo for atjwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449389 (https://phabricator.wikimedia.org/T200713) (owner: 10Urbanecm) [11:15:12] Urbanecm: 449389 at mwdebug [11:15:15] ack [11:15:40] zeljkof, works, please deploy&purge [11:15:58] Urbanecm: ok [11:16:56] !log zfilipin@deploy1001 Synchronized static/images/project-logos: SWAT: [[gerrit:449389|Milestone logo for atjwiki (T200713)]] (duration: 00m 49s) [11:17:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:00] T200713: Change logo on atj.wp for 4 months - https://phabricator.wikimedia.org/T200713 [11:17:53] Urbanecm: deployed and purged [11:17:57] Thanks [11:18:06] (03CR) 10jenkins-bot: Milestone logo for atjwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449389 (https://phabricator.wikimedia.org/T200713) (owner: 10Urbanecm) [11:18:46] (03CR) 10Zfilipin: "Purged: T200713#4464786" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449389 (https://phabricator.wikimedia.org/T200713) (owner: 10Urbanecm) [11:19:45] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449390 (https://phabricator.wikimedia.org/T200470) (owner: 10Urbanecm) [11:21:00] 10Operations, 10SRE-Access-Requests: Access to Google Search Console for Go Fish Digital - https://phabricator.wikimedia.org/T192893 (10Deskana) >>! In T192893#4462800, @RobH wrote: > Ok, this is set to expire on 2018-08-01. By expire, I mean my google calendar reminds me to manually login and pull up these d... [11:21:00] (03Merged) 10jenkins-bot: Upload HD logos for hewiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449390 (https://phabricator.wikimedia.org/T200470) (owner: 10Urbanecm) [11:22:06] Urbanecm: 449390 at mwdebug [11:22:56] (03CR) 10jenkins-bot: Upload HD logos for hewiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449390 (https://phabricator.wikimedia.org/T200470) (owner: 10Urbanecm) [11:23:08] zeljkof, please deploy [11:23:16] Urbanecm: ok [11:24:21] !log zfilipin@deploy1001 Synchronized static/images/project-logos/: SWAT: [[gerrit:449390|Upload HD logos for hewiktionary (T200470)]] (duration: 00m 48s) [11:24:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:26] T200470: Add HD logos for hewiktionary - https://phabricator.wikimedia.org/T200470 [11:24:46] Urbanecm: deployed, do you see the new logos without purging, or should I purge? [11:25:00] zeljkof, please do purge [11:25:04] Urbanecm: ok [11:25:56] (03CR) 10Zfilipin: "Purged: T200470#4464798" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449390 (https://phabricator.wikimedia.org/T200470) (owner: 10Urbanecm) [11:26:49] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449391 (https://phabricator.wikimedia.org/T200470) (owner: 10Urbanecm) [11:27:50] It is visible now, thanks [11:28:06] (03Merged) 10jenkins-bot: Use HD logos for hewiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449391 (https://phabricator.wikimedia.org/T200470) (owner: 10Urbanecm) [11:28:24] (03CR) 10jenkins-bot: Use HD logos for hewiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449391 (https://phabricator.wikimedia.org/T200470) (owner: 10Urbanecm) [11:28:56] Urbanecm: 449391 at mwdebug [11:29:11] ack [11:29:14] PROBLEM - LVS HTTP IPv4 on wdqs.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:29:43] zeljkof, works, please deplly [11:29:44] deploy [11:29:51] Urbanecm: ok [11:29:56] ^^ looking at wdqs [11:30:05] RECOVERY - LVS HTTP IPv4 on wdqs.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 434 bytes in 0.016 second response time [11:30:53] !log zfilipin@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:449391|Use HD logos for hewiktionary (T200470)]] (duration: 00m 48s) [11:30:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:58] T200470: Add HD logos for hewiktionary - https://phabricator.wikimedia.org/T200470 [11:31:12] Urbanecm: deployed [11:31:21] probably related to T200563 (which means I have no idea what's going on) [11:31:21] T200563: wdq1003 is anomalous - https://phabricator.wikimedia.org/T200563 [11:31:57] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['cp3010.esams.wmnet'] ``` and were **ALL** successful. [11:32:28] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448050 (https://phabricator.wikimedia.org/T199475) (owner: 10Urbanecm) [11:33:18] (03CR) 10Zfilipin: Introduce autopatrolled on bnwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448050 (https://phabricator.wikimedia.org/T199475) (owner: 10Urbanecm) [11:33:26] (03PS2) 10Zfilipin: Introduce autopatrolled on bnwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448050 (https://phabricator.wikimedia.org/T199475) (owner: 10Urbanecm) [11:33:36] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448050 (https://phabricator.wikimedia.org/T199475) (owner: 10Urbanecm) [11:34:53] (03Merged) 10jenkins-bot: Introduce autopatrolled on bnwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448050 (https://phabricator.wikimedia.org/T199475) (owner: 10Urbanecm) [11:36:06] Urbanecm: 448050 at mwdebug [11:36:12] ack [11:36:42] Works, please deploy [11:36:46] Urbanecm: ok [11:37:48] !log zfilipin@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:448050|Introduce autopatrolled on bnwikisource (T199475)]] (duration: 00m 48s) [11:37:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:53] T199475: Add autopatrolled group to Bengali Wikisource - https://phabricator.wikimedia.org/T199475 [11:38:17] Urbanecm: deployed, please check and thanks for deploying with #releng ;) [11:38:17] (03CR) 10jenkins-bot: Introduce autopatrolled on bnwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448050 (https://phabricator.wikimedia.org/T199475) (owner: 10Urbanecm) [11:38:30] !log EU SWAT finished [11:38:31] thank you [11:38:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:34] That was quick :) [11:38:48] Urbanecm: yes, we could have deployed 10 today :D [11:39:08] I don't have 10 now :D maybe in future SWATs [11:40:16] (03PS1) 10Urbanecm: Revert "Milestone logo for atjwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449445 (https://phabricator.wikimedia.org/T200713) [11:44:31] !log restarting blazegraph on wdqs1003 - JVM unresponsive [11:44:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:45:20] PROBLEM - WDQS HTTP Port on wdqs1003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 380 bytes in 0.000 second response time [11:46:29] RECOVERY - WDQS HTTP Port on wdqs1003 is OK: HTTP OK: HTTP/1.1 200 OK - 434 bytes in 0.039 second response time [11:51:00] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [11:51:09] PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=misc&var-status_type=5 [11:53:09] query.wikidata.org related --^ [11:53:23] should already be resolved [11:57:40] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [11:57:49] RECOVERY - Misc HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=misc&var-status_type=5 [11:58:15] 10Operations, 10DBA, 10JADE, 10Scoring-platform-team, 10TechCom-RFC: Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10daniel) Just to make sure we are all taking about the same thing: Do I understand correctly that the scalability c... [11:58:54] jouncebot: now [11:58:54] For the next 0 hour(s) and 1 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180731T1100) [11:58:57] jouncebot: next [11:58:57] In 0 hour(s) and 1 minute(s): Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180731T1200) [11:59:13] (03PS2) 10Reedy: wfLoadExtension for Sentry and LiquidThreads [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448178 [11:59:20] (03CR) 10Reedy: [C: 032] wfLoadExtension for Sentry and LiquidThreads [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448178 (owner: 10Reedy) [12:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180731T1200) [12:00:51] (03Merged) 10jenkins-bot: wfLoadExtension for Sentry and LiquidThreads [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448178 (owner: 10Reedy) [12:01:04] (03CR) 10jenkins-bot: wfLoadExtension for Sentry and LiquidThreads [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448178 (owner: 10Reedy) [12:01:11] (03PS2) 10Reedy: Remove LQT config cruft [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448179 [12:01:16] (03CR) 10Reedy: [C: 032] Remove LQT config cruft [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448179 (owner: 10Reedy) [12:02:49] (03Merged) 10jenkins-bot: Remove LQT config cruft [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448179 (owner: 10Reedy) [12:05:08] !log reedy@deploy1001 Synchronized wmf-config/liquidthreads.php: Remove lqt cruft, load with wfLoadExtension (duration: 00m 48s) [12:05:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:51] !log reedy@deploy1001 Synchronized wmf-config/CommonSettings.php: Load Sentry with wfLoadExtension (duration: 00m 48s) [12:06:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:15] (03CR) 10Reedy: "For reference, this is done in MW core now" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445030 (owner: 10Reedy) [12:07:41] (03PS2) 10Reedy: Remove sentry $wmg -> $wg [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448182 [12:07:51] (03CR) 10Reedy: [C: 032] Remove sentry $wmg -> $wg [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448182 (owner: 10Reedy) [12:08:01] (03CR) 10Alexandros Kosiaris: [C: 04-1] "FYI" [puppet] - 10https://gerrit.wikimedia.org/r/443045 (owner: 10Alexandros Kosiaris) [12:08:40] (03Merged) 10jenkins-bot: Remove sentry $wmg -> $wg [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448182 (owner: 10Reedy) [12:09:30] (03PS2) 10Reedy: Stop logging email changes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445030 [12:09:58] !log reedy@deploy1001 Synchronized wmf-config/CommonSettings.php: Remove wmg - for Sentry (duration: 00m 48s) [12:10:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:42] (03CR) 10Reedy: [C: 032] "Icc403be286f87a591ebc9d3e07d84b09f8b87713" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445030 (owner: 10Reedy) [12:11:10] !log reedy@deploy1001 Synchronized wmf-config/InitialiseSettings.php: wmg - for Sentry (duration: 00m 48s) [12:11:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:56] (03Merged) 10jenkins-bot: Stop logging email changes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445030 (owner: 10Reedy) [12:13:33] !log reedy@deploy1001 Synchronized wmf-config/CommonSettings.php: Remove logging now in mw core (duration: 00m 48s) [12:13:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:06] 10Operations, 10TCB-Team, 10wikidiff2, 10WMDE-QWERTY-Sprint-2018-07-17, 10WMDE-QWERTY-Sprint-2018-07-31: Update wikidiff2 library on the WMF production cluster to v1.7.2 - https://phabricator.wikimedia.org/T199801 (10WMDE-Fisch) [12:20:19] 10Operations, 10DBA, 10JADE, 10Scoring-platform-team, 10TechCom-RFC: Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10jcrespo) Data (external storage) storage is never a concern, because it is a key-value storage and is already shard... [12:29:06] !log rebalance LVS weights to send less traffic to wdqs1003 - T200563 [12:29:08] !log gehel@puppetmaster1001 conftool action : set/weight=15; selector: dc=eqiad,cluster=wdqs,name=wdqs1004.eqiad.wmnet [12:29:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:29:10] T200563: wdq1003 is anomalous - https://phabricator.wikimedia.org/T200563 [12:29:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:29:15] !log gehel@puppetmaster1001 conftool action : set/weight=15; selector: dc=eqiad,cluster=wdqs,name=wdqs1005.eqiad.wmnet [12:29:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:36:07] (03CR) 10Filippo Giunchedi: "> Patch Set 1:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/449189 (owner: 10Filippo Giunchedi) [12:41:10] (03CR) 10jenkins-bot: Remove LQT config cruft [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448179 (owner: 10Reedy) [12:41:13] (03CR) 10jenkins-bot: Remove sentry $wmg -> $wg [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448182 (owner: 10Reedy) [12:41:15] (03CR) 10jenkins-bot: Stop logging email changes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445030 (owner: 10Reedy) [12:45:12] !log depool eqiad mathoid in discovery [12:45:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:22] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 57.14% of data above the critical threshold [140.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [12:48:10] (03PS2) 10Filippo Giunchedi: logstash: add 'id' to logstash::input [puppet] - 10https://gerrit.wikimedia.org/r/449189 [12:48:38] (03PS3) 10Filippo Giunchedi: logstash: add 'id' to logstash::input [puppet] - 10https://gerrit.wikimedia.org/r/449189 (https://phabricator.wikimedia.org/T200362) [12:54:33] !log upgrade argon.eqiad.wmnet to kubernetes 1.9.9 [12:54:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:05] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler02/11931/logstash1007.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/449189 (https://phabricator.wikimedia.org/T200362) (owner: 10Filippo Giunchedi) [12:57:18] (03CR) 10Filippo Giunchedi: "Nowadays all graphite hosts are jessie or stretch, so LGTM. Can you run PCC too?" [puppet] - 10https://gerrit.wikimedia.org/r/448779 (https://phabricator.wikimedia.org/T194724) (owner: 10Dzahn) [12:58:49] !log upgrade chlorine.eqiad.wmnet to kubernetes 1.9.9 [12:58:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:59:19] (03PS1) 10Ema: cp-misc_eqiad: upgrade to stretch [puppet] - 10https://gerrit.wikimedia.org/r/449451 (https://phabricator.wikimedia.org/T200445) [13:00:04] Deploy window MediaWiki train - European version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180731T1300) [13:00:36] (03CR) 10Ema: [C: 032] cp-misc_eqiad: upgrade to stretch [puppet] - 10https://gerrit.wikimedia.org/r/449451 (https://phabricator.wikimedia.org/T200445) (owner: 10Ema) [13:01:51] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/446687 (owner: 10EBernhardson) [13:03:25] !log upgrade kubernetes1001 to 1.9.9 [13:03:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:33] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on neodymium.eqiad.wmnet for hosts: ``` ['cp1045.eqiad.wmnet', 'cp1051.eqiad.wmnet'] ``` The log can be found in `/var/l... [13:12:23] PROBLEM - kubelet operational latencies on kubernetes1003 is CRITICAL: instance=kubernetes1003.eqiad.wmnet operation_type={image_status,remove_container,stop_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [13:13:33] RECOVERY - kubelet operational latencies on kubernetes1003 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [13:14:13] !log upgrade kubernetes1003 to 1.9.9 [13:14:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:16] !log upgrade kubernetes1002 to 1.9.9 [13:14:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:43] !log upgrade kubernetes1004 to 1.9.9 [13:18:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:29] (03PS2) 10Sbisson: CleanupParent for draftquality model when PageTriage is used [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449437 (https://phabricator.wikimedia.org/T199357) [13:24:32] PROBLEM - kubelet operational latencies on kubernetes1004 is CRITICAL: instance=kubernetes1004.eqiad.wmnet operation_type={create_container,image_status,start_container,stop_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [13:25:32] RECOVERY - kubelet operational latencies on kubernetes1004 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [13:33:40] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['cp1045.eqiad.wmnet', 'cp1051.eqiad.wmnet'] ``` and were **ALL** successful. [13:34:22] !log repool mathoid eqiad cluster in discovery [13:34:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:12] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [13:41:23] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on neodymium.eqiad.wmnet for hosts: ``` ['cp1058.eqiad.wmnet', 'cp1061.eqiad.wmnet'] ``` The log can be found in `/var/l... [13:51:44] !log installing ffmpeg 3.2.12 security updates on video scalers [13:51:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:54] 10Operations: Add email addresses for new techcom members to techcom@wikimedia.org - https://phabricator.wikimedia.org/T200799 (10Joe) [13:54:07] 10Operations: Add email addresses for new techcom members to techcom@wikimedia.org - https://phabricator.wikimedia.org/T200799 (10Joe) p:05Triage>03Normal a:03Joe [14:01:12] PROBLEM - Check systemd state on cp3008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:01:50] (03CR) 10Cmjohnson: [C: 032] Adding mgmt/production dns authdns1001 [dns] - 10https://gerrit.wikimedia.org/r/449236 (https://phabricator.wikimedia.org/T196693) (owner: 10Cmjohnson) [14:01:52] (03PS3) 10Cmjohnson: Adding mgmt/production dns authdns1001 [dns] - 10https://gerrit.wikimedia.org/r/449236 (https://phabricator.wikimedia.org/T196693) [14:03:33] (03CR) 10Cmjohnson: [V: 032 C: 032] Adding mgmt/production dns authdns1001 [dns] - 10https://gerrit.wikimedia.org/r/449236 (https://phabricator.wikimedia.org/T196693) (owner: 10Cmjohnson) [14:04:53] 10Operations, 10SRE-Access-Requests: NDA access for Telecom Paristech Research Team - https://phabricator.wikimedia.org/T200800 (10RStallman-legalteam) [14:05:56] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['cp1058.eqiad.wmnet'] ``` Of which those **FAILED**: ``` ['cp1058.eqiad.wmnet'] ``` [14:09:32] RECOVERY - Check systemd state on cp3008 is OK: OK - running: The system is fully operational [14:10:17] (03CR) 10Gehel: [C: 032] Create prometheus::resource_config [puppet] - 10https://gerrit.wikimedia.org/r/446687 (owner: 10EBernhardson) [14:10:26] (03PS5) 10Gehel: Create prometheus::resource_config [puppet] - 10https://gerrit.wikimedia.org/r/446687 (owner: 10EBernhardson) [14:18:06] (03PS9) 10Andrew Bogott: Delegate 185.15.56.0/24 to labs-ns0/ns1 [dns] - 10https://gerrit.wikimedia.org/r/445303 (https://phabricator.wikimedia.org/T199374) [14:19:44] (03PS1) 10Alexandros Kosiaris: Migrate to apps/v1 API [deployment-charts] - 10https://gerrit.wikimedia.org/r/449458 [14:20:50] (03PS1) 10Cmjohnson: Adding mgmt/productin dns torrelay1001 [dns] - 10https://gerrit.wikimedia.org/r/449460 (https://phabricator.wikimedia.org/T196701) [14:22:15] (03CR) 10Cmjohnson: [C: 032] Adding mgmt/productin dns torrelay1001 [dns] - 10https://gerrit.wikimedia.org/r/449460 (https://phabricator.wikimedia.org/T196701) (owner: 10Cmjohnson) [14:23:17] 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install torrelay1001.wikimedia.org - https://phabricator.wikimedia.org/T196701 (10Cmjohnson) [14:26:33] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on neodymium.eqiad.wmnet for hosts: ``` ['cp1058.eqiad.wmnet'] ``` The log can be found in `/var/log/wmf-auto-reimage/20... [14:30:47] (03PS1) 10Muehlenhoff: Update Cumin alias for labtestweb role rename [puppet] - 10https://gerrit.wikimedia.org/r/449462 [14:31:50] (03PS2) 10Muehlenhoff: Update Cumin alias for labtestweb role rename [puppet] - 10https://gerrit.wikimedia.org/r/449462 [14:32:46] (03CR) 10Muehlenhoff: [C: 032] Update Cumin alias for labtestweb role rename [puppet] - 10https://gerrit.wikimedia.org/r/449462 (owner: 10Muehlenhoff) [14:33:35] (03PS1) 10Cmjohnson: Updating dns for authdns1001 mv private to public [dns] - 10https://gerrit.wikimedia.org/r/449464 (https://phabricator.wikimedia.org/T196693) [14:34:57] (03CR) 10Cmjohnson: [C: 032] Updating dns for authdns1001 mv private to public [dns] - 10https://gerrit.wikimedia.org/r/449464 (https://phabricator.wikimedia.org/T196693) (owner: 10Cmjohnson) [14:40:48] PROBLEM - IPsec on cp3007 is CRITICAL: Strongswan CRITICAL - ok: 14 not-conn: cp1058_v4, cp1058_v6 [14:41:50] ACKNOWLEDGEMENT - IPsec on cp3007 is CRITICAL: Strongswan CRITICAL - ok: 14 not-conn: cp1058_v4, cp1058_v6 Ema reimaging 1058 [14:42:56] (03PS1) 10BBlack: cp1075-99: further mkfs tweaks [puppet] - 10https://gerrit.wikimedia.org/r/449466 (https://phabricator.wikimedia.org/T195923) [14:43:18] (03PS1) 10Cmjohnson: Adding mgmt/production dns auth1002 [dns] - 10https://gerrit.wikimedia.org/r/449468 (https://phabricator.wikimedia.org/T196698) [14:46:01] RECOVERY - IPsec on cp3007 is OK: Strongswan OK - 16 ESP OK [14:46:15] (03CR) 10Cmjohnson: [C: 032] Adding mgmt/production dns auth1002 [dns] - 10https://gerrit.wikimedia.org/r/449468 (https://phabricator.wikimedia.org/T196698) (owner: 10Cmjohnson) [14:46:54] (03CR) 10ArielGlenn: "It seems like the values of dumpFormat and extraFormat can both be nt or ttl; doesn't this mean that $targetDir/$filename.$dumpFormat.gz a" [puppet] - 10https://gerrit.wikimedia.org/r/447922 (https://phabricator.wikimedia.org/T144103) (owner: 10Smalyshev) [14:47:10] 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install auth1002 - https://phabricator.wikimedia.org/T196698 (10Cmjohnson) [14:50:57] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['cp1058.eqiad.wmnet'] ``` and were **ALL** successful. [14:51:09] (03PS2) 10Herron: admin: remove expiry attributes of user nettrom [puppet] - 10https://gerrit.wikimedia.org/r/449290 (https://phabricator.wikimedia.org/T200723) [14:53:29] (03CR) 10Herron: admin: remove expiry attributes of user nettrom (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/449290 (https://phabricator.wikimedia.org/T200723) (owner: 10Herron) [14:54:42] PROBLEM - Check systemd state on ms-be2035 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:56:53] (03PS3) 10Herron: admin: remove expiry attributes of user nettrom [puppet] - 10https://gerrit.wikimedia.org/r/449290 (https://phabricator.wikimedia.org/T200723) [14:57:56] (03CR) 10Herron: [C: 032] admin: remove expiry attributes of user nettrom [puppet] - 10https://gerrit.wikimedia.org/r/449290 (https://phabricator.wikimedia.org/T200723) (owner: 10Herron) [15:01:36] (03PS2) 10Jcrespo: WMFMariaDB refactoring and adding tests [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/449185 [15:01:38] (03PS1) 10Jcrespo: db backup statistics: Initial implementation of the backup stats [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/449469 (https://phabricator.wikimedia.org/T198987) [15:01:59] !log starting the eqiad row B servers move - T183585 [15:02:00] (03CR) 10jerkins-bot: [V: 04-1] db backup statistics: Initial implementation of the backup stats [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/449469 (https://phabricator.wikimedia.org/T198987) (owner: 10Jcrespo) [15:02:02] (03CR) 10jerkins-bot: [V: 04-1] WMFMariaDB refactoring and adding tests [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/449185 (owner: 10Jcrespo) [15:02:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:03] T183585: Rack/cable/configure asw2-b-eqiad switch stack - https://phabricator.wikimedia.org/T183585 [15:03:39] 10Operations, 10Product-Analytics, 10SRE-Access-Requests, 10Patch-For-Review: Remove expiry date from Morten Warncke-Wang's production shell access - https://phabricator.wikimedia.org/T200723 (10herron) 05Open>03Resolved Expiry attributes have been removed from account `nettrom`. I'll transition this... [15:04:20] (03PS1) 10Ema: cp3031 (text), cp3044 (upload): upgrade to stretch [puppet] - 10https://gerrit.wikimedia.org/r/449470 (https://phabricator.wikimedia.org/T200445) [15:05:22] (03CR) 10Ema: [C: 032] cp3031 (text), cp3044 (upload): upgrade to stretch [puppet] - 10https://gerrit.wikimedia.org/r/449470 (https://phabricator.wikimedia.org/T200445) (owner: 10Ema) [15:07:12] 10Operations, 10SRE-Access-Requests: Access to Google Search Console for Go Fish Digital - https://phabricator.wikimedia.org/T192893 (10herron) a:05Deskana>03RobH [15:10:18] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on neodymium.eqiad.wmnet for hosts: ``` ['cp3031.esams.wmnet', 'cp3044.esams.wmnet'] ``` The log can be found in `/var/l... [15:17:38] 10Operations, 10SRE-Access-Requests: NDA access for Telecom Paristech Research Team - https://phabricator.wikimedia.org/T200800 (10herron) [15:18:13] PROBLEM - haproxy failover on dbproxy1008 is CRITICAL: CRITICAL check_failover servers up 2 down 1 [15:18:14] PROBLEM - MariaDB Slave SQL: m3 on db1117 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1062, Errmsg: Error Duplicate entry 24361579 for key PRIMARY on query. Default database: phabricator_file. [Query snipped] [15:18:14] PROBLEM - haproxy failover on dbproxy1003 is CRITICAL: CRITICAL check_failover servers up 2 down 1 [15:20:55] !log depool thumbor100[12] ahead of switch move - T183585 [15:20:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:59] T183585: Rack/cable/configure asw2-b-eqiad switch stack - https://phabricator.wikimedia.org/T183585 [15:22:44] jynus: db1117:3323 wasn't in read only [15:23:01] 10Operations, 10SRE-Access-Requests: NDA access for Telecom Paristech Research Team - https://phabricator.wikimedia.org/T200800 (10herron) p:05Triage>03Normal a:03Gilles @Gilles I've added the access request checklist for reference. Could you please clarify what group memberships are being requested for... [15:23:14] (03CR) 10Gehel: "minor comment inline (I'll send a patch to correct it)" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/447851 (https://phabricator.wikimedia.org/T194787) (owner: 10MSantos) [15:23:43] can we revert the proxy? [15:24:20] I have put db1117:3323 in read only (so phabricator is read only) until we decide what to do [15:24:58] We can revert the proxy but there is stuff that has been written to db1117 already [15:25:01] db1072 is now up [15:25:15] well, there are also things written on db1072, too [15:25:24] otherwise, it wouldn't have failed [15:25:34] I say revert [15:25:46] ok, I will reload the proxy [15:26:26] I thought the proxies were not affected? [15:26:28] reloaded [15:26:43] otherwise, we could have depooled them [15:27:09] stop db1117 [15:27:16] the m3 instance [15:27:19] db1117 has read only on [15:27:20] so connections die [15:27:23] PROBLEM - MariaDB Slave Lag: m3 on db1117 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 703.58 seconds [15:27:31] ok [15:27:33] RECOVERY - haproxy failover on dbproxy1008 is OK: OK check_failover servers up 2 down 0 [15:27:34] yeah, buy we need to kill connection [15:27:37] can you check if phab is back read-write? [15:27:51] I will stop the instance on db1117 [15:28:03] phab still throwing read-only error when viewing tasks [15:28:07] not yet for me, we need the kill [15:28:17] it is stopping now [15:28:21] <_joe_> isn't it easier to restart apache instead? [15:28:26] <_joe_> on phab, I mean [15:28:35] stopped [15:28:49] _joe_: I think we need both [15:29:06] <_joe_> Can Not Connect to MySQL [15:29:09] marostegui: meanwhile, can you check the active proxy, and where it is pointing? [15:29:36] pha should be back now [15:29:44] back in business [15:29:46] <_joe_> !log restarting apache on phab1001 [15:29:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:06] jynus: dbproxy1003 is the active [15:30:07] phab uses heaby connection pooling [15:30:08] <_joe_> we might need to do the same with the other daemons [15:30:35] PROBLEM - haproxy failover on dbproxy1008 is CRITICAL: CRITICAL check_failover servers up 1 down 1 [15:30:41] hmm recent updates to tasks seem to be lost [15:30:49] yes [15:30:54] herron: expected :( [15:30:56] (03PS7) 10Gehel: Set up cron task to regen low-zoom vector tiles [puppet] - 10https://gerrit.wikimedia.org/r/447851 (https://phabricator.wikimedia.org/T194787) (owner: 10MSantos) [15:31:05] so apparently dbproxy1003 wasn't part of the maintenace [15:31:06] kk [15:31:14] jynus: nope [15:31:15] or at least I don't see it [15:31:25] PROBLEM - mysqld processes on db1117 is CRITICAL: PROCS CRITICAL: 3 processes with command name mysqld [15:31:32] that's going to page isn't it? [15:31:34] (03CR) 10jerkins-bot: [V: 04-1] Set up cron task to regen low-zoom vector tiles [puppet] - 10https://gerrit.wikimedia.org/r/447851 (https://phabricator.wikimedia.org/T194787) (owner: 10MSantos) [15:31:36] jynus: I guess the downtime for db1072 was too long and it got caught by haproxy [15:32:12] then we should have removed the failover in advance [15:32:39] last time it didn't switch [15:32:43] exactly [15:32:50] we should do that with db1073, just in case [15:32:55] as it is part of the maintenance as well [15:33:10] (m5) [15:33:11] failover is at :18 [15:34:44] the read only could be a bug of misc_multiinstance, as it is only on 2 hosts [15:35:16] yeah, my.cnf has read_only=0 [15:36:40] the thing is, by having 1072 as the master, at leasy we have 2 other copies [15:36:52] the other way around we only have one availabe copies [15:36:55] I am going to set read only manually on the other instances of db1117 for now [15:37:16] m5 will not be an issue becaue the apps don't use the proxy yet [15:37:34] so, we can tell XioNoX to proceed with B3 to avoid blocking this [15:37:41] sure [15:37:51] XioNoX: Feel free to proceed with B3 [15:38:03] I am going to bring up db1117:3323 up with read only ON [15:38:09] wait [15:38:13] sure [15:38:28] maybe we should do it on a separate port [15:38:37] just in case [15:38:39] to be extra careful? [15:38:39] sure [15:38:43] 3333 sounds good? [15:38:43] or reconfiguring both proxies [15:38:55] marostegui: ok [15:38:55] what do you mean? [15:38:56] whatever that makes it impossible to happen again [15:39:05] so nobody points to it [15:39:09] ah right [15:39:12] I get you [15:39:14] marostegui: there are db servers in b3, should we skip them? [15:39:21] e.g. imagine we lose db1072 again [15:39:26] XioNoX: no, you can go ahead [15:39:29] ok [15:39:32] jynus: yeah, good point [15:39:40] jynus: I will bring it up with 3333 [15:39:45] cool [15:39:56] I will also bring it up with read only [15:41:17] Ah right, firewall rules :) [15:41:25] RECOVERY - mysqld processes on db1117 is OK: PROCS OK: 4 processes with command name mysqld [15:41:53] that is ok, socket should be ok [15:41:53] ok, it is now up and replication is broken of course [15:42:07] I just want to know how many events were lost [15:42:32] yeah, now it is time to do some archeology [15:43:36] probably from 18 to 27 or so [15:44:28] I put it in read only around .26 [15:44:55] but existing connections may continue writing to the existing host [15:45:04] :( [15:45:11] we can check on stats [15:45:25] 10Operations, 10Traffic: cp3033: Power required by the system exceeds the power supplied by the Power Supply Units - https://phabricator.wikimedia.org/T200806 (10ema) [15:45:29] 10Operations, 10Traffic: cp3033: Power required by the system exceeds the power supplied by the Power Supply Units - https://phabricator.wikimedia.org/T200806 (10ema) p:05Triage>03Normal [15:45:57] !log ayounsi@puppetmaster1001 conftool action : set/pooled=no; selector: dc=eqiad,cluster=prometheus,service=prometheus,name=prometheus1004.eqiad.wmnet [15:45:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:07] 10Operations, 10Traffic: cp3031: Power required by the system exceeds the power supplied by the Power Supply Units - https://phabricator.wikimedia.org/T200806 (10ema) [15:46:13] !log ayounsi@puppetmaster1001 conftool action : set/pooled=no; selector: dc=eqiad,cluster=eventbus,service=eventbus,name=kafka1002.eqiad.wmnet [15:46:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:29] so https://grafana.wikimedia.org/dashboard/db/mysql?panelId=16&fullscreen&orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1117&var-port=13323&from=now-1h&to=now [15:46:30] vs [15:46:42] https://grafana.wikimedia.org/dashboard/db/mysql?panelId=16&fullscreen&orgId=1&from=now-1h&to=now&var-dc=eqiad%20prometheus%2Fops&var-server=db1072&var-port=9104 [15:47:13] (03PS8) 10Gehel: Set up cron task to regen low-zoom vector tiles [puppet] - 10https://gerrit.wikimedia.org/r/447851 (https://phabricator.wikimedia.org/T194787) (owner: 10MSantos) [15:48:21] jynus: https://grafana.wikimedia.org/dashboard/db/mysql?panelId=2&fullscreen&orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1117&var-port=13323&from=1533048691983&to=1533051952360 [15:48:26] writes seem to stop at .26 [15:48:44] PROBLEM - Host logstash1005 is DOWN: PING CRITICAL - Packet loss = 100% [15:48:46] and db1072: https://grafana.wikimedia.org/dashboard/db/mysql?panelId=2&fullscreen&orgId=1&from=now-1h&to=now&var-dc=eqiad%20prometheus%2Fops&var-server=db1072&var-port=9104 [15:49:24] PROBLEM - Host maps1002 is DOWN: PING CRITICAL - Packet loss = 100% [15:50:11] !log ayounsi@puppetmaster1001 conftool action : set/pooled=no; selector: dc=eqiad,cluster=aqs,service=aqs,name=aqs1004.eqiad.wmnet [15:50:12] !log ayounsi@puppetmaster1001 conftool action : set/pooled=no; selector: dc=eqiad,cluster=aqs,service=cassandra,name=aqs1004.eqiad.wmnet [15:50:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:57] (03PS7) 10Gehel: Drop query_clicks partitions after 90 days [puppet] - 10https://gerrit.wikimedia.org/r/419954 (https://phabricator.wikimedia.org/T189845) (owner: 10EBernhardson) [15:51:00] !log ayounsi@puppetmaster1001 conftool action : set/pooled=no; selector: dc=eqiad,cluster=druid-public,service=druid-public-broker,name=druid1005.eqiad.wmnet [15:51:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:42] actually it had read_only = 1 on config, so puppet was ok [15:51:47] no [15:51:51] I changed that before starting it [15:51:54] and disabled puppet :) [15:51:58] it was 0 [15:51:59] ah, ok [15:53:04] RECOVERY - Host logstash1005 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [15:53:34] RECOVERY - Host maps1002 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [15:53:35] PROBLEM - Host kafka1002 is DOWN: PING CRITICAL - Packet loss = 100% [15:53:59] so I think it was a combination of being hardcoded on most other templates [15:54:09] but not on manifest [15:54:12] Yeah [15:54:28] We should probably start with read only ON no matter what ,like we do with core [15:54:51] (03CR) 10Gehel: [C: 032] Drop query_clicks partitions after 90 days [puppet] - 10https://gerrit.wikimedia.org/r/419954 (https://phabricator.wikimedia.org/T189845) (owner: 10EBernhardson) [15:55:50] We need to rebuild db1117:3323 [15:55:59] we have time [15:56:04] (03PS1) 10Jcrespo: mariadb: Make misc multiinstance read only always [puppet] - 10https://gerrit.wikimedia.org/r/449478 [15:56:05] PROBLEM - tilerator on maps1002 is CRITICAL: connect to address 10.64.16.42 and port 6534: Connection refused [15:56:18] backups generates from it [15:56:37] we can leave it like that and recover it from codfw or the master [15:56:44] yeah, I was going to suggest that [15:56:50] As backups start tonight or yesterday? [15:56:50] meanwhile we cam point to the codfw replica [15:56:57] (03CR) 10Marostegui: [C: 031] mariadb: Make misc multiinstance read only always [puppet] - 10https://gerrit.wikimedia.org/r/449478 (owner: 10Jcrespo) [15:56:59] tonight [15:57:28] (03PS2) 10Jcrespo: mariadb: Make misc multiinstance read only always [puppet] - 10https://gerrit.wikimedia.org/r/449478 [15:58:04] RECOVERY - Host kafka1002 is UP: PING OK - Packet loss = 0%, RTA = 1.33 ms [15:58:55] RECOVERY - tilerator on maps1002 is OK: HTTP OK: HTTP/1.1 200 OK - 305 bytes in 0.039 second response time [15:59:50] (03PS1) 10Marostegui: dbproxy100{3,8}: Point m3 secondary to codfw [puppet] - 10https://gerrit.wikimedia.org/r/449479 [15:59:54] jynus: ^ [16:00:04] godog, moritzm, and _joe_: #bothumor I � Unicode. All rise for Puppet SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180731T1600). [16:00:05] Krenair: A patch you scheduled for Puppet SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:09] I was doing that marostegui [16:00:15] Ah :) [16:00:22] I can review mine and compare, then discard mine [16:00:31] sure, whichever you prefer :) [16:00:41] hm [16:00:52] lots of activity here and a bad status line [16:01:01] Krenair: we are good now [16:01:05] is stuff happening as normal or [16:01:06] oh ok [16:01:07] (03PS3) 10Jcrespo: mariadb: Make misc multiinstance read only always [puppet] - 10https://gerrit.wikimedia.org/r/449478 [16:01:09] (03PS1) 10Jcrespo: Point m3 to the codfw replica [puppet] - 10https://gerrit.wikimedia.org/r/449481 [16:01:11] (from a phab point of view) [16:01:15] <_joe_> well, dbs are ok [16:01:25] <_joe_> we're in the middle of a complex hardware maintenance [16:01:27] !log ayounsi@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=eqiad,cluster=prometheus,service=prometheus,name=prometheus1004.eqiad.wmnet [16:01:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:44] <_joe_> we should've cancelled the deployment window [16:01:46] _joe_: dbs not ok, but the service that failed is up [16:02:04] although we are trying to fix the reduced redundancy [16:02:04] <_joe_> yeah sorry, badly worded [16:02:08] :-) [16:02:11] jynus: our puppet changes are identical [16:02:16] there's no rush to deploy any of the puppet changes on the list [16:02:34] that said review would be appreciated [16:03:04] even if stuff is not actually being deployed today [16:03:11] (03Abandoned) 10Jcrespo: Point m3 to the codfw replica [puppet] - 10https://gerrit.wikimedia.org/r/449481 (owner: 10Jcrespo) [16:03:16] <_joe_> Krenair: ack :) [16:03:19] (03CR) 10Jcrespo: [C: 031] dbproxy100{3,8}: Point m3 secondary to codfw [puppet] - 10https://gerrit.wikimedia.org/r/449479 (owner: 10Marostegui) [16:03:25] <_joe_> Krenair: I've seen your ping about apache changes [16:03:34] (03CR) 10Marostegui: [C: 032] dbproxy100{3,8}: Point m3 secondary to codfw [puppet] - 10https://gerrit.wikimedia.org/r/449479 (owner: 10Marostegui) [16:03:43] (03CR) 10Jcrespo: [C: 032] mariadb: Make misc multiinstance read only always [puppet] - 10https://gerrit.wikimedia.org/r/449478 (owner: 10Jcrespo) [16:03:48] <_joe_> have you seen what I did in the mediawiki_test environment on that front? [16:03:48] I will merge both patches [16:03:51] (03PS4) 10Jcrespo: mariadb: Make misc multiinstance read only always [puppet] - 10https://gerrit.wikimedia.org/r/449478 [16:03:52] jynus: ^ [16:03:58] (03CR) 10Jcrespo: [V: 032 C: 032] mariadb: Make misc multiinstance read only always [puppet] - 10https://gerrit.wikimedia.org/r/449478 (owner: 10Jcrespo) [16:04:05] _joe_: He has. He abandoned his patch(es) last night because of yours :P [16:04:11] <_joe_> ahah ok [16:04:18] <_joe_> sorry still playing catchup [16:04:22] _joe_, I think I saw a bit briefly [16:04:22] 10Operations, 10SRE-Access-Requests: NDA access for Telecom Paristech Research Team - https://phabricator.wikimedia.org/T200800 (10herron) p:05Triage>03Normal [16:04:28] <+SP9002_@efnet> so, he wants the win. so we're just gonna get lunch or something, then hes gonna push me to the ground and tap my ass with his foot so he can claim he "kicked my ass" tbh im going along with it becase I dont wanna lose any teeth [16:04:31] jynus: deployed. I will enable puppet back on db1117 [16:04:32] With our IRC ad service you can reach a global audience of entrepreneurs and fentanyl addicts with extraordinary engagement rates! https://williampitcock.com/ [16:04:36] I thought you guys might be interested in this blog by freenode staff member Bryan 'kloeri' Ostergaard https://bryanostergaard.com/ [16:04:39] or maybe this blog by freenode staff member Matthew 'mst' Trout https://MattSTrout.com/ [16:05:13] sigh [16:05:27] <_joe_> We need to remove this shit from the public logs? [16:05:47] pretty sure all of freenode is getting spammed with this stuff [16:06:01] !log reboot cp3044 for kernel update [16:06:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:05] 10Operations, 10SRE-Access-Requests: NDA access for Telecom Paristech Research Team - https://phabricator.wikimedia.org/T200800 (10herron) a:03Gilles @Gilles I've added the access request checklist for reference. Could you please clarify what group memberships are being requested for these users? Also, do t... [16:06:06] <_joe_> yeah, but only our channels have logs hosted by us :) [16:06:41] marostegui: do you or I reload the proxies? [16:06:53] I can do that [16:07:52] !log Reload dbproxy1003 and dbproxy1008 [16:07:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:25] RECOVERY - haproxy failover on dbproxy1003 is OK: OK check_failover servers up 2 down 0 [16:08:36] RECOVERY - haproxy failover on dbproxy1008 is OK: OK check_failover servers up 2 down 0 [16:08:50] if the failure happened fast, marostegui- we may not need to do a full recovery, let me check [16:09:08] (03PS1) 10Ayounsi: eqiad: temporarily remove chromium from LVS nameservers [puppet] - 10https://gerrit.wikimedia.org/r/449484 (https://phabricator.wikimedia.org/T183585) [16:10:43] _joe_, the mediawiki_test apache stuff is looking good [16:10:58] (03CR) 10Ema: [C: 031] eqiad: temporarily remove chromium from LVS nameservers [puppet] - 10https://gerrit.wikimedia.org/r/449484 (https://phabricator.wikimedia.org/T183585) (owner: 10Ayounsi) [16:11:48] (03PS1) 10Marostegui: db1117: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/449486 [16:12:04] (I am checking the exact binlog positions) [16:12:33] _joe_, so the plan is to move a small number of sites at a time and see how it goes? [16:13:15] PROBLEM - puppet last run on ms-be1017 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/rsyslog.d/10-puppet-agent.conf] [16:13:27] (03CR) 10Marostegui: [C: 032] db1117: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/449486 (owner: 10Marostegui) [16:13:48] (03CR) 10Ayounsi: [C: 032] eqiad: temporarily remove chromium from LVS nameservers [puppet] - 10https://gerrit.wikimedia.org/r/449484 (https://phabricator.wikimedia.org/T183585) (owner: 10Ayounsi) [16:13:51] <_joe_> Krenair: to reproduce the procedure we did there for all the appservers, yes [16:13:57] (03PS2) 10Ayounsi: eqiad: temporarily remove chromium from LVS nameservers [puppet] - 10https://gerrit.wikimedia.org/r/449484 (https://phabricator.wikimedia.org/T183585) [16:14:16] 10Operations, 10SRE-Access-Requests: NDA access for Telecom Paristech Research Team - https://phabricator.wikimedia.org/T200800 (10herron) a:05Gilles>03Imarlier Was just told @Gilles is out of the office. @Imarlier would you be able to answer the above? [16:14:29] jynus: I am going to enable puppet back on db1117 [16:14:37] Mysql will remain on 3333 for now [16:15:15] <+SP9002_@efnet> so, he wants the win. so we're just gonna get lunch or something, then hes gonna push me to the ground and tap my ass with his foot so he can claim he "kicked my ass" tbh im going along with it becase I dont wanna lose any teeth [16:15:19] With our IRC ad service you can reach a global audience of entrepreneurs and fentanyl addicts with extraordinary engagement rates! https://williampitcock.com/ [16:16:01] 10Operations, 10SRE-Access-Requests: NDA access for Telecom Paristech Research Team - https://phabricator.wikimedia.org/T200800 (10Imarlier) @herron Unfortunately, I can't really answer these questions, @Gilles has been driving this process. It'll have to wait for him to have a chance to respond. I will ping... [16:16:01] !log precautionary restart of eventbus on kafka1002 after network downtime (DNS name res errors, Kafka broker conn issues, etc..) [16:16:04] (03CR) 10Thcipriani: [C: 031] jenkins: base::service_unit -> systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/434538 (https://phabricator.wikimedia.org/T194724) (owner: 10Dzahn) [16:16:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:46] PROBLEM - IPsec on cp1075 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp3031_v4, cp3031_v6 [16:17:50] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['cp3031.esams.wmnet'] ``` Of which those **FAILED**: ``` ['cp3031.esams.wmnet'] ``` [16:18:03] 10Operations, 10SRE-Access-Requests: NDA access for Telecom Paristech Research Team - https://phabricator.wikimedia.org/T200800 (10herron) 05Open>03stalled a:05Imarlier>03Gilles Thanks @Imarlier sounds good. Marking as stalled until then. [16:18:45] PROBLEM - Varnish HTTP text-backend - port 3128 on cp1075 is CRITICAL: connect to address 10.64.0.130 and port 3128: Connection refused [16:19:05] sorry cp1075 is just a downtime expiry [16:19:15] I'll fix it in a sec [16:20:48] feature request: icinga should send an IRC notice when a downtime expires, so it's obvious if that's why an alert appears :) [16:24:34] (03PS1) 10Imarlier: dumps: Datahub has moved [puppet] - 10https://gerrit.wikimedia.org/r/449496 (https://phabricator.wikimedia.org/T200705) [16:24:38] (03Abandoned) 10Imarlier: dumps: datahub no longer exists [puppet] - 10https://gerrit.wikimedia.org/r/449238 (https://phabricator.wikimedia.org/T200705) (owner: 10Imarlier) [16:24:54] !log ayounsi@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=eqiad,cluster=aqs,service=aqs,name=aqs1004.eqiad.wmnet [16:24:57] !log ayounsi@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=eqiad,cluster=aqs,service=cassandra,name=aqs1004.eqiad.wmnet [16:24:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:03] !log pool eventbus on kafka1002 after network maintenance [16:25:05] 10Operations, 10SRE-Access-Requests: Access to Google Search Console for Go Fish Digital - https://phabricator.wikimedia.org/T192893 (10RobH) Ok, parsing the above task and comments (which I reference below), we had a number of domains to remove them from. The comments below list them out, plus the addition o... [16:25:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:18] !log repool thumbor100[12] - T183585 [16:25:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:24] T183585: Rack/cable/configure asw2-b-eqiad switch stack - https://phabricator.wikimedia.org/T183585 [16:26:14] !log ayounsi@puppetmaster1001 conftool action : set/pooled=no; selector: dc=eqiad,cluster=dns,service=pdns_recursor,name=chromium.wikimedia.org [16:26:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:05] <+SP9002_@efnet> so, he wants the win. so we're just gonna get lunch or something, then hes gonna push me to the ground and tap my ass with his foot so he can claim he "kicked my ass" tbh im going along with it becase I dont wanna lose any teeth [16:30:05] With our IRC ad service you can reach a global audience of entrepreneurs and fentanyl addicts with extraordinary engagement rates! https://williampitcock.com/ [16:30:35] 10Operations, 10Epic, 10Maps (Maps-data): Epic: cultivating the Maps garden - https://phabricator.wikimedia.org/T137616 (10Gehel) [16:30:37] 10Operations, 10Maps (Maps-data): Improve automation around Maps servers - https://phabricator.wikimedia.org/T138017 (10Gehel) 05Open>03declined Too vague description. Some of the points have already been addressed in more specific tasks. [16:30:44] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [140.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [16:31:04] 10Operations, 10Epic, 10Maps (Maps-data): Epic: cultivating the Maps garden - https://phabricator.wikimedia.org/T137616 (10Gehel) 05Open>03declined Too vague description. Some of the points have already been addressed in more specific tasks. [16:31:15] (03PS3) 10Marostegui: Revert "db-eqiad.php: Depool all the hosts in row B" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449396 [16:31:24] PROBLEM - puppet last run on mw1288 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:31:55] PROBLEM - Recursive DNS on 208.80.153.78 is CRITICAL: CRITICAL - Plugin timed out while executing system call [16:32:24] PROBLEM - puppet last run on thumbor1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:32:50] mw1288 was under network maintenance --^ [16:33:09] so uh, anyone got any questions about the puppet patches? [16:33:54] PROBLEM - puppet last run on mw1303 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:34:04] PROBLEM - puppet last run on mw1305 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:34:05] PROBLEM - puppet last run on mw1300 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:34:17] 10Operations, 10Epic, 10Maps (Maps-data): Epic: cultivating the Maps garden - https://phabricator.wikimedia.org/T137616 (10Mholloway) [16:34:19] 10Operations, 10SRE-Access-Requests: NDA access for Telecom Paristech Research Team - https://phabricator.wikimedia.org/T200800 (10Gilles) They need access to Eventlogging data, preferably on Hadoop/Spark [16:37:15] PROBLEM - puppet last run on elastic1047 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:37:24] PROBLEM - puppet last run on mw1289 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:40:07] !log repool druid1005 after network maintenance [16:40:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:40:45] PROBLEM - puppet last run on mw1295 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:41:24] PROBLEM - puppet last run on mw1287 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:41:55] PROBLEM - puppet last run on mw1293 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:42:35] PROBLEM - puppet last run on scb1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:43:04] PROBLEM - puppet last run on mw1284 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:43:14] PROBLEM - puppet last run on mw1294 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:43:35] RECOVERY - puppet last run on ms-be1017 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:43:36] 10Operations, 10Puppet, 10Maps: Refactor puppet-postgresql module to use custom types - https://phabricator.wikimedia.org/T150020 (10Gehel) 05Open>03declined Let's do this on the fly as part of other changes. [16:44:15] PROBLEM - puppet last run on elastic1028 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:44:20] anyone looking at all those puppet failures? [16:44:32] volans: yeah already alerted Arzhel [16:44:53] the mw ones seems to be due to network maintenance, can't ping puppetmaster1001 from those hosts [16:45:02] XioNoX: let's chat in here [16:45:04] Could not retrieve catalog from remote server: execution expired [16:45:25] just got the same from scb1002 as well [16:45:32] and it wasn't part of the maintenance afaics [16:45:35] PROBLEM - puppet last run on elastic1046 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:45:35] XioNoX: --^ [16:45:42] mine was from scb1002 [16:45:43] yeah, looking [16:46:40] is this expected as part of some maintenance ongoing? Sorry I was buried in code and didn't read all the recent backlog [16:46:49] not really [16:46:52] so puppetmaster1001 has been moved to the new switch, without disabling puppet first [16:47:02] (was planned for later) [16:47:13] so it might be fallouts from there [16:47:25] ah I didn't see that puppetmaster1001 was in the list [16:47:25] sigh [16:47:56] ok, then it makes sense, you can https://wikitech.wikimedia.org/wiki/Cumin#Run_Puppet_only_if_last_run_failed if you want [16:47:57] lower down, yeah [16:48:14] I have to step out now, sorry [16:48:18] ok, thx, will do it [16:48:25] but seems nothing critical/permanent if they recover [16:48:25] XioNoX: interesting - from puppetmaster1001 I can ping mw1305.eqiad.wmnet, but not the opposite [16:48:27] looking at connectivity issues [16:48:31] if they keep failing we have something else ofc [16:48:36] stale mac maybe [16:48:39] thanks [16:48:39] arp* [16:48:45] PROBLEM - puppet last run on mw1286 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:49:34] PROBLEM - puppet last run on wdqs1009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:49:41] elukey: so v4 ping works but not v6... [16:49:52] from mw to puppetmaster [16:50:15] 10Operations, 10Discovery, 10Maps, 10Maps-Sprint, and 3 others: Remove referrer check from varnish for maps cluster - https://phabricator.wikimedia.org/T137848 (10Mholloway) [16:50:29] 10Operations, 10Maps, 10Epic: Epic: switch Maps to production status - https://phabricator.wikimedia.org/T133744 (10Mholloway) 05Open>03Resolved [16:50:31] XioNoX: ahh yes right, I didn't notice it [16:50:35] 10Operations, 10Maps, 10Epic: Epic: switch Maps to production status - https://phabricator.wikimedia.org/T133744 (10Gehel) Maps are considered production for some time. There will always be things to be improve, but outside of this task [16:50:49] but I can ping puppetmaster's linklocal address, as they are in the same vlan [16:51:45] PROBLEM - puppet last run on mw1285 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:52:14] XioNoX: is 2620:0:861:102:10:64:16:73 still correct after the move? [16:52:15] PROBLEM - puppet last run on kubernetes1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:52:44] yeah, it's in the same vlan [16:52:54] PROBLEM - puppet last run on mw1299 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:52:56] okok [16:53:02] so all these are going via ipv6 [16:53:04] interesting [16:53:54] PROBLEM - puppet last run on mw1302 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:53:54] PROBLEM - puppet last run on mw1301 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:54:33] they have the proper v6 neighbor ip/mac [16:54:44] PROBLEM - puppet last run on mw1290 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:54:45] PROBLEM - puppet last run on mw1296 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:55:00] is v6 autoconf flapping on one or both sides perhaps? [16:55:08] (router advert expiring or whatever, multicast-related issues?0 [16:55:16] 2620:0:861:102:1618:77ff:fe61:38f4 dev eno1 INCOMPLETE [16:55:35] puppetmaster isn't learning the mac from the other hosts [16:56:53] let's move it back to the previous host [16:56:55] PROBLEM - puppet last run on mw1304 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:56:56] er, switch [16:57:14] PROBLEM - puppet last run on thumbor1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:59:37] PROBLEM - puppet last run on mw1306 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:59:48] PROBLEM - puppet last run on mwmaint1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:00:04] cscott, arlolra, subbu, halfak, and Amir1: #bothumor My software never has bugs. It just develops random features. Rise for Services – Graphoid / Parsoid / Citoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180731T1700). [17:00:24] aside from the puppetfails issue, any other signs of network-induced problems? [17:00:47] (trying to catch up a bit from not following closely earlier!) [17:01:57] PROBLEM - puppet last run on aqs1008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:02:11] seems limited to puppet failures due to clients not being able to resolve the ipv6 address of puppetmaster1001 [17:02:37] ok [17:03:42] <+SP9002_@efnet> so, he wants the win. so we're just gonna get lunch or something, then hes gonna push me to the ground and tap my ass with his foot so he can claim he "kicked my ass" tbh im going along with it becase I dont wanna lose any teeth [17:03:42] With our IRC ad service you can reach a global audience of entrepreneurs and fentanyl addicts with extraordinary engagement rates! https://williampitcock.com/ [17:04:04] still trying to get a hold of Chris [17:04:48] RECOVERY - Varnish HTTP text-backend - port 3128 on cp1075 is OK: HTTP OK: HTTP/1.1 200 OK - 218 bytes in 0.000 second response time [17:06:12] ok, puppetmaster1001 v6 is back [17:06:22] weeeeeeird issue [17:06:47] RECOVERY - puppet last run on mw1287 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [17:06:50] I'd guess a bug on the switch, probably need to bounce the port but didn't want to keep live troubleshoting it [17:07:18] RECOVERY - puppet last run on mw1293 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [17:08:07] RECOVERY - puppet last run on scb1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:08:27] RECOVERY - puppet last run on mw1284 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:08:37] RECOVERY - puppet last run on mw1294 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:09:47] RECOVERY - puppet last run on elastic1028 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [17:09:57] RECOVERY - puppet last run on wdqs1009 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:10:12] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool all the hosts in row B" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449396 (owner: 10Marostegui) [17:11:07] RECOVERY - puppet last run on elastic1046 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [17:11:17] RECOVERY - puppet last run on mw1295 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [17:11:35] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool all the hosts in row B" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449396 (owner: 10Marostegui) [17:11:38] RECOVERY - puppet last run on mw1286 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [17:11:55] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool all the hosts in row B" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449396 (owner: 10Marostegui) [17:11:58] RECOVERY - puppet last run on mw1285 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [17:12:28] RECOVERY - puppet last run on kubernetes1002 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [17:12:45] 10Operations, 10Analytics, 10Discovery-Search (Current work), 10Patch-For-Review, 10Services (watching): Create kafka topic for mjolinr bulk daemon and decide on cluster - https://phabricator.wikimedia.org/T200215 (10EBernhardson) If my reading of puppet is right, what we need will be: on kafkamon1001.e... [17:12:55] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool all hosts in row B - T183585 (duration: 00m 51s) [17:12:59] 10Operations, 10DBA, 10JADE, 10Scoring-platform-team, 10TechCom-RFC: Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10awight) >>! In T200297#4464608, @Marostegui wrote: > What does: "our revision data on x1" means? Your own set of ta... [17:12:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:00] T183585: Rack/cable/configure asw2-b-eqiad switch stack - https://phabricator.wikimedia.org/T183585 [17:13:08] RECOVERY - puppet last run on mw1301 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [17:13:08] RECOVERY - puppet last run on mw1299 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [17:14:04] moritzm, bblack, correct, more exactly puppetmaster1001, for some reasons, was not able to learn the MAC addresses of the servers' v6 IPs. So as its cache expired, puppet runs failed. [17:14:58] RECOVERY - puppet last run on mw1290 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [17:15:07] RECOVERY - puppet last run on mw1296 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:17:07] RECOVERY - puppet last run on mw1304 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [17:17:27] RECOVERY - puppet last run on thumbor1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:17:59] 10Operations, 10DNS, 10Release-Engineering-Team, 10Traffic, and 5 others: Move Foundation Wiki to new URL when new Wikimedia Foundation website launches - https://phabricator.wikimedia.org/T188776 (10Varnent) @BBlack and @Reedy - One of the places that did not seem to respect the temporary nature of that U... [17:19:07] RECOVERY - puppet last run on mw1302 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [17:19:38] 10Operations, 10DBA, 10JADE, 10Scoring-platform-team, 10TechCom-RFC: Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10awight) >>! In T200297#4464943, @jcrespo wrote: > There is an issue because there is not way to prevent recursive s... [17:19:40] !log branching 1.32.0-wmf.15 refs T191061 [17:19:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:19:44] T191061: 1.32.0-wmf.15 deployment blockers - https://phabricator.wikimedia.org/T191061 [17:19:47] RECOVERY - puppet last run on mw1306 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [17:19:58] RECOVERY - puppet last run on mwmaint1001 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [17:22:17] RECOVERY - puppet last run on aqs1008 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [17:23:08] RECOVERY - puppet last run on thumbor1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [17:25:41] (03CR) 10Smalyshev: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/447922 (https://phabricator.wikimedia.org/T144103) (owner: 10Smalyshev) [17:26:31] bblack: i was told your request is supported in icinga2 [17:26:58] PROBLEM - Host labvirt1009 is DOWN: PING CRITICAL - Packet loss = 100% [17:27:08] RECOVERY - puppet last run on mw1288 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [17:27:19] 10Operations, 10DNS, 10Traffic, 10WMF-Communications, and 4 others: Move Foundation Wiki to new URL when new Wikimedia Foundation website launches - https://phabricator.wikimedia.org/T188776 (10greg) [17:28:08] RECOVERY - puppet last run on elastic1047 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:28:17] RECOVERY - puppet last run on mw1289 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [17:29:11] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [17:29:40] RECOVERY - puppet last run on mw1303 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [17:29:50] RECOVERY - puppet last run on mw1305 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [17:29:51] RECOVERY - puppet last run on mw1300 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [17:30:52] RECOVERY - Host labvirt1009 is UP: PING OK - Packet loss = 0%, RTA = 0.66 ms [17:34:27] !log ayounsi@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=eqiad,cluster=dns,service=pdns_recursor,name=chromium.wikimedia.org [17:34:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:34:42] 10Operations, 10DBA, 10JADE, 10Scoring-platform-team, 10TechCom-RFC: Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10Harej) >>! In T200297#4464630, @jcrespo wrote: > > You will not be able to join data on wiki storage and metadata... [17:37:38] 10Operations, 10SRE-Access-Requests: NDA access for Telecom Paristech Research Team - https://phabricator.wikimedia.org/T200800 (10Ottomata) They will need `analytics-privatedata-users` for access to EventLogging data in Hadoop. [17:38:29] (03PS2) 10Gehel: Enable constraints loading everywhere [puppet] - 10https://gerrit.wikimedia.org/r/447742 (https://phabricator.wikimedia.org/T192567) (owner: 10Smalyshev) [17:38:56] (03PS1) 10Ayounsi: Revert "eqiad: temporarily remove chromium from LVS nameservers" [puppet] - 10https://gerrit.wikimedia.org/r/449508 [17:39:13] (03CR) 10Gehel: [C: 032] Enable constraints loading everywhere [puppet] - 10https://gerrit.wikimedia.org/r/447742 (https://phabricator.wikimedia.org/T192567) (owner: 10Smalyshev) [17:40:11] 10Operations, 10Analytics, 10Discovery-Search (Current work), 10Patch-For-Review, 10Services (watching): Create kafka topic for mjolinr bulk daemon and decide on cluster - https://phabricator.wikimedia.org/T200215 (10Ottomata) Do you need the data mirrored between the two different Kafka clusters? If so... [17:40:31] PROBLEM - Host lvs1004 is DOWN: PING CRITICAL - Packet loss = 100% [17:40:55] ^ known, (and depooled) [17:41:09] was about to ask, thx [17:41:13] (03CR) 10Ayounsi: [C: 032] Revert "eqiad: temporarily remove chromium from LVS nameservers" [puppet] - 10https://gerrit.wikimedia.org/r/449508 (owner: 10Ayounsi) [17:42:47] (03PS9) 10Gehel: Set up cron task to regen low-zoom vector tiles [puppet] - 10https://gerrit.wikimedia.org/r/447851 (https://phabricator.wikimedia.org/T194787) (owner: 10MSantos) [17:43:11] 10Operations, 10DNS, 10Traffic, 10WMF-Communications, and 4 others: Move Foundation Wiki to new URL when new Wikimedia Foundation website launches - https://phabricator.wikimedia.org/T188776 (10BBlack) They may have cached it during the brief time it was a 301 rather than a 302 in the changes above, rather... [17:44:04] 10Operations, 10Analytics, 10Discovery-Search (Current work), 10Patch-For-Review, 10Services (watching): Create kafka topic for mjolinr bulk daemon and decide on cluster - https://phabricator.wikimedia.org/T200215 (10EBernhardson) Yes it will need to be mirrored, to be consumed in each datacenter separat... [17:44:07] (03PS1) 10Andrew Bogott: Add some delays and polling while dns updates. [puppet] - 10https://gerrit.wikimedia.org/r/449509 [17:44:29] 10Operations, 10Analytics, 10Discovery-Search (Current work), 10Patch-For-Review, 10Services (watching): Create kafka topic for mjolinr bulk daemon and decide on cluster - https://phabricator.wikimedia.org/T200215 (10EBernhardson) [17:44:40] RECOVERY - Host lvs1004 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [17:48:17] (03PS1) 10Andrew Bogott: Updated labtest_pool_config.yml to support two pdns servers [puppet] - 10https://gerrit.wikimedia.org/r/449512 (https://phabricator.wikimedia.org/T199578) [17:48:24] (03CR) 10Andrew Bogott: [C: 032] Add some delays and polling while dns updates. [puppet] - 10https://gerrit.wikimedia.org/r/449509 (owner: 10Andrew Bogott) [17:48:36] (03PS2) 10Andrew Bogott: Updated labtest_pool_config.yml to support two pdns servers [puppet] - 10https://gerrit.wikimedia.org/r/449512 (https://phabricator.wikimedia.org/T199578) [17:48:36] 1004 to 1006 are the backup lvs, they don't have traffic anyway, so starting with them [17:49:31] (03CR) 10Andrew Bogott: [C: 032] Updated labtest_pool_config.yml to support two pdns servers [puppet] - 10https://gerrit.wikimedia.org/r/449512 (https://phabricator.wikimedia.org/T199578) (owner: 10Andrew Bogott) [17:50:46] 10Operations, 10DNS, 10Traffic, 10WMF-Communications, and 4 others: Move Foundation Wiki to new URL when new Wikimedia Foundation website launches - https://phabricator.wikimedia.org/T188776 (10Varnent) Thank you for the quick response! I am open to ideas. In IRC it was suggested that we make some updates... [17:51:20] XioNoX: do you want me to look at anything re: phab1001 . it seems nothing needed, right [17:51:28] because today is the 31st.. ack [17:51:40] mutante: phab has been moved, no expected issues [17:51:51] PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 [17:51:59] not sure what you mean about 31st? [17:52:02] (03PS6) 10EBernhardson: Add mjolnir kafka daemon to primary elasticsearch clusters [puppet] - 10https://gerrit.wikimedia.org/r/445254 (https://phabricator.wikimedia.org/T198490) [17:53:00] RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy [17:53:25] !log stopping pybal on lvs1001 [17:53:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:53:30] XioNoX: just because i realized we are iin the window you announce you are aiming for.. but it wasnt clear if all can be done [17:54:06] (03PS1) 10Zoranzoki21: Enable TemplateStyles on Meta-Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449513 (https://phabricator.wikimedia.org/T200613) [17:54:39] (03PS3) 10Reedy: Add fluidsynth to wikimedia servers [puppet] - 10https://gerrit.wikimedia.org/r/445603 (https://phabricator.wikimedia.org/T184598) [17:54:59] (03CR) 10jerkins-bot: [V: 04-1] Add fluidsynth to wikimedia servers [puppet] - 10https://gerrit.wikimedia.org/r/445603 (https://phabricator.wikimedia.org/T184598) (owner: 10Reedy) [17:55:06] (03PS2) 10Zoranzoki21: Enable TemplateStyles on Meta-Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449513 (https://phabricator.wikimedia.org/T200613) [17:55:50] PROBLEM - PyBal backends health check on lvs1001 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 [17:56:01] PROBLEM - pybal on lvs1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal [17:57:35] expected ^ [17:58:07] (03PS2) 10Ayounsi: Revert "eqiad: temporarily remove chromium from LVS nameservers" [puppet] - 10https://gerrit.wikimedia.org/r/449508 [17:58:17] (03PS4) 10Reedy: Add fluidsynth to wikimedia servers [puppet] - 10https://gerrit.wikimedia.org/r/445603 (https://phabricator.wikimedia.org/T184598) [17:59:49] PROBLEM - PyBal connections to etcd on lvs1001 is CRITICAL: CRITICAL: 0 connections established with conf1001.eqiad.wmnet:2379 (min=4) [18:01:18] RECOVERY - PyBal backends health check on lvs1001 is OK: PYBAL OK - All pools are healthy [18:01:19] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [140.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [18:01:29] !log repool lvs1001 [18:01:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:02:14] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team: labvirt1009 HP Raid alert - https://phabricator.wikimedia.org/T198479 (10RobH) a:03Bstorm So, this has a 300GB SFF SAS Disk. We don't have any of those spare, but we do have a ton of 300GB Intel 710SSDs, according to the spares tracking; Intel... [18:04:13] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team: labvirt1009 HP Raid alert - https://phabricator.wikimedia.org/T198479 (10Bstorm) I think this is a lovely idea as long as no other disks die in the meantime :) So far so good on that end. [18:04:20] !log stopping pybal on lvs1002 [18:04:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:38] (03CR) 10Dzahn: [C: 032] "thanks for compiling. that's just a puppet refactoring thing that's noop" [puppet] - 10https://gerrit.wikimedia.org/r/449347 (owner: 10Dzahn) [18:04:58] RECOVERY - PyBal connections to etcd on lvs1001 is OK: OK: 4 connections established with conf1001.eqiad.wmnet:2379 (min=4) [18:05:02] (03CR) 10Ayounsi: [V: 032 C: 032] Revert "eqiad: temporarily remove chromium from LVS nameservers" [puppet] - 10https://gerrit.wikimedia.org/r/449508 (owner: 10Ayounsi) [18:05:51] (03PS10) 10Gehel: Set up cron task to regen low-zoom vector tiles [puppet] - 10https://gerrit.wikimedia.org/r/447851 (https://phabricator.wikimedia.org/T194787) (owner: 10MSantos) [18:07:08] PROBLEM - pybal on lvs1002 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal [18:07:15] (03CR) 10jerkins-bot: [V: 04-1] Set up cron task to regen low-zoom vector tiles [puppet] - 10https://gerrit.wikimedia.org/r/447851 (https://phabricator.wikimedia.org/T194787) (owner: 10MSantos) [18:07:46] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team: labvirt1009 HP Raid alert - https://phabricator.wikimedia.org/T198479 (10Andrew) Sounds fine to me! [18:08:08] PROBLEM - PyBal connections to etcd on lvs1002 is CRITICAL: CRITICAL: 0 connections established with conf1001.eqiad.wmnet:2379 (min=14) [18:09:08] RECOVERY - pybal on lvs1001 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal [18:10:57] 10Operations, 10ops-eqiad: Degraded RAID on labvirt1020 - https://phabricator.wikimedia.org/T194855 (10RobH) a:03Cmjohnson Ok, Dasher/HP states these shipped with battery systems already in place on the mainboard for the raid controllers, and have attached a file for review. Since the pdf of the email has e... [18:10:59] 10Operations, 10ops-eqiad: Degraded RAID on labvirt1019 - https://phabricator.wikimedia.org/T196507 (10RobH) a:03Cmjohnson Ok, Dasher/HP states these shipped with battery systems already in place on the mainboard for the raid controllers, and have attached a file for review. Since the pdf of the email has e... [18:14:18] RECOVERY - pybal on lvs1002 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal [18:16:24] jouncebot: next [18:16:24] In 0 hour(s) and 43 minute(s): MediaWiki train - Americas version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180731T1900) [18:17:43] (03PS11) 10Gehel: Set up cron task to regen low-zoom vector tiles [puppet] - 10https://gerrit.wikimedia.org/r/447851 (https://phabricator.wikimedia.org/T194787) (owner: 10MSantos) [18:18:18] RECOVERY - PyBal connections to etcd on lvs1002 is OK: OK: 14 connections established with conf1001.eqiad.wmnet:2379 (min=14) [18:18:33] (03PS12) 10Gehel: Set up cron task to regen low-zoom vector tiles [puppet] - 10https://gerrit.wikimedia.org/r/447851 (https://phabricator.wikimedia.org/T194787) (owner: 10MSantos) [18:18:35] (03CR) 10jerkins-bot: [V: 04-1] Set up cron task to regen low-zoom vector tiles [puppet] - 10https://gerrit.wikimedia.org/r/447851 (https://phabricator.wikimedia.org/T194787) (owner: 10MSantos) [18:20:48] !log stopping puppet across the fleet for puppetmaster1001 uplink move [18:20:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:40] 10Operations, 10Traffic, 10Wikimedia-General-or-Unknown, 10I18n: wikimediafoundation.org's language selector is confusing to most visitors who don't have accounts there - https://phabricator.wikimedia.org/T166782 (10Varnent) 05Open>03declined No longer applies to new site. [18:24:48] (03CR) 10Gehel: "puppet compiler: https://puppet-compiler.wmflabs.org/compiler02/11934/" [puppet] - 10https://gerrit.wikimedia.org/r/447851 (https://phabricator.wikimedia.org/T194787) (owner: 10MSantos) [18:25:05] (03PS2) 10Dzahn: failoid/configcluster:: use ::profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/449344 [18:30:46] 10Operations: remove mathoid from scb - https://phabricator.wikimedia.org/T200832 (10Dzahn) [18:31:26] 10Operations, 10Mathoid: remove mathoid from scb - https://phabricator.wikimedia.org/T200832 (10Dzahn) p:05Triage>03Normal [18:31:28] 10Operations, 10Mathoid: remove mathoid from scb - https://phabricator.wikimedia.org/T200832 (10Dzahn) [18:39:28] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [18:40:09] (03Abandoned) 10Aaron Schulz: Enable prefix routing wildcards for mcrouter purge broadcasting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440471 (owner: 10Aaron Schulz) [18:40:55] (03CR) 10Dzahn: [C: 032] failoid/configcluster:: use ::profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/449344 (owner: 10Dzahn) [18:41:24] (03PS1) 10Volans: Fix prospector tests [software/cumin] - 10https://gerrit.wikimedia.org/r/449519 [18:42:37] (03Abandoned) 10Aaron Schulz: Only send cache writes to mcrouter for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440473 (owner: 10Aaron Schulz) [18:42:40] (03Abandoned) 10Aaron Schulz: Use "memcached-mcrouter" as the main cache type for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440472 (owner: 10Aaron Schulz) [18:46:45] 10Operations, 10Analytics, 10ChangeProp, 10Services (designing), 10Wikimedia-Incident: Separate dev Change-Prop from production Kafka cluster - https://phabricator.wikimedia.org/T199427 (10Ottomata) Should be doable! [18:48:28] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 57.14% of data above the critical threshold [140.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [18:49:28] <+SP9002_@efnet> so, he wants the win. so we're just gonna get lunch or something, then hes gonna push me to the ground and tap my ass with his foot so he can claim he "kicked my ass" tbh im going along with it becase I dont wanna lose any teeth [18:49:32] <+SP9002_@efnet> so, he wants the win. so we're just gonna get lunch or something, then hes gonna push me to the ground and tap my ass with his foot so he can claim he "kicked my ass" tbh im going along with it becase I dont wanna lose any teeth [18:49:35] <+SP9002_@efnet> so, he wants the win. so we're just gonna get lunch or something, then hes gonna push me to the ground and tap my ass with his foot so he can claim he "kicked my ass" tbh im going along with it becase I dont wanna lose any teeth [18:50:17] (03CR) 10Ottomata: [C: 031] profile::kafka::broker: raise default max open files to 128k [puppet] - 10https://gerrit.wikimedia.org/r/447389 (https://phabricator.wikimedia.org/T200177) (owner: 10Elukey) [18:50:57] (03PS1) 10Thcipriani: Beta: ensure deployment-deploy01 is a co-master [puppet] - 10https://gerrit.wikimedia.org/r/449520 (https://phabricator.wikimedia.org/T192561) [18:51:00] (03PS1) 10Thcipriani: Beta: Make deployment-deploy01 main deploy server [puppet] - 10https://gerrit.wikimedia.org/r/449521 (https://phabricator.wikimedia.org/T192561) [18:52:32] (03CR) 10Thcipriani: [C: 04-1] "Some Jenkins work needs to be done before this merges" [puppet] - 10https://gerrit.wikimedia.org/r/449521 (https://phabricator.wikimedia.org/T192561) (owner: 10Thcipriani) [18:54:20] <+SP9002_@efnet> so, he wants the win. so we're just gonna get lunch or something, then hes gonna push me to the ground and tap my ass with his foot so he can claim he "kicked my ass" tbh im going along with it becase I dont wanna lose any teeth [18:54:27] <+SP9002_@efnet> so, he wants the win. so we're just gonna get lunch or something, then hes gonna push me to the ground and tap my ass with his foot so he can claim he "kicked my ass" tbh im going along with it becase I dont wanna lose any teeth [18:54:28] <07IADHE3L> <+SP9002_@efnet> so, he wants the win. so we're just gonna get lunch or something, then hes gonna push me to the ground and tap my ass with his foot so he can claim he "kicked my ass" tbh im going along with it becase I dont wanna lose any teeth [19:00:04] twentyafterfour: My dear minions, it's time we take the moon! Just kidding. Time for MediaWiki train - Americas version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180731T1900). [19:00:16] 10Operations, 10DBA, 10JADE, 10Scoring-platform-team, 10TechCom-RFC: Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10awight) @jcrespo Another angle of this that I'd like your input on is the association between judgment page title a... [19:00:41] !log enabling `unchecked_tombstone_compaction` on enwiki_T_mobile__ng_remaining -- T192689 [19:00:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:49] T192689: Unchecked storage growth(?) - https://phabricator.wikimedia.org/T192689 [19:04:38] (03CR) 10Ottomata: "Cool! Too bad we can't use hiera_hash, it seems made for this problem. :/" [puppet] - 10https://gerrit.wikimedia.org/r/447813 (https://phabricator.wikimedia.org/T167790) (owner: 10Elukey) [19:04:41] (03CR) 10Ottomata: [C: 031] [WIP] Refactor Hadoop code to allow more than one cluster in Prod [puppet] - 10https://gerrit.wikimedia.org/r/447813 (https://phabricator.wikimedia.org/T167790) (owner: 10Elukey) [19:05:25] (03PS1) 10Dzahn: puppetdb: add postgres backup to bacula [puppet] - 10https://gerrit.wikimedia.org/r/449523 [19:05:28] (03CR) 10Ottomata: [C: 031] EventStreams: Use the default log level (warn) [puppet] - 10https://gerrit.wikimedia.org/r/448152 (owner: 10Mobrovac) [19:06:09] (03CR) 10Ottomata: [C: 031] role::archiva: move to profile [puppet] - 10https://gerrit.wikimedia.org/r/448569 (owner: 10Elukey) [19:08:52] 10Operations: v6 ND failure on puppetmaster1001/asw2-b-eqiad - https://phabricator.wikimedia.org/T200838 (10ayounsi) p:05Triage>03Normal [19:13:41] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team, 10Patch-For-Review: Rack/cable/configure asw2-b-eqiad switch stack - https://phabricator.wikimedia.org/T183585 (10ayounsi) Will cleanup description with remaining servers Not moved: puppetmaster1001, see T200838 Moved: ```lines=15 === No s... [19:15:38] (03CR) 10Ebe123: [C: 031] Add fluidsynth to wikimedia servers [puppet] - 10https://gerrit.wikimedia.org/r/445603 (https://phabricator.wikimedia.org/T184598) (owner: 10Reedy) [19:18:56] <+SP9002_@efnet> so, he wants the win. so we're just gonna get lunch or something, then hes gonna push me to the ground and tap my ass with his foot so he can claim he "kicked my ass" tbh im going along with it becase I dont wanna lose any teeth [19:18:56] With our IRC ad service you can reach a global audience of entrepreneurs and fentanyl addicts with extraordinary engagement rates! https://williampitcock.com/ [19:22:26] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team, 10netops: Rack/cable/configure asw2-b-eqiad switch stack - https://phabricator.wikimedia.org/T183585 (10ayounsi) [19:22:44] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team, 10netops: Rack/cable/configure asw2-b-eqiad switch stack - https://phabricator.wikimedia.org/T183585 (10ayounsi) [19:25:09] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [19:25:28] RECOVERY - Check systemd state on labtestservices2002 is OK: OK - running: The system is fully operational [19:40:12] (03PS1) 10Ayounsi: Extend cp to cp ipsec MTU 1450 to codfw [puppet] - 10https://gerrit.wikimedia.org/r/449526 (https://phabricator.wikimedia.org/T195365) [19:42:11] 10Operations, 10Traffic, 10netops, 10Patch-For-Review: cp intermittent IPsec MTU issue - https://phabricator.wikimedia.org/T195365 (10ayounsi) [19:46:41] (03PS13) 10Gehel: Set up cron task to regen low-zoom vector tiles [puppet] - 10https://gerrit.wikimedia.org/r/447851 (https://phabricator.wikimedia.org/T194787) (owner: 10MSantos) [19:47:02] (03PS14) 10Gehel: Set up cron task to regen low-zoom vector tiles [puppet] - 10https://gerrit.wikimedia.org/r/447851 (https://phabricator.wikimedia.org/T194787) (owner: 10MSantos) [19:47:53] (03CR) 10Gehel: [C: 032] Set up cron task to regen low-zoom vector tiles [puppet] - 10https://gerrit.wikimedia.org/r/447851 (https://phabricator.wikimedia.org/T194787) (owner: 10MSantos) [19:48:12] (03CR) 10MSantos: [C: 031] Set up cron task to regen low-zoom vector tiles [puppet] - 10https://gerrit.wikimedia.org/r/447851 (https://phabricator.wikimedia.org/T194787) (owner: 10MSantos) [19:50:32] (03CR) 10Ayounsi: "Compiler output looks good: https://puppet-compiler.wmflabs.org/compiler02/11935/" [puppet] - 10https://gerrit.wikimedia.org/r/449526 (https://phabricator.wikimedia.org/T195365) (owner: 10Ayounsi) [19:52:16] (03PS1) 10Gehel: maps: re-enable osm replication [puppet] - 10https://gerrit.wikimedia.org/r/449528 (https://phabricator.wikimedia.org/T200228) [19:53:18] (03CR) 10Gehel: [C: 032] maps: re-enable osm replication [puppet] - 10https://gerrit.wikimedia.org/r/449528 (https://phabricator.wikimedia.org/T200228) (owner: 10Gehel) [19:56:11] 10Operations, 10Wikimedia-General-or-Unknown: Wrong umask when deploying from screen - https://phabricator.wikimedia.org/T200690 (10herron) p:05Triage>03High a:03Dzahn [19:57:21] (03PS1) 10Cmjohnson: Adding mgmt/production dns graphite1004 [dns] - 10https://gerrit.wikimedia.org/r/449529 (https://phabricator.wikimedia.org/T196484) [19:57:56] 10Operations, 10Performance-Team, 10Wikimedia-Mailing-lists: Close performance@lists.wikimedia.org in favour of wikitech-l - https://phabricator.wikimedia.org/T200733 (10herron) p:05Triage>03Normal [19:58:09] 10Operations, 10Performance-Team, 10Wikimedia-Mailing-lists: Close performance@lists.wikimedia.org in favour of wikitech-l - https://phabricator.wikimedia.org/T200733 (10herron) a:03herron [19:58:19] (03CR) 10Cmjohnson: [C: 032] Adding mgmt/production dns graphite1004 [dns] - 10https://gerrit.wikimedia.org/r/449529 (https://phabricator.wikimedia.org/T196484) (owner: 10Cmjohnson) [19:58:25] 10Operations, 10Performance-Team, 10Wikimedia-Mailing-lists, 10User-herron: Close performance@lists.wikimedia.org in favour of wikitech-l - https://phabricator.wikimedia.org/T200733 (10herron) [20:01:11] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team: labstore1003: more SMART failures - https://phabricator.wikimedia.org/T199780 (10herron) p:05Triage>03High [20:01:22] 10Operations, 10ops-eqiad, 10monitoring, 10Patch-For-Review: rack/setup/install graphite1004 - https://phabricator.wikimedia.org/T196484 (10Cmjohnson) [20:02:01] maps seem down on beta. ongoing maintenance, or broken? https://maps-beta.wmflabs.org/img/osm-intl,2,30.1,0,400x300.png?lang=en [20:02:10] that gives me "502 Bad Gateway nginx/1.13.6" [20:02:19] 10Operations, 10ops-codfw: wtp2011 memory correctable errors - https://phabricator.wikimedia.org/T200678 (10herron) p:05Triage>03High [20:02:46] (03CR) 10Gehel: [C: 031] "very minor comment inline, otherwise LGTM" (032 comments) [software/cumin] - 10https://gerrit.wikimedia.org/r/449519 (owner: 10Volans) [20:02:51] that URL is loaded when i view https://en.wikipedia.beta.wmflabs.org/w/index.php?title=User:RYasmeen_(WMF)&oldid=382671 [20:09:03] (03PS1) 10Krinkle: webperf: Set mpm=worker explicitly for httpd. [puppet] - 10https://gerrit.wikimedia.org/r/449532 (https://phabricator.wikimedia.org/T180761) [20:09:26] (03CR) 10Krinkle: "Will try out on beta later." [puppet] - 10https://gerrit.wikimedia.org/r/449532 (https://phabricator.wikimedia.org/T180761) (owner: 10Krinkle) [20:09:37] (03CR) 10jerkins-bot: [V: 04-1] webperf: Set mpm=worker explicitly for httpd. [puppet] - 10https://gerrit.wikimedia.org/r/449532 (https://phabricator.wikimedia.org/T180761) (owner: 10Krinkle) [20:09:55] MatmaRex, well maps-beta is a proxy to 10.68.18.91:6533 [20:10:15] alex@alex-laptop:~$ nslookup 10.68.18.91 labs-ns0.wikimedia.org [20:10:15] 91.18.68.10.in-addr.arpa name = deployment-maps03.deployment-prep.eqiad.wmflabs. [20:10:52] krenair@deployment-maps03:~$ sudo lsof -i :6533 [20:10:52] krenair@deployment-maps03:~$ [20:12:02] that's supposed to be kartotherian I think [20:12:08] based on a puppet grep for that port number [20:13:21] /etc/kartotherian/config.yaml has [20:13:27] services: [20:13:28] - conf: [20:13:33] port: 6533 [20:14:10] jouncebot: next [20:14:10] In 2 hour(s) and 45 minute(s): Evening SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180731T2300) [20:16:02] (03PS1) 1020after4: testwikis wikis to 1.32.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449533 [20:16:04] (03CR) 1020after4: [C: 032] testwikis wikis to 1.32.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449533 (owner: 1020after4) [20:16:20] 10Operations, 10ops-eqsin, 10Traffic: cp5001 unreachable since 2018-07-14 17:49:21 - https://phabricator.wikimedia.org/T199675 (10RobH) I've opened case 977580870 to coordinate getting a Dell Tech dispatched to eqsin with a replacement part. [20:17:26] (03Merged) 10jenkins-bot: testwikis wikis to 1.32.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449533 (owner: 1020after4) [20:20:34] !log twentyafterfour@deploy1001 Started scap: testwikis wikis to 1.32.0-wmf.15 [20:20:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:25:17] (03PS3) 10MarcoAurelio: Enable TemplateStyles on Meta-Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449513 (https://phabricator.wikimedia.org/T200613) (owner: 10Zoranzoki21) [20:25:52] (03PS3) 10MarcoAurelio: Enable $wgCiteResponsiveReferences for Meta-Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449277 (https://phabricator.wikimedia.org/T200707) [20:28:56] <+SP9002_@efnet> so, he wants the win. so we're just gonna get lunch or something, then hes gonna push me to the ground and tap my ass with his foot so he can claim he "kicked my ass" tbh im going along with it becase I dont wanna lose any teeth [20:32:27] (03PS2) 10Volans: Fix prospector tests [software/cumin] - 10https://gerrit.wikimedia.org/r/449519 [20:32:33] (03CR) 10Volans: "Replies inline" (032 comments) [software/cumin] - 10https://gerrit.wikimedia.org/r/449519 (owner: 10Volans) [20:32:38] (03CR) 10jenkins-bot: testwikis wikis to 1.32.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449533 (owner: 1020after4) [20:38:20] <+SP9002_@efnet> so, he wants the win. so we're just gonna get lunch or something, then hes gonna push me to the ground and tap my ass with his foot so he can claim he "kicked my ass" tbh im going along with it becase I dont wanna lose any teeth [20:38:20] With our IRC ad service you can reach a global audience of entrepreneurs and fentanyl addicts with extraordinary engagement rates! https://williampitcock.com/ [20:39:27] (03PS1) 10Aaron Schulz: Use mcrouter for cache reads for test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449603 (https://phabricator.wikimedia.org/T198239) [20:39:29] (03PS1) 10Aaron Schulz: Use mcrouter for cache reads on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449604 (https://phabricator.wikimedia.org/T198239) [20:39:31] (03PS1) 10Aaron Schulz: Only do cache writes to mcrouter for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449605 (https://phabricator.wikimedia.org/T198239) [20:39:33] (03PS1) 10Aaron Schulz: Allow broadcasted mcrouter cache operations for purges [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449606 (https://phabricator.wikimedia.org/T198239) [20:39:53] (03PS1) 10Dzahn: bacula/postgresql: add a generic fileset for psql [puppet] - 10https://gerrit.wikimedia.org/r/449607 (https://phabricator.wikimedia.org/T190184) [20:40:08] <+SP9002_@efnet> so, he wants the win. so we're just gonna get lunch or something, then hes gonna push me to the ground and tap my ass with his foot so he can claim he "kicked my ass" tbh im going along with it becase I dont wanna lose any teeth [20:40:08] With our IRC ad service you can reach a global audience of entrepreneurs and fentanyl addicts with extraordinary engagement rates! https://williampitcock.com/ [20:40:08] I thought you guys might be interested in this blog by freenode staff member Bryan 'kloeri' Ostergaard https://bryanostergaard.com/ [20:43:35] PROBLEM - Recursive DNS on 208.80.154.20 is CRITICAL: CRITICAL - Plugin timed out while executing system call [20:44:17] (03CR) 10Imarlier: webperf: Set mpm=worker explicitly for httpd. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/449532 (https://phabricator.wikimedia.org/T180761) (owner: 10Krinkle) [20:49:21] <+SP9002_@efnet> so, he wants the win. so we're just gonna get lunch or something, then hes gonna push me to the ground and tap my ass with his foot so he can claim he "kicked my ass" tbh im going along with it becase I dont wanna lose any teeth [20:55:25] PROBLEM - etcd request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 operation=list https://grafana.wikimedia.org/dashboard/db/kubernetes-api [20:55:34] PROBLEM - etcd request latencies on argon is CRITICAL: instance=10.64.32.133:6443 operation={compareAndSwap,get,list} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [20:56:15] PROBLEM - Request latencies on argon is CRITICAL: instance=10.64.32.133:6443 verb=PUT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [20:58:34] PROBLEM - Request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 verb=LIST https://grafana.wikimedia.org/dashboard/db/kubernetes-api [20:58:41] <94KAAJKVV> <+SP9002_@efnet> so, he wants the win. so we're just gonna get lunch or something, then hes gonna push me to the ground and tap my ass with his foot so he can claim he "kicked my ass" tbh im going along with it becase I dont wanna lose any teeth [21:00:51] maybe we can make the channel +R for some time [21:00:54] RECOVERY - Request latencies on argon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [21:01:05] Hauskatze, maybe [21:01:15] things is I'm not sure our bots can handle that [21:01:40] * Hauskatze eats a Lotus cookie [21:01:55] RECOVERY - Request latencies on chlorine is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [21:02:15] RECOVERY - etcd request latencies on chlorine is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [21:02:24] RECOVERY - etcd request latencies on argon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [21:03:48] <+SP9002_@efnet> so, he wants the win. so we're just gonna get lunch or something, then hes gonna push me to the ground and tap my ass with his foot so he can claim he "kicked my ass" tbh im going along with it becase I dont wanna lose any teeth [21:05:24] !log twentyafterfour@deploy1001 Finished scap: testwikis wikis to 1.32.0-wmf.15 (duration: 44m 49s) [21:05:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:08:39] (03PS1) 1020after4: group0 wikis to 1.32.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449611 [21:08:41] (03CR) 1020after4: [C: 032] group0 wikis to 1.32.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449611 (owner: 1020after4) [21:10:08] (03Merged) 10jenkins-bot: group0 wikis to 1.32.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449611 (owner: 1020after4) [21:10:15] PROBLEM - etcd request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 operation=compareAndSwap https://grafana.wikimedia.org/dashboard/db/kubernetes-api [21:10:15] PROBLEM - etcd request latencies on neon is CRITICAL: instance=10.64.0.40:6443 operation=get https://grafana.wikimedia.org/dashboard/db/kubernetes-api [21:11:05] PROBLEM - Request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 verb=PATCH https://grafana.wikimedia.org/dashboard/db/kubernetes-api [21:11:14] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=PUT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [21:13:45] RECOVERY - etcd request latencies on chlorine is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [21:14:59] (03PS2) 10Dzahn: Beta: ensure deployment-deploy01 is a co-master [puppet] - 10https://gerrit.wikimedia.org/r/449520 (https://phabricator.wikimedia.org/T192561) (owner: 10Thcipriani) [21:17:14] PROBLEM - etcd request latencies on neon is CRITICAL: instance=10.64.0.40:6443 operation=get https://grafana.wikimedia.org/dashboard/db/kubernetes-api [21:18:05] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=PUT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [21:20:15] PROBLEM - Request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 verb=PATCH https://grafana.wikimedia.org/dashboard/db/kubernetes-api [21:22:38] (03CR) 10Dzahn: [C: 032] Beta: ensure deployment-deploy01 is a co-master [puppet] - 10https://gerrit.wikimedia.org/r/449520 (https://phabricator.wikimedia.org/T192561) (owner: 10Thcipriani) [21:23:45] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [21:24:04] RECOVERY - etcd request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [21:24:30] (03PS2) 10Dzahn: bacula/postgresql: add a generic fileset for psql [puppet] - 10https://gerrit.wikimedia.org/r/449607 (https://phabricator.wikimedia.org/T190184) [21:24:54] RECOVERY - Request latencies on chlorine is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [21:30:44] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=PUT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [21:31:01] this is still broken [21:31:02] https://upload.wikimedia.org/wikipedia/commons/thumb/b/b5/Jupiter_diagram.svg/5000px-Jupiter_diagram.svg.png [21:31:26] Why do you need it so big? [21:31:45] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [21:32:46] why not? [21:33:07] 5000px is big [21:33:07] https://upload.wikimedia.org/wikipedia/commons/thumb/b/b5/Jupiter_diagram.svg/2000px-Jupiter_diagram.svg.png [21:33:58] but it's offered [21:34:04] the svg version works [21:34:04] https://upload.wikimedia.org/wikipedia/commons/b/b5/Jupiter_diagram.svg [21:34:17] That one isn't doing any c=scaling [21:35:04] my point is, below the image it says: [21:35:06] Size of this PNG preview of this SVG file: 800 × 400 pixels. Other resolutions: 320 × 160 pixels | 640 × 320 pixels | 1,024 × 512 pixels | 1,280 × 640 pixels | 5,000 × 2,500 pixels. [21:35:11] clicking on the 5000 version is broken [21:35:21] so it either shouldn't be offered, or it should be fixed [21:35:59] Error: 429, Too Many Requests [21:36:04] That suggests the scalers are a bit busy [21:36:33] it was error 500 the other day [21:36:41] ah i see, the file's nominal dimensions are quite large [21:37:02] so i think the default sizes offered are sometiems coming up bigger than the max-area limit we constrain with [21:37:09] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: Decom/reclaim terbium - https://phabricator.wikimedia.org/T200763 (10RobH) [21:37:27] !log twentyafterfour@deploy1001 rebuilt and synchronized wikiversions files: group0 wikis to 1.32.0-wmf.15 [21:37:29] should be fairly straightforward to not offer links for the ones that are going to be too big :) [21:37:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:37:44] $wgMaxImageArea = 10e7; // 100MP [21:37:59] hmmmm odd [21:38:08] 5000x2500 is only 12.5 mp [21:38:42] is there a different limit on thumbor? [21:38:55] (03PS3) 10Dzahn: bacula/postgresql: add a generic fileset for psql [puppet] - 10https://gerrit.wikimedia.org/r/449607 (https://phabricator.wikimedia.org/T190184) [21:39:02] <+SP9002_@efnet> so, he wants the win. so we're just gonna get lunch or something, then hes gonna push me to the ground and tap my ass with his foot so he can claim he "kicked my ass" tbh im going along with it becase I dont wanna lose any teeth [21:39:16] yay spam [21:39:28] (03PS1) 10RobH: terbium decom [puppet] - 10https://gerrit.wikimedia.org/r/449614 (https://phabricator.wikimedia.org/T200763) [21:40:50] (03PS1) 10Reedy: Remove $wgUseImageResize as same as default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449615 [21:41:12] (03PS1) 10RobH: decom prod dns for terbium [dns] - 10https://gerrit.wikimedia.org/r/449616 (https://phabricator.wikimedia.org/T200763) [21:41:44] (03CR) 10RobH: [C: 032] terbium decom [puppet] - 10https://gerrit.wikimedia.org/r/449614 (https://phabricator.wikimedia.org/T200763) (owner: 10RobH) [21:41:45] sigh [21:42:26] so what do you think? [21:42:33] (03CR) 10RobH: [C: 032] decom prod dns for terbium [dns] - 10https://gerrit.wikimedia.org/r/449616 (https://phabricator.wikimedia.org/T200763) (owner: 10RobH) [21:42:35] (03CR) 10Dzahn: [C: 032] bacula/postgresql: add a generic fileset for psql [puppet] - 10https://gerrit.wikimedia.org/r/449607 (https://phabricator.wikimedia.org/T190184) (owner: 10Dzahn) [21:42:58] (03PS4) 10Dzahn: bacula/postgresql: add a generic fileset for psql [puppet] - 10https://gerrit.wikimedia.org/r/449607 (https://phabricator.wikimedia.org/T190184) [21:43:26] aaaaaaaaaa: so my current theory is it may be failing due to different configurations on the new scaling service from the mediawiki config [21:43:47] aaaaaaaaaa: can you file a bug in phabricator? we'll want to track it :) [21:44:04] don't have an account [21:44:22] mutante: i have dns changes with yours pending as well [21:44:23] (03CR) 10jenkins-bot: group0 wikis to 1.32.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449611 (owner: 1020after4) [21:44:25] im assuming its ok [21:44:26] =] [21:44:40] robh: dns changes or puppet? [21:44:46] dns [21:44:48] sorry [21:44:49] puppet [21:44:52] im merging both heh [21:45:05] oh really? i am on puppetmaster1001 and dont see it [21:45:13] aaaaaaaaaa: ok :) i'll file one in a bit [21:45:23] thanks :) [21:45:31] robh: the bacula fileset? thanks [21:46:01] Dzahn: Beta: ensure deployment-deploy01 is a co-master (b5c61c7627) [21:46:08] hmm, right now it's rejecting for any new size ;) so we'll diagnose later after timeout [21:46:33] robh: what master are you on? [21:52:18] (03PS2) 10Dzahn: parsoid/thumbor::mediawiki: use ::profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/449345 [21:54:34] RECOVERY - Recursive DNS on 208.80.153.78 is OK: DNS OK: 0.044 seconds response time. www.wikipedia.org returns [21:58:31] (03CR) 10Dzahn: [C: 032] parsoid/thumbor::mediawiki: use ::profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/449345 (owner: 10Dzahn) [22:00:08] (03PS2) 10Dzahn: aqs/poolcounter:: use ::profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/449346 [22:05:12] (03CR) 10Dzahn: [C: 032] aqs/poolcounter:: use ::profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/449346 (owner: 10Dzahn) [22:06:08] (03PS2) 10Dzahn: cache::canary/pybaltest: use ::profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/449347 [22:06:40] (03PS2) 10Krinkle: webperf: Set mpm=worker explicitly for httpd. [puppet] - 10https://gerrit.wikimedia.org/r/449532 (https://phabricator.wikimedia.org/T180761) [22:06:46] (03PS3) 10Krinkle: webperf: Set mpm=worker explicitly for httpd. [puppet] - 10https://gerrit.wikimedia.org/r/449532 (https://phabricator.wikimedia.org/T180761) [22:14:25] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: Decom/reclaim terbium - https://phabricator.wikimedia.org/T200763 (10RobH) p:05Triage>03Normal a:03Cmjohnson [22:18:09] 10Operations, 10ops-eqiad, 10PoolCounter, 10decommission: Decommision poolcounter1002 - https://phabricator.wikimedia.org/T193025 (10RobH) [22:19:43] (03PS1) 10RobH: decom poolcounter1002 [puppet] - 10https://gerrit.wikimedia.org/r/449624 (https://phabricator.wikimedia.org/T193025) [22:20:20] (03PS1) 10RobH: decom poolcounter1002 prod dns [dns] - 10https://gerrit.wikimedia.org/r/449625 (https://phabricator.wikimedia.org/T193025) [22:20:43] (03CR) 10RobH: [C: 032] decom poolcounter1002 [puppet] - 10https://gerrit.wikimedia.org/r/449624 (https://phabricator.wikimedia.org/T193025) (owner: 10RobH) [22:21:05] (03CR) 10RobH: [C: 032] decom poolcounter1002 prod dns [dns] - 10https://gerrit.wikimedia.org/r/449625 (https://phabricator.wikimedia.org/T193025) (owner: 10RobH) [22:22:52] 10Operations, 10ops-eqiad, 10PoolCounter, 10decommission: Decommision poolcounter1002 - https://phabricator.wikimedia.org/T193025 (10RobH) a:03Cmjohnson [22:24:03] 10Operations, 10decommission: decom bast1001 - https://phabricator.wikimedia.org/T191153 (10RobH) p:05High>03Normal [22:24:15] 10Operations, 10ops-eqiad, 10decommission: decom bast1001 - https://phabricator.wikimedia.org/T191153 (10RobH) [22:25:24] (03PS1) 10Andrew Bogott: labs-ip-alias-dump: Update to work with pdns-recursor v4.x [puppet] - 10https://gerrit.wikimedia.org/r/449627 (https://phabricator.wikimedia.org/T200294) [22:25:56] (03CR) 10jerkins-bot: [V: 04-1] labs-ip-alias-dump: Update to work with pdns-recursor v4.x [puppet] - 10https://gerrit.wikimedia.org/r/449627 (https://phabricator.wikimedia.org/T200294) (owner: 10Andrew Bogott) [22:26:46] (03PS2) 10Andrew Bogott: labs-ip-alias-dump: Update to work with pdns-recursor v4.x [puppet] - 10https://gerrit.wikimedia.org/r/449627 (https://phabricator.wikimedia.org/T200294) [22:30:01] (03PS1) 10Dzahn: syslog::centralserver: use ::profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/449629 [22:33:41] (03Abandoned) 10Paladox: WIP: phabricator: Switch from apache to nginx [puppet] - 10https://gerrit.wikimedia.org/r/406243 (https://phabricator.wikimedia.org/T185644) (owner: 10Paladox) [22:33:56] (03CR) 10Dzahn: [C: 032] syslog::centralserver: use ::profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/449629 (owner: 10Dzahn) [22:35:23] (03Abandoned) 10Paladox: puppetdb: Allow customising username and password for active record [puppet] - 10https://gerrit.wikimedia.org/r/390504 (owner: 10Paladox) [22:36:14] (03PS3) 10Dzahn: mediawiki::php: Remove support for PHP 5 [puppet] - 10https://gerrit.wikimedia.org/r/449219 (owner: 10Muehlenhoff) [22:36:24] (03PS4) 10Krinkle: webperf: Set mpm=prefork explicitly for profiling_tools' httpd [puppet] - 10https://gerrit.wikimedia.org/r/449532 (https://phabricator.wikimedia.org/T180761) [22:36:58] (03CR) 10jerkins-bot: [V: 04-1] webperf: Set mpm=prefork explicitly for profiling_tools' httpd [puppet] - 10https://gerrit.wikimedia.org/r/449532 (https://phabricator.wikimedia.org/T180761) (owner: 10Krinkle) [22:37:29] (03PS5) 10Krinkle: webperf: Set mpm=prefork explicitly for profiling_tools' httpd [puppet] - 10https://gerrit.wikimedia.org/r/449532 (https://phabricator.wikimedia.org/T180761) [22:48:46] 10Operations, 10Wikimedia-General-or-Unknown: Wrong umask when deploying from screen - https://phabricator.wikimedia.org/T200690 (10Dzahn) A quick workaround is if you use: ``` ssh -t deploy1001.eqiad.wmnet 'umask 002; screen -S ssh -R -q' ``` You mean deploy1001.eqiad.wmnet, right (vs. mwdeploy) right? [22:49:04] (03CR) 10Krinkle: "Cherry-picked to beta and ran puppet on webperf12 there. First puppet run for this role without errors :)" [puppet] - 10https://gerrit.wikimedia.org/r/449532 (https://phabricator.wikimedia.org/T180761) (owner: 10Krinkle) [22:50:41] (03CR) 10Dzahn: [C: 031] mediawiki::php: Remove support for PHP 5 [puppet] - 10https://gerrit.wikimedia.org/r/449219 (owner: 10Muehlenhoff) [22:51:28] (03CR) 10Dzahn: [C: 032] webperf: Set mpm=prefork explicitly for profiling_tools' httpd [puppet] - 10https://gerrit.wikimedia.org/r/449532 (https://phabricator.wikimedia.org/T180761) (owner: 10Krinkle) [22:51:36] mutante: nice sprint killing php5/terbium stuff :) [22:51:47] (03PS6) 10Dzahn: webperf: Set mpm=prefork explicitly for profiling_tools' httpd [puppet] - 10https://gerrit.wikimedia.org/r/449532 (https://phabricator.wikimedia.org/T180761) (owner: 10Krinkle) [22:52:22] Krinkle: :) [22:53:44] Krinkle: webperf change applied on webperf* in prod.. [22:53:53] and yea, let's also merge that last php5 related change, heh [22:54:08] (03CR) 10Dzahn: [C: 032] mediawiki::php: Remove support for PHP 5 [puppet] - 10https://gerrit.wikimedia.org/r/449219 (owner: 10Muehlenhoff) [22:54:15] 10Operations, 10ops-esams, 10netops: set up cr3-esams - https://phabricator.wikimedia.org/T174616 (10ayounsi) [22:54:17] 10Operations, 10ops-ulsfo, 10Traffic, 10netops, 10Patch-For-Review: Rack/cable/configure ulsfo MX204 - https://phabricator.wikimedia.org/T189552 (10ayounsi) [22:54:20] 10Operations, 10Traffic, 10netops: Configure interface damping on primary links - https://phabricator.wikimedia.org/T196432 (10ayounsi) [22:54:23] (03PS4) 10Dzahn: mediawiki::php: Remove support for PHP 5 [puppet] - 10https://gerrit.wikimedia.org/r/449219 (owner: 10Muehlenhoff) [22:55:55] (03CR) 10Alex Monk: "Cherry-pick of this seems to have a problem, see T200842" [puppet] - 10https://gerrit.wikimedia.org/r/444610 (https://phabricator.wikimedia.org/T198351) (owner: 10EBernhardson) [23:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: I, the Bot under the Fountain, allow thee, The Deployer, to do Evening SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180731T2300). [23:00:04] brion and Hauskatze: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:13] o/ [23:00:24] o/ [23:02:47] (03PS5) 10Dzahn: jenkins: base::service_unit -> systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/434538 (https://phabricator.wikimedia.org/T194724) [23:03:13] it's that time of the day again to break the wikis :P [23:04:10] woohoo [23:07:56] well with incentive like that, I guess I can SWAT [23:08:33] (03CR) 10Alex Monk: "Replaced cherry-pick with latest PS" [puppet] - 10https://gerrit.wikimedia.org/r/444610 (https://phabricator.wikimedia.org/T198351) (owner: 10EBernhardson) [23:08:34] :D [23:09:56] :D waffels for thcipriani [23:10:08] (03PS4) 10Thcipriani: Enable TemplateStyles on Meta-Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449513 (https://phabricator.wikimedia.org/T200613) (owner: 10Zoranzoki21) [23:10:44] (03PS1) 10Dzahn: ores::redis: use ::profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/449631 [23:10:52] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449513 (https://phabricator.wikimedia.org/T200613) (owner: 10Zoranzoki21) [23:11:28] /away zzz [23:11:40] good night mutante [23:12:14] thcipriani: the templatestyles patch cannot really be tested, although you can push it to mwdebug and I can check that the wiki is still there I guess :) [23:12:28] (03Merged) 10jenkins-bot: Enable TemplateStyles on Meta-Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449513 (https://phabricator.wikimedia.org/T200613) (owner: 10Zoranzoki21) [23:12:40] Hauskatze: ensuring that the wiki is still there is always good [23:13:05] I'd say so [23:13:24] universe has not collapsed.... check [23:14:05] Reedy: looks like there is a file owned by you but uncommitted on deploy1001 - are you doing something there or can I remove it? [23:14:59] brion: you might not be, but mine.... :) [23:15:13] yours* [23:15:23] @todo parallelize the multiverse checks [23:15:31] @return null [23:16:33] removing file [23:16:52] or, I guess I'll be cautious and move it to /tmp [23:18:18] (03PS16) 10Alex Monk: Switch elasticsearch to use tlsproxy module [puppet] - 10https://gerrit.wikimedia.org/r/444610 (https://phabricator.wikimedia.org/T198351) (owner: 10EBernhardson) [23:18:42] (03CR) 10jerkins-bot: [V: 04-1] Switch elasticsearch to use tlsproxy module [puppet] - 10https://gerrit.wikimedia.org/r/444610 (https://phabricator.wikimedia.org/T198351) (owner: 10EBernhardson) [23:18:44] Hauskatze: I pulled to mwdebug1002 and was able to confirm that the wiki still existed afterwards. Going live. [23:18:48] (03PS17) 10EBernhardson: Switch elasticsearch to use tlsproxy module [puppet] - 10https://gerrit.wikimedia.org/r/444610 (https://phabricator.wikimedia.org/T198351) [23:19:03] thcipriani: ok thanks [23:19:12] (03CR) 10jerkins-bot: [V: 04-1] Switch elasticsearch to use tlsproxy module [puppet] - 10https://gerrit.wikimedia.org/r/444610 (https://phabricator.wikimedia.org/T198351) (owner: 10EBernhardson) [23:19:20] I guess TemplateStyles was there on Special:Version too? [23:19:49] * Hauskatze is looking [23:20:13] oh well, the patch is not there anymore; no probs [23:20:55] yep, it's there :D [23:21:38] !log thcipriani@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:449513|Enable TemplateStyles on Meta-Wiki]] T200613 (duration: 00m 57s) [23:21:41] ^ Hauskatze live now [23:21:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:21:42] T200613: Enable TemplateStyles on Meta-Wiki - https://phabricator.wikimedia.org/T200613 [23:21:47] :) [23:22:46] (03PS4) 10Thcipriani: Enable $wgCiteResponsiveReferences for Meta-Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449277 (https://phabricator.wikimedia.org/T200707) (owner: 10MarcoAurelio) [23:22:57] https://meta.wikimedia.org/w/index.php?title=Template:Test1/style.css&action=info shows 'saniticed-css' as contentmodel, as expected [23:24:37] (03CR) 10jenkins-bot: Enable TemplateStyles on Meta-Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449513 (https://phabricator.wikimedia.org/T200613) (owner: 10Zoranzoki21) [23:26:04] brion: hrm, it looks like the backport for wmf.15 is going to fail since this failed: https://integration.wikimedia.org/ci/job/wmf-quibble-vendor-mysql-php70-docker/601/ [23:26:30] poop [23:26:32] * brion looks [23:26:37] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449277 (https://phabricator.wikimedia.org/T200707) (owner: 10MarcoAurelio) [23:27:08] i've seen those parserfunctions tests failing before, seems nondeterministic [23:27:13] recheck should fix it [23:27:42] (03Merged) 10jenkins-bot: Enable $wgCiteResponsiveReferences for Meta-Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449277 (https://phabricator.wikimedia.org/T200707) (owner: 10MarcoAurelio) [23:27:56] at least it did earlier. ah i think an upstream error that got fixed? [23:28:42] Hauskatze: your second patch is on mwdebug1002, check please [23:28:47] if .15 isn't deployed yet we can wait on that one [23:28:48] ack, on it [23:29:31] brion: I think it's on group0 (though I haven't checked) [23:29:49] yeah: https://tools.wmflabs.org/versions/ mediawikiwiki/testwikis [23:30:12] will be on non-wikipedias this time tomorrow (knock on wood) [23:30:19] thcipriani: nothing broken so looks good to me [23:30:26] Hauskatze: ok, going live [23:30:45] can't really test that one either, maybe guillom knows better [23:31:52] ah i think it needs https://gerrit.wikimedia.org/r/c/mediawiki/extensions/ParserFunctions/+/449504 [23:32:32] !log thcipriani@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:449277|Enable $wgCiteResponsiveReferences for Meta-Wiki]] T200707 (duration: 00m 56s) [23:32:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:32:36] T200707: enable $wgCiteResponsiveReferences for Meta-Wiki - https://phabricator.wikimedia.org/T200707 [23:32:38] ^ Hauskatze live now [23:32:52] thcipriani: awesome, thanks [23:32:52] here's a backport: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/ParserFunctions/+/449634 [23:33:06] RECOVERY - Recursive DNS on 208.80.154.20 is OK: DNS OK: 0.007 seconds response time. www.wikipedia.org returns 208.80.154.224 [23:36:20] brion: looks like that one just barely missed the branch cut, +2'd [23:36:25] woot [23:36:31] thx :) [23:36:45] no problem [23:37:08] mutante, hey [23:37:31] /away zzz [23:37:44] ah [23:37:52] but maybe he's still there [23:38:06] well, maybe someone else knows, I was going to ask - does prod not use jessie to run mediawiki anymore? [23:41:35] pretty sure the mw servers are stretch now but i'm not too familiar with what's left to migrate [23:41:58] i keep to my video scalers :D [23:42:02] :) [23:43:00] 10Operations, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team, 10Jenkins, 10Patch-For-Review: Upgrade deployment-prep deployment servers to stretch - https://phabricator.wikimedia.org/T192561 (10Krenair) Since https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/449219/ got merged, puppet has... [23:49:22] whee [23:50:33] brion: I've got wmf.14 TimedMediaHandler up-to-date on mwdebug1002 if you want to check that out [23:51:28] looking [23:53:00] so far so good [23:53:04] thcipriani: go fer it [23:53:10] (03CR) 10Thcipriani: "cherry-picked on beta. Should be good to land." [puppet] - 10https://gerrit.wikimedia.org/r/449521 (https://phabricator.wikimedia.org/T192561) (owner: 10Thcipriani) [23:53:20] * thcipriani does [23:55:34] !log thcipriani@deploy1001 Synchronized php-1.32.0-wmf.14/extensions/TimedMediaHandler/WebVideoTranscode/WebVideoTranscode.php: SWAT: [[gerrit:449621|Workaround for job queue reporting 0 length]] (T200813) (duration: 00m 57s) [23:55:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:55:38] T200813: JobQueueGroup::singleton()->getQueueSizes() returns 0 for all queues in production - https://phabricator.wikimedia.org/T200813 [23:55:40] ^ brion live now [23:55:47] sweet! [23:56:44] thanks thcipriani :D [23:56:55] yw :) [23:57:02] i can confirm it's working on my maint script live [23:57:56] awesome