[00:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Evening SWAT (Max 8 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180214T0000). [00:00:04] Amir1: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [00:00:12] o/ [00:00:59] 10Operations, 10Maps, 10Maps-Sprint, 10Traffic, and 2 others: Decide on Cache-Control headers for map tiles - https://phabricator.wikimedia.org/T186732#3969931 (10jmatazzoni) [00:01:12] 10Operations, 10Discovery, 10Maps, 10Maps-Sprint, and 3 others: Decide on Cache-Control headers for map tiles - https://phabricator.wikimedia.org/T186732#3969935 (10jmatazzoni) [00:01:14] The first patch is not testable, the second one is [00:05:13] I can SWAT [00:05:41] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410353 (https://phabricator.wikimedia.org/T187265) (owner: 10Ladsgroup) [00:07:15] (03Merged) 10jenkins-bot: Enable xkill on top wikis that use x aspect [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410353 (https://phabricator.wikimedia.org/T187265) (owner: 10Ladsgroup) [00:07:51] (03CR) 10jenkins-bot: Enable xkill on top wikis that use x aspect [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410353 (https://phabricator.wikimedia.org/T187265) (owner: 10Ladsgroup) [00:07:53] Thank you! [00:08:13] Amir1: the one that just merged is the one that is not testable? I pulled it over to mwdebug1002 in any case. [00:08:32] thcipriani: yup, that one is not testable [00:08:40] k, going live [00:08:57] It has been live in lots of wikis for a while now, it's pretty safe [00:09:31] (It can blow up the database but that will happen at least several days from now, and I'm monitoring eveything) [00:12:01] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:410353|Enable xkill on top wikis that use x aspect]] T187265 (duration: 01m 14s) [00:12:06] ^ Amir1 live now [00:12:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:12:16] T187265: Enable xkill on top wikis that use x aspect - https://phabricator.wikimedia.org/T187265 [00:12:18] Thank you! [00:12:21] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410358 (https://phabricator.wikimedia.org/T187187) (owner: 10Ladsgroup) [00:13:35] sure thing :) [00:14:34] 10Operations, 10wikitech.wikimedia.org, 10Patch-For-Review, 10cloud-services-team (Kanban): Remove cloud-admin rights from YuviPanda - https://phabricator.wikimedia.org/T186289#3969966 (10bd808) 05Open>03Resolved Let's call this done until @yuvipanda stumbles on something that we missed. [00:16:35] (03PS3) 10Thcipriani: Add uploader user group to mznwiki and make it automagically added [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410358 (https://phabricator.wikimedia.org/T187187) (owner: 10Ladsgroup) [00:16:57] (03CR) 10Thcipriani: Add uploader user group to mznwiki and make it automagically added [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410358 (https://phabricator.wikimedia.org/T187187) (owner: 10Ladsgroup) [00:17:03] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410358 (https://phabricator.wikimedia.org/T187187) (owner: 10Ladsgroup) [00:18:38] (03Merged) 10jenkins-bot: Add uploader user group to mznwiki and make it automagically added [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410358 (https://phabricator.wikimedia.org/T187187) (owner: 10Ladsgroup) [00:18:49] (03CR) 10jenkins-bot: Add uploader user group to mznwiki and make it automagically added [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410358 (https://phabricator.wikimedia.org/T187187) (owner: 10Ladsgroup) [00:19:22] Amir1: ^ live on mwdebug1002, check please [00:19:30] on it [00:22:40] thcipriani: everything seems fine [00:22:57] Amir1: ok, going live [00:25:48] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:410358|Add uploader user group to mznwiki and make it automagically added]] T187187 (duration: 01m 12s) [00:25:53] ^ Amir1 live now [00:26:03] Thanks. Will take a look ASAP [00:26:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:26:04] T187187: Add uploader user group and give access to them only in mznwiki - https://phabricator.wikimedia.org/T187187 [00:26:58] sounds okay [00:27:01] Thank you [00:27:32] cool. thanks for the patches :) [00:50:06] (03CR) 10Niedzielski: [C: 04-1] New: add chromium_render service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/409996 (https://phabricator.wikimedia.org/T178166) (owner: 10Niedzielski) [01:05:59] PROBLEM - puppet last run on mw1294 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:07:33] 10Operations, 10DBA, 10MediaWiki-General-or-Unknown, 10MW-1.31-release-notes (WMF-deploy-2017-10-10 (1.31.0-wmf.3)), and 2 others: Regularly purge expired temporary userrights from DB tables - https://phabricator.wikimedia.org/T176754#3970110 (10TBolliger) [01:22:16] (03CR) 10Chad: [C: 031] "This should land :)" [puppet] - 10https://gerrit.wikimedia.org/r/410214 (https://phabricator.wikimedia.org/T187269) (owner: 10Chad) [01:25:11] (03PS4) 10Dzahn: Gerrit: Force expire the old /r login cookie [puppet] - 10https://gerrit.wikimedia.org/r/410214 (https://phabricator.wikimedia.org/T187269) (owner: 10Chad) [01:27:26] (03CR) 10Dzahn: [C: 032] "approaching airport" [puppet] - 10https://gerrit.wikimedia.org/r/410214 (https://phabricator.wikimedia.org/T187269) (owner: 10Chad) [01:30:20] (03CR) 10Chad: [C: 031] Deploy GlobalPreferences in Beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410267 (https://phabricator.wikimedia.org/T184668) (owner: 10MaxSem) [01:30:28] no_justification: i can still login and stuff.. the only thing that i see happening since ysterday is [01:30:35] that the reviewers plugin fails loading [01:30:54] then i click continue and that's it [01:31:06] You shouldn't see anything to do with the reviewers plugin.... [01:31:07] no real issue but that message kept popping up [01:31:16] I'll disable for now [01:31:28] ok [01:32:21] (03PS1) 10Chad: Revert "Adding reviewers plugin" [software/gerrit] - 10https://gerrit.wikimedia.org/r/410366 [01:32:23] (03CR) 10Chad: [C: 032] Revert "Adding reviewers plugin" [software/gerrit] - 10https://gerrit.wikimedia.org/r/410366 (owner: 10Chad) [01:32:26] (03CR) 10Chad: [V: 032 C: 032] Revert "Adding reviewers plugin" [software/gerrit] - 10https://gerrit.wikimedia.org/r/410366 (owner: 10Chad) [01:32:56] !log demon@tin Started deploy [gerrit/gerrit@b234c85]: rm reviewers plugin (for now) [01:33:07] !log demon@tin Finished deploy [gerrit/gerrit@b234c85]: rm reviewers plugin (for now) (duration: 00m 11s) [01:33:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:33:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:33:38] mutante: Should disappear momentarily [01:33:51] Gone [01:36:18] (03CR) 10Dzahn: "i sent a test message with a lot of characters and it was cut off at 225. so more than 140 or 160 and that was without having the "cut" co" [puppet] - 10https://gerrit.wikimedia.org/r/406535 (https://phabricator.wikimedia.org/T185862) (owner: 10Dzahn) [01:36:29] no_justification: cool, thanks [01:36:43] I saw the console errors about that too, kinda didn't care enough :p [01:39:03] i just it in browser, not all the time but more than once [01:39:06] saw [01:40:59] RECOVERY - puppet last run on mw1294 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [01:52:43] mutante: no_justification were you using polygerrit ? [01:53:02] It will show a console error on gerrit 2.14 polygerrit [01:53:13] Probably [01:53:15] Is fixed in 2.15 where polygerrit plugins are supported [01:53:15] That's lamesauce. [01:53:43] Yep [01:54:32] mutante: about the review plugin, did you disable JavaScript? [01:55:46] (03PS1) 10Chad: Remove executable bit from font and text files [mediawiki-config/fonts] - 10https://gerrit.wikimedia.org/r/410367 [01:57:15] (03CR) 10Chad: [V: 032 C: 032] Remove executable bit from font and text files [mediawiki-config/fonts] - 10https://gerrit.wikimedia.org/r/410367 (owner: 10Chad) [01:58:19] (03PS1) 10Chad: Updating fonts to master (removes +x) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410368 [01:58:21] (03CR) 10Chad: [C: 032] Updating fonts to master (removes +x) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410368 (owner: 10Chad) [02:00:01] (03Merged) 10jenkins-bot: Updating fonts to master (removes +x) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410368 (owner: 10Chad) [02:00:12] (03CR) 10jenkins-bot: Updating fonts to master (removes +x) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410368 (owner: 10Chad) [02:02:46] !log demon@tin Synchronized fonts/: removing executable bits, no-op (duration: 01m 15s) [02:02:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:09:08] PROBLEM - puppet last run on analytics1045 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:18:00] (03PS1) 10Ayounsi: pmacct: add tags to aggregated netflow based on the source device [puppet] - 10https://gerrit.wikimedia.org/r/410369 [02:18:34] (03CR) 10jerkins-bot: [V: 04-1] pmacct: add tags to aggregated netflow based on the source device [puppet] - 10https://gerrit.wikimedia.org/r/410369 (owner: 10Ayounsi) [02:25:50] !log l10nupdate@tin scap sync-l10n completed (1.31.0-wmf.20) (duration: 05m 39s) [02:26:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:34:08] RECOVERY - puppet last run on analytics1045 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [02:58:11] (03PS2) 10Ayounsi: pmacct: add tags to aggregated netflow based on the source device [puppet] - 10https://gerrit.wikimedia.org/r/410369 [03:08:56] (03PS3) 10Ayounsi: pmacct: add tags to aggregated netflow based on the source device [puppet] - 10https://gerrit.wikimedia.org/r/410369 [03:11:35] (03CR) 10Ayounsi: "https://puppet-compiler.wmflabs.org/compiler02/9967/rhenium.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/410369 (owner: 10Ayounsi) [05:51:52] !log andrew@tin Started deploy [horizon/deploy@c355366]: updating sudo-dashboard [05:52:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:52:07] !log andrew@tin Finished deploy [horizon/deploy@c355366]: updating sudo-dashboard (duration: 00m 20s) [05:52:15] !log andrew@tin Started deploy [horizon/deploy@c355366]: updating sudo-dashboard [05:52:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:52:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:55:28] !log andrew@tin Finished deploy [horizon/deploy@c355366]: updating sudo-dashboard (duration: 03m 13s) [05:55:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:19:28] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: Rack and setup db1115 (tendril replacement database) - https://phabricator.wikimedia.org/T185788#3970735 (10Marostegui) Chris, can you hold this task? There are under going discussions about the hostname, I marked the task as stalled but I should've sai... [06:25:18] (03PS1) 10Marostegui: db-eqiad.php: Depool db1096:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410390 (https://phabricator.wikimedia.org/T187089) [06:27:37] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1096:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410390 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui) [06:29:31] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1096:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410390 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui) [06:29:43] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1096:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410390 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui) [06:30:24] !log Deploy schema change on db1096:3315 - T187089 T185128 T153182 [06:30:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:30:39] T187089: Fix WMF schemas to not break when comment store goes WRITE_NEW - https://phabricator.wikimedia.org/T187089 [06:30:39] T153182: Perform schema change to add externallinks.el_index_60 to all wikis - https://phabricator.wikimedia.org/T153182 [06:30:40] T185128: Schema change to prepare for dropping archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T185128 [06:31:10] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Depool db1096:3315 for alter table (duration: 01m 13s) [06:31:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:34:29] (03PS1) 10Marostegui: db-eqiad.php: Depool db1067 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410391 (https://phabricator.wikimedia.org/T162807) [06:36:28] PROBLEM - HHVM rendering on mw2129 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:37:18] RECOVERY - HHVM rendering on mw2129 is OK: HTTP OK: HTTP/1.1 200 OK - 75286 bytes in 0.406 second response time [06:39:19] PROBLEM - puppet last run on sca1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[tzdata] [06:40:24] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1067 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410391 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui) [06:42:33] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1067 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410391 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui) [06:42:47] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1067 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410391 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui) [06:44:00] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1067 - T162807 (duration: 01m 12s) [06:44:04] !log Stop replication in sync db1089 and db1067 - T162807 [06:44:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:44:13] T162807: Run pt-table-checksum on s1 (enwiki) - https://phabricator.wikimedia.org/T162807 [06:44:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:57:37] (03CR) 10Giuseppe Lavagetto: [C: 032] Safely load yaml files (031 comment) [software/conftool] - 10https://gerrit.wikimedia.org/r/408290 (https://phabricator.wikimedia.org/T185080) (owner: 10Giuseppe Lavagetto) [06:58:47] (03Merged) 10jenkins-bot: Safely load yaml files [software/conftool] - 10https://gerrit.wikimedia.org/r/408290 (https://phabricator.wikimedia.org/T185080) (owner: 10Giuseppe Lavagetto) [06:58:53] (03PS1) 10Marostegui: db-eqiad.php: Depool db1096:3316 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410392 [06:59:32] (03CR) 10Krinkle: "I imagine it would fit well in https://github.com/wikimedia/mediawiki-extensions-WikimediaMaintenance, and/or (if generalised) added to th" [puppet] - 10https://gerrit.wikimedia.org/r/397913 (owner: 10Anomie) [07:00:29] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1096:3316 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410392 (owner: 10Marostegui) [07:01:56] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1096:3316 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410392 (owner: 10Marostegui) [07:02:11] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1096:3316 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410392 (owner: 10Marostegui) [07:04:01] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1096:3316 (duration: 01m 12s) [07:04:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:09:18] RECOVERY - puppet last run on sca1003 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [07:16:49] PROBLEM - Host labvirt1008 is DOWN: PING CRITICAL - Packet loss = 100% [07:20:48] RECOVERY - Host labvirt1008 is UP: PING OK - Packet loss = 0%, RTA = 0.20 ms [07:23:39] (03PS1) 10Elukey: profile::hadoop::worker: remove jmxtrans support [puppet] - 10https://gerrit.wikimedia.org/r/410396 (https://phabricator.wikimedia.org/T166248) [07:40:46] <_joe_> did anyone reboot labvirt1008? [07:42:21] not me [07:43:15] not me either [07:53:29] (03PS2) 10Elukey: profile::hadoop::worker: remove jmxtrans support [puppet] - 10https://gerrit.wikimedia.org/r/410396 (https://phabricator.wikimedia.org/T166248) [08:00:59] 10Operations, 10ops-eqiad, 10cloud-services-team: labvirt1008 rebooted / system was overheated - https://phabricator.wikimedia.org/T187292#3970886 (10MoritzMuehlenhoff) [08:02:46] (03PS6) 10Giuseppe Lavagetto: Add support for jsonschema-based entities [software/conftool] - 10https://gerrit.wikimedia.org/r/408585 (https://phabricator.wikimedia.org/T185080) [08:02:48] (03PS2) 10Giuseppe Lavagetto: Increase test coverage [software/conftool] - 10https://gerrit.wikimedia.org/r/410224 [08:02:50] (03PS2) 10Giuseppe Lavagetto: Add simple actions to be exercised only on the basic types. [software/conftool] - 10https://gerrit.wikimedia.org/r/410225 [08:02:52] (03PS2) 10Giuseppe Lavagetto: Release new version of conftool [software/conftool] - 10https://gerrit.wikimedia.org/r/410226 [08:02:59] PROBLEM - Host bohrium is DOWN: PING CRITICAL - Packet loss = 100% [08:02:59] PROBLEM - Host etcd1004 is DOWN: PING CRITICAL - Packet loss = 100% [08:03:08] PROBLEM - Host actinium is DOWN: PING CRITICAL - Packet loss = 100% [08:03:08] PROBLEM - Host neon is DOWN: PING CRITICAL - Packet loss = 100% [08:03:18] PROBLEM - Host netmon1003 is DOWN: PING CRITICAL - Packet loss = 100% [08:03:21] <_joe_> oh jeez again? [08:03:25] <_joe_> ganeti, why? [08:03:39] PROBLEM - Host boron is DOWN: PING CRITICAL - Packet loss = 100% [08:03:52] <_joe_> boron is a vm too? [08:03:59] PROBLEM - Host etcd1002 is DOWN: PING CRITICAL - Packet loss = 100% [08:04:08] PROBLEM - Host etcd1005 is DOWN: PING CRITICAL - Packet loss = 100% [08:04:09] PROBLEM - Host kubestagetcd1003 is DOWN: PING CRITICAL - Packet loss = 100% [08:04:18] PROBLEM - Host rutherfordium is DOWN: PING CRITICAL - Packet loss = 100% [08:04:19] yeah, we set that up as a VM when copper couldn't handle HHVM builds anymore [08:04:25] <_joe_> eheh [08:04:38] PROBLEM - Host install1002 is DOWN: PING CRITICAL - Packet loss = 100% [08:04:49] PROBLEM - SSH on ganeti1008 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:05:24] there's an interesting data point in that crash, though: [08:05:46] we explicitly held back ganeti1008 at Linux 4.4 to rule out the crashes are a regression between 4.4->4.9 [08:06:51] it would be interesting to know if one of the vms had a spike in iops or similar before the crash [08:08:10] <_joe_> !log powercycled ganeti1008 [08:08:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:08:58] PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=misc&var-status_type=5 [08:09:10] yeah, boron should induce quite some I/O load when doing massive builds (and I've build the kernel last week e.g.) [08:09:19] <_joe_> moritzm: I see linux 4.9 starting there [08:09:48] RECOVERY - SSH on ganeti1008 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u4 (protocol 2.0) [08:09:58] _joe_: that should be fine, the kernel was already installed (but we held back the reboot) [08:10:13] now that we know 4.4 is equally affected, we can revert to the default kernel [08:10:35] the 50x are due to piwik on bohrium [08:10:39] RECOVERY - Host neon is UP: PING OK - Packet loss = 0%, RTA = 6.89 ms [08:10:40] 10Operations, 10Mail, 10Wikimedia-Mailing-lists: Emails sent to Wikidata mailing list are not received - https://phabricator.wikimedia.org/T187163#3970906 (10Lea_Lacroix_WMDE) @herron I am the list admin. I've been looking for modifying the value you mention, but I couldn't find it in the admin interface. Do... [08:10:48] RECOVERY - Host etcd1005 is UP: PING OK - Packet loss = 0%, RTA = 7.44 ms [08:10:58] RECOVERY - Host etcd1004 is UP: PING OK - Packet loss = 0%, RTA = 7.35 ms [08:10:58] RECOVERY - Host boron is UP: PING OK - Packet loss = 0%, RTA = 7.33 ms [08:10:58] RECOVERY - Host rutherfordium is UP: PING OK - Packet loss = 0%, RTA = 6.76 ms [08:10:59] RECOVERY - Host bohrium is UP: PING WARNING - Packet loss = 37%, RTA = 6.68 ms [08:11:08] RECOVERY - Host actinium is UP: PING WARNING - Packet loss = 64%, RTA = 7.63 ms [08:11:08] RECOVERY - Host netmon1003 is UP: PING OK - Packet loss = 0%, RTA = 6.78 ms [08:11:08] RECOVERY - Host etcd1002 is UP: PING OK - Packet loss = 0%, RTA = 6.67 ms [08:11:18] RECOVERY - Host kubestagetcd1003 is UP: PING OK - Packet loss = 0%, RTA = 7.23 ms [08:12:39] RECOVERY - Host install1002 is UP: PING OK - Packet loss = 0%, RTA = 7.33 ms [08:21:37] so 4.4 also affected... how nice [08:23:02] well at least we can upgrade that box again to 4.9 [08:23:08] RECOVERY - Misc HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=misc&var-status_type=5 [08:25:59] (03PS1) 10Alexandros Kosiaris: lvs: Move ORES to the ores cluster from scb [puppet] - 10https://gerrit.wikimedia.org/r/410398 [08:27:08] (03CR) 10Alexandros Kosiaris: [C: 032] lvs: Move ORES to the ores cluster from scb [puppet] - 10https://gerrit.wikimedia.org/r/410398 (owner: 10Alexandros Kosiaris) [08:29:34] !log pybal restart on lvs1006, lvs1009, lvs1012 to pickup https://gerrit.wikimedia.org/r/410398 [08:29:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:19] hmm scratch that.. lvs1009 is not exactly up right now [08:31:33] 10Operations, 10ORES, 10Scoring-platform-team, 10Patch-For-Review: rack/setup/install ores2001-2009 - https://phabricator.wikimedia.org/T165170#3970944 (10akosiaris) [08:31:36] 10Operations, 10ORES, 10Patch-For-Review, 10Scoring-platform-team (Current): Reimage ores* hosts with Debian Stretch - https://phabricator.wikimedia.org/T171851#3970939 (10akosiaris) 05Open>03Resolved a:03akosiaris Yes that's right. Resolving [08:34:38] 10Operations, 10ORES, 10Scap, 10Scoring-platform-team: Use external dsh group to list pooled ORES nodes - https://phabricator.wikimedia.org/T179501#3970955 (10akosiaris) And scap configuration updated in https://gerrit.wikimedia.org/r/#/c/409932/. When that one is merged this can be called done as well [08:38:20] (03PS2) 10Jcrespo: mariadb: Repool db2042 after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410236 [08:38:22] (03PS1) 10Jcrespo: mariadb: Rebalance s8 weigth due to high load [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410401 [08:38:24] !log pybal restart on lvs1003 to pickup https://gerrit.wikimedia.org/r/410398 [08:38:33] (03PS2) 10Jcrespo: mariadb: Rebalance s8 weigth due to high load [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410401 [08:38:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:38:58] and finally production traffic flows to ores* hosts :-) [08:39:11] I 'll be monitoring but I can say I am happy [08:39:54] \o/ \o/ [08:41:05] (03CR) 10Jcrespo: [C: 032] mariadb: Rebalance s8 weigth due to high load [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410401 (owner: 10Jcrespo) [08:42:19] (03PS3) 10Elukey: profile::hadoop::worker: remove jmxtrans support [puppet] - 10https://gerrit.wikimedia.org/r/410396 (https://phabricator.wikimedia.org/T166248) [08:42:38] (03Merged) 10jenkins-bot: mariadb: Rebalance s8 weigth due to high load [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410401 (owner: 10Jcrespo) [08:42:51] (03CR) 10jenkins-bot: mariadb: Rebalance s8 weigth due to high load [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410401 (owner: 10Jcrespo) [08:45:18] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Rebalance s8 (duration: 01m 13s) [08:45:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:18] (03PS4) 10Elukey: profile::hadoop::worker: remove jmxtrans support [puppet] - 10https://gerrit.wikimedia.org/r/410396 (https://phabricator.wikimedia.org/T166248) [08:51:00] (03PS1) 10Filippo Giunchedi: prometheus: aggregate availability for varnish backends [puppet] - 10https://gerrit.wikimedia.org/r/410402 (https://phabricator.wikimedia.org/T177195) [08:51:10] godog: I am working on ---^, if you have a minute later on would you mind to check if wrote horrible things ? :D [08:51:48] heheh ok will do elukey [08:52:11] 10Operations, 10Discovery, 10Traffic, 10WMDE-Tech-Communication, and 3 others: announce breaking change: http > https for entities in rdf - https://phabricator.wikimedia.org/T154015#3971009 (10Smalyshev) p:05Low>03Lowest [08:52:32] 10Operations, 10Discovery, 10Traffic, 10Wikidata, and 2 others: compile number of http uses for http://www.wikidata.org/entity - https://phabricator.wikimedia.org/T154017#3971010 (10Smalyshev) 05Open>03stalled p:05Low>03Lowest [08:52:35] 10Operations, 10Discovery, 10Traffic, 10Wikidata, and 2 others: Consider switching to HTTPS for Wikidata query service links - https://phabricator.wikimedia.org/T153563#3971012 (10Smalyshev) [08:52:49] (03PS2) 10Filippo Giunchedi: prometheus: aggregate availability for varnish backends [puppet] - 10https://gerrit.wikimedia.org/r/410402 (https://phabricator.wikimedia.org/T177195) [08:54:40] godog: already discovered all sorts of bad things, will ping you later :D [08:55:41] (03CR) 10Filippo Giunchedi: [C: 032] prometheus: aggregate availability for varnish backends [puppet] - 10https://gerrit.wikimedia.org/r/410402 (https://phabricator.wikimedia.org/T177195) (owner: 10Filippo Giunchedi) [08:59:20] (03PS1) 10Filippo Giunchedi: Revert "Depool poolcounter1002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410404 (https://phabricator.wikimedia.org/T186534) [09:00:13] moritzm: ^ [09:02:05] !log Stop MySQL on db1096:3315 and 3316 for mysql+kernel upgrade [09:02:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:41] 10Operations, 10hardware-requests: HW replacement for poolcounter1002 - https://phabricator.wikimedia.org/T187297#3971031 (10fgiunchedi) [09:03:12] 10Operations, 10hardware-requests: HW replacement for poolcounter1002 - https://phabricator.wikimedia.org/T187297#3971031 (10fgiunchedi) cc @joe and @akosiaris as FYI [09:05:34] 10Operations, 10hardware-requests: HW replacement for poolcounter1002 - https://phabricator.wikimedia.org/T187297#3971055 (10akosiaris) Let's not ? This is the only poolcounter instance that is still a physical machine and that's for historical reasons. The other 3 instances are VMs already so let's do this on... [09:08:10] 10Operations, 10hardware-requests: HW replacement for poolcounter1002 - https://phabricator.wikimedia.org/T187297#3971057 (10fgiunchedi) Works for me! A VM would be even better indeed. [09:08:28] !log Deploy schema change on s5 dbstore1002 https://phabricator.wikimedia.org/T187089 https://phabricator.wikimedia.org/T185128 https://phabricator.wikimedia.org/T153182 [09:08:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:41] 10Operations, 10vm-requests: VM for poolcounter1002 - https://phabricator.wikimedia.org/T187297#3971064 (10fgiunchedi) [09:13:38] (03PS1) 10Marostegui: db-eqiad.php: Slowly repool db1096 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410407 [09:13:43] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-fgiunchedi: Offline uncorrectable sectors on poolcounter1002 /dev/sda - https://phabricator.wikimedia.org/T186534#3971067 (10fgiunchedi) 05Open>03Resolved This is completed [09:14:50] 10Operations, 10vm-requests: VM for poolcounter1002 - https://phabricator.wikimedia.org/T187297#3971070 (10fgiunchedi) p:05Triage>03Normal [09:14:51] (03PS5) 10Elukey: profile::hadoop::worker: remove jmxtrans support [puppet] - 10https://gerrit.wikimedia.org/r/410396 (https://phabricator.wikimedia.org/T166248) [09:15:01] 10Operations, 10Code-Stewardship-Reviews, 10Services: zotero translation server: code stewardship request - https://phabricator.wikimedia.org/T187194#3971071 (10fgiunchedi) p:05Triage>03Normal [09:15:13] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-fgiunchedi: Decommission graphite1002 - https://phabricator.wikimedia.org/T187190#3971074 (10fgiunchedi) p:05Triage>03Normal [09:17:05] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Slowly repool db1096 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410407 (owner: 10Marostegui) [09:18:33] (03Merged) 10jenkins-bot: db-eqiad.php: Slowly repool db1096 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410407 (owner: 10Marostegui) [09:18:44] (03CR) 10jenkins-bot: db-eqiad.php: Slowly repool db1096 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410407 (owner: 10Marostegui) [09:20:15] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool slowly db1096:3316,3315 (duration: 01m 13s) [09:20:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:54] (03PS1) 10Marostegui: db-eqiad.php: Increase traffic for db1096 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410409 [09:28:21] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase traffic for db1096 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410409 (owner: 10Marostegui) [09:28:34] 10Operations, 10Code-Stewardship-Reviews, 10Services: zotero translation server: code stewardship request - https://phabricator.wikimedia.org/T187194#3971089 (10akosiaris) >>! In T187194#3968832, @danstillman wrote: > Zotero dev here. I'm not clear on when all the above was written, but a few clarifications.... [09:30:13] (03Merged) 10jenkins-bot: db-eqiad.php: Increase traffic for db1096 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410409 (owner: 10Marostegui) [09:30:24] (03CR) 10jenkins-bot: db-eqiad.php: Increase traffic for db1096 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410409 (owner: 10Marostegui) [09:31:51] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase traffic for db1096:3316,3315 (duration: 01m 12s) [09:32:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:50] (03PS1) 10Marostegui: db-eqiad.php: Increase traffic for db1096 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410410 [09:45:50] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase traffic for db1096 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410410 (owner: 10Marostegui) [09:45:57] (03CR) 10Giuseppe Lavagetto: Add support for jsonschema-based entities (033 comments) [software/conftool] - 10https://gerrit.wikimedia.org/r/408585 (https://phabricator.wikimedia.org/T185080) (owner: 10Giuseppe Lavagetto) [09:47:22] (03Merged) 10jenkins-bot: db-eqiad.php: Increase traffic for db1096 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410410 (owner: 10Marostegui) [09:47:34] (03CR) 10jenkins-bot: db-eqiad.php: Increase traffic for db1096 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410410 (owner: 10Marostegui) [09:49:35] !log akosiaris@puppetmaster1001 conftool action : set/weight=10; selector: all (tags: ['dc=codfw', 'cluster=ores', 'service=ores']) [09:49:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:48] !log akosiaris@puppetmaster1001 conftool action : set/weight=10; selector: all (tags: ['dc=eqiad', 'cluster=ores', 'service=ores']) [09:49:56] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase traffic for db1096:3316,3315 (duration: 01m 12s) [09:50:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:01] !log set standard weight for all ores* hosts [09:50:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:20] (03PS1) 10Marostegui: db-eqiad.php: Fully repool db1096 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410411 [09:50:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:53] (03CR) 10Volans: [C: 031] "LGTM" [software/conftool] - 10https://gerrit.wikimedia.org/r/408585 (https://phabricator.wikimedia.org/T185080) (owner: 10Giuseppe Lavagetto) [09:59:49] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Fully repool db1096 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410411 (owner: 10Marostegui) [10:01:18] (03Merged) 10jenkins-bot: db-eqiad.php: Fully repool db1096 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410411 (owner: 10Marostegui) [10:01:29] (03CR) 10jenkins-bot: db-eqiad.php: Fully repool db1096 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410411 (owner: 10Marostegui) [10:01:32] (03PS1) 10Filippo Giunchedi: hieradata: enable SMART for db in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/410412 (https://phabricator.wikimedia.org/T86552) [10:01:34] (03PS1) 10Filippo Giunchedi: hieradata: enable SMART for lab/labtest [puppet] - 10https://gerrit.wikimedia.org/r/410413 (https://phabricator.wikimedia.org/T86552) [10:02:48] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Fully repool db1096:3316,3315 (duration: 01m 12s) [10:03:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:05:34] (03PS1) 10Marostegui: db-eqiad.php: Depool db1100 for alter table [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410416 (https://phabricator.wikimedia.org/T187089) [10:08:12] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1100 for alter table [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410416 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui) [10:10:09] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1100 for alter table [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410416 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui) [10:10:22] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1100 for alter table [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410416 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui) [10:11:47] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1100 (duration: 01m 12s) [10:11:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:14] !log Deploy schema change on db1100 - T187089 T185128 T153182 [10:13:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:27] T187089: Fix WMF schemas to not break when comment store goes WRITE_NEW - https://phabricator.wikimedia.org/T187089 [10:13:27] T153182: Perform schema change to add externallinks.el_index_60 to all wikis - https://phabricator.wikimedia.org/T153182 [10:13:28] T185128: Schema change to prepare for dropping archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T185128 [10:16:52] (03PS6) 10Elukey: profile::hadoop::worker: remove jmxtrans support [puppet] - 10https://gerrit.wikimedia.org/r/410396 (https://phabricator.wikimedia.org/T166248) [10:19:21] ;win 2 [10:21:45] (03CR) 10Volans: "Ok in general, few comments inline." (033 comments) [software/conftool] - 10https://gerrit.wikimedia.org/r/410224 (owner: 10Giuseppe Lavagetto) [10:27:24] (03PS1) 10Gilles: Upgrade to 1.13 [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/410417 (https://phabricator.wikimedia.org/T187159) [10:28:35] !log installing libvorbis security updates on trusty systems [10:28:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:40] (03PS2) 10Gilles: Update Thumbor header names [puppet] - 10https://gerrit.wikimedia.org/r/410199 (https://phabricator.wikimedia.org/T187159) [10:30:46] (03PS3) 10Gilles: Update Thumbor header names [puppet] - 10https://gerrit.wikimedia.org/r/410199 (https://phabricator.wikimedia.org/T187159) [10:32:10] (03PS1) 10Muehlenhoff: Add library hint for libvorbis [puppet] - 10https://gerrit.wikimedia.org/r/410419 [10:33:09] (03CR) 10Muehlenhoff: [C: 032] Add library hint for libvorbis [puppet] - 10https://gerrit.wikimedia.org/r/410419 (owner: 10Muehlenhoff) [10:33:34] (03CR) 10Volans: "Nice approach, I would change slightly the behaviour, see inline." (034 comments) [software/conftool] - 10https://gerrit.wikimedia.org/r/410225 (owner: 10Giuseppe Lavagetto) [10:38:04] (03CR) 10Volans: [C: 031] "LGTM, nitpicks inline ;)" (032 comments) [software/conftool] - 10https://gerrit.wikimedia.org/r/410226 (owner: 10Giuseppe Lavagetto) [10:42:43] !log Stop replication in sync on db1089 and db1067 - T162807 [10:42:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:57] T162807: Run pt-table-checksum on s1 (enwiki) - https://phabricator.wikimedia.org/T162807 [10:46:15] !log dropping test databases from m5 T186585 [10:46:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:29] T186585: Review m5 backups - https://phabricator.wikimedia.org/T186585 [10:53:47] (03PS7) 10Elukey: profile::hadoop::worker: remove jmxtrans support [puppet] - 10https://gerrit.wikimedia.org/r/410396 (https://phabricator.wikimedia.org/T166248) [10:54:46] (03CR) 10Giuseppe Lavagetto: [C: 031] "LGTM, but it seems there is no documentation added for the new query backend. Should it happen in a later commit? Or should it be added he" (031 comment) [software/cumin] - 10https://gerrit.wikimedia.org/r/405719 (owner: 10Volans) [10:59:27] (03CR) 10Filippo Giunchedi: prometheus: add check prometheus metric script (0310 comments) [puppet] - 10https://gerrit.wikimedia.org/r/409054 (https://phabricator.wikimedia.org/T181410) (owner: 10Filippo Giunchedi) [10:59:34] (03PS5) 10Filippo Giunchedi: prometheus: add check prometheus metric script [puppet] - 10https://gerrit.wikimedia.org/r/409054 (https://phabricator.wikimedia.org/T181410) [11:00:05] 10Operations, 10Discovery, 10Maps, 10Maps-Sprint, and 3 others: Decide on Cache-Control headers for map tiles - https://phabricator.wikimedia.org/T186732#3971323 (10ema) Anything else left to be discussed here? [11:00:34] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1100 for alter table" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410426 [11:01:29] (03PS8) 10Elukey: profile::hadoop::worker: remove jmxtrans support [puppet] - 10https://gerrit.wikimedia.org/r/410396 (https://phabricator.wikimedia.org/T166248) [11:03:32] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1100 for alter table" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410426 (owner: 10Marostegui) [11:05:28] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1100 for alter table" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410426 (owner: 10Marostegui) [11:06:56] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1100 for alter table" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410426 (owner: 10Marostegui) [11:07:00] kart_: did 410105 (from swat yesterday) started to work? or still broken? [11:07:14] Nikerabbit: ^ [11:07:25] (03PS27) 10ArielGlenn: [WIP] php7 manifests for mediawiki on stretch [puppet] - 10https://gerrit.wikimedia.org/r/394977 [11:07:49] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1100 (duration: 01m 12s) [11:08:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:26] (03CR) 10Muehlenhoff: [C: 031] Revert "Depool poolcounter1002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410404 (https://phabricator.wikimedia.org/T186534) (owner: 10Filippo Giunchedi) [11:09:43] (03PS2) 10Filippo Giunchedi: Revert "Depool poolcounter1002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410404 (https://phabricator.wikimedia.org/T186534) [11:10:16] (03CR) 10Filippo Giunchedi: [C: 032] Revert "Depool poolcounter1002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410404 (https://phabricator.wikimedia.org/T186534) (owner: 10Filippo Giunchedi) [11:11:19] (03PS1) 10Marostegui: db-eqiad.php: Depool db1106 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410427 (https://phabricator.wikimedia.org/T187089) [11:12:13] (03Merged) 10jenkins-bot: Revert "Depool poolcounter1002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410404 (https://phabricator.wikimedia.org/T186534) (owner: 10Filippo Giunchedi) [11:12:28] (03CR) 10jenkins-bot: Revert "Depool poolcounter1002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410404 (https://phabricator.wikimedia.org/T186534) (owner: 10Filippo Giunchedi) [11:13:11] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1106 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410427 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui) [11:13:37] marostegui: I'm sync-file for poolcounter1002 btw [11:13:48] godog: cool, I will wait for you :) [11:13:53] even though scap has a lock, just in case :) [11:14:04] yeah, ping me when done :) [11:14:34] !log filippo@tin Synchronized wmf-config/ProductionServices.php: repool poolcounter1002 after disk replacement (duration: 01m 12s) [11:14:43] marostegui: yup, done [11:14:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:51] awesome ,thanks godog [11:14:54] (03PS9) 10Elukey: profile::hadoop::worker: remove jmxtrans support [puppet] - 10https://gerrit.wikimedia.org/r/410396 (https://phabricator.wikimedia.org/T166248) [11:15:06] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1106 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410427 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui) [11:15:38] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1106 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410427 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui) [11:16:27] (03CR) 10Elukey: [C: 032] profile::hadoop::worker: remove jmxtrans support [puppet] - 10https://gerrit.wikimedia.org/r/410396 (https://phabricator.wikimedia.org/T166248) (owner: 10Elukey) [11:16:32] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1106 (duration: 01m 12s) [11:16:35] !log Stop MySQL and reboot db1106 for mysql and kernel upgrade [11:16:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:16] (03CR) 10Giuseppe Lavagetto: "The code seems generally correct, but I'm not sure what this feature wants to actually achieve." (032 comments) [software/cumin] - 10https://gerrit.wikimedia.org/r/409980 (https://phabricator.wikimedia.org/T186818) (owner: 10Volans) [11:23:23] 10Operations, 10Discovery, 10Maps, 10Maps-Sprint, and 3 others: Decide on Cache-Control headers for map tiles - https://phabricator.wikimedia.org/T186732#3971360 (10Pnorman) Because cache-control is in one of the sample config files, we should make sure it's something sensible, even if we use something dif... [11:25:10] (03PS5) 10Volans: Backends: add known hosts files backend [software/cumin] - 10https://gerrit.wikimedia.org/r/405719 [11:25:46] !log Deploy schema change on db1106 - T187089 T185128 T153182 [11:25:59] (03CR) 10Volans: "Thanks for the reminder of the documentation... had completely forgot to update it, given is a new module and new backend." [software/cumin] - 10https://gerrit.wikimedia.org/r/405719 (owner: 10Volans) [11:26:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:00] T187089: Fix WMF schemas to not break when comment store goes WRITE_NEW - https://phabricator.wikimedia.org/T187089 [11:26:00] T153182: Perform schema change to add externallinks.el_index_60 to all wikis - https://phabricator.wikimedia.org/T153182 [11:26:01] T185128: Schema change to prepare for dropping archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T185128 [11:29:50] 10Operations, 10Ops-Access-Requests, 10Traffic, 10Patch-For-Review: Ops Onboarding for Valentín Gutiérrez - https://phabricator.wikimedia.org/T187035#3971381 (10MoritzMuehlenhoff) Valentín has been added to pwstore. [11:30:11] 10Operations, 10Ops-Access-Requests, 10Traffic, 10Patch-For-Review: Ops Onboarding for Valentín Gutiérrez - https://phabricator.wikimedia.org/T187035#3971382 (10MoritzMuehlenhoff) [11:36:53] 10Operations, 10ops-eqiad, 10cloud-services-team: labvirt1008 rebooted / system was overheated - https://phabricator.wikimedia.org/T187292#3971406 (10chasemp) Thanks @moritz Luckily the Toolforge instances here are a mix we could afford to have down. @andrew let's sync up on this? [11:38:23] (03CR) 10Giuseppe Lavagetto: [C: 031] Batch size: allow to specify it in percentage [software/cumin] - 10https://gerrit.wikimedia.org/r/410167 (https://phabricator.wikimedia.org/T187185) (owner: 10Volans) [11:50:16] (03PS4) 10Arturo Borrero Gonzalez: WIP: toollabs: add apt pinnings for key packages [puppet] - 10https://gerrit.wikimedia.org/r/410177 (https://phabricator.wikimedia.org/T187193) [11:53:49] (03CR) 10Volans: "> The code seems generally correct, but I'm not sure what this feature wants to actually achieve." (032 comments) [software/cumin] - 10https://gerrit.wikimedia.org/r/409980 (https://phabricator.wikimedia.org/T186818) (owner: 10Volans) [11:58:09] (03CR) 10Volans: [C: 032] Batch size: allow to specify it in percentage [software/cumin] - 10https://gerrit.wikimedia.org/r/410167 (https://phabricator.wikimedia.org/T187185) (owner: 10Volans) [12:00:33] (03Merged) 10jenkins-bot: Batch size: allow to specify it in percentage [software/cumin] - 10https://gerrit.wikimedia.org/r/410167 (https://phabricator.wikimedia.org/T187185) (owner: 10Volans) [12:01:10] (03PS1) 10Elukey: profile::hadoop: force prometheus critical/warning threshold to int [puppet] - 10https://gerrit.wikimedia.org/r/410430 [12:01:48] (03CR) 10jenkins-bot: Batch size: allow to specify it in percentage [software/cumin] - 10https://gerrit.wikimedia.org/r/410167 (https://phabricator.wikimedia.org/T187185) (owner: 10Volans) [12:06:34] (03CR) 10Elukey: [C: 032] profile::hadoop: force prometheus critical/warning threshold to int [puppet] - 10https://gerrit.wikimedia.org/r/410430 (owner: 10Elukey) [12:12:10] (03PS6) 10Volans: Backends: add known hosts files backend [software/cumin] - 10https://gerrit.wikimedia.org/r/405719 [12:16:33] (03PS1) 10Marostegui: db-eqiad.php: Slowly repool db1106 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410431 [12:19:35] (03PS2) 10Marostegui: db-eqiad.php: Slowly repool db1106 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410431 [12:23:38] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Slowly repool db1106 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410431 (owner: 10Marostegui) [12:26:56] (03Merged) 10jenkins-bot: db-eqiad.php: Slowly repool db1106 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410431 (owner: 10Marostegui) [12:28:35] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Slowly repool db1106 (duration: 01m 12s) [12:28:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:55] (03PS1) 10Matthias Mullie: Load 3D extension on other wikis, for display only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410433 (https://phabricator.wikimedia.org/T187261) [12:31:41] (03PS2) 10Matthias Mullie: Load 3D extension on other wikis, for display only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410433 (https://phabricator.wikimedia.org/T187261) [12:32:39] (03CR) 10Matthias Mullie: [C: 04-1] "-1 until deploy (also, make sure I285bca9e0a64ee8cd1a5482f697fc725cb54e9c5 is merged)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410433 (https://phabricator.wikimedia.org/T187261) (owner: 10Matthias Mullie) [12:36:39] (03PS1) 10Marostegui: db-eqiad.php: Increase traffic for db1106 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410435 [12:37:46] (03CR) 10jenkins-bot: db-eqiad.php: Slowly repool db1106 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410431 (owner: 10Marostegui) [12:38:47] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase traffic for db1106 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410435 (owner: 10Marostegui) [12:40:25] (03Merged) 10jenkins-bot: db-eqiad.php: Increase traffic for db1106 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410435 (owner: 10Marostegui) [12:41:45] zeljkof: multiple reports that it is working now [12:42:04] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Give more traffic to db1106 (duration: 01m 12s) [12:42:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:34] (03CR) 10jenkins-bot: db-eqiad.php: Increase traffic for db1106 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410435 (owner: 10Marostegui) [12:44:40] 10Operations, 10Mail, 10Wikimedia-Mailing-lists: Emails sent to Wikidata mailing list are not received - https://phabricator.wikimedia.org/T187163#3971552 (10Aklapper) @Lea_Lacroix_WMDE : See https://meta.wikimedia.org/wiki/Mailing_lists/Administration#Spam_filters [12:45:53] (03PS3) 10Jcrespo: mariadb: Repool db2042 after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410236 [12:45:55] (03PS1) 10Jcrespo: mariadb: Depool db1088 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410438 [12:46:06] (03PS2) 10Jcrespo: mariadb: Depool db1088 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410438 [12:46:44] (03PS1) 10Marostegui: Revert "mariadb: Rebalance s8 weigth due to high load" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410439 [12:46:48] (03CR) 10Jcrespo: "Manuel, tell me when I can do this one without affecting you." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410438 (owner: 10Jcrespo) [12:47:05] (03CR) 10Marostegui: [C: 031] "go for it :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410438 (owner: 10Jcrespo) [12:47:20] (03CR) 10Jcrespo: [C: 032] mariadb: Depool db1088 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410438 (owner: 10Jcrespo) [12:49:15] (03Merged) 10jenkins-bot: mariadb: Depool db1088 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410438 (owner: 10Jcrespo) [12:49:25] (03CR) 10jenkins-bot: mariadb: Depool db1088 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410438 (owner: 10Jcrespo) [12:50:57] 10Operations, 10ops-eqiad, 10cloud-services-team: labvirt1008 rebooted / system was overheated - https://phabricator.wikimedia.org/T187292#3971559 (10chasemp) >nova list --host labvirt1008 --all-tenants | awk '{print $4,$6}' | grep -v 'Name Tenant' | tr " " . ```accounts-appserver4.account-creation-assistan... [12:53:02] (03PS1) 10Marostegui: db-eqiad.php: Fully repool db1106 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410440 [12:55:44] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Fully repool db1106 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410440 (owner: 10Marostegui) [12:57:38] (03Merged) 10jenkins-bot: db-eqiad.php: Fully repool db1106 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410440 (owner: 10Marostegui) [12:57:49] (03CR) 10jenkins-bot: db-eqiad.php: Fully repool db1106 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410440 (owner: 10Marostegui) [12:58:54] 10Operations, 10ops-eqiad, 10cloud-services-team: labvirt1008 rebooted / system was overheated - https://phabricator.wikimedia.org/T187292#3971589 (10chasemp) [12:59:14] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team: labvirt1008 rebooted / system was overheated - https://phabricator.wikimedia.org/T187292#3970886 (10chasemp) [13:02:43] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Fully repool db1106 (duration: 01m 12s) [13:02:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:26] (03PS1) 10Elukey: profile::hadoop::master: fix hadoop-hdfs-capacity-gb-remaining alert [puppet] - 10https://gerrit.wikimedia.org/r/410443 [13:11:09] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team: labvirt1008 rebooted / system was overheated - https://phabricator.wikimedia.org/T187292#3971628 (10chasemp) [13:11:30] (03CR) 10Elukey: [C: 032] profile::hadoop::master: fix hadoop-hdfs-capacity-gb-remaining alert [puppet] - 10https://gerrit.wikimedia.org/r/410443 (owner: 10Elukey) [13:14:55] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool db1088 (duration: 01m 12s) [13:15:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:47] (03PS12) 10Muehlenhoff: Add support for selective automatic restarts of stateless services [puppet] - 10https://gerrit.wikimedia.org/r/399618 (https://phabricator.wikimedia.org/T135991) [13:16:41] !log stop slave and rolling schema change on db1059 m3 replica [13:16:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:54] (03PS2) 10Marostegui: Revert "mariadb: Rebalance s8 weigth due to high load" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410439 [13:22:34] (03CR) 10Marostegui: [C: 032] Revert "mariadb: Rebalance s8 weigth due to high load" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410439 (owner: 10Marostegui) [13:24:30] (03Merged) 10jenkins-bot: Revert "mariadb: Rebalance s8 weigth due to high load" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410439 (owner: 10Marostegui) [13:26:02] (03PS1) 10Elukey: role::archiva: move to java 8 [puppet] - 10https://gerrit.wikimedia.org/r/410445 (https://phabricator.wikimedia.org/T166248) [13:26:12] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1104 original weight (duration: 01m 12s) [13:26:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:58] (03CR) 10jenkins-bot: Revert "mariadb: Rebalance s8 weigth due to high load" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410439 (owner: 10Marostegui) [13:33:14] Nikerabbit: great! [13:34:30] !log installed openjdk-8 on meitnerium, manually upgraded java-update-alternatives to java8, restarted archiva [13:34:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:19] (03Draft2) 10Jayprakash12345: Create Portal alias for hiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410447 (https://phabricator.wikimedia.org/T187286) [13:40:47] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team: labvirt1008 rebooted / system was overheated - https://phabricator.wikimedia.org/T187292#3971674 (10chasemp) p:05Triage>03High [13:41:34] jouncebot, next [13:41:35] In 0 hour(s) and 18 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180214T1400) [13:42:21] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team: Evacuate relevant instances off of labvirt1008 - https://phabricator.wikimedia.org/T187317#3971677 (10chasemp) p:05Triage>03High [13:43:16] (03CR) 10Urbanecm: [C: 031] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410447 (https://phabricator.wikimedia.org/T187286) (owner: 10Jayprakash12345) [13:44:16] !log rollback java 8 upgrade for archiva - issues with Analytics builds [13:44:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:50] (03CR) 10Elukey: [C: 04-2] "Tested on meitnerium, there are some issues with our builds (getting 500s for some jars)" [puppet] - 10https://gerrit.wikimedia.org/r/410445 (https://phabricator.wikimedia.org/T166248) (owner: 10Elukey) [13:45:49] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team: Evacuate relevant instances off of labvirt1008 - https://phabricator.wikimedia.org/T187317#3971693 (10chasemp) I imagine we can wait a little bit to see if @cmjohnson can address with just some thermal paste? It may be more efficient than a mi... [13:46:12] PROBLEM - puppet last run on snapshot1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:47:39] * apergos glares at puppet [13:49:50] * apergos glares harder [13:51:12] RECOVERY - puppet last run on snapshot1005 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:52:15] ahem. [13:52:23] * apergos goes back to their regularly scheduled programming [13:56:17] 10Operations, 10Ops-Access-Requests, 10Traffic, 10Patch-For-Review: Ops Onboarding for Valentín Gutiérrez - https://phabricator.wikimedia.org/T187035#3971759 (10Vgutierrez) [14:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: I, the Bot under the Fountain, allow thee, The Deployer, to do European Mid-day SWAT(Max 8 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180214T1400). [14:00:05] bblack, Urbanecm, and matthiasmullie: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:10] * Urbanecm is here [14:00:12] here [14:00:12] I can SWAT today [14:00:32] bblack, matthiasmullie: do you want to deploy your patches? [14:00:43] sure [14:00:54] matthiasmullie: go ahead then [14:01:07] alright [14:02:24] zeljkof: yes please [14:02:40] bblack: as soon as matthiasmullie is done, you are next [14:02:40] zeljkof: mine shouldn't be capable of causing any issue in theory, but I'm here :) [14:03:54] oh I misread the above as "do you want [me] to deploy your patches for you" :) [14:04:03] process has changed since I last did even a config patch [14:04:26] bblack: both are possible :) should I deploy your patch? [14:04:43] it's all documented here https://wikitech.wikimedia.org/wiki/SWAT_deploys/Deployers [14:05:54] zeljkof: if you can that would be awesome :) [14:06:16] bblack: sure, please stand by, you are next [14:07:27] bblack: is there anything to test at mwdebug1002, or should I deploy to production immediately? [14:08:53] zeljkof: nothing to test [14:09:32] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410113 (https://phabricator.wikimedia.org/T156027) (owner: 10BBlack) [14:09:37] (it's painful to write those words, but in this case it's pragmatically true) [14:10:11] !log mlitn@tin Synchronized php-1.31.0-wmf.20/extensions/3D/modules/ext.3d.js: Fix 3D badge and Webkit thumb load detection (duration: 01m 13s) [14:10:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:29] bblack: sufficiently advanced monitoring is indistinguishable from testing ;) [14:11:03] (03Merged) 10jenkins-bot: wgSquidServersNoPurge: add eqsin, remove dead IP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410113 (https://phabricator.wikimedia.org/T156027) (owner: 10BBlack) [14:11:13] (03CR) 10jenkins-bot: wgSquidServersNoPurge: add eqsin, remove dead IP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410113 (https://phabricator.wikimedia.org/T156027) (owner: 10BBlack) [14:11:33] !log mlitn@tin Synchronized php-1.31.0-wmf.20/extensions/3D/modules/mmv.3d.head.js: Fix 3D badge (duration: 01m 12s) [14:11:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:52] zeljkof: bblack: done [14:12:03] zeljkof: thanks! [14:12:07] matthiasmullie: great, taking over swat [14:12:28] bblack: 410113 is merged, deploying [14:14:42] !log zfilipin@tin Synchronized wmf-config/reverse-proxy.php: SWAT: [[gerrit:410113|wgSquidServersNoPurge: add eqsin, remove dead IP (T156027)]] (duration: 01m 12s) [14:14:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:54] T156027: Configuration for Asia Cache DC hosts - https://phabricator.wikimedia.org/T156027 [14:15:07] bblack: deployed, please monitor relevant logs for a while and thanks for deploying with #releng ;) [14:15:20] zeljkof: thanks again! [14:15:23] (03PS2) 10Zfilipin: Make alias from old NS_PROJECT to new NS_PROJECT at hiwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408045 (https://phabricator.wikimedia.org/T185347) (owner: 10Urbanecm) [14:15:31] Urbanecm: you are next, reviewing 408045 [14:15:50] * Urbanecm is ready [14:16:56] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408045 (https://phabricator.wikimedia.org/T185347) (owner: 10Urbanecm) [14:19:43] hashar: around? looks like CI is slow again :( [14:19:47] (03CR) 10Filippo Giunchedi: "See inline, also since the script runs a lot of external commands I'd refactor subprocess invocations into a function that would also prin" (0310 comments) [puppet] - 10https://gerrit.wikimedia.org/r/399618 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [14:20:28] 408045,2 has 2 out of 3 jobs in queue :( [14:20:41] zeljkof, and operations/mediawiki-config's tests are high priority... [14:21:07] !log reboot ganeti1008 for kernel upgrade T181121 [14:21:13] according to zuul page, only one job is running :( [14:21:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:20] T181121: Kernels errors on ganeti1005- ganeti1008 under high I/O - https://phabricator.wikimedia.org/T181121 [14:21:36] The job you're talking about is completed already [14:21:59] It's quite ridiculous that the completed job is the most expensive one :D [14:22:12] ok, so something is wrong with ci for sure [14:22:22] PROBLEM - Host ganeti1008 is DOWN: PING CRITICAL - Packet loss = 100% [14:22:30] as far as I can see, no jobs are running [14:22:53] zeljkof, are you able to cancel a job? [14:22:59] maybe CI should be tested before swat start? I think it's annoying for devs and sysadmins that they have to stop their jobs in the middle of the swat [14:23:12] RECOVERY - Host ganeti1008 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [14:23:26] Urbanecm: which one? [14:23:31] (03Merged) 10jenkins-bot: Make alias from old NS_PROJECT to new NS_PROJECT at hiwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408045 (https://phabricator.wikimedia.org/T185347) (owner: 10Urbanecm) [14:23:35] Hauskatze: CI should just work :( [14:23:54] ok, looks like CI is back up [14:23:56] zeljkof: https://grafana.wikimedia.org/dashboard/db/nodepool?orgId=1 [14:24:06] looks like there are something weird going on [14:24:13] *are/is [14:24:29] keep in mind https://phabricator.wikimedia.org/T187292 [14:24:34] Hauskatze: I don't think we have much job on nodepool any more [14:25:02] (03CR) 10Filippo Giunchedi: [C: 032] Upgrade to 1.13 [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/410417 (https://phabricator.wikimedia.org/T187159) (owner: 10Gilles) [14:25:26] akosiaris: thanks, looks like at least some parts of CI are affected [14:25:52] (03PS2) 10Zfilipin: Require 7 days & 10 edits for autoconfirmed at zhwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/409743 (https://phabricator.wikimedia.org/T187018) (owner: 10Urbanecm) [14:26:54] (03CR) 10jenkins-bot: Make alias from old NS_PROJECT to new NS_PROJECT at hiwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408045 (https://phabricator.wikimedia.org/T185347) (owner: 10Urbanecm) [14:27:16] zeljkof, Hauskatze: Is CI back up? [14:27:25] Or is anything doable from my side? :) [14:27:29] Urbanecm: looks like it is back [14:27:33] That's great! [14:27:40] Urbanecm: 408045 should be at mwdebug1002 any second now [14:27:42] yep, looks like it's back up again [14:27:58] zeljkof, ack, will test [14:28:35] Urbanecm: 408045 is at mwdebug1002 [14:28:36] 10Operations, 10Traffic, 10Patch-For-Review: Configuration for Asia Cache DC hosts - https://phabricator.wikimedia.org/T156027#3971859 (10BBlack) Remaining known stuff, paring down the earlier list: ``` * hieradata/common/cache/*.yaml: cp5006 + cp5010 commented out (borked) * External monitoring stuff in:... [14:28:39] zeljkof, ack [14:29:19] zeljkof, 408045 is working, please deploy to the whole universe [14:29:47] launching to low earth orbit [14:30:06] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team: labvirt1008 rebooted / system was overheated - https://phabricator.wikimedia.org/T187292#3971863 (10chasemp) ```define service { # --PUPPET_NAME-- labvirt1008 disk_space active_checks_enabled 1 check_command nrpe_che... [14:30:51] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:408045|Make alias from old NS_PROJECT to new NS_PROJECT at hiwikiversity (T185347)]] (duration: 01m 12s) [14:31:01] Urbanecm: 408045 is deployed [14:31:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:05] T185347: Change Namespace and logo of hiwikiversity - https://phabricator.wikimedia.org/T185347 [14:31:19] zeljkof, great [14:31:59] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/409743 (https://phabricator.wikimedia.org/T187018) (owner: 10Urbanecm) [14:33:53] (03Merged) 10jenkins-bot: Require 7 days & 10 edits for autoconfirmed at zhwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/409743 (https://phabricator.wikimedia.org/T187018) (owner: 10Urbanecm) [14:34:44] Urbanecm: 409743 is at mwdebug1002 [14:34:48] zeljkof, ack [14:35:46] zeljkof, working, please deploy [14:35:47] (03PS3) 10Zfilipin: Enable flood flag at zhwikt [mediawiki-config] - 10https://gerrit.wikimedia.org/r/409752 (https://phabricator.wikimedia.org/T187018) (owner: 10Urbanecm) [14:35:59] deploying [14:36:57] (03CR) 10jenkins-bot: Require 7 days & 10 edits for autoconfirmed at zhwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/409743 (https://phabricator.wikimedia.org/T187018) (owner: 10Urbanecm) [14:37:03] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team: labvirt1008 rebooted / system was overheated - https://phabricator.wikimedia.org/T187292#3971873 (10chasemp) ``` February 14, 2018 07:00 Service Ok[2018-02-14 07:22:39] SERVICE ALERT: labvirt1008;puppet last run;OK;HARD;1;OK: Puppet is curren... [14:37:06] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:409743|Require 7 days & 10 edits for autoconfirmed at zhwiktionary (T187018)]] (duration: 01m 13s) [14:37:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:21] T187018: Request to change autoconfirmed settings, allow autoconfirmed user to suppress redirects and allow sysop to grant and remove flood flags on zh.wiktionary - https://phabricator.wikimedia.org/T187018 [14:37:22] Urbanecm: 409743 is deployed [14:37:32] zeljkof, ack [14:37:36] 10Operations, 10Ops-Access-Requests, 10Traffic, 10Patch-For-Review: Ops Onboarding for Valentín Gutiérrez - https://phabricator.wikimedia.org/T187035#3971876 (10Vgutierrez) [14:38:56] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/409752 (https://phabricator.wikimedia.org/T187018) (owner: 10Urbanecm) [14:40:10] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team: labvirt1008 rebooted / system was overheated - https://phabricator.wikimedia.org/T187292#3971877 (10chasemp) {F13741113} {F13741114} [14:40:51] (03Merged) 10jenkins-bot: Enable flood flag at zhwikt [mediawiki-config] - 10https://gerrit.wikimedia.org/r/409752 (https://phabricator.wikimedia.org/T187018) (owner: 10Urbanecm) [14:41:02] (03CR) 10jenkins-bot: Enable flood flag at zhwikt [mediawiki-config] - 10https://gerrit.wikimedia.org/r/409752 (https://phabricator.wikimedia.org/T187018) (owner: 10Urbanecm) [14:41:49] Urbanecm: 409752 is at mwdebug1002 [14:41:59] zeljkof, ack [14:42:14] (03PS1) 10BBlack: eqsin: add bast5001 to network::constants [puppet] - 10https://gerrit.wikimedia.org/r/410451 (https://phabricator.wikimedia.org/T156027) [14:42:42] zeljkof, 409752 is working, please deploy [14:42:50] deploying [14:43:22] Urbanecm: 409745 has merge conflict [14:43:25] (03CR) 10BBlack: [C: 032] eqsin: add bast5001 to network::constants [puppet] - 10https://gerrit.wikimedia.org/r/410451 (https://phabricator.wikimedia.org/T156027) (owner: 10BBlack) [14:43:35] zeljkof, do you want me to fix it? [14:43:46] Urbanecm: please do [14:43:56] Ok, working on it [14:44:05] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:409752|Enable flood flag at zhwikt (T187018)]] (duration: 01m 12s) [14:44:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:20] T187018: Request to change autoconfirmed settings, allow autoconfirmed user to suppress redirects and allow sysop to grant and remove flood flags on zh.wiktionary - https://phabricator.wikimedia.org/T187018 [14:44:25] Urbanecm: 409752 is deployed [14:45:07] (03PS3) 10Urbanecm: Add suppressredirect to autoconfirmed at zhwikt [mediawiki-config] - 10https://gerrit.wikimedia.org/r/409745 (https://phabricator.wikimedia.org/T187018) [14:45:07] ack [14:45:12] Merge conflict should be resolved [14:45:28] thanks, reviewing [14:46:12] (03CR) 10Muehlenhoff: cassandra: enable component/cassandra33 where applicable (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/410252 (https://phabricator.wikimedia.org/T186619) (owner: 10Eevans) [14:46:48] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/409745 (https://phabricator.wikimedia.org/T187018) (owner: 10Urbanecm) [14:47:17] !log installing PHP security updates [14:47:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:43] (03Merged) 10jenkins-bot: Add suppressredirect to autoconfirmed at zhwikt [mediawiki-config] - 10https://gerrit.wikimedia.org/r/409745 (https://phabricator.wikimedia.org/T187018) (owner: 10Urbanecm) [14:48:54] (03CR) 10jenkins-bot: Add suppressredirect to autoconfirmed at zhwikt [mediawiki-config] - 10https://gerrit.wikimedia.org/r/409745 (https://phabricator.wikimedia.org/T187018) (owner: 10Urbanecm) [14:49:42] Urbanecm: 409745 is at mwdebug1002 [14:49:56] ack [14:50:33] zeljkof, please do not deploy, there's some mistake [14:50:39] I'll create a follow-up commit [14:50:45] ok [14:52:09] (03PS1) 10Urbanecm: In groupoverrides, a => true is needed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410454 (https://phabricator.wikimedia.org/T187018) [14:52:36] I've created 410454, can you add it to mwdebug? [14:52:58] Urbanecm: reviewing [14:53:02] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 30 probes of 304 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [14:53:49] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410454 (https://phabricator.wikimedia.org/T187018) (owner: 10Urbanecm) [14:55:33] I see some CI problems AGAIN... [14:55:50] it might just be busy [14:56:12] That's kind of a problem from my point of view [14:56:23] PROBLEM - DPKG on labtestweb2001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [14:56:41] at least it's doing something, it was stuck earlier [14:56:55] Yep [14:57:08] ^labtestweb is me/PHP [14:58:23] RECOVERY - DPKG on labtestweb2001 is OK: All packages OK [15:00:49] Urbanecm: CI is still slow, I'll wait a few more minutes [15:01:02] I'm wathing zuul's page too :) [15:01:28] see cloud-l [15:01:31] Funny... Priority jobs are executing more slow than non-priority ones [15:01:39] I see some ci-etc...etc affected [15:01:51] not sure if CI is hosted there though [15:02:28] Well... Gerrit, Phab is at phab... [15:02:30] *prod [15:02:49] ci-jessie-wikimedia{0-9}[+]-contintcloud [15:03:02] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 9 probes of 304 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [15:03:05] that said, no idea if this is a different ci [15:03:30] Hauskatze, I see nothing in cloud-l ATM [15:03:58] Urbanecm: sorry, cloud announce: [Cloud-announce] Cloud VPS single hypervisor failure and (some) down instances [15:04:12] not sure what to do, reverting 409745 also needs CI :( [15:04:30] This outage is still ongoing. [15:04:30] We're currently waiting on some on-site data center work (re-applying thermal paste to the hosts' CPUs) before determining exactly how to respond. It still appears that no actual data has been lost but the affected VMs will remain turned off for several more hours. [15:04:44] (03PS1) 10Marostegui: db-eqiad.php: Depool db1110 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410459 (https://phabricator.wikimedia.org/T187089) [15:05:12] zeljkof, theoretically you can add verified flag yourself but this is highly discouraged [15:05:32] Urbanecm: I can also revert directly at tin [15:05:43] zeljkof, can you restart a queued job? [15:06:12] Urbanecm: the problem is that it did not start :) so re-starting does nothing [15:06:19] it's waiting in queue to get started [15:06:30] can you remove it from the queue and start again? :D [15:06:37] but apparently there are no available vm/container [15:06:44] I can [15:06:55] (03CR) 10Zfilipin: In groupoverrides, a => true is needed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410454 (https://phabricator.wikimedia.org/T187018) (owner: 10Urbanecm) [15:07:26] (03CR) 10Eevans: [C: 031] cassandra: enable component/cassandra33 where applicable (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/410252 (https://phabricator.wikimedia.org/T186619) (owner: 10Eevans) [15:07:30] (03CR) 10Volans: "Adding some comments, just a partial review, as I see Filippo has already done a pass." (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/399618 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [15:07:40] Urbanecm: the problem is that not even test pipeline jobs ran for 410454 [15:07:48] I see [15:07:57] And there's another patch waiting, from marostegui [15:08:10] Yeah, I won't merge, no worries :) [15:08:47] Urbanecm: I'll revert 409745 on tin, we are over time [15:08:52] I have to close the swat window [15:09:24] (03CR) 10Zfilipin: "I will revert 2ae602452844bf5e0b4065625c6fb54f7876f653, please amend that commit." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410454 (https://phabricator.wikimedia.org/T187018) (owner: 10Urbanecm) [15:09:46] (03Abandoned) 10Urbanecm: In groupoverrides, a => true is needed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410454 (https://phabricator.wikimedia.org/T187018) (owner: 10Urbanecm) [15:10:28] I've abandoned the follow-up patch, I'll create another patch based on the reverting patch [15:10:31] zeljkof, ^^ [15:13:40] (03PS1) 10Ema: cache_upload: upgrade cp1048 to varnish 5 [puppet] - 10https://gerrit.wikimedia.org/r/410460 (https://phabricator.wikimedia.org/T180433) [15:13:42] (03PS1) 10Ema: cache_upload: upgrade cp1049 to varnish 5 [puppet] - 10https://gerrit.wikimedia.org/r/410461 (https://phabricator.wikimedia.org/T180433) [15:13:45] (03PS1) 10Ema: cache_upload: upgrade cp1050 to varnish 5 [puppet] - 10https://gerrit.wikimedia.org/r/410462 (https://phabricator.wikimedia.org/T180433) [15:13:47] (03PS1) 10Ema: cache_upload: upgrade cp1062 to varnish 5 [puppet] - 10https://gerrit.wikimedia.org/r/410463 (https://phabricator.wikimedia.org/T180433) [15:13:49] (03PS1) 10Ema: cache_upload: upgrade cp1063 to varnish 5 [puppet] - 10https://gerrit.wikimedia.org/r/410464 (https://phabricator.wikimedia.org/T180433) [15:13:51] (03PS1) 10Ema: cache_upload: upgrade cp1064 to varnish 5 [puppet] - 10https://gerrit.wikimedia.org/r/410465 (https://phabricator.wikimedia.org/T180433) [15:13:53] (03PS1) 10Ema: cache_upload: upgrade cp1071 to varnish 5 [puppet] - 10https://gerrit.wikimedia.org/r/410466 (https://phabricator.wikimedia.org/T180433) [15:13:55] (03PS1) 10Ema: cache_upload: upgrade cp1072 to varnish 5 [puppet] - 10https://gerrit.wikimedia.org/r/410467 (https://phabricator.wikimedia.org/T180433) [15:13:57] (03PS1) 10Ema: cache_upload: upgrade cp1073 to varnish 5 [puppet] - 10https://gerrit.wikimedia.org/r/410468 (https://phabricator.wikimedia.org/T180433) [15:13:59] (03PS1) 10Ema: cache_upload: upgrade cp1074 to varnish 5 [puppet] - 10https://gerrit.wikimedia.org/r/410469 (https://phabricator.wikimedia.org/T180433) [15:14:20] elukey: what are your feelings re: https://gerrit.wikimedia.org/r/409916 ? [15:15:17] zeljkof, how's the reverting process? [15:15:39] urandom: from my ignorant pov it seems good, but I have no idea what it does :D [15:16:03] Urbanecm: on it https://wikitech.wikimedia.org/wiki/SWAT_deploys/Deployers#Reverting [15:16:08] (03CR) 10Muehlenhoff: cassandra: enable component/cassandra33 where applicable (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/410252 (https://phabricator.wikimedia.org/T186619) (owner: 10Eevans) [15:16:21] elukey: it adds another field to logstash messages, one that contains the instance ID [15:16:34] ahhhh! +1 then [15:16:37] (03PS1) 10Ema: cache_upload: upgrade eqiad to varnish 5 [puppet] - 10https://gerrit.wikimedia.org/r/410470 (https://phabricator.wikimedia.org/T180433) [15:16:44] elukey: without which, there's no way to know which it came from [15:17:15] elukey: it's straightforward, but i am wondering if the field name should be reconsidered [15:17:59] urandom: the only thin that I don't know is if logstash needs to be adjusted to take this new parameter [15:18:06] !log upgrade cp1048 to varnish 5 [15:18:07] if not, it seems a good addition [15:18:07] i seem to remember there there is some issue when (if) the name isn't unique, if it's used by logstash for indexing, or filtering, or somesuch [15:18:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:31] maybe we can summon gehel ? :D [15:18:43] elukey: yeah, that is where I was going with that; I think intervention is required, if the name clashes with an attribute otherwise utilized [15:19:15] elukey: for example, it sends a type attribute that we had to have rewritten, if memory serves, but it clashed [15:19:21] s/but it/because it/ [15:19:35] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: Revert "Add suppressredirect to autoconfirmed at zhwikt" (T187018) (duration: 01m 13s) [15:19:41] * gehel feels pulled into whatever is happening here... [15:19:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:46] * gehel is reading back... [15:19:46] T187018: Request to change autoconfirmed settings, allow autoconfirmed user to suppress redirects and allow sysop to grant and remove flood flags on zh.wiktionary - https://phabricator.wikimedia.org/T187018 [15:19:47] Urbanecm: reverted, please check if things are back to normal [15:20:08] 10Operations, 10Puppet: Investigate landscape of PuppetDB Frontends and Provision One - https://phabricator.wikimedia.org/T184563#3971982 (10Volans) a:03Volans [15:20:21] Yes, they are [15:20:38] elukey: how can I help? [15:20:42] gehel: tl;dr https://gerrit.wikimedia.org/r/409916 [15:20:54] looking [15:21:02] gehel: i.e. does adding a new field there, require doing anything logstash-side [15:21:54] and, should we make the proposed field (`instance_name`) more unique to future-proof things, `cassandra-instance` perhaps? [15:22:48] urandom, elukey: that new instance_name field being a string, it should be no issue. And any preexisting instance_name field should also be a string (I'll check the current mapping). [15:23:05] i don't think there is an existing one [15:23:30] making the field more unique would ensure there are no clash, but would increase the number of fields (we are already at the limit some days) [15:23:31] zeljkof, can you create reverting commit so everything will be reverted even in git? [15:23:42] gehel: oh? limit? [15:23:55] (03CR) 10Ema: [C: 032] cache_upload: upgrade cp1048 to varnish 5 [puppet] - 10https://gerrit.wikimedia.org/r/410460 (https://phabricator.wikimedia.org/T180433) (owner: 10Ema) [15:24:03] gehel: what is the limit on? unique number of fields for...everything? [15:24:08] !log roll-upgrade thumbor to 1.13 - T187159 T179954 T187088 [15:24:14] 10Operations, 10ops-codfw, 10DBA: db2048: RAID with predictive failure - https://phabricator.wikimedia.org/T187328#3972007 (10Marostegui) p:05Triage>03Normal [15:24:21] Urbanecm: trying to clean it up, will let you know when I finish, never done it before :) [15:24:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:23] T187088: Thumbor won't render some SVGs whose namespace is too far into the file - https://phabricator.wikimedia.org/T187088 [15:24:24] T179954: Thumbor errors should contain a trackable request id - https://phabricator.wikimedia.org/T179954 [15:24:24] T187159: Add prefixes to custom Thumbor headers - https://phabricator.wikimedia.org/T187159 [15:24:24] urandom: yeah, the limit is something insane (like 10k fields per index, can't remember the exact number) [15:24:39] and we exceed it... [15:24:41] zeljkof, ok, will wait [15:24:43] * urandom groans [15:24:52] and since we send all logs to the same index, yes, that's a very global limit. [15:25:10] gehel: what happens when it is exceeded? [15:25:43] log fills up with warnings :) and messages that try to create new fields are not indexed [15:25:58] lovely. [15:26:02] PROBLEM - puppet last run on mwdebug1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[tzdata] [15:26:11] I like to think something along the lines of http://wondermark.com/394/ happens [15:26:24] the underlying issue is that we dump whole object trees to be indexed field by field, which is insane [15:26:45] godog: ha! [15:27:10] we have field names like "Q1234", which makes no sense, and I'm almost sure we never query those. [15:27:40] gehel: so... does that somehow mean that using something *less* unique (but unique in the context of the application) would be... desirable? [15:27:58] yeah, i'm not sure we'd need to query on these [15:28:10] no instance_name field yet, so no problem [15:28:30] it means that we should invest a non trivial amount of time to review the way we use logstash [15:29:03] gehel: vis-a-vis what does and does not need indexing, or...? [15:29:20] We are using it like a magic bin, where you throw random stuff at it and expect logstash to magically make sense of it. [15:29:27] ya [15:30:18] yeah, what needs to be indexed, what subtrees should just be dumped as json inside a single field, what should be standardized between different log producers, ... [15:30:31] in this case, this is meant to be helpful metadata for the results you find otherwise, I can't imagine a use-case where we'd need to search on it [15:30:35] we're hitting the limit of what is reasonable with automatic mapping [15:31:04] instance_name definitely looks like something that make sense to have in a field, so go ahead! [15:31:12] gehel: ok! [15:31:15] (03CR) 10Gehel: "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/409916 (https://phabricator.wikimedia.org/T130862) (owner: 10Eevans) [15:31:18] (03CR) 10Gehel: [C: 031] cassandra: add instance ID to list of custom logstash fields [puppet] - 10https://gerrit.wikimedia.org/r/409916 (https://phabricator.wikimedia.org/T130862) (owner: 10Eevans) [15:31:20] (03PS1) 10Zfilipin: Revert "Add suppressredirect to autoconfirmed at zhwikt" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410473 [15:31:27] elukey: `instance_name`, `cassandra_instance`, ... ? [15:31:51] Urbanecm: reverted https://gerrit.wikimedia.org/r/#/c/410473/ [15:31:58] !log EU SWAT finished [15:32:09] 10Operations, 10Phabricator, 10Patch-For-Review, 10Release-Engineering-Team (Kanban), 10User-Elukey: Apache on phab1001 is gradually leaking worker processes which are stuck in "Gracefully finishing" state - https://phabricator.wikimedia.org/T182832#3972056 (10elukey) >>! In T182832#3968786, @Dzahn wrote... [15:32:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:22] zeljkof, great! Maybe you should CR+2 it? [15:32:31] (so as soon as CI is back okay it'll be merged?) [15:32:42] it's already merged, not sure what to do [15:32:46] will ask and do it :) [15:33:23] No, it isn't. By merged I meant it is in master [15:33:59] (03PS1) 10Chad: Gerrit: Swap git auth to HTTP_LDAP [puppet] - 10https://gerrit.wikimedia.org/r/410474 [15:34:33] (03CR) 10Zfilipin: [V: 032 C: 032] Revert "Add suppressredirect to autoconfirmed at zhwikt" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410473 (owner: 10Zfilipin) [15:35:36] urandom, elukey: just for fun, the current mapping: https://phabricator.wikimedia.org/P6696$18123 [15:35:47] Urbanecm: it's merged at tin :) anyway, merged in gerrit now too [15:35:49] (03PS1) 10Urbanecm: Add suppressredirect to autoconfirmed at zhwikt [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410475 (https://phabricator.wikimedia.org/T187018) [15:35:54] should be all good and cleaned up [15:36:08] Great, thanks for your deployment-work! [15:37:01] (03CR) 10jenkins-bot: Revert "Add suppressredirect to autoconfirmed at zhwikt" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410473 (owner: 10Zfilipin) [15:38:05] gehel: umm... [15:38:12] gehel: that makes my eyes bleed [15:38:35] urandom: I'm sure I can convert it to XML if that helps ;) [15:38:53] oh my [15:39:56] I tried to link to line 18123 (were there is a lot of funny things going on), but it looks like it just load at the top, so do the scrilling yourself... [15:42:32] oh, yeah, those are interesting fields [15:42:41] or ... uninteresting, as it were [15:43:22] urandom: long story short, your new "instance_name" field looks more than fine in comparison to what we already have! [15:45:06] !log installing libgcrypt security updates on trusty [15:45:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:53] RECOVERY - puppet last run on mwdebug1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:51:38] !log powering down labvirt1008 so chris can re-apply thermal paste [15:51:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:01] !log rolling out debdeploy 0.0.99.2 (cumin masters already upgraded for a while, just synching the clients) [15:52:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:30] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: Rack and setup db1115 (tendril replacement database) - https://phabricator.wikimedia.org/T185788#3972137 (10RobH) a:05Cmjohnson>03Marostegui @Marostegui: I'm assigning this to you for stalled until you provide feedback as requested. Please assign... [16:14:11] (03PS1) 10Alexandros Kosiaris: Add a .gitreview file [deployment-charts] - 10https://gerrit.wikimedia.org/r/410477 [16:14:11] (03PS1) 10Alexandros Kosiaris: Correctly name prometheus-statsd image [deployment-charts] - 10https://gerrit.wikimedia.org/r/410478 [16:14:12] !log upgrade and restart db1088 [16:14:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:12] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Add a .gitreview file [deployment-charts] - 10https://gerrit.wikimedia.org/r/410477 (owner: 10Alexandros Kosiaris) [16:14:12] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team: labvirt1008 rebooted / system was overheated - https://phabricator.wikimedia.org/T187292#3972144 (10Andrew) I've migrated two VMs off of this host: integration-slave-jessie-1001.integration integration-slave-jessie-1002.integration [16:14:12] PROBLEM - Host labvirt1008.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:14:13] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team: labvirt1008 rebooted / system was overheated - https://phabricator.wikimedia.org/T187292#3972145 (10Andrew) Chris is currently applying thermal paste to 1008. [16:14:18] zeljkof: sorry for late reply. It is working fine now (re: 410105) [16:14:18] kart_: great! [16:14:18] We still need to figure out issue why it took so long time to reflect in production though. [16:14:19] (03PS3) 10Jcrespo: [WIP]Orchestrate the source of the database backups per datacenter [puppet] - 10https://gerrit.wikimedia.org/r/410180 (https://phabricator.wikimedia.org/T184696) [16:14:19] (03PS1) 10Jcrespo: mariadb: Move db1088 socket location to the default path [puppet] - 10https://gerrit.wikimedia.org/r/410479 (https://phabricator.wikimedia.org/T148507) [16:14:19] (03CR) 10jerkins-bot: [V: 04-1] [WIP]Orchestrate the source of the database backups per datacenter [puppet] - 10https://gerrit.wikimedia.org/r/410180 (https://phabricator.wikimedia.org/T184696) (owner: 10Jcrespo) [16:14:19] (03PS2) 10Jcrespo: mariadb: Move db1088 socket location to the default path [puppet] - 10https://gerrit.wikimedia.org/r/410479 (https://phabricator.wikimedia.org/T148507) [16:14:19] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Move db1088 socket location to the default path [puppet] - 10https://gerrit.wikimedia.org/r/410479 (https://phabricator.wikimedia.org/T148507) (owner: 10Jcrespo) [16:14:20] (03CR) 10Jcrespo: [C: 032] mariadb: Move db1088 socket location to the default path [puppet] - 10https://gerrit.wikimedia.org/r/410479 (https://phabricator.wikimedia.org/T148507) (owner: 10Jcrespo) [16:14:21] RECOVERY - Host labvirt1008.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.92 ms [16:14:23] Hrmm, now been waiting on git pull for minutes.... no reply [16:14:23] ahhh i say that and it finally goes [16:14:23] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team: labvirt1008 rebooted / system was overheated - https://phabricator.wikimedia.org/T187292#3972239 (10Andrew) 1008 is back up and I'm restarting all the hosted VMs. [16:15:10] 10Operations, 10Ops-Access-Requests, 10Traffic, 10Patch-For-Review: Ops Onboarding for Valentín Gutiérrez - https://phabricator.wikimedia.org/T187035#3972241 (10RobH) >>! In T187035#3969514, @Dzahn wrote: > @Robh could you do one more Racktables user? thanks! someone beat me to this, he is already setup! =] [16:15:50] (03PS1) 10Gilles: Upgrade to 1.14 [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/410482 (https://phabricator.wikimedia.org/T187335) [16:16:23] 10Operations, 10Ops-Access-Requests, 10Traffic, 10Patch-For-Review: Ops Onboarding for Valentín Gutiérrez - https://phabricator.wikimedia.org/T187035#3972246 (10Vgutierrez) >>! In T187035#3972241, @RobH wrote: >>>! In T187035#3969514, @Dzahn wrote: >> @Robh could you do one more Racktables user? thanks! >... [16:18:43] 10Operations, 10Prod-Kubernetes (Experiment): Build Kubernetes for production use - https://phabricator.wikimedia.org/T148968#3972250 (10akosiaris) 05Open>03Resolved a:03akosiaris This has been done a long time now. [16:19:19] !log upgrade cp1049 to varnish 5 [16:19:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:20] (03PS2) 10Ema: cache_upload: upgrade cp1049 to varnish 5 [puppet] - 10https://gerrit.wikimedia.org/r/410461 (https://phabricator.wikimedia.org/T180433) [16:20:35] (03CR) 10Ema: [V: 032 C: 032] cache_upload: upgrade cp1049 to varnish 5 [puppet] - 10https://gerrit.wikimedia.org/r/410461 (https://phabricator.wikimedia.org/T180433) (owner: 10Ema) [16:25:32] (03PS4) 10Jcrespo: mariadb: Repool db2042 after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410236 [16:25:34] (03PS1) 10Jcrespo: mariadb: Repool db1088 after maintenance with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410483 [16:26:52] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team: labvirt1008 rebooted / system was overheated - https://phabricator.wikimedia.org/T187292#3972301 (10chasemp) https://phabricator.wikimedia.org/P6697 [16:28:53] (03PS2) 10Jcrespo: mariadb: Repool db1088 after maintenance with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410483 [16:28:55] 10Operations, 10Scap: Install git-lfs client (at least on scap targets & masters) - https://phabricator.wikimedia.org/T180628#3972303 (10Halfak) Ping. :) [16:29:15] (03CR) 10Jcrespo: [C: 032] mariadb: Repool db1088 after maintenance with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410483 (owner: 10Jcrespo) [16:29:38] (03PS2) 10Marostegui: db-eqiad.php: Depool db1110 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410459 (https://phabricator.wikimedia.org/T187089) [16:29:57] (03PS1) 10Jcrespo: Revert "mariadb: Depool db1088 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410484 [16:30:20] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1067" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410485 [16:30:27] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1067" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410485 [16:30:51] (03PS2) 10Odder: Update logos for the Urdu Wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410201 (https://phabricator.wikimedia.org/T187182) [16:30:59] I am looking at s6 at eqiad [16:31:06] sorry, wrong channel [16:31:41] (03PS3) 10Odder: Update logos for the Urdu Wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410201 (https://phabricator.wikimedia.org/T187182) [16:31:48] (03Merged) 10jenkins-bot: mariadb: Repool db1088 after maintenance with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410483 (owner: 10Jcrespo) [16:32:01] (03CR) 10jenkins-bot: mariadb: Repool db1088 after maintenance with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410483 (owner: 10Jcrespo) [16:32:10] 10Operations, 10ops-eqiad: Degraded RAID on analytics1057 - https://phabricator.wikimedia.org/T187146#3965895 (10Cmjohnson) Dell now requires a new report. My self dispatch was kicked back. I resubmitted just now [16:32:51] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team: labvirt1008 rebooted / system was overheated - https://phabricator.wikimedia.org/T187292#3972311 (10chasemp) prometheus seems to have temperature readings https://grafana-admin.wikimedia.org/dashboard/db/prometheus-machine-stats?orgId=1&var-se... [16:33:50] (03PS3) 10Marostegui: Revert "db-eqiad.php: Depool db1067" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410485 [16:34:00] 10Operations, 10Patch-For-Review: setup/install bast1002(WMF4749) - https://phabricator.wikimedia.org/T186623#3972313 (10Cmjohnson) Dell now requires a new report with self dispatches and this was kicked back. I resubmitted just now [16:35:49] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool db1088 with low weight (duration: 01m 12s) [16:36:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:24] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1067" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410485 (owner: 10Marostegui) [16:40:07] (03PS5) 10Jcrespo: mariadb: Repool db2042 after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410236 [16:40:09] (03PS2) 10Jcrespo: Revert "mariadb: Depool db1088 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410484 [16:40:36] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1067" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410485 (owner: 10Marostegui) [16:40:49] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1067" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410485 (owner: 10Marostegui) [16:41:02] (03CR) 10Jcrespo: "What do you think about this alternative of a "normal state", Manuel?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410484 (owner: 10Jcrespo) [16:41:58] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1067 - T162807 (duration: 01m 12s) [16:42:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:42:11] T162807: Run pt-table-checksum on s1 (enwiki) - https://phabricator.wikimedia.org/T162807 [16:42:17] (03CR) 10Marostegui: [C: 031] "I like it - it is a better usage of resources for that section" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410484 (owner: 10Jcrespo) [16:42:58] (03CR) 10Jcrespo: [C: 04-1] "Not sure if put it back or move it to x1 or some other misc. If we need more capacity, I would use one of the large servers instead?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410236 (owner: 10Jcrespo) [16:43:06] (03PS3) 10Marostegui: db-eqiad.php: Depool db1110 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410459 (https://phabricator.wikimedia.org/T187089) [16:43:53] (03CR) 10Esanders: Load 3D extension on other wikis, for display only (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410433 (https://phabricator.wikimedia.org/T187261) (owner: 10Matthias Mullie) [16:43:59] (03CR) 10Jcrespo: [C: 031] "Will wait a bit for db1088 to warm up and deploy as is." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410484 (owner: 10Jcrespo) [16:44:01] (03CR) 10Marostegui: "> Not sure if put it back or move it to x1 or some other misc. If we" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410236 (owner: 10Jcrespo) [16:45:03] (03CR) 10Jcrespo: [C: 04-1] "Will then amend to retire it from here and at some point in the future pool it somewhere else... when there is time to do it." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410236 (owner: 10Jcrespo) [16:46:48] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1110 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410459 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui) [16:48:49] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1110 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410459 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui) [16:49:02] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1110 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410459 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui) [16:50:24] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1110 (duration: 01m 12s) [16:50:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:36] !log Deploy schema change on db1110 - T187089 T185128 T153182 [16:50:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:50] T187089: Fix WMF schemas to not break when comment store goes WRITE_NEW - https://phabricator.wikimedia.org/T187089 [16:50:50] T153182: Perform schema change to add externallinks.el_index_60 to all wikis - https://phabricator.wikimedia.org/T153182 [16:50:50] T185128: Schema change to prepare for dropping archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T185128 [16:51:57] (03PS1) 10Alexandros Kosiaris: Prepare kubernetes nodes for serving mathoid traffic [puppet] - 10https://gerrit.wikimedia.org/r/410489 (https://phabricator.wikimedia.org/T184919) [16:53:27] (03PS1) 10Marostegui: db-eqiad.php: Repool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410490 [16:56:12] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410490 (owner: 10Marostegui) [16:56:28] !log upgrade cp1050 to varnish 5 [16:56:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:57:22] (03PS2) 10Ema: cache_upload: upgrade cp1050 to varnish 5 [puppet] - 10https://gerrit.wikimedia.org/r/410462 (https://phabricator.wikimedia.org/T180433) [16:57:36] (03CR) 10Ema: [V: 032 C: 032] cache_upload: upgrade cp1050 to varnish 5 [puppet] - 10https://gerrit.wikimedia.org/r/410462 (https://phabricator.wikimedia.org/T180433) (owner: 10Ema) [16:57:40] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410490 (owner: 10Marostegui) [16:58:59] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1089 - T162807 (duration: 01m 09s) [16:59:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:59:14] T162807: Run pt-table-checksum on s1 (enwiki) - https://phabricator.wikimedia.org/T162807 [17:01:05] PROBLEM - puppet last run on elastic1018 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:01:14] PROBLEM - puppet last run on cp1047 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:01:24] PROBLEM - puppet last run on analytics1034 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:01:24] PROBLEM - puppet last run on mw1312 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:01:44] puppetdb killed and restarted by systemd... expect some spam [17:01:54] PROBLEM - puppet last run on wtp1026 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:01:58] [Wed Feb 14 16:58:06 2018] Out of memory: Kill process 27758 (java) score 355 or sacrifice child [17:02:24] PROBLEM - puppet last run on aqs1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:02:29] it would be great to test swappiness = 1 [17:02:52] yeah! what we gathered from the ganeti test for it? [17:02:54] PROBLEM - puppet last run on tin is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:02:54] PROBLEM - puppet last run on restbase1011 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:02:55] hey, at least when the backend was on mysql, it only run out of ids every 6 months! [17:03:06] jynus: rotfl [17:04:15] PROBLEM - puppet last run on db1097 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:04:24] PROBLEM - puppet last run on rhodium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:04:24] PROBLEM - puppet last run on ms-be1028 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:04:40] heh, i see [17:04:49] Anyone on the SWAT deploy today? [17:04:52] paladox: i didnt disable all javascript but maybe one of the extensions influences that [17:05:24] PROBLEM - HHVM rendering on mw2142 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:05:29] Oh wait, it's in 1 hour [17:05:34] mutante yep [17:05:34] PROBLEM - puppet last run on maps1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:05:37] sorry [17:05:52] (03CR) 10Filippo Giunchedi: [C: 032] Upgrade to 1.14 [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/410482 (https://phabricator.wikimedia.org/T187335) (owner: 10Gilles) [17:06:14] RECOVERY - HHVM rendering on mw2142 is OK: HTTP OK: HTTP/1.1 200 OK - 75268 bytes in 0.302 second response time [17:06:42] AndyRussG: Ummm, it looks like your submodule change was merged to core already.... [17:06:54] (we shouldn't do that...we wait for the deploy window so HEAD is always deployable) [17:07:12] (probably fine for the next hour, just fyi) [17:07:21] (03CR) 10Jcrespo: [C: 031] "I can take care of deploying this on puppet swat, did it last time and was the one that disabled it by discovering past issues. Ping me wh" [puppet] - 10https://gerrit.wikimedia.org/r/409645 (https://phabricator.wikimedia.org/T181107) (owner: 10Gergő Tisza) [17:07:32] (03PS4) 10Filippo Giunchedi: Update Thumbor header names [puppet] - 10https://gerrit.wikimedia.org/r/410199 (https://phabricator.wikimedia.org/T187159) (owner: 10Gilles) [17:08:51] (03CR) 10Filippo Giunchedi: [C: 032] Update Thumbor header names [puppet] - 10https://gerrit.wikimedia.org/r/410199 (https://phabricator.wikimedia.org/T187159) (owner: 10Gilles) [17:09:10] no_justification: right, there was a bit more unusualness than usual here... [17:09:14] apologies [17:09:26] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410490 (owner: 10Marostegui) [17:10:22] also CentralNotice is a snowflake wrt deploys [17:10:43] Heh. It shouldn't be ;-) [17:10:51] I hate snowflakes! [17:10:54] * no_justification grabs a blowtorch [17:11:04] PROBLEM - Varnish HTTP text-backend - port 3128 on cp4027 is CRITICAL: connect to address 10.128.0.127 and port 3128: Connection refused [17:11:06] no_justification: yeah... I think there's a task attached to that particular blowtorch [17:11:41] Yes, and lots of ranting from me over the years [17:12:04] RECOVERY - Varnish HTTP text-backend - port 3128 on cp4027 is OK: HTTP OK: HTTP/1.1 200 OK - 218 bytes in 0.157 second response time [17:13:10] T179536, T113428, T136904 from just a cursory search [17:13:10] T179536: Unexpected side-effect of CentralNotice wmf_deploy branch strategy - https://phabricator.wikimedia.org/T179536 [17:13:10] T136904: Spike: Plan reforms of the CentralNotice deployment branch - https://phabricator.wikimedia.org/T136904 [17:13:10] T113428: CentralNotice: make-wmf-branch doesn't work for named extension deployment branches - https://phabricator.wikimedia.org/T113428 [17:15:25] Heh, the "spike" is from 2016 ;-) [17:17:15] So, change 406487 can be deploy as usual? [17:20:08] !log roll-upgrade thumbor 1.14 in eqiad/codfw [17:20:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:02] no_justification: at least T136904 is scheduled for Q4 [17:22:02] T136904: Spike: Plan reforms of the CentralNotice deployment branch - https://phabricator.wikimedia.org/T136904 [17:22:13] (see the column in Fundraising-Backlog) [17:22:42] apologies for such delayz [17:22:48] !log installing uwsgi jessie update on graphite* [17:23:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:15] RECOVERY - puppet last run on db1097 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [17:29:24] RECOVERY - puppet last run on rhodium is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [17:30:14] !log roll-restart ms-fe to pick up https://gerrit.wikimedia.org/r/c/410199/ [17:30:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:34] RECOVERY - puppet last run on maps1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:30:40] meh, when i fixed one graphite role that broke the other graphite role... on it [17:31:05] RECOVERY - puppet last run on elastic1018 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:31:14] RECOVERY - puppet last run on cp1047 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:31:24] RECOVERY - puppet last run on mw1312 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [17:31:24] RECOVERY - puppet last run on analytics1034 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:31:54] RECOVERY - puppet last run on wtp1026 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [17:32:00] !log Upgrading Jenkins on contint1001 / contint2001 [17:32:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:32:24] RECOVERY - puppet last run on aqs1006 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [17:32:54] RECOVERY - puppet last run on tin is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [17:32:54] RECOVERY - puppet last run on restbase1011 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [17:34:24] RECOVERY - puppet last run on ms-be1028 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [17:34:48] !log updating remaining python-cryptography updates from jessie point release [17:35:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:35:10] 10Operations, 10Phabricator, 10Patch-For-Review, 10Release-Engineering-Team (Kanban), 10User-Elukey: Apache on phab1001 is gradually leaking worker processes which are stuck in "Gracefully finishing" state - https://phabricator.wikimedia.org/T182832#3972736 (10elukey) [17:35:41] 10Operations, 10Phabricator, 10Patch-For-Review, 10Release-Engineering-Team (Kanban), 10User-Elukey: Apache on phab1001 is gradually leaking worker processes which are stuck in "Gracefully finishing" state - https://phabricator.wikimedia.org/T182832#3836191 (10elukey) [17:36:01] 10Operations, 10Kubernetes, 10Prod-Kubernetes (Experiment): Load balancing "external" traffic to the Kubernetes cluster in production - https://phabricator.wikimedia.org/T152078#3972739 (10akosiaris) 05Open>03Resolved a:03akosiaris This has been address in T170121 and T170111. TL;DR we will end up usin... [17:39:35] !log CI Jenkins seems all happy following the upgrade ^o^ [17:39:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:42:00] !log updated jenkins packages on apt.wikimedia.org for stretch (thirdpary/ci) and jessie (thirdparty) to 2.89.4 [17:42:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:43:37] !log upgrade cp1062 to varnish 5 [17:43:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:44:26] (03PS2) 10Ema: cache_upload: upgrade cp1062 to varnish 5 [puppet] - 10https://gerrit.wikimedia.org/r/410463 (https://phabricator.wikimedia.org/T180433) [17:44:31] 10Operations: Integrate jessie 8.8 point release - https://phabricator.wikimedia.org/T164703#3972781 (10MoritzMuehlenhoff) These are fully rolled out: ca-certificates uwsgi python-cryptography [17:44:49] (03CR) 10Ema: [V: 032 C: 032] cache_upload: upgrade cp1062 to varnish 5 [puppet] - 10https://gerrit.wikimedia.org/r/410463 (https://phabricator.wikimedia.org/T180433) (owner: 10Ema) [17:45:43] 10Operations, 10ops-eqiad, 10User-Eevans: Degraded RAID on restbase-dev1006 - https://phabricator.wikimedia.org/T185494#3972802 (10RobH) a:05RobH>03Cmjohnson So lshw is simply lockign up at reading SCSI on the machine, and won't output the disk model/capacity. Chris, Can you pull defective SSD sdc? (It... [17:52:58] 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: Increase storage space for Wikidata Query Service - https://phabricator.wikimedia.org/T186526#3972824 (10RobH) Actually, each of the wdqs systems is mildly different, and may have different ssds and capacities. I'm moving this... [17:54:00] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team: labvirt1008 rebooted / system was overheated - https://phabricator.wikimedia.org/T187292#3970886 (10hashar) >>! In T187292#3972144, @Andrew wrote: > I've migrated two VMs off of this host: > > integration-slave-jessie-1001.integration > integr... [18:00:05] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: My dear minions, it's time we take the moon! Just kidding. Time for Morning SWAT (Max 8 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180214T1800). [18:00:05] razesoldier, AndyRussG, and Jhs: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:16] (03PS1) 10Dzahn: graphite: fix duplicate httpd declaration conflict [puppet] - 10https://gerrit.wikimedia.org/r/410517 [18:00:56] i'm here! :) [18:01:21] +1 [18:02:08] meetoo [18:06:02] hello guys [18:07:23] Who will SWAT today? [18:11:14] ping zeljkof perhaps? :) [18:11:18] (03CR) 10Chad: [WIP] php7 manifests for mediawiki on stretch (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/394977 (owner: 10ArielGlenn) [18:13:19] zeljkof: Are your here? [18:17:22] looks not here [18:18:05] vgutierrez: WELCOME! [18:18:57] (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/9983/" [puppet] - 10https://gerrit.wikimedia.org/r/410517 (owner: 10Dzahn) [18:21:04] 10Operations, 10Analytics, 10EventBus, 10hardware-requests, and 2 others: SSDs for main Kafka clusters - https://phabricator.wikimedia.org/T166341#3972939 (10RobH) Can you guys provide me with the exact hostnames of the kafka hostnames you want upgraded? I see quite a few, and the hostnames of kafka and a... [18:21:49] (03PS1) 10Chico Venancio: Rebranding alt message in Horizon [puppet] - 10https://gerrit.wikimedia.org/r/410520 (https://phabricator.wikimedia.org/T168480) [18:22:16] 10Operations, 10Analytics, 10EventBus, 10hardware-requests, and 2 others: SSDs for main Kafka clusters - https://phabricator.wikimedia.org/T166341#3972942 (10Ottomata) @robh I am talking with Faidon about this right now for budgeting next FY. I think we are not going to add SSDs, but instead, get a couple... [18:22:22] icinga-wm: recover! [18:22:43] alright, i need to make dinner. if a swatter comes around, ping me, but i may not be available [18:23:17] (03CR) 10Dzahn: [C: 031] Rebranding alt message in Horizon [puppet] - 10https://gerrit.wikimedia.org/r/410520 (https://phabricator.wikimedia.org/T168480) (owner: 10Chico Venancio) [18:23:30] Hey... addshore, Antoine (hashar), Brad (anomie), Katie (aude), Max (MaxSem), Mukunda (twentyafterfour), Roan (RoanKattouw), Sébastien (Dereckson), Tyler (thcipriani), Niharika (Niharika), or Željko (zeljkof) [18:23:34] swat deploy todayz? [18:23:44] thx in advance! [18:24:14] RECOVERY - puppet last run on graphite1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:24:25] there we go [18:24:35] graphite puppet fixed [18:27:14] RECOVERY - puppet last run on graphite2001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [18:30:34] 10Operations, 10Analytics, 10EventBus, 10hardware-requests, and 2 others: SSDs for main Kafka clusters - https://phabricator.wikimedia.org/T166341#3972952 (10Ottomata) For reference, the Kafka cluster nodes are defined in Puppet in [[ https://github.com/wikimedia/puppet/blob/production/hieradata/common.yam... [18:31:04] PROBLEM - HHVM jobrunner on mw1311 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [18:31:09] 10Operations, 10Analytics, 10EventBus, 10hardware-requests, and 2 others: SSDs for main Kafka clusters - https://phabricator.wikimedia.org/T166341#3972956 (10RobH) So it looks like kafka[12]00[123] are all misc systems with 4 * 4TB LFF hot swap bays. Those cannot easily be converted to SFF, since Dell doe... [18:31:39] Oh, this SWAT window is over halfway [18:32:04] RECOVERY - HHVM jobrunner on mw1311 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.006 second response time [18:32:10] 10Operations, 10Analytics, 10EventBus, 10hardware-requests, and 2 others: SSDs for main Kafka clusters - https://phabricator.wikimedia.org/T166341#3972959 (10Ottomata) 05Open>03declined Ya, sounds good. (probably no SSDs for future nodes, FYI) [18:36:34] 10Operations, 10Datasets-General-or-Unknown, 10Dumps-Generation, 10hardware-requests: Give misc dump crons their own host - https://phabricator.wikimedia.org/T181936#3972967 (10thiemowmde) Rough answer without knowing all details @hoo might have been referring to: We see mostly linear growth on https://gra... [18:47:10] (03CR) 10BryanDavis: [C: 031] Rebranding alt message in Horizon [puppet] - 10https://gerrit.wikimedia.org/r/410520 (https://phabricator.wikimedia.org/T168480) (owner: 10Chico Venancio) [18:47:26] (03PS2) 10Dzahn: openstack: Rebranding alt message in Horizon [puppet] - 10https://gerrit.wikimedia.org/r/410520 (https://phabricator.wikimedia.org/T168480) (owner: 10Chico Venancio) [18:47:38] (03CR) 10Dzahn: [C: 032] openstack: Rebranding alt message in Horizon [puppet] - 10https://gerrit.wikimedia.org/r/410520 (https://phabricator.wikimedia.org/T168480) (owner: 10Chico Venancio) [18:48:03] (03PS3) 10Dzahn: openstack: Rebranding alt message in Horizon [puppet] - 10https://gerrit.wikimedia.org/r/410520 (https://phabricator.wikimedia.org/T168480) (owner: 10Chico Venancio) [18:55:59] No one SWAT deployers appear, may be because time is too early? [18:56:46] razesoldier: Or all the swatters are just busy (it happens) [18:56:56] SWAT is always a "best effort" deploy window. Sometimes it doesn't happen [18:57:52] jouncebot: next [18:57:52] In 0 hour(s) and 2 minute(s): Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180214T1900) [18:58:05] Eh screw it, I can do them they're pretty easy [18:58:20] (03PS6) 10Chad: Set Portal and Portal talk namespace alias of zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/406487 (https://phabricator.wikimedia.org/T184866) (owner: 10星耀晨曦) [18:58:38] (03PS1) 10Rush: icinga: create a wmcs contact group for some aggressive alerting [puppet] - 10https://gerrit.wikimedia.org/r/410525 (https://phabricator.wikimedia.org/T187292) [18:58:41] If not happen, will the deployment be postponed next window? [18:58:58] Eh, it just means people would need to add it to the next window. But I'll do it now [18:59:01] Better late than never [18:59:06] (03PS2) 10Rush: icinga: create a wmcs contact group for some aggressive alerting [puppet] - 10https://gerrit.wikimedia.org/r/410525 (https://phabricator.wikimedia.org/T187292) [18:59:24] thank [18:59:49] no_justification, whee :) [18:59:53] (03CR) 10Rush: [C: 032] icinga: create a wmcs contact group for some aggressive alerting [puppet] - 10https://gerrit.wikimedia.org/r/410525 (https://phabricator.wikimedia.org/T187292) (owner: 10Rush) [19:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180214T1900) [19:00:04] No GERRIT patches in the queue for this window AFAICS. [19:00:14] !log upgrade cp1063 to varnish 5 [19:00:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:49] (03PS2) 10Ema: cache_upload: upgrade cp1063 to varnish 5 [puppet] - 10https://gerrit.wikimedia.org/r/410464 (https://phabricator.wikimedia.org/T180433) [19:00:51] (03CR) 10Chad: [C: 032] Set Portal and Portal talk namespace alias of zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/406487 (https://phabricator.wikimedia.org/T184866) (owner: 10星耀晨曦) [19:01:02] (03CR) 10Ema: [V: 032 C: 032] cache_upload: upgrade cp1063 to varnish 5 [puppet] - 10https://gerrit.wikimedia.org/r/410464 (https://phabricator.wikimedia.org/T180433) (owner: 10Ema) [19:04:08] 10Puppet, 10cloud-services-team: Install hp health tools on labvirts where appropriate - https://phabricator.wikimedia.org/T187355#3973064 (10Bstorm) [19:04:13] (03PS1) 10Dzahn: icinga: add volans to test SMS contact group [puppet] - 10https://gerrit.wikimedia.org/r/410526 [19:04:44] (03PS2) 10Dzahn: icinga: add volans to test SMS contact group [puppet] - 10https://gerrit.wikimedia.org/r/410526 [19:05:09] (03Merged) 10jenkins-bot: Set Portal and Portal talk namespace alias of zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/406487 (https://phabricator.wikimedia.org/T184866) (owner: 10星耀晨曦) [19:05:17] (03CR) 10Dzahn: [C: 032] "created a test contact using the new notification commands. this group is a contact for the host "foobar.wmflabs.org" and the service on i" [puppet] - 10https://gerrit.wikimedia.org/r/410526 (owner: 10Dzahn) [19:06:20] (03PS4) 10Chad: Add 3 namespaces to wawiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405258 (https://phabricator.wikimedia.org/T185289) (owner: 10Jon Harald Søby) [19:06:27] (03CR) 10Chad: [C: 032] Add 3 namespaces to wawiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405258 (https://phabricator.wikimedia.org/T185289) (owner: 10Jon Harald Søby) [19:06:43] !log demon@tin rebuilt and synchronized wikiversions files: namespace aliases for zhwiki, T184866 [19:06:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:55] T184866: Set Portal and Portal talk namespace alias of zhwiki - https://phabricator.wikimedia.org/T184866 [19:06:55] Crap, wrong script [19:07:01] Are SWAT going? [19:07:01] (03CR) 10jenkins-bot: Set Portal and Portal talk namespace alias of zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/406487 (https://phabricator.wikimedia.org/T184866) (owner: 10星耀晨曦) [19:07:21] Jayprakash12345, yeah, no_justification is on it [19:07:57] (03Merged) 10jenkins-bot: Add 3 namespaces to wawiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405258 (https://phabricator.wikimedia.org/T185289) (owner: 10Jon Harald Søby) [19:08:01] Ok, My one patch is on backlog [19:08:29] !log demon@tin scap failed: average error rate on 7/11 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org/goto/2cc7028226a539553178454fc2f14459 for details) [19:08:37] ....um? [19:08:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:13] Gah! [19:09:17] How did I not catch that [19:09:26] razesoldier: NS_PORTAL isn't a namespace constant [19:09:30] Gotta use the integers. [19:09:59] (03PS1) 10Chad: Revert "Set Portal and Portal talk namespace alias of zhwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410529 [19:10:01] (03CR) 10Chad: [C: 032] Revert "Set Portal and Portal talk namespace alias of zhwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410529 (owner: 10Chad) [19:10:22] (03CR) 10jenkins-bot: Add 3 namespaces to wawiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405258 (https://phabricator.wikimedia.org/T185289) (owner: 10Jon Harald Søby) [19:10:35] what [19:11:27] razesoldier: 100 and 101 are the number of Portal. [19:11:35] Notice: Use of undefined constant NS_PORTAL - assumed 'NS_PORTAL' in /srv/mediawiki/wmf-config/InitialiseSettings.php on line 4522 [19:11:37] Etc. [19:11:37] !log demon@tin Synchronized wmf-config/InitialiseSettings.php: Revert prior, busted the canaries (duration: 01m 15s) [19:11:45] razesoldier: see https://gerrit.wikimedia.org/r/#/c/410447/2/wmf-config/InitialiseSettings.php [19:11:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:12:01] (03PS1) 10Rush: openstack: alert on down/unreachable nova-compute early [puppet] - 10https://gerrit.wikimedia.org/r/410532 (https://phabricator.wikimedia.org/T187292) [19:12:29] 10Puppet, 10cloud-services-team: Install hp health tools on labvirts where appropriate - https://phabricator.wikimedia.org/T187355#3972790 (10chasemp) Looks available in the trusty repos at least > p hp-health - hp System Health Application and Command line Utility Packa [19:12:33] (03PS2) 10Rush: openstack: alert on down/unreachable nova-compute early [puppet] - 10https://gerrit.wikimedia.org/r/410532 (https://phabricator.wikimedia.org/T187292) [19:12:36] (03CR) 10jerkins-bot: [V: 04-1] openstack: alert on down/unreachable nova-compute early [puppet] - 10https://gerrit.wikimedia.org/r/410532 (https://phabricator.wikimedia.org/T187292) (owner: 10Rush) [19:12:36] ok, I now pushing a new patch [19:12:50] razesoldier: Submit yours for next SWAT, I'm not reattempting it this window [19:13:00] (03CR) 10jerkins-bot: [V: 04-1] openstack: alert on down/unreachable nova-compute early [puppet] - 10https://gerrit.wikimedia.org/r/410532 (https://phabricator.wikimedia.org/T187292) (owner: 10Rush) [19:13:33] !log demon@tin Synchronized wmf-config/InitialiseSettings.php: wawiktionary namespaces, T185289 (duration: 01m 13s) [19:13:36] Moving on.... Jhs your first one is live.... now ^ [19:13:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:13:46] T185289: Create 3 new namespaces on Walloon Wiktionary - https://phabricator.wikimedia.org/T185289 [19:13:56] no_justification: ok [19:14:00] no_justification, live as in wmdebugsomething or live live? :) [19:14:02] hmm [19:14:17] PHP Notice: Use of undefined constant NS_PORTAL - assumed 'NS_PORTAL' in /srv/mediawiki-staging/wmf-config/InitialiseSettings.php on line 4522 [19:14:18] 10Puppet, 10cloud-services-team: Install hp health tools on labvirts where appropriate - https://phabricator.wikimedia.org/T187355#3973080 (10Bstorm) For general info on the cli: https://h50146.www5.hpe.com/products/software/oe/linux/mainstream/support/doc/general/mgmt/ima/v790/hpasmcli.txt [19:14:22] HausAFKatze: Fixed already [19:14:28] ok [19:14:30] Jhs: Live as in live everywhere ;-) [19:14:33] cool :) [19:14:42] !log ran namespaceDupes.php --fix on wawiktionary [19:14:44] Jon Harald, hi! [19:14:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:15:14] Jhs is Harald V in secret mission :-) [19:15:18] (03PS3) 10Chad: Set category collation for nowikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/406022 (https://phabricator.wikimedia.org/T185630) (owner: 10Jon Harald Søby) [19:15:24] (03CR) 10Chad: [C: 032] Set category collation for nowikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/406022 (https://phabricator.wikimedia.org/T185630) (owner: 10Jon Harald Søby) [19:15:32] no_justification: Let me know when my turn. [19:15:40] HausAFKatze, a royal hullo to you *waves awkwardly* [19:15:41] Got one more for Jhs, then you :) [19:15:57] Jhs: me-bows [19:16:40] !log rebooting labvirt1019 so I can have a look at the raid setup, for T172538 [19:16:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:16:53] T172538: rack/setup/install labvirt10(19|20).eqiad.wmnet - https://phabricator.wikimedia.org/T172538 [19:16:54] (03Merged) 10jenkins-bot: Set category collation for nowikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/406022 (https://phabricator.wikimedia.org/T185630) (owner: 10Jon Harald Søby) [19:17:24] (03CR) 10jenkins-bot: Set category collation for nowikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/406022 (https://phabricator.wikimedia.org/T185630) (owner: 10Jon Harald Søby) [19:17:49] (03CR) 10Andrew Bogott: [C: 031] "This seems like an improvement -- it's important to notice, though, that if a host reboots this check will recover (but we will still have" [puppet] - 10https://gerrit.wikimedia.org/r/410532 (https://phabricator.wikimedia.org/T187292) (owner: 10Rush) [19:17:59] (03PS2) 10Chad: Revert "Set Portal and Portal talk namespace alias of zhwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410529 [19:18:01] (03CR) 10Chad: [V: 032 C: 032] Revert "Set Portal and Portal talk namespace alias of zhwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410529 (owner: 10Chad) [19:18:51] (03PS3) 10Chad: Create Portal alias for hiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410447 (https://phabricator.wikimedia.org/T187286) (owner: 10Jayprakash12345) [19:18:56] (03CR) 10Chad: [C: 032] Create Portal alias for hiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410447 (https://phabricator.wikimedia.org/T187286) (owner: 10Jayprakash12345) [19:19:45] !log demon@tin Synchronized wmf-config/InitialiseSettings.php: nowikimedia collation, T185630 (duration: 01m 13s) [19:19:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:19:59] T185630: Change category collation on Wikimedia Norge's wiki to "uca-nb-u-kn" - https://phabricator.wikimedia.org/T185630 [19:20:37] (03CR) 10jenkins-bot: Revert "Set Portal and Portal talk namespace alias of zhwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410529 (owner: 10Chad) [19:20:39] (03PS28) 10ArielGlenn: [WIP] php7 manifests for mediawiki on stretch [puppet] - 10https://gerrit.wikimedia.org/r/394977 [19:20:58] !log running updateCollation.php on nowikimedia [19:21:06] Jhs: Your second one is done (pending the script completion) [19:21:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:21:16] And {{done}} [19:21:31] no_justification: Is my patch on the mwdebug1002? [19:21:42] It's still waiting on jenkins :) [19:22:08] (03Merged) 10jenkins-bot: Create Portal alias for hiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410447 (https://phabricator.wikimedia.org/T187286) (owner: 10Jayprakash12345) [19:22:15] And if you haven't caught on, I don't tend to use mwdebug ;-) [19:22:25] no_justification, yup, both patches work as they should. Thank you very much :) [19:22:56] yw [19:22:58] !log upgrade cp1064 to varnish 5 [19:23:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:23:33] (03PS2) 10Ema: cache_upload: upgrade cp1064 to varnish 5 [puppet] - 10https://gerrit.wikimedia.org/r/410465 (https://phabricator.wikimedia.org/T180433) [19:23:41] (03PS3) 10Rush: openstack: alert on down/unreachable nova-compute early [puppet] - 10https://gerrit.wikimedia.org/r/410532 (https://phabricator.wikimedia.org/T187292) [19:23:48] (03CR) 10Ema: [V: 032 C: 032] cache_upload: upgrade cp1064 to varnish 5 [puppet] - 10https://gerrit.wikimedia.org/r/410465 (https://phabricator.wikimedia.org/T180433) (owner: 10Ema) [19:23:58] (03CR) 10jenkins-bot: Create Portal alias for hiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410447 (https://phabricator.wikimedia.org/T187286) (owner: 10Jayprakash12345) [19:24:00] !log demon@tin Synchronized wmf-config/InitialiseSettings.php: portal aliases for hiwiki (duration: 01m 13s) [19:24:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:47] !log ran namespaceDupes.php --fix for hiwiki [19:24:53] Jayprakash12345: You're done now [19:24:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:25:47] !log enabling netflow on cr1-eqiad [19:25:48] no_justification: Thanks, It is working fine. :) [19:25:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:25:59] (03CR) 10ArielGlenn: [WIP] php7 manifests for mediawiki on stretch (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/394977 (owner: 10ArielGlenn) [19:28:35] (03PS1) 10星耀晨曦: Follow-up 410529: Set Portal and Portal talk namespace alias of zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410536 (https://phabricator.wikimedia.org/T184866) [19:28:55] Jayprakash12345: Fwiw, there's a bit of namespaceDupes ugliness from hiwiki now.... [19:29:37] https://phabricator.wikimedia.org/P6700 [19:30:41] no_justification: any problem? [19:30:48] Not a...problem really [19:30:52] But we should clean it up [19:30:52] :) [19:31:47] (03PS2) 10星耀晨曦: Follow-up 0bfc7d8: Set Portal and Portal talk namespace alias of zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410536 (https://phabricator.wikimedia.org/T184866) [19:32:08] (03PS4) 10Rush: openstack: alert on down/unreachable nova-compute early [puppet] - 10https://gerrit.wikimedia.org/r/410532 (https://phabricator.wikimedia.org/T187292) [19:33:03] please review my patch, let me rest assured [19:33:06] :) [19:33:34] (03CR) 10Rush: [C: 032] openstack: alert on down/unreachable nova-compute early [puppet] - 10https://gerrit.wikimedia.org/r/410532 (https://phabricator.wikimedia.org/T187292) (owner: 10Rush) [19:34:14] (03CR) 10Jayprakash12345: [C: 031] Follow-up 0bfc7d8: Set Portal and Portal talk namespace alias of zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410536 (https://phabricator.wikimedia.org/T184866) (owner: 10星耀晨曦) [19:34:18] razesoldier: Please submit for the next swat, I don't want to reattempt it this time (considering it's not even an official swat window, I kinda just made it up) [19:34:25] (03CR) 10Jayprakash12345: [C: 031] "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410536 (https://phabricator.wikimedia.org/T184866) (owner: 10星耀晨曦) [19:35:39] I plan, submit next morning SWAT [19:35:50] no_justification: Thanks for being here Chad, Good Night :) [19:36:01] gnite [19:39:59] Goodbye, I have to go to bed now (UTC+8: 3:39) :) [19:44:13] (03PS1) 10Rush: openstack: monitor kvm processes on labvirts [puppet] - 10https://gerrit.wikimedia.org/r/410540 (https://phabricator.wikimedia.org/T187292) [19:44:20] (03PS2) 10Rush: openstack: monitor kvm processes on labvirts [puppet] - 10https://gerrit.wikimedia.org/r/410540 (https://phabricator.wikimedia.org/T187292) [19:44:26] !log upgrade cp1071 to varnish 5 [19:44:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:44:46] (03CR) 10jerkins-bot: [V: 04-1] openstack: monitor kvm processes on labvirts [puppet] - 10https://gerrit.wikimedia.org/r/410540 (https://phabricator.wikimedia.org/T187292) (owner: 10Rush) [19:44:50] (03PS2) 10Ema: cache_upload: upgrade cp1071 to varnish 5 [puppet] - 10https://gerrit.wikimedia.org/r/410466 (https://phabricator.wikimedia.org/T180433) [19:44:54] 10Operations, 10Cloud-Services, 10cloud-services-team (Kanban): rack/setup/install labvirt10(19|20).eqiad.wmnet - https://phabricator.wikimedia.org/T172538#3973259 (10Andrew) Drive config on the HPs is annoying. The steps are: -reboot -during boot, ESC-9 -select System Configuration->Embedded RAID 1 : Smar... [19:45:00] (03CR) 10Ema: [V: 032 C: 032] cache_upload: upgrade cp1071 to varnish 5 [puppet] - 10https://gerrit.wikimedia.org/r/410466 (https://phabricator.wikimedia.org/T180433) (owner: 10Ema) [19:46:09] (03PS3) 10Rush: openstack: monitor kvm processes on labvirts [puppet] - 10https://gerrit.wikimedia.org/r/410540 (https://phabricator.wikimedia.org/T187292) [19:55:45] (03PS1) 10Sbisson: Enable log channel T184670 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410546 (https://phabricator.wikimedia.org/T184670) [19:57:52] 10Operations, 10ops-eqiad, 10User-Eevans: Degraded RAID on restbase-dev1006 - https://phabricator.wikimedia.org/T185494#3973319 (10Cmjohnson) robh: INTEL SSD D S3610 Series 800GB, model ssdsc2bx800g4 [19:58:47] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team: Evacuate relevant instances off of labvirt1008 - https://phabricator.wikimedia.org/T187317#3973329 (10chasemp) 05Open>03Invalid Marking this invalid for now as long as the current fix holds out :) [19:58:50] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team, 10Patch-For-Review: labvirt1008 rebooted / system was overheated - https://phabricator.wikimedia.org/T187292#3973331 (10chasemp) [20:00:04] twentyafterfour: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) MediaWiki train deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180214T2000). [20:00:04] No GERRIT patches in the queue for this window AFAICS. [20:00:23] RECOVERY - puppet last run on restbase-dev1006 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:00:55] (03CR) 10Andrew Bogott: [C: 031] "This might turn out to be annoying but it's worth a try :)" [puppet] - 10https://gerrit.wikimedia.org/r/410540 (https://phabricator.wikimedia.org/T187292) (owner: 10Rush) [20:01:08] (03CR) 10Bstorm: "Seems like it'd work. If nagios had libvirtd permissions it could run 'virsh list' instead...but do we want nagios to have more perms?" [puppet] - 10https://gerrit.wikimedia.org/r/410540 (https://phabricator.wikimedia.org/T187292) (owner: 10Rush) [20:02:15] 10Operations, 10ops-eqiad, 10User-Eevans: Degraded RAID on restbase-dev1006 - https://phabricator.wikimedia.org/T185494#3973357 (10RobH) 05Open>03stalled replacement SSD order is now pending on sub-task T187369. This is stalled until the replacement is ordered and onsite. [20:05:56] !log upgrade cp1072 to varnish 5 [20:06:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:06:14] (03CR) 10Bstorm: [C: 031] openstack: monitor kvm processes on labvirts [puppet] - 10https://gerrit.wikimedia.org/r/410540 (https://phabricator.wikimedia.org/T187292) (owner: 10Rush) [20:07:03] (03PS2) 10Ema: cache_upload: upgrade cp1072 to varnish 5 [puppet] - 10https://gerrit.wikimedia.org/r/410467 (https://phabricator.wikimedia.org/T180433) [20:07:08] (03CR) 10Ema: [V: 032 C: 032] cache_upload: upgrade cp1072 to varnish 5 [puppet] - 10https://gerrit.wikimedia.org/r/410467 (https://phabricator.wikimedia.org/T180433) (owner: 10Ema) [20:10:05] (03PS4) 10Rush: openstack: monitor kvm processes on labvirts [puppet] - 10https://gerrit.wikimedia.org/r/410540 (https://phabricator.wikimedia.org/T187292) [20:10:07] (03PS1) 10Rush: openstack: fix overlapping descriptions for nova-compute check [puppet] - 10https://gerrit.wikimedia.org/r/410551 (https://phabricator.wikimedia.org/T187292) [20:10:45] (03CR) 10Rush: [C: 032] openstack: fix overlapping descriptions for nova-compute check [puppet] - 10https://gerrit.wikimedia.org/r/410551 (https://phabricator.wikimedia.org/T187292) (owner: 10Rush) [20:13:49] 10Operations, 10Datasets-General-or-Unknown, 10Dumps-Generation, 10hardware-requests: Give misc dump crons their own host - https://phabricator.wikimedia.org/T181936#3807102 (10Smalyshev) While projecting the dumps, we should also account for: 1. Lexemes, which would probably cause some bump in addition to... [20:14:54] twentyafterfour: hi! just a detail about the train deploy... [20:15:58] There's this change https://gerrit.wikimedia.org/r/#/c/410346/ [20:16:11] got put in the wmf.21 branch [20:16:33] the expectation was that it'd go out on a swat deploy this morning, however that didn't happen... [20:16:42] 10Puppet, 10cloud-services-team: Install hp health tools on labvirts where appropriate - https://phabricator.wikimedia.org/T187355#3973411 (10chasemp) p:05Triage>03Normal [20:16:59] just wondering if the fact that it's on the branch means it'd be pushed out with today's train [20:17:08] no worries either way, just wanted to give you a heads-up [20:17:22] and find out if someone should be around to watch the deploy [20:17:38] not a huge change, only would impact the CentralNotice Admin UI, which only lives on meta [20:18:26] AndyRussG: Ahhh, you were afk! [20:18:29] I did a belated swat [20:18:38] (but I don't wanna step on train at this point) [20:24:48] 10Operations, 10Cloud-Services, 10cloud-services-team (Kanban): rack/setup/install labvirt10(19|20).eqiad.wmnet - https://phabricator.wikimedia.org/T172538#3973425 (10Andrew) It looks like we just need to rebuild the raids on these. That's more-or-less impossible to do remotely so I'll create subtasks for C... [20:25:55] no_justification: oops [20:25:59] :) [20:26:01] no_justification: so did it go out? [20:26:12] 10Operations, 10Cloud-Services, 10cloud-services-team (Kanban): Rebuild hardware raids for labvirt1019 and 1020 - https://phabricator.wikimedia.org/T187373#3973427 (10Andrew) p:05Triage>03Normal [20:26:26] (the CentralNotice patch) [20:26:44] 10Operations, 10ops-eqiad, 10Cloud-Services, 10cloud-services-team (Kanban): Rebuild hardware raids for labvirt1019 and 1020 - https://phabricator.wikimedia.org/T187373#3973427 (10Andrew) [20:26:48] I didn't sync anything from MW branches [20:26:51] It was all config stuff [20:27:56] no_justification: ah hmmm [20:28:03] and do you know the train status? [20:28:19] would this go out on the train in any case, because of having been merged to wmf.21 in Gerrit? [20:28:22] That's a question for twentyafterfour :) [20:30:59] no_justification: is wmf.21 on group1 already? [20:31:11] See what I said before ^ [20:31:16] I'm not on train duty! [20:31:17] sorry just arrived [20:31:19] ? [20:31:33] wmf.21 is going to group1 right about now [20:31:47] scap! :D [20:32:55] (03PS2) 10Ema: cache_upload: upgrade cp1073 to varnish 5 [puppet] - 10https://gerrit.wikimedia.org/r/410468 (https://phabricator.wikimedia.org/T180433) [20:33:02] (03CR) 10Ema: [V: 032 C: 032] cache_upload: upgrade cp1073 to varnish 5 [puppet] - 10https://gerrit.wikimedia.org/r/410468 (https://phabricator.wikimedia.org/T180433) (owner: 10Ema) [20:33:37] !log upgrade cp1073 to varnish 5 [20:33:38] (03PS1) 1020after4: group1 wikis to 1.31.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410557 [20:33:40] (03CR) 1020after4: [C: 032] group1 wikis to 1.31.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410557 (owner: 1020after4) [20:33:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:34:17] twentyafterfour: if a patch was merged yesterday to wmf.21, is it going out now then to group 1 then? [20:34:34] AndyRussG: only if it was also sync'd [20:34:58] twentyafterfour: ah K I don't know if it was [20:35:09] twentyafterfour: can you check perhaps? https://gerrit.wikimedia.org/r/#/c/410346/ [20:35:10] AndyRussG: which patch? I can deploy it [20:35:22] twentyafterfour: ah that'd be fantastic :) [20:35:54] thx!!!! [20:36:10] (we had a few swatty delays earlier...) [20:36:34] (03Merged) 10jenkins-bot: group1 wikis to 1.31.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410557 (owner: 1020after4) [20:36:55] (03CR) 10jenkins-bot: group1 wikis to 1.31.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410557 (owner: 1020after4) [20:37:35] !log twentyafterfour@tin rebuilt and synchronized wikiversions files: group1 wikis to 1.31.0-wmf.21 [20:37:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:38:48] !log twentyafterfour@tin Synchronized php: group1 wikis to 1.31.0-wmf.21 (duration: 01m 12s) [20:39:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:39:33] 10Operations, 10ops-eqiad, 10Cloud-Services, 10cloud-services-team (Kanban): Rebuild hardware raids for labvirt1019 and 1020 - https://phabricator.wikimedia.org/T187373#3973473 (10Andrew) => ctrl slot=0 pd all show status physicaldrive 1I:1:1 (port 1I:box 1:bay 1, 1.6 TB): OK physicaldrive 1I:1:2 (p... [20:40:53] 10Operations, 10ops-eqiad, 10Cloud-Services, 10cloud-services-team (Kanban): Rebuild hardware raids for labvirt1019 and 1020 - https://phabricator.wikimedia.org/T187373#3973484 (10Andrew) And yet... the system claims that it's using all 8 drives and only getting 5.6 Tb. ``` => ctrl slot=0 ld 1 show Smar... [20:40:53] !log Group1 wikis are now running MediaWiki 1.31.0-wmf.21 - still no blockers on T183960 [20:41:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:41:08] T183960: 1.31.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T183960 [20:41:51] AndyRussG: syncing CentralNotice for good measure [20:42:05] twentyafterfour: fantasmic, thanks much! [20:42:19] !log twentyafterfour@tin Synchronized php-1.31.0-wmf.21/extensions/CentralNotice/: sync https://gerrit.wikimedia.org/r/#/c/410346/ for Ejegg (duration: 01m 15s) [20:42:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:53] !log upgrade cp1074 to varnish 5 [20:44:04] thanks twentyafterfour ! [20:44:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:44:53] (03PS2) 10Ema: cache_upload: upgrade cp1074 to varnish 5 [puppet] - 10https://gerrit.wikimedia.org/r/410469 (https://phabricator.wikimedia.org/T180433) [20:45:02] (03CR) 10Ema: [V: 032 C: 032] cache_upload: upgrade cp1074 to varnish 5 [puppet] - 10https://gerrit.wikimedia.org/r/410469 (https://phabricator.wikimedia.org/T180433) (owner: 10Ema) [20:47:57] ejegg: twentyafterfour: I'm actually not seeing the new code there. Using the debug=true flag, which i think forces RL to send the latest modules [20:48:12] 10Operations, 10ops-eqiad, 10Cloud-Services, 10cloud-services-team (Kanban): Rebuild hardware raids for labvirt1019 and 1020 - https://phabricator.wikimedia.org/T187373#3973507 (10chasemp) The quote for the original order here seems to indicate 20 drives over 2 servers so 10 each? [20:48:49] (03PS1) 10Ema: wmf-upgrade-varnish: add support for non-interactive upgrades [puppet] - 10https://gerrit.wikimedia.org/r/410558 (https://phabricator.wikimedia.org/T168529) [20:49:44] 10Operations, 10ops-eqiad, 10Cloud-Services, 10cloud-services-team (Kanban): Rebuild hardware raids for labvirt1019 and 1020 - https://phabricator.wikimedia.org/T187373#3973510 (10Andrew) OK, so step one is for Chris to crack open those cases and count the SSDs :( [20:49:52] 10Operations, 10ops-eqiad, 10Cloud-Services, 10cloud-services-team (Kanban): Rebuild hardware raids for labvirt1019 and 1020 - https://phabricator.wikimedia.org/T187373#3973511 (10RobH) Indeed, the order shows 10 disks per system, not 8. I wonder if the other 2 show up as non-raid configured disks and nee... [20:50:38] twentyafterfour, AndyRussG : Darn, me neither [20:50:55] twentyafterfour: I have to be afk (kid pickup) intermittently over the next hour or so [20:51:56] code looks good on tin [20:52:12] sometimes there are RL sync issues [20:52:23] K I'll be back intermittently, apologies and thanks [20:52:51] twentyafterfour: pls refer questions to ejegg as needed, many thanks 2 you both! [20:53:21] Special:Version points to 5e7beec, but I guess that would be because you just synced extensions/CentralNotice? [20:53:41] ejegg: syncing again .. yeah I had to rebase on tin [20:54:14] !log twentyafterfour@tin Synchronized php-1.31.0-wmf.21/extensions/CentralNotice: Sync CentralNotice again after proper rebase (duration: 01m 14s) [20:54:17] and done [20:54:25] ejegg: does everything look ok now? [20:54:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:55:40] twentyafterfour: yep! Definitely got the new code [20:55:43] thank you [20:56:32] you're welcome! not sure why git didn't rebase that properly without some coaxing. [20:56:36] This is why I love submodules [21:00:04] cscott, arlolra, subbu, bearND, halfak, and Amir1: Your horoscope predicts another unfortunate Services – Parsoid / Citoid / Mobileapps / ORES / … deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180214T2100). [21:00:04] No GERRIT patches in the queue for this window AFAICS. [21:00:22] !log upgrade cp1099 to varnish 5 (last upload@eqiad host) [21:00:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:00:58] no ORES today. We'll be fully switching over to the new cluster soon though [21:00:59] (03CR) 10Volans: "LGTM, minor comment inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/410558 (https://phabricator.wikimedia.org/T168529) (owner: 10Ema) [21:01:13] ejegg: twentyafterfour: yep! works great :) [21:01:18] thanks! [21:01:28] (just tested the actual bug we were fixing) [21:01:47] (03PS2) 10Ema: cache_upload: upgrade eqiad to varnish 5 [puppet] - 10https://gerrit.wikimedia.org/r/410470 (https://phabricator.wikimedia.org/T180433) [21:02:23] (03CR) 10Ema: [C: 032] cache_upload: upgrade eqiad to varnish 5 [puppet] - 10https://gerrit.wikimedia.org/r/410470 (https://phabricator.wikimedia.org/T180433) (owner: 10Ema) [21:02:37] 10Operations, 10ops-eqiad, 10Cloud-Services, 10cloud-services-team (Kanban): Count the SSDs inside of labvirt1019 and/or labvirt1020 - https://phabricator.wikimedia.org/T187373#3973551 (10Andrew) [21:08:16] (03PS5) 10Rush: openstack: monitor kvm processes on labvirts [puppet] - 10https://gerrit.wikimedia.org/r/410540 (https://phabricator.wikimedia.org/T187292) [21:08:42] (03PS6) 10Rush: openstack: monitor kvm processes on labvirts [puppet] - 10https://gerrit.wikimedia.org/r/410540 (https://phabricator.wikimedia.org/T187292) [21:09:35] (03CR) 10Rush: [C: 032] openstack: monitor kvm processes on labvirts [puppet] - 10https://gerrit.wikimedia.org/r/410540 (https://phabricator.wikimedia.org/T187292) (owner: 10Rush) [21:12:38] PROBLEM - HHVM rendering on mw1298 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:13:28] RECOVERY - HHVM rendering on mw1298 is OK: HTTP OK: HTTP/1.1 200 OK - 75310 bytes in 0.157 second response time [21:15:01] !log arlolra@tin Started deploy [parsoid/deploy@7961b3f]: Updating Parsoid to caee2ed [21:15:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:15:24] twentyafterfour: have a train soon? [21:15:44] oh, i see it was already done [21:15:47] matanya: it's there :) https://tools.wmflabs.org/versions/ [21:16:45] 10Operations, 10ops-eqiad, 10Cloud-Services, 10cloud-services-team (Kanban): Rebuild raids on labvirt1019 and 1020 - https://phabricator.wikimedia.org/T187373#3973640 (10Andrew) [21:18:43] thanks greg-g [21:18:44] (03CR) 10Volans: "Result of chat with ema inline." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/410558 (https://phabricator.wikimedia.org/T168529) (owner: 10Ema) [21:27:49] (03PS1) 10Dzahn: icinga: don't cut new SMS output at 135 chars [puppet] - 10https://gerrit.wikimedia.org/r/410567 (https://phabricator.wikimedia.org/T185862) [21:28:52] (03CR) 10Dzahn: [C: 032] icinga: don't cut new SMS output at 135 chars [puppet] - 10https://gerrit.wikimedia.org/r/410567 (https://phabricator.wikimedia.org/T185862) (owner: 10Dzahn) [21:29:23] (03PS3) 10BBlack: URL Path Normalization: fully normalize cache_text [puppet] - 10https://gerrit.wikimedia.org/r/407643 (https://phabricator.wikimedia.org/T127387) [21:29:25] (03PS4) 10BBlack: URL Path Normalization: add to cache_misc [puppet] - 10https://gerrit.wikimedia.org/r/407489 (https://phabricator.wikimedia.org/T127387) [21:29:27] (03PS2) 10BBlack: URL Normalization: strip fragment [puppet] - 10https://gerrit.wikimedia.org/r/407670 (https://phabricator.wikimedia.org/T127387) [21:29:29] (03PS2) 10BBlack: URL Normalization: normalize query chars as well [puppet] - 10https://gerrit.wikimedia.org/r/407671 [21:30:13] !log arlolra@tin Finished deploy [parsoid/deploy@7961b3f]: Updating Parsoid to caee2ed (duration: 15m 12s) [21:30:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:32:36] !log mholloway-shell@tin Started deploy [mobileapps/deploy@9bad612]: Update mobileapps to f23519f [21:32:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:35:01] (03CR) 10Niedzielski: [C: 04-1] New: add chromium_render service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/409996 (https://phabricator.wikimedia.org/T178166) (owner: 10Niedzielski) [21:37:38] PROBLEM - ensure kvm processes are running on labtestvirt2001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args /usr/bin/kvm [21:38:37] !log mholloway-shell@tin Finished deploy [mobileapps/deploy@9bad612]: Update mobileapps to f23519f (duration: 06m 01s) [21:38:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:44:34] (03PS1) 10Dzahn: icinga: add robh to test paging group [puppet] - 10https://gerrit.wikimedia.org/r/410597 [21:45:17] (03CR) 10Dzahn: [C: 032] icinga: add robh to test paging group [puppet] - 10https://gerrit.wikimedia.org/r/410597 (owner: 10Dzahn) [21:47:58] (03PS3) 10Jcrespo: Revert "mariadb: Depool db1088 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410484 [21:48:28] 10Operations, 10Performance-Team, 10Patch-For-Review: /usr/local/bin/xenon-generate-svgs and flamegraph.pl cronspam - https://phabricator.wikimedia.org/T169249#3973705 (10aaron) a:05aaron>03None [21:48:33] (03Abandoned) 10Aaron Schulz: Avoid perl warnings for invalid lines in reverse-stack mode [puppet] - 10https://gerrit.wikimedia.org/r/377451 (https://phabricator.wikimedia.org/T169249) (owner: 10Aaron Schulz) [21:48:46] (03CR) 10Jcrespo: [C: 032] Revert "mariadb: Depool db1088 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410484 (owner: 10Jcrespo) [21:50:27] (03Merged) 10jenkins-bot: Revert "mariadb: Depool db1088 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410484 (owner: 10Jcrespo) [21:50:37] (03CR) 10jenkins-bot: Revert "mariadb: Depool db1088 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410484 (owner: 10Jcrespo) [21:52:15] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool db1088 with full weight (duration: 01m 13s) [21:52:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:59:29] (03PS1) 10Dzahn: icinga: rm temp testing setup, add test-robh to ops [puppet] - 10https://gerrit.wikimedia.org/r/410598 [22:00:19] PROBLEM - ensure kvm processes are running on labtestvirt2002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args /usr/bin/kvm [22:01:49] (03CR) 10Dzahn: [C: 032] icinga: rm temp testing setup, add test-robh to ops [puppet] - 10https://gerrit.wikimedia.org/r/410598 (owner: 10Dzahn) [22:04:21] !log aaron@tin Synchronized php-1.31.0-wmf.20/includes/SiteStats.php: f549559dc0 (duration: 01m 13s) [22:04:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:08:40] PROBLEM - Check correctness of the icinga configuration on einsteinium is CRITICAL: Icinga configuration contains errors [22:10:01] mutante: ^^^ [22:10:33] volans: thanks, it's already fixed [22:10:58] needed one more puppet run to get contact/contactgroup removal [22:11:27] eh, or not, but on it [22:15:01] (03PS1) 10Bstorm: openstack: install hp-health on labvirt* servers [puppet] - 10https://gerrit.wikimedia.org/r/410599 (https://phabricator.wikimedia.org/T187355) [22:48:18] 10Operations, 10Performance-Team, 10Patch-For-Review: /usr/local/bin/xenon-generate-svgs and flamegraph.pl cronspam - https://phabricator.wikimedia.org/T169249#3973875 (10Krinkle) Let's try updating to the latest upstream version, and see if we still get these errors, and if so, I'll file an upstream report. [22:48:58] (03CR) 10Andrew Bogott: [C: 031] openstack: install hp-health on labvirt* servers [puppet] - 10https://gerrit.wikimedia.org/r/410599 (https://phabricator.wikimedia.org/T187355) (owner: 10Bstorm) [23:08:49] RECOVERY - Check correctness of the icinga configuration on einsteinium is OK: Icinga configuration is correct [23:39:39] !log Running initSiteStats.php on s3 for T186947 [23:39:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:39:56] T186947: many statistics have fallen to 0 on azwiktionary, ruwikiquote, and ptwikisource - https://phabricator.wikimedia.org/T186947 [23:45:39] PROBLEM - puppet last run on ms-be1038 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:53:20] 10Operations, 10ops-codfw, 10Analytics, 10DC-Ops: Decomission eventlog2001 - https://phabricator.wikimedia.org/T182397#3974123 (10RobH) 05Open>03Resolved a:03RobH >>! In T182397#3866939, @MoritzMuehlenhoff wrote: > This host still shows up in puppetdb, i.e. misses the deactivate step (e.g. visible in... [23:54:00] 10Operations, 10ops-codfw: failing RAID disk on frdb2001 - https://phabricator.wikimedia.org/T171584#3974131 (10RobH) [23:54:31] 10Operations, 10Cloud-Services, 10hardware-requests: eqiad: (1) hardware access request for dedicated labmon1002 - https://phabricator.wikimedia.org/T161750#3974134 (10RobH) [23:54:34] 10Operations, 10ops-eqsin, 10DC-Ops, 10Traffic: singapore caching center: eqiad staging tracking task - https://phabricator.wikimedia.org/T166179#3974137 (10RobH) [23:56:46] 10Operations, 10ops-eqiad, 10Analytics-Kanban: Degraded RAID on analytics1055 - https://phabricator.wikimedia.org/T172809#3974169 (10RobH) [23:57:44] 10Operations, 10Traffic: Server hardware purchasing for Asia Cache DC - https://phabricator.wikimedia.org/T156033#3974182 (10RobH) [23:57:51] 10Operations, 10ops-eqsin, 10DC-Ops, 10Traffic: singapore caching center: eqiad staging tracking task - https://phabricator.wikimedia.org/T166179#3974185 (10RobH) [23:57:56] 10Operations, 10Traffic: Network hardware purchasing for Asia Cache DC - https://phabricator.wikimedia.org/T162683#3974186 (10RobH) [23:57:59] 10Operations, 10ops-eqsin, 10DC-Ops, 10Traffic: singapore caching center: eqiad staging tracking task - https://phabricator.wikimedia.org/T166179#3287561 (10RobH) [23:58:34] (03PS1) 10Dzahn: xhgui: apache -> httpd module [puppet] - 10https://gerrit.wikimedia.org/r/410620