[00:04:12] Reedy: I'm ready to test whenever jenkins feels like merging it. :-) [00:04:22] Think he's nearly done [00:05:23] hhvm is slooooow [00:05:31] It's kidna hilarious we're saying that [00:05:37] Yeah, why don't we get rid of that? [00:05:38] ;-) [00:06:58] (03PS2) 1020after4: WIP: Add phabricator config for the new swift backend [puppet] - 10https://gerrit.wikimedia.org/r/432533 [00:07:15] (03PS2) 1020after4: Add account for phabricator_files to swift::params::accounts [puppet] - 10https://gerrit.wikimedia.org/r/432528 [00:07:31] Reedy will need to find a new exotic locale to run a deploy from if he's going to return in style. Cars, planes, and boats have been done already. What does it take to get on a rocket? [00:07:44] * bd808 phones Musk [00:07:51] ISS would be fun [00:08:12] Cindy knows folks at NASA :) [00:08:37] Submarine would be harder than space. [00:08:46] (03PS3) 1020after4: Add account for phabricator_files to swift::params::accounts [puppet] - 10https://gerrit.wikimedia.org/r/432528 [00:08:51] Submarine docked in portsmouth seems easy [00:09:47] (03PS2) 10EBernhardson: Tune CirrusSearch slow logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436848 (https://phabricator.wikimedia.org/T196180) [00:09:57] (03CR) 10jerkins-bot: [V: 04-1] Tune CirrusSearch slow logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436848 (https://phabricator.wikimedia.org/T196180) (owner: 10EBernhardson) [00:10:06] James_F: should be on mwdebug1001 [00:10:16] Ta. [00:12:11] Reedy: LGTM. [00:15:10] !log reedy@deploy1001 Synchronized php-1.32.0-wmf.7/extensions/WikimediaMessages/: respect watchlist preference feature flag (duration: 00m 58s) [00:15:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:20:09] (03PS1) 10EBernhardson: logstash: Use gelf long_message when provided [puppet] - 10https://gerrit.wikimedia.org/r/437657 (https://phabricator.wikimedia.org/T196180) [00:23:05] (03CR) 10Alex Monk: cumin: Allow Puppet DB backend to be used within Labs projects that use it (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/437052 (owner: 10Alex Monk) [00:23:52] (03PS4) 1020after4: Configuration for phabricator to use swift storage. [puppet] - 10https://gerrit.wikimedia.org/r/432528 (https://phabricator.wikimedia.org/T182085) [00:24:12] (03Abandoned) 1020after4: WIP: Add phabricator config for the new swift backend [puppet] - 10https://gerrit.wikimedia.org/r/432533 (owner: 1020after4) [00:26:52] (03CR) 10Alex Monk: cumin: Allow Puppet DB backend to be used within Labs projects that use it (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/437052 (owner: 10Alex Monk) [00:26:59] (03PS3) 10Alex Monk: cumin: Allow Puppet DB backend to be used within Labs projects that use it [puppet] - 10https://gerrit.wikimedia.org/r/437052 [00:28:02] (03CR) 10Alex Monk: cumin: Allow Puppet DB backend to be used within Labs projects that use it (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/437052 (owner: 10Alex Monk) [00:28:24] 10Operations, 10MediaWiki-Platform-Team, 10Performance-Team, 10MW-1.32-release-notes (WMF-deploy-2018-06-05 (1.32.0-wmf.7)), and 2 others: php-memcached 3.0 (PHP 7) incompatible with BagOStuff - https://phabricator.wikimedia.org/T196125#4247677 (10Reedy) [00:38:03] (03PS4) 10Alex Monk: cumin: Allow Puppet DB backend to be used within Labs projects that use it [puppet] - 10https://gerrit.wikimedia.org/r/437052 [00:38:52] (03CR) 10jerkins-bot: [V: 04-1] cumin: Allow Puppet DB backend to be used within Labs projects that use it [puppet] - 10https://gerrit.wikimedia.org/r/437052 (owner: 10Alex Monk) [00:40:59] (03PS5) 10Alex Monk: cumin: Allow Puppet DB backend to be used within Labs projects that use it [puppet] - 10https://gerrit.wikimedia.org/r/437052 [00:41:27] * Krinkle staging on deploy1001 and testing something mwdebug1002 [00:41:47] (03CR) 10jerkins-bot: [V: 04-1] cumin: Allow Puppet DB backend to be used within Labs projects that use it [puppet] - 10https://gerrit.wikimedia.org/r/437052 (owner: 10Alex Monk) [00:44:49] (03PS6) 10Alex Monk: cumin: Allow Puppet DB backend to be used within Labs projects that use it [puppet] - 10https://gerrit.wikimedia.org/r/437052 [00:45:27] (03CR) 10jerkins-bot: [V: 04-1] cumin: Allow Puppet DB backend to be used within Labs projects that use it [puppet] - 10https://gerrit.wikimedia.org/r/437052 (owner: 10Alex Monk) [00:51:05] (03PS7) 10Alex Monk: cumin: Allow Puppet DB backend to be used within Labs projects that use it [puppet] - 10https://gerrit.wikimedia.org/r/437052 [00:51:47] (03CR) 10jerkins-bot: [V: 04-1] cumin: Allow Puppet DB backend to be used within Labs projects that use it [puppet] - 10https://gerrit.wikimedia.org/r/437052 (owner: 10Alex Monk) [00:52:05] what nonsense [00:53:01] (03PS8) 10Alex Monk: cumin: Allow Puppet DB backend to be used within Labs projects that use it [puppet] - 10https://gerrit.wikimedia.org/r/437052 [00:53:42] (03CR) 10jerkins-bot: [V: 04-1] cumin: Allow Puppet DB backend to be used within Labs projects that use it [puppet] - 10https://gerrit.wikimedia.org/r/437052 (owner: 10Alex Monk) [00:54:35] (03PS9) 10Alex Monk: cumin: Allow Puppet DB backend to be used within Labs projects that use it [puppet] - 10https://gerrit.wikimedia.org/r/437052 [00:55:16] (03CR) 10jerkins-bot: [V: 04-1] cumin: Allow Puppet DB backend to be used within Labs projects that use it [puppet] - 10https://gerrit.wikimedia.org/r/437052 (owner: 10Alex Monk) [00:55:44] 00:55:14 modules/network/manifests/constants.pp:220 wmf-style: Found hiera call in class 'network::constants' for 'network::constants::extra_labs_cumin_masters' [00:55:45] wat [00:56:13] I'm going back to the idea of just giving those hosts access to everything [00:57:36] (03PS10) 10Alex Monk: cumin: Allow Puppet DB backend to be used within Labs projects that use it [puppet] - 10https://gerrit.wikimedia.org/r/437052 [00:58:13] (03CR) 10jerkins-bot: [V: 04-1] cumin: Allow Puppet DB backend to be used within Labs projects that use it [puppet] - 10https://gerrit.wikimedia.org/r/437052 (owner: 10Alex Monk) [00:58:57] (03PS11) 10Alex Monk: cumin: Allow Puppet DB backend to be used within Labs projects that use it [puppet] - 10https://gerrit.wikimedia.org/r/437052 [01:04:40] actually I know a different hack around this [01:05:01] no I'll just try the (bad) suggested way [01:06:28] !log krinkle@deploy1001 Synchronized php-1.32.0-wmf.7/vendor/: I5a5d7de4702c23f0 / T196496 (duration: 01m 35s) [01:06:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:06:34] T196496: Inline script for 'wgBackendResponseTime' missing in prod - https://phabricator.wikimedia.org/T196496 [01:07:43] !log krinkle@deploy1001 Synchronized php-1.32.0-wmf.7/composer.json: I13dbdba2b9d / T196496 (duration: 00m 57s) [01:07:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:11:44] (03PS12) 10Alex Monk: cumin: Allow Puppet DB backend to be used within Labs projects that use it [puppet] - 10https://gerrit.wikimedia.org/r/437052 [01:12:24] (03CR) 10jerkins-bot: [V: 04-1] cumin: Allow Puppet DB backend to be used within Labs projects that use it [puppet] - 10https://gerrit.wikimedia.org/r/437052 (owner: 10Alex Monk) [01:16:28] (03PS13) 10Alex Monk: cumin: Allow Puppet DB backend to be used within Labs projects that use it [puppet] - 10https://gerrit.wikimedia.org/r/437052 [01:17:01] (03CR) 10jerkins-bot: [V: 04-1] cumin: Allow Puppet DB backend to be used within Labs projects that use it [puppet] - 10https://gerrit.wikimedia.org/r/437052 (owner: 10Alex Monk) [01:17:59] (03PS14) 10Alex Monk: cumin: Allow Puppet DB backend to be used within Labs projects that use it [puppet] - 10https://gerrit.wikimedia.org/r/437052 [02:27:12] !log l10nupdate@deploy1001 scap sync-l10n completed (1.32.0-wmf.6) (duration: 08m 23s) [02:27:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:29:46] (03Abandoned) 10Krinkle: Move scap::sources from role::deployment_server to common [puppet] - 10https://gerrit.wikimedia.org/r/436581 (https://phabricator.wikimedia.org/T161675) (owner: 10Krinkle) [02:59:25] 10Operations, 10MediaWiki-Platform-Team, 10Performance-Team, 10MW-1.31-release, and 3 others: php-memcached 3.0 (PHP 7) incompatible with BagOStuff - https://phabricator.wikimedia.org/T196125#4259791 (10Reedy) [02:59:59] !log l10nupdate@deploy1001 scap sync-l10n completed (1.32.0-wmf.7) (duration: 15m 31s) [03:00:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:10:14] !log l10nupdate@deploy1001 ResourceLoader cache refresh completed at Wed Jun 6 03:10:14 UTC 2018 (duration 10m 15s) [03:10:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:32:47] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 901.17 seconds [04:03:37] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 292.88 seconds [04:35:23] (03PS1) 10KartikMistry: dotfiles: Added `screen -R` in .bash_profile [puppet] - 10https://gerrit.wikimedia.org/r/437669 [04:41:24] !log kartik@deploy1001 Started deploy [cxserver/deploy@8ce20ba]: Update cxserver to 391d7b6 (Fixing T196462) [04:41:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:41:30] T196462: cxserver: Error: ENOENT: no such file or directory, open 'config/MWPageLoader.yaml - https://phabricator.wikimedia.org/T196462 [04:44:29] !log kartik@deploy1001 Finished deploy [cxserver/deploy@8ce20ba]: Update cxserver to 391d7b6 (Fixing T196462) (duration: 03m 06s) [04:44:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:15:13] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1065 - https://phabricator.wikimedia.org/T196490#4259845 (10Marostegui) 05Open>03Resolved a:03Cmjohnson All good! ``` Number of Virtual Disks: 1 Virtual Drive: 0 (Target Id: 0) Name : RAID Level : Primary-1, Secondary-0, RAID... [05:16:08] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1065 storage crash - https://phabricator.wikimedia.org/T195444#4259848 (10Marostegui) 05Open>03Resolved After replacing disk #1, this is all good now. ``` root@db1065:~# megacli -LDPDInfo -aAll | grep -i flagged Drive has flagged a S.M.A.R.T alert... [05:17:47] !log Deploy schema change on db1070 s5 primary master - T191316 T192926 T195193 [05:17:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:17:53] T192926: Schema change to drop archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T192926 [05:17:53] T195193: Schema change for ct_tag_id field to change_tag - https://phabricator.wikimedia.org/T195193 [05:17:53] T191316: Schema change to make archive.ar_rev_id NOT NULL - https://phabricator.wikimedia.org/T191316 [05:20:56] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1972 bytes in 0.115 second response time [05:21:01] <_joe_> AaronSchulz: so, I've noticed that using 2 distinct PoolRoutes we expose ourself to some failure scenario which is undesired, I'll write a ticket once I'm sure of what might need to be changed [05:21:57] (03PS1) 10Marostegui: dbproxy1010: Depool labsdb1010 [puppet] - 10https://gerrit.wikimedia.org/r/437670 (https://phabricator.wikimedia.org/T190704) [05:22:44] (03CR) 10Marostegui: [C: 032] dbproxy1010: Depool labsdb1010 [puppet] - 10https://gerrit.wikimedia.org/r/437670 (https://phabricator.wikimedia.org/T190704) (owner: 10Marostegui) [05:24:07] !log Reload haproxy on dbproxy1010 to depool labsdb1010 - T190704 [05:24:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:24:12] T190704: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704 [05:25:01] !log Restart MySQL on labsdb1010 [05:25:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:29:07] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 22 probes of 320 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [05:29:24] (03PS1) 10Marostegui: Revert "dbproxy1010: Depool labsdb1010" [puppet] - 10https://gerrit.wikimedia.org/r/437671 [05:30:36] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 64 probes of 301 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [05:30:57] (03CR) 10Marostegui: [C: 032] Revert "dbproxy1010: Depool labsdb1010" [puppet] - 10https://gerrit.wikimedia.org/r/437671 (owner: 10Marostegui) [05:31:56] !log Reload haproxy on dbproxy1010 to repool labsdb1010 - T190704 [05:32:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:32:00] T190704: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704 [05:38:07] PROBLEM - Check systemd state on kubestage1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [05:39:26] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 19 probes of 320 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [05:40:47] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 12 probes of 301 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [05:41:27] RECOVERY - Check systemd state on kubestage1001 is OK: OK - running: The system is fully operational [05:44:25] (03PS1) 10Marostegui: dbproxy1010: Depool labsdb1011 [puppet] - 10https://gerrit.wikimedia.org/r/437674 (https://phabricator.wikimedia.org/T190704) [05:45:00] (03CR) 10Marostegui: [C: 032] dbproxy1010: Depool labsdb1011 [puppet] - 10https://gerrit.wikimedia.org/r/437674 (https://phabricator.wikimedia.org/T190704) (owner: 10Marostegui) [05:46:29] !log Reload haproxy on dbproxy1010 to depool labsdb1011 - T190704 [05:46:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:46:34] T190704: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704 [05:53:16] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 43 probes of 301 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [05:53:44] !log ppchelko@deploy1001 Started deploy [restbase/deploy@baa70b7]: Public release of feed availability endpoint T196402 [05:53:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:53:48] T196402: Public rollout of feed content availability endpoint - https://phabricator.wikimedia.org/T196402 [05:55:56] (03PS1) 10Marostegui: db-eqiad.php: Depool all sanitarium masters [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437676 (https://phabricator.wikimedia.org/T190704) [05:58:18] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 10 probes of 301 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [05:59:17] (03PS2) 10Marostegui: db-eqiad.php: Depool all sanitarium masters [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437676 (https://phabricator.wikimedia.org/T190704) [06:01:37] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [06:01:55] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool all sanitarium masters [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437676 (https://phabricator.wikimedia.org/T190704) (owner: 10Marostegui) [06:02:46] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [06:03:08] (03Merged) 10jenkins-bot: db-eqiad.php: Depool all sanitarium masters [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437676 (https://phabricator.wikimedia.org/T190704) (owner: 10Marostegui) [06:04:26] 10Operations, 10DBA, 10Epic: DB meta task for next DC failover issues - https://phabricator.wikimedia.org/T189107#4259884 (10Marostegui) [06:04:46] 10Operations, 10DBA, 10Epic: DB meta task for next DC failover issues - https://phabricator.wikimedia.org/T189107#4031557 (10Marostegui) [06:04:46] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool all sanitariums masters - T190704 (duration: 01m 09s) [06:04:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:04:51] T190704: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704 [06:05:28] !log ppchelko@deploy1001 Finished deploy [restbase/deploy@baa70b7]: Public release of feed availability endpoint T196402 (duration: 11m 45s) [06:05:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:05:32] T196402: Public rollout of feed content availability endpoint - https://phabricator.wikimedia.org/T196402 [06:06:46] !log ppchelko@deploy1001 Started deploy [restbase/deploy@baa70b7]: Public release of feed availability endpoint T196402, take 2 [06:06:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:09:56] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) timed out before a response was received: /en.wikipedia.org/v1/transform/wikitext/to/html{/title}{/revision} (Transform wikitext to html) is WARNING: Test Transform wikitext to html responds with unexpected body: h2 id=HeadingHeading/h2 != /^h2.* Heading \/h2/: /en.w [06:09:56] e/media/{title}{/revision} (Get media in test page) is WARNING: Test Get media in test page responds with unexpected value at path /items[2] = Missing keys: [utitles, uthumbnail, ulicense] [06:13:59] !log ppchelko@deploy1001 Finished deploy [restbase/deploy@baa70b7]: Public release of feed availability endpoint T196402, take 2 (duration: 07m 13s) [06:14:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:14:04] T196402: Public rollout of feed content availability endpoint - https://phabricator.wikimedia.org/T196402 [06:14:19] !log Stop slave on db2095:3316 to rebuild archive_insert and archive_update triggers - T192926 [06:14:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:14:24] T192926: Schema change to drop archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T192926 [06:18:56] 10Operations, 10Traffic, 10Wikimania-Hackathon-2018, 10Availability (MediaWiki-MultiDC): Create HTTP verb and sticky cookie DC routing in VCL - https://phabricator.wikimedia.org/T91820#4259902 (10tstarling) Special:Userlogin starts a session on a GET request so that it can implement CSRF protection on the... [06:32:05] !log Deploy schema change on s6 codfw master (db2039), this will generate lag on s6 codfw - T191316 T192926 T195193 T89737 [06:32:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:32:12] T89737: Make several mediawiki table fields unsigned ints on wmf databases - https://phabricator.wikimedia.org/T89737 [06:32:12] T192926: Schema change to drop archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T192926 [06:32:12] T195193: Schema change for ct_tag_id field to change_tag - https://phabricator.wikimedia.org/T195193 [06:32:12] T191316: Schema change to make archive.ar_rev_id NOT NULL - https://phabricator.wikimedia.org/T191316 [06:37:55] (03Abandoned) 10Giuseppe Lavagetto: Depool esams [dns] - 10https://gerrit.wikimedia.org/r/434952 (owner: 10Giuseppe Lavagetto) [06:47:08] 10Operations, 10Traffic, 10Wikimania-Hackathon-2018, 10Availability (MediaWiki-MultiDC): Create HTTP verb and sticky cookie DC routing in VCL - https://phabricator.wikimedia.org/T91820#4259926 (10aaron) >>! In T91820#4259902, @tstarling wrote: > Special:Userlogin starts a session on a GET request so that i... [06:56:08] (03CR) 10Muehlenhoff: [C: 031] "One more down!" [puppet] - 10https://gerrit.wikimedia.org/r/437467 (https://phabricator.wikimedia.org/T188377) (owner: 10Elukey) [06:57:35] (03PS4) 10Elukey: Move the varnishkafka submodule to operations/puppet [puppet] - 10https://gerrit.wikimedia.org/r/437467 (https://phabricator.wikimedia.org/T188377) [06:58:26] jynus: o/ - qq before merging --^ - did you get any issue with puppet when you moved the mariad db submodule to operations/puppet? (just to know what to expect) [07:00:26] yes, it broke all puppetmasters [07:00:40] ah lovely [07:00:58] I would wait to involve cloud [07:01:11] as it just needs an rm to be fixed [07:01:49] (03CR) 10Muehlenhoff: profile::mediawiki::jobrunner: manage both videoscaler, jobrunner (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/437490 (owner: 10Giuseppe Lavagetto) [07:01:53] basically it creates a conflict because existing files (old submodule) conflict with new tracked files [07:02:34] so pull fails [07:05:09] ah you mean the puppet masters syncing from the prod ones, like labs etc.. [07:05:24] but the regular puppet-merge in prod shouldn't cause issues right? [07:11:15] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on labsdb1009 - https://phabricator.wikimedia.org/T195690#4259933 (10jcrespo) @RobH Can you check if we have next-business day support for defects for this hw provider and purchase? Because they seem to not be honoring that/adding some on-purpose delay. [07:17:17] ACKNOWLEDGEMENT - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 224, down: 1, dormant: 0, excluded: 0, unused: 0: Ayounsi Zayo outage [07:20:55] (03CR) 10Elukey: "Adding Cloud folks since it seems from the past experience that changes like these tend to break puppet masters, waiting for their green l" [puppet] - 10https://gerrit.wikimedia.org/r/437467 (https://phabricator.wikimedia.org/T188377) (owner: 10Elukey) [07:25:21] (03CR) 10Elukey: [C: 031] "LVS seems related to adding the git-ssh.eqiad.wikimedia.org's IP to the phab1002's interface, but until it is not added to conftool/pybal " [puppet] - 10https://gerrit.wikimedia.org/r/437300 (https://phabricator.wikimedia.org/T196019) (owner: 10Dzahn) [07:30:54] !log Stop MySQL on labsdb1011 to install intel-microcode and reboot [07:30:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:31:01] moritzm: ^ [07:31:47] ack, thanks [07:33:44] (03CR) 10Muehlenhoff: [C: 031] profile::mediawiki::videoscaler: remove global Timeout setting [puppet] - 10https://gerrit.wikimedia.org/r/437491 (owner: 10Giuseppe Lavagetto) [07:38:26] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1953 bytes in 0.077 second response time [07:47:37] (03PS2) 10Gehel: logstash: Use gelf long_message when provided [puppet] - 10https://gerrit.wikimedia.org/r/437657 (https://phabricator.wikimedia.org/T196180) (owner: 10EBernhardson) [07:48:10] (03CR) 10Gehel: [C: 032] logstash: Use gelf long_message when provided [puppet] - 10https://gerrit.wikimedia.org/r/437657 (https://phabricator.wikimedia.org/T196180) (owner: 10EBernhardson) [07:48:27] (03CR) 10Gehel: [C: 032] "Nice! This one was tricky :)" [puppet] - 10https://gerrit.wikimedia.org/r/437657 (https://phabricator.wikimedia.org/T196180) (owner: 10EBernhardson) [07:48:54] !log Stop replication on all sanitarium masters to move labsdb1011 - T190704 [07:48:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:48:58] T190704: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704 [07:53:18] (03CR) 10Muehlenhoff: [C: 031] jobrunner: add profile::mediawiki::videoscaler [puppet] - 10https://gerrit.wikimedia.org/r/437492 (owner: 10Giuseppe Lavagetto) [07:53:57] (03CR) 10Muehlenhoff: [C: 031] videoscaler/jobrunner: add the respective VIPs [puppet] - 10https://gerrit.wikimedia.org/r/437493 (owner: 10Giuseppe Lavagetto) [07:55:56] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1973 bytes in 0.083 second response time [07:56:15] (03CR) 10Muehlenhoff: conftool-data: merge the jobrunner, videoscaler clusters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/437494 (owner: 10Giuseppe Lavagetto) [07:57:11] (03PS3) 10Elukey: phabricator: add role to node phab1002 [puppet] - 10https://gerrit.wikimedia.org/r/437300 (https://phabricator.wikimedia.org/T196019) (owner: 10Dzahn) [07:58:09] (03PS3) 10Addshore: Wikidata: Always have 4 change dispatchers running [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435648 (https://phabricator.wikimedia.org/T194602) (owner: 10Hoo man) [08:00:17] PROBLEM - MariaDB Slave Lag: s1 on db1116 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 679.41 seconds [08:00:26] PROBLEM - MariaDB Slave Lag: s3 on db1116 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 680.82 seconds [08:00:28] ^ that is me [08:00:36] I think I missed to silence that host [08:00:46] PROBLEM - MariaDB Slave Lag: s8 on db1116 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 700.25 seconds [08:00:46] PROBLEM - MariaDB Slave Lag: s5 on db1116 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 701.27 seconds [08:02:56] RECOVERY - MariaDB Slave Lag: s5 on db1116 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [08:04:32] (03PS5) 10Gehel: elasticsearch: alert when cirrus writes are frozen for too long [puppet] - 10https://gerrit.wikimedia.org/r/431754 (https://phabricator.wikimedia.org/T193605) [08:04:44] (03PS6) 10Gehel: elasticsearch: alert when cirrus writes are frozen for too long [puppet] - 10https://gerrit.wikimedia.org/r/431754 (https://phabricator.wikimedia.org/T193605) [08:05:38] (03CR) 10Gehel: [C: 032] elasticsearch: alert when cirrus writes are frozen for too long [puppet] - 10https://gerrit.wikimedia.org/r/431754 (https://phabricator.wikimedia.org/T193605) (owner: 10Gehel) [08:05:47] RECOVERY - MariaDB Slave Lag: s3 on db1116 is OK: OK slave_sql_lag Replication lag: 0.29 seconds [08:06:16] RECOVERY - MariaDB Slave Lag: s8 on db1116 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [08:06:33] 10Operations, 10Mail, 10Patch-For-Review: Upgrade mx1001/mx2001 to stretch - https://phabricator.wikimedia.org/T175361#4259998 (10ayounsi) I'm currently in Europe, so if you're on the east coast, ping me anytime (east coast) this week and I can do it. [08:06:57] RECOVERY - MariaDB Slave Lag: s1 on db1116 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [08:07:54] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool all sanitarium masters" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437678 [08:09:36] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool all sanitarium masters" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437678 (owner: 10Marostegui) [08:12:22] (03PS1) 10Gehel: elasticsearch: send frozen writes check over HTTPS [puppet] - 10https://gerrit.wikimedia.org/r/437679 (https://phabricator.wikimedia.org/T193605) [08:12:33] (03PS2) 10Gehel: elasticsearch: send frozen writes check over HTTPS [puppet] - 10https://gerrit.wikimedia.org/r/437679 (https://phabricator.wikimedia.org/T193605) [08:13:01] jouncebot: now [08:13:01] For the next 0 hour(s) and 46 minute(s): Wikibase Dispatching (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180606T0805) [08:13:19] marostegui: I'v got 1 patch for mediawiki-config :) let me know when im okay! [08:13:29] (03CR) 10Gehel: [C: 032] elasticsearch: send frozen writes check over HTTPS [puppet] - 10https://gerrit.wikimedia.org/r/437679 (https://phabricator.wikimedia.org/T193605) (owner: 10Gehel) [08:14:12] addshore: yeah, I am deplying now, should be done in 1 min or so :) [08:14:57] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool all sanitariums masters - T190704 (duration: 00m 57s) [08:15:01] addshore: all yours [08:15:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:15:02] T190704: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704 [08:18:20] (03CR) 10Alexandros Kosiaris: "I wouldn't recommend doing that, for the reasons very nicely pointed out in https://superuser.com/questions/224631/is-it-a-good-idea-to-p" [puppet] - 10https://gerrit.wikimedia.org/r/437669 (owner: 10KartikMistry) [08:19:09] (03CR) 10jerkins-bot: [V: 04-1] Revert "db-eqiad.php: Depool all sanitarium masters" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437678 (owner: 10Marostegui) [08:20:07] marostegui: thanks! [08:20:14] (03CR) 10Addshore: [C: 032] Wikidata: Always have 4 change dispatchers running [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435648 (https://phabricator.wikimedia.org/T194602) (owner: 10Hoo man) [08:20:18] I will need to deploy later again [08:20:29] marostegui: yup, thats fine :) [08:20:51] (03CR) 10Marostegui: [C: 032] "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437678 (owner: 10Marostegui) [08:21:39] (03Merged) 10jenkins-bot: Wikidata: Always have 4 change dispatchers running [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435648 (https://phabricator.wikimedia.org/T194602) (owner: 10Hoo man) [08:22:17] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool all sanitarium masters" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437678 (owner: 10Marostegui) [08:22:50] addshore: let me know when I can do so [08:23:09] syncing now [08:24:03] !log addshore@deploy1001 Synchronized wmf-config/Wikibase.php: wikidatawiki dispatching: [[gerrit:435648|dispatchMaxTime 720 (4 dispatchers at once)]] (duration: 00m 56s) [08:24:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:12] woo :D deploy1001 [08:24:14] marostegui: all yours [08:24:19] \o/ [08:24:19] thanks [08:25:17] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool all sanitariums masters - T190704 (duration: 00m 56s) [08:25:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:21] T190704: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704 [08:26:31] 10Operations, 10DBA, 10Goal, 10Patch-For-Review: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704#4260035 (10Marostegui) labsdb1011 has been moved over the new sanitarium. This was the last host to be moved. Let's wait to mak... [08:26:32] (03PS1) 10Marostegui: Revert "dbproxy1010: Depool labsdb1011" [puppet] - 10https://gerrit.wikimedia.org/r/437682 [08:26:37] (03PS2) 10Marostegui: Revert "dbproxy1010: Depool labsdb1011" [puppet] - 10https://gerrit.wikimedia.org/r/437682 [08:26:49] 10Operations, 10DBA, 10Goal, 10Patch-For-Review: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704#4260036 (10Marostegui) [08:27:22] (03CR) 10Marostegui: [C: 032] Revert "dbproxy1010: Depool labsdb1011" [puppet] - 10https://gerrit.wikimedia.org/r/437682 (owner: 10Marostegui) [08:29:05] !log Reload haproxy on dbproxy1010 to repool labsdb1011 - T190704 [08:29:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:29:39] (03CR) 10Alexandros Kosiaris: [C: 04-1] Prepare to tighten Puppet DB access control - check client certificates (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/437057 (https://phabricator.wikimedia.org/T194962) (owner: 10Alex Monk) [08:30:32] (03PS1) 10Gehel: elasticsearch: check frozen writes improvements [puppet] - 10https://gerrit.wikimedia.org/r/437683 (https://phabricator.wikimedia.org/T193605) [08:31:32] (03CR) 10Gehel: [C: 032] elasticsearch: check frozen writes improvements [puppet] - 10https://gerrit.wikimedia.org/r/437683 (https://phabricator.wikimedia.org/T193605) (owner: 10Gehel) [08:32:49] (03PS1) 10Hashar: Try a build against jessie-backports [debs/contenttranslation/apertium-apy] - 10https://gerrit.wikimedia.org/r/437684 (https://phabricator.wikimedia.org/T196037) [08:33:17] (03CR) 10jerkins-bot: [V: 04-1] Try a build against jessie-backports [debs/contenttranslation/apertium-apy] - 10https://gerrit.wikimedia.org/r/437684 (https://phabricator.wikimedia.org/T196037) (owner: 10Hashar) [08:36:48] (03CR) 10Giuseppe Lavagetto: conftool-data: merge the jobrunner, videoscaler clusters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/437494 (owner: 10Giuseppe Lavagetto) [08:37:30] (03CR) 10Giuseppe Lavagetto: profile::mediawiki::jobrunner: manage both videoscaler, jobrunner (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/437490 (owner: 10Giuseppe Lavagetto) [08:38:32] (03Abandoned) 10Hashar: Try a build against jessie-backports [debs/contenttranslation/apertium-apy] - 10https://gerrit.wikimedia.org/r/437684 (https://phabricator.wikimedia.org/T196037) (owner: 10Hashar) [08:39:15] (03CR) 10jenkins-bot: Add Minus-X to check against files that shouldn't be executable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436994 (https://phabricator.wikimedia.org/T196225) (owner: 10Mainframe98) [08:39:54] (03CR) 10jenkins-bot: Fixing very trivial spelling error in InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437441 (owner: 10Sau226) [08:39:59] (03CR) 10Hashar: "recheck" [debs/contenttranslation/apertium-apy] - 10https://gerrit.wikimedia.org/r/433318 (https://phabricator.wikimedia.org/T194342) (owner: 10KartikMistry) [08:40:20] (03CR) 10jenkins-bot: Drop the UnicodeConverter extension from production, part 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436331 (https://phabricator.wikimedia.org/T195941) (owner: 10Jforrester) [08:40:53] (03CR) 10jerkins-bot: [V: 04-1] WIP: apertium-apy: New upstream release [debs/contenttranslation/apertium-apy] - 10https://gerrit.wikimedia.org/r/433318 (https://phabricator.wikimedia.org/T194342) (owner: 10KartikMistry) [08:42:11] (03CR) 10Muehlenhoff: conftool-data: merge the jobrunner, videoscaler clusters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/437494 (owner: 10Giuseppe Lavagetto) [08:44:47] (03PS2) 10Giuseppe Lavagetto: profile::mediawiki::jobrunner: manage both videoscaler, jobrunner [puppet] - 10https://gerrit.wikimedia.org/r/437490 [08:46:58] 10Operations, 10DBA, 10Goal, 10Patch-For-Review: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704#4241885 (10Marostegui) [08:47:38] 10Operations, 10ops-eqiad, 10Cloud-VPS: labnet1003 and labnet1004 moving and enabling 10G NICs - https://phabricator.wikimedia.org/T193196#4260109 (10ayounsi) asw2-b-eqiad xe-7/0/9 and xe-4/0/3 moved to group "vlan-cloud-hosts1-b-eqiad" asw2-b-eqiad xe-7/0/19 and xe-4/0/46 moved to group "vlan-cloud-instanc... [08:48:16] 10Operations, 10Wikidata, 10Wikidata-Campsite, 10Wikimedia-General-or-Unknown, and 6 others: Multiple projects reporting Cannot access the database: No working replica DB server - https://phabricator.wikimedia.org/T195520#4260113 (10Addshore) 05Open>03Resolved So to wrap this ticket up the incident rep... [08:58:21] (03PS1) 10Marostegui: mariadb: Convert db1116 to spare [puppet] - 10https://gerrit.wikimedia.org/r/437687 (https://phabricator.wikimedia.org/T196376) [08:58:47] (03CR) 10Muehlenhoff: [C: 031] profile::mediawiki::jobrunner: manage both videoscaler, jobrunner [puppet] - 10https://gerrit.wikimedia.org/r/437490 (owner: 10Giuseppe Lavagetto) [08:59:20] (03CR) 10Giuseppe Lavagetto: "https://puppet-compiler.wmflabs.org/compiler02/11388/ seems to DTRT; I will apply this with care." [puppet] - 10https://gerrit.wikimedia.org/r/437490 (owner: 10Giuseppe Lavagetto) [08:59:50] (03CR) 10Giuseppe Lavagetto: [C: 032] profile::mediawiki::jobrunner: manage both videoscaler, jobrunner [puppet] - 10https://gerrit.wikimedia.org/r/437490 (owner: 10Giuseppe Lavagetto) [09:00:06] (03CR) 10ArielGlenn: "This looks legit to me. But maybe we should move away from using the trebuchet user anywhere at all." [puppet] - 10https://gerrit.wikimedia.org/r/361796 (owner: 10Thcipriani) [09:03:47] (03CR) 10Elukey: [C: 031] "Just to be on the safe side, I am adding Mukunda to the code review. Is there anything relevant to know when Phabricator gets executed the" [puppet] - 10https://gerrit.wikimedia.org/r/437300 (https://phabricator.wikimedia.org/T196019) (owner: 10Dzahn) [09:05:14] (03CR) 10ArielGlenn: "Looks ok. I am pretty sure we don't need the dumps.yaml change in the end, but that can be sorted out later. Please don't merge this until" [puppet] - 10https://gerrit.wikimedia.org/r/437558 (https://phabricator.wikimedia.org/T196019) (owner: 10Dzahn) [09:10:43] (03CR) 10jenkins-bot: Drop the UnicodeConverter extension from production, part 3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436333 (https://phabricator.wikimedia.org/T195941) (owner: 10Jforrester) [09:11:00] (03CR) 10jenkins-bot: Drop the UnicodeConverter extension from production, part 4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436334 (https://phabricator.wikimedia.org/T195941) (owner: 10Jforrester) [09:11:46] (03PS10) 10MarcoAurelio: idwikimedia: initial configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429385 (https://phabricator.wikimedia.org/T192726) [09:12:40] (03CR) 10jenkins-bot: Replace wfGetLBFactory [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414310 (owner: 10Umherirrender) [09:15:34] (03CR) 10jenkins-bot: Add reference for itwiki $wgAbuseFilterActions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420237 (owner: 10Nemo bis) [09:15:38] PROBLEM - Disk space on elastic1029 is CRITICAL: DISK CRITICAL - free space: /srv 59704 MB (12% inode=99%) [09:15:59] (03CR) 10jenkins-bot: Only retain private securepoll data for 60 days after election [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372180 (https://phabricator.wikimedia.org/T173393) (owner: 10Brian Wolff) [09:16:20] (03CR) 10jenkins-bot: Remove $wgNamespacesWithSubpages overrides for NS_TEMPLATE [mediawiki-config] - 10https://gerrit.wikimedia.org/r/432587 (https://phabricator.wikimedia.org/T191612) (owner: 10Gergő Tisza) [09:16:33] (03CR) 10jenkins-bot: Wikidata: Always have 4 change dispatchers running [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435648 (https://phabricator.wikimedia.org/T194602) (owner: 10Hoo man) [09:20:37] (03CR) 10jenkins-bot: Testing page creation log on Beta Labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437379 (https://phabricator.wikimedia.org/T196400) (owner: 10Kaldari) [09:20:52] (03CR) 10jenkins-bot: Disable DisableAccount on wikis where there are no disabled users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338792 (https://phabricator.wikimedia.org/T106067) (owner: 10Reedy) [09:21:21] (03CR) 10jenkins-bot: Remove lines that are now part of AbuseFilter defaults [mediawiki-config] - 10https://gerrit.wikimedia.org/r/424974 (https://phabricator.wikimedia.org/T178349) (owner: 10Huji) [09:21:35] (03CR) 10jenkins-bot: db-eqiad.php: Depool all sanitarium masters [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437676 (https://phabricator.wikimedia.org/T190704) (owner: 10Marostegui) [09:21:49] (03CR) 10jenkins-bot: Enable DynamicPageList extension on bdwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414109 (https://phabricator.wikimedia.org/T188109) (owner: 10Framawiki) [09:22:20] (03CR) 10jenkins-bot: Add wmgBabelCategoryNames to officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/432403 (owner: 10Amire80) [09:22:35] (03CR) 10jenkins-bot: Drop the UnicodeConverter extension from production, part 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436332 (https://phabricator.wikimedia.org/T195941) (owner: 10Jforrester) [09:22:45] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool all sanitarium masters" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437678 (owner: 10Marostegui) [09:26:24] (03CR) 10Alexandros Kosiaris: [C: 04-1] Tighten Puppet DB access control - check client certificates (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/437640 (https://phabricator.wikimedia.org/T194962) (owner: 10Alex Monk) [09:27:39] (03PS8) 10Ema: prometheus: export intel-microcode information via node_exporter [puppet] - 10https://gerrit.wikimedia.org/r/436553 (https://phabricator.wikimedia.org/T127825) [09:29:48] RECOVERY - Disk space on elastic1029 is OK: DISK OK [09:31:29] (03CR) 10Ema: "Script updated to handle a few issues reported by Moritz, current output:" [puppet] - 10https://gerrit.wikimedia.org/r/436553 (https://phabricator.wikimedia.org/T127825) (owner: 10Ema) [09:32:27] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1958 bytes in 0.078 second response time [09:42:50] (03PS1) 10Giuseppe Lavagetto: profile::mediawiki::jobrunner_tls: add server alias for discovery [puppet] - 10https://gerrit.wikimedia.org/r/437697 [09:42:58] <_joe_> brown paper bag fix :( [09:43:17] <_joe_> also, http(s) is hard as you add layers of indirection [09:43:36] (03CR) 10Giuseppe Lavagetto: [C: 032] profile::mediawiki::jobrunner_tls: add server alias for discovery [puppet] - 10https://gerrit.wikimedia.org/r/437697 (owner: 10Giuseppe Lavagetto) [09:44:55] (03CR) 10ArielGlenn: "Logic is sensible. /etc/ssl/certs/Puppet_Internal_CA.pem is copied from /var/lib/puppet/ssl/certs/ca.pem exactly so that it will be availa" [puppet] - 10https://gerrit.wikimedia.org/r/437057 (https://phabricator.wikimedia.org/T194962) (owner: 10Alex Monk) [09:45:56] (03PS1) 10Giuseppe Lavagetto: jobrunner_tls: server_aliases, not server_alias [puppet] - 10https://gerrit.wikimedia.org/r/437698 [09:45:58] (03PS2) 10Marostegui: mariadb: Convert db1116 to spare [puppet] - 10https://gerrit.wikimedia.org/r/437687 (https://phabricator.wikimedia.org/T196376) [09:46:37] (03CR) 10Alexandros Kosiaris: [C: 04-1] Tighten Puppet DB access control - check client certificates (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/437640 (https://phabricator.wikimedia.org/T194962) (owner: 10Alex Monk) [09:46:52] (03CR) 10Marostegui: [C: 032] mariadb: Convert db1116 to spare [puppet] - 10https://gerrit.wikimedia.org/r/437687 (https://phabricator.wikimedia.org/T196376) (owner: 10Marostegui) [09:48:29] <_joe_> sigh, ff-only [09:48:36] (03PS2) 10Giuseppe Lavagetto: jobrunner_tls: server_aliases, not server_alias [puppet] - 10https://gerrit.wikimedia.org/r/437698 [09:48:44] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] jobrunner_tls: server_aliases, not server_alias [puppet] - 10https://gerrit.wikimedia.org/r/437698 (owner: 10Giuseppe Lavagetto) [09:49:17] PROBLEM - puppet last run on mw2152 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:54:27] RECOVERY - puppet last run on mw2152 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [09:54:44] (03CR) 10ArielGlenn: "Yes, you don't want to mix and match classes from different modules within a module. We do that at the profile level." [puppet] - 10https://gerrit.wikimedia.org/r/372764 (owner: 10Alex Monk) [09:55:11] (03PS1) 10Jcrespo: mariadb: Failover m2-master to db1065 [puppet] - 10https://gerrit.wikimedia.org/r/437703 (https://phabricator.wikimedia.org/T186320) [09:57:51] (03CR) 10Marostegui: [C: 031] mariadb: Failover m2-master to db1065 [puppet] - 10https://gerrit.wikimedia.org/r/437703 (https://phabricator.wikimedia.org/T186320) (owner: 10Jcrespo) [09:58:47] (03PS1) 10Jcrespo: mariadb: Failover m3-master to db1072 [puppet] - 10https://gerrit.wikimedia.org/r/437707 (https://phabricator.wikimedia.org/T186320) [09:59:19] (03CR) 10Jcrespo: [C: 04-1] "wrong port" [puppet] - 10https://gerrit.wikimedia.org/r/437703 (https://phabricator.wikimedia.org/T186320) (owner: 10Jcrespo) [10:00:01] (03PS2) 10Jcrespo: mariadb: Failover m2-master to db1065 [puppet] - 10https://gerrit.wikimedia.org/r/437703 (https://phabricator.wikimedia.org/T186320) [10:00:53] (03CR) 10Marostegui: "commit says m2 slave, isn't it m3?" [puppet] - 10https://gerrit.wikimedia.org/r/437707 (https://phabricator.wikimedia.org/T186320) (owner: 10Jcrespo) [10:04:24] (03PS2) 10Jcrespo: mariadb: Failover m3-master to db1072 [puppet] - 10https://gerrit.wikimedia.org/r/437707 (https://phabricator.wikimedia.org/T186320) [10:07:43] (03PS1) 10Jcrespo: mariadb: Update misc replica CNAME for m2 and m3 [dns] - 10https://gerrit.wikimedia.org/r/437710 (https://phabricator.wikimedia.org/T186320) [10:08:26] (03PS1) 10Marostegui: sX.hosts: Remove db1116 [software] - 10https://gerrit.wikimedia.org/r/437711 (https://phabricator.wikimedia.org/T196376) [10:09:21] (03CR) 10Marostegui: [C: 032] sX.hosts: Remove db1116 [software] - 10https://gerrit.wikimedia.org/r/437711 (https://phabricator.wikimedia.org/T196376) (owner: 10Marostegui) [10:10:08] (03Merged) 10jenkins-bot: sX.hosts: Remove db1116 [software] - 10https://gerrit.wikimedia.org/r/437711 (https://phabricator.wikimedia.org/T196376) (owner: 10Marostegui) [10:13:41] I'm doing a canary deployment to ores2002, shouldn't impact anyone else... [10:15:14] !log awight@deploy1001 Started deploy [ores/deploy@bf182e2]: ORES canary deployment to ores2002.codfw.wmnet; T176336 [10:15:20] !log awight@deploy1001 Finished deploy [ores/deploy@bf182e2]: ORES canary deployment to ores2002.codfw.wmnet; T176336 (duration: 00m 06s) [10:15:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:21] T176336: Deploy drafttopic model to production ORES - https://phabricator.wikimedia.org/T176336 [10:15:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:41] !log awight@deploy1001 Started deploy [ores/deploy@65e979f]: ORES canary deployment to ores2002.codfw.wmnet; T176336 [10:15:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:36] !log Deploy schema change on dbstore1002:s6 - T191316 T192926 T195193 T89737 [10:16:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:45] T89737: Make several mediawiki table fields unsigned ints on wmf databases - https://phabricator.wikimedia.org/T89737 [10:16:45] T192926: Schema change to drop archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T192926 [10:16:45] T195193: Schema change for ct_tag_id field to change_tag - https://phabricator.wikimedia.org/T195193 [10:16:45] T191316: Schema change to make archive.ar_rev_id NOT NULL - https://phabricator.wikimedia.org/T191316 [10:19:25] !log awight@deploy1001 Finished deploy [ores/deploy@65e979f]: ORES canary deployment to ores2002.codfw.wmnet; T176336 (duration: 03m 44s) [10:19:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:20:24] (03CR) 10Muehlenhoff: [C: 031] "On a Nahelem CPU with the intel-microcode package in stretch (before Spectre happened):" [puppet] - 10https://gerrit.wikimedia.org/r/436553 (https://phabricator.wikimedia.org/T127825) (owner: 10Ema) [10:20:41] 10Operations, 10ops-codfw, 10netops: upgrade all codfw switch stacks to include additional 10G switch per row - https://phabricator.wikimedia.org/T196489#4260466 (10Peachey88) [10:24:20] (03PS3) 10Jcrespo: mariadb: Failover m2-master to db1065 [puppet] - 10https://gerrit.wikimedia.org/r/437703 (https://phabricator.wikimedia.org/T186320) [10:24:22] (03PS3) 10Jcrespo: mariadb: Failover m3-master to db1072 [puppet] - 10https://gerrit.wikimedia.org/r/437707 (https://phabricator.wikimedia.org/T186320) [10:24:24] (03PS1) 10Jcrespo: mariadb: Switchover m2-master to db1065 [puppet] - 10https://gerrit.wikimedia.org/r/437714 (https://phabricator.wikimedia.org/T186320) [10:24:26] (03PS1) 10Jcrespo: mariadb: Switchover m3-master to db1072 [puppet] - 10https://gerrit.wikimedia.org/r/437715 (https://phabricator.wikimedia.org/T186320) [10:24:45] (03PS5) 10Ema: VCL: Normalise the Accept-Language header for the REST API [puppet] - 10https://gerrit.wikimedia.org/r/434558 (https://phabricator.wikimedia.org/T195327) (owner: 10Mobrovac) [10:28:11] (03PS2) 10Jcrespo: mariadb: Switchover m2-master to db1065 [puppet] - 10https://gerrit.wikimedia.org/r/437714 (https://phabricator.wikimedia.org/T186320) [10:29:26] (03PS2) 10Jcrespo: mariadb: Switchover m3-master to db1072 [puppet] - 10https://gerrit.wikimedia.org/r/437715 (https://phabricator.wikimedia.org/T186320) [10:40:42] (03PS1) 10Awight: Initialize LFS on scap targets [puppet] - 10https://gerrit.wikimedia.org/r/437719 (https://phabricator.wikimedia.org/T180627) [10:41:16] (03CR) 10jerkins-bot: [V: 04-1] Initialize LFS on scap targets [puppet] - 10https://gerrit.wikimedia.org/r/437719 (https://phabricator.wikimedia.org/T180627) (owner: 10Awight) [10:42:46] (03PS7) 10Giuseppe Lavagetto: Switch video scalers to a profile [puppet] - 10https://gerrit.wikimedia.org/r/430892 (owner: 10Muehlenhoff) [10:44:03] (03PS1) 10Marostegui: mariadb: Set db1095 as spare, remove unused code [puppet] - 10https://gerrit.wikimedia.org/r/437720 (https://phabricator.wikimedia.org/T196376) [10:44:56] (03CR) 10Marostegui: [C: 04-2] "Do not merge until db1095 is out of use" [puppet] - 10https://gerrit.wikimedia.org/r/437720 (https://phabricator.wikimedia.org/T196376) (owner: 10Marostegui) [10:49:11] (03CR) 10Marostegui: [C: 04-2] "https://puppet-compiler.wmflabs.org/compiler02/11391/" [puppet] - 10https://gerrit.wikimedia.org/r/437720 (https://phabricator.wikimedia.org/T196376) (owner: 10Marostegui) [10:53:45] (03PS2) 10Awight: Initialize LFS on scap targets [puppet] - 10https://gerrit.wikimedia.org/r/437719 (https://phabricator.wikimedia.org/T180627) [10:54:32] (03CR) 10jerkins-bot: [V: 04-1] Initialize LFS on scap targets [puppet] - 10https://gerrit.wikimedia.org/r/437719 (https://phabricator.wikimedia.org/T180627) (owner: 10Awight) [10:57:58] (03PS3) 10Awight: Initialize LFS on scap targets [puppet] - 10https://gerrit.wikimedia.org/r/437719 (https://phabricator.wikimedia.org/T180627) [10:58:24] (03CR) 10jerkins-bot: [V: 04-1] Initialize LFS on scap targets [puppet] - 10https://gerrit.wikimedia.org/r/437719 (https://phabricator.wikimedia.org/T180627) (owner: 10Awight) [10:59:11] (03CR) 10Giuseppe Lavagetto: [C: 032] Switch video scalers to a profile [puppet] - 10https://gerrit.wikimedia.org/r/430892 (owner: 10Muehlenhoff) [10:59:46] (03PS4) 10Awight: Initialize LFS on scap targets [puppet] - 10https://gerrit.wikimedia.org/r/437719 (https://phabricator.wikimedia.org/T180627) [10:59:51] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 26 probes of 300 (alerts on 19) - https://atlas.ripe.net/measurements/11645088/#!map [11:00:25] (03CR) 10jerkins-bot: [V: 04-1] Initialize LFS on scap targets [puppet] - 10https://gerrit.wikimedia.org/r/437719 (https://phabricator.wikimedia.org/T180627) (owner: 10Awight) [11:04:10] (03PS2) 10Giuseppe Lavagetto: profile::mediawiki::videoscaler: remove global Timeout setting [puppet] - 10https://gerrit.wikimedia.org/r/437491 [11:04:51] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 7 probes of 300 (alerts on 19) - https://atlas.ripe.net/measurements/11645088/#!map [11:05:07] (03CR) 10Mobrovac: [C: 031] VCL: Normalise the Accept-Language header for the REST API [puppet] - 10https://gerrit.wikimedia.org/r/434558 (https://phabricator.wikimedia.org/T195327) (owner: 10Mobrovac) [11:05:16] (03CR) 10Giuseppe Lavagetto: [C: 032] profile::mediawiki::videoscaler: remove global Timeout setting [puppet] - 10https://gerrit.wikimedia.org/r/437491 (owner: 10Giuseppe Lavagetto) [11:06:44] (03CR) 10KartikMistry: "> I wouldn't recommend doing that, for the reasons very nicely" [puppet] - 10https://gerrit.wikimedia.org/r/437669 (owner: 10KartikMistry) [11:06:56] (03Abandoned) 10KartikMistry: dotfiles: Added `screen -R` in .bash_profile [puppet] - 10https://gerrit.wikimedia.org/r/437669 (owner: 10KartikMistry) [11:12:22] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1953 bytes in 0.083 second response time [11:15:01] PROBLEM - Disk space on elastic1029 is CRITICAL: DISK CRITICAL - free space: /srv 60186 MB (12% inode=99%) [11:19:27] (03PS1) 10Jcrespo: mariadb: Depool db1084 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437730 [11:22:35] (03CR) 10Jcrespo: [C: 032] mariadb: Depool db1084 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437730 (owner: 10Jcrespo) [11:23:25] (03PS2) 10Giuseppe Lavagetto: jobrunner: add profile::mediawiki::videoscaler [puppet] - 10https://gerrit.wikimedia.org/r/437492 [11:23:49] (03Merged) 10jenkins-bot: mariadb: Depool db1084 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437730 (owner: 10Jcrespo) [11:27:02] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1084 (duration: 00m 58s) [11:27:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:12] PROBLEM - MariaDB Slave Lag: s5 on db2075 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 303.33 seconds [11:27:32] PROBLEM - MariaDB Slave Lag: s5 on db2059 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 308.25 seconds [11:27:41] PROBLEM - MariaDB Slave Lag: s5 on db2052 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 309.59 seconds [11:27:51] PROBLEM - MariaDB Slave Lag: s5 on db2094 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 313.57 seconds [11:27:51] PROBLEM - MariaDB Slave Lag: s5 on db2066 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 313.86 seconds [11:27:51] PROBLEM - MariaDB Slave Lag: s5 on db2084 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 314.36 seconds [11:27:52] PROBLEM - MariaDB Slave Lag: s5 on db2038 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 315.40 seconds [11:28:30] (03CR) 10Giuseppe Lavagetto: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/11393/mw1308.eqiad.wmnet/ seems to DTRT." [puppet] - 10https://gerrit.wikimedia.org/r/437492 (owner: 10Giuseppe Lavagetto) [11:29:01] (03CR) 10jenkins-bot: mariadb: Depool db1084 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437730 (owner: 10Jcrespo) [11:32:41] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1947 bytes in 0.072 second response time [11:33:21] (03PS1) 10Jcrespo: mariadb: Give more s4 weight to db1097 and db1103 (3314) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437733 [11:34:51] (03CR) 10Jcrespo: [C: 032] mariadb: Give more s4 weight to db1097 and db1103 (3314) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437733 (owner: 10Jcrespo) [11:35:32] !log stop and reimage db1084 [11:35:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:04] (03Merged) 10jenkins-bot: mariadb: Give more s4 weight to db1097 and db1103 (3314) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437733 (owner: 10Jcrespo) [11:38:01] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Increase s4 weight for db1097 and db1103 (duration: 00m 56s) [11:38:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:28] (03CR) 10jenkins-bot: mariadb: Give more s4 weight to db1097 and db1103 (3314) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437733 (owner: 10Jcrespo) [11:41:23] (03PS2) 10Giuseppe Lavagetto: videoscaler/jobrunner: add the respective VIPs [puppet] - 10https://gerrit.wikimedia.org/r/437493 [11:41:33] (03PS1) 10Jcrespo: mariadb: Reimage db1084 as stretch [puppet] - 10https://gerrit.wikimedia.org/r/437735 [11:42:09] (03CR) 10Jcrespo: [V: 032 C: 032] mariadb: Reimage db1084 as stretch [puppet] - 10https://gerrit.wikimedia.org/r/437735 (owner: 10Jcrespo) [11:44:13] (03CR) 10Giuseppe Lavagetto: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/11395/" [puppet] - 10https://gerrit.wikimedia.org/r/437493 (owner: 10Giuseppe Lavagetto) [11:44:24] <_joe_> argh, merge-sniped [11:44:31] RECOVERY - Disk space on elastic1029 is OK: DISK OK [11:44:39] (03PS3) 10Giuseppe Lavagetto: videoscaler/jobrunner: add the respective VIPs [puppet] - 10https://gerrit.wikimedia.org/r/437493 [11:49:12] PROBLEM - HHVM jobrunner on mw1306 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.005 second response time [11:49:45] <_joe_> that's puppet restarting hhvm ^^ [11:50:21] RECOVERY - HHVM jobrunner on mw1306 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.001 second response time [11:51:21] PROBLEM - HHVM jobrunner on mw1334 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [11:52:22] RECOVERY - HHVM jobrunner on mw1334 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.001 second response time [11:55:12] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1974 bytes in 0.079 second response time [12:02:04] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1966 bytes in 0.106 second response time [12:04:44] PROBLEM - HHVM jobrunner on mw1337 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [12:05:44] RECOVERY - HHVM jobrunner on mw1337 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.002 second response time [12:08:54] PROBLEM - HHVM jobrunner on mw1310 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [12:09:54] RECOVERY - HHVM jobrunner on mw1310 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.001 second response time [12:11:33] PROBLEM - HHVM jobrunner on mw1309 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [12:12:33] RECOVERY - HHVM jobrunner on mw1309 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.005 second response time [12:15:55] (03PS2) 10Mobrovac: Disable redis queue for cirrus search for all wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437448 (https://phabricator.wikimedia.org/T189137) (owner: 10Ppchelko) [12:18:23] (03CR) 10Mobrovac: [C: 032] Disable redis queue for cirrus search for all wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437448 (https://phabricator.wikimedia.org/T189137) (owner: 10Ppchelko) [12:18:33] * mobrovac taking over deploy1001 [12:19:36] (03Merged) 10jenkins-bot: Disable redis queue for cirrus search for all wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437448 (https://phabricator.wikimedia.org/T189137) (owner: 10Ppchelko) [12:22:14] !log ppchelko@deploy1001 Started deploy [cpjobqueue/deploy@c8d62da]: Enable cirrus for everything T190327 [12:22:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:19] T190327: FY17/18 Q4 Program 8 Services Goal: Complete the JobQueue transition to EventBus - https://phabricator.wikimedia.org/T190327 [12:23:00] !log mobrovac@deploy1001 Synchronized wmf-config/jobqueue.php: Switch CirrusSearch jobs to EventBus for all wikis - T189137 (duration: 00m 57s) [12:23:01] !log ppchelko@deploy1001 Finished deploy [cpjobqueue/deploy@c8d62da]: Enable cirrus for everything T190327 (duration: 00m 47s) [12:23:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:08] T189137: Migrate CirrusSearch jobs to Kafka queue - https://phabricator.wikimedia.org/T189137 [12:23:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:21] (03PS9) 10Ema: prometheus: export intel-microcode information via node_exporter [puppet] - 10https://gerrit.wikimedia.org/r/436553 (https://phabricator.wikimedia.org/T127825) [12:24:24] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1951 bytes in 0.087 second response time [12:24:38] !log mobrovac@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Switch CirrusSearch jobs to EventBus for all wikis, file 2/2 - T189137 (duration: 00m 56s) [12:24:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:03] (03CR) 10Ema: [C: 032] prometheus: export intel-microcode information via node_exporter [puppet] - 10https://gerrit.wikimedia.org/r/436553 (https://phabricator.wikimedia.org/T127825) (owner: 10Ema) [12:27:59] * mobrovac done with deploy1001 [12:29:35] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1947 bytes in 0.106 second response time [12:33:25] (03PS4) 10Jcrespo: mariadb: Failover m2-master to db1065 [puppet] - 10https://gerrit.wikimedia.org/r/437703 (https://phabricator.wikimedia.org/T186320) [12:34:07] (03PS16) 10Elukey: [WIP] Create profile::analytics::cluster::packages::* classes [puppet] - 10https://gerrit.wikimedia.org/r/436012 (owner: 10Ottomata) [12:38:47] (03PS1) 10Jcrespo: mariadb: Repool db1084 with low load after reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437744 [12:44:31] (03CR) 10Jcrespo: [C: 032] mariadb: Repool db1084 with low load after reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437744 (owner: 10Jcrespo) [12:46:04] (03Merged) 10jenkins-bot: mariadb: Repool db1084 with low load after reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437744 (owner: 10Jcrespo) [12:48:50] jouncebot: reload [12:48:54] jouncebot: refresh [12:48:55] I refreshed my knowledge about deployments. [12:50:11] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1084 with low load (duration: 00m 56s) [12:50:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:48] (03PS1) 10Jcrespo: mariadb: Repool db1084 fully after reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437745 [12:52:18] (03CR) 10jenkins-bot: Disable redis queue for cirrus search for all wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437448 (https://phabricator.wikimedia.org/T189137) (owner: 10Ppchelko) [12:52:22] (03CR) 10jenkins-bot: mariadb: Repool db1084 with low load after reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437744 (owner: 10Jcrespo) [12:54:11] (03CR) 10Elukey: "All right we are at another checkpoint:" [puppet] - 10https://gerrit.wikimedia.org/r/436012 (owner: 10Ottomata) [12:59:29] !log add +spec_ctrl to ganeti01.svc.codfw.wmnet cluster default cpu_type [12:59:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Your horoscope predicts another unfortunate European Mid-day SWAT(Max 6 patches) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180606T1300). [13:00:04] No GERRIT patches in the queue for this window AFAICS. [13:01:17] (03PS1) 10Muehlenhoff: Add library hint for elfutils [puppet] - 10https://gerrit.wikimedia.org/r/437747 [13:01:43] !log starting slow rolling restart of all VMs on ganeti01.svc.codfw.wmnet [13:01:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:11] so, nothing for SWAT, no EU SWAT, I'm around if anybody comes late [13:02:26] (03PS2) 10Muehlenhoff: Add library hint for elfutils [puppet] - 10https://gerrit.wikimedia.org/r/437747 [13:03:52] (03CR) 10Muehlenhoff: [C: 032] Add library hint for elfutils [puppet] - 10https://gerrit.wikimedia.org/r/437747 (owner: 10Muehlenhoff) [13:06:05] =o [13:06:45] !log installing elfutils security updates [13:06:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:58] 10Operations, 10CirrusSearch, 10Discovery, 10Elasticsearch, and 2 others: Alert when elasticsearch writes are frozen for too long - https://phabricator.wikimedia.org/T193605#4260822 (10Gehel) Deployed and seems to be working [13:30:31] 10Operations, 10Move-Files-To-Commons, 10TCB-Team, 10Wikimedia-Extension-setup, and 2 others: Deploying FileExporter and FileImporter - https://phabricator.wikimedia.org/T190716#4260883 (10JStrodt_WMDE) [13:30:41] (03PS1) 10Marostegui: db-eqiad.php: Depool db1096:3316 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437754 (https://phabricator.wikimedia.org/T191316) [13:31:36] 10Operations, 10Move-Files-To-Commons, 10TCB-Team, 10Wikimedia-Extension-setup, and 2 others: Deploying FileExporter and FileImporter - https://phabricator.wikimedia.org/T190716#4081970 (10JStrodt_WMDE) [13:32:19] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1096:3316 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437754 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui) [13:32:45] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1972 bytes in 0.078 second response time [13:33:30] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1096:3316 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437754 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui) [13:33:46] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1096:3316 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437754 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui) [13:35:10] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1096:3316 for alter table (duration: 00m 57s) [13:35:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:55] !log Deploy schema change on db1096:3316 - T191316 T192926 T195193 T89737 [13:36:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:01] T89737: Make several mediawiki table fields unsigned ints on wmf databases - https://phabricator.wikimedia.org/T89737 [13:36:01] T192926: Schema change to drop archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T192926 [13:36:01] T195193: Schema change for ct_tag_id field to change_tag - https://phabricator.wikimedia.org/T195193 [13:36:02] T191316: Schema change to make archive.ar_rev_id NOT NULL - https://phabricator.wikimedia.org/T191316 [13:40:35] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html{/title}{/revision} (Transform wikitext to html) is WARNING: Test Transform wikitext to html responds with unexpected body: h2 id=HeadingHeading/h2 != /^h2.* Heading \/h2/: /en.wikipedia.org/v1/page/media/{title}{/revision} (Get media in test page) is WARNING: Test Get media in test page responds with unexpected v [13:40:35] [2] = Missing keys: [utitles, uthumbnail, ulicense] [13:40:45] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html{/title}{/revision} (Transform wikitext to html) is WARNING: Test Transform wikitext to html responds with unexpected body: h2 id=HeadingHeading/h2 != /^h2.* Heading \/h2/: /en.wikipedia.org/v1/page/media/{title}{/revision} (Get media in test page) is WARNING: Test Get media in test page responds with unexpected v [13:40:45] [2] = Missing keys: [utitles, uthumbnail, ulicense] [13:42:55] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1960 bytes in 0.072 second response time [13:46:29] 10Operations, 10Wikimedia-Mailing-lists: Wikidata_Mail_BR - https://phabricator.wikimedia.org/T196552#4260920 (10Kaioduarte-TB) [13:48:05] (03PS2) 10Muehlenhoff: Remove at [puppet] - 10https://gerrit.wikimedia.org/r/435171 [13:49:21] (03CR) 10Muehlenhoff: [C: 032] Remove at [puppet] - 10https://gerrit.wikimedia.org/r/435171 (owner: 10Muehlenhoff) [13:53:28] (03PS3) 10Muehlenhoff: Enable base::service_auto_restart for prometheus-mcrouter-exporter [puppet] - 10https://gerrit.wikimedia.org/r/436782 (https://phabricator.wikimedia.org/T135991) [13:57:11] (03CR) 10Ottomata: "I think that's fine. We install refinery on analytics1003 and use it to launch jobs, so it makes sense that it gets all the packages." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/436012 (owner: 10Ottomata) [13:58:17] !log disabling puppet on db1051, db1065 [13:58:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:51] (03Abandoned) 10ArielGlenn: Fix killing dumpers in Wikidata entity dumpers [puppet] - 10https://gerrit.wikimedia.org/r/393923 (owner: 10Hoo man) [14:00:32] (03PS1) 10Volans: Add nginx::snippet define [puppet/nginx] - 10https://gerrit.wikimedia.org/r/437761 [14:02:39] (03PS5) 10Jcrespo: mariadb: Failover m2-master to db1065 [puppet] - 10https://gerrit.wikimedia.org/r/437703 (https://phabricator.wikimedia.org/T186320) [14:04:07] (03CR) 10Vgutierrez: [C: 04-1] Add nginx::snippet define (031 comment) [puppet/nginx] - 10https://gerrit.wikimedia.org/r/437761 (owner: 10Volans) [14:05:21] (03CR) 10Jcrespo: [C: 032] mariadb: Failover m2-master to db1065 [puppet] - 10https://gerrit.wikimedia.org/r/437703 (https://phabricator.wikimedia.org/T186320) (owner: 10Jcrespo) [14:05:35] (03CR) 10Volans: Add nginx::snippet define (031 comment) [puppet/nginx] - 10https://gerrit.wikimedia.org/r/437761 (owner: 10Volans) [14:06:34] !log rebooting labvirt1003 [14:06:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:07] (03CR) 10Vgutierrez: [C: 04-1] Add nginx::snippet define (031 comment) [puppet/nginx] - 10https://gerrit.wikimedia.org/r/437761 (owner: 10Volans) [14:09:15] (03CR) 10Jcrespo: [C: 032] mariadb: Switchover m2-master to db1065 [puppet] - 10https://gerrit.wikimedia.org/r/437714 (https://phabricator.wikimedia.org/T186320) (owner: 10Jcrespo) [14:09:20] (03PS3) 10Jcrespo: mariadb: Switchover m2-master to db1065 [puppet] - 10https://gerrit.wikimedia.org/r/437714 (https://phabricator.wikimedia.org/T186320) [14:09:22] (03PS2) 10Volans: Add nginx::snippet define [puppet/nginx] - 10https://gerrit.wikimedia.org/r/437761 [14:09:24] (03PS1) 10Muehlenhoff: Also remove at from gridengine [puppet] - 10https://gerrit.wikimedia.org/r/437764 [14:09:45] PROBLEM - Host labvirt1003 is DOWN: PING CRITICAL - Packet loss = 100% [14:09:53] (03CR) 10Volans: "replies inline" (031 comment) [puppet/nginx] - 10https://gerrit.wikimedia.org/r/437761 (owner: 10Volans) [14:10:05] PROBLEM - Host www.toolserver.org is DOWN: CRITICAL - Host Unreachable (www.toolserver.org) [14:10:21] (03CR) 10Rush: [C: 031] "I won't even harrass you that this doesn't need to be an array anymore :D" [puppet] - 10https://gerrit.wikimedia.org/r/437764 (owner: 10Muehlenhoff) [14:10:29] (03CR) 10Andrew Bogott: [C: 031] Also remove at from gridengine [puppet] - 10https://gerrit.wikimedia.org/r/437764 (owner: 10Muehlenhoff) [14:11:14] (03CR) 10Muehlenhoff: [C: 032] Also remove at from gridengine [puppet] - 10https://gerrit.wikimedia.org/r/437764 (owner: 10Muehlenhoff) [14:11:26] marostegui: so I am ready for the switch [14:11:32] so am I! [14:12:11] (03CR) 10Vgutierrez: [C: 031] "<3" [puppet/nginx] - 10https://gerrit.wikimedia.org/r/437761 (owner: 10Volans) [14:12:12] you want me to handle the dbproxies for example? [14:12:42] <_joe_> volans, vgutierrez is /etc/nginx/snippets a debian standard way to do things? [14:12:52] note I wrote 1001 and 1006 [14:12:57] but it is 1002 and 1007 [14:13:04] but yes [14:13:23] <_joe_> I didn't know, tbh [14:13:51] ok, I will reload them whenever you give me green light [14:14:00] ok, starting [14:14:27] !log starting s2-master switchover from db1051 to db1065 [14:14:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:35] PROBLEM - Check systemd state on kafkamon2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:14:37] _joe_: need to check the /etc/nginx/snippets is generated by the nginx-common package and has fastcgi-php.conf and snakeoil.conf inside [14:15:02] checking kafkamon [14:15:05] <_joe_> yeah, and it doesn't have conf-available/conf-enabled [14:15:21] has /etc/nginx/conf.d but all inside is included [14:15:36] <_joe_> yeah, it's a different beast for sure [14:15:53] marostegui: prepare [14:15:56] ok [14:16:10] whenever you want :) [14:16:13] i am ready [14:16:29] heartbeat ran on 1051 [14:16:32] not on 65 [14:16:37] https://www.youtube.com/watch?v=l2-iq7moFgM [14:16:51] marostegui: :P [14:17:06] missing patch [14:17:31] going back to rw [14:17:38] I will prepare the patch [14:17:45] (03PS5) 10Jcrespo: mariadb: Switchover m2-master to db1065 [puppet] - 10https://gerrit.wikimedia.org/r/437714 (https://phabricator.wikimedia.org/T186320) [14:17:54] (03PS2) 10Giuseppe Lavagetto: conftool-data: merge the jobrunner, videoscaler clusters [puppet] - 10https://gerrit.wikimedia.org/r/437494 [14:18:08] (03CR) 10Jcrespo: [V: 032 C: 032] mariadb: Switchover m2-master to db1065 [puppet] - 10https://gerrit.wikimedia.org/r/437714 (https://phabricator.wikimedia.org/T186320) (owner: 10Jcrespo) [14:18:17] that was fast :) [14:18:43] ok, going to read only again [14:18:45] _joe_: to answer your question yes, seems debian-specific: https://salsa.debian.org/nginx-team/nginx/blob/master/debian/changelog#L533 [14:18:53] !log setting gerrit on read only [14:18:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:52] we are ok now, switch [14:19:57] ok [14:20:02] I am ready whenever you want [14:20:18] no [14:20:20] now [14:20:22] ok [14:20:38] done [14:20:49] db1065 is now on dbproxies and db1051 is gone [14:20:50] confirm on stats? [14:20:52] ok [14:21:01] yep [14:21:07] !log setting m2 on read write [14:21:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:37] replication is working [14:22:00] killing connection on db1051 [14:22:20] akosiaris: check otrs [14:22:21] akosiaris: can you check otrs? [14:22:26] checking [14:22:46] (03CR) 10Jcrespo: [V: 032 C: 032] "This didn't went so well :-/" [puppet] - 10https://gerrit.wikimedia.org/r/437714 (https://phabricator.wikimedia.org/T186320) (owner: 10Jcrespo) [14:22:58] seems fine [14:22:58] !log rebooting labvirt1009 [14:22:58] ^I can comment on gerrit, so writes work [14:23:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:16] anything I 've tested in OTRS works [14:23:44] <_joe_> can I merge my own change then, or should I wait for you in case you need a very quick revert? [14:23:50] (03PS3) 10Giuseppe Lavagetto: conftool-data: merge the jobrunner, videoscaler clusters [puppet] - 10https://gerrit.wikimedia.org/r/437494 [14:24:09] (03CR) 10Elukey: [WIP] Create profile::analytics::cluster::packages::* classes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/436012 (owner: 10Ottomata) [14:24:12] _joe_: go on [14:24:17] if we were to revert [14:24:18] <_joe_> ok, thanks [14:24:26] all good for debmonitor [14:24:34] <_joe_> in case don't worry, just don't do sudo -i puppet-merge in case [14:24:35] <_joe_> :P [14:24:39] we would do it with everything already written [14:24:48] (03CR) 10Giuseppe Lavagetto: [C: 032] conftool-data: merge the jobrunner, videoscaler clusters [puppet] - 10https://gerrit.wikimedia.org/r/437494 (owner: 10Giuseppe Lavagetto) [14:24:50] so not really a revert, just another fail [14:24:51] 10Operations, 10Electron-PDFs, 10Proton, 10Readers-Web-Backlog, and 3 others: New service request: chromium-render/deploy - https://phabricator.wikimedia.org/T186748#4261077 (10Niedzielski) [14:24:52] (03PS1) 10Ppchelko: Switch all jobs to the new queue and clean up the old queue configs. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437767 (https://phabricator.wikimedia.org/T190327) [14:25:15] PROBLEM - Host checker.tools.wmflabs.org is DOWN: CRITICAL - Host Unreachable (checker.tools.wmflabs.org) [14:25:35] PROBLEM - Host labvirt1009 is DOWN: PING CRITICAL - Packet loss = 100% [14:25:39] (03PS17) 10Elukey: [WIP] Create profile::analytics::cluster::packages::* classes [puppet] - 10https://gerrit.wikimedia.org/r/436012 (owner: 10Ottomata) [14:25:59] (03CR) 10jerkins-bot: [V: 04-1] Switch all jobs to the new queue and clean up the old queue configs. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437767 (https://phabricator.wikimedia.org/T190327) (owner: 10Ppchelko) [14:26:28] so there is a gotcha- don't coordinate on etherpad if you are going to do maintenance on etehrpad [14:26:43] ahahahaha [14:26:44] :) [14:26:48] and don't deploy patches if you are going to do maintenance on gerrit [14:26:53] I learned today the second [14:26:57] one patch got stuck [14:27:12] which prevented itself from unstuck [14:27:34] (03PS2) 10Ppchelko: Switch all jobs to the new queue and clean up the old queue configs. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437767 (https://phabricator.wikimedia.org/T190327) [14:28:19] 10Operations, 10ops-eqiad, 10netops: replace mr1-eqiad - https://phabricator.wikimedia.org/T185171#4261082 (10faidon) It's been a few months now, what's the status of this? [14:28:29] marostegui: I am setting up db1051 replication [14:28:33] just in case [14:29:01] PROBLEM - LVS HTTP IPv4 on videoscaler.svc.codfw.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:29:08] um [14:29:11] yeah [14:29:13] agreed [14:29:13] _joe_: ^ ? [14:29:13] <_joe_> that's my fault [14:29:16] ok [14:29:22] <_joe_> yeah, lemme understand what's up [14:30:30] <_joe_> it's codfw, so it's not critical [14:31:12] it is the maintenance you are doing with puppet, right? [14:31:29] <_joe_> yeah I confirm it's just codfw [14:31:31] (probably related to that) [14:32:16] <_joe_> uhm,no idea why that's happening [14:35:54] !log oblivian@puppetmaster1001 conftool action : set/pooled=inactive; selector: cluster=videoscaler,dc=codfw,service=nginx,name=mw21(5[3-9]|6).* [14:35:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:50] (03PS1) 10Papaul: DNS: Add prod & mgmt DNS for frmon2001 [dns] - 10https://gerrit.wikimedia.org/r/437768 (https://phabricator.wikimedia.org/T196476) [14:37:28] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Minor nitpick, rest LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/435631 (https://phabricator.wikimedia.org/T194962) (owner: 10Alex Monk) [14:37:45] RECOVERY - Check systemd state on kafkamon2001 is OK: OK - running: The system is fully operational [14:38:04] 10Operations, 10Wikimedia-Mailing-lists: Give admin acces to recommender-feedback@wikimedia.org - https://phabricator.wikimedia.org/T196556#4261110 (10bmansurov) [14:38:25] PROBLEM - etcd request latencies on acrab is CRITICAL: instance=10.192.16.26:6443 operation=list https://grafana.wikimedia.org/dashboard/db/kubernetes-api [14:39:14] !log oblivian@puppetmaster1001 conftool action : set/pooled=inactive; selector: cluster=videoscaler,dc=codfw,service=nginx,name=mw22(4[1-5]|5[3-8]|6[1-9]).* [14:39:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:35] RECOVERY - etcd request latencies on acrab is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [14:40:09] 10Operations, 10Wikimedia-Mailing-lists: Give admin acces to recommender-feedback@wikimedia.org - https://phabricator.wikimedia.org/T196556#4261130 (10bmansurov) [14:40:12] (03PS2) 10ArielGlenn: allow writeuptopageid to write multiple output files [dumps/mwbzutils] - 10https://gerrit.wikimedia.org/r/436511 (https://phabricator.wikimedia.org/T196063) [14:42:16] akosiaris: greg-g: I'd like to carve out an ORES deployment window today, maybe 15:00-16:00 UTC, unless there are objections? I noticed that the Wednesday Services window overlaps with the train, which wouldn't be cool in this case. [14:42:40] 10Operations, 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Import some Analytics git puppet submodules to operations/puppet - https://phabricator.wikimedia.org/T188377#4261137 (10elukey) [14:42:51] <_joe_> I'm still not sure what's happening in codfw with the videoscalers tbh [14:43:08] <_joe_> oh I see, it happens I'm an idiot [14:43:11] <_joe_> sorry people [14:45:16] PROBLEM - Apache HTTP on mwdebug2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:45:23] <_joe_> uh? [14:45:31] <_joe_> that's not me at all ^^ [14:45:38] Alex is rebooting ganeti instances [14:45:44] <_joe_> oh ok [14:46:15] RECOVERY - Apache HTTP on mwdebug2002 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 621 bytes in 0.134 second response time [14:46:29] 10Operations, 10ops-codfw, 10fundraising-tech-ops, 10netops: switch port configuration for frmon2001 - https://phabricator.wikimedia.org/T196557#4261150 (10Papaul) p:05Triage>03Normal [14:46:41] RECOVERY - LVS HTTP IPv4 on videoscaler.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 266 bytes in 0.132 second response time [14:47:16] did you discover what it was^? [14:47:51] Oh, I just read probably just a mistake [14:49:44] <_joe_> jynus: I forgot to reenable puppet on some servers in codfw [14:49:53] <_joe_> simple as that [14:50:31] :-) [14:51:57] milimetric: Do you know if the 'https=1' portion is still useful in the analytics response header? [14:54:03] hm, not off the top of my head Krinkle, I’m at the doctor’s, maybe mforns can take a look? [14:54:10] thx [14:55:14] 10Operations, 10ops-eqiad, 10netops: replace mr1-eqiad - https://phabricator.wikimedia.org/T185171#4261177 (10Cmjohnson) @faidon it's still not done, I have been waiting until we're finished upgrading the network switches [14:56:07] awight: should be fine, please add to calendar [14:56:33] greg-g: done, ty! [14:57:02] (03PS1) 10Jcrespo: mariadb: Remove db1051, to be decommissioned, add db1065 [software] - 10https://gerrit.wikimedia.org/r/437769 (https://phabricator.wikimedia.org/T195484) [14:57:25] (03PS1) 10Ayounsi: Facter: add a v4 and v6 default routes fact [puppet] - 10https://gerrit.wikimedia.org/r/437771 [14:58:37] RECOVERY - Host labvirt1003 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [14:58:48] (03PS1) 10Giuseppe Lavagetto: jobrunner: uniform hiera parameters between jobrunner and videoscaler [puppet] - 10https://gerrit.wikimedia.org/r/437772 [14:58:48] PROBLEM - ensure kvm processes are running on labvirt1003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args /usr/bin/kvm [15:00:03] (03CR) 10Giuseppe Lavagetto: [C: 032] jobrunner: uniform hiera parameters between jobrunner and videoscaler [puppet] - 10https://gerrit.wikimedia.org/r/437772 (owner: 10Giuseppe Lavagetto) [15:00:04] awight: How many deployers does it take to do ORES special deployment deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180606T1500). [15:00:54] ORES is about to do some donuts in the parking lot. [15:01:08] !log awight@deploy1001 Started deploy [ores/deploy@65e979f]: ORES: new draft topic model; T176336 [15:01:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:20] T176336: Deploy drafttopic model to production ORES - https://phabricator.wikimedia.org/T176336 [15:01:30] (03PS3) 10ArielGlenn: allow writeuptopageid to write multiple output files [dumps/mwbzutils] - 10https://gerrit.wikimedia.org/r/436511 (https://phabricator.wikimedia.org/T196063) [15:06:17] !log upgrade Cassandra to 3.11.2, restbase1007-{b,c} - T178905 [15:06:22] !log oblivian@puppetmaster1001 conftool action : set/pooled=yes; selector: cluster=videoscaler,dc=codfw,service=nginx [15:06:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:32] T178905: Upgrade RESTBase cluster to Cassandra release: 3.11.2 - https://phabricator.wikimedia.org/T178905 [15:06:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:37] PROBLEM - HHVM jobrunner on mw1303 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [15:07:49] RECOVERY - ensure kvm processes are running on labvirt1003 is OK: PROCS OK: 3 processes with regex args /usr/bin/kvm [15:07:49] PROBLEM - HHVM jobrunner on mw1305 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [15:08:05] <_joe_> the criticals on hhvm is me restarting it again, sorry [15:08:17] RECOVERY - Host labvirt1009 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [15:08:33] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team: rack/setup/install labstore1008 & labstore1009 - https://phabricator.wikimedia.org/T193655#4261243 (10Cmjohnson) @chasemp I went to cable these today and noticed they have 10G nics...do you need these in a 10G rack? [15:08:37] RECOVERY - HHVM jobrunner on mw1303 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.002 second response time [15:08:48] RECOVERY - HHVM jobrunner on mw1305 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.002 second response time [15:10:17] PROBLEM - HHVM jobrunner on mw1302 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [15:11:17] RECOVERY - HHVM jobrunner on mw1302 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.002 second response time [15:12:17] PROBLEM - HHVM jobrunner on mw1311 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [15:12:21] 10Operations, 10ops-eqiad: Degraded RAID on wtp1043 - https://phabricator.wikimedia.org/T196260#4261260 (10Cmjohnson) a:05Cmjohnson>03RobH assigning to @robh to order a new disk because my techdirect renewal is pending approval [15:12:44] oh yeah, ill do that now [15:13:18] RECOVERY - HHVM jobrunner on mw1311 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.002 second response time [15:13:46] <_joe_> !log adding jobrunners, videoscalers to both pools with equal weight in codfw [15:13:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:17] PROBLEM - HHVM jobrunner on mw1308 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.004 second response time [15:17:26] (03PS1) 10Krinkle: varnish: Remove setting of CP cookies [puppet] - 10https://gerrit.wikimedia.org/r/437774 (https://phabricator.wikimedia.org/T110353) [15:17:30] (03CR) 10Jcrespo: [C: 032] mariadb: Remove db1051, to be decommissioned, add db1065 [software] - 10https://gerrit.wikimedia.org/r/437769 (https://phabricator.wikimedia.org/T195484) (owner: 10Jcrespo) [15:18:17] RECOVERY - HHVM jobrunner on mw1308 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.002 second response time [15:19:37] PROBLEM - HHVM jobrunner on mw1306 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [15:19:37] PROBLEM - HHVM jobrunner on mw1293 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [15:20:37] RECOVERY - HHVM jobrunner on mw1306 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.001 second response time [15:20:38] RECOVERY - HHVM jobrunner on mw1293 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.002 second response time [15:21:47] PROBLEM - HHVM jobrunner on mw1304 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [15:22:48] RECOVERY - HHVM jobrunner on mw1304 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.005 second response time [15:22:48] 10Operations, 10ops-eqiad, 10Patch-For-Review: setup/install phab1002(WMF4727) - https://phabricator.wikimedia.org/T196019#4261344 (10Cmjohnson) @dzahn I see phab1002 is installed and in icinga does the bios/drac/serial still need setup/testing [15:22:57] PROBLEM - HHVM jobrunner on mw1296 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [15:23:37] 10Operations, 10ops-eqiad, 10Patch-For-Review: setup/install phab1002(WMF4727) - https://phabricator.wikimedia.org/T196019#4261346 (10Cmjohnson) [15:23:58] RECOVERY - HHVM jobrunner on mw1296 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.002 second response time [15:26:31] (03PS1) 10Giuseppe Lavagetto: site.pp: merge videoscalers into the jobrunners [puppet] - 10https://gerrit.wikimedia.org/r/437776 [15:26:45] !log awight@deploy1001 Finished deploy [ores/deploy@65e979f]: ORES: new draft topic model; T176336 (duration: 25m 37s) [15:26:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:56] T176336: Deploy drafttopic model to production ORES - https://phabricator.wikimedia.org/T176336 [15:26:56] \o/ [15:26:59] !log stop pybal on lvs1001 [15:27:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:23] !log upgrade Cassandra to 3.11.2, restbase2001-{b,c} - T178905 [15:27:25] Looks good [15:27:32] yes that happened! [15:27:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:35] T178905: Upgrade RESTBase cluster to Cassandra release: 3.11.2 - https://phabricator.wikimedia.org/T178905 [15:28:10] <_joe_> I see a raise in the number of errors on the OresFetchScore jobs [15:28:23] (03PS1) 10Sau226: Implementing Patroller User Rights for azwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437777 (https://phabricator.wikimedia.org/T196488) [15:28:26] <_joe_> halfak/ awight it might be an artifact of deployment, I'll keep you updated [15:28:33] (03CR) 10jerkins-bot: [V: 04-1] Implementing Patroller User Rights for azwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437777 (https://phabricator.wikimedia.org/T196488) (owner: 10Sau226) [15:28:44] <_joe_> reference graph: https://grafana.wikimedia.org/dashboard/db/jobqueue-eventbus?orgId=1&panelId=9&fullscreen&from=now-15m&to=now [15:28:47] PROBLEM - ores on ores2001 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error - 136 bytes in 0.061 second response time [15:28:47] PROBLEM - Check systemd state on ores2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:29:10] <_joe_> it's already going down, so less critical [15:29:15] (03PS2) 10Sau226: Implementing Patroller User Rights for azwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437777 (https://phabricator.wikimedia.org/T196488) [15:29:19] <_joe_> and probably a consequence of the deploy [15:30:20] akosiaris: mutante: Can I ask a favor... I need two directories rm'd [15:30:37] PROBLEM - PyBal backends health check on lvs1001 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 [15:30:37] PROBLEM - pybal on lvs1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal [15:30:53] ores200[1-2].codfw.wmnet:/srv/deployment/ores/deploy-cache/revs/65e979fc2ee87198a93473a852278b2adf551dc8 [15:31:05] (03PS3) 10Sau226: Implementing Patroller User Rights for azwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437777 (https://phabricator.wikimedia.org/T196488) [15:31:27] PROBLEM - PyBal connections to etcd on lvs1001 is CRITICAL: CRITICAL: 0 connections established with conf1001.eqiad.wmnet:2379 (min=4) [15:32:18] <_joe_> !log cross enabling videoscalers,jobrunners in their respective pools [15:32:22] the lvs1001 alerts are known, pybal stopped for maintenance ^ [15:32:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:57] RECOVERY - Host www.toolserver.org is UP: PING OK - Packet loss = 0%, RTA = 0.49 ms [15:34:17] RECOVERY - Check systemd state on ores2001 is OK: OK - running: The system is fully operational [15:34:40] (03PS1) 10Jcrespo: mariadb: Decommission db1051 [puppet] - 10https://gerrit.wikimedia.org/r/437779 (https://phabricator.wikimedia.org/T195484) [15:34:55] akosiaris: mutante: nvm the request above, I worked around like you don't want to think about. [15:35:08] ACKNOWLEDGEMENT - PyBal backends health check on lvs1001 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 Ayounsi Maintenance for T187962 [15:35:08] ACKNOWLEDGEMENT - PyBal connections to etcd on lvs1001 is CRITICAL: CRITICAL: 0 connections established with conf1001.eqiad.wmnet:2379 (min=4) Ayounsi Maintenance for T187962 [15:35:08] ACKNOWLEDGEMENT - pybal on lvs1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal Ayounsi Maintenance for T187962 [15:35:13] awight@deploy1001:/srv/deployment/ores/deploy$ SSH_AUTH_SOCK=/run/keyholder/proxy.sock ssh -l deploy-service ores2002.codfw.wmnet "cd /srv/deployment/ores/dep [15:35:16] loy-cache/revs/65e979fc2ee87198a93473a852278b2adf551dc8/submodules/assets; git lfs pull" [15:35:19] MEOW [15:36:10] * halfak barfs a little bit [15:38:25] 10Operations, 10ops-eqiad, 10netops, 10Patch-For-Review: Rack/cable/configure asw2-c-eqiad switch stack - https://phabricator.wikimedia.org/T187962#4261381 (10ayounsi) @Cmjohnson: Please move lvs1001 from asw-c-eqiad:ge-2/0/45 to asw2-c-eqiad:ge-2/0/27 and (after I'm done de-pooling the host) lvs1002 fro... [15:38:40] !log ppchelko@deploy1001 Started deploy [cpjobqueue/deploy@402d729]: Adjust the cirrus concurrencies [15:38:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:54] All done messing with ORES. [15:39:18] RECOVERY - PyBal backends health check on lvs1001 is OK: PYBAL OK - All pools are healthy [15:39:18] !log ppchelko@deploy1001 Finished deploy [cpjobqueue/deploy@402d729]: Adjust the cirrus concurrencies (duration: 00m 40s) [15:39:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:24] Seems that our deployment is happy. [15:39:27] RECOVERY - pybal on lvs1001 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal [15:41:38] RECOVERY - PyBal connections to etcd on lvs1001 is OK: OK: 4 connections established with conf1001.eqiad.wmnet:2379 (min=4) [15:42:37] PROBLEM - puppet last run on cp4026 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:42:57] PROBLEM - puppet last run on elastic2015 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:42:57] PROBLEM - puppet last run on mw2159 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:43:00] 10Operations, 10ops-codfw, 10fundraising-tech-ops: frdb2001 RAID disk failure - https://phabricator.wikimedia.org/T196251#4261401 (10Jgreen) 05Open>03Resolved excellent, thanks! [15:43:17] PROBLEM - puppet last run on mw2191 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:43:27] PROBLEM - puppet last run on cp2019 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:43:42] <_joe_> uhm what's up with puppet? [15:43:46] <_joe_> can someone look? [15:43:57] PROBLEM - puppet last run on db2093 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:43:57] PROBLEM - puppet last run on puppetdb2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:44:28] 10Operations, 10ops-codfw, 10fundraising-tech-ops, 10netops: switch port configuration for frbast2001 - https://phabricator.wikimedia.org/T196503#4261409 (10Jgreen) [15:44:58] PROBLEM - puppet last run on mw2278 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:45:17] PROBLEM - puppet last run on mw2190 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:45:18] PROBLEM - puppet last run on elastic2034 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:45:48] PROBLEM - puppet last run on mc2020 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:46:00] 10Operations, 10ops-codfw, 10fundraising-tech-ops, 10netops: switch port configuration for frbast2001 - https://phabricator.wikimedia.org/T196503#4258998 (10Jgreen) Note--corrected hostname on the task title and description. @ayounsi this should be vlan frack-bastion-codfw. [15:46:17] PROBLEM - puppet last run on rdb2006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:46:38] PROBLEM - puppet last run on rdb2004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:46:47] PROBLEM - puppet last run on restbase2006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:46:57] PROBLEM - puppet last run on mw2178 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:46:58] PROBLEM - puppet last run on acrab is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:47:17] PROBLEM - puppet last run on ms-be2039 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:47:18] PROBLEM - puppet last run on db2038 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:47:48] (03PS2) 10RobH: Add Reedy to contint-docker group [puppet] - 10https://gerrit.wikimedia.org/r/436860 (https://phabricator.wikimedia.org/T196192) (owner: 10Reedy) [15:48:24] (03CR) 10RobH: [C: 032] Add Reedy to contint-docker group [puppet] - 10https://gerrit.wikimedia.org/r/436860 (https://phabricator.wikimedia.org/T196192) (owner: 10Reedy) [15:49:31] 10Operations, 10Continuous-Integration-Infrastructure, 10SRE-Access-Requests, 10Patch-For-Review: Add Reedy to contint-docker group - https://phabricator.wikimedia.org/T196192#4261418 (10RobH) 05Open>03Resolved a:03RobH This has been merged live. All affected servers will call into puppet and get th... [15:49:48] probably puppetdb restart for codfw [15:49:54] it's also on ganeti [15:49:55] (03CR) 10Jgreen: [C: 031] DNS: Add prod & mgmt DNS for frmon2001 [dns] - 10https://gerrit.wikimedia.org/r/437768 (https://phabricator.wikimedia.org/T196476) (owner: 10Papaul) [15:51:07] !log disable pybal on lvs1002 - T187962 [15:51:17] manual puppet run on affected host works fine, seems like fallout of puppetdb reboot [15:51:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:25] T187962: Rack/cable/configure asw2-c-eqiad switch stack - https://phabricator.wikimedia.org/T187962 [15:51:58] RECOVERY - puppet last run on mw2178 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:53:37] RECOVERY - Host checker.tools.wmflabs.org is UP: PING OK - Packet loss = 0%, RTA = 0.80 ms [15:54:25] !log upgrade Cassandra to 3.11.2, restbase2001-{a,b,c} - T178905 [15:54:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:37] T178905: Upgrade RESTBase cluster to Cassandra release: 3.11.2 - https://phabricator.wikimedia.org/T178905 [15:54:50] 10Operations, 10Traffic, 10Wikimania-Hackathon-2018, 10Availability (MediaWiki-MultiDC): Create HTTP verb and sticky cookie DC routing in VCL - https://phabricator.wikimedia.org/T91820#1096885 (10Anomie) >>! On IRC, @tstarling wrote: > pity the SessionManager refactor did not add replication... [15:55:25] 10Operations, 10Phabricator: Phabricator is very slow to load - https://phabricator.wikimedia.org/T196565#4261464 (10Paladox) [15:56:03] 10Operations, 10Phabricator: Phabricator is very slow to load - https://phabricator.wikimedia.org/T196565#4261482 (10Paladox) p:05Triage>03Unbreak! [15:56:36] !log rebooting labvirt1014 [15:56:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:49] 10Operations, 10Wikimedia-Mailing-lists: Give admin acces to recommender-feedback@wikimedia.org - https://phabricator.wikimedia.org/T196556#4261110 (10RobH) I don't see this actual list already created, but it mentions that Ori was the admin? This seems to be asking for a modification, not a new list. Is the... [15:57:16] 10Operations, 10Phabricator: Phabricator is very slow to load - https://phabricator.wikimedia.org/T196565#4261464 (10greg) See also: https://news.ycombinator.com/item?id=17245649 [15:57:58] !log reloading apache on phab1001 to free up some resources [15:58:01] (03PS1) 10Arturo Borrero Gonzalez: openstack: labtest: keystone: delete service (collapsed) [puppet] - 10https://gerrit.wikimedia.org/r/437783 (https://phabricator.wikimedia.org/T167559) [15:58:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:07] RECOVERY - puppet last run on puppetdb2001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [15:59:15] 10Operations, 10Phabricator: Phabricator is very slow to load - https://phabricator.wikimedia.org/T196565#4261505 (10mmodell) Restarted apache to free up some stuck processes, this seems to have helped quite a bit, I'm not sure for how long though. [16:00:37] (03CR) 1020after4: [C: 031] Initialize LFS on scap targets [puppet] - 10https://gerrit.wikimedia.org/r/437719 (https://phabricator.wikimedia.org/T180627) (owner: 10Awight) [16:01:10] (03PS1) 10Ayounsi: [WIP] Add static routes with MTU 1450 for ipsec dests [puppet] - 10https://gerrit.wikimedia.org/r/437784 [16:01:11] 10Operations, 10SRE-Access-Requests, 10Reading-Infrastructure-Team-Backlog (Kanban): Requesting deployment access for jforrester - https://phabricator.wikimedia.org/T196566#4261509 (10Jdforrester-WMF) [16:01:35] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Add static routes with MTU 1450 for ipsec dests [puppet] - 10https://gerrit.wikimedia.org/r/437784 (owner: 10Ayounsi) [16:01:52] (03CR) 10Rush: [C: 032] openstack: labtest: keystone: delete service (collapsed) [puppet] - 10https://gerrit.wikimedia.org/r/437783 (https://phabricator.wikimedia.org/T167559) (owner: 10Arturo Borrero Gonzalez) [16:02:22] (03CR) 10Rush: [C: 032] "We probably need to stop keystone on labtestcontrol2001 and make sure it doesn't start on boot? Could puppetize that...I'm good either wa" [puppet] - 10https://gerrit.wikimedia.org/r/437783 (https://phabricator.wikimedia.org/T167559) (owner: 10Arturo Borrero Gonzalez) [16:02:41] 10Operations, 10Wikimedia-Mailing-lists: Give admin acces to recommender-feedback@wikimedia.org - https://phabricator.wikimedia.org/T196556#4261524 (10bmansurov) @RobH, thanks for the reply. My bad, I mixed up this individual email address with a list. Do you know who manages @wikimedia.org email addresses? [16:04:04] (03CR) 10Arturo Borrero Gonzalez: [C: 032] openstack: labtest: keystone: delete service (collapsed) [puppet] - 10https://gerrit.wikimedia.org/r/437783 (https://phabricator.wikimedia.org/T167559) (owner: 10Arturo Borrero Gonzalez) [16:04:25] 10Operations, 10Wikimedia-Mailing-lists: Give admin acces to recommender-feedback@wikimedia.org - https://phabricator.wikimedia.org/T196556#4261527 (10RobH) WMF OIT handles the actual @wikimedia.org address allocations for staff and a google alias. That isn't a @lists.wikimedia.org address, which is handled v... [16:04:35] !log rebooting labvirt1002 [16:04:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:40] !log lvs1002 repooled [16:05:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:07] PROBLEM - toolschecker: Redis set/get on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/redis - 259 bytes in 12.013 second response time [16:09:12] !log rebooting labvirt1004 [16:09:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:38] RECOVERY - puppet last run on mw2190 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [16:11:18] RECOVERY - puppet last run on mc2020 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:11:47] RECOVERY - puppet last run on rdb2006 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:12:08] RECOVERY - puppet last run on rdb2004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:12:17] RECOVERY - puppet last run on restbase2006 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:12:28] RECOVERY - puppet last run on acrab is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:12:47] RECOVERY - puppet last run on ms-be2039 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:12:48] RECOVERY - puppet last run on db2038 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:13:08] RECOVERY - puppet last run on cp4026 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:13:27] RECOVERY - puppet last run on elastic2015 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [16:13:27] RECOVERY - puppet last run on mw2159 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:13:47] RECOVERY - puppet last run on mw2191 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [16:13:57] RECOVERY - puppet last run on cp2019 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [16:14:27] RECOVERY - puppet last run on db2093 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [16:14:28] RECOVERY - toolschecker: Redis set/get on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.009 second response time [16:15:28] RECOVERY - puppet last run on mw2278 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [16:15:57] RECOVERY - puppet last run on elastic2034 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [16:16:26] !log rebooting labvirt1005 [16:16:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:07] PROBLEM - toolschecker: check mtime mod from tools cron job on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/toolscron - 185 bytes in 0.008 second response time [16:21:08] PROBLEM - toolschecker: tools homepage -admin tool- on tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 325 bytes in 1.041 second response time [16:21:18] RECOVERY - toolschecker: check mtime mod from tools cron job on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.005 second response time [16:21:38] PROBLEM - toolschecker: All k8s worker nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/k8s/nodes/ready - 185 bytes in 0.209 second response time [16:23:57] PROBLEM - Check systemd state on kubernetes2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:25:13] !log rebooting labvirt1006 [16:25:14] !log stop mysql @ db1051 in preparation for decom [16:25:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:45] (03PS2) 10Jcrespo: mariadb: Decommission db1051 [puppet] - 10https://gerrit.wikimedia.org/r/437779 (https://phabricator.wikimedia.org/T195484) [16:26:12] (03CR) 10Jcrespo: [C: 032] mariadb: Decommission db1051 [puppet] - 10https://gerrit.wikimedia.org/r/437779 (https://phabricator.wikimedia.org/T195484) (owner: 10Jcrespo) [16:26:14] 10Operations, 10MediaWiki-Debian, 10Wikimedia-Mailing-lists: Create mediawiki-debian mailing list - https://phabricator.wikimedia.org/T192865#4261579 (10RobH) 05Open>03Resolved a:03RobH This seems to have sat, and the new list is working fine without the old list archives from the third party server.... [16:27:48] RECOVERY - toolschecker: tools homepage -admin tool- on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 1015 bytes in 0.014 second response time [16:30:02] 10Operations, 10Wikimedia-Mailing-lists: Create new editing-team mailing list - https://phabricator.wikimedia.org/T196120#4261595 (10RobH) 05Open>03Resolved a:03RobH I've gone ahead and created this list, setting it to private and requiring approval to join it. Since it is a private team list, I turned... [16:31:50] PROBLEM - toolschecker: All Flannel etcd nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/etcd/flannel - 255 bytes in 3.592 second response time [16:33:42] !log rebooting labvirt1007 [16:33:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:39:11] 10Operations, 10DBA, 10Goal, 10Patch-For-Review: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704#4261829 (10jcrespo) There is a script `operations/software/dbtools/events_sanitarium.sql` that should be checked, updated and d... [17:40:20] 10Operations, 10ops-eqiad, 10Traffic, 10Patch-For-Review: rack/setup/install lvs101[3-6] - https://phabricator.wikimedia.org/T184293#4261830 (10Vgutierrez) @Cmjohnson any updates regarding lvs1015? [17:44:19] !log rebooting labvirt1010 [17:44:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:49:37] (03PS1) 10Phuedx: admin: Replace phuedx's key [puppet] - 10https://gerrit.wikimedia.org/r/437794 [17:50:08] 10Operations, 10SRE-Access-Requests, 10Reading-Infrastructure-Team-Backlog (Kanban): Requesting deployment access for jforrester - https://phabricator.wikimedia.org/T196566#4261877 (10RobH) [17:51:37] 10Operations, 10SRE-Access-Requests, 10Reading-Infrastructure-Team-Backlog (Kanban): Requesting deployment access for jforrester - https://phabricator.wikimedia.org/T196566#4261509 (10RobH) [17:51:42] 10Operations, 10Traffic, 10media-storage, 10Patch-For-Review, 10Performance-Team (Radar): Remove unnecessary response headers - https://phabricator.wikimedia.org/T194814#4261895 (10Krinkle) [17:51:48] 10Operations, 10SRE-Access-Requests, 10Reading-Infrastructure-Team-Backlog (Kanban): Requesting deployment access for jforrester - https://phabricator.wikimedia.org/T196566#4261509 (10RobH) p:05Triage>03Normal [17:52:56] 10Operations, 10Phabricator: Phabricator is very slow to load - https://phabricator.wikimedia.org/T196565#4261908 (10mmodell) I can't reproduce currently, load average isn't particularly high and phabricator has been snappy fast for a while now. I think we can close this as resolved. [17:53:19] 10Operations, 10SRE-Access-Requests, 10Reading-Infrastructure-Team-Backlog (Kanban): Requesting deployment access for jforrester - https://phabricator.wikimedia.org/T196566#4261910 (10RobH) Any additions to deployers requires approval by both @greg (for RI) plus review in the SRE weekly meetings. The othe... [17:55:04] 10Operations, 10Traffic, 10media-storage, 10Patch-For-Review, 10Performance-Team (Radar): Remove unnecessary response headers - https://phabricator.wikimedia.org/T194814#4261920 (10Krinkle) [17:57:48] 10Operations, 10DBA, 10Goal, 10Patch-For-Review: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704#4261935 (10Marostegui) I do see it is deployed on db1095 and on db1102 on the `ops` database It needs some checking, but I gues... [17:58:21] 10Operations, 10Traffic, 10media-storage, 10Patch-For-Review, 10Performance-Team (Radar): Reduce amount of headers sent from Varnish responses - https://phabricator.wikimedia.org/T194814#4261938 (10Krinkle) [17:58:44] 10Operations, 10Traffic, 10media-storage, 10Patch-For-Review, 10Performance-Team (Radar): Reduce amount of headers sent from web responses - https://phabricator.wikimedia.org/T194814#4209672 (10Krinkle) [17:59:22] !log upgrade Cassandra to 3.11.2, restbase2010-{a,b,c} - T178905 [17:59:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:27] T178905: Upgrade RESTBase cluster to Cassandra release: 3.11.2 - https://phabricator.wikimedia.org/T178905 [17:59:44] !log rebooting labvirt1011 [17:59:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:52] (03PS2) 10Jgreen: DNS: Add prod & mgmt DNS for frmon2001 [dns] - 10https://gerrit.wikimedia.org/r/437768 (https://phabricator.wikimedia.org/T196476) (owner: 10Papaul) [17:59:53] (03PS2) 10Jgreen: DNS: Add prod & mgmt DNS for frmon2001 [dns] - 10https://gerrit.wikimedia.org/r/437768 (https://phabricator.wikimedia.org/T196476) (owner: 10Papaul) [18:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180606T1800) [18:00:07] (03CR) 10Jgreen: [V: 031 C: 032] DNS: Add prod & mgmt DNS for frmon2001 [dns] - 10https://gerrit.wikimedia.org/r/437768 (https://phabricator.wikimedia.org/T196476) (owner: 10Papaul) [18:00:08] (03CR) 10Jgreen: [V: 031 C: 032] DNS: Add prod & mgmt DNS for frmon2001 [dns] - 10https://gerrit.wikimedia.org/r/437768 (https://phabricator.wikimedia.org/T196476) (owner: 10Papaul) [18:00:14] 10Operations, 10DBA, 10Goal, 10Patch-For-Review: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704#4261959 (10Marostegui) It is from 3 years ago...: https://gerrit.wikimedia.org/r/#/q/events_sanitarium.sql [18:00:28] !log gilles@deploy1001 Started deploy [performance/navtiming@816e610]: T196528 Funnel performance survey responses from kafka to graphite [18:00:33] !log gilles@deploy1001 Finished deploy [performance/navtiming@816e610]: T196528 Funnel performance survey responses from kafka to graphite (duration: 00m 05s) [18:01:06] (03CR) 10Jgreen: [V: 032 C: 032] DNS: Add prod & mgmt DNS for frmon2001 [dns] - 10https://gerrit.wikimedia.org/r/437768 (https://phabricator.wikimedia.org/T196476) (owner: 10Papaul) [18:03:15] (03PS2) 10Jgreen: DNS: Add prod DNS entries for frbast2001 [dns] - 10https://gerrit.wikimedia.org/r/437539 (https://phabricator.wikimedia.org/T196417) (owner: 10Papaul) [18:04:37] 10Operations, 10DBA, 10Goal, 10Patch-For-Review: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704#4261989 (10jcrespo) > if they are really needed anymore They are needed, a different thing is how much changes they need, but... [18:04:56] (03CR) 10Jgreen: [C: 032] DNS: Add prod DNS entries for frbast2001 [dns] - 10https://gerrit.wikimedia.org/r/437539 (https://phabricator.wikimedia.org/T196417) (owner: 10Papaul) [18:04:57] (03CR) 1020after4: [C: 031] Install LFS on scap targets (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/437719 (https://phabricator.wikimedia.org/T180627) (owner: 10Awight) [18:05:09] (03CR) 10Jgreen: [C: 032] DNS: Add prod DNS entries for frbast2001 [dns] - 10https://gerrit.wikimedia.org/r/437539 (https://phabricator.wikimedia.org/T196417) (owner: 10Papaul) [18:05:09] (03CR) 1020after4: [C: 031] Install LFS on scap targets (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/437719 (https://phabricator.wikimedia.org/T180627) (owner: 10Awight) [18:05:54] 10Operations, 10SRE-Access-Requests, 10Reading-Infrastructure-Team-Backlog (Kanban): Requesting deployment access for jforrester - https://phabricator.wikimedia.org/T196566#4261994 (10greg) +1 (yay!) [18:06:44] !log rebooting labvirt1012 [18:06:46] 10Operations, 10DBA, 10Goal, 10Patch-For-Review: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704#4262001 (10jcrespo) See: T196570 [18:06:48] 10Operations, 10Phabricator, 10User-greg: Phabricator is very slow to load - https://phabricator.wikimedia.org/T196565#4262003 (10greg) 05Open>03Resolved a:03greg Please reopen if something looks off in the future. [18:06:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:24] (03PS4) 10Alex Monk: Prepare to tighten Puppet DB access control - check client certificates [puppet] - 10https://gerrit.wikimedia.org/r/437057 (https://phabricator.wikimedia.org/T194962) [18:07:25] (03PS4) 10Alex Monk: Prepare to tighten Puppet DB access control - check client certificates [puppet] - 10https://gerrit.wikimedia.org/r/437057 (https://phabricator.wikimedia.org/T194962) [18:07:27] (03PS2) 10Alex Monk: Tighten Puppet DB access control - check client certificates [puppet] - 10https://gerrit.wikimedia.org/r/437640 (https://phabricator.wikimedia.org/T194962) [18:07:29] (03PS2) 10Alex Monk: Tighten Puppet DB access control - check client certificates [puppet] - 10https://gerrit.wikimedia.org/r/437640 (https://phabricator.wikimedia.org/T194962) [18:07:31] (03CR) 1020after4: [C: 031] Install LFS on scap targets (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/437719 (https://phabricator.wikimedia.org/T180627) (owner: 10Awight) [18:07:33] (03CR) 1020after4: [C: 031] Install LFS on scap targets (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/437719 (https://phabricator.wikimedia.org/T180627) (owner: 10Awight) [18:07:56] 10Operations, 10Traffic, 10media-storage, 10Patch-For-Review, 10Performance-Team (Radar): Reduce amount of headers sent from web responses - https://phabricator.wikimedia.org/T194814#4209672 (10Vgutierrez) @ema Could we use std.log (VCL_Log) to report X-Analytics data and stop the header from reaching th... [18:08:16] 10Operations, 10SRE-Access-Requests, 10Reading-Infrastructure-Team-Backlog (Kanban): Requesting deployment access for jforrester - https://phabricator.wikimedia.org/T196566#4262010 (10RobH) [18:09:17] 10Operations, 10DBA, 10Goal, 10Patch-For-Review: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704#4262014 (10Marostegui) I definitely think we do not need the `ops` database one on sanitarium hosts, those are probably entries... [18:12:41] (03PS2) 10RobH: DNS: Add mgmt DNS entries for labtestnet2003 [dns] - 10https://gerrit.wikimedia.org/r/436579 (https://phabricator.wikimedia.org/T196000) (owner: 10Papaul) [18:12:42] (03PS2) 10RobH: DNS: Add mgmt DNS entries for labtestnet2003 [dns] - 10https://gerrit.wikimedia.org/r/436579 (https://phabricator.wikimedia.org/T196000) (owner: 10Papaul) [18:13:24] (03CR) 10RobH: [C: 032] DNS: Add mgmt DNS entries for labtestnet2003 [dns] - 10https://gerrit.wikimedia.org/r/436579 (https://phabricator.wikimedia.org/T196000) (owner: 10Papaul) [18:13:24] (03CR) 10RobH: [C: 032] DNS: Add mgmt DNS entries for labtestnet2003 [dns] - 10https://gerrit.wikimedia.org/r/436579 (https://phabricator.wikimedia.org/T196000) (owner: 10Papaul) [18:13:53] well [18:13:59] someone had dns changes pending on nameserver [18:14:23] !log rebooting labvirt1013 [18:14:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:27] someone had pending dns changes merged but not live, for frmon2001 and such, now live [18:15:06] robh maybe this https://gerrit.wikimedia.org/r/437539 ? [18:15:07] Jeff_Green: ^ patchset you merged in gerrit are now live [18:15:16] yeah [18:15:20] found it via git blame ;] [18:15:27] robh thanks [18:15:41] i saw it was new mgmt entries and assumed it was cool to merge =] [18:15:52] yeah, we were just working on this [18:19:00] !log upgrade Cassandra to 3.11.2, restbase2003-{a,b,c} - T178905 [18:19:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:05] T178905: Upgrade RESTBase cluster to Cassandra release: 3.11.2 - https://phabricator.wikimedia.org/T178905 [18:20:16] 10Operations, 10DBA, 10Goal, 10Patch-For-Review: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704#4262072 (10Marostegui) Nevermind my comments above. They have nothing to do with the sanitarium events. The ones on the file a... [18:20:26] 10Operations, 10MediaWiki-Platform-Team, 10HHVM, 10TechCom-RFC (TechCom-Approved), 10User-ArielGlenn: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370#4262075 (10Reedy) [18:20:34] 10Operations, 10MediaWiki-Platform-Team, 10Performance-Team, 10MW-1.27-release-notes, and 3 others: php-memcached 3.0 (PHP 7) incompatible with BagOStuff - https://phabricator.wikimedia.org/T196125#4262073 (10Reedy) 05Open>03Resolved [18:21:01] !log rebooting labvirt1015 [18:21:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:32] !log rebooting labvirt1016 [18:23:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:05] (03PS1) 10RobH: labtestnet2003 install params [puppet] - 10https://gerrit.wikimedia.org/r/437801 (https://phabricator.wikimedia.org/T196000) [18:25:06] (03PS1) 10RobH: labtestnet2003 install params [puppet] - 10https://gerrit.wikimedia.org/r/437801 (https://phabricator.wikimedia.org/T196000) [18:26:25] (03PS1) 10Marostegui: events_sanitarium: Update sanitarium hosts [software] - 10https://gerrit.wikimedia.org/r/437802 (https://phabricator.wikimedia.org/T190704) [18:26:25] (03PS1) 10Marostegui: events_sanitarium: Update sanitarium hosts [software] - 10https://gerrit.wikimedia.org/r/437802 (https://phabricator.wikimedia.org/T190704) [18:29:18] !log rebooting labvirt1017 [18:29:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:33:49] Why do we have duplicate lines from jouncebot everytime something is commited to gerrit? ie my change or robh last change [18:34:06] yeah i just noticed that as well [18:34:16] well, from wikibuygs you mean? [18:34:30] needs rebooting [18:35:05] Yeah, probably got a second instance running due to Cloud reboots? [18:35:20] They're on the same name though [18:35:24] two listeners? [18:35:53] Could be. [18:36:16] !log rebooting labvirt1018, 1021, 1022 [18:36:17] Ugh, it's fab deployed? [18:36:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:36:30] which is why i have no idea how to fix it [18:36:40] qdel [18:37:55] hurrah [18:38:43] !log rebooting labvirt1019, 1020 [18:38:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:54:50] !log upgrade Cassandra to 3.11.2, restbase2004-{a,b,c} - T178905 [18:54:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:54:54] T178905: Upgrade RESTBase cluster to Cassandra release: 3.11.2 - https://phabricator.wikimedia.org/T178905 [18:55:19] !log stopped exim on mx1001 in prep for upgrade to stretch [18:55:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:04] thcipriani: #bothumor My software never has bugs. It just develops random features. Rise for MediaWiki train. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180606T1900). [19:00:04] thcipriani: Time to snap out of that daydream and deploy MediaWiki train. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180606T1900). [19:00:45] * thcipriani does [19:01:05] I see we have multiple jouncebots too [19:01:37] Not any more? [19:01:47] James_F: jouncebot and Guest24005? [19:02:05] Oh, right, I was going by jouncebot and jouncebot_ earlier. [19:02:08] heh [19:02:24] But of course there's a /third/. [19:06:13] bleh [19:08:32] o/ [19:08:59] thcipriani: i am around! [19:11:16] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html{/title}{/revision} (Transform wikitext to html) is WARNING: Test Transform wikitext to html responds with unexpected body: h2 id=HeadingHeading/h2 != /^h2.* Heading \/h2/: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was rece [19:16:11] wth? [19:16:18] how is a warning critical? [19:17:18] ...and why doesn't that show up in the web ui...and why no email? [19:18:18] !log thcipriani@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.32.0-wmf.7 [19:18:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:19:15] !log thcipriani@deploy1001 Synchronized php: group1 wikis to 1.32.0-wmf.7 (duration: 00m 56s) [19:19:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:38:08] !log upgrade Cassandra to 3.11.2, restbase2008-{a,b,c} - T178905 [19:38:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:38:16] T178905: Upgrade RESTBase cluster to Cassandra release: 3.11.2 - https://phabricator.wikimedia.org/T178905 [19:47:40] 10Operations, 10Cloud-VPS, 10cloud-services-team: templatetiger is using 827G of 8T available tools nfs storage - https://phabricator.wikimedia.org/T183954#4262294 (10Bstorm) p:05Normal>03High Hello @Kolossos, the NFS is at quite high utilization again, and the number one user is the templatetiger tool d... [19:53:45] PROBLEM - IPv4 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: Traceback (most recent call last) [19:53:45] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: Traceback (most recent call last) [19:54:14] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: Traceback (most recent call last) [19:56:04] (03CR) 10Andrew Bogott: "I'm a bit confused about variable naming... in at least one place it's implied that keystone_host is set in hiera, but elsewhere you're fo" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/437812 (https://phabricator.wikimedia.org/T167559) (owner: 10Rush) [19:58:14] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 9 probes of 301 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [19:58:14] RECOVERY - IPv4 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 0 probes of 321 (alerts on 19) - https://atlas.ripe.net/measurements/1791307/#!map [20:00:04] cscott, arlolra, subbu, bearND, halfak, and Amir1: That opportune time is upon us again. Time for a Services – Parsoid / Citoid / Mobileapps / ORES / … deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180606T2000). [20:00:04] cscott, arlolra, subbu, bearND, halfak, and Amir1: Dear deployers, time to do the Services – Parsoid / Citoid / Mobileapps / ORES / … deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180606T2000). [20:00:30] We're all done with ORES stuff for today :) [20:00:45] huh .. 2 bots? [20:00:52] looking [20:01:28] Aren't we deploying new version of ORES? I want to see the new look on the home page. [20:01:40] Oh! I thought we did [20:01:51] We got the drafttopic version out. [20:01:58] Is there a newer version ready for deployment? [20:02:00] hm [20:02:05] oh there it is [20:02:39] Amir1, I think we need to get the new homepage on ores-beta first [20:02:54] I'll give you a quick review if you want to make that simple change now :) [20:03:04] yeah sure [20:03:34] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 7 probes of 301 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [20:04:02] jouncebot, next [20:04:02] In 2 hour(s) and 55 minute(s): Evening SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180606T2300) [20:10:01] !log bsitzmann@deploy1001 Started deploy [mobileapps/deploy@a07af40]: Update mobileapps to 3bf9be5 (T196402 T195948) [20:10:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:12] T196402: Public rollout of feed content availability endpoint - https://phabricator.wikimedia.org/T196402 [20:10:13] T195948: MCS should respect Accept-Language header - https://phabricator.wikimedia.org/T195948 [20:17:27] !log upgrade Cassandra to 3.11.2, restbase2011-{a,b,c} - T178905 [20:17:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:33] T178905: Upgrade RESTBase cluster to Cassandra release: 3.11.2 - https://phabricator.wikimedia.org/T178905 [20:17:52] 10Operations, 10ops-codfw, 10Cloud-VPS: move/setup/install labtestnet2003(WMF6469) - https://phabricator.wikimedia.org/T196000#4262353 (10RobH) [20:18:38] !log bsitzmann@deploy1001 Finished deploy [mobileapps/deploy@a07af40]: Update mobileapps to 3bf9be5 (T196402 T195948) (duration: 08m 37s) [20:18:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:18:47] T196402: Public rollout of feed content availability endpoint - https://phabricator.wikimedia.org/T196402 [20:18:48] T195948: MCS should respect Accept-Language header - https://phabricator.wikimedia.org/T195948 [20:18:55] 10Operations, 10Cloud-VPS: move/setup/install labtestnet2003(WMF6469) - https://phabricator.wikimedia.org/T196000#4243908 (10RobH) a:05RobH>03chasemp @chasemp, This system is now ready for cloud team to take over. Feel free to use or resolve this task as needed. [20:19:10] !log rolled back mobileapps deploy [20:19:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:57] (03CR) 10Ottomata: [WIP] Allow admin module to ensure system user membership in managed groups (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/379004 (https://phabricator.wikimedia.org/T174465) (owner: 10Ottomata) [20:20:30] (03CR) 10Chad: [V: 032 C: 032] "We'll deploy 2.15.2, but let's merge this for consistency and since the artifacts already uploaded" [software/gerrit] (stable-2.15) - 10https://gerrit.wikimedia.org/r/436607 (owner: 10Chad) [20:20:31] 10Operations, 10SRE-Access-Requests, 10Reading-Infrastructure-Team-Backlog (Kanban): Requesting deployment access for jforrester - https://phabricator.wikimedia.org/T196566#4262364 (10RobH) I forgot to note that @Jdforrester-WMF put 'deployers' but I assume he meant 'deployment' [20:20:33] (03PS2) 10Ottomata: [WIP] Allow admin module to ensure system user membership in managed groups [puppet] - 10https://gerrit.wikimedia.org/r/379004 (https://phabricator.wikimedia.org/T174465) [20:21:07] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Allow admin module to ensure system user membership in managed groups [puppet] - 10https://gerrit.wikimedia.org/r/379004 (https://phabricator.wikimedia.org/T174465) (owner: 10Ottomata) [20:21:23] (03CR) 10Chad: [V: 032 C: 032] Merge tag 'v2.15.2' into wmf/stable-2.15 [software/gerrit/gerrit] (wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/436619 (owner: 10Chad) [20:22:31] (03PS1) 10RobH: adds jforrester to deployment, deploy-service, & mobileapps-admin groups [puppet] - 10https://gerrit.wikimedia.org/r/437819 (https://phabricator.wikimedia.org/T196566) [20:23:40] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review, 10Reading-Infrastructure-Team-Backlog (Kanban): Requesting deployment access for jforrester - https://phabricator.wikimedia.org/T196566#4262372 (10RobH) [20:29:31] jouncebot: now [20:29:31] For the next 0 hour(s) and 30 minute(s): Services – Parsoid / Citoid / Mobileapps / ORES / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180606T2000) [20:29:31] For the next 0 hour(s) and 30 minute(s): MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180606T1900) [20:29:33] jouncebot: next [20:29:34] In 2 hour(s) and 30 minute(s): Evening SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180606T2300) [20:42:19] !log upgrade Cassandra to 3.11.2, restbase2005-{a,b,c} - T178905 [20:42:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:42:23] T178905: Upgrade RESTBase cluster to Cassandra release: 3.11.2 - https://phabricator.wikimedia.org/T178905 [20:43:45] 10Operations, 10JADE, 10TechCom, 10Scoring-platform-team (Current): Deploy JADE extension to production - https://phabricator.wikimedia.org/T183381#4262445 (10Joe) [20:44:53] https://tools.wmflabs.org/add-information/?image=Map-heart-054.jpg broken? [20:45:00] 10Operations, 10JADE, 10Scoring-platform-team (Current), 10User-Joe: Scalability concerns creating a page per revision - https://phabricator.wikimedia.org/T196547#4262449 (10Joe) [20:45:04] 502 Bad Gateway [20:45:22] yannf: #wikimedia-cloud likely fallout from maintenance/host reboots [20:47:05] !log sighup logstash on logstash100[789] to reload config for gerrit.wikimedia.org/r/437657 [20:47:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:55:29] 10Operations, 10Mail, 10Patch-For-Review: Upgrade mx1001/mx2001 to stretch - https://phabricator.wikimedia.org/T175361#4262462 (10herron) Planning to proceed with the firewall update and reinstall to Stretch starting at 10a Eastern tomorrow (coordinated over IRC) In preparation for that, Exim on mx1001 has... [21:05:26] Reedy, should I open a report on Phab? [21:06:20] Visit #wikimedia-cloud and mention the tool isn't working. Someone should restart it [21:07:11] done [21:07:43] ok, it worked ;) [21:08:04] *works [21:26:17] 10Operations, 10ops-codfw, 10fundraising-tech-ops, 10Patch-For-Review: rack/setup/install Prometeuse/Grafana host frmon2001 for fr-tech - https://phabricator.wikimedia.org/T196476#4262540 (10Papaul) [21:27:49] 10Operations, 10ops-codfw, 10fundraising-tech-ops, 10Patch-For-Review: rack/setup/install Prometeuse/Grafana host frmon2001 for fr-tech - https://phabricator.wikimedia.org/T196476#4257945 (10Papaul) a:05Papaul>03Jgreen @Jgreen all yours. let me know if you have any questions. [21:34:27] (03CR) 10Awight: [C: 031] "Thanks for the fixup!" [puppet] - 10https://gerrit.wikimedia.org/r/437719 (https://phabricator.wikimedia.org/T180627) (owner: 10Awight) [21:40:30] PROBLEM - cassandra-c CQL 10.192.48.48:9042 on restbase2005 is CRITICAL: connect to address 10.192.48.48 and port 9042: Connection refused [21:41:11] PROBLEM - cassandra-c SSL 10.192.48.48:7001 on restbase2005 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [21:42:08] (03PS1) 10EBernhardson: logstash: typo gelf long_message -> full_message [puppet] - 10https://gerrit.wikimedia.org/r/437864 [21:42:20] (03PS2) 10EBernhardson: logstash: typo gelf long_message -> full_message [puppet] - 10https://gerrit.wikimedia.org/r/437864 [21:42:24] (03CR) 10jerkins-bot: [V: 04-1] logstash: typo gelf long_message -> full_message [puppet] - 10https://gerrit.wikimedia.org/r/437864 (owner: 10EBernhardson) [21:46:12] ^^^ got that [21:46:41] ACKNOWLEDGEMENT - cassandra-c CQL 10.192.48.48:9042 on restbase2005 is CRITICAL: connect to address 10.192.48.48 and port 9042: Connection refused eevans Cassandra upgrade [21:46:41] ACKNOWLEDGEMENT - cassandra-c SSL 10.192.48.48:7001 on restbase2005 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused eevans Cassandra upgrade [21:48:56] (03PS1) 10Chad: Gerrit 2.15.2 wmf build [software/gerrit] (stable-2.15) - 10https://gerrit.wikimedia.org/r/437865 [21:58:40] RECOVERY - cassandra-c SSL 10.192.48.48:7001 on restbase2005 is OK: SSL OK - Certificate restbase2005-c valid until 2018-08-17 16:12:01 +0000 (expires in 71 days) [22:00:10] RECOVERY - cassandra-c CQL 10.192.48.48:9042 on restbase2005 is OK: TCP OK - 0.030 second response time on 10.192.48.48 port 9042 [22:03:15] jouncebot: next [22:03:15] In 0 hour(s) and 56 minute(s): Evening SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180606T2300) [22:16:31] !log mholloway-shell@deploy1001 Started deploy [mobileapps/deploy@0346959]: Update mobileapps to 5ea008c [22:16:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:22:04] !log mholloway-shell@deploy1001 Finished deploy [mobileapps/deploy@0346959]: Update mobileapps to 5ea008c (duration: 05m 33s) [22:22:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:33:42] (03PS8) 10Awight: Install LFS on scap targets [puppet] - 10https://gerrit.wikimedia.org/r/437719 (https://phabricator.wikimedia.org/T180627) [22:34:19] (03CR) 10jerkins-bot: [V: 04-1] Install LFS on scap targets [puppet] - 10https://gerrit.wikimedia.org/r/437719 (https://phabricator.wikimedia.org/T180627) (owner: 10Awight) [22:46:26] (03PS4) 10Krinkle: Swap mediawiki.org to use standard docroot naming scheme [puppet] - 10https://gerrit.wikimedia.org/r/421949 (owner: 10Chad) [22:50:09] (03CR) 10Awight: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/437719 (https://phabricator.wikimedia.org/T180627) (owner: 10Awight) [22:50:55] (03CR) 10jerkins-bot: [V: 04-1] Install LFS on scap targets [puppet] - 10https://gerrit.wikimedia.org/r/437719 (https://phabricator.wikimedia.org/T180627) (owner: 10Awight) [22:52:13] (03PS9) 10Awight: Install LFS on scap targets [puppet] - 10https://gerrit.wikimedia.org/r/437719 (https://phabricator.wikimedia.org/T180627) [22:58:37] 10Operations, 10ops-ulsfo, 10Traffic, 10netops: troubleshoot cr3/cr4 link - https://phabricator.wikimedia.org/T196030#4262689 (10RobH) a:05RobH>03ayounsi Ok, I'm back onsite today, and I've taken the following steps: * verified both optics are working by connecting an lc-sc patch to a light meter, the... [23:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Dear deployers, time to do the Evening SWAT (Max 6 patches) deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180606T2300). [23:00:04] MatmaRex: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:27] hi. [23:06:17] * legoktm looks around [23:06:50] I'll just do the swat then? [23:07:41] That's what I do when people on the list aren't around :P [23:07:41] thanks [23:07:45] (03PS1) 10Krinkle: mc: Clean up docs and use same format and order between prod and beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437876 [23:08:58] (03CR) 10jerkins-bot: [V: 04-1] mc: Clean up docs and use same format and order between prod and beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437876 (owner: 10Krinkle) [23:09:10] (03CR) 10Krinkle: "This documents in -labs the differences from prod. This is important given that unlike e.g. CommonSettings, one does not load after the ot" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437876 (owner: 10Krinkle) [23:10:57] (03PS2) 10Krinkle: mc: Clean up docs and use same format and order between prod and beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437876 [23:13:17] (03PS1) 10Krinkle: mc-labs: Update wgMemCachedPersistent override [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437877 [23:16:24] MatmaRex: it's on mwdebug1002 [23:16:50] PROBLEM - MariaDB Slave Lag: s1 on db2062 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 303.29 seconds [23:16:51] PROBLEM - MariaDB Slave Lag: s1 on db2055 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 304.52 seconds [23:17:10] PROBLEM - MariaDB Slave Lag: s1 on db2088 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 309.66 seconds [23:17:11] PROBLEM - MariaDB Slave Lag: s1 on db2071 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 310.00 seconds [23:17:11] PROBLEM - MariaDB Slave Lag: s1 on db2094 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 311.13 seconds [23:17:21] PROBLEM - MariaDB Slave Lag: s1 on db2070 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 314.25 seconds [23:17:21] PROBLEM - MariaDB Slave Lag: s1 on db2085 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 313.80 seconds [23:17:30] PROBLEM - MariaDB Slave Lag: s1 on db2072 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 316.16 seconds [23:17:31] PROBLEM - MariaDB Slave Lag: s1 on db2092 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 318.39 seconds [23:17:31] PROBLEM - MariaDB Slave Lag: s1 on db2048 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 318.02 seconds [23:17:43] What's up with codfw? [23:18:55] MatmaRex: ugh, I'm gonna have to scap [23:18:57] legoktm: looks good except the l10n message is missing [23:18:59] yeah [23:21:22] !log legoktm@deploy1001 Started scap: Preference for responsive MonoBook, plus set mobile width cutoff to 550px ([[gerrit:437875]], [[gerrit:437814]]) [23:21:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:23:46] !log upgrade Cassandra to 3.11.2, restbase2009-{a,b,c} - T178905 [23:23:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:23:50] T178905: Upgrade RESTBase cluster to Cassandra release: 3.11.2 - https://phabricator.wikimedia.org/T178905 [23:24:12] (03PS3) 10Krinkle: mc-labs: Sync with prod or document differences [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437876 [23:25:22] (03PS4) 10Krinkle: mc-labs: Sync with prod or document differences [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437876 [23:25:35] (03Abandoned) 10Krinkle: mc-labs: Update wgMemCachedPersistent override [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437877 (owner: 10Krinkle) [23:26:12] (03PS5) 10Krinkle: mc-labs: Sync with prod or document differences [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437876 [23:27:34] (03PS6) 10Krinkle: mc-labs: Sync with prod or document differences [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437876 [23:51:48] !log upgrade Cassandra to 3.11.2, restbase2012-{a,b,c} - T178905 [23:51:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:51:52] T178905: Upgrade RESTBase cluster to Cassandra release: 3.11.2 - https://phabricator.wikimedia.org/T178905 [23:54:32] 10Operations, 10Beta-Cluster-Infrastructure: confd broken on deployment-redis hosts - https://phabricator.wikimedia.org/T196596#4262770 (10Reedy) [23:55:21] 10Operations, 10Beta-Cluster-Infrastructure: confd broken on deployment-redis hosts - https://phabricator.wikimedia.org/T196596#4262785 (10Reedy) p:05Triage>03High [23:56:28] 10Operations, 10Beta-Cluster-Infrastructure: confd broken on deployment-redis hosts - https://phabricator.wikimedia.org/T196596#4262770 (10Reedy)