[00:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Evening SWAT (Max 8 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171108T0000). [00:00:04] No GERRIT patches in the queue for this window AFAICS. [00:01:33] (03CR) 10Faidon Liambotis: [C: 032] Remove unjustified overbroad /8 network from blocklist [puppet] - 10https://gerrit.wikimedia.org/r/389888 (owner: 1020after4) [00:02:21] (03CR) 1020after4: "Thanks Faidon!" [puppet] - 10https://gerrit.wikimedia.org/r/389888 (owner: 1020after4) [00:02:31] yvw [00:03:03] paravoid: indeed, thanks [00:27:41] 10Operations, 10wikidiff2, 10Patch-For-Review, 10User-Addshore, and 2 others: Update and use php-wikidiff2 to 1.5 in production - https://phabricator.wikimedia.org/T177891#3674094 (10Jdforrester-WMF) Is this set of changes the reason that I'm seeing great big "⚫ " s for some diffs but still "-"s and "+"s f... [01:20:33] 10Operations, 10wikidiff2, 10Patch-For-Review, 10User-Addshore, and 2 others: Update and use php-wikidiff2 to 1.5 in production - https://phabricator.wikimedia.org/T177891#3743342 (10Legoktm) Yes, it's only on group0 wikis for now. https://gerrit.wikimedia.org/r/#/c/386387/ switches the dots to arrows, but... [02:30:03] RECOVERY - Check systemd state on scb1002 is OK: OK - running: The system is fully operational [02:32:43] heh, the main deployment table is wrong in wikitech, it's wmf6/wmf7, not wmf5/wmf7 (no skip this time) [02:33:01] !log l10nupdate@tin scap sync-l10n completed (1.31.0-wmf.6) (duration: 07m 38s) [02:33:04] PROBLEM - Check systemd state on scb1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [02:33:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:42:35] !log aaron@tin Synchronized php-1.31.0-wmf.7/includes: Deploy 087f2d579a9f which reverts 4432e898be0 due to statsd spam (duration: 01m 40s) [02:42:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:44:06] !log aaron@tin Synchronized php-1.31.0-wmf.7/tests: Deploy 087f2d579a9f which reverts 4432e898be0 due to statsd spam (duration: 01m 14s) [02:44:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:10:40] !log l10nupdate@tin scap sync-l10n completed (1.31.0-wmf.7) (duration: 15m 16s) [03:10:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:17:53] !log l10nupdate@tin ResourceLoader cache refresh completed at Wed Nov 8 03:17:53 UTC 2017 (duration 7m 13s) [03:17:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:25:54] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 665.40 seconds [03:56:03] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 164.90 seconds [04:12:53] (03CR) 10Jayprakash12345: [C: 031] Remove overlapping userrights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370791 (https://phabricator.wikimedia.org/T101983) (owner: 10TerraCodes) [04:29:54] PROBLEM - Restbase root url on restbase1015 is CRITICAL: connect to address 10.64.48.134 and port 7231: Connection refused [04:44:07] (03PS10) 10TerraCodes: Remove overlapping userrights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370791 (https://phabricator.wikimedia.org/T101983) [05:45:08] !log rm -rf /var/lib/carbon/whisper/MediaWiki/wanobjectcache/centralauth_user_* on graphite1001 and graphite2001 for T179999 [05:45:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:45:15] T179999: CentralAuthUser::loadFromCache doesn't call the makeKey() methods as needed - https://phabricator.wikimedia.org/T179999 [05:49:54] PROBLEM - graphite.wikimedia.org on graphite1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 398 bytes in 0.001 second response time [05:56:40] no_justification: so docroot/foundation/logos is now inaccessible / ready for deletion. [05:57:07] (03CR) 10Krinkle: [C: 031] Remove last vestigates of weird wmfwiki-specific docroot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385113 (owner: 10Chad) [05:57:12] (03PS2) 10Krinkle: Remove last vestigates of weird wmfwiki-specific docroot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385113 (owner: 10Chad) [05:57:12] right. [05:57:15] Already had a patch :) [05:57:21] Krinkle: not something I wanna do at 10pm [05:57:37] Also: we didn't force restart apache everywhere [05:57:46] okay [05:57:54] https://wikimediafoundation.org/logos/nupedia.png seems 404 consistently, but no harm in waiting [05:58:03] PROBLEM - graphite.wikimedia.org on graphite2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:58:19] Well it'll 404 after swapping too [05:58:20] I suppose if we remove the whole docroot, that would harm the older apaches, but deleting the logos only is safe [05:58:28] I didn't retain those silly files [05:58:33] But might as well do it in one go later [05:59:14] no_justification: No need to, you asked me 6 months ago if they were used and I said no, and then I ran a query on stats1002 for 10 hours checking varnish requests for them for the past 2 monts and there were only 5 hits which were all from us. [05:59:46] They were last referenced in our HTML on wikimedia.org around 2006 [06:00:18] And this was before it was cool to mirror our domains willy-nilly in poorly configured hotlinking ways, so no third-party requests either it seems. [06:00:34] Maybe if someone is using a really outdated copy of the internet? That's how it works, right? [06:00:46] Hehe yeah. [06:01:11] I checked the archive.org copy and it used a relative url, so unlikely and either way, did the traffic check [06:01:35] There's still git history for any critical recovery needs [06:02:26] Like thats web 2.0 days. With the way versioning goes we're on like web 56.1.9.44.20171107 right? [06:08:39] (03PS1) 10Marostegui: db-eqiad.php: Depool db1051 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389919 (https://phabricator.wikimedia.org/T178359) [06:11:00] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1051 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389919 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [06:12:11] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1051 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389919 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [06:12:25] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1051 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389919 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [06:14:19] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1051 - T178359 (duration: 00m 51s) [06:14:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:14:26] T178359: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359 [06:15:09] !log Stop MySQL on db1051 to copy its content to db1105.s1 - T178359 [06:15:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:15:34] PROBLEM - puppet last run on wtp1044 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:16:03] RECOVERY - graphite.wikimedia.org on graphite1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1547 bytes in 0.007 second response time [06:16:21] !log Restarted uwsgi-graphite-web service on graphite1001 [06:16:23] !log Restarted uwsgi-graphite-web service on graphite2001 [06:16:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:16:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:17:04] RECOVERY - graphite.wikimedia.org on graphite2001 is OK: HTTP OK: HTTP/1.1 200 OK - 1547 bytes in 0.191 second response time [06:22:30] (03CR) 10Krinkle: [C: 031] "This should also fix the non-raw view at https://noc.wikimedia.org/conf/highlight.php?file=reverse-proxy-staging.php which is currently br" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389779 (owner: 10Chad) [06:22:42] (03CR) 10Krinkle: [C: 032] "Minor noc fix" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389779 (owner: 10Chad) [06:23:55] (03Merged) 10jenkins-bot: Fix reverse-proxy symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389779 (owner: 10Chad) [06:26:10] !log Add 330G to db2023 partition to make sure the alter over logging table runs fine - T174569 [06:26:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:26:19] T174569: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569 [06:27:16] (03CR) 10jenkins-bot: Fix reverse-proxy symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389779 (owner: 10Chad) [06:27:43] PROBLEM - graphite.wikimedia.org on graphite1003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 398 bytes in 0.001 second response time [06:28:43] RECOVERY - graphite.wikimedia.org on graphite1003 is OK: HTTP OK: HTTP/1.1 200 OK - 1547 bytes in 0.013 second response time [06:45:34] RECOVERY - puppet last run on wtp1044 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:06] (03CR) 10Krinkle: "Given this wasn't merged before the point the commit says it's safe to undo, I imagine either the referenced issue went by unfixed and is " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363531 (owner: 10Legoktm) [06:57:20] !log Deploy alter table on s6 - on codfw master (db2028) with replication enabled - T172207 [06:57:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:57:29] T172207: flaggedrevs.fr_user is unindexed - https://phabricator.wikimedia.org/T172207 [07:13:00] !log krinkle@tin Synchronized docroot/noc/conf/: I2e51e783a (duration: 01m 06s) [07:13:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:13:33] !log Deploy alter table on s7 - on codfw master (db2029) with replication enabled - T172207 [07:13:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:13:40] T172207: flaggedrevs.fr_user is unindexed - https://phabricator.wikimedia.org/T172207 [07:21:18] !log Deploy alter table on s3 - on codfw master (db2018) with replication enabled - T172207 [07:21:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:21:25] T172207: flaggedrevs.fr_user is unindexed - https://phabricator.wikimedia.org/T172207 [07:30:14] PROBLEM - HHVM rendering on mw2134 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:31:13] RECOVERY - HHVM rendering on mw2134 is OK: HTTP OK: HTTP/1.1 200 OK - 74387 bytes in 0.302 second response time [07:44:48] 10Operations, 10wikidiff2, 10Patch-For-Review, 10User-Addshore, and 2 others: Update and use php-wikidiff2 to 1.5 in production - https://phabricator.wikimedia.org/T177891#3743565 (10hashar) >>! In T177891#3743342, @Legoktm wrote: > Yes, it's only on group0 wikis for now. https://gerrit.wikimedia.org/r/#/c... [07:46:22] (03PS2) 10Muehlenhoff: Create repository components component/elastic55 and thirdparty/elastic55 [puppet] - 10https://gerrit.wikimedia.org/r/389714 [07:46:38] !log Deploy alter table on s2 - on codfw master (db2017) with replication enabled - T172207 [07:46:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:46:45] T172207: flaggedrevs.fr_user is unindexed - https://phabricator.wikimedia.org/T172207 [07:55:18] !log Deploy alter table on s1 - on codfw master (db2048) with replication enabled - T172207 [07:55:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:24] T172207: flaggedrevs.fr_user is unindexed - https://phabricator.wikimedia.org/T172207 [07:58:02] (03CR) 10Muehlenhoff: [C: 032] Create repository components component/elastic55 and thirdparty/elastic55 [puppet] - 10https://gerrit.wikimedia.org/r/389714 (owner: 10Muehlenhoff) [08:02:58] (03PS3) 10Muehlenhoff: Synchronise elastic 5.5 stack to thirdparty/elastic55 [puppet] - 10https://gerrit.wikimedia.org/r/389715 [08:07:03] (03PS1) 10Marostegui: s1,s2.hosts: Remove db1047 [software] - 10https://gerrit.wikimedia.org/r/389922 (https://phabricator.wikimedia.org/T177405) [08:08:16] (03CR) 10Marostegui: [C: 032] s1,s2.hosts: Remove db1047 [software] - 10https://gerrit.wikimedia.org/r/389922 (https://phabricator.wikimedia.org/T177405) (owner: 10Marostegui) [08:08:58] (03Merged) 10jenkins-bot: s1,s2.hosts: Remove db1047 [software] - 10https://gerrit.wikimedia.org/r/389922 (https://phabricator.wikimedia.org/T177405) (owner: 10Marostegui) [08:12:14] (03CR) 10Muehlenhoff: [C: 032] Synchronise elastic 5.5 stack to thirdparty/elastic55 [puppet] - 10https://gerrit.wikimedia.org/r/389715 (owner: 10Muehlenhoff) [08:16:49] (03CR) 10Giuseppe Lavagetto: [C: 031] puppet: add puppet 4 auth.conf template (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/389720 (https://phabricator.wikimedia.org/T179722) (owner: 10Herron) [08:29:14] PROBLEM - Restbase root url on restbase1013 is CRITICAL: connect to address 10.64.32.80 and port 7231: Connection refused [08:39:12] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review, 10Performance-Team (Radar): Create perf-team shell group - https://phabricator.wikimedia.org/T179728#3743653 (10Krinkle) [08:39:41] 10Operations, 10Ops-Access-Requests, 10Performance-Team (Radar): Varnish and Apache root for hoo - https://phabricator.wikimedia.org/T179317#3743660 (10Krinkle) [08:40:31] !log resume cache_text/upload rolling reboots: upgrading kernel to 4.9.51, libssl to 1.0.2m and 1.1.0g [08:40:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:18] 10Operations, 10Ops-Access-Requests, 10Performance-Team: Requesting access to perf-teams for phedenskog (add phedenskog to perf-roots) - https://phabricator.wikimedia.org/T179729#3743674 (10MoritzMuehlenhoff) [08:42:21] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review, 10Performance-Team (Radar): Create perf-team shell group - https://phabricator.wikimedia.org/T179728#3743671 (10MoritzMuehlenhoff) 05Open>03Resolved a:03MoritzMuehlenhoff The task descriptions mentions "Add aaron to perf-team group", but he has... [08:43:23] 10Operations, 10Ops-Access-Requests, 10Performance-Team: Adding phedenskog to perf-team - https://phabricator.wikimedia.org/T179729#3734308 (10MoritzMuehlenhoff) [08:43:35] 10Operations, 10Ops-Access-Requests, 10Performance-Team: Adding phedenskog to perf-team - https://phabricator.wikimedia.org/T179729#3734308 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff [08:44:47] 10Operations, 10Ops-Access-Requests, 10Performance-Team: Adding phedenskog to perf-team - https://phabricator.wikimedia.org/T179729#3743681 (10MoritzMuehlenhoff) I've updated the title to reflect the recent creation of perf-team. I'll create a Gerrit patch, but this needs to be approved in next Monday's Ops... [08:52:38] @seen Amir1 [08:52:38] mutante: Last time I saw Amir1 they were quitting the network with reason: Quit: Connection closed for inactivity N/A at 11/7/2017 8:34:09 PM (12h18m29s ago) [08:54:35] 10Operations, 10wikidiff2, 10Patch-For-Review, 10User-Addshore, and 2 others: Update and use php-wikidiff2 to 1.5 in production - https://phabricator.wikimedia.org/T177891#3743689 (10WMDE-Fisch) > Then I guess we can cherry pick that to the wmf branch? :) +1 @Addshore [08:58:38] !log planet2001 - apt autoremove; reboot for kernel upgrade [08:58:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:23] RECOVERY - Check systemd state on scb1002 is OK: OK - running: The system is fully operational [09:00:31] what [09:01:13] (03CR) 10Ppchelko: [C: 031] Kafka: Enable topic deletion for Kafka by default [puppet] - 10https://gerrit.wikimedia.org/r/349280 (https://phabricator.wikimedia.org/T163392) (owner: 10Ppchelko) [09:01:22] 10Operations, 10CirrusSearch, 10Discovery, 10MediaWiki-JobQueue, and 6 others: Job queue is increasing non-stop - https://phabricator.wikimedia.org/T173710#3743707 (10Krinkle) [09:03:23] PROBLEM - Check systemd state on scb1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:03:38] 10Operations, 10CirrusSearch, 10Discovery, 10MediaWiki-JobQueue, and 6 others: Job queue is increasing non-stop - https://phabricator.wikimedia.org/T173710#3743712 (10Krinkle) [09:06:41] cp2002 is up but offline? [09:06:53] not reachable but mgmt console shows login [09:09:16] mutante: I'm rebooting text/upload hosts [09:09:51] !log restart restbase on 1013 and 1015 [09:09:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:29] ema: :) gotcha [09:19:57] (03PS7) 10Ema: Add local patch for transaction_timeout [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/387236 (https://phabricator.wikimedia.org/T179156) (owner: 10BBlack) [09:22:18] (03CR) 10Ema: [C: 032] Add local patch for transaction_timeout [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/387236 (https://phabricator.wikimedia.org/T179156) (owner: 10BBlack) [09:22:47] !log restart cassandra-a on restbase1010 [09:22:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:53] !log rutherforidum (people.wikimedia.org) : apt-get autoremove ; reboot for kernel upgrade [09:23:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:53] !log ununpentium (rt.wikimedia.org): apt-get autoremove; reboot for kernel upgrade [09:26:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:50] @seen icinga-wm [09:28:51] mutante: Last time I saw icinga-wm they were quitting the network with reason: Ping timeout: 240 seconds N/A at 11/8/2017 9:19:08 AM (9m43s ago) [09:32:07] (03PS5) 10Ema: 5.1.3-1wm2: transcation_timeout, record-prefix, run vtc tests [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/389516 [09:32:30] !log restarting ircecho (icinga-wm) [09:32:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:41] !log planet1001, alsafi (url-downloader) - apt autoremove; reboot for kernel upgrade [09:37:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:09] (03PS1) 10Filippo Giunchedi: role: Prometheus https access to k8s apiserver / node [puppet] - 10https://gerrit.wikimedia.org/r/389929 (https://phabricator.wikimedia.org/T177395) [09:41:11] (03PS1) 10Filippo Giunchedi: profile: allow Prometheus to access k8s kubelet [puppet] - 10https://gerrit.wikimedia.org/r/389930 (https://phabricator.wikimedia.org/T177395) [09:41:46] (03CR) 10jerkins-bot: [V: 04-1] role: Prometheus https access to k8s apiserver / node [puppet] - 10https://gerrit.wikimedia.org/r/389929 (https://phabricator.wikimedia.org/T177395) (owner: 10Filippo Giunchedi) [09:54:59] 10Operations, 10Prod-Kubernetes, 10monitoring, 10Kubernetes, and 3 others: Improve monitoring of the Kubernetes clusters - https://phabricator.wikimedia.org/T177395#3657156 (10fgiunchedi) I gave k8s discovery for Prometheus a try, the first blocker is that the Debian version of Prometheus doesn't include k... [09:57:57] (03PS1) 10Gehel: archiva: generate git-fat sha1 for .tar.gz and .whl [puppet] - 10https://gerrit.wikimedia.org/r/389932 [10:06:22] (03PS2) 10Filippo Giunchedi: role: Prometheus https access to k8s apiserver / node [puppet] - 10https://gerrit.wikimedia.org/r/389929 (https://phabricator.wikimedia.org/T177395) [10:06:24] (03PS2) 10Filippo Giunchedi: profile: allow Prometheus to access k8s kubelet [puppet] - 10https://gerrit.wikimedia.org/r/389930 (https://phabricator.wikimedia.org/T177395) [10:07:28] !log reboot aqs100[4-9] for jvm and kernel updates [10:07:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:22] !log elukey@puppetmaster1001 conftool action : set/pooled=no; selector: name=aqs1004.eqiad.wmnet [10:18:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:08] will use --quiet sorry --^ [10:19:08] (03PS6) 10Ema: 5.1.3-1wm2: transaction_timeout, record-prefix, run vtc tests [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/389516 [10:28:42] !log elukey@puppetmaster1001 conftool action : set/pooled=no; selector: name=aqs1005.eqiad.wmnet [10:28:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:52] uff [10:31:04] (03CR) 10Ema: [C: 032] 5.1.3-1wm2: transaction_timeout, record-prefix, run vtc tests [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/389516 (owner: 10Ema) [10:38:09] (03PS1) 10Marostegui: db-eqiad.php: Repool db1051 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389940 [10:38:32] (03CR) 10Marostegui: [C: 04-2] "Wait for the lag to be gone" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389940 (owner: 10Marostegui) [10:39:54] !log aluminium - reboot for kernel upgrade (url-downloader) [10:40:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:03] !log varnish 5.1.3-1wm2 built and uploaded to apt.w.o (experimental) [10:40:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:39] !log actinium - reboot for kernel upgrade (url-downloader) [10:42:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:18] (03PS1) 10Volans: home (volans): replace alias 'my' with a function [puppet] - 10https://gerrit.wikimedia.org/r/389941 [10:44:24] (03PS1) 10Gehel: apt: purge unmanaged sources.list [puppet] - 10https://gerrit.wikimedia.org/r/389942 [10:45:26] (03CR) 10jerkins-bot: [V: 04-1] apt: purge unmanaged sources.list [puppet] - 10https://gerrit.wikimedia.org/r/389942 (owner: 10Gehel) [10:46:23] (03PS2) 10Gehel: apt: purge unmanaged sources.list [puppet] - 10https://gerrit.wikimedia.org/r/389942 [10:48:39] (03CR) 10Volans: [C: 032] home (volans): replace alias 'my' with a function [puppet] - 10https://gerrit.wikimedia.org/r/389941 (owner: 10Volans) [10:54:30] PROBLEM - IPsec on cp2004 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp4032_v4, cp4032_v6 [10:54:40] PROBLEM - puppet last run on cp2004 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/home/volans/.bash_completion] [10:54:49] wut? [10:54:50] PROBLEM - puppet last run on krypton is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/home/volans/.bash_completion] [10:55:00] PROBLEM - puppet last run on ganeti2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/home/volans/.bash_completion] [10:55:11] that's me, dunno why, checking [10:55:30] RECOVERY - IPsec on cp2004 is OK: Strongswan OK - 56 ESP OK [10:55:43] (03PS1) 10Dzahn: webserver_misc_static: add profile for wikiba.se [puppet] - 10https://gerrit.wikimedia.org/r/389944 (https://phabricator.wikimedia.org/T99531) [10:55:58] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1051 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389940 (owner: 10Marostegui) [10:56:07] 10Operations, 10CirrusSearch, 10Discovery, 10Discovery-Search, 10Elasticsearch: Created dedicated elastic component in our APT repository - https://phabricator.wikimedia.org/T179964#3742139 (10MoritzMuehlenhoff) New components thirdparty/elastic55 and component/elastic55 have been created and kibana, log... [10:56:50] interesting, it's running fine now... seems it was a race condition... [10:56:57] _joe_: know issue? [10:57:16] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1051 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389940 (owner: 10Marostegui) [10:57:25] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1051 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389940 (owner: 10Marostegui) [10:57:41] !log ulsfo lvs reboots: upgrading kernel to 4.9.51, libssl to 1.0.2m [10:57:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:33] <_joe_> volans: yes,. did you add or remove a file? [10:58:38] add one [10:58:40] <_joe_> that can be the reason [10:58:53] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1051 with low weight after maintenance - T178359 (duration: 01m 01s) [10:58:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:59] T178359: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359 [10:59:00] <_joe_> so, with adding one it's very very hard to trigger the race [10:59:39] RECOVERY - puppet last run on cp2004 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [10:59:42] the error was [10:59:42] Could not set 'file' on ensure: Error 404 on SERVER: Not Found: Could not find file_content modules/admin/home/volans/.bash_completion [10:59:50] RECOVERY - puppet last run on krypton is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:59:59] RECOVERY - puppet last run on ganeti2001 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [11:00:09] RECOVERY - Check systemd state on scb1002 is OK: OK - running: The system is fully operational [11:03:09] PROBLEM - Check systemd state on scb1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:03:25] Amir1: o/ [11:04:33] Amir1: ores deployment on scb1002 seems to be weird, the service fails to start because of a missing module [11:05:12] 10Operations, 10CirrusSearch, 10Discovery, 10MediaWiki-JobQueue, and 6 others: Job queue is increasing non-stop - https://phabricator.wikimedia.org/T173710#3743891 (10Krinkle) [11:05:45] 10Operations, 10Analytics, 10DBA, 10User-Elukey: Prep to decommission old dbstore hosts (db1046, db1047) - https://phabricator.wikimedia.org/T156844#3743903 (10elukey) Just sent a summary of what's happening to engineering@ and analytics@, new deadlines: ``` - November 13th: the analytics-slave CNAME move... [11:09:59] PROBLEM - Host aqs1008 is DOWN: PING CRITICAL - Packet loss = 100% [11:10:22] argh expired downtime [11:10:54] fixed [11:10:55] !log imported jenkins 2.73.3 to apt.wikimedia.org [11:11:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:01] RECOVERY - Host aqs1008 is UP: PING OK - Packet loss = 0%, RTA = 0.20 ms [11:11:33] (03PS1) 10Marostegui: db-eqiad.php: Increase traffic for db1051 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389945 [11:12:11] !log installing jenkins security update on releases* [11:12:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:08] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase traffic for db1051 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389945 (owner: 10Marostegui) [11:14:22] (03Merged) 10jenkins-bot: db-eqiad.php: Increase traffic for db1051 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389945 (owner: 10Marostegui) [11:15:33] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase db1051 weight after maintenance - T178359 (duration: 00m 50s) [11:15:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:40] T178359: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359 [11:17:10] (03CR) 10jenkins-bot: db-eqiad.php: Increase traffic for db1051 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389945 (owner: 10Marostegui) [11:31:31] (03CR) 10Dzahn: [C: 032] webserver_misc_static: add profile for wikiba.se [puppet] - 10https://gerrit.wikimedia.org/r/389944 (https://phabricator.wikimedia.org/T99531) (owner: 10Dzahn) [11:34:41] PROBLEM - puppet last run on bromine is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:36:32] ^ me, fixing [11:39:27] (03PS1) 10Dzahn: wikibase: move hiera parameters to correct location [puppet] - 10https://gerrit.wikimedia.org/r/389947 (https://phabricator.wikimedia.org/T99531) [11:41:51] (03CR) 10Dzahn: [C: 032] wikibase: move hiera parameters to correct location [puppet] - 10https://gerrit.wikimedia.org/r/389947 (https://phabricator.wikimedia.org/T99531) (owner: 10Dzahn) [11:43:20] (03PS1) 10Marostegui: db-eqiad.php: Increase weight for db1051 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389948 [11:43:50] PROBLEM - DPKG on restbase2006 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [11:44:29] PROBLEM - DPKG on restbase2005 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [11:44:50] RECOVERY - DPKG on restbase2006 is OK: All packages OK [11:45:56] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase weight for db1051 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389948 (owner: 10Marostegui) [11:46:22] (03PS1) 10Dzahn: wikibase: another fix to Hiera parameter names [puppet] - 10https://gerrit.wikimedia.org/r/389949 (https://phabricator.wikimedia.org/T99531) [11:46:29] RECOVERY - DPKG on restbase2005 is OK: All packages OK [11:47:08] (03Merged) 10jenkins-bot: db-eqiad.php: Increase weight for db1051 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389948 (owner: 10Marostegui) [11:47:18] (03CR) 10jenkins-bot: db-eqiad.php: Increase weight for db1051 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389948 (owner: 10Marostegui) [11:47:24] (03CR) 10Dzahn: [C: 032] wikibase: another fix to Hiera parameter names [puppet] - 10https://gerrit.wikimedia.org/r/389949 (https://phabricator.wikimedia.org/T99531) (owner: 10Dzahn) [11:48:13] !log installed openjdk-8/openssl updates and new kernels on restbase* [11:48:14] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase db1051 weight after maintenance - T178359 (duration: 00m 50s) [11:48:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:48:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:48:24] T178359: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359 [11:51:18] (03PS1) 10Dzahn: wikibase: include base::firewall vs declaring it [puppet] - 10https://gerrit.wikimedia.org/r/389952 (https://phabricator.wikimedia.org/T99531) [11:51:44] (03CR) 10jerkins-bot: [V: 04-1] wikibase: include base::firewall vs declaring it [puppet] - 10https://gerrit.wikimedia.org/r/389952 (https://phabricator.wikimedia.org/T99531) (owner: 10Dzahn) [11:53:22] !log netmon1003 (servermon) - rebooting for kernel upgrade [11:53:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:53:27] (03CR) 10MarcoAurelio: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370791 (https://phabricator.wikimedia.org/T101983) (owner: 10TerraCodes) [11:55:37] (03PS1) 10Marostegui: db-eqiad.php: Restore db1051 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389954 [11:56:35] (03CR) 10Dzahn: [V: 032 C: 032] "yea, that's adding a style violation but the existing code can't work in prod, only in labs, where the profile doesn't share the same node" [puppet] - 10https://gerrit.wikimedia.org/r/389952 (https://phabricator.wikimedia.org/T99531) (owner: 10Dzahn) [11:59:39] PROBLEM - puppet last run on cp2006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:01:23] (03PS1) 10Dzahn: wikibase: include apache class vs declaring it [puppet] - 10https://gerrit.wikimedia.org/r/389957 (https://phabricator.wikimedia.org/T99531) [12:01:36] (03CR) 10jerkins-bot: [V: 04-1] wikibase: include apache class vs declaring it [puppet] - 10https://gerrit.wikimedia.org/r/389957 (https://phabricator.wikimedia.org/T99531) (owner: 10Dzahn) [12:03:08] !log alcyone (url-downloader) rebooting for kernel upgrade [12:03:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:04:53] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Restore db1051 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389954 (owner: 10Marostegui) [12:05:48] (03CR) 10Dzahn: [V: 032 C: 032] wikibase: include apache class vs declaring it [puppet] - 10https://gerrit.wikimedia.org/r/389957 (https://phabricator.wikimedia.org/T99531) (owner: 10Dzahn) [12:05:52] 10Operations, 10User-Joe: [DRAFT][RfC] Deployment of python applications in production - https://phabricator.wikimedia.org/T180023#3744120 (10Joe) [12:06:12] 10Operations, 10User-Joe: [DRAFT][RfC] Deployment of python applications in production - https://phabricator.wikimedia.org/T180023#3744133 (10Joe) [12:06:46] (03Merged) 10jenkins-bot: db-eqiad.php: Restore db1051 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389954 (owner: 10Marostegui) [12:07:13] bromine: recover you should [12:07:25] (03PS2) 10BBlack: ulsfo subnet comment fixup [dns] - 10https://gerrit.wikimedia.org/r/389738 [12:07:51] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Restore db1051 original weight after maintenance - T178359 (duration: 00m 50s) [12:07:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:55] T178359: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359 [12:08:40] (03CR) 10BBlack: [C: 032] ulsfo subnet comment fixup [dns] - 10https://gerrit.wikimedia.org/r/389738 (owner: 10BBlack) [12:09:18] (03PS2) 10BBlack: ulsfo definition fixups [puppet] - 10https://gerrit.wikimedia.org/r/389740 [12:09:39] RECOVERY - puppet last run on bromine is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [12:10:09] (03CR) 10BBlack: [C: 032] ulsfo definition fixups [puppet] - 10https://gerrit.wikimedia.org/r/389740 (owner: 10BBlack) [12:10:14] is there any reason that all sites on misc varnish have to be in wikimedia.org? it seems not, since i see query.wikidata.org as well [12:10:20] wants to put wikiba.se on it [12:11:10] wikimedia.org isn't a requirement, but being part of our canonical domain set is [12:11:14] (which wikiba.se isn't) [12:11:30] bblack: it would move over to use [12:11:31] us [12:11:38] but that DNS switch would be last [12:11:58] no, I mean the TLS stuff fronting cache_misc uses the unified certificates, which do not have wikiba.se [12:12:08] oooh, right [12:12:27] i will have to request adding it i guess [12:12:33] maybe? [12:13:09] I donno, I haven't been following the wikiba.se conversation at all [12:13:13] context https://phabricator.wikimedia.org/T99531 i will comment there [12:13:35] but is it really justified in joining our shortlist of canonicals and inflating our certs further, etc? [12:14:03] is a separate service ip a feasible solution for such domains? [12:14:10] (and thus a separate cert) [12:14:39] well, or SAN-based, it doesn't have to be a seperate IP unless you really really care about some now very-ancient browsers [12:14:47] right [12:15:12] but we haven't even deployed any separated certs to the cache frontends since quite a while back, there's no guarantee the puppetization for that hasn't decayed during refactors, etc [12:15:23] (03CR) 10jenkins-bot: db-eqiad.php: Restore db1051 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389954 (owner: 10Marostegui) [12:17:15] anyways, I'm assuming there was some compelling reason to make this wikiba.se and not wikibase.wikimedia.org or whatever? (existing site with user?) [12:17:33] or we could talk to them about actually changing it to wikibase.wikimedia.org [12:17:54] or wikibase.org (but i already asked about that and they said it was decided against). hmm [12:18:08] yes, existing site [12:18:18] it's used by wikidata [12:18:35] the site is? [12:19:11] for the separate cert, we could probably avoid adding costs at various levels by making it separate and using LE [12:19:15] yes, or i thought so, but i will ask Ladsgroup about it again [12:19:29] i would just use LE, i just want the caching [12:19:42] but then we're definitely in new unknown territory on the puppetization (second cert for cache_misc, and blending traditional + LE cert deploys) [12:20:20] ah, LE on misc-web, yea.. hmm [12:20:56] yeah LE on many-hosts, is tricky. you start having to design around having an LE-challenge-answerer deployed behind all the caches. [12:21:08] the nicest solution seems to use wikibase.wikimedia.org [12:21:10] and a mechanism to rsync out updated certs automagically [12:21:43] (03PS3) 10Elukey: [WIP] First commit [software/druid_exporter] - 10https://gerrit.wikimedia.org/r/389475 (https://phabricator.wikimedia.org/T177459) [12:22:51] let me discuss with Ladsgroup/WMDE how bad it would really be to rename it [12:23:02] /win 3 [12:25:15] revisiting the SNI-support thing (for SAN-based separate certs) from above [12:25:36] the tl;dr on that is IE-on-XP and Android 2.x are the significant remaining clients that don't do SNI [12:26:17] but IE-on-XP is also being cryptographically deprecated off our terminators (final cut in 9 days) [12:26:48] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1059 - https://phabricator.wikimedia.org/T179727#3744155 (10Marostegui) @Cmjohnson if you have some time today, could we the failed disk swapped? Thank you! [12:26:56] and the Android 2.x's in question currently sit at ~0.4% of our total request volume. [12:29:44] RECOVERY - puppet last run on cp2006 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [12:30:32] and summing up the rest of that rambling: deploying as whatever.wikimedia.org would be easy. deploying as a separate paid cert for wikiba.se will cost a small amount of $$ and may dredge up a bug or two in crusty puppetization but could be worked out, doing it via LE probably requires some real engineering. [12:31:48] (03CR) 10Alexandros Kosiaris: [C: 04-1] "2 minor comments, rest LGTM" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/389550 (owner: 10EBernhardson) [12:32:21] (03CR) 10BBlack: [C: 032] git-ssh.wm.o: reduce to 10m TTL for failover [dns] - 10https://gerrit.wikimedia.org/r/389869 (https://phabricator.wikimedia.org/T164810) (owner: 10BBlack) [12:32:23] (03PS2) 10BBlack: git-ssh.wm.o: reduce to 10m TTL for failover [dns] - 10https://gerrit.wikimedia.org/r/389869 (https://phabricator.wikimedia.org/T164810) [12:33:02] thanks bblack, that's a good summary, *nod* i'll get back to it [12:35:43] !log bromine (misc static tistes, annual/transparency/static-bz) - rebooting for kernel upgrade [12:35:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:09] (03CR) 10BBlack: [V: 032 C: 032] Swap git.wikimedia.org -> phabricator.wikimedia.org [software/varnish/libvmod-netmapper] - 10https://gerrit.wikimedia.org/r/389655 (https://phabricator.wikimedia.org/T139089) (owner: 10Chad) [12:41:08] !log krypton (misc PHP apps, scholarships.wm, iegreview.wm, grafana, racktables, burrow) rebooting for kernel upgrade [12:41:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:22] (03CR) 10BBlack: [C: 031] smart: enable SMART health collection in ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/389485 (https://phabricator.wikimedia.org/T86552) (owner: 10Filippo Giunchedi) [12:42:00] (03PS1) 10Muehlenhoff: Lower depool threshold for Apache to 0.8 (80%) [puppet] - 10https://gerrit.wikimedia.org/r/389964 (https://phabricator.wikimedia.org/T178799) [12:43:27] (03CR) 10Alexandros Kosiaris: "I like this, but it does indeed have a potential for causing pain, especially in labs where people probably don't expect this to happen. I" [puppet] - 10https://gerrit.wikimedia.org/r/389942 (owner: 10Gehel) [12:44:06] (03CR) 10Alexandros Kosiaris: [C: 031] Lower depool threshold for Apache to 0.8 (80%) [puppet] - 10https://gerrit.wikimedia.org/r/389964 (https://phabricator.wikimedia.org/T178799) (owner: 10Muehlenhoff) [12:46:02] !log osmium - re-enabling puppet - temp test is over and will be decom'ed [12:46:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:30] (03PS2) 10BBlack: LVS/phabricator: add git-ssh in codfw [puppet] - 10https://gerrit.wikimedia.org/r/389871 (https://phabricator.wikimedia.org/T164810) [12:46:42] ^ :) [12:47:37] (03CR) 10BBlack: [C: 032] LVS/phabricator: add git-ssh in codfw [puppet] - 10https://gerrit.wikimedia.org/r/389871 (https://phabricator.wikimedia.org/T164810) (owner: 10BBlack) [12:47:48] (03CR) 10Dzahn: "we could add a Hiera setting like "purge-unmanaged-sources: false" to allow labs instances to opt-out of it? (but purge by default)" [puppet] - 10https://gerrit.wikimedia.org/r/389942 (owner: 10Gehel) [12:49:22] 10Operations, 10Traffic, 10Wikidata, 10wikiba.se, and 2 others: [Task] move wikiba.se webhosting to wikimedia misc-cluster - https://phabricator.wikimedia.org/T99531#3744190 (10Dzahn) 05Open>03stalled [12:51:22] !log restart pybal on lvs2005 for git-ssh.codfw deploy [12:51:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:09] !log restart pybal on lvs2002 for git-ssh.codfw deploy [12:53:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:23] (03CR) 10Alexandros Kosiaris: "Yes this could work. Whether it should be opt-in or opt-out in labs is something we can figure out using cumin I guess." [puppet] - 10https://gerrit.wikimedia.org/r/389942 (owner: 10Gehel) [12:54:10] !log bblack@puppetmaster1001 conftool action : set/pooled=yes; selector: name=phab2001-vcs.codfw.wmnet [12:54:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:23] (03CR) 10Elukey: [C: 031] Lower depool threshold for Apache to 0.8 (80%) [puppet] - 10https://gerrit.wikimedia.org/r/389964 (https://phabricator.wikimedia.org/T178799) (owner: 10Muehlenhoff) [12:58:35] 10Operations, 10Traffic, 10Wikidata, 10wikiba.se, and 2 others: [Task] move wikiba.se webhosting to wikimedia misc-cluster - https://phabricator.wikimedia.org/T99531#3744205 (10Dzahn) @Ladsgroup Hi, i saw your IRC ping and continued working on this (see above). Though.. now we'll have to talk about the cer... [13:01:54] (03CR) 10Muehlenhoff: "I double-checked apt sources with Cumin and it looks mostly fine, almost all the apt repositories are managed via puppet. These apt source" [puppet] - 10https://gerrit.wikimedia.org/r/389942 (owner: 10Gehel) [13:03:07] (03PS1) 10BBlack: phab@codfw - add git-ssh public IPs to vcs config [puppet] - 10https://gerrit.wikimedia.org/r/389968 [13:03:36] !log Upgrading jenkins on contint1001/contint2001 [13:03:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:32] (03PS2) 10BBlack: phab@codfw - add git-ssh public IPs to vcs config [puppet] - 10https://gerrit.wikimedia.org/r/389968 (https://phabricator.wikimedia.org/T164810) [13:05:10] (03CR) 10BBlack: [C: 032] phab@codfw - add git-ssh public IPs to vcs config [puppet] - 10https://gerrit.wikimedia.org/r/389968 (https://phabricator.wikimedia.org/T164810) (owner: 10BBlack) [13:17:46] (03CR) 10Muehlenhoff: "All non-puppet managed apt lines have been removed except cloudarchive-kilo-proposed.list on labvirt11, I'm pinging WMCS people on IRC for" [puppet] - 10https://gerrit.wikimedia.org/r/389942 (owner: 10Gehel) [13:20:27] (03PS1) 10Alexandros Kosiaris: Remove chromium module [puppet] - 10https://gerrit.wikimedia.org/r/389971 (https://phabricator.wikimedia.org/T175093) [13:23:26] (03CR) 10Muehlenhoff: "On the topic of applying this to labs; I think it would be best to have WMCS projects opt-in; for a more official project like deployment-" [puppet] - 10https://gerrit.wikimedia.org/r/389942 (owner: 10Gehel) [13:23:35] 10Operations, 10User-Joe: [DRAFT][RfC] Deployment of python applications in production - https://phabricator.wikimedia.org/T180023#3744258 (10Gehel) I would argue that including the source of the software as a submodule should be optional. The specific use case I have in mind is the deployment of mapzen, where... [13:30:40] (03PS8) 10ArielGlenn: move references to datasets use from dumps module out to profile [puppet] - 10https://gerrit.wikimedia.org/r/389745 (https://phabricator.wikimedia.org/T179942) [13:31:42] (03CR) 10ArielGlenn: [C: 032] move references to datasets use from dumps module out to profile [puppet] - 10https://gerrit.wikimedia.org/r/389745 (https://phabricator.wikimedia.org/T179942) (owner: 10ArielGlenn) [13:42:54] (03PS1) 10Ppchelko: [EventBus] Increase number of worker processes to 16 [puppet] - 10https://gerrit.wikimedia.org/r/389975 (https://phabricator.wikimedia.org/T180017) [13:48:10] 10Operations, 10User-Joe: [DRAFT][RfC] Deployment of python applications in production - https://phabricator.wikimedia.org/T180023#3744420 (10Volans) > Which deployment method to choose I would mention also cases in which the upstream package or dependencies release quite often, like for example web apps. >... [13:49:48] (03PS1) 10DCausse: [cirrus] Add overridden iw prefix for svwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389976 (https://phabricator.wikimedia.org/T177913) [13:50:22] PROBLEM - puppet last run on labtestvirt2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: It is that lovely time of the day again! You are hereby commanded to deploy European Mid-day SWAT(Max 8 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171108T1400). [14:00:05] Pchelolo: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:25] I can SWAT today [14:00:36] o/ [14:00:42] (03PS2) 10Filippo Giunchedi: smart: enable SMART health collection in ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/389485 (https://phabricator.wikimedia.org/T86552) [14:00:51] thank you zeljkof [14:01:21] (03CR) 10Filippo Giunchedi: [C: 032] smart: enable SMART health collection in ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/389485 (https://phabricator.wikimedia.org/T86552) (owner: 10Filippo Giunchedi) [14:01:39] o/ [14:01:52] wait, when did two more patches appear! :) [14:02:04] :) [14:02:24] Pchelolo, dcausse, addshore: does your patch take a long time to deploy and/or test? [14:02:32] mine should be speedy [14:02:36] (in that case it will be move to the end) [14:02:41] zeljkof: mine is pretty simple test and affect only one wiki [14:02:59] Pchelolo, dcausse, addshore: let me know if you would like to deploy your patch yourself [14:03:15] zeljkof: my is untestable, it's improving logging for some pretty rare error [14:03:23] zeljkof: sure I can deploy mine [14:04:15] Pchelolo: ok, then deploying your change and hoping for the best :) [14:04:19] (03PS4) 10Elukey: [WIP] First commit [software/druid_exporter] - 10https://gerrit.wikimedia.org/r/389475 (https://phabricator.wikimedia.org/T177459) [14:04:21] I can also deploy mine :) [14:04:33] dcausse, addshore: I will let you know when I am done, so you can take over then :) [14:04:40] ok [14:05:16] 10Operations, 10Traffic: Migrate to nginx-light - https://phabricator.wikimedia.org/T164456#3744448 (10BBlack) Well there's two different actions to get through here: First is upgrade tlsproxy hosts to `1.13.6-2+wmf1` (but still on existing `nginx-full` packages) - seamless, shouldn't require any depooling.... [14:08:26] !log codfw lvs reboots: upgrading kernel to 4.9.51, libssl to 1.0.2m [14:08:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:48] !log zfilipin@tin Synchronized php-1.31.0-wmf.6/extensions/EventBus/EventBus.php: SWAT: [[gerrit:389974|Logging improvements.]] (duration: 00m 52s) [14:10:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:08] Pchelolo: deployed, please monitor logs for a while [14:11:21] dcausse, addshore: I am done, feel free to take over SWAT [14:11:29] kk thank you [14:11:35] o/! [14:12:05] I have just +2ed https://gerrit.wikimedia.org/r/#/c/389969/1 [14:12:28] zeljkof : sure, addshore let me know when you're done [14:12:37] dcausse: ack! [14:12:49] mine is just a CSS change :) [14:13:06] feel free to +2 yours now and do it [14:13:13] you might even beat me as I have to wait for CI [14:13:13] oh thanks [14:13:43] (03CR) 10DCausse: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389976 (https://phabricator.wikimedia.org/T177913) (owner: 10DCausse) [14:15:36] (03Merged) 10jenkins-bot: [cirrus] Add overridden iw prefix for svwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389976 (https://phabricator.wikimedia.org/T177913) (owner: 10DCausse) [14:15:54] addshore: I'll deploy ^ [14:16:00] ack! [14:16:04] !log upgrading cumin to v1.3.0 on prod and WMCS cumin masters [14:16:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:16] (03CR) 10jenkins-bot: [cirrus] Add overridden iw prefix for svwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389976 (https://phabricator.wikimedia.org/T177913) (owner: 10DCausse) [14:20:22] RECOVERY - puppet last run on labtestvirt2002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:22:23] addshore: mine does not work, reverting... [14:22:30] ack! [14:22:48] (03PS1) 10DCausse: Revert "[cirrus] Add overridden iw prefix for svwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389982 [14:24:33] (03CR) 10DCausse: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389982 (owner: 10DCausse) [14:25:51] (03Merged) 10jenkins-bot: Revert "[cirrus] Add overridden iw prefix for svwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389982 (owner: 10DCausse) [14:26:47] addshore: tin is clear you can go ahead [14:26:50] thanks! [14:28:04] !log addshore@tin Synchronized php-1.31.0-wmf.7/resources/src/mediawiki/mediawiki.diff.styles.css: SWAT [[gerrit:389969|Add render moved paragraphs marker in diff view]] PT 1/2 (duration: 00m 51s) [14:28:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:49] (03PS1) 10Volans: cumin: use new syntax in aliases [puppet] - 10https://gerrit.wikimedia.org/r/389983 [14:29:11] !log addshore@tin Synchronized php-1.31.0-wmf.7/docs/uidesign/mediawiki.diff.html: SWAT [[gerrit:389969|Add render moved paragraphs marker in diff view]] PT 2/2 DOCS ONLY (duration: 00m 50s) [14:29:13] dcausse: zeljkof all done! [14:29:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:52] (03PS1) 10ArielGlenn: use a simple find to toss old cirrussearch dump log files [puppet] - 10https://gerrit.wikimedia.org/r/389984 (https://phabricator.wikimedia.org/T162688) [14:31:34] (03PS3) 10BBlack: eqsin DNS for hosts, services, geodns [dns] - 10https://gerrit.wikimedia.org/r/389739 (https://phabricator.wikimedia.org/T156027) [14:31:38] (03PS2) 10ArielGlenn: use a simple find to toss old cirrussearch dump log files [puppet] - 10https://gerrit.wikimedia.org/r/389984 (https://phabricator.wikimedia.org/T162688) [14:32:13] (03CR) 10ArielGlenn: [C: 032] use a simple find to toss old cirrussearch dump log files [puppet] - 10https://gerrit.wikimedia.org/r/389984 (https://phabricator.wikimedia.org/T162688) (owner: 10ArielGlenn) [14:32:38] (03PS1) 10Muehlenhoff: Update snapshot cumin aliases for new role names [puppet] - 10https://gerrit.wikimedia.org/r/389985 [14:32:58] thanks! [14:33:24] ops... though you were reviewing mine :-P [14:33:54] (03PS2) 10Muehlenhoff: Update snapshot cumin aliases for new role names [puppet] - 10https://gerrit.wikimedia.org/r/389985 [14:34:47] rah... I now understand why it did not work... it's cached :/ [14:34:56] (03CR) 10Muehlenhoff: [C: 032] Update snapshot cumin aliases for new role names [puppet] - 10https://gerrit.wikimedia.org/r/389985 (owner: 10Muehlenhoff) [14:36:23] (03PS1) 10DCausse: Revert "Revert "[cirrus] Add overridden iw prefix for svwiki"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389986 [14:37:23] PROBLEM - IPsec on cp2017 is CRITICAL: Strongswan CRITICAL - ok: 66 not-conn: cp1050_v4, cp1050_v6 [14:37:43] PROBLEM - IPsec on cp3046 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp1050_v4, cp1050_v6 [14:38:02] PROBLEM - IPsec on cp3035 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp1050_v4, cp1050_v6 [14:38:02] PROBLEM - IPsec on cp3038 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp1050_v4, cp1050_v6 [14:38:02] PROBLEM - IPsec on cp3037 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp1050_v4, cp1050_v6 [14:38:02] PROBLEM - IPsec on cp2024 is CRITICAL: Strongswan CRITICAL - ok: 66 not-conn: cp1050_v4, cp1050_v6 [14:38:03] PROBLEM - IPsec on cp4026 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp1050_v4, cp1050_v6 [14:38:03] PROBLEM - IPsec on cp4022 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp1050_v4, cp1050_v6 [14:38:03] PROBLEM - IPsec on cp3044 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp1050_v4, cp1050_v6 [14:38:12] PROBLEM - IPsec on cp2014 is CRITICAL: Strongswan CRITICAL - ok: 66 not-conn: cp1050_v4, cp1050_v6 [14:38:12] PROBLEM - IPsec on cp2022 is CRITICAL: Strongswan CRITICAL - ok: 66 not-conn: cp1050_v4, cp1050_v6 [14:38:13] PROBLEM - IPsec on cp2008 is CRITICAL: Strongswan CRITICAL - ok: 66 not-conn: cp1050_v4, cp1050_v6 [14:38:14] PROBLEM - IPsec on cp3034 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp1050_v4, cp1050_v6 [14:38:14] PROBLEM - IPsec on cp3047 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp1050_v4, cp1050_v6 [14:38:22] PROBLEM - IPsec on cp4025 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp1050_v4, cp1050_v6 [14:38:22] PROBLEM - IPsec on cp3048 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp1050_v4, cp1050_v6 [14:38:23] PROBLEM - IPsec on cp2020 is CRITICAL: Strongswan CRITICAL - ok: 66 not-conn: cp1050_v4, cp1050_v6 [14:38:23] PROBLEM - IPsec on cp2026 is CRITICAL: Strongswan CRITICAL - ok: 66 not-conn: cp1050_v4, cp1050_v6 [14:38:23] PROBLEM - IPsec on cp3039 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp1050_v4, cp1050_v6 [14:38:23] PROBLEM - IPsec on cp4023 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp1050_v4, cp1050_v6 [14:38:32] PROBLEM - IPsec on cp4021 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp1050_v4, cp1050_v6 [14:38:32] PROBLEM - IPsec on cp2002 is CRITICAL: Strongswan CRITICAL - ok: 66 not-conn: cp1050_v4, cp1050_v6 [14:38:32] PROBLEM - IPsec on cp3049 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp1050_v4, cp1050_v6 [14:38:32] PROBLEM - IPsec on cp3045 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp1050_v4, cp1050_v6 [14:38:33] PROBLEM - IPsec on cp3036 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp1050_v4, cp1050_v6 [14:38:33] (03PS1) 10Muehlenhoff: Add missing } in cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/389987 [14:38:33] PROBLEM - IPsec on cp2005 is CRITICAL: Strongswan CRITICAL - ok: 66 not-conn: cp1050_v4, cp1050_v6 [14:38:42] PROBLEM - IPsec on cp2011 is CRITICAL: Strongswan CRITICAL - ok: 66 not-conn: cp1050_v4, cp1050_v6 [14:38:46] (03PS4) 10BBlack: eqsin DNS for hosts, services, geodns [dns] - 10https://gerrit.wikimedia.org/r/389739 (https://phabricator.wikimedia.org/T156027) [14:39:40] mmh cp1050 not coming up after reboot, let's see ^ [14:39:52] (03CR) 10Muehlenhoff: [C: 032] Add missing } in cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/389987 (owner: 10Muehlenhoff) [14:40:02] PROBLEM - Host cp1050 is DOWN: PING CRITICAL - Packet loss = 100% [14:40:33] nothing interesting in console, power-cycling [14:41:06] !log powercycle cp1050 (failed reboot) [14:41:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:22] did console respond at all? [14:41:32] it was just blank [14:41:34] just wondering if it was the initramfs prompt with the mdadm delay stuff [14:41:49] I still get that on new installs fairly often [14:43:18] (03CR) 10DCausse: "I thought that this patch did not work in the first place because testing on mwdebug1002 had no effect. Double checking this config affect" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389986 (owner: 10DCausse) [14:43:39] it took a looong time to init firmware interfaces [14:43:50] (following the boot process now) [14:44:23] "I'm not slacking..." [14:44:30] (03CR) 10jenkins-bot: Revert "[cirrus] Add overridden iw prefix for svwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389982 (owner: 10DCausse) [14:45:01] (03PS2) 10DCausse: [cirrus] Add overridden iw prefix for svwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389986 (https://phabricator.wikimedia.org/T177913) [14:45:13] RECOVERY - IPsec on cp2014 is OK: Strongswan OK - 68 ESP OK [14:45:13] RECOVERY - IPsec on cp2022 is OK: Strongswan OK - 68 ESP OK [14:45:13] RECOVERY - IPsec on cp2008 is OK: Strongswan OK - 68 ESP OK [14:45:22] RECOVERY - IPsec on cp3034 is OK: Strongswan OK - 54 ESP OK [14:45:22] RECOVERY - IPsec on cp3047 is OK: Strongswan OK - 54 ESP OK [14:45:22] RECOVERY - Host cp1050 is UP: PING OK - Packet loss = 0%, RTA = 0.19 ms [14:45:22] RECOVERY - IPsec on cp4025 is OK: Strongswan OK - 54 ESP OK [14:45:23] RECOVERY - IPsec on cp3048 is OK: Strongswan OK - 54 ESP OK [14:45:32] RECOVERY - IPsec on cp2020 is OK: Strongswan OK - 68 ESP OK [14:45:32] RECOVERY - IPsec on cp2026 is OK: Strongswan OK - 68 ESP OK [14:45:32] RECOVERY - IPsec on cp3039 is OK: Strongswan OK - 54 ESP OK [14:45:32] RECOVERY - IPsec on cp4023 is OK: Strongswan OK - 54 ESP OK [14:45:32] RECOVERY - IPsec on cp4021 is OK: Strongswan OK - 54 ESP OK [14:45:32] RECOVERY - IPsec on cp2017 is OK: Strongswan OK - 68 ESP OK [14:45:32] RECOVERY - IPsec on cp2002 is OK: Strongswan OK - 68 ESP OK [14:45:33] RECOVERY - IPsec on cp3049 is OK: Strongswan OK - 54 ESP OK [14:45:33] RECOVERY - IPsec on cp3045 is OK: Strongswan OK - 54 ESP OK [14:45:34] RECOVERY - IPsec on cp3036 is OK: Strongswan OK - 54 ESP OK [14:45:42] RECOVERY - IPsec on cp2005 is OK: Strongswan OK - 68 ESP OK [14:45:42] RECOVERY - IPsec on cp2011 is OK: Strongswan OK - 68 ESP OK [14:45:52] RECOVERY - IPsec on cp3046 is OK: Strongswan OK - 54 ESP OK [14:46:02] RECOVERY - IPsec on cp2024 is OK: Strongswan OK - 68 ESP OK [14:46:03] RECOVERY - IPsec on cp3035 is OK: Strongswan OK - 54 ESP OK [14:46:03] RECOVERY - IPsec on cp3038 is OK: Strongswan OK - 54 ESP OK [14:46:03] RECOVERY - IPsec on cp3037 is OK: Strongswan OK - 54 ESP OK [14:46:12] RECOVERY - IPsec on cp4022 is OK: Strongswan OK - 54 ESP OK [14:46:12] RECOVERY - IPsec on cp4026 is OK: Strongswan OK - 54 ESP OK [14:46:13] RECOVERY - IPsec on cp3044 is OK: Strongswan OK - 54 ESP OK [14:46:28] (03PS2) 10Ottomata: [EventBus] Increase number of worker processes to 16 [puppet] - 10https://gerrit.wikimedia.org/r/389975 (https://phabricator.wikimedia.org/T180017) (owner: 10Ppchelko) [14:47:20] it seems to have booted fine now [14:47:28] [Wed Nov 8 14:45:03 2017] bnx2x 0000:01:00.0 eth0: Warning: Unqualified SFP+ module detected, Port 0 from FINISAR CORP. part number FTLX1471D3BCL [14:47:34] are these expected? ^ [14:47:56] yeah [14:48:08] (03CR) 10Ottomata: [C: 032] [EventBus] Increase number of worker processes to 16 [puppet] - 10https://gerrit.wikimedia.org/r/389975 (https://phabricator.wikimedia.org/T180017) (owner: 10Ppchelko) [14:50:24] (03PS2) 10Volans: cumin: use new syntax in aliases [puppet] - 10https://gerrit.wikimedia.org/r/389983 [14:50:29] it's basically saying "the optics plugged into this card aren't a brand/model we've officially whitelisted as supported" [14:51:10] I think it has to do with the brand of DAC cables we sometimes use in eqiad [14:51:11] and switches do the same thing, they want their own brand [14:51:16] which is particularly fun with DAC cables [14:51:24] (we only get those FINISAR warnings on bnx2x for caches in eqiad, anyways) [14:51:31] since you can hardly buy the same vendor across switch and server [14:52:42] PROBLEM - puppet last run on snapshot1007 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/logrotate.d/cirrusdump] [14:54:05] !log otto@tin Started deploy [eventlogging/eventbus@41e3418]: (no justification provided) [14:54:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:17] !log otto@tin Finished deploy [eventlogging/eventbus@41e3418]: (no justification provided) (duration: 00m 12s) [14:54:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:53] !log esams lvs reboots: upgrading kernel to 4.9.51, libssl to 1.0.2m [14:54:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:09] !log otto@tin Started restart [eventlogging/eventbus@41e3418]: Bumping worker processes to 16: T180017 [14:55:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:15] T180017: Timeouts on event delivery to EventBus - https://phabricator.wikimedia.org/T180017 [15:00:48] 10Operations, 10Electron-PDFs, 10Proton, 10Readers-Web-Backlog, and 3 others: [subtask] How should we get Chromium for use in puppeteer? - https://phabricator.wikimedia.org/T178570#3744617 (10phuedx) [15:01:17] (03CR) 10Alexandros Kosiaris: [C: 031] role: Prometheus https access to k8s apiserver / node [puppet] - 10https://gerrit.wikimedia.org/r/389929 (https://phabricator.wikimedia.org/T177395) (owner: 10Filippo Giunchedi) [15:01:43] !log otto@tin Started restart [eventlogging/eventbus@41e3418]: Bumping worker processes to 16 on all targets: T180017 [15:01:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:54] T180017: Timeouts on event delivery to EventBus - https://phabricator.wikimedia.org/T180017 [15:04:54] (03CR) 10Ema: [C: 031] "lgtm" [dns] - 10https://gerrit.wikimedia.org/r/389739 (https://phabricator.wikimedia.org/T156027) (owner: 10BBlack) [15:07:22] (03CR) 10Alexandros Kosiaris: [C: 04-1] profile: allow Prometheus to access k8s kubelet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/389930 (https://phabricator.wikimedia.org/T177395) (owner: 10Filippo Giunchedi) [15:09:28] (03CR) 10BBlack: [C: 032] eqsin DNS for hosts, services, geodns [dns] - 10https://gerrit.wikimedia.org/r/389739 (https://phabricator.wikimedia.org/T156027) (owner: 10BBlack) [15:13:43] ACKNOWLEDGEMENT - MD RAID on lvs3001 is CRITICAL: CRITICAL: State: degraded, Active: 2, Working: 2, Failed: 0, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T180040 [15:13:46] 10Operations, 10ops-esams: Degraded RAID on lvs3001 - https://phabricator.wikimedia.org/T180040#3744690 (10ops-monitoring-bot) [15:15:03] 10Operations, 10ops-esams: Degraded RAID on lvs3001 - https://phabricator.wikimedia.org/T168619#3744698 (10Volans) [15:15:06] 10Operations, 10ops-esams: Degraded RAID on lvs3001 - https://phabricator.wikimedia.org/T180040#3744700 (10Volans) [15:16:59] !log otto@tin Started deploy [eventlogging/analytics@02c5a6b]: EventCapsule update and fixes, this is no-op as is. T179625 [15:17:04] !log otto@tin Finished deploy [eventlogging/analytics@02c5a6b]: EventCapsule update and fixes, this is no-op as is. T179625 (duration: 00m 04s) [15:17:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:07] T179625: Resolve EventCapsule / MySQL / Hive schema discrepancies - https://phabricator.wikimedia.org/T179625 [15:17:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:05] 10Operations, 10Phabricator, 10Traffic: Please create a phame blog for the Traffic team - https://phabricator.wikimedia.org/T180041#3744710 (10ema) [15:19:15] 10Operations, 10Phabricator, 10Traffic: Please create a phame blog for the Traffic team - https://phabricator.wikimedia.org/T180041#3744725 (10ema) p:05Triage>03Normal [15:21:44] !log eqiad lvs reboots: upgrading kernel to 4.9.51, libssl to 1.0.2m [15:21:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:54] 10Operations, 10Traffic, 10Patch-For-Review: Allocate address space for Singapore (APNIC) - https://phabricator.wikimedia.org/T156256#3744763 (10BBlack) Status updates? >>! In T156256#3699583, @faidon wrote: > Things pending: > - RPKI, @ayounsi has sent the extra ToS to legal for review and they said they m... [15:23:18] <_joe_> !log testing changes on rhodium regarding hostprivkey,hostcert [15:23:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:05] <_joe_> expect puppet failures to happen [15:26:22] PROBLEM - puppet last run on mw2255 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:27:32] PROBLEM - puppet last run on wtp1030 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:28:05] (03PS1) 10BBlack: eqsin v6: s/0df2/df2/ (more-canonical and shorter) [dns] - 10https://gerrit.wikimedia.org/r/389995 (https://phabricator.wikimedia.org/T156256) [15:28:59] (03PS3) 10Filippo Giunchedi: role: Prometheus https access to k8s apiserver / node [puppet] - 10https://gerrit.wikimedia.org/r/389929 (https://phabricator.wikimedia.org/T177395) [15:29:01] (03PS3) 10Filippo Giunchedi: profile: allow Prometheus to access k8s kubelet [puppet] - 10https://gerrit.wikimedia.org/r/389930 (https://phabricator.wikimedia.org/T177395) [15:29:02] PROBLEM - puppet last run on mw2170 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:29:14] (03CR) 10BBlack: [C: 032] eqsin v6: s/0df2/df2/ (more-canonical and shorter) [dns] - 10https://gerrit.wikimedia.org/r/389995 (https://phabricator.wikimedia.org/T156256) (owner: 10BBlack) [15:29:16] (03CR) 10Filippo Giunchedi: profile: allow Prometheus to access k8s kubelet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/389930 (https://phabricator.wikimedia.org/T177395) (owner: 10Filippo Giunchedi) [15:29:42] PROBLEM - puppet last run on conf1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:29:42] PROBLEM - puppet last run on cp1064 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:29:42] PROBLEM - puppet last run on db1016 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:29:42] PROBLEM - puppet last run on analytics1043 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:29:42] PROBLEM - puppet last run on db1086 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:29:42] PROBLEM - puppet last run on db1079 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:29:43] PROBLEM - puppet last run on db1077 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:29:43] PROBLEM - puppet last run on db1064 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:29:44] PROBLEM - puppet last run on cp2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:29:44] PROBLEM - puppet last run on cp2024 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:29:52] PROBLEM - puppet last run on labservices1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:29:52] PROBLEM - puppet last run on scb1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:29:53] PROBLEM - puppet last run on cp3010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:29:53] PROBLEM - puppet last run on cp3049 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:30:02] PROBLEM - puppet last run on mw1287 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:30:02] PROBLEM - puppet last run on mc1029 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:30:02] PROBLEM - puppet last run on mw1242 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:30:02] PROBLEM - puppet last run on kubestagetcd1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:30:02] PROBLEM - puppet last run on thumbor1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:30:02] PROBLEM - puppet last run on labsdb1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:30:03] PROBLEM - puppet last run on labvirt1013 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:30:03] PROBLEM - puppet last run on mw1328 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:30:04] PROBLEM - puppet last run on db2016 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:30:04] PROBLEM - puppet last run on tin is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:30:05] PROBLEM - puppet last run on wtp1028 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:30:05] PROBLEM - puppet last run on ms-be1027 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:30:06] PROBLEM - puppet last run on labtestneutron2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:30:12] PROBLEM - puppet last run on mc2025 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:30:12] PROBLEM - puppet last run on maps2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:30:12] PROBLEM - puppet last run on mw2220 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:30:12] PROBLEM - puppet last run on mw2207 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:30:12] PROBLEM - puppet last run on mw2099 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:30:13] PROBLEM - puppet last run on mc1027 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:30:13] PROBLEM - puppet last run on mw1199 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:30:14] PROBLEM - puppet last run on dbproxy1010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:30:22] PROBLEM - puppet last run on maps-test2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:30:32] PROBLEM - puppet last run on elastic2017 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:30:32] PROBLEM - puppet last run on kafka2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:30:38] (03PS3) 10BBlack: eqsin: basics [puppet] - 10https://gerrit.wikimedia.org/r/389741 (https://phabricator.wikimedia.org/T156027) [15:30:42] PROBLEM - puppet last run on mw1234 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:30:43] PROBLEM - puppet last run on mw2116 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:30:52] PROBLEM - puppet last run on sarin is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:30:52] PROBLEM - puppet last run on tungsten is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:30:52] PROBLEM - puppet last run on mw2109 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:31:02] PROBLEM - puppet last run on rdb1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:33:29] !log Decommissioning restbase2001-c.codfw.wmnet (T179422) [15:33:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:36] T179422: Reshape RESTBase Cassandra clusters - https://phabricator.wikimedia.org/T179422 [15:34:04] (03CR) 10EBernhardson: [C: 031] archiva: generate git-fat sha1 for .tar.gz and .whl [puppet] - 10https://gerrit.wikimedia.org/r/389932 (owner: 10Gehel) [15:34:43] PROBLEM - Varnish HTTP text-backend - port 3128 on cp4027 is CRITICAL: connect to address 10.128.0.127 and port 3128: Connection refused [15:34:44] (03PS4) 10BBlack: eqsin: basics [puppet] - 10https://gerrit.wikimedia.org/r/389741 (https://phabricator.wikimedia.org/T156027) [15:35:43] RECOVERY - Varnish HTTP text-backend - port 3128 on cp4027 is OK: HTTP OK: HTTP/1.1 200 OK - 178 bytes in 0.157 second response time [15:36:57] (03PS1) 10ArielGlenn: move hardcoded references to dump server mount points out to profile [puppet] - 10https://gerrit.wikimedia.org/r/390000 (https://phabricator.wikimedia.org/T179942) [15:37:23] (03CR) 10jerkins-bot: [V: 04-1] move hardcoded references to dump server mount points out to profile [puppet] - 10https://gerrit.wikimedia.org/r/390000 (https://phabricator.wikimedia.org/T179942) (owner: 10ArielGlenn) [15:37:53] PROBLEM - Check systemd state on lvs3001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:38:25] (03CR) 10EBernhardson: Deploy MjoLniR with new deploy repository (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/389550 (owner: 10EBernhardson) [15:39:11] the lvs3001's alert is due to mdmonitor.service [15:39:53] cp4027's instead is due to the weekly backend restart taking a bit longer than usual [15:41:42] 10Operations, 10HHVM, 10Patch-For-Review, 10User-Elukey: Migration of mw* servers to stretch - https://phabricator.wikimedia.org/T174431#3744841 (10MoritzMuehlenhoff) p:05Triage>03Normal [15:42:07] 10Operations, 10MediaWiki-Containers, 10Continuous-Integration-Infrastructure (shipyard): Homepage for https://docker-registry.wikimedia.org - https://phabricator.wikimedia.org/T179696#3744843 (10MoritzMuehlenhoff) p:05Triage>03Normal [15:42:23] 10Operations, 10hardware-requests: Replacement hardware for cumin masters - https://phabricator.wikimedia.org/T178392#3744845 (10MoritzMuehlenhoff) p:05Triage>03Normal [15:42:37] 10Operations, 10Operations-Software-Development: Upgrade Cumin masters to stretch - https://phabricator.wikimedia.org/T177385#3744846 (10MoritzMuehlenhoff) p:05Triage>03Normal [15:42:54] 10Operations, 10monitoring: Add RIPE atlas data to Prometheus - https://phabricator.wikimedia.org/T167689#3744848 (10MoritzMuehlenhoff) p:05Triage>03Normal [15:53:41] (03CR) 10EBernhardson: [C: 031] [cirrus] Add overridden iw prefix for svwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389986 (https://phabricator.wikimedia.org/T177913) (owner: 10DCausse) [15:54:43] RECOVERY - puppet last run on db1077 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [15:55:03] RECOVERY - puppet last run on thumbor1001 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [15:55:03] RECOVERY - puppet last run on labvirt1013 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [15:55:03] RECOVERY - puppet last run on labtestneutron2002 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [15:55:13] RECOVERY - puppet last run on mw2220 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [15:55:14] RECOVERY - puppet last run on mc1027 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [15:55:14] RECOVERY - puppet last run on dbproxy1010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:55:33] RECOVERY - puppet last run on kafka2003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:55:33] RECOVERY - puppet last run on elastic2017 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [15:55:43] RECOVERY - puppet last run on mw1234 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [15:56:03] RECOVERY - puppet last run on tungsten is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [15:56:03] RECOVERY - puppet last run on sarin is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [15:56:03] RECOVERY - puppet last run on mw2109 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [15:56:04] RECOVERY - puppet last run on rdb1006 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:56:23] RECOVERY - puppet last run on mw2255 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:56:34] RECOVERY - puppet last run on mw2116 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:57:33] RECOVERY - puppet last run on wtp1030 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [15:59:03] RECOVERY - puppet last run on mw2170 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [15:59:23] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [15:59:43] RECOVERY - puppet last run on cp1064 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [15:59:44] RECOVERY - puppet last run on conf1006 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [15:59:44] RECOVERY - puppet last run on analytics1043 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [15:59:44] RECOVERY - puppet last run on db1016 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [15:59:44] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [15:59:44] RECOVERY - puppet last run on db1079 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [15:59:44] RECOVERY - puppet last run on db1086 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [15:59:45] RECOVERY - puppet last run on db1064 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [15:59:45] RECOVERY - puppet last run on cp2024 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [15:59:46] RECOVERY - puppet last run on cp2002 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [15:59:53] RECOVERY - puppet last run on labservices1002 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [15:59:53] RECOVERY - puppet last run on scb1004 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [15:59:53] RECOVERY - puppet last run on cp3049 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [15:59:53] RECOVERY - puppet last run on cp3010 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [16:00:03] RECOVERY - puppet last run on mw1242 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [16:00:03] RECOVERY - puppet last run on mc1029 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [16:00:03] RECOVERY - puppet last run on labsdb1003 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [16:00:03] RECOVERY - puppet last run on kubestagetcd1002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [16:00:03] RECOVERY - puppet last run on mw1287 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [16:00:04] RECOVERY - puppet last run on mw1328 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [16:00:04] RECOVERY - puppet last run on db2016 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [16:00:04] RECOVERY - puppet last run on tin is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [16:00:05] RECOVERY - puppet last run on wtp1028 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [16:00:05] RECOVERY - puppet last run on ms-be1027 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [16:00:13] RECOVERY - puppet last run on mc2025 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [16:00:13] RECOVERY - puppet last run on maps2003 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [16:00:13] RECOVERY - puppet last run on mw2207 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [16:00:13] RECOVERY - puppet last run on mw2099 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [16:00:14] RECOVERY - puppet last run on mw1199 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [16:00:23] RECOVERY - puppet last run on maps-test2002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [16:00:24] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [16:01:09] !log otto@tin Started deploy [eventlogging/analytics@03285e4]: Reverting EvenCapsule update and fixes, processes got restarted too early [16:01:12] !log otto@tin Finished deploy [eventlogging/analytics@03285e4]: Reverting EvenCapsule update and fixes, processes got restarted too early (duration: 00m 02s) [16:01:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:02:23] 10Operations, 10Traffic: Network hardware configuration for Asia Cache DC - https://phabricator.wikimedia.org/T162684#3744898 (10faidon) [16:02:25] 10Operations, 10Traffic, 10Patch-For-Review: Configuration for Asia Cache DC hosts - https://phabricator.wikimedia.org/T156027#3744899 (10faidon) [16:02:28] 10Operations, 10Traffic, 10Patch-For-Review: Allocate address space for Singapore (APNIC) - https://phabricator.wikimedia.org/T156256#3744896 (10faidon) 05Open>03Resolved RPKI is all done as far as I know. @mark said he'll create his account later, if at all. I think we can resolve. [16:02:30] !log mobrovac@tin Started deploy [restbase/deploy@c5dd1e2]: Switch wiktionary definitions to use the next-gen storage, take 2 - T179420 [16:02:31] (03PS5) 10Elukey: [WIP] First commit [software/druid_exporter] - 10https://gerrit.wikimedia.org/r/389475 (https://phabricator.wikimedia.org/T177459) [16:02:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:02:37] T179420: Migrate definitions storage from the legacy to new strategy - https://phabricator.wikimedia.org/T179420 [16:02:43] !log mobrovac@tin Finished deploy [restbase/deploy@c5dd1e2]: Switch wiktionary definitions to use the next-gen storage, take 2 - T179420 (duration: 00m 13s) [16:02:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:13] !log mobrovac@tin Started deploy [restbase/deploy@c5dd1e2]: Switch wiktionary definitions to use the next-gen storage, take 2b - T179420 [16:04:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:24] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [16:08:33] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [16:09:44] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [16:11:34] !log mobrovac@tin Finished deploy [restbase/deploy@c5dd1e2]: Switch wiktionary definitions to use the next-gen storage, take 2b - T179420 (duration: 07m 22s) [16:11:37] (03PS2) 10ArielGlenn: move hardcoded references to dump server mount points out to profile [puppet] - 10https://gerrit.wikimedia.org/r/390000 (https://phabricator.wikimedia.org/T179942) [16:11:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:41] T179420: Migrate definitions storage from the legacy to new strategy - https://phabricator.wikimedia.org/T179420 [16:12:15] (03CR) 10jerkins-bot: [V: 04-1] move hardcoded references to dump server mount points out to profile [puppet] - 10https://gerrit.wikimedia.org/r/390000 (https://phabricator.wikimedia.org/T179942) (owner: 10ArielGlenn) [16:14:19] (03CR) 10Filippo Giunchedi: [WIP] First commit (033 comments) [software/druid_exporter] - 10https://gerrit.wikimedia.org/r/389475 (https://phabricator.wikimedia.org/T177459) (owner: 10Elukey) [16:14:37] (03PS3) 10ArielGlenn: move hardcoded references to dump server mount points out to profile [puppet] - 10https://gerrit.wikimedia.org/r/390000 (https://phabricator.wikimedia.org/T179942) [16:16:44] (03CR) 10Muehlenhoff: "cloudarchive-kilo-proposed.list is now also removed, so this should now be a NOP to merge (once the prod/WMCS application is sorted out)" [puppet] - 10https://gerrit.wikimedia.org/r/389942 (owner: 10Gehel) [16:22:27] awight: o/ [16:24:49] elukey: Hi! Anything I can help with? [16:26:08] awight: it's about ores on scb1002 (https://phabricator.wikimedia.org/T179837#3744192) [16:26:14] (03PS1) 10Nuria: Revert "Removing appInstallId from whitelist" [puppet] - 10https://gerrit.wikimedia.org/r/390019 [16:26:50] elukey: Nasty! Thanks, I recognize the error, looks like we let the canary get out of the cage. [16:29:16] awight: can we fix it with a corrective deployment with --limit or similar? [16:29:34] !log roll-restart thumbor in eqiad for kernel upgrade [16:29:36] (03PS4) 10ArielGlenn: move hardcoded references to dump server mount points out to profile [puppet] - 10https://gerrit.wikimedia.org/r/390000 (https://phabricator.wikimedia.org/T179942) [16:29:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:08] RECOVERY - Check systemd state on scb1002 is OK: OK - running: The system is fully operational [16:33:08] PROBLEM - Check systemd state on scb1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:33:52] elukey: greg-g: Is this a good time to do an emergency rollback deployment to one of the production ORES nodes? [16:34:31] awight: uhhh, yeah? [16:34:40] any time is a good time for an emergency rollback, ftr [16:34:44] !log awight@tin Started deploy [ores/deploy@82a13ae]: Fix ORES on scb1002 [16:34:45] !log awight@tin Finished deploy [ores/deploy@82a13ae]: Fix ORES on scb1002 (duration: 00m 03s) [16:34:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:21] !log disregard message about thumbor rolling-restart, upgrade already done and only thumbor1001 rebooted now [16:37:25] !log Restarting Cassandra, restbase1010-[abc] [16:37:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:38:58] Amir1: Looks like we have to roll forward…. and I currently can’t test on beta cos of scap breakage. [16:40:02] !log awight@tin Started deploy [ores/deploy@1b0e59f]: Try to purge specter of revscoring 1 [16:40:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:40:17] PROBLEM - puppet last run on mw1223 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:40:47] PROBLEM - puppet last run on wtp1029 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:41:04] 10Operations, 10wikidiff2, 10Patch-For-Review, 10User-Addshore, and 2 others: Update and use php-wikidiff2 to 1.5 in production - https://phabricator.wikimedia.org/T177891#3744999 (10Jdforrester-WMF) Yup, my issue is all fixed now, thanks! [16:43:01] (03PS2) 10Ottomata: Add exception guard for json parsing in eventlogging mysql filter [puppet] - 10https://gerrit.wikimedia.org/r/389861 (https://phabricator.wikimedia.org/T179625) [16:43:19] RECOVERY - Check systemd state on scb1002 is OK: OK - running: The system is fully operational [16:43:30] (03CR) 10jerkins-bot: [V: 04-1] Add exception guard for json parsing in eventlogging mysql filter [puppet] - 10https://gerrit.wikimedia.org/r/389861 (https://phabricator.wikimedia.org/T179625) (owner: 10Ottomata) [16:44:15] (03PS3) 10Ottomata: Add exception guard for json parsing in eventlogging mysql filter [puppet] - 10https://gerrit.wikimedia.org/r/389861 (https://phabricator.wikimedia.org/T179625) [16:44:17] (03PS5) 10ArielGlenn: move hardcoded references to dump server mount points out to profile [puppet] - 10https://gerrit.wikimedia.org/r/390000 (https://phabricator.wikimedia.org/T179942) [16:44:20] awight: ImportError: No module named 'pytest' :P [16:44:36] * awight scowls [16:44:58] (03PS1) 10TerraCodes: git.wikimedia.org -> phab [software/swift-utils] - 10https://gerrit.wikimedia.org/r/390026 (https://phabricator.wikimedia.org/T139089) [16:45:47] !log awight@tin Finished deploy [ores/deploy@1b0e59f]: Try to purge specter of revscoring 1 (duration: 05m 45s) [16:45:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:07] 10Operations, 10ops-eqiad, 10hardware-requests, 10Performance-Team (Radar): Decommission osmium.eqiad.wmnet - https://phabricator.wikimedia.org/T175093#3745012 (10RobH) [16:46:17] (03CR) 10Ottomata: [C: 032] Add exception guard for json parsing in eventlogging mysql filter [puppet] - 10https://gerrit.wikimedia.org/r/389861 (https://phabricator.wikimedia.org/T179625) (owner: 10Ottomata) [16:46:20] PROBLEM - Check systemd state on scb1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:47:00] !log awight@tin Started deploy [ores/deploy@82a13ae]: Roll back scb1002 [16:47:01] 10Operations, 10ops-eqiad, 10hardware-requests, 10Performance-Team (Radar): Decommission osmium.eqiad.wmnet - https://phabricator.wikimedia.org/T175093#3582551 (10RobH) [16:47:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:47:59] anytime i disable a switch port, i still have a moment of 'man i hope that port was labeled accurately' [16:48:12] (03PS1) 10Arturo Borrero Gonzalez: passwords: add labs key for arturo [labs/private] - 10https://gerrit.wikimedia.org/r/390027 [16:48:14] * robh impatiently watches icinga for ping downtime on unexpected hosts for next few minutes [16:49:02] awight: seems running now [16:49:20] RECOVERY - Check systemd state on scb1002 is OK: OK - running: The system is fully operational [16:49:26] (03PS10) 10Umherirrender: Add ar_content_format and ar_content_model to labs views [puppet] - 10https://gerrit.wikimedia.org/r/363851 (https://phabricator.wikimedia.org/T89741) [16:49:37] !log awight@tin Finished deploy [ores/deploy@82a13ae]: Roll back scb1002 (duration: 02m 37s) [16:49:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:05] 10Operations, 10ops-eqiad, 10hardware-requests, 10Performance-Team (Radar): Decommission osmium.eqiad.wmnet - https://phabricator.wikimedia.org/T175093#3745028 (10RobH) @krinkle: Please note in my puppet cleanup, these two files reference this host. I did not touch them, as they are scripts and the hostna... [16:50:09] RECOVERY - ores on scb1002 is OK: HTTP OK: HTTP/1.0 200 OK - 3580 bytes in 0.062 second response time [16:50:09] PROBLEM - cassandra-c CQL 10.64.0.116:9042 on restbase1010 is CRITICAL: connect to address 10.64.0.116 and port 9042: Connection refused [16:50:10] PROBLEM - cassandra-c SSL 10.64.0.116:7001 on restbase1010 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [16:50:48] ^^^ just a restart [16:51:10] RECOVERY - cassandra-c CQL 10.64.0.116:9042 on restbase1010 is OK: TCP OK - 0.000 second response time on 10.64.0.116 port 9042 [16:51:11] RECOVERY - cassandra-c SSL 10.64.0.116:7001 on restbase1010 is OK: SSL OK - Certificate restbase1010-c valid until 2018-08-17 16:11:07 +0000 (expires in 281 days) [16:52:57] (03PS1) 10RobH: decom of osmium [puppet] - 10https://gerrit.wikimedia.org/r/390028 (https://phabricator.wikimedia.org/T175093) [16:53:56] (03PS1) 10RobH: remove osmium production dns entries [dns] - 10https://gerrit.wikimedia.org/r/390029 (https://phabricator.wikimedia.org/T175093) [16:54:37] (03CR) 10RobH: [C: 032] remove osmium production dns entries [dns] - 10https://gerrit.wikimedia.org/r/390029 (https://phabricator.wikimedia.org/T175093) (owner: 10RobH) [16:54:47] (03PS2) 10RobH: decom of osmium [puppet] - 10https://gerrit.wikimedia.org/r/390028 (https://phabricator.wikimedia.org/T175093) [16:55:06] (03CR) 10RobH: [C: 032] decom of osmium [puppet] - 10https://gerrit.wikimedia.org/r/390028 (https://phabricator.wikimedia.org/T175093) (owner: 10RobH) [16:55:21] elukey: Thanks for the ping! [16:55:36] greg-g: I’m done making a mess, for now. [16:55:45] (03PS2) 10Ottomata: Revert "Removing appInstallId from whitelist" [puppet] - 10https://gerrit.wikimedia.org/r/390019 (owner: 10Nuria) [16:55:48] (03CR) 10Ottomata: [V: 032 C: 032] Revert "Removing appInstallId from whitelist" [puppet] - 10https://gerrit.wikimedia.org/r/390019 (owner: 10Nuria) [16:56:06] robh merging your change [16:56:12] ottomata: cool [16:56:12] puppet-merging [16:56:20] i was removing the pupet cert and was about to merge, heh [16:56:21] thx =] [16:57:00] RECOVERY - Check systemd state on lvs3001 is OK: OK - running: The system is fully operational [16:57:43] 10Operations, 10ops-eqiad, 10hardware-requests, 10Performance-Team (Radar): Decommission osmium.eqiad.wmnet - https://phabricator.wikimedia.org/T175093#3745063 (10RobH) a:05RobH>03Cmjohnson [16:58:14] 10Operations, 10ops-eqiad, 10hardware-requests, 10Performance-Team (Radar): Decommission osmium.eqiad.wmnet - https://phabricator.wikimedia.org/T175093#3582551 (10RobH) All non-interrupt steps complete, this is now pending on site wipe and remaining checkbox steps. Assigned to Chris for completion. [16:58:34] !log Restarting Cassandra, restbase1008-[abc] [16:58:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:01:26] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1059 - https://phabricator.wikimedia.org/T179727#3745078 (10Cmjohnson) Disk has been swapped [17:02:03] (03PS6) 10ArielGlenn: move hardcoded references to dump server mount points out to profile [puppet] - 10https://gerrit.wikimedia.org/r/390000 (https://phabricator.wikimedia.org/T179942) [17:02:59] (03CR) 10ArielGlenn: [C: 032] move hardcoded references to dump server mount points out to profile [puppet] - 10https://gerrit.wikimedia.org/r/390000 (https://phabricator.wikimedia.org/T179942) (owner: 10ArielGlenn) [17:04:04] going to merge : [17:04:06] Ottomata: Revert "Removing appInstallId from whitelist" (630ef3d) [17:04:06] RobH: decom of osmium (203736b) [17:04:10] ok? [17:04:13] ? [17:04:18] oh apergos yes thanks [17:04:19] sorry [17:04:19] otto just said he merged [17:04:29] oh said yes... [17:04:29] so yeah? [17:04:34] ottomata: =p [17:04:35] Merge these changes? (multiple/no)? yes [17:04:35] Aborting merge. [17:04:36] hm [17:04:36] ;] [17:04:39] oh [17:04:41] multiple. [17:04:43] doh [17:04:44] yeah gotta type multiple [17:04:45] doing [17:04:47] aye ya [17:04:47] habit [17:04:48] doesnt happen often [17:04:54] i did same a few months ago [17:05:04] then wondered why my install server updates weerent there for a second [17:05:05] heh [17:05:20] done [17:07:02] im just killing servers mine wasnt time sensitive since i killed the actual metal. [17:08:40] (03CR) 10Alexandros Kosiaris: "Yeah, let's not cause too many waves on the labs lake indeed. Let's do this opt-in and enable it for deployment-prep via a hiera if guard" [puppet] - 10https://gerrit.wikimedia.org/r/389942 (owner: 10Gehel) [17:08:50] !log Restarting Cassandra, restbase1009-[abc] [17:08:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:08:59] (03PS1) 10ArielGlenn: get rid of the remnants of the cirrussearch dumps logrot [puppet] - 10https://gerrit.wikimedia.org/r/390037 [17:09:02] !log regenerated rhodium puppet certificate [17:09:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:48] (03CR) 10ArielGlenn: [C: 032] get rid of the remnants of the cirrussearch dumps logrot [puppet] - 10https://gerrit.wikimedia.org/r/390037 (owner: 10ArielGlenn) [17:09:57] 10Operations, 10ops-eqiad, 10Performance-Team: setup/install lawrencium for temp use by performance team - https://phabricator.wikimedia.org/T179968#3745099 (10Cmjohnson) [17:10:19] RECOVERY - puppet last run on mw1223 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [17:10:21] 10Operations, 10ops-eqiad, 10Performance-Team: setup/install lawrencium for temp use by performance team - https://phabricator.wikimedia.org/T179968#3742297 (10Cmjohnson) a:05Cmjohnson>03RobH All on-site work is complete assigning to @robh [17:10:36] (03CR) 10Alexandros Kosiaris: [C: 04-1] Deploy MjoLniR with new deploy repository (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/389550 (owner: 10EBernhardson) [17:10:49] RECOVERY - puppet last run on wtp1029 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [17:12:08] (03PS1) 10Gehel: [wip] logstash: move to role / profiles [puppet] - 10https://gerrit.wikimedia.org/r/390039 [17:12:47] RECOVERY - puppet last run on snapshot1007 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [17:14:56] PROBLEM - puppet last run on aqs1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:15:06] PROBLEM - puppet last run on labcontrol1001 is CRITICAL: CRITICAL: Puppet has 11 failures. Last run 2 minutes ago with 11 failures. Failed resources (up to 3 shown): File[/home/bd808],File[/home/rush],File[/home/mark],File[/home/dzahn] [17:15:16] PROBLEM - puppet last run on poolcounter2002 is CRITICAL: CRITICAL: Puppet has 6 failures. Last run 2 minutes ago with 6 failures. Failed resources (up to 3 shown): File[/home/faidon],File[/home/gehel],File[/home/otto],File[/home/andrew] [17:15:16] PROBLEM - puppet last run on thumbor2004 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 2 minutes ago with 3 failures. Failed resources (up to 3 shown): File[/usr/lib/nagios/plugins/check-fresh-files-in-dir.py],File[/usr/local/bin/puppet-enabled],File[/usr/lib/nagios/plugins/check_sysctl] [17:15:27] PROBLEM - puppet last run on ms-be1015 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:15:27] PROBLEM - puppet last run on es2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:15:27] PROBLEM - puppet last run on mw2198 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:15:46] PROBLEM - puppet last run on mc2030 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:15:47] PROBLEM - puppet last run on ms-fe2007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:16:26] at least on labcontrol1001 this seems to have been a random hiccup [17:16:57] PROBLEM - puppet last run on mw2204 is CRITICAL: CRITICAL: Puppet has 6 failures. Last run 4 minutes ago with 6 failures. Failed resources (up to 3 shown): File[/home/awight],File[/home/marostegui],File[/home/niharika29],File[/home/joal] [17:16:58] same on es2003, a second run went fine [17:17:06] PROBLEM - puppet last run on cp4027 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:18:13] these should clear on the next run [17:18:37] herron: I am guessing a restart? [17:18:42] of apache that is [17:18:53] (03PS1) 10Chad: git.wm.o -> phab.wm.o [debs/mwbzutils] - 10https://gerrit.wikimedia.org/r/390041 [17:18:56] PROBLEM - puppet last run on mw2150 is CRITICAL: CRITICAL: Puppet has 6 failures. Last run 5 minutes ago with 6 failures. Failed resources (up to 3 shown): File[/home/filippo],File[/home/ppchelko],File[/home/cscott],File[/home/musikanimal] [17:19:07] heh, ok [17:19:25] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1059 - https://phabricator.wikimedia.org/T179727#3745134 (10Marostegui) It failed again, was this a brand new disk, @Cmjohnson? ``` root@db1059:~# megacli -pdlist -a0 Adapter #0 Enclosure Device ID: 32 Slot Number: 0 Drive's position: DiskGroup: 0, Span:... [17:19:31] !log Restarting Cassandra, restbase2003-[abc] [17:19:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:19:38] akosiaris it is related to the hostcert setting it seems to cause these puppetdb errors on rhodium [17:19:56] PROBLEM - puppet last run on analytics1052 is CRITICAL: CRITICAL: Puppet has 8 failures. Last run 6 minutes ago with 8 failures. Failed resources (up to 3 shown): File[/etc/modprobe.d/nf_conntrack.conf],File[/etc/sudoers],File[/usr/local/bin/apt-upgrade-activity],File[/usr/lib/nagios/plugins/check_ferm] [17:20:06] RECOVERY - puppet last run on labcontrol1001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [17:20:27] RECOVERY - puppet last run on es2003 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [17:21:08] (03PS3) 10Gehel: apt: purge unmanaged sources.list [puppet] - 10https://gerrit.wikimedia.org/r/389942 [17:22:35] (03CR) 10jerkins-bot: [V: 04-1] apt: purge unmanaged sources.list [puppet] - 10https://gerrit.wikimedia.org/r/389942 (owner: 10Gehel) [17:26:03] (03PS1) 10TerraCodes: git.wikimedia.org -> phab [debs/mwbzutils] - 10https://gerrit.wikimedia.org/r/390043 (https://phabricator.wikimedia.org/T139089) [17:26:48] (03CR) 10Chad: [C: 04-1] "Already handling in Ia71272f847b2534c440d44cffc1312c26b6a8fd8" [debs/mwbzutils] - 10https://gerrit.wikimedia.org/r/390043 (https://phabricator.wikimedia.org/T139089) (owner: 10TerraCodes) [17:27:36] (03PS4) 10Gehel: apt: purge unmanaged sources.list [puppet] - 10https://gerrit.wikimedia.org/r/389942 [17:28:12] (03CR) 10jerkins-bot: [V: 04-1] apt: purge unmanaged sources.list [puppet] - 10https://gerrit.wikimedia.org/r/389942 (owner: 10Gehel) [17:29:18] (03PS5) 10Gehel: apt: purge unmanaged sources.list [puppet] - 10https://gerrit.wikimedia.org/r/389942 [17:30:52] (03PS2) 10TerraCodes: git.wikimedia.org -> phab [debs/mwbzutils] - 10https://gerrit.wikimedia.org/r/390043 (https://phabricator.wikimedia.org/T139089) [17:31:26] PROBLEM - Nginx local proxy to apache on mw2210 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:32:07] (03Abandoned) 10TerraCodes: git.wikimedia.org -> phab [debs/mwbzutils] - 10https://gerrit.wikimedia.org/r/390043 (https://phabricator.wikimedia.org/T139089) (owner: 10TerraCodes) [17:32:16] RECOVERY - Nginx local proxy to apache on mw2210 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.195 second response time [17:32:50] (03CR) 10TerraCodes: [C: 031] git.wm.o -> phab.wm.o [debs/mwbzutils] - 10https://gerrit.wikimedia.org/r/390041 (owner: 10Chad) [17:34:26] PROBLEM - cassandra-c SSL 10.192.32.136:7001 on restbase2003 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [17:34:37] PROBLEM - cassandra-c CQL 10.192.32.136:9042 on restbase2003 is CRITICAL: connect to address 10.192.32.136 and port 9042: Connection refused [17:34:49] (03CR) 10Gehel: "puppet compiler is failing (https://puppet-compiler.wmflabs.org/compiler02/8692/elastic1020.eqiad.wmnet/). It pretends that `profile::base" [puppet] - 10https://gerrit.wikimedia.org/r/389942 (owner: 10Gehel) [17:38:36] (03CR) 10ArielGlenn: [C: 032] git.wm.o -> phab.wm.o [debs/mwbzutils] - 10https://gerrit.wikimedia.org/r/390041 (owner: 10Chad) [17:39:27] RECOVERY - cassandra-c SSL 10.192.32.136:7001 on restbase2003 is OK: SSL OK - Certificate restbase2003-c valid until 2018-08-17 16:11:51 +0000 (expires in 281 days) [17:39:50] (03PS1) 10TerraCodes: git.wikimedia.org -> phab [debs/nfsd-ldap] - 10https://gerrit.wikimedia.org/r/390050 (https://phabricator.wikimedia.org/T139089) [17:40:37] RECOVERY - cassandra-c CQL 10.192.32.136:9042 on restbase2003 is OK: TCP OK - 0.036 second response time on 10.192.32.136 port 9042 [17:41:35] !log Restarting Cassandra, restbase2001-[abc] [17:41:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:43:53] !log restbase truncate the default parsoid storage group's tables for T179417 [17:43:56] PROBLEM - puppet last run on rdb1008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:43:56] RECOVERY - puppet last run on mw2150 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:43:56] PROBLEM - puppet last run on mw1281 is CRITICAL: CRITICAL: Puppet has 35 failures. Last run 2 minutes ago with 35 failures. Failed resources (up to 3 shown): File[/etc/apache2/sites-enabled/wikimedia-legacy.incl],File[/etc/apache2/sites-enabled/public-wiki-rewrites.incl],File[/etc/apache2/sites-enabled/api-rewrites.incl],File[/etc/apache2/conf-available/50-hhvm-catchall.conf] [17:43:56] PROBLEM - puppet last run on mw2208 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/local/bin/mwrepl],File[/var/lib/hphpd/hphpd.ini] [17:43:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:44:01] T179417: Migrate Parsoid from legacy to new storage - https://phabricator.wikimedia.org/T179417 [17:44:07] PROBLEM - puppet last run on mw2221 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:44:56] RECOVERY - puppet last run on analytics1052 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:44:56] PROBLEM - puppet last run on analytics1067 is CRITICAL: CRITICAL: Puppet has 9 failures. Last run 2 minutes ago with 9 failures. Failed resources (up to 3 shown): File[/etc/profile.d/field.sh],File[/etc/modprobe.d/nf_conntrack.conf],File[/etc/sudoers],File[/usr/local/bin/apt-upgrade-activity] [17:44:56] PROBLEM - puppet last run on cp2010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:44:56] PROBLEM - puppet last run on db1085 is CRITICAL: CRITICAL: Puppet has 14 failures. Last run 4 minutes ago with 14 failures. Failed resources (up to 3 shown): File[/usr/local/lib/nagios/plugins/get-raid-status-hpssacli],File[/usr/local/lib/nagios/plugins/check_raid],File[/usr/local/lib/nagios/plugins/check_ipmi_sensor],File[/usr/local/lib/nagios/plugins/check_puppetrun] [17:44:56] RECOVERY - puppet last run on aqs1005 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:44:56] PROBLEM - puppet last run on db1072 is CRITICAL: CRITICAL: Puppet has 20 failures. Last run 3 minutes ago with 20 failures. Failed resources (up to 3 shown) [17:44:57] PROBLEM - puppet last run on db1059 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:44:57] PROBLEM - puppet last run on labvirt1005 is CRITICAL: CRITICAL: Puppet has 30 failures. Last run 3 minutes ago with 30 failures. Failed resources (up to 3 shown): File[/home/aaron],File[/home/tstarling],File[/home/herron],File[/home/ema] [17:44:57] PROBLEM - puppet last run on db1054 is CRITICAL: CRITICAL: Puppet has 20 failures. Last run 3 minutes ago with 20 failures. Failed resources (up to 3 shown): File[/home/aborrero],File[/home/filippo],File[/home/oblivian],File[/home/jmm] [17:44:58] PROBLEM - puppet last run on cp4026 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:44:58] PROBLEM - puppet last run on analytics1034 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP],File[/etc/rsyslog.d] [17:44:59] PROBLEM - puppet last run on cp4021 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 4 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/home/gehel],File[/home/otto] [17:45:16] PROBLEM - puppet last run on labtestservices2001 is CRITICAL: CRITICAL: Puppet has 10 failures. Last run 3 minutes ago with 10 failures. Failed resources (up to 3 shown): File[/etc/sysctl.d],File[/etc/mysql/grcat.config],File[/root/.screenrc],File[/usr/bin/check_mariadb.py] [17:45:17] RECOVERY - puppet last run on poolcounter2002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [17:45:17] RECOVERY - puppet last run on thumbor2004 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [17:45:17] PROBLEM - puppet last run on stat1006 is CRITICAL: CRITICAL: Puppet has 8 failures. Last run 3 minutes ago with 8 failures. Failed resources (up to 3 shown): File[/etc/modprobe.d/nf_conntrack.conf],File[/etc/sudoers],File[/usr/local/bin/apt-upgrade-activity],File[/usr/lib/nagios/plugins/check_ferm] [17:45:17] PROBLEM - puppet last run on mw2143 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 4 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/local/bin/mwrepl],File[/var/lib/hphpd/hphpd.ini] [17:45:26] (03PS1) 10Chad: group1 to wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390052 [17:45:27] RECOVERY - puppet last run on ms-be1015 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [17:45:27] PROBLEM - puppet last run on mw2183 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:45:27] PROBLEM - puppet last run on mw2165 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:45:27] PROBLEM - puppet last run on ms-be2020 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:45:27] RECOVERY - puppet last run on mw2198 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [17:45:28] (03CR) 10Chad: [C: 04-2] group1 to wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390052 (owner: 10Chad) [17:45:36] PROBLEM - puppet last run on maps-test2004 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 4 minutes ago with 3 failures. Failed resources (up to 3 shown): File[/home/pnorman],File[/usr/local/bin/gen_fingerprints],File[/home/ariel] [17:45:37] PROBLEM - puppet last run on db2058 is CRITICAL: CRITICAL: Puppet has 11 failures. Last run 4 minutes ago with 11 failures. Failed resources (up to 3 shown): File[/etc/ferm/conf.d/00_main],File[/usr/lib/nagios/plugins/check_conntrack],File[/etc/update-motd.d/97-last-puppet-run],File[/etc/systemd/system/ganglia-monitor.service] [17:45:37] PROBLEM - puppet last run on mw1297 is CRITICAL: CRITICAL: Puppet has 11 failures. Last run 3 minutes ago with 11 failures. Failed resources (up to 3 shown): File[/etc/apache2/mods-available/expires.conf],File[/usr/local/bin/mediawiki-firejail-convert],File[/usr/local/bin/cgroup-mediawiki-clean],File[/usr/lib/nagios/plugins/check_ferm] [17:45:46] RECOVERY - puppet last run on mc2030 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [17:45:47] PROBLEM - puppet last run on releases2001 is CRITICAL: CRITICAL: Puppet has 6 failures. Last run 5 minutes ago with 6 failures. Failed resources (up to 3 shown): File[/home/reedy],File[/home/filippo],File[/home/oblivian],File[/home/sharvaniharan] [17:45:47] RECOVERY - puppet last run on ms-fe2007 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [17:45:47] PROBLEM - puppet last run on ms-be2028 is CRITICAL: CRITICAL: Puppet has 4 failures. Last run 4 minutes ago with 4 failures. Failed resources (up to 3 shown): File[/home/gehel],File[/home/otto],File[/home/andrew],File[/home/ori] [17:46:06] PROBLEM - puppet last run on mw1296 is CRITICAL: CRITICAL: Puppet has 8 failures. Last run 4 minutes ago with 8 failures. Failed resources (up to 3 shown): File[/etc/profile.d/mysql-ps1.sh],File[/usr/local/bin/check-hhvm-stacktraces],File[/usr/local/bin/hhvm-needs-restart],File[/etc/apache2/sites-available/07-wikimania.conf] [17:46:46] PROBLEM - puppet last run on mw1209 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:47:06] RECOVERY - puppet last run on cp4027 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [17:47:36] PROBLEM - puppet last run on mw1310 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:47:56] !log Deploy alter table on s1 codfw primary master (db2048) with replication, this will generate lag on codfw - T174569 [17:48:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:48:04] T174569: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569 [17:49:52] (03PS6) 10Elukey: [WIP] First commit [software/druid_exporter] - 10https://gerrit.wikimedia.org/r/389475 (https://phabricator.wikimedia.org/T177459) [17:50:27] RECOVERY - puppet last run on mw2183 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [17:51:46] RECOVERY - puppet last run on mw1209 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:51:48] !log Clearing snapshots in RESTBase legacy Cassandra cluster (T179417) [17:51:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:51:55] T179417: Migrate Parsoid from legacy to new storage - https://phabricator.wikimedia.org/T179417 [17:55:41] (03CR) 10Alexandros Kosiaris: [C: 04-1] apt: purge unmanaged sources.list (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/389942 (owner: 10Gehel) [17:56:22] (03PS2) 10ArielGlenn: don't keep and serve old slowparse logs forever [puppet] - 10https://gerrit.wikimedia.org/r/389732 (https://phabricator.wikimedia.org/T174421) [17:56:57] (03CR) 10ArielGlenn: [C: 032] don't keep and serve old slowparse logs forever [puppet] - 10https://gerrit.wikimedia.org/r/389732 (https://phabricator.wikimedia.org/T174421) (owner: 10ArielGlenn) [17:58:20] !log Restarting Cassandra, restbase2005-[abc] [17:58:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:16] 10Operations, 10Services, 10Wikimedia-Logstash, 10Discovery-Search (Current work): Reduce the number of fields declared in elasticsearch by logstash - https://phabricator.wikimedia.org/T180051#3745214 (10Gehel) [18:03:36] (03CR) 10Muehlenhoff: apt: purge unmanaged sources.list (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/389942 (owner: 10Gehel) [18:04:24] (03PS3) 10Cmjohnson: Removing site.pp and dhcpd entries for decom db's db1028,33,[35-38,41 [puppet] - 10https://gerrit.wikimedia.org/r/389727 [18:06:24] (03CR) 10Cmjohnson: [C: 032] Removing site.pp and dhcpd entries for decom db's db1028,33,[35-38,41 [puppet] - 10https://gerrit.wikimedia.org/r/389727 (owner: 10Cmjohnson) [18:08:34] 10Operations, 10Phabricator, 10Traffic, 10Release-Engineering-Team (Kanban), 10User-greg: Please create a phame blog for the Traffic team - https://phabricator.wikimedia.org/T180041#3745262 (10greg) 05Open>03Resolved a:03greg Created with some stub content (Title and sub-title). Please make them be... [18:10:07] RECOVERY - puppet last run on mw1239 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [18:10:17] RECOVERY - puppet last run on mw2143 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [18:10:27] RECOVERY - puppet last run on maps-test2004 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [18:10:47] RECOVERY - puppet last run on ms-be2028 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [18:12:02] (03PS7) 10Elukey: [WIP] First commit [software/druid_exporter] - 10https://gerrit.wikimedia.org/r/389475 (https://phabricator.wikimedia.org/T177459) [18:12:37] RECOVERY - puppet last run on mw1310 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:12:58] (03PS1) 10Chad: wmf-config/Privatesettings.php doesn't exist anymore [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390056 [18:13:56] RECOVERY - puppet last run on rdb1008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:13:56] RECOVERY - puppet last run on mw1281 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [18:13:56] RECOVERY - puppet last run on mw2208 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [18:14:06] RECOVERY - puppet last run on mw2221 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [18:14:56] RECOVERY - puppet last run on analytics1067 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [18:14:56] RECOVERY - puppet last run on cp2010 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [18:14:56] RECOVERY - puppet last run on db1085 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [18:14:56] RECOVERY - puppet last run on db1072 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [18:14:57] RECOVERY - puppet last run on db1059 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [18:14:57] RECOVERY - puppet last run on mw1229 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [18:14:57] RECOVERY - puppet last run on analytics1034 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [18:14:57] RECOVERY - puppet last run on labvirt1005 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [18:14:57] RECOVERY - puppet last run on db1054 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [18:14:58] RECOVERY - puppet last run on cp4026 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [18:14:58] RECOVERY - puppet last run on einsteinium is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [18:14:59] RECOVERY - puppet last run on cp4021 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [18:15:17] RECOVERY - puppet last run on stat1006 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [18:15:27] RECOVERY - puppet last run on mw2165 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [18:15:27] RECOVERY - puppet last run on ms-be2020 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [18:15:36] (03CR) 10Chad: [C: 032] wmf-config/Privatesettings.php doesn't exist anymore [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390056 (owner: 10Chad) [18:15:36] RECOVERY - puppet last run on mw1297 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [18:15:37] RECOVERY - puppet last run on db2058 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [18:15:47] RECOVERY - puppet last run on releases2001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [18:16:07] RECOVERY - puppet last run on mw1296 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [18:16:57] RECOVERY - puppet last run on mw2204 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [18:17:26] (03Merged) 10jenkins-bot: wmf-config/Privatesettings.php doesn't exist anymore [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390056 (owner: 10Chad) [18:17:36] (03CR) 10jenkins-bot: wmf-config/Privatesettings.php doesn't exist anymore [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390056 (owner: 10Chad) [18:18:53] (03CR) 10Chad: [C: 032] "I can't find anything referencing the old path anymore. Here goes nothing." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389780 (owner: 10Chad) [18:19:13] !log demon@tin Synchronized phpcs.xml: no-op (duration: 00m 50s) [18:19:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:32] (03Merged) 10jenkins-bot: Remove PrivateSettings.php symlink [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389780 (owner: 10Chad) [18:21:42] (03CR) 10jenkins-bot: Remove PrivateSettings.php symlink [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389780 (owner: 10Chad) [18:23:53] RECOVERY - MegaRAID on db1059 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy [18:27:43] !log demon@tin Synchronized wmf-config/: Dropping old PrivateSettings symlink (ducks and covers) (duration: 00m 52s) [18:27:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:29:36] no_justification, I won't be around for the train this evening [18:29:51] but I don't see anything going wrong with the wikidata changes [18:30:29] What are your thoughts on my scheduling some large ish deploy windows at some point (EU time) to slowly move away from using the extensions in the build to the regular extensions? [18:31:06] I'd basically slowly check each extension, switch the loading to the extension from the build first for beta, then for a small set of prod wikis and then everything for each extension [18:31:29] In a similar way to the config changes I did yesterday. [18:32:23] EU time is rough for me, but otherwise ok [18:33:32] I don't invision anything going wrong, I mean, I'm just changing where the code is loaded from, but other than that it should all be the same. [18:33:59] The only thing I need to touch to do all of that is mediawiki-config now :) [18:34:19] (03PS1) 10BBlack: lvs4005-7: initial puppet setup [puppet] - 10https://gerrit.wikimedia.org/r/390059 (https://phabricator.wikimedia.org/T178436) [18:34:32] And I can do a slow rollout of beta, group0, group1, all wikis for each extension [18:35:17] (03PS1) 10RobH: setting lawrencium production dns [dns] - 10https://gerrit.wikimedia.org/r/390060 (https://phabricator.wikimedia.org/T179968) [18:35:54] Although, actually, this wont be able to happen for a week or 2 as we are still blocked on autoloaders for the extensions! Right, I need to file some more tickets I think! [18:36:53] (03PS1) 10Krinkle: webperf: Record navtiming discards to Graphite, and add is_sane test [puppet] - 10https://gerrit.wikimedia.org/r/390061 [18:37:22] (03CR) 10jerkins-bot: [V: 04-1] webperf: Record navtiming discards to Graphite, and add is_sane test [puppet] - 10https://gerrit.wikimedia.org/r/390061 (owner: 10Krinkle) [18:38:21] (03PS2) 10Krinkle: webperf: Record navtiming discards to Graphite, and add is_sane test [puppet] - 10https://gerrit.wikimedia.org/r/390061 [18:38:59] (03CR) 10jerkins-bot: [V: 04-1] webperf: Record navtiming discards to Graphite, and add is_sane test [puppet] - 10https://gerrit.wikimedia.org/r/390061 (owner: 10Krinkle) [18:39:04] Krinkle: We should finish killing skins-1.5 [18:39:36] (03CR) 10RobH: [C: 032] setting lawrencium production dns [dns] - 10https://gerrit.wikimedia.org/r/390060 (https://phabricator.wikimedia.org/T179968) (owner: 10RobH) [18:42:35] (03PS3) 10Krinkle: webperf: Record navtiming discards to Graphite, and add is_sane test [puppet] - 10https://gerrit.wikimedia.org/r/390061 [18:46:01] (03CR) 10BBlack: [C: 032] lvs4005-7: initial puppet setup [puppet] - 10https://gerrit.wikimedia.org/r/390059 (https://phabricator.wikimedia.org/T178436) (owner: 10BBlack) [18:46:51] (03PS1) 10Cmjohnson: Removing dns entries for decom db's 1028,33[35-38],41 [dns] - 10https://gerrit.wikimedia.org/r/390063 [18:47:11] 10Operations, 10Traffic, 10Patch-For-Review: rack/setup/install lvs400[567].ulsfo.wmnet - https://phabricator.wikimedia.org/T178436#3745560 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by bblack on neodymium.eqiad.wmnet for hosts: ``` ['lvs4005.ulsfo.wmnet', 'lvs4006.ulsfo.wmnet', 'lvs4007.uls... [18:47:52] (03CR) 10Cmjohnson: [C: 032] Removing dns entries for decom db's 1028,33[35-38],41 [dns] - 10https://gerrit.wikimedia.org/r/390063 (owner: 10Cmjohnson) [18:49:12] 10Operations, 10ops-eqiad, 10DBA, 10hardware-requests: Decommission db1033 and db1028 - https://phabricator.wikimedia.org/T174076#3745566 (10Cmjohnson) [18:49:43] 10Operations, 10ops-eqiad, 10DBA, 10hardware-requests: Decommission db1035 - https://phabricator.wikimedia.org/T176931#3745571 (10Cmjohnson) [18:50:09] (03PS7) 10EBernhardson: Deploy MjoLniR with new deploy repository [puppet] - 10https://gerrit.wikimedia.org/r/389550 [18:50:09] 10Operations, 10ops-eqiad, 10DBA, 10hardware-requests, 10Patch-For-Review: decommission db1036 - https://phabricator.wikimedia.org/T176311#3745574 (10Cmjohnson) [18:52:07] 10Operations, 10ops-eqiad, 10DBA, 10hardware-requests, 10Patch-For-Review: Decommission db1037 - https://phabricator.wikimedia.org/T174902#3745579 (10Cmjohnson) [18:52:27] 10Operations, 10ops-eqiad, 10DBA, 10hardware-requests, 10Patch-For-Review: Decommission db1038 - https://phabricator.wikimedia.org/T177911#3745582 (10Cmjohnson) [18:53:32] 10Operations, 10Pybal, 10Traffic: Pybal should be able to advertise to multiple routers - https://phabricator.wikimedia.org/T180069#3745584 (10BBlack) [18:55:03] (03PS1) 10Cmjohnson: Removing dns entry for decom server wmf3248 [dns] - 10https://gerrit.wikimedia.org/r/390065 [18:56:12] (03PS1) 10RobH: lawrencium install params [puppet] - 10https://gerrit.wikimedia.org/r/390066 (https://phabricator.wikimedia.org/T179968) [18:56:31] (03PS2) 10RobH: lawrencium install params [puppet] - 10https://gerrit.wikimedia.org/r/390066 (https://phabricator.wikimedia.org/T179968) [18:56:53] (03CR) 10RobH: [C: 032] lawrencium install params [puppet] - 10https://gerrit.wikimedia.org/r/390066 (https://phabricator.wikimedia.org/T179968) (owner: 10RobH) [18:58:02] 10Operations, 10ops-ulsfo, 10Traffic: setup/deploy wmf741[56] - https://phabricator.wikimedia.org/T179204#3745604 (10BBlack) a:05BBlack>03RobH @RobH - the hostnames for these should be dns4001 + dns4002. We won't be running ganeti when we initially bring these into service, so should have standard no-vi... [18:59:08] 10Operations, 10Traffic: Investigate Chrony as a replacement for ISC ntpd - https://phabricator.wikimedia.org/T177742#3745606 (10BBlack) [19:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: How many deployers does it take to do Morning SWAT (Max 8 patches) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171108T1900). [19:00:04] No GERRIT patches in the queue for this window AFAICS. [19:01:11] I’m working on a late submission for this SWAT window... [19:01:18] 10Operations, 10ops-ulsfo, 10Traffic: setup/deploy wmf741[56] - https://phabricator.wikimedia.org/T179204#3745610 (10BBlack) @RobH - also, we should go stretch from the get-go on these as well (like bast4) [19:01:24] I can take care of it. ebernhardson - around? [19:01:28] awight: Sure. [19:01:34] 10Operations, 10ops-eqiad, 10hardware-requests: Decommission WMF3248 (old R510) - https://phabricator.wikimedia.org/T172323#3745613 (10Cmjohnson) [19:01:38] Niharika: ty! [19:01:57] 10Operations, 10ops-eqiad, 10hardware-requests: Decommission WMF3248 (old R510) - https://phabricator.wikimedia.org/T172323#3494819 (10Cmjohnson) 05Open>03Resolved [19:02:45] Niharika: yup [19:03:29] Alrighty. [19:05:41] (03PS6) 10Zoranzoki21: Enable the ArticlePlaceholder for Northern Sami (sewiki) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387077 (https://phabricator.wikimedia.org/T179241) [19:06:00] jouncebot: now [19:06:00] For the next 0 hour(s) and 53 minute(s): Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171108T1900) [19:06:08] Useless bot [19:06:15] awight: I see a fair number of "Notice: Undefined property: stdClass::$ores_damaging_threshold in /srv/mediawiki/php-1.31.0-wmf.6/extensions/ORES/includes/Hooks.php on line 602" [19:06:23] no_justification: Anything wrong? [19:06:36] mmm. I don’t think these patches will fix that ;-) [19:06:43] Niharika: I want swat to end so I can start the train window sooner. [19:06:44] :) [19:07:13] Hmm that ORES notice is odd [19:07:17] no_justification: It will probably end soon given that there aren't a lot of patches. [19:08:18] (03CR) 10Zoranzoki21: [C: 031] Remove overlapping userrights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370791 (https://phabricator.wikimedia.org/T101983) (owner: 10TerraCodes) [19:09:16] RoanKattouw: Reported last week, I thought a fix was in the pipeline [19:11:02] !log smalyshev@tin Started deploy [wdqs/wdqs@b330bc8]: Update service whitelist [19:11:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:11:25] https://phabricator.wikimedia.org/T179830 Doubt that. [19:13:22] Hmm WRF [19:13:23] *WTF [19:13:31] So, line 602 is $row->ores_damaging_score > $row->ores_damaging_threshold [19:13:41] Which, sure, $row can fail to have an ores_damaging_score property [19:14:00] !log smalyshev@tin Finished deploy [wdqs/wdqs@b330bc8]: Update service whitelist (duration: 02m 58s) [19:14:04] Except right above it on line 594 we have if ( !isset( $row->ores_damaging_score ) ) { return; } [19:14:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:23] So how is this notice even possible [19:14:32] Niharika: the backports are in Zuul, if you still have the time: https://gerrit.wikimedia.org/r/#/c/390070/ https://gerrit.wikimedia.org/r/#/c/390071/ [19:14:33] RoanKattouw: Threshold, not score. [19:14:42] Niharika: Gotcha, thanks [19:15:01] awight: I sure do. [19:16:35] RoanKattouw: no_justification: Sorry, we haven’t worked on the undefined var bug yet. It’s tracked as T179830 [19:16:35] T179830: Notice: Undefined property: stdClass::$ores_damaging_threshold in /srv/mediawiki/php-1.31.0-wmf.6/extensions/ORES/includes/Hooks.php on line 602 - https://phabricator.wikimedia.org/T179830 [19:19:59] ebernhardson: Your patch is on mwdebug1002. [19:20:28] 10Operations, 10Traffic, 10Patch-For-Review: rack/setup/install lvs400[567].ulsfo.wmnet - https://phabricator.wikimedia.org/T178436#3745639 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['lvs4005.ulsfo.wmnet', 'lvs4006.ulsfo.wmnet', 'lvs4007.ulsfo.wmnet'] ``` and were **ALL** successful. [19:21:06] awight: I just submitted a patch for that bug but I didn't test it [19:21:40] awight: https://gerrit.wikimedia.org/r/#/c/390070 is on mwdebug1002 as well. [19:21:55] RoanKattouw: Nice! [19:22:05] Niharika: looks reasonable [19:22:57] ebernhardson: Going live then... [19:24:39] RoanKattouw: I think it’s the right idea, thanks for the push! [19:24:50] !log niharika29@tin Synchronized php-1.31.0-wmf.7/extensions/CirrusSearch/: Revert Improve handling of 5xx responses to elasticsearch requests (duration: 01m 02s) [19:24:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:25:19] ebernhardson: And done^ [19:25:56] Niharika: thanks! [19:27:10] awight: Both wmf.6 and wmf.7 patches are now on mwdebug1002. Please test. :) [19:28:08] Niharika: I haven’t read the logs yet, but the frontend is behaving correctly. Good to go, thanks! [19:28:32] Alright. [19:30:09] !log niharika29@tin Synchronized php-1.31.0-wmf.7/extensions/ORES/: Store stats of accessing ores service for getting thresholds T179862 (duration: 00m 51s) [19:30:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:30:15] T179862: Keep statistics about ores service hits for storing thresholds - https://phabricator.wikimedia.org/T179862 [19:31:08] !log Creating page restrictions schema (T179421) [19:31:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:31:14] T179421: Migrate revisions and restrictions from legacy to new storage - https://phabricator.wikimedia.org/T179421 [19:31:21] !log niharika29@tin Synchronized php-1.31.0-wmf.6/extensions/ORES/: Store stats of accessing ores service for getting thresholds T179862 (duration: 00m 51s) [19:31:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:31:27] awight: Both synced. [19:31:33] no_justification: All yours. [19:31:35] \o/ Thanks again [19:31:53] Niharika: Thanks! [19:32:07] You're welcome! [19:32:10] Niharika: I was going to steal it anyway :p [19:32:23] no_justification has gone even more rogue than usual [19:32:28] (03PS1) 10Ori.livneh: xenon: encode the request method as a virtual stack frame [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390073 [19:32:48] Don't curse my pet bot so much, no_justification. It hurts. :P [19:34:02] If only it would actually be active ;-) [19:35:35] (03CR) 10Faidon Liambotis: "Bryan asked me today whether my -2 stands." [puppet] - 10https://gerrit.wikimedia.org/r/384574 (https://phabricator.wikimedia.org/T171508) (owner: 10Madhuvishy) [19:40:06] PROBLEM - puppet last run on cp3030 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:42:24] madhuvishy: I'm doing cleanup on datasets1001 and ms1001 removing directories we no longer write there, and that you don't need to pick up; that's happening now. [19:42:40] apergos: okay cool [19:44:24] PROBLEM - puppet last run on puppetcompiler1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:48:07] madhuvishy: done; I need the fqdn for the host that will be pulling, and if it's got an ipv6 addy I need to know that it does (don't need the address, just a yes or no) [19:48:16] then I can add to the rsync conf [19:48:32] labstore1006.wikimedia.org [19:49:03] it has an ipv6 address [19:53:44] apergos: [19:53:46] ^ [19:55:17] (03PS1) 10ArielGlenn: allow labstore1006 to rsync from dumps servers [puppet] - 10https://gerrit.wikimedia.org/r/390075 (https://phabricator.wikimedia.org/T171541) [19:55:25] Thanks, setting it up now [19:55:42] !log Creating mathoid schema (T179419) [19:55:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:55:50] T179419: Migrate mathoid storage from legacy to new strategy - https://phabricator.wikimedia.org/T179419 [19:57:00] !log demon@tin Synchronized php-1.31.0-wmf.7/extensions/CentralAuth/includes/CentralAuthUser.php: (no justification provided) (duration: 00m 51s) [19:57:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:57:16] (03CR) 10Chad: [C: 032] group1 to wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390052 (owner: 10Chad) [19:58:47] (03Merged) 10jenkins-bot: group1 to wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390052 (owner: 10Chad) [19:58:56] (03CR) 10jenkins-bot: group1 to wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390052 (owner: 10Chad) [19:59:10] (03CR) 10ArielGlenn: [C: 032] allow labstore1006 to rsync from dumps servers [puppet] - 10https://gerrit.wikimedia.org/r/390075 (https://phabricator.wikimedia.org/T171541) (owner: 10ArielGlenn) [20:00:05] no_justification: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for MediaWiki train deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171108T2000). [20:00:05] No GERRIT patches in the queue for this window AFAICS. [20:02:45] PROBLEM - Check systemd state on ms1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [20:03:24] (03PS1) 10ArielGlenn: fix up labstore1006 ferm rules for dump server rsyncs [puppet] - 10https://gerrit.wikimedia.org/r/390077 [20:04:28] (03CR) 10ArielGlenn: [C: 032] fix up labstore1006 ferm rules for dump server rsyncs [puppet] - 10https://gerrit.wikimedia.org/r/390077 (owner: 10ArielGlenn) [20:05:05] PROBLEM - puppet last run on dataset1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/data/xmldatadumps/private/centralauth] [20:05:45] RECOVERY - Check systemd state on ms1001 is OK: OK - running: The system is fully operational [20:07:30] 10Operations, 10Puppet, 10Patch-For-Review, 10User-Joe: puppetmaster hostcert and hostprivkey point to nonexistent files - https://phabricator.wikimedia.org/T179099#3745780 (10herron) After deploying the updated `hostcert` setting in https://gerrit.wikimedia.org/r/386666 `rhodium` began logging two types o... [20:09:03] madhuvishy: the rsync source you want is ms1001.wikimedia.org::data/xmldatadumps/public [20:09:24] !log demon@tin Synchronized php: symlink (duration: 00m 49s) [20:09:25] apergos: do i need to set up anything ssh-keys wise on my end? [20:09:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:09:31] you don't have to preserve ownership but if you don't, bear in mind that the cleanup, tarball unpkac and other stuff script will need permissions to write there so [20:09:43] it should either run as the user you designate, or root [20:09:49] nope, no keys [20:09:52] it's host-based [20:10:04] RECOVERY - puppet last run on dataset1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [20:10:14] RECOVERY - puppet last run on cp3030 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [20:10:45] !log depooled rhodium via puppetmaster1001 apache config [20:10:46] later I'll set up a separate stanza just for the labstore web servers, this will do for now [20:10:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:15:09] apergos: does this look okay? [20:15:12] https://www.irccloud.com/pastebin/mjRrmkgY/ [20:16:39] do you have stuff there now that needs to be deleted? otherwise I'd drop that [20:16:49] and you don't need root@ [20:16:50] not really [20:16:54] ah okay [20:17:16] you don't need the hardlinks flag either [20:18:10] apergos: cool, trying then [20:18:18] great [20:18:24] * apergos camps on ms1001 and watches [20:20:44] I see you... [20:21:17] things seem to be happening :) [20:21:18] all right, I'm gonna just !log this in case anyone is watching the network port usage skyrocket [20:22:02] !log rsync from ms1001 to labstore1006 of dumps, 17T so expect it to take several days [20:22:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:22:13] schweet [20:22:16] awesome [20:22:18] thank you! [20:22:20] yw [20:22:41] anything else I need to do for tonight? else I might be off the clock and only hang out here for snark [20:22:53] I think we're all set! I'm gonna go eat lunch, hope you have a nice evening/night :) [20:23:04] enjoy your lunch! [20:25:24] (03CR) 10Krinkle: [C: 031] xenon: encode the request method as a virtual stack frame [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390073 (owner: 10Ori.livneh) [20:26:30] 10Operations, 10Performance-Team: setup/install lawrencium for temp use by performance team - https://phabricator.wikimedia.org/T179968#3745826 (10RobH) a:05RobH>03Gilles [20:27:01] 10Operations, 10Performance-Team: setup/install lawrencium for temp use by performance team - https://phabricator.wikimedia.org/T179968#3742297 (10RobH) Assigned to @Gilles so he is aware this is ready for his team to take over. Please resolve this task when aware, thanks! [20:33:51] !log aaron@tin Synchronized php-1.31.0-wmf.6/extensions/CentralAuth: Use the proper cache key method in loadFromCache() (duration: 00m 54s) [20:33:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:36:10] 10Operations, 10ops-eqiad, 10DBA, 10hardware-requests: Decommission db1033 and db1028 - https://phabricator.wikimedia.org/T174076#3745843 (10Cmjohnson) [20:36:34] 10Operations, 10ops-eqiad, 10DBA, 10hardware-requests: Decommission db1035 - https://phabricator.wikimedia.org/T176931#3745845 (10Cmjohnson) [20:36:51] 10Operations, 10ops-eqiad, 10DBA, 10hardware-requests, 10Patch-For-Review: decommission db1036 - https://phabricator.wikimedia.org/T176311#3745846 (10Cmjohnson) [20:37:08] 10Operations, 10ops-eqiad, 10DBA, 10hardware-requests, 10Patch-For-Review: Decommission db1037 - https://phabricator.wikimedia.org/T174902#3745847 (10Cmjohnson) [20:37:29] 10Operations, 10ops-eqiad, 10DBA, 10hardware-requests, 10Patch-For-Review: Decommission db1038 - https://phabricator.wikimedia.org/T177911#3745850 (10Cmjohnson) [20:38:02] 10Operations, 10ops-eqiad, 10DBA, 10hardware-requests, 10Patch-For-Review: Decommission db1041 - https://phabricator.wikimedia.org/T173915#3745851 (10Cmjohnson) [20:43:38] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1059 - https://phabricator.wikimedia.org/T179727#3745857 (10Cmjohnson) 05Open>03Resolved Looks like the rebuild is complete and all disks are back online root@db1059:~# megacli -PDList -aALL |grep "Firmware state" Firmware state: Online, Spun Up Firmw... [20:45:43] (03PS1) 10Krinkle: webperf: Refactor tests to directly associate expected data with cases [puppet] - 10https://gerrit.wikimedia.org/r/390083 [20:46:21] (03CR) 10jerkins-bot: [V: 04-1] webperf: Refactor tests to directly associate expected data with cases [puppet] - 10https://gerrit.wikimedia.org/r/390083 (owner: 10Krinkle) [20:48:10] (03PS2) 10Krinkle: webperf: Refactor tests to directly associate expected data with cases [puppet] - 10https://gerrit.wikimedia.org/r/390083 [20:52:31] (03PS4) 10Krinkle: webperf: Record navtiming discards to Graphite, and add is_sane test [puppet] - 10https://gerrit.wikimedia.org/r/390061 [20:53:24] (03PS1) 10ArielGlenn: move last hardcoded user names out of snapshot modules to profiles [puppet] - 10https://gerrit.wikimedia.org/r/390085 (https://phabricator.wikimedia.org/T179942) [20:53:45] (03CR) 10jerkins-bot: [V: 04-1] move last hardcoded user names out of snapshot modules to profiles [puppet] - 10https://gerrit.wikimedia.org/r/390085 (https://phabricator.wikimedia.org/T179942) (owner: 10ArielGlenn) [20:53:56] 10Operations, 10ops-eqiad, 10DBA, 10Phabricator: Decommission db1048 (was Move m3 slave to db1059) - https://phabricator.wikimedia.org/T175679#3745890 (10Cmjohnson) @jcrespo @marostegui is it safe to finish off db1048? [20:56:13] (03PS2) 10ArielGlenn: move last hardcoded user names out of snapshot modules to profiles [puppet] - 10https://gerrit.wikimedia.org/r/390085 (https://phabricator.wikimedia.org/T179942) [20:58:02] 10Operations, 10ops-eqiad, 10DBA, 10hardware-requests, 10Patch-For-Review: Decommission db1015 - https://phabricator.wikimedia.org/T173570#3745891 (10Cmjohnson) @marostegui during my decom checks I found db1015 in this file. Should a replacement be identified? modules/admin/files/enforce-users-groups.sh [21:00:04] gwicke, cscott, arlolra, subbu, bearND, halfak, and Amir1: It is that lovely time of the day again! You are hereby commanded to deploy Services – Parsoid / OCG / Citoid / Mobileapps / ORES / …. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171108T2100). [21:00:05] No GERRIT patches in the queue for this window AFAICS. [21:00:38] no parsoid deploy today. [21:01:36] 10Operations, 10Traffic, 10netops: High amount of unexpected ICMP dest unreachable toward esams cache clusters - https://phabricator.wikimedia.org/T167691#3745895 (10ayounsi) The `Destination unreachable (Host unreachable)` packets are most likely due to firewalls or middle boxes on the client side that have... [21:02:21] Nothing for ORES [21:06:40] 10Operations, 10ops-ulsfo, 10Traffic: setup/deploy wmf721[56] - https://phabricator.wikimedia.org/T179204#3745899 (10RobH) [21:08:50] (03PS3) 10ArielGlenn: move last hardcoded user names out of snapshot modules to profiles [puppet] - 10https://gerrit.wikimedia.org/r/390085 (https://phabricator.wikimedia.org/T179942) [21:12:30] (03CR) 10ArielGlenn: [C: 032] move last hardcoded user names out of snapshot modules to profiles [puppet] - 10https://gerrit.wikimedia.org/r/390085 (https://phabricator.wikimedia.org/T179942) (owner: 10ArielGlenn) [21:29:46] (03CR) 10BBlack: [C: 031] Lower depool threshold for Apache to 0.8 (80%) [puppet] - 10https://gerrit.wikimedia.org/r/389964 (https://phabricator.wikimedia.org/T178799) (owner: 10Muehlenhoff) [21:33:45] (03CR) 10Zoranzoki21: [C: 031] xenon: encode the request method as a virtual stack frame [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390073 (owner: 10Ori.livneh) [21:37:28] !log bsitzmann@tin Started deploy [mobileapps/deploy@00e60b2]: Update mobileapps to 8e82983 (T178706 T178708 T178333 T170692) [21:37:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:37:39] T178708: Update Wiktionary definition parsing to account for new (possibly nested) section tags - https://phabricator.wikimedia.org/T178708 [21:37:39] T178333: Move RESTBase page summary logic to MCS - https://phabricator.wikimedia.org/T178333 [21:37:40] T170692: Return common URLs in summary API so clients do not have to perform bug prone string manipulation - https://phabricator.wikimedia.org/T170692 [21:37:40] T178706: Improve section parsing in mobile-sections endpoint - https://phabricator.wikimedia.org/T178706 [21:39:08] (03PS1) 10Kaldari: Create new MP3 Uploaders group on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390131 (https://phabricator.wikimedia.org/T180002) [21:44:40] !log bsitzmann@tin Finished deploy [mobileapps/deploy@00e60b2]: Update mobileapps to 8e82983 (T178706 T178708 T178333 T170692) (duration: 07m 12s) [21:44:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:44:49] T178708: Update Wiktionary definition parsing to account for new (possibly nested) section tags - https://phabricator.wikimedia.org/T178708 [21:44:49] T178333: Move RESTBase page summary logic to MCS - https://phabricator.wikimedia.org/T178333 [21:44:50] T170692: Return common URLs in summary API so clients do not have to perform bug prone string manipulation - https://phabricator.wikimedia.org/T170692 [21:44:50] T178706: Improve section parsing in mobile-sections endpoint - https://phabricator.wikimedia.org/T178706 [21:46:47] 10Operations, 10Analytics, 10DBA, 10User-Elukey: Prep to decommission old dbstore hosts (db1046, db1047) - https://phabricator.wikimedia.org/T156844#2987618 (10Tgr) Does that mean it's not going to be possible to JOIN EventLogging tables with MediaWiki tables in the future? I'm not working with user-facing... [21:49:35] (03PS1) 10Dmaza: Enable per-filter profiling on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390153 (https://phabricator.wikimedia.org/T179323) [21:52:35] 10Operations, 10Analytics, 10DBA, 10User-Elukey: Prep to decommission old dbstore hosts (db1046, db1047) - https://phabricator.wikimedia.org/T156844#3746008 (10Nuria) @Tgr What types of selects were you doing? We think the best place to do this type of joining is hadoop, we are working into refining EL d... [21:58:58] (03PS2) 10Kaldari: Create new MP3 Uploaders group on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390131 (https://phabricator.wikimedia.org/T180002) [22:06:19] (03PS1) 10RobH: setting dns400[12] production dns [dns] - 10https://gerrit.wikimedia.org/r/390160 (https://phabricator.wikimedia.org/T179204) [22:08:15] (03PS2) 10RobH: setting dns400[12] production dns [dns] - 10https://gerrit.wikimedia.org/r/390160 (https://phabricator.wikimedia.org/T179204) [22:08:39] (03CR) 10RobH: [C: 032] setting dns400[12] production dns [dns] - 10https://gerrit.wikimedia.org/r/390160 (https://phabricator.wikimedia.org/T179204) (owner: 10RobH) [22:09:07] !log demon@tin rebuilt wikiversions.php and synchronized wikiversions files: group1 to wmf.7 [22:09:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:11:15] 10Operations, 10ops-ulsfo: apply hostname labels to dns400[12] - https://phabricator.wikimedia.org/T180077#3746038 (10RobH) [22:11:47] 10Operations, 10ops-ulsfo, 10Traffic: setup/deploy wmf721[56] - https://phabricator.wikimedia.org/T179204#3716775 (10RobH) [22:12:21] 10Operations, 10ops-ulsfo: apply hostname labels to dns400[12] - https://phabricator.wikimedia.org/T180077#3746038 (10RobH) [22:15:44] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] [22:16:30] 10Operations, 10ops-ulsfo, 10Traffic: setup/deploy bast400[12]/wmf721[56] - https://phabricator.wikimedia.org/T179204#3746075 (10RobH) [22:18:32] no_justification: odd thing ... although you only moved group1 forward at 14:09, i'm seeing a flood of errors on enwiki [22:18:46] who broke CirrusSearch ? [22:19:24] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 90.00% of data above the critical threshold [50.0] [22:19:26] (03PS1) 10RobH: setting dns400[12] mgmt dns entries [dns] - 10https://gerrit.wikimedia.org/r/390161 (https://phabricator.wikimedia.org/T179204) [22:19:28] NotASpy: I did. And on purpose, of course [22:19:31] "[WgOC6wpAMCYAAGOuy4EAAAAX] 2017-11-08 22:19:24: Fatal exception of type "CirrusSearch\Search\InvalidRescoreProfileException"" [22:19:33] ebernhardson: I only touched group0 [22:19:33] 2017-11-08 22:18:44: Fatal exception of type "CirrusSearch\Search\InvalidRescoreProfileException" no_justification [22:19:34] (03PS2) 10RobH: setting dns400[12] mgmt dns entries [dns] - 10https://gerrit.wikimedia.org/r/390161 (https://phabricator.wikimedia.org/T179204) [22:19:44] https://en.wikipedia.org/w/index.php?search=test&title=Special%3ASearch&profile=default&fulltext=1 [22:19:46] Zppix: Pointless ping, stop [22:19:53] no_justification: the errors in logstash start at exactly 22:09. It's very odd :S [22:19:55] no_justification: I'd expect nothing less. :p [22:19:56] (03CR) 10RobH: [C: 032] setting dns400[12] mgmt dns entries [dns] - 10https://gerrit.wikimedia.org/r/390161 (https://phabricator.wikimedia.org/T179204) (owner: 10RobH) [22:20:03] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [22:20:34] ebernhardson: Cross-wiki searches?" [22:20:36] no_justification: sigh, it's the cross-wiki search. it's sourcing configuration from wmf.7 which has the new rescore profile [22:20:40] wmf.6/7 mismatch? [22:20:43] Yeah, that's my thought [22:20:48] no_justification: if you can rollback i'll work up a fix [22:20:52] Doing [22:21:11] ebernhardson: way to cut his and I's 1:1 short [22:21:16] sorry! [22:21:18] :) [22:21:30] "enwiki's broken" "uh, ok bye?" "bye" [22:21:40] !log demon@tin rebuilt wikiversions.php and synchronized wikiversions files: group1 back to wmf.6 [22:21:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:21:54] greg-g: I prepped that timebomb on purpose ;-) [22:22:08] things started getting tense, he needed an out [22:22:25] And bam, errors subsiding [22:23:49] so, the simplest fix will be to just pull the rescore refactor into wmf.6 i think. I'm going to test deploy the cherry pick to mwdebug1002 and see [22:24:02] Hi, i still get the error. [22:24:15] ah works now. [22:24:30] greg-g: [{exception_id}] {exception_url} CirrusSearch\Search\InvalidRescoreProfileException from line 115 of /srv/mediawiki/php-1.31.0-wmf.6/extensions/CirrusSearch/includes/Search/RescoreBuilders.php: Unsupported rescore query type: phrase [22:24:35] Wait, wrong copy+paste [22:24:42] greg-g: https://www.youtube.com/watch?v=D8KuH_RxUNE :D [22:25:48] lol [22:26:26] ebernhardson: Sounds like a plan [22:29:18] (03PS1) 10RobH: dns400[12] install params [puppet] - 10https://gerrit.wikimedia.org/r/390164 (https://phabricator.wikimedia.org/T179204) [22:29:59] (03PS1) 10ArielGlenn: clean up dumps web server rsync to its fallback [puppet] - 10https://gerrit.wikimedia.org/r/390165 (https://phabricator.wikimedia.org/T179942) [22:34:00] (03PS2) 10RobH: dns400[12] install params [puppet] - 10https://gerrit.wikimedia.org/r/390164 (https://phabricator.wikimedia.org/T179204) [22:34:37] (03CR) 10RobH: [C: 032] dns400[12] install params [puppet] - 10https://gerrit.wikimedia.org/r/390164 (https://phabricator.wikimedia.org/T179204) (owner: 10RobH) [22:37:33] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] [22:37:37] (03PS2) 10ArielGlenn: clean up dumps web server rsync to its fallback [puppet] - 10https://gerrit.wikimedia.org/r/390165 (https://phabricator.wikimedia.org/T179942) [22:41:04] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [22:42:23] no_justification: looks good with the patch pulled back. syncing it out [22:43:53] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [22:43:54] !log ebernhardson@tin Synchronized php-1.31.0-wmf.6/extensions/CirrusSearch/: Backport cirrus rescore profile refactor to wmf.6 (duration: 01m 02s) [22:44:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:50:25] !log demon@tin rebuilt wikiversions.php and synchronized wikiversions files: group1 to wmf.7 try #2 [22:50:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:54:25] ebernhardson: Yay, nothing exploded this time! [22:56:03] :) [23:45:55] !log Decommissioning Cassandra, restbase2004.codfw.wmnet (T179422) [23:46:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:46:04] T179422: Reshape RESTBase Cassandra clusters - https://phabricator.wikimedia.org/T179422