[00:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161129T0000). [00:00:04] Jdlrobson: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [00:00:12] * MaxSem can deploy [00:00:14] \o present [00:00:44] (03PS1) 10Thcipriani: Fix scap clean command [mediawiki-config] - 10https://gerrit.wikimedia.org/r/323995 [00:04:03] (03PS4) 10Jdlrobson: Clean unused MobileFrontend variables [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315985 (owner: 10Dereckson) [00:04:05] (03PS8) 10Jdlrobson: Switch MobileFrontend to extension registration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314748 (https://phabricator.wikimedia.org/T147092) (owner: 10Dereckson) [00:04:09] (03CR) 10Jdlrobson: "rebased again" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315985 (owner: 10Dereckson) [00:04:44] (03CR) 10MaxSem: "There's no way this change can be deployed without massive error log spam. Break it up?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314748 (https://phabricator.wikimedia.org/T147092) (owner: 10Dereckson) [00:04:51] jdlrobson, ^ [00:05:06] MaxSem: error log spam? [00:05:11] (03PS1) 1020after4: Phabricator: Don't use vcs group, use phd [puppet] - 10https://gerrit.wikimedia.org/r/323996 (https://phabricator.wikimedia.org/T146055) [00:05:29] I stole the deploy conch. [00:05:35] scap is in progress [00:05:52] jdlrobson: not existing configuration variables will be needed [00:06:06] so it's tricky to sync [00:06:33] jdlrobson, if IS.php goes live first, it will scream about old variables. if CS.php, it will scream about the new ones [00:06:49] jfdi ;-) [00:07:01] It'll be mad for like...a second [00:07:26] naïve ostriches [00:07:35] but okay [00:07:39] paranoid maxsem :p [00:07:41] surely a little short term log pain is worthwhile for longterm readability sanity? :) [00:07:47] Anyway, you can't do it yet [00:07:49] MaxSem: we got two success methods: reedy suggested to copy/paste the removed block from the old state, to keep old variables [00:07:51] I stole the lockfile [00:08:02] RELENG PERMITTED ME TO BREAK THE CLUSTER ONEONEONE [00:08:17] sync-dir wmf-config [00:08:18] * Reedy h ides [00:08:28] MaxSem: and thcipriani did like you suggest, split in several patches [00:08:59] MaxSem: I'll do it if you don't wanna take the blame lol [00:09:07] Reedy method advantage is we avoid to pollute log history with the deployment logic [00:09:30] I prefer not making logs scream, but I can see the appeal in the sync-dir [00:09:46] sync-dir isn't necessarily any better [00:09:46] sync-dir wmf-config is fast enough [00:09:51] Cause you don't know which it'll do first [00:09:59] in the past, before we had a debug server, it burried any logs you'd be interested in [00:09:59] Just it's one sync, not two [00:10:01] or N [00:10:14] now, that's less of a concern [00:10:20] up to the deployer probably [00:11:55] Systematically we had an high error rate with sync dir for 2 hours. Manually copy/paste the removed part of IS/mobile, or alternatively edit CS, to offer both config, then sync, then remove safely the part not needed anymore, aka the actual files would avoid that. [00:17:08] !log demon@tin Finished scap: moving some stuff around, pruned old branches too (duration: 20m 29s) [00:17:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:20:25] !log re-enabled icinga notifications for labtest* services (first double checked they are _not_ paging anymore) (T120047) [00:20:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:20:36] T120047: labtestcontrol2001 should not make Icinga page us - https://phabricator.wikimedia.org/T120047 [00:20:53] MaxSem: I'm done [00:21:02] woo [00:21:14] (03CR) 10MaxSem: [C: 032] Switch MobileFrontend to extension registration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314748 (https://phabricator.wikimedia.org/T147092) (owner: 10Dereckson) [00:21:51] (03Merged) 10jenkins-bot: Switch MobileFrontend to extension registration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314748 (https://phabricator.wikimedia.org/T147092) (owner: 10Dereckson) [00:22:51] 06Operations, 10Icinga, 06Labs, 10Labs-Infrastructure, 10Monitoring: labtestcontrol2001 should not make Icinga page us - https://phabricator.wikimedia.org/T120047#2829483 (10Dzahn) 05Open>03Resolved @volans This should resolve it, i enabled the notifications again, for the services on these hosts and... [00:23:39] 06Operations, 10Icinga, 06Labs, 10Labs-Infrastructure, 10Monitoring: labtestcontrol2001 should not make Icinga page us - https://phabricator.wikimedia.org/T120047#2829485 (10Dzahn) @chasemp we might see labtest* notifications again in email and IRC but not via SMS [00:25:44] jdlrobson, pulled on mwdebug1002 [00:25:57] MaxSem: on it [00:27:12] (03CR) 10Dzahn: "[iridium:~] $ id vcs" [puppet] - 10https://gerrit.wikimedia.org/r/323972 (owner: 10Paladox) [00:29:41] (03PS1) 10Chad: WIP: Remove mobileportal docroot, unused [mediawiki-config] - 10https://gerrit.wikimedia.org/r/323999 [00:30:04] MaxSem: looks good to me [00:30:12] product is behaving as expected [00:30:48] (03CR) 10Dzahn: Phabricator: Create user vcs and group vcs (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/323972 (owner: 10Paladox) [00:30:57] (03CR) 10Dzahn: [C: 04-1] Phabricator: Create user vcs and group vcs [puppet] - 10https://gerrit.wikimedia.org/r/323972 (owner: 10Paladox) [00:32:44] * MaxSem grabs popcorn [00:33:28] jdlrobson, meanwhile: 7 Undefined index: token in /srv/mediawiki/php-1.29.0-wmf.3/extensions/MobileFrontend/includes/diff/InlineDifferenceEngine.php on line 169 [00:33:28] 5 Invalid State Error in /srv/mediawiki/php-1.29.0-wmf.3/extensions/MobileFrontend/includes/MobileFormatter.php on line 673 [00:33:28] 5 Invalid State Error in /srv/mediawiki/php-1.29.0-wmf.3/extensions/MobileFrontend/includes/MobileFormatter.php on line 667 [00:33:39] !log maxsem@tin Synchronized wmf-config/: https://gerrit.wikimedia.org/r/#/c/314748/ (duration: 00m 46s) [00:33:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:34:42] thcipriani: still around ? I'm about to upload scap 3.4 internally, was there a puppet patch already? [00:34:42] jdlrobson, everything still works? ^ [00:35:08] godog: yup, I'm around and there is... [00:35:10] * thcipriani looks [00:35:27] godog: https://gerrit.wikimedia.org/r/#/c/323852/ [00:35:34] (03CR) 10Dzahn: "but... when moving the entire file we also still get this: http://puppet-compiler.wmflabs.org/4696/ .. hrmm" [puppet] - 10https://gerrit.wikimedia.org/r/323333 (owner: 10BryanDavis) [00:35:53] so far so good MaxSem. Is the Mobileformatter error new? [00:35:59] not new [00:36:02] can you give me more context (i can login to logstash if needed) [00:36:12] that's all the context [00:36:18] (03CR) 10MaxSem: [C: 032] Clean unused MobileFrontend variables [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315985 (owner: 10Dereckson) [00:36:20] PROBLEM - puppet last run on mw1272 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:36:54] (03Merged) 10jenkins-bot: Clean unused MobileFrontend variables [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315985 (owner: 10Dereckson) [00:37:23] (03PS2) 10Filippo Giunchedi: scap: bump version to 3.4.0-1 [puppet] - 10https://gerrit.wikimedia.org/r/323852 (owner: 10Thcipriani) [00:37:31] undefined index token looks like an issue in the patrol function.. which ive suspected to be broken for some time [00:37:53] meanwhile, the next patch is on mwdebug1002 [00:37:54] thcipriani: ack, tyvm, going to merge that then [00:38:19] godog: could you wait for SWAT to finish up? [00:38:38] thcipriani: for sure, poke me when done! [00:38:40] don't think it would have any affect, but...don't want to cause issue [00:38:46] will do :) [00:39:30] ill raise a bug for that token issue MaxSem [00:40:30] that patch looks good too MaxSem [00:41:16] (03PS3) 10Filippo Giunchedi: scap: bump version to 3.4.0-1 [puppet] - 10https://gerrit.wikimedia.org/r/323852 (https://phabricator.wikimedia.org/T127762) (owner: 10Thcipriani) [00:41:27] (03PS5) 10Dzahn: logstash: Break logstash.pp up into individual classes [puppet] - 10https://gerrit.wikimedia.org/r/323333 (owner: 10BryanDavis) [00:41:30] (03CR) 10Chad: "Um, there's already a user{} stanza at the top of the file too, we should just fine the group and add the dependency to the user." [puppet] - 10https://gerrit.wikimedia.org/r/323972 (owner: 10Paladox) [00:42:03] !log maxsem@tin Synchronized wmf-config/: https://gerrit.wikimedia.org/r/#/c/315985/ (duration: 00m 51s) [00:42:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:42:18] jdlrobson, ^ [00:42:22] (03CR) 10Chad: "Word vomit. I meant "we should add the group"" [puppet] - 10https://gerrit.wikimedia.org/r/323972 (owner: 10Paladox) [00:43:19] godog: On the debian package front, is there a chance you could poke https://gerrit.wikimedia.org/r/#/c/323545/ for me? [00:43:25] (at some point this week, doesn't have to be today) [00:45:44] Thanks MaxSem [00:46:03] (03CR) 10Dzahn: "like PS5 it looks better: http://puppet-compiler.wmflabs.org/4697/ but the files need to be moved at the same time" [puppet] - 10https://gerrit.wikimedia.org/r/323333 (owner: 10BryanDavis) [00:46:30] ostriches: sure, I've added myself to that review, should be able to poke it this week during clinic duty [00:46:46] ty! [00:46:47] MaxSem: was that the end of SWAT? [00:46:56] (03CR) 10Dzahn: "additional problem only on logstash1003 now:" [puppet] - 10https://gerrit.wikimedia.org/r/323333 (owner: 10BryanDavis) [00:46:58] yup, thcipriani [00:47:04] cool, thanks! [00:47:17] godog: scap patch should be fine to go out now [00:49:05] (03CR) 10Dzahn: logstash: Break logstash.pp up into individual classes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/323333 (owner: 10BryanDavis) [00:50:05] (03CR) 10Thcipriani: [C: 031] "new scap version has been uploaded" [puppet] - 10https://gerrit.wikimedia.org/r/323852 (https://phabricator.wikimedia.org/T127762) (owner: 10Thcipriani) [00:51:44] (03PS6) 10Dzahn: logstash: Break logstash.pp up into individual classes [puppet] - 10https://gerrit.wikimedia.org/r/323333 (https://phabricator.wikimedia.org/T93645) (owner: 10BryanDavis) [00:51:51] (03PS2) 10Chad: MWVersion.php simplification [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321600 [00:52:11] (03PS2) 10Dzahn: logstash: Move files from root to role module [puppet] - 10https://gerrit.wikimedia.org/r/323332 (owner: 10BryanDavis) [00:52:13] (03PS7) 10Dzahn: logstash: Break logstash.pp up into individual classes [puppet] - 10https://gerrit.wikimedia.org/r/323333 (owner: 10BryanDavis) [00:52:15] (03PS3) 10Dzahn: logstash: Add processing rules for MediaWiki's exception channel [puppet] - 10https://gerrit.wikimedia.org/r/323351 (https://phabricator.wikimedia.org/T136849) (owner: 10BryanDavis) [00:52:38] (03PS8) 10Dzahn: logstash: Break logstash.pp up into individual classes [puppet] - 10https://gerrit.wikimedia.org/r/323333 (https://phabricator.wikimedia.org/T93645) (owner: 10BryanDavis) [00:54:51] (03CR) 10Dzahn: "now it compiles on all nodes: http://puppet-compiler.wmflabs.org/4698/ there are changes due to the moved files" [puppet] - 10https://gerrit.wikimedia.org/r/323333 (https://phabricator.wikimedia.org/T93645) (owner: 10BryanDavis) [00:55:52] thcipriani: ok, going ahead with that [00:56:13] (03PS4) 10Filippo Giunchedi: scap: bump version to 3.4.0-1 [puppet] - 10https://gerrit.wikimedia.org/r/323852 (https://phabricator.wikimedia.org/T127762) (owner: 10Thcipriani) [00:57:09] (03CR) 10Chad: [C: 032] MWVersion.php simplification [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321600 (owner: 10Chad) [00:57:44] (03Merged) 10jenkins-bot: MWVersion.php simplification [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321600 (owner: 10Chad) [00:57:50] (03CR) 10Dzahn: [C: 04-1] "blocked on .4 being deployed" [puppet] - 10https://gerrit.wikimedia.org/r/319892 (https://phabricator.wikimedia.org/T150029) (owner: 10Reedy) [00:58:28] (03CR) 10Filippo Giunchedi: [C: 032] scap: bump version to 3.4.0-1 [puppet] - 10https://gerrit.wikimedia.org/r/323852 (https://phabricator.wikimedia.org/T127762) (owner: 10Thcipriani) [00:59:27] ostriches: if you're doing mw-config stuff, wanna get this fix there for tomorrow: https://gerrit.wikimedia.org/r/#/c/323995/ [01:00:09] (03CR) 10Filippo Giunchedi: prometheus: add vhtcpd stats via node-exporter (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/323559 (https://phabricator.wikimedia.org/T147429) (owner: 10Filippo Giunchedi) [01:00:19] (03CR) 10Chad: [C: 032] Fix scap clean command [mediawiki-config] - 10https://gerrit.wikimedia.org/r/323995 (owner: 10Thcipriani) [01:01:01] (03Merged) 10jenkins-bot: Fix scap clean command [mediawiki-config] - 10https://gerrit.wikimedia.org/r/323995 (owner: 10Thcipriani) [01:01:19] ty! [01:01:46] thcipriani: merged, should be deployed in the next ~30min or so [01:02:10] godog: awesome! Thank you! I'll try a README sync here in a bit. [01:02:19] ok! [01:02:25] !log demon@tin Synchronized scap/plugins/clean.py: typofix (duration: 00m 43s) [01:02:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:03:45] !log demon@tin Synchronized multiversion/: use MWVersion relative path (duration: 00m 59s) [01:03:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:04:20] RECOVERY - puppet last run on mw1272 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [01:04:49] !log demon@tin Synchronized rpc/RunJobs.php: more relative mwversion stuff (duration: 00m 43s) [01:04:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:05:51] !log demon@tin Synchronized w/: remove old MWVersion entry point (duration: 00m 47s) [01:06:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:07:41] 06Operations, 10Traffic, 13Patch-For-Review, 05Prometheus-metrics-monitoring: Error collecting metrics from varnish_exporter on some misc hosts - https://phabricator.wikimedia.org/T150479#2829562 (10fgiunchedi) Looks like it can be any number really, ATM I'm seeing three different UUIDs ``` varnish_backen... [01:08:30] RECOVERY - cassandra-a CQL 10.192.48.68:9042 on restbase2012 is OK: TCP OK - 0.036 second response time on 10.192.48.68 port 9042 [01:15:53] 06Operations, 06Performance-Team, 10Thumbor: Investigate source of thumbnail 302 redirects - https://phabricator.wikimedia.org/T148410#2829576 (10fgiunchedi) @Gilles no I don't think you have access, I'm running the command above for e.g. 30m to get a sample [01:17:14] dangit [01:17:17] I did break somethin [01:17:29] !log demon@tin Synchronized w/MWVersion.php: bleh, still used (duration: 00m 47s) [01:17:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:18:14] (03PS1) 10Chad: Re-adding w/MWVersion, grr [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324003 [01:18:23] (03CR) 10Chad: [C: 032 V: 032] Re-adding w/MWVersion, grr [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324003 (owner: 10Chad) [01:18:35] Almost [01:18:38] Oh well, time for a break [01:24:12] !log rolling out security upgrades for libicu52 [DSA 3725-1] (CVE-2014-9911 CVE-2015-2632 CVE-2015-4844 CVE-2016-0494 CVE-2016-6293 CVE-2016-7415) [01:24:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:27:46] !log upgrade prometheus-varnish-exporter on cache_misc/ulsfo T150479 [01:27:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:27:56] T150479: Error collecting metrics from varnish_exporter on some misc hosts - https://phabricator.wikimedia.org/T150479 [01:29:00] PROBLEM - puppet last run on mw2146 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[scap] [01:29:30] PROBLEM - puppet last run on mw2077 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[scap] [01:29:30] PROBLEM - puppet last run on mw2166 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[scap] [01:29:55] those are apt being busy with install [01:30:02] and i only did codfw so far [01:30:21] well,no, it's upgrading scap [01:31:05] that's different [01:31:30] RECOVERY - puppet last run on mw2166 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [01:33:27] hrm, are those puppet runs scap update fails? [01:34:04] ah, yeah, they are...why would that happen on those hosts. apt-get update should run first... [01:34:20] PROBLEM - puppet last run on ganeti1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:34:43] !log mw2152 - remove libicu48 (for some reason this one host was different from all the others) [01:34:52] thcipriani: yes, but after the next run they are ok [01:34:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:35:20] ah, ok, yeah, I see the candidate there now. [01:35:30] RECOVERY - puppet last run on mw2077 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [01:36:26] thcipriani: at the same time apt was installing another upgrade on all of mw in codfw ... [01:36:39] that made me think it's that first [01:37:36] (03PS1) 10Thcipriani: Scap: clone directory group writable [puppet] - 10https://gerrit.wikimedia.org/r/324006 (https://phabricator.wikimedia.org/T151231) [01:40:28] (03PS1) 10Gergő Tisza: Fix "Message::toString using implicit format" warning [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324008 [01:42:35] (03CR) 10Chad: [C: 032] Fix "Message::toString using implicit format" warning [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324008 (owner: 10Gergő Tisza) [01:42:55] !log upgrade prometheus-varnish-exporter on cache boxes in ulsfo T150479 [01:43:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:43:06] T150479: Error collecting metrics from varnish_exporter on some misc hosts - https://phabricator.wikimedia.org/T150479 [01:43:10] (03Merged) 10jenkins-bot: Fix "Message::toString using implicit format" warning [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324008 (owner: 10Gergő Tisza) [01:43:56] 01:43:37 sync-file failed: Failed to acquire lock "/var/lock/scap"; owner is "thcipriani"; reason is "scap 3.4 sync file" [01:43:59] \o/ [01:44:05] Yay my new error message is live [01:44:13] heh, working [01:44:18] !log thcipriani@tin Synchronized README: scap 3.4 sync file (duration: 00m 53s) [01:44:20] PROBLEM - puppet last run on kubernetes1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:44:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:44:59] nice, pretty cool [01:45:10] glad to get 3.4 out [01:45:18] !log demon@tin Synchronized wmf-config/CommonSettings.php: fix implicit message parsing error (duration: 00m 48s) [01:45:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:45:40] neat [01:46:30] PROBLEM - HHVM jobrunner on mw1167 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:46:58] !log restarting hhvm service across codfw [01:47:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:52:38] !log deploying icu upgrade on all eqiad mw servers [01:52:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:54:20] !log terbium, mw1260: purging libicu48 package in 'rc' status [01:54:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:57:00] RECOVERY - puppet last run on mw2146 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [02:03:20] RECOVERY - puppet last run on ganeti1001 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [02:09:27] !log rolling restart of hhvm service across eqiad [02:09:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:12:20] RECOVERY - puppet last run on kubernetes1002 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [02:23:09] (03PS1) 10ArielGlenn: move dump result constants to the miscdumpslib module [dumps] - 10https://gerrit.wikimedia.org/r/324009 (https://phabricator.wikimedia.org/T133547) [02:24:19] (03PS1) 10ArielGlenn: generate statusinfo and md5sums for generic set of dump output files [dumps] - 10https://gerrit.wikimedia.org/r/324011 (https://phabricator.wikimedia.org/T133547) [02:24:46] (03CR) 10jenkins-bot: [V: 04-1] generate statusinfo and md5sums for generic set of dump output files [dumps] - 10https://gerrit.wikimedia.org/r/324011 (https://phabricator.wikimedia.org/T133547) (owner: 10ArielGlenn) [02:28:05] (03PS2) 10ArielGlenn: generate statusinfo and md5sums for generic set of dump output files [dumps] - 10https://gerrit.wikimedia.org/r/324011 (https://phabricator.wikimedia.org/T133547) [02:29:29] (03PS1) 10ArielGlenn: handle adds/changes-specifig args in incr_dumps module [dumps] - 10https://gerrit.wikimedia.org/r/324013 (https://phabricator.wikimedia.org/T133547) [02:29:47] (03CR) 10jenkins-bot: [V: 04-1] handle adds/changes-specifig args in incr_dumps module [dumps] - 10https://gerrit.wikimedia.org/r/324013 (https://phabricator.wikimedia.org/T133547) (owner: 10ArielGlenn) [02:36:26] (03PS2) 10ArielGlenn: handle adds/changes-specifig args in incr_dumps module [dumps] - 10https://gerrit.wikimedia.org/r/324013 (https://phabricator.wikimedia.org/T133547) [02:36:47] (03CR) 10jenkins-bot: [V: 04-1] handle adds/changes-specifig args in incr_dumps module [dumps] - 10https://gerrit.wikimedia.org/r/324013 (https://phabricator.wikimedia.org/T133547) (owner: 10ArielGlenn) [02:38:31] (03PS3) 10ArielGlenn: handle adds/changes-specifig args in incr_dumps module [dumps] - 10https://gerrit.wikimedia.org/r/324013 (https://phabricator.wikimedia.org/T133547) [02:45:11] (03PS1) 10ArielGlenn: remove cutoff option from command line [dumps] - 10https://gerrit.wikimedia.org/r/324015 [02:46:53] (03PS1) 10ArielGlenn: add MiscDumpFactory to manage different dump types [dumps] - 10https://gerrit.wikimedia.org/r/324016 [02:47:18] (03CR) 10jenkins-bot: [V: 04-1] add MiscDumpFactory to manage different dump types [dumps] - 10https://gerrit.wikimedia.org/r/324016 (owner: 10ArielGlenn) [03:03:52] !log druid1001 - restarting all druid services [03:04:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:04:55] (03PS1) 10Chad: Delete the Wikimedia Legal Code, ancient & unused [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324017 [03:05:45] (03CR) 10Chad: [C: 032] Delete the Wikimedia Legal Code, ancient & unused [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324017 (owner: 10Chad) [03:06:18] (03Merged) 10jenkins-bot: Delete the Wikimedia Legal Code, ancient & unused [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324017 (owner: 10Chad) [03:09:58] !log demon@tin Synchronized docroot/foundation: rm old legalcode junk (duration: 01m 33s) [03:10:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:15:01] (03PS2) 10ArielGlenn: add MiscDumpFactory to manage different dump types [dumps] - 10https://gerrit.wikimedia.org/r/324016 [03:16:31] (03PS1) 10ArielGlenn: move last references to incr/Incr out of generateincrementals module [dumps] - 10https://gerrit.wikimedia.org/r/324018 (https://phabricator.wikimedia.org/T133547) [03:19:04] (03PS1) 10ArielGlenn: generateincrementals.py becomes generatemiscdumps.py at last [dumps] - 10https://gerrit.wikimedia.org/r/324019 (https://phabricator.wikimedia.org/T133547) [03:21:10] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 714.16 seconds [03:24:51] (03PS4) 10Aude: Move interwiki sorting orders to config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/323556 (https://phabricator.wikimedia.org/T111023) [03:29:10] PROBLEM - HHVM jobrunner on mw1300 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [03:30:10] RECOVERY - HHVM jobrunner on mw1300 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.001 second response time [03:30:20] PROBLEM - Disk space on labtestnet2001 is CRITICAL: DISK CRITICAL - free space: / 350 MB (3% inode=46%) [03:32:50] PROBLEM - HHVM jobrunner on mw1303 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [03:32:50] PROBLEM - HHVM jobrunner on mw1306 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [03:33:50] RECOVERY - HHVM jobrunner on mw1306 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.001 second response time [03:33:50] RECOVERY - HHVM jobrunner on mw1303 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.002 second response time [03:38:10] PROBLEM - HHVM jobrunner on mw1301 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [03:39:10] RECOVERY - HHVM jobrunner on mw1301 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.002 second response time [03:40:00] PROBLEM - HHVM jobrunner on mw1305 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [03:41:00] RECOVERY - HHVM jobrunner on mw1305 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.001 second response time [03:42:40] RECOVERY - HHVM jobrunner on mw1167 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.001 second response time [03:45:10] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 242.05 seconds [04:00:19] (03CR) 10Krinkle: [C: 031] Commons/Usability docroots: Use wikimedia.org standard docroot [puppet] - 10https://gerrit.wikimedia.org/r/321916 (owner: 10Chad) [04:11:00] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=2396.60 Read Requests/Sec=532.90 Write Requests/Sec=0.30 KBytes Read/Sec=41970.40 KBytes_Written/Sec=17.20 [04:11:07] (03Abandoned) 10BryanDavis: [WIP] Add Vagrantfile [puppet] - 10https://gerrit.wikimedia.org/r/212294 (owner: 10BryanDavis) [04:12:19] 06Operations, 10Gerrit, 07Beta-Cluster-reproducible, 13Patch-For-Review: gerrit jgit gc caused mediawiki/core repo problems - https://phabricator.wikimedia.org/T151676#2829687 (10demon) p:05High>03Normal So, pretty sure this only affects core and **maybe** extensions that get wmf branches. Best we can... [04:21:00] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=71.30 Read Requests/Sec=0.00 Write Requests/Sec=0.30 KBytes Read/Sec=0.00 KBytes_Written/Sec=6.00 [04:28:39] (03PS2) 10BryanDavis: Monolog: Add processor for XFF resolved IP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/273376 (https://phabricator.wikimedia.org/T114700) [04:32:12] (03CR) 10BryanDavis: "PS2 includes a manual rebase and addresses Chad's point about avoiding a global. This may still have a non-trivial performance impact." (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/273376 (https://phabricator.wikimedia.org/T114700) (owner: 10BryanDavis) [04:41:50] PROBLEM - puppet last run on mc1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:50:57] (03CR) 10BryanDavis: "> I rewrote bigbrother as a module to tools-manifest (because it" [puppet] - 10https://gerrit.wikimedia.org/r/309216 (https://phabricator.wikimedia.org/T144955) (owner: 10BryanDavis) [04:54:45] 06Operations, 06Commons, 06Multimedia, 10media-storage, 15User-Josve05a: Specific revisions of multiple files missing from Swift - 404 Not Found returned - https://phabricator.wikimedia.org/T124101#1945898 (10Srittau) This problems seems to increase. There was one thread on the the English VP today, and... [05:06:51] (03CR) 10BryanDavis: [C: 031] Send 'exception' channel to logstash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/323111 (https://phabricator.wikimedia.org/T136849) (owner: 10Gergő Tisza) [05:10:50] RECOVERY - puppet last run on mc1001 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [05:26:41] heh. it kind of works [05:28:43] jouncebot: next [05:28:43] In 8 hour(s) and 31 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161129T1400) [05:30:30] (03CR) 10BryanDavis: [C: 032] Adding nick change functionality automatically [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/322037 (https://phabricator.wikimedia.org/T150916) (owner: 10Gerrit Patch Uploader) [05:31:22] (03Merged) 10jenkins-bot: Adding nick change functionality automatically [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/322037 (https://phabricator.wikimedia.org/T150916) (owner: 10Gerrit Patch Uploader) [05:41:27] (03PS1) 10BryanDavis: Don't keep adding _ to nick indefinitely [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/324025 (https://phabricator.wikimedia.org/T150916) [05:51:26] (03CR) 10Hashar: [C: 031] "If both jouncebot and jouncebot_ are used.. Interesting matter will happens :]" (031 comment) [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/324025 (https://phabricator.wikimedia.org/T150916) (owner: 10BryanDavis) [05:56:20] (03CR) 10BryanDavis: Don't keep adding _ to nick indefinitely (031 comment) [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/324025 (https://phabricator.wikimedia.org/T150916) (owner: 10BryanDavis) [06:32:10] (03PS2) 10Marostegui: db-codfw.php: Depool db2048 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/323854 (https://phabricator.wikimedia.org/T149553) [06:34:38] (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool db2048 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/323854 (https://phabricator.wikimedia.org/T149553) (owner: 10Marostegui) [06:35:26] (03Merged) 10jenkins-bot: db-codfw.php: Depool db2048 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/323854 (https://phabricator.wikimedia.org/T149553) (owner: 10Marostegui) [06:37:58] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Depool db2048 - T149553 (duration: 00m 46s) [06:38:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:38:11] T149553: db2034: investigate its crash and reimage - https://phabricator.wikimedia.org/T149553 [06:41:13] !log Stopping replication db1095 - s1 instance for maintenance - T150802 [06:41:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:41:25] T150802: Provision db1095 with at least 1 shard, sanitize and test slave-side triggers - https://phabricator.wikimedia.org/T150802 [06:50:20] RECOVERY - Disk space on labtestnet2001 is OK: DISK OK [07:08:40] PROBLEM - puppet last run on labvirt1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:35:40] RECOVERY - puppet last run on labvirt1003 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [07:42:52] (03CR) 10Jcrespo: "Volans, if the check is wrong, according to the style guide, the check must be disabled; we cannot blindly follow a check that has a bug a" [puppet] - 10https://gerrit.wikimedia.org/r/323525 (https://phabricator.wikimedia.org/T150802) (owner: 10Jcrespo) [07:43:40] PROBLEM - puppet last run on labtestweb2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:53:30] PROBLEM - puppet last run on dbproxy1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:10:19] (03CR) 10Giuseppe Lavagetto: [C: 032] add additional information on malformed responses [software/service-checker] - 10https://gerrit.wikimedia.org/r/321714 (https://phabricator.wikimedia.org/T150560) (owner: 10Volans) [08:12:40] RECOVERY - puppet last run on labtestweb2001 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [08:19:46] (03PS2) 10Dzahn: puppet-lint: ignore 'lines over 140 chars' warnings [puppet] - 10https://gerrit.wikimedia.org/r/322907 [08:20:29] (03CR) 10jenkins-bot: [V: 04-1] puppet-lint: ignore 'lines over 140 chars' warnings [puppet] - 10https://gerrit.wikimedia.org/r/322907 (owner: 10Dzahn) [08:21:30] RECOVERY - puppet last run on dbproxy1005 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [08:26:07] (03CR) 10Dzahn: [C: 031] Don't keep adding _ to nick indefinitely [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/324025 (https://phabricator.wikimedia.org/T150916) (owner: 10BryanDavis) [08:32:11] (03PS3) 10Dzahn: puppet-lint: ignore 'lines over 140 chars' warnings [puppet] - 10https://gerrit.wikimedia.org/r/322907 (https://phabricator.wikimedia.org/T144667) [08:33:25] (03CR) 10jenkins-bot: [V: 04-1] puppet-lint: ignore 'lines over 140 chars' warnings [puppet] - 10https://gerrit.wikimedia.org/r/322907 (https://phabricator.wikimedia.org/T144667) (owner: 10Dzahn) [08:35:21] (03PS4) 10Dzahn: puppet-lint: ignore 'lines over 140 chars' warnings [puppet] - 10https://gerrit.wikimedia.org/r/322907 (https://phabricator.wikimedia.org/T144667) [08:36:02] (03CR) 10jenkins-bot: [V: 04-1] puppet-lint: ignore 'lines over 140 chars' warnings [puppet] - 10https://gerrit.wikimedia.org/r/322907 (https://phabricator.wikimedia.org/T144667) (owner: 10Dzahn) [08:36:50] (03CR) 10Dzahn: "adding people but should not be merged, pending linked ticket. but it needs the discussion/consensus." [puppet] - 10https://gerrit.wikimedia.org/r/322907 (https://phabricator.wikimedia.org/T144667) (owner: 10Dzahn) [08:37:10] (03PS7) 10Jcrespo: Create script to check that sanitarium filtering is working [puppet] - 10https://gerrit.wikimedia.org/r/323525 (https://phabricator.wikimedia.org/T150802) [08:37:27] (03CR) 10Jcrespo: Create script to check that sanitarium filtering is working (0313 comments) [puppet] - 10https://gerrit.wikimedia.org/r/323525 (https://phabricator.wikimedia.org/T150802) (owner: 10Jcrespo) [08:37:54] (03CR) 10jenkins-bot: [V: 04-1] Create script to check that sanitarium filtering is working [puppet] - 10https://gerrit.wikimedia.org/r/323525 (https://phabricator.wikimedia.org/T150802) (owner: 10Jcrespo) [08:41:16] 06Operations, 06Performance-Team, 10Thumbor: Thumbor should handle "temp" thumbnail requests - https://phabricator.wikimedia.org/T151441#2829904 (10Gilles) Looking at a recent example to figure out the proper paths: http://upload.wikimedia.org/wikipedia/commons/thumb/temp/b/b7/20161128144034%21chunkedupload... [08:48:13] 06Operations, 06Performance-Team, 10Thumbor: Thumbor resource consumption is spiky - https://phabricator.wikimedia.org/T151851#2829906 (10Gilles) [08:53:26] (03PS1) 10Dzahn: puppet-lint.rc: make exception for "no docs" obsolete :) [puppet] - 10https://gerrit.wikimedia.org/r/324033 (https://phabricator.wikimedia.org/T127797) [08:57:49] (03PS3) 10Paladox: Phabricator: Create group vcs and require it by the vcs user [puppet] - 10https://gerrit.wikimedia.org/r/323972 [08:58:10] (03PS4) 10Paladox: Phabricator: Create group vcs and require it by the vcs user [puppet] - 10https://gerrit.wikimedia.org/r/323972 [09:00:30] (03PS8) 10Jcrespo: Create script to check that sanitarium filtering is working [puppet] - 10https://gerrit.wikimedia.org/r/323525 (https://phabricator.wikimedia.org/T150802) [09:01:24] (03CR) 10jenkins-bot: [V: 04-1] Create script to check that sanitarium filtering is working [puppet] - 10https://gerrit.wikimedia.org/r/323525 (https://phabricator.wikimedia.org/T150802) (owner: 10Jcrespo) [09:01:50] (03CR) 10Paladox: "> [iridium:~] $ id vcs" [puppet] - 10https://gerrit.wikimedia.org/r/323972 (owner: 10Paladox) [09:03:33] (03CR) 10Paladox: "I think the vcs user is in the phd group." [puppet] - 10https://gerrit.wikimedia.org/r/323972 (owner: 10Paladox) [09:04:10] PROBLEM - MariaDB Slave Lag: s7 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 603.18 seconds [09:05:51] that is happening at the same time every day, must be an analytics/research script [09:05:57] on cron [09:09:35] (03PS1) 10Jcrespo: Ignore W503 pep8 warning, it is outdated on our jenkins [puppet] - 10https://gerrit.wikimedia.org/r/324035 [09:09:59] (03PS2) 10Jcrespo: Ignore W503 pep8 warning, it is outdated on our jenkins [puppet] - 10https://gerrit.wikimedia.org/r/324035 [09:12:00] PROBLEM - puppet last run on wtp1024 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:12:03] (03CR) 10Jcrespo: "You can see it with the wrong behavior at: https://integration.wikimedia.org/ci/job/operations-puppet-tox-jessie/9662/console" [puppet] - 10https://gerrit.wikimedia.org/r/324035 (owner: 10Jcrespo) [09:16:04] (03CR) 10Paladox: "See https://github.com/wikimedia/operations-puppet/blob/283ee6a37fe76a43d7e75aac54b517e2e122000e/modules/role/manifests/phabricator/main.p" [puppet] - 10https://gerrit.wikimedia.org/r/323972 (owner: 10Paladox) [09:22:02] 06Operations, 06Performance-Team, 10Thumbor: Thumbor should handle "temp" thumbnail requests - https://phabricator.wikimedia.org/T151441#2829980 (10Gilles) According to code I ran in eval.php, the thumbnail should be stored at: ``` mwstore://local-multiwrite/local-temp/thumb/b/b7/20161128144034!chunkeduploa... [09:22:31] 06Operations, 10Traffic: several 502 Bad Gateway - https://phabricator.wikimedia.org/T151686#2829981 (10doctaxon) @Paladox @elukey : no, it's not fixed. These Gateway errors happen as before the api server restart. The URLs: * https://de.wikipedia.org/w/api.php?action=query&titles=Geschichte%20des%20Verkehrs&... [09:23:17] (03PS9) 10Jcrespo: Create script to check that sanitarium filtering is working [puppet] - 10https://gerrit.wikimedia.org/r/323525 (https://phabricator.wikimedia.org/T150802) [09:24:35] (03CR) 10jenkins-bot: [V: 04-1] Create script to check that sanitarium filtering is working [puppet] - 10https://gerrit.wikimedia.org/r/323525 (https://phabricator.wikimedia.org/T150802) (owner: 10Jcrespo) [09:25:10] RECOVERY - MariaDB Slave Lag: s7 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 245.24 seconds [09:27:20] 06Operations, 06Labs, 10wikitech.wikimedia.org: Can't login wikitech - https://phabricator.wikimedia.org/T144805#2829987 (10Shizhao) no, still can't login, "Please enter verification code from your mobile app" [09:27:24] (03CR) 10Jcrespo: "@volans: This is blocking https://gerrit.wikimedia.org/r/323525 , which is a privacy-related part of the goal, so I would need a quick ok " [puppet] - 10https://gerrit.wikimedia.org/r/324035 (owner: 10Jcrespo) [09:27:47] 06Operations, 06Labs, 10wikitech.wikimedia.org: Can't login wikitech - https://phabricator.wikimedia.org/T144805#2829990 (10Shizhao) 05Resolved>03Open [09:30:17] (03PS10) 10Jcrespo: Create script to check that sanitarium filtering is working [puppet] - 10https://gerrit.wikimedia.org/r/323525 (https://phabricator.wikimedia.org/T150802) [09:30:55] (03CR) 10jenkins-bot: [V: 04-1] Create script to check that sanitarium filtering is working [puppet] - 10https://gerrit.wikimedia.org/r/323525 (https://phabricator.wikimedia.org/T150802) (owner: 10Jcrespo) [09:33:48] 06Operations, 10Traffic: several 502 Bad Gateway - https://phabricator.wikimedia.org/T151686#2830010 (10doctaxon) as I can receive, it looks like the errors become more and more every day [09:41:00] RECOVERY - puppet last run on wtp1024 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [09:42:40] PROBLEM - puppet last run on labsdb1009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:48:22] <_joe_> !log upgrading firejail on scb1004, restarting all dependent services [09:48:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:56] (03PS1) 10Jcrespo: [WIP] labsdb: Move repo from views to common; add private data check [puppet] - 10https://gerrit.wikimedia.org/r/324040 (https://phabricator.wikimedia.org/T150802) [09:50:06] (03CR) 10jenkins-bot: [V: 04-1] [WIP] labsdb: Move repo from views to common; add private data check [puppet] - 10https://gerrit.wikimedia.org/r/324040 (https://phabricator.wikimedia.org/T150802) (owner: 10Jcrespo) [09:51:17] <_joe_> why -1? [09:51:42] <_joe_> /modules/role/files/mariadb/check_private_data.py [09:51:46] <_joe_> fix that please [09:52:20] (03PS2) 10Jcrespo: [WIP] labsdb: Move repo from views to common; add private data check [puppet] - 10https://gerrit.wikimedia.org/r/324040 (https://phabricator.wikimedia.org/T150802) [09:53:22] (03CR) 10jenkins-bot: [V: 04-1] [WIP] labsdb: Move repo from views to common; add private data check [puppet] - 10https://gerrit.wikimedia.org/r/324040 (https://phabricator.wikimedia.org/T150802) (owner: 10Jcrespo) [09:53:23] ? [09:53:39] then -1, I have not filled in the classes [09:53:42] *the [09:53:49] <_joe_> flake8 is complaining about that file [09:53:58] that is a bug [09:54:06] https://gerrit.wikimedia.org/r/#/c/324035/ [09:54:12] _joe_, ^ [09:54:30] of the pep8 executable [09:55:09] <_joe_> funny how both versions are mathematically incorrect [09:55:14] (03CR) 10Volans: [C: 031] "I'm ok with this change as a temporary fix, because upgrading flake8 triggers a tons of other errors that we have in our code, so it would" [puppet] - 10https://gerrit.wikimedia.org/r/324035 (owner: 10Jcrespo) [09:55:23] 06Operations, 06Commons, 06Multimedia, 10media-storage, 15User-Josve05a: Specific revisions of multiple files missing from Swift - 404 Not Found returned - https://phabricator.wikimedia.org/T124101#2830072 (10Perhelion) [09:55:35] well, I go by the style guide [09:55:50] (03CR) 10Giuseppe Lavagetto: [C: 031] Ignore W503 pep8 warning, it is outdated on our jenkins [puppet] - 10https://gerrit.wikimedia.org/r/324035 (owner: 10Jcrespo) [09:55:54] and it actually works on the latest version of pep8 [09:56:00] PROBLEM - puppet last run on cp4009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:56:02] e.g. on jessie [09:56:13] <_joe_> jynus: yeah mine was just a tangential comment :P [09:56:22] sorry [09:57:11] (03CR) 10Jcrespo: [C: 032] Ignore W503 pep8 warning, it is outdated on our jenkins [puppet] - 10https://gerrit.wikimedia.org/r/324035 (owner: 10Jcrespo) [09:58:03] did anyone deploy that?^ [09:58:12] jynus, _joe_ FYI the PEP8 styleguide was updated this april, see https://github.com/python/peps/commit/c59c4376ad233a62ca4b3a6060c81368bd21e85b#diff-64ec08cc46db7540f18f2af46037f599 [09:58:36] jynus: nobody, you didn't submit [09:58:38] (03PS2) 10Giuseppe Lavagetto: Add LVS IP for pdfrender [dns] - 10https://gerrit.wikimedia.org/r/322917 [09:58:44] ha [09:58:46] I am silly [09:59:02] no, accustomed to mediawiki-style deploying [09:59:58] (03CR) 10Giuseppe Lavagetto: [C: 032] Add LVS IP for pdfrender [dns] - 10https://gerrit.wikimedia.org/r/322917 (owner: 10Giuseppe Lavagetto) [10:00:18] (03PS11) 10Jcrespo: Create script to check that sanitarium filtering is working [puppet] - 10https://gerrit.wikimedia.org/r/323525 (https://phabricator.wikimedia.org/T150802) [10:01:50] PROBLEM - MD RAID on thumbor1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:02:40] RECOVERY - MD RAID on thumbor1002 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [10:02:51] <_joe_> !log upgrading firejail on all other scb servers in eqiad [10:03:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:05] (03CR) 10Volans: "misunderstanding :)" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/323559 (https://phabricator.wikimedia.org/T147429) (owner: 10Filippo Giunchedi) [10:10:40] RECOVERY - puppet last run on labsdb1009 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [10:13:27] (03PS1) 10Giuseppe Lavagetto: pdfrender: active by default [puppet] - 10https://gerrit.wikimedia.org/r/324071 [10:13:31] <_joe_> mobrovac: ^^ [10:13:44] (03PS3) 10Mobrovac: RESTBase: Add the PDF Render service config [puppet] - 10https://gerrit.wikimedia.org/r/323548 [10:13:54] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] pdfrender: active by default [puppet] - 10https://gerrit.wikimedia.org/r/324071 (owner: 10Giuseppe Lavagetto) [10:15:17] <_joe_> mobrovac: wait, are we supposed to access this from restbase? [10:15:25] <_joe_> and, are we storing pdfs in cassandra? [10:15:48] (03PS12) 10Jcrespo: Create script to check that sanitarium filtering is working [puppet] - 10https://gerrit.wikimedia.org/r/323525 (https://phabricator.wikimedia.org/T150802) [10:15:51] <_joe_> if not, please let mediawiki call pdfrender directly; if we do, well, are we sure we want to do it? [10:17:15] (03PS13) 10Jcrespo: Create script to check that sanitarium filtering is working [puppet] - 10https://gerrit.wikimedia.org/r/323525 (https://phabricator.wikimedia.org/T150802) [10:22:15] (03PS2) 10Giuseppe Lavagetto: pdfrender: lvs configuration [puppet] - 10https://gerrit.wikimedia.org/r/322925 [10:23:06] (03CR) 10Giuseppe Lavagetto: pdfrender: lvs configuration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/322925 (owner: 10Giuseppe Lavagetto) [10:24:22] (03PS4) 10Gehel: Kartotherian: deploy application configuration with scap3 [puppet] - 10https://gerrit.wikimedia.org/r/321374 (https://phabricator.wikimedia.org/T150021) [10:25:00] RECOVERY - puppet last run on cp4009 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [10:27:58] <_joe_> mobrovac: care to take a last look at my change? [10:29:50] yup [10:31:41] (03PS14) 10Jcrespo: Create script to check that sanitarium filtering is working [puppet] - 10https://gerrit.wikimedia.org/r/323525 (https://phabricator.wikimedia.org/T150802) [10:32:11] (03CR) 10Mobrovac: [C: 031] pdfrender: lvs configuration [puppet] - 10https://gerrit.wikimedia.org/r/322925 (owner: 10Giuseppe Lavagetto) [10:32:17] _joe_: ^ [10:32:30] (03CR) 10ArielGlenn: [C: 032] move dump result constants to the miscdumpslib module [dumps] - 10https://gerrit.wikimedia.org/r/324009 (https://phabricator.wikimedia.org/T133547) (owner: 10ArielGlenn) [10:33:20] <_joe_> mobrovac: kk\ [10:35:03] (03PS15) 10Jcrespo: Create script to check that sanitarium filtering is working [puppet] - 10https://gerrit.wikimedia.org/r/323525 (https://phabricator.wikimedia.org/T150802) [10:35:05] (03CR) 10MarcoAurelio: [C: 031] "LGTM." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/303183 (owner: 10BryanDavis) [10:35:44] <_joe_> mobrovac: doing a last round of PCC, then merging [10:36:36] 06Operations, 10Traffic: several 502 Bad Gateway - https://phabricator.wikimedia.org/T151686#2830171 (10doctaxon) i'll try a monitoring with an api query traffic loop and will report here [10:36:55] (03CR) 10ArielGlenn: [C: 032] generate statusinfo and md5sums for generic set of dump output files [dumps] - 10https://gerrit.wikimedia.org/r/324011 (https://phabricator.wikimedia.org/T133547) (owner: 10ArielGlenn) [10:38:26] (03CR) 10Giuseppe Lavagetto: [C: 032] pdfrender: lvs configuration [puppet] - 10https://gerrit.wikimedia.org/r/322925 (owner: 10Giuseppe Lavagetto) [10:38:29] 06Operations, 10Traffic: several 502 Bad Gateway - https://phabricator.wikimedia.org/T151686#2830172 (10ema) @doctaxon: thanks. Please also include request and response headers. I haven't managed to reproduce the issue yet. [10:38:56] (03Draft2) 10MarcoAurelio: Add 'global-renamer' to the list of privileged wiki groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324096 (https://phabricator.wikimedia.org/T150951) [10:39:16] (03CR) 10Jcrespo: [C: 032] Create script to check that sanitarium filtering is working [puppet] - 10https://gerrit.wikimedia.org/r/323525 (https://phabricator.wikimedia.org/T150802) (owner: 10Jcrespo) [10:39:21] (03PS16) 10Jcrespo: Create script to check that sanitarium filtering is working [puppet] - 10https://gerrit.wikimedia.org/r/323525 (https://phabricator.wikimedia.org/T150802) [10:40:00] 06Operations, 06Performance-Team, 10Thumbor: Improve Content-Disposition support in Thumbor - https://phabricator.wikimedia.org/T151072#2830193 (10Gilles) [10:40:24] !log oblivian@puppetmaster1001 conftool action : set/pooled=yes; selector: cluster=scb,service=pdfrender [10:40:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:37] conftool-sync failed on deploy [10:42:28] 2 warnings and an error on eventstreams [10:43:31] (03PS1) 10Ema: varnish: double workspace_backend [puppet] - 10https://gerrit.wikimedia.org/r/324103 (https://phabricator.wikimedia.org/T151563) [10:43:38] <_joe_> jynus: again? I should check what the hell is inside etcd [10:43:53] (03PS4) 10ArielGlenn: handle adds/changes-specifig args in incr_dumps module [dumps] - 10https://gerrit.wikimedia.org/r/324013 (https://phabricator.wikimedia.org/T133547) [10:43:55] just a heads up, I assume it is not critical [10:44:12] when I deploy and I get an error I just tell it here [10:45:17] <_joe_> jynus: yeah my bad [10:45:37] <_joe_> please don't merge puppet changes until I'm done [10:46:09] ok [10:47:11] (03CR) 10ArielGlenn: [C: 032] handle adds/changes-specifig args in incr_dumps module [dumps] - 10https://gerrit.wikimedia.org/r/324013 (https://phabricator.wikimedia.org/T133547) (owner: 10ArielGlenn) [10:47:15] (03PS1) 10Giuseppe Lavagetto: conftool: fix scb2003 [puppet] - 10https://gerrit.wikimedia.org/r/324109 [10:47:15] <_joe_> it's one stupid typo [10:47:16] <_joe_> :P [10:47:30] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] conftool: fix scb2003 [puppet] - 10https://gerrit.wikimedia.org/r/324109 (owner: 10Giuseppe Lavagetto) [10:47:46] (03PS2) 10ArielGlenn: remove cutoff option from command line [dumps] - 10https://gerrit.wikimedia.org/r/324015 [10:49:00] <_joe_> jynus: fixed [10:50:09] <_joe_> !log restarting pybal on low-traffic eqiad [10:50:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:05] (03CR) 10ArielGlenn: [C: 032] remove cutoff option from command line [dumps] - 10https://gerrit.wikimedia.org/r/324015 (owner: 10ArielGlenn) [10:51:22] 06Operations, 07Puppet, 07Epic, 07Need-volunteer, 13Patch-For-Review: align puppet-lint config with coding style - https://phabricator.wikimedia.org/T93645#2830212 (10Volans) @Dzahn regarding the length discussion you raised in https://gerrit.wikimedia.org/r/#/c/322907/ I think is easier to talk here tha... [10:51:35] (03PS3) 10ArielGlenn: add MiscDumpFactory to manage different dump types [dumps] - 10https://gerrit.wikimedia.org/r/324016 [10:51:53] (03CR) 10Volans: "I think is easier to discuss in the task ( T93645 ). I've replied there." [puppet] - 10https://gerrit.wikimedia.org/r/322907 (https://phabricator.wikimedia.org/T144667) (owner: 10Dzahn) [10:52:38] 06Operations, 06Performance-Team, 10Thumbor: Improve Content-Disposition support in Thumbor - https://phabricator.wikimedia.org/T151072#2830216 (10Gilles) [10:53:18] <_joe_> !log restarting pybal on low-traffic codfw [10:53:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:37] (03CR) 10ArielGlenn: [C: 032] add MiscDumpFactory to manage different dump types [dumps] - 10https://gerrit.wikimedia.org/r/324016 (owner: 10ArielGlenn) [10:54:42] (03Abandoned) 10Gehel: Kartotherian: deploy application configuration with scap3 [puppet] - 10https://gerrit.wikimedia.org/r/321375 (https://phabricator.wikimedia.org/T150021) (owner: 10Gehel) [10:55:30] (03PS2) 10ArielGlenn: move last references to incr/Incr out of generateincrementals module [dumps] - 10https://gerrit.wikimedia.org/r/324018 (https://phabricator.wikimedia.org/T133547) [10:56:05] (03PS5) 10Gehel: Kartotherian: deploy application configuration with scap3 [puppet] - 10https://gerrit.wikimedia.org/r/321374 (https://phabricator.wikimedia.org/T150021) [10:56:20] 06Operations, 10Electron-PDFs, 13Patch-For-Review, 07Service-deployment-requests, and 3 others: New service request - PDF Render - https://phabricator.wikimedia.org/T143129#2830235 (10Joe) 05stalled>03Resolved [10:56:23] 06Operations, 10Electron-PDFs, 10Security-Reviews, 06Services (blocked), 15User-mobrovac: Productize the Electron PDF render service & create a REST API end point - https://phabricator.wikimedia.org/T142226#2830236 (10Joe) [10:58:22] (03CR) 10ArielGlenn: [C: 032] move last references to incr/Incr out of generateincrementals module [dumps] - 10https://gerrit.wikimedia.org/r/324018 (https://phabricator.wikimedia.org/T133547) (owner: 10ArielGlenn) [10:58:51] (03PS2) 10ArielGlenn: generateincrementals.py becomes generatemiscdumps.py at last [dumps] - 10https://gerrit.wikimedia.org/r/324019 (https://phabricator.wikimedia.org/T133547) [11:04:43] PROBLEM - puppet last run on carbon is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:07:10] (03PS4) 10Giuseppe Lavagetto: RESTBase: Add the PDF Render service config [puppet] - 10https://gerrit.wikimedia.org/r/323548 (owner: 10Mobrovac) [11:08:01] (03PS3) 10Jcrespo: [WIP] labsdb: Move repo from views to common; add private data check [puppet] - 10https://gerrit.wikimedia.org/r/324040 (https://phabricator.wikimedia.org/T150802) [11:10:08] (03PS4) 10Elukey: Remove HHVM X-Powered-By header from static Apache responses [puppet] - 10https://gerrit.wikimedia.org/r/314519 [11:10:39] 06Operations, 10Electron-PDFs, 06TCB-Team, 13Patch-For-Review, 07User-notice: Deploy ElectronPdfService Extension to production - https://phabricator.wikimedia.org/T150185#2830285 (10Joe) 05Open>03stalled [11:11:04] 06Operations, 10Electron-PDFs, 06TCB-Team, 13Patch-For-Review, 07User-notice: Deploy ElectronPdfService Extension to production - https://phabricator.wikimedia.org/T150185#2776664 (10Joe) Changed the status to "stalled" until this is fixed and mediawiki calls the service directly. [11:12:43] (03CR) 10jenkins-bot: [V: 04-1] [WIP] labsdb: Move repo from views to common; add private data check [puppet] - 10https://gerrit.wikimedia.org/r/324040 (https://phabricator.wikimedia.org/T150802) (owner: 10Jcrespo) [11:12:47] (03CR) 10Giuseppe Lavagetto: [C: 032] RESTBase: Add the PDF Render service config [puppet] - 10https://gerrit.wikimedia.org/r/323548 (owner: 10Mobrovac) [11:12:57] <_joe_> mobrovac: ^^ [11:13:07] \o/ [11:13:53] <_joe_> mobrovac: this will cache the pdfs in varnish, though [11:14:01] yes [11:14:03] <_joe_> which is ok, I guess [11:14:07] <_joe_> since the TTL is short [11:14:13] yup [11:14:18] 10 mins is more than ok [11:14:23] could be longer too [11:14:24] <_joe_> but it's the text cluster, we should check with traffic once the service wants to go live [11:14:33] kk [11:14:41] <_joe_> as in not just "configured and able to serve requests" [11:15:09] (03PS1) 10Volans: Release 0.0.3 [software/service-checker] - 10https://gerrit.wikimedia.org/r/324139 [11:15:54] <_joe_> moritzm: running puppet across the rb cluster [11:15:59] <_joe_> err, mobrovac [11:16:09] kk [11:17:24] 06Operations, 10Electron-PDFs, 06TCB-Team, 13Patch-For-Review, and 2 others: Deploy ElectronPdfService Extension to beta cluster - https://phabricator.wikimedia.org/T150945#2830294 (10Joe) 05Open>03stalled [11:17:28] (03PS4) 10Jcrespo: [WIP] labsdb: Move repo from views to common; add private data check [puppet] - 10https://gerrit.wikimedia.org/r/324040 (https://phabricator.wikimedia.org/T150802) [11:18:19] 06Operations, 10Electron-PDFs, 06TCB-Team, 15User-Addshore, 03WMDE-QWERTY-Team-Board: Deploy ElectronPdfService Extension to testwikis and mediawikiwiki - https://phabricator.wikimedia.org/T150944#2802107 (10Joe) 05Open>03stalled [11:18:42] 06Operations, 10Electron-PDFs, 06TCB-Team, 15User-Addshore, 03WMDE-QWERTY-Team-Board: Deploy ElectronPdfService Extension to metawiki - https://phabricator.wikimedia.org/T150943#2802094 (10Joe) 05Open>03stalled [11:19:03] 06Operations, 10Electron-PDFs, 06TCB-Team, 15User-Addshore, 03WMDE-QWERTY-Team-Board: Deploy ElectronPdfService Extension to testwikis and mediawikiwiki - https://phabricator.wikimedia.org/T150944#2830309 (10Addshore) [11:19:04] 06Operations, 10Electron-PDFs, 06TCB-Team, 13Patch-For-Review, 07User-notice: Deploy ElectronPdfService Extension to production - https://phabricator.wikimedia.org/T150185#2830313 (10Joe) [11:19:07] 06Operations, 10Electron-PDFs, 06TCB-Team, 15User-Addshore, 03WMDE-QWERTY-Team-Board: Deploy ElectronPdfService Extension to dewiki - https://phabricator.wikimedia.org/T150942#2802081 (10Joe) 05Open>03stalled [11:19:10] 06Operations, 10Electron-PDFs, 06TCB-Team, 15User-Addshore, 03WMDE-QWERTY-Team-Board: Deploy ElectronPdfService Extension to metawiki - https://phabricator.wikimedia.org/T150943#2830314 (10Addshore) [11:19:21] 06Operations, 10Electron-PDFs, 06TCB-Team, 15User-Addshore, 03WMDE-QWERTY-Team-Board: Deploy ElectronPdfService Extension to dewiki - https://phabricator.wikimedia.org/T150942#2830316 (10Addshore) [11:19:31] 06Operations, 10Electron-PDFs, 06TCB-Team, 13Patch-For-Review, 07User-notice: Deploy ElectronPdfService Extension to production - https://phabricator.wikimedia.org/T150185#2830320 (10Addshore) [11:21:33] (03CR) 10jenkins-bot: [V: 04-1] [WIP] labsdb: Move repo from views to common; add private data check [puppet] - 10https://gerrit.wikimedia.org/r/324040 (https://phabricator.wikimedia.org/T150802) (owner: 10Jcrespo) [11:21:47] 06Operations, 10Electron-PDFs, 06TCB-Team, 13Patch-For-Review, 07User-notice: Deploy ElectronPdfService Extension to production - https://phabricator.wikimedia.org/T150185#2830324 (10Addshore) >>! In T150185#2830274, @Joe wrote: > I have just noticed that this extension goes through restbase to request t... [11:22:46] (03PS5) 10Elukey: Remove HHVM X-Powered-By header from static Apache responses [puppet] - 10https://gerrit.wikimedia.org/r/314519 [11:23:14] _joe_: ^^ we only switched to using restbase as of a few days ago, we thought that was where all of the discussion had lead us. [11:23:24] !log disabled puppet on mw1* hosts as pre-step for https://gerrit.wikimedia.org/r/#/c/314519 [11:23:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:38] (03PS3) 10ArielGlenn: generateincrementals.py becomes generatemiscdumps.py at last [dumps] - 10https://gerrit.wikimedia.org/r/324019 (https://phabricator.wikimedia.org/T133547) [11:25:23] (03CR) 10Elukey: [C: 032] Remove HHVM X-Powered-By header from static Apache responses [puppet] - 10https://gerrit.wikimedia.org/r/314519 (owner: 10Elukey) [11:25:47] merging --^, will run it first on codfw [11:26:44] 06Operations, 10Electron-PDFs, 10Security-Reviews, 06Services (blocked), 15User-mobrovac: Productize the Electron PDF render service & create a REST API end point - https://phabricator.wikimedia.org/T142226#2527912 (10Joe) So, as far as I understand, the MediaWiki extension is supposed to call restbase t... [11:27:03] 06Operations, 10Electron-PDFs, 10Security-Reviews, 06Services (blocked), and 2 others: Productize the Electron PDF render service & create a REST API end point - https://phabricator.wikimedia.org/T142226#2830330 (10Joe) [11:27:06] (03PS5) 10Jcrespo: [WIP] labsdb: Move repo from views to common; add private data check [puppet] - 10https://gerrit.wikimedia.org/r/324040 (https://phabricator.wikimedia.org/T150802) [11:27:41] (03CR) 10ArielGlenn: [C: 032] generateincrementals.py becomes generatemiscdumps.py at last [dumps] - 10https://gerrit.wikimedia.org/r/324019 (https://phabricator.wikimedia.org/T133547) (owner: 10ArielGlenn) [11:28:00] (03CR) 10jenkins-bot: [V: 04-1] [WIP] labsdb: Move repo from views to common; add private data check [puppet] - 10https://gerrit.wikimedia.org/r/324040 (https://phabricator.wikimedia.org/T150802) (owner: 10Jcrespo) [11:28:04] <_joe_> addshore: which discussion? [11:28:16] <_joe_> addshore: I'm not finding it anywhere [11:28:32] <_joe_> but I'm using phab's search, so please forgive me :P [11:29:02] _joe_: It will probably take a while for me to find it too, there are so many long running tickets regarding the service, extension & restbase [11:29:05] (03PS1) 10ArielGlenn: move md5-related methods to miscdumplib [dumps] - 10https://gerrit.wikimedia.org/r/324141 [11:29:13] PROBLEM - puppet last run on analytics1038 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:29:13] <_joe_> addshore: heh, exactly [11:29:35] <_joe_> anyways, I stand by my point: if there is no very very good reason to go through restbase, we shouldn't [11:29:43] PROBLEM - puppet last run on mw2198 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/apache2/conf-available/49-mark-engine.conf] [11:29:53] <_joe_> elukey: ^^ [11:30:08] <_joe_> I think it's a race condition, but you'd better check [11:30:09] (03CR) 10ArielGlenn: [C: 032] move md5-related methods to miscdumplib [dumps] - 10https://gerrit.wikimedia.org/r/324141 (owner: 10ArielGlenn) [11:30:37] (03PS1) 10ArielGlenn: make index html output nicer and include md5 sums of files [dumps] - 10https://gerrit.wikimedia.org/r/324143 [11:30:45] well _joe_ https://phabricator.wikimedia.org/T143132 deifntly gives the impresion the extension should be using restbase [11:31:24] <_joe_> addshore: heh no #operations tag means the chances for ops to say what they think was minimal, sorry to come this late to the party [11:32:43] RECOVERY - puppet last run on carbon is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [11:33:09] (03PS6) 10Jcrespo: [WIP] labsdb: Move repo from views to common; add private data check [puppet] - 10https://gerrit.wikimedia.org/r/324040 (https://phabricator.wikimedia.org/T150802) [11:34:15] (03CR) 10ArielGlenn: [C: 032] make index html output nicer and include md5 sums of files [dumps] - 10https://gerrit.wikimedia.org/r/324143 (owner: 10ArielGlenn) [11:34:16] thanks _joe_, I am running puppet atm in codfw [11:35:12] Git::Clone and not Git::clone, arg [11:35:43] RECOVERY - puppet last run on mw2198 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [11:35:46] _joe_ yes it is indeed a race, solved with the second puppet run [11:36:25] <_joe_> elukey: whenever you remove a puppet file, that happens [11:36:35] <_joe_> anyways, I need a break, ttyl [11:39:59] (03PS7) 10Jcrespo: labsdb: Move repo from views to common; add private data check [puppet] - 10https://gerrit.wikimedia.org/r/324040 (https://phabricator.wikimedia.org/T150802) [11:43:48] codfw almost done, apache-fast-test results looks fine. Proceeding with mw1017/mw1099 [11:44:41] ah snap mwdebug100[12] [11:44:44] forgot :P [11:46:59] 06Operations, 10Electron-PDFs, 06TCB-Team, 13Patch-For-Review, 07User-notice: Deploy ElectronPdfService Extension to production - https://phabricator.wikimedia.org/T150185#2830388 (10Joe) >>! In T150185#2830324, @Addshore wrote: > > Ahh, after all of the discussions we thought this was the direction the... [11:47:59] all right header is gone, mwdebug are ok for apache-test, proceeding with eqiad (10% batches) [11:48:28] 06Operations, 10RESTBase, 10RESTBase-API, 10Traffic, and 2 others: Expose the PDF rendering service via RESTBase - https://phabricator.wikimedia.org/T143132#2830390 (10mark) [11:48:56] !log re-enable puppet on mw1* hosts and apply Apache config change (https://gerrit.wikimedia.org/r/#/c/314519) [11:49:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:49:13] PROBLEM - puppet last run on mw1302 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:51:05] mmm I haven't started [11:51:09] checking 1302 [11:51:24] <_joe_> elukey: mw1302 is a jobrunner IIRC [11:52:03] RECOVERY - puppet last run on mw1302 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [11:52:52] _joe_ do we still need to use the --tag mw-apache-config ? [11:54:18] <_joe_> elukey: not strictly needed [11:54:23] okok [11:54:25] thanks [11:54:27] proceeding [11:56:41] (03PS8) 10Jcrespo: labsdb: Move repo from views to common; add private data check [puppet] - 10https://gerrit.wikimedia.org/r/324040 (https://phabricator.wikimedia.org/T150802) [11:58:13] RECOVERY - puppet last run on analytics1038 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [12:00:31] (03PS1) 10ArielGlenn: add docstrings to generatemiscdumps and miscdumplib [dumps] - 10https://gerrit.wikimedia.org/r/324147 [12:00:51] (03CR) 10jenkins-bot: [V: 04-1] add docstrings to generatemiscdumps and miscdumplib [dumps] - 10https://gerrit.wikimedia.org/r/324147 (owner: 10ArielGlenn) [12:03:46] (03PS2) 10ArielGlenn: add docstrings to generatemiscdumps and miscdumplib [dumps] - 10https://gerrit.wikimedia.org/r/324147 [12:05:10] (03CR) 10ArielGlenn: [C: 032] add docstrings to generatemiscdumps and miscdumplib [dumps] - 10https://gerrit.wikimedia.org/r/324147 (owner: 10ArielGlenn) [12:05:23] (03PS1) 10Jcrespo: Add fake maintainviews passwords for puppet compilation [labs/private] - 10https://gerrit.wikimedia.org/r/324148 [12:05:40] and done [12:05:55] !log complete rolling restart of apache in eqiad [12:06:00] (03CR) 10Jcrespo: [C: 032 V: 032] Add fake maintainviews passwords for puppet compilation [labs/private] - 10https://gerrit.wikimedia.org/r/324148 (owner: 10Jcrespo) [12:06:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:17] the X-Powered-By: HHVM/3.3.0-static header should go away as cache entries expire [12:10:13] 06Operations, 10Electron-PDFs, 06TCB-Team, 13Patch-For-Review, 07User-notice: Deploy ElectronPdfService Extension to production - https://phabricator.wikimedia.org/T150185#2830410 (10Joe) Looking at the code, there has been a misunderstanding: MediaWiki is just issuing a redirect to the rest api, not pr... [12:11:58] <_joe_> addshore: heh the way the extension works is different from what I gathered here on irc [12:12:11] * _joe_ should always look at the code first [12:12:46] (03CR) 10Jcrespo: "https://puppet-compiler.wmflabs.org/4702/" [puppet] - 10https://gerrit.wikimedia.org/r/324040 (https://phabricator.wikimedia.org/T150802) (owner: 10Jcrespo) [12:12:54] <_joe_> it is redirecting to the restbase url, correct? [12:13:05] yes! [12:13:25] <_joe_> why not link do the rb url directly then? [12:13:32] <_joe_> *to [12:13:59] (03PS1) 10ArielGlenn: move check for wikis we skip into a separate method [dumps] - 10https://gerrit.wikimedia.org/r/324149 [12:14:57] *looks* [12:15:08] <_joe_> what I get is that we add a mw-served link to the menu, and that url will issue a redirect to restbase. Wouldn't it be better if we linked to rb directly? [12:15:28] (03CR) 10ArielGlenn: [C: 032] move check for wikis we skip into a separate method [dumps] - 10https://gerrit.wikimedia.org/r/324149 (owner: 10ArielGlenn) [12:15:49] well, there is also sometimes a page inbetween, see https://upload.wikimedia.org/wikipedia/mediawiki/9/9e/ElectronPdfService-mockup.png [12:16:10] <_joe_> addshore: oh, ok [12:16:15] but from that page yes, we can go directly to electron / redirect to collection [12:17:07] <_joe_> if we do that, I see no reason to object [12:18:33] <_joe_> there is just the fact we're adding restbase to the chain of dependencies of the extension, but I don't see that as a big issue tbh [12:21:44] _joe_: great! I'll look into that right now! [12:22:39] 06Operations, 10Electron-PDFs, 06TCB-Team, 13Patch-For-Review, 07User-notice: Deploy ElectronPdfService Extension to production - https://phabricator.wikimedia.org/T150185#2830419 (10Joe) So, as discussed with @addshore on irc: # we could get rid of the unnecessary redirect by just linking to the restba... [12:24:46] 06Operations, 10RESTBase, 10RESTBase-API, 10Traffic, and 2 others: Expose the PDF rendering service via RESTBase - https://phabricator.wikimedia.org/T143132#2830424 (10Joe) >>! In T143132#2830385, @Joe wrote: > Please note that exposing the service via restbase doesn't mean it's a good idea to call it via... [12:37:35] (03PS1) 10Jcrespo: [WIP] Remove private_databases from realm.pp [puppet] - 10https://gerrit.wikimedia.org/r/324153 [12:38:24] (03PS1) 10ArielGlenn: miscdumps: use logging module instead of passing verbose var around everywhere [dumps] - 10https://gerrit.wikimedia.org/r/324154 [12:40:54] my ruby is rusty, I am not sure this works: https://gerrit.wikimedia.org/r/#/c/324153/1/templates/mariadb/sanitarium.my.cnf.erb [12:43:01] (03CR) 10ArielGlenn: [C: 032] miscdumps: use logging module instead of passing verbose var around everywhere [dumps] - 10https://gerrit.wikimedia.org/r/324154 (owner: 10ArielGlenn) [12:48:17] (03PS1) 10ArielGlenn: miscdumpslib Config becomes MiscDumpConfig [dumps] - 10https://gerrit.wikimedia.org/r/324158 [12:49:38] (03CR) 10ArielGlenn: [C: 032] miscdumpslib Config becomes MiscDumpConfig [dumps] - 10https://gerrit.wikimedia.org/r/324158 (owner: 10ArielGlenn) [12:50:13] (03CR) 10Faidon Liambotis: [C: 031] RAID: get RAID status improvement for MegaCLI [puppet] - 10https://gerrit.wikimedia.org/r/322249 (https://phabricator.wikimedia.org/T151043) (owner: 10Volans) [13:09:13] (03PS1) 10ArielGlenn: clean up lockfile and locking class names, toss unused classes [dumps] - 10https://gerrit.wikimedia.org/r/324168 [13:10:28] 06Operations, 06Performance-Team, 10Thumbor: Thumbor leaks memory - https://phabricator.wikimedia.org/T150757#2830527 (10Gilles) On Vagrant, before thumbnailing that djvu: ``` 92.945312 Mb /srv/thumbor/bin/python /srv/thumbor/bin/thumbor -p 8888 -c /etc/thumbor.d ``` After: ``` 561.847656 Mb /srv/thumbor... [13:11:31] (03CR) 10ArielGlenn: [C: 032] clean up lockfile and locking class names, toss unused classes [dumps] - 10https://gerrit.wikimedia.org/r/324168 (owner: 10ArielGlenn) [13:14:48] !log restarting db1095's mysql for T151752 [13:15:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:00] T151752: Prepare and check storage layer for the future private wiki arbcom-cs.wikipedia.org - https://phabricator.wikimedia.org/T151752 [13:18:30] 06Operations, 06Performance-Team, 10Thumbor: Thumbor leaks memory - https://phabricator.wikimedia.org/T150757#2830545 (10Gilles) ``` types | # objects | total size ===================================================================== | =====... [13:22:59] (03PS1) 10ArielGlenn: decouple MiscDumpConfig from the wikidump config class [dumps] - 10https://gerrit.wikimedia.org/r/324174 [13:24:47] 06Operations, 06Labs, 10wikitech.wikimedia.org: Can't login wikitech - https://phabricator.wikimedia.org/T144805#2830569 (10Reedy) This is why people shouldn't pile on with "me too" requests on the same bug for the same thing when it directly affects their account [13:26:06] (03CR) 10Alex Monk: [DNM] Initial configuration for arbcom_cs.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/323843 (https://phabricator.wikimedia.org/T151731) (owner: 10MarcoAurelio) [13:27:24] !log deleted oathauth row on wikitech for user Shizhao per T144805 [13:27:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:34] T144805: Can't login wikitech - https://phabricator.wikimedia.org/T144805 [13:28:06] 06Operations, 06Labs, 10wikitech.wikimedia.org: Can't login wikitech - https://phabricator.wikimedia.org/T144805#2830577 (10Reedy) 05Open>03Resolved [13:29:31] 06Operations, 10Traffic, 13Patch-For-Review: Varnishkafka seeing abandoned VSM logs - https://phabricator.wikimedia.org/T151563#2830578 (10ema) This is the list of hosts affected by the issue in the last 3 days sorted by number of crashes. 28 cp1065.eqiad.wmnet 21 cp1068.eqiad.wmnet 18 cp3032... [13:29:39] (03CR) 10Reedy: [C: 031] Add 'global-renamer' to the list of privileged wiki groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324096 (https://phabricator.wikimedia.org/T150951) (owner: 10MarcoAurelio) [13:30:27] (03CR) 10MarcoAurelio: [C: 04-1] "Some amendments requested on Phabricator." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/323843 (https://phabricator.wikimedia.org/T151731) (owner: 10MarcoAurelio) [13:30:47] (03CR) 10ArielGlenn: [C: 032] decouple MiscDumpConfig from the wikidump config class [dumps] - 10https://gerrit.wikimedia.org/r/324174 (owner: 10ArielGlenn) [13:32:03] PROBLEM - Apache HTTP on mw1285 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:33:23] PROBLEM - Apache HTTP on mw1287 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:33:33] PROBLEM - HHVM rendering on mw1285 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:33:53] PROBLEM - mobileapps endpoints health on scb2003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:33:53] PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:33:53] RECOVERY - Apache HTTP on mw1285 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 3.873 second response time [13:34:03] PROBLEM - Apache HTTP on mw1290 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:34:13] PROBLEM - HHVM rendering on mw1287 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:34:25] _joe_, elukey: ^ [13:34:38] checking [13:34:53] RECOVERY - mobileapps endpoints health on scb2003 is OK: All endpoints are healthy [13:34:53] RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy [13:34:53] PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:35:03] PROBLEM - Apache HTTP on mw1283 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:35:03] PROBLEM - Apache HTTP on mw1205 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:35:13] PROBLEM - HHVM rendering on mw1276 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:35:13] PROBLEM - HHVM rendering on mw1290 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:35:14] PROBLEM - Apache HTTP on mw1206 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:35:14] PROBLEM - HHVM rendering on mw1206 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:35:53] RECOVERY - Apache HTTP on mw1283 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 4.519 second response time [13:36:33] PROBLEM - Apache HTTP on mw1278 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:36:43] PROBLEM - HHVM rendering on mw1279 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:36:53] RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy [13:36:53] RECOVERY - Apache HTTP on mw1290 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.738 second response time [13:37:00] what is this? [13:37:13] elukey: there are quite a few mw hosts affected [13:37:13] RECOVERY - HHVM rendering on mw1276 is OK: HTTP OK: HTTP/1.1 200 OK - 70102 bytes in 7.033 second response time [13:37:13] PROBLEM - Apache HTTP on mw1288 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:37:32] ema: yeah, hhvm is doing something weird [13:37:33] RECOVERY - HHVM rendering on mw1285 is OK: HTTP OK: HTTP/1.1 200 OK - 70101 bytes in 5.510 second response time [13:37:35] are they API? [13:37:43] PROBLEM - Apache HTTP on mw1276 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:37:44] jynus: not sure, I've logged into mw1287 and it seemed ok to me, then after a little it recovered [13:37:45] I was thinking the same, it seems the same problem [13:37:48] recurring [13:37:53] RECOVERY - Apache HTTP on mw1205 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 3.378 second response time [13:38:00] (03PS1) 10ArielGlenn: misc pylint and small cleanup [dumps] - 10https://gerrit.wikimedia.org/r/324176 [13:38:03] PROBLEM - Apache HTTP on mw1285 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:38:10] could it be the restart process? [13:38:13] PROBLEM - HHVM rendering on mw1288 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:38:13] RECOVERY - Apache HTTP on mw1206 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 7.909 second response time [13:38:13] RECOVERY - HHVM rendering on mw1206 is OK: HTTP OK: HTTP/1.1 200 OK - 70102 bytes in 7.938 second response time [13:38:34] (03PS2) 10Jcrespo: Remove private_wikis from realm.pp [puppet] - 10https://gerrit.wikimedia.org/r/324153 [13:38:42] jynus: I checked and I didn't find it in journalctl [13:38:48] 06Operations, 10RESTBase, 10RESTBase-API, 10Traffic, and 2 others: Expose the PDF rendering service via RESTBase - https://phabricator.wikimedia.org/T143132#2830584 (10Joe) For comparison, I just confirmed that when using OCG, MediaWiki issues `Cache-control: no-cache`; that's because OCG is caching conten... [13:38:53] PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:38:53] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:38:53] RECOVERY - Apache HTTP on mw1285 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 1.543 second response time [13:39:13] PROBLEM - Apache HTTP on mw1202 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:39:13] PROBLEM - HHVM rendering on mw1283 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:39:13] PROBLEM - Apache HTTP on mw1195 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:39:13] PROBLEM - HHVM rendering on mw1195 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:39:13] PROBLEM - puppet last run on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:39:20] let me check the aggregated http requests [13:39:23] PROBLEM - Apache HTTP on mw1277 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:39:23] RECOVERY - Apache HTTP on mw1278 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.040 second response time [13:39:33] PROBLEM - Apache HTTP on mw1194 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:39:33] PROBLEM - Apache HTTP on mw1289 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:39:43] PROBLEM - HHVM rendering on mw1194 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:39:53] PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:39:53] PROBLEM - restbase endpoints health on restbase2008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:39:53] PROBLEM - mobileapps endpoints health on scb2003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:39:53] PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:39:57] 06Operations, 10Electron-PDFs, 06TCB-Team, 13Patch-For-Review, 07User-notice: Deploy ElectronPdfService Extension to production - https://phabricator.wikimedia.org/T150185#2830586 (10Joe) 05stalled>03Open [13:39:59] 06Operations, 06Performance-Team, 10Thumbor: Thumbor leaks memory - https://phabricator.wikimedia.org/T150757#2830588 (10Gilles) This seems to be coming from a single huge object: ``` >>> getsizeof(root) 271117801 >>> cb.print_tree() -+--- +---... [13:40:03] RECOVERY - HHVM rendering on mw1288 is OK: HTTP OK: HTTP/1.1 200 OK - 70099 bytes in 0.094 second response time [13:40:03] RECOVERY - puppet last run on ms-be1016 is OK: OK: Puppet is currently enabled, last run 30 minutes ago with 0 failures [13:40:03] RECOVERY - Apache HTTP on mw1202 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 3.923 second response time [13:40:03] RECOVERY - Apache HTTP on mw1288 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.104 second response time [13:40:05] they are not high [13:40:06] <_joe_> shit I was just back from lunch [13:40:13] PROBLEM - HHVM rendering on mw1276 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:40:13] RECOVERY - Apache HTTP on mw1195 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 7.151 second response time [13:40:13] RECOVERY - HHVM rendering on mw1195 is OK: HTTP OK: HTTP/1.1 200 OK - 70101 bytes in 7.212 second response time [13:40:23] RECOVERY - Apache HTTP on mw1277 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 7.199 second response time [13:40:23] PROBLEM - HHVM rendering on mw1289 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:40:29] _joe_, it just started [13:40:33] RECOVERY - Apache HTTP on mw1194 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 5.984 second response time [13:40:33] yep [13:40:33] RECOVERY - HHVM rendering on mw1194 is OK: HTTP OK: HTTP/1.1 200 OK - 70100 bytes in 0.119 second response time [13:40:33] RECOVERY - Apache HTTP on mw1276 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 1.043 second response time [13:40:33] PROBLEM - HHVM rendering on mw1285 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:40:43] PROBLEM - HHVM rendering on mw1202 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:40:43] PROBLEM - Apache HTTP on mw1208 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:40:43] and it seems all apis [13:40:53] PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:40:53] PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:40:53] PROBLEM - HHVM rendering on mw1277 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:41:03] PROBLEM - Apache HTTP on mw1290 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:41:13] PROBLEM - HHVM rendering on mw1278 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:41:17] * volans available if you need any help [13:41:21] (03CR) 10Hashar: [C: 032] "I merely mentioned the issue for the sake of it :] Looks all fine to me." [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/324025 (https://phabricator.wikimedia.org/T150916) (owner: 10BryanDavis) [13:41:23] RECOVERY - Apache HTTP on mw1289 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 2.797 second response time [13:41:23] RECOVERY - HHVM rendering on mw1289 is OK: HTTP OK: HTTP/1.1 200 OK - 70101 bytes in 9.367 second response time [13:41:33] RECOVERY - HHVM rendering on mw1285 is OK: HTTP OK: HTTP/1.1 200 OK - 70102 bytes in 2.697 second response time [13:41:33] RECOVERY - HHVM rendering on mw1202 is OK: HTTP OK: HTTP/1.1 200 OK - 70101 bytes in 2.962 second response time [13:41:33] PROBLEM - Apache HTTP on mw1279 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:41:33] RECOVERY - Apache HTTP on mw1208 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 1.474 second response time [13:41:47] _joe_ I am seeing a lot of http://eu.wikipedia.org/w/api.php with Parsoid as UA [13:41:52] (503s) [13:41:53] RECOVERY - restbase endpoints health on restbase2008 is OK: All endpoints are healthy [13:41:53] RECOVERY - mobileapps endpoints health on scb2004 is OK: All endpoints are healthy [13:41:53] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:41:53] RECOVERY - Apache HTTP on mw1290 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 1.625 second response time [13:42:03] PROBLEM - Apache HTTP on mw1285 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:42:03] RECOVERY - HHVM rendering on mw1278 is OK: HTTP OK: HTTP/1.1 200 OK - 70102 bytes in 1.360 second response time [13:42:03] RECOVERY - HHVM rendering on mw1283 is OK: HTTP OK: HTTP/1.1 200 OK - 70101 bytes in 3.280 second response time [13:42:05] the rate of text 503s is not particularly high, though it went up [13:42:13] it is clearly the same issue [13:42:49] <_joe_> elukey: yeah [13:42:51] <_joe_> same issue [13:42:53] PROBLEM - restbase endpoints health on restbase1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:42:53] RECOVERY - Apache HTTP on mw1285 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 5.728 second response time [13:42:56] (03CR) 10ArielGlenn: [C: 032] misc pylint and small cleanup [dumps] - 10https://gerrit.wikimedia.org/r/324176 (owner: 10ArielGlenn) [13:43:03] RECOVERY - HHVM rendering on mw1287 is OK: HTTP OK: HTTP/1.1 200 OK - 70101 bytes in 0.178 second response time [13:43:18] !log updating jouncebot so it properly reclaim its nick ( T150916 https://gerrit.wikimedia.org/r/#/c/324025/ ) [13:43:23] PROBLEM - HHVM rendering on mw1201 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:43:23] RECOVERY - Apache HTTP on mw1279 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.083 second response time [13:43:23] PROBLEM - Apache HTTP on mw1201 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:43:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:29] T150916: Jouncebot: Add functionality to change Nick from Jouncebot_ to Jouncebot automatically - https://phabricator.wikimedia.org/T150916 [13:43:33] PROBLEM - Apache HTTP on mw1278 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:43:53] PROBLEM - restbase endpoints health on restbase2006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:44:03] PROBLEM - Apache HTTP on mw1282 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:44:13] RECOVERY - HHVM rendering on mw1290 is OK: HTTP OK: HTTP/1.1 200 OK - 70101 bytes in 8.444 second response time [13:44:13] RECOVERY - Apache HTTP on mw1287 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 2.070 second response time [13:44:23] PROBLEM - restbase endpoints health on restbase1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:44:23] RECOVERY - HHVM rendering on mw1201 is OK: HTTP OK: HTTP/1.1 200 OK - 70101 bytes in 8.869 second response time [13:44:23] RECOVERY - Apache HTTP on mw1201 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 8.383 second response time [13:44:23] PROBLEM - HHVM rendering on mw1289 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:44:33] PROBLEM - Apache HTTP on mw1289 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:44:33] PROBLEM - restbase endpoints health on xenon is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:44:34] RECOVERY - HHVM rendering on mw1279 is OK: HTTP OK: HTTP/1.1 200 OK - 70101 bytes in 0.130 second response time [13:44:44] !log reedy@tin Synchronized php-1.29.0-wmf.3/api.php: Redeploy ori bandaid for T151702 (duration: 00m 44s) [13:44:47] jynus: I don't think we restart hhvm automatically on api hosts [13:44:53] PROBLEM - restbase endpoints health on restbase2010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:44:53] PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:44:53] PROBLEM - restbase endpoints health on restbase1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:44:53] PROBLEM - restbase endpoints health on restbase1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:44:53] PROBLEM - restbase endpoints health on restbase2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:44:53] PROBLEM - restbase endpoints health on restbase2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:44:53] PROBLEM - Apache HTTP on mw1281 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:44:54] RECOVERY - Apache HTTP on mw1282 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 7.188 second response time [13:44:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:03] PROBLEM - Apache HTTP on mw1290 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:45:07] <_joe_> ok now I'll do a rolling restart of HHVM there [13:45:13] RECOVERY - restbase endpoints health on restbase1009 is OK: All endpoints are healthy [13:45:23] RECOVERY - Apache HTTP on mw1278 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 2.379 second response time [13:45:23] RECOVERY - restbase endpoints health on xenon is OK: All endpoints are healthy [13:45:26] ema, I remember at some point someone introducing something like that [13:45:33] PROBLEM - restbase endpoints health on restbase2009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:45:33] RECOVERY - Apache HTTP on mw1289 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 8.369 second response time [13:45:38] jynus: only for role::mediawiki::webserver AFAICT [13:45:42] ok [13:45:43] RECOVERY - Apache HTTP on mw1281 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 2.404 second response time [13:45:43] PROBLEM - restbase endpoints health on restbase2007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:45:53] RECOVERY - HHVM rendering on mw1277 is OK: HTTP OK: HTTP/1.1 200 OK - 70101 bytes in 1.573 second response time [13:45:53] RECOVERY - restbase endpoints health on restbase2010 is OK: All endpoints are healthy [13:45:53] RECOVERY - restbase endpoints health on restbase2001 is OK: All endpoints are healthy [13:45:53] RECOVERY - restbase endpoints health on restbase1008 is OK: All endpoints are healthy [13:45:53] RECOVERY - restbase endpoints health on restbase2002 is OK: All endpoints are healthy [13:45:53] PROBLEM - restbase endpoints health on restbase-test2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:45:53] RECOVERY - Apache HTTP on mw1290 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 6.597 second response time [13:46:03] RECOVERY - HHVM rendering on mw1276 is OK: HTTP OK: HTTP/1.1 200 OK - 70102 bytes in 2.378 second response time [13:46:13] RECOVERY - HHVM rendering on mw1289 is OK: HTTP OK: HTTP/1.1 200 OK - 70101 bytes in 0.603 second response time [13:46:17] ema: it was done for apis originally, it should be for all IIRC [13:46:19] <_joe_> !log rolling restart of HHVM in the api cluster [13:46:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:33] RECOVERY - restbase endpoints health on restbase2009 is OK: All endpoints are healthy [13:46:43] RECOVERY - restbase endpoints health on restbase-test2001 is OK: All endpoints are healthy [13:46:48] <_joe_> sorry, I'm not sure what are you talking about with "restart hhvm automatically" [13:47:03] _joe_: /usr/local/bin/restart-hhvm [13:47:16] <_joe_> ema: it should be there for API [13:47:23] <_joe_> but we don't have time for it in this case [13:47:43] RECOVERY - restbase endpoints health on restbase2007 is OK: All endpoints are healthy [13:47:43] RECOVERY - restbase endpoints health on restbase1013 is OK: All endpoints are healthy [13:47:44] it doesn't matter, I only asked about it [13:47:53] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [13:47:53] RECOVERY - restbase endpoints health on restbase2006 is OK: All endpoints are healthy [13:47:53] RECOVERY - mobileapps endpoints health on scb2003 is OK: All endpoints are healthy [13:47:58] as a potential cause at the time, which wasn't [13:48:12] 06Operations, 06Performance-Team, 10Thumbor: Thumbor leaks memory - https://phabricator.wikimedia.org/T150757#2830637 (10Gilles) By looking at the contents, it seems clear that it's the ppm variable created in the djvu engine. Aside from the fact that PPMs seem to be huge and might not be the best choice he... [13:48:14] the case [13:48:35] <_joe_> outage over [13:48:43] RECOVERY - restbase endpoints health on restbase1012 is OK: All endpoints are healthy [13:48:53] RECOVERY - mobileapps endpoints health on scb2002 is OK: All endpoints are healthy [13:48:53] RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy [13:48:53] RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy [13:48:53] RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy [13:49:19] <_joe_> so, started at :34 or so, solved at ~ :46 [13:49:43] RECOVERY - mobileapps endpoints health on scb2004 is OK: All endpoints are healthy [13:49:43] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [13:50:14] (03CR) 10Gehel: "Puppet compiler: https://puppet-compiler.wmflabs.org/4704/" [puppet] - 10https://gerrit.wikimedia.org/r/321374 (https://phabricator.wikimedia.org/T150021) (owner: 10Gehel) [13:50:16] _joe_: looks like mem usage started going up ~ 13:20 https://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&m=cpu_report&c=API+application+servers+eqiad&h=mw1287.eqiad.wmnet&tab=m&vn=&hide-hf=false&mc=2&z=medium&metric_group=NOGROUPS [13:50:38] <_joe_> ema: yes, but the real issues started around that time [13:50:43] indeed [13:51:06] (03Merged) 10jenkins-bot: Don't keep adding _ to nick indefinitely [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/324025 (https://phabricator.wikimedia.org/T150916) (owner: 10BryanDavis) [13:51:11] <_joe_> the rolling restart is going a bit slow [13:51:21] <_joe_> but that's ok [13:53:17] (03PS6) 10Gehel: Kartotherian: deploy application configuration with scap3 [puppet] - 10https://gerrit.wikimedia.org/r/321374 (https://phabricator.wikimedia.org/T150021) [13:53:40] jouncebot: refresh [13:53:43] I refreshed my knowledge about deployments. [13:53:44] jouncebot: next [13:53:44] In 0 hour(s) and 6 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161129T1400) [13:53:55] (03PS3) 10MarcoAurelio: [DNM] Initial configuration for arbcom_cs.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/323843 (https://phabricator.wikimedia.org/T151731) [13:54:02] !log deploying kartotherian config with scap3 - T150021 [13:54:08] (03PS1) 10ArielGlenn: misdcumps: clean up arg handling [dumps] - 10https://gerrit.wikimedia.org/r/324183 [13:54:09] Dutch Wikipedia has a topic about the ip 91.198.174.192 (Wikimedia EU) being blocked because of Malwarebytes Anti-Malware finding a Locky C2. ( https://nl.wikipedia.org/wiki/Wikipedia:De_kroeg#Malwarebytes ) [13:54:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:15] T150021: Deploy kartotherian / tilerator / tileratorui configuration via scap3 - https://phabricator.wikimedia.org/T150021 [13:54:19] Does that ring a bell? Couldn't find a ticket in phabricator [13:54:48] The linked topic (in English) is https://forums.malwarebytes.org/topic/191202-wikipediaorg-being-blocked-by-malwarebytes/ [13:55:16] (03CR) 10Hashar: "I have updated the bot on tools labs based on http://wikitech.wikimedia.org/wiki/Jouncebot" [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/324025 (https://phabricator.wikimedia.org/T150916) (owner: 10BryanDavis) [13:55:53] (03CR) 10Gehel: [C: 032] Kartotherian: deploy application configuration with scap3 [puppet] - 10https://gerrit.wikimedia.org/r/321374 (https://phabricator.wikimedia.org/T150021) (owner: 10Gehel) [13:57:30] jouncebot: next [13:57:30] In 0 hour(s) and 2 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161129T1400) [13:57:48] looks like there are not patches for EU SWAT today [13:59:06] (03PS4) 10MarcoAurelio: [DNM] Initial configuration for arbcom_cs.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/323843 (https://phabricator.wikimedia.org/T151731) [13:59:08] !log gehel@tin Starting deploy [kartotherian/deploy@f3805c4]: (no message) [13:59:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161129T1400). Please do the needful. [14:03:06] nothing to deploy [14:03:34] I could move my patches from below but I've not much time left [14:03:36] (03PS1) 10Hashar: contint: install php5-xsl package [puppet] - 10https://gerrit.wikimedia.org/r/324186 [14:10:52] (03PS1) 10Gehel: Kartotherian - deploy config with scap3 [puppet] - 10https://gerrit.wikimedia.org/r/324187 (https://phabricator.wikimedia.org/T150021) [14:12:42] (03PS9) 10Rush: labsdb: Move repo from views to common; add private data check [puppet] - 10https://gerrit.wikimedia.org/r/324040 (https://phabricator.wikimedia.org/T150802) (owner: 10Jcrespo) [14:12:55] (03CR) 10Gehel: [C: 032] Kartotherian - deploy config with scap3 [puppet] - 10https://gerrit.wikimedia.org/r/324187 (https://phabricator.wikimedia.org/T150021) (owner: 10Gehel) [14:14:02] (03CR) 10Rush: [C: 031] "seems reasonable" [puppet] - 10https://gerrit.wikimedia.org/r/324040 (https://phabricator.wikimedia.org/T150802) (owner: 10Jcrespo) [14:14:09] (03PS2) 10Volans: Puppet merge: molly-guard multiple commits [puppet] - 10https://gerrit.wikimedia.org/r/322362 [14:16:42] (03PS2) 10Hashar: contint: install php5-xsl package [puppet] - 10https://gerrit.wikimedia.org/r/324186 (https://phabricator.wikimedia.org/T151879) [14:18:57] 06Operations, 06Performance-Team, 10Thumbor: Thumbor leaks memory - https://phabricator.wikimedia.org/T150757#2830780 (10Gilles) Leak found, it was very stupid. [14:18:58] (03PS3) 10Giuseppe Lavagetto: contint: install php5-xsl package [puppet] - 10https://gerrit.wikimedia.org/r/324186 (https://phabricator.wikimedia.org/T151879) (owner: 10Hashar) [14:19:58] !log gehel@tin Finished deploy [kartotherian/deploy@f3805c4]: (no message) (duration: 20m 49s) [14:20:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:26] (03CR) 10Giuseppe Lavagetto: [C: 032] contint: install php5-xsl package [puppet] - 10https://gerrit.wikimedia.org/r/324186 (https://phabricator.wikimedia.org/T151879) (owner: 10Hashar) [14:21:53] PROBLEM - Check systemd state on labstore1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:22:03] PROBLEM - Ensure mysql credential creation for tools users is running on labstore1004 is CRITICAL: CRITICAL - Expecting active but unit create-dbusers is failed [14:26:04] hm [14:26:31] I have not deployed anything yet [14:27:03] RECOVERY - Ensure mysql credential creation for tools users is running on labstore1004 is OK: OK - create-dbusers is active [14:27:35] (03CR) 10Urbanecm: "Is this still DNM patch?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/323843 (https://phabricator.wikimedia.org/T151731) (owner: 10MarcoAurelio) [14:27:53] RECOVERY - Check systemd state on labstore1004 is OK: OK - running: The system is fully operational [14:28:01] (03PS5) 10Volans: RAID: get RAID status improvement for MegaCLI [puppet] - 10https://gerrit.wikimedia.org/r/322249 (https://phabricator.wikimedia.org/T151043) [14:29:08] (03CR) 10Volans: [C: 032] RAID: get RAID status improvement for MegaCLI [puppet] - 10https://gerrit.wikimedia.org/r/322249 (https://phabricator.wikimedia.org/T151043) (owner: 10Volans) [14:32:42] jynus: that's not you, it's from ldap connection issues [14:33:39] (03CR) 10Marostegui: [C: 031] "Very nice work, this will help A LOT for the future of labs and future deployments and changes. Very impressive!" [puppet] - 10https://gerrit.wikimedia.org/r/324040 (https://phabricator.wikimedia.org/T150802) (owner: 10Jcrespo) [14:34:54] (03PS5) 10MarcoAurelio: Initial configuration for arbcom_cs.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/323843 (https://phabricator.wikimedia.org/T151731) [14:35:13] !log reedy@tin Synchronized php-1.29.0-wmf.3/api.php: Resync after making into gerrit commit (duration: 00m 45s) [14:35:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:20] (03PS1) 10Urbanecm: [logo] Add logo for arbcom_cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324188 (https://phabricator.wikimedia.org/T151731) [14:37:30] (03PS2) 10Ottomata: Scap: clone directory group writable [puppet] - 10https://gerrit.wikimedia.org/r/324006 (https://phabricator.wikimedia.org/T151231) (owner: 10Thcipriani) [14:37:59] (03CR) 10Marostegui: [C: 031] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/324153 (owner: 10Jcrespo) [14:38:15] (03CR) 10Urbanecm: [C: 04-1] "From my POV all wikis should serve HD logos. 324188 contains it so please add suppor for HD." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/323843 (https://phabricator.wikimedia.org/T151731) (owner: 10MarcoAurelio) [14:39:09] (03PS10) 10Jcrespo: labsdb: Move repo from views to common; add private data check [puppet] - 10https://gerrit.wikimedia.org/r/324040 (https://phabricator.wikimedia.org/T150802) [14:39:21] (03CR) 10Ottomata: [C: 032] Scap: clone directory group writable [puppet] - 10https://gerrit.wikimedia.org/r/324006 (https://phabricator.wikimedia.org/T151231) (owner: 10Thcipriani) [14:40:17] (03PS11) 10Jcrespo: labsdb: Move repo from views to common; add private data check [puppet] - 10https://gerrit.wikimedia.org/r/324040 (https://phabricator.wikimedia.org/T150802) [14:41:30] (03CR) 10Jcrespo: [C: 032] labsdb: Move repo from views to common; add private data check [puppet] - 10https://gerrit.wikimedia.org/r/324040 (https://phabricator.wikimedia.org/T150802) (owner: 10Jcrespo) [14:41:53] 06Operations, 06Analytics-Kanban: setup/install thorium/wmf4726 as stat1001 replacement - https://phabricator.wikimedia.org/T151816#2830877 (10Ottomata) 05Open>03Resolved I will track this in the parent T149438 [14:42:00] (03PS6) 10Urbanecm: Initial configuration for arbcom_cs.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/323843 (https://phabricator.wikimedia.org/T151731) (owner: 10MarcoAurelio) [14:42:18] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: Thumbor leaks memory - https://phabricator.wikimedia.org/T150757#2830892 (10Gilles) [14:42:29] (03PS1) 10Gehel: node service - allow empty entry point [puppet] - 10https://gerrit.wikimedia.org/r/324190 (https://phabricator.wikimedia.org/T150021) [14:43:48] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: Fix memory leaks in Thumbor plugins - https://phabricator.wikimedia.org/T150757#2830900 (10Gilles) [14:43:52] (03CR) 10MarcoAurelio: "> From my POV all wikis should serve HD logos. 324188 contains it so" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/323843 (https://phabricator.wikimedia.org/T151731) (owner: 10MarcoAurelio) [14:47:03] PROBLEM - puppet last run on thumbor1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:47:12] (03PS1) 10Gehel: discovery-stats: create mysql credentials file specific to discovery-stat [puppet] - 10https://gerrit.wikimedia.org/r/324191 (https://phabricator.wikimedia.org/T151063) [14:47:26] !log Deploye alter table dbstore1001 - dewiki.revision - T148967 [14:47:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:37] T148967: Fix PK on S5 dewiki.revision - https://phabricator.wikimedia.org/T148967 [14:47:58] (03PS7) 10MarcoAurelio: Initial configuration for arbcom_cs.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/323843 (https://phabricator.wikimedia.org/T151731) [14:47:58] 06Operations, 10RESTBase, 10RESTBase-API, 10Traffic, and 2 others: Expose the PDF rendering service via RESTBase - https://phabricator.wikimedia.org/T143132#2830907 (10GWicke) @joe: The traffic we are talking about here is very low. OCG currently sees about 2 req/s. [14:49:01] (03PS1) 10Jcrespo: mariadb: Rename filtered_columns.txt to filtered_tables.txt [puppet] - 10https://gerrit.wikimedia.org/r/324192 [14:50:03] (03CR) 10Urbanecm: [C: 031] "Now good for me. Thanks for your work, @MarcoAurelio." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/323843 (https://phabricator.wikimedia.org/T151731) (owner: 10MarcoAurelio) [14:51:23] (03CR) 10Andrew Bogott: [C: 031] "Quotas adjusted as per the commit comment." [puppet] - 10https://gerrit.wikimedia.org/r/322270 (https://phabricator.wikimedia.org/T133911) (owner: 10Hashar) [14:51:28] (03PS2) 10Andrew Bogott: nodepool: bump max server from 12 to 20 [puppet] - 10https://gerrit.wikimedia.org/r/322270 (https://phabricator.wikimedia.org/T133911) (owner: 10Hashar) [14:51:30] (03CR) 10Urbanecm: "Yes, I know about it. I'm waiting till January 2017. Will this patch be de-CR-2ed after this date?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307475 (https://phabricator.wikimedia.org/T144254) (owner: 10Urbanecm) [14:52:31] (03CR) 10ArielGlenn: [C: 032] misdcumps: clean up arg handling [dumps] - 10https://gerrit.wikimedia.org/r/324183 (owner: 10ArielGlenn) [14:52:59] (03CR) 10Tim Landscheidt: "The file modules/role/files/mariadb/filtered_columns.txt has CRLF line endings. Is this intended?" [puppet] - 10https://gerrit.wikimedia.org/r/324040 (https://phabricator.wikimedia.org/T150802) (owner: 10Jcrespo) [14:53:36] (03PS1) 10ArielGlenn: pylint of modules for misc dumps: add/clean up doc strings [dumps] - 10https://gerrit.wikimedia.org/r/324193 [14:53:44] (03PS5) 10Urbanecm: Add option for disabling CompactLink and disable it on enwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298187 (https://phabricator.wikimedia.org/T139903) [14:53:46] (03CR) 10Andrew Bogott: [C: 032] nodepool: bump max server from 12 to 20 [puppet] - 10https://gerrit.wikimedia.org/r/322270 (https://phabricator.wikimedia.org/T133911) (owner: 10Hashar) [14:54:18] (03Abandoned) 10Urbanecm: Add option for disabling CompactLink and disable it on enwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298187 (https://phabricator.wikimedia.org/T139903) (owner: 10Urbanecm) [14:54:54] 06Operations, 10Traffic: several 502 Bad Gateway - https://phabricator.wikimedia.org/T151686#2830931 (10ema) [14:55:20] (03CR) 10Jcrespo: "Tim, no, it is not, but this is an import from another repo- we can fix it on another patch." [puppet] - 10https://gerrit.wikimedia.org/r/324040 (https://phabricator.wikimedia.org/T150802) (owner: 10Jcrespo) [14:55:31] (03CR) 10ArielGlenn: [C: 032] pylint of modules for misc dumps: add/clean up doc strings [dumps] - 10https://gerrit.wikimedia.org/r/324193 (owner: 10ArielGlenn) [14:56:03] (03PS1) 10ArielGlenn: disable two pylint warnings for misc dump and incr dump modules [dumps] - 10https://gerrit.wikimedia.org/r/324195 [14:56:43] (03CR) 10ArielGlenn: [C: 032] disable two pylint warnings for misc dump and incr dump modules [dumps] - 10https://gerrit.wikimedia.org/r/324195 (owner: 10ArielGlenn) [14:57:40] (03CR) 10Jcrespo: [C: 032] mariadb: Rename filtered_columns.txt to filtered_tables.txt [puppet] - 10https://gerrit.wikimedia.org/r/324192 (owner: 10Jcrespo) [14:57:45] (03PS2) 10Jcrespo: mariadb: Rename filtered_columns.txt to filtered_tables.txt [puppet] - 10https://gerrit.wikimedia.org/r/324192 [14:58:13] (03PS1) 10ArielGlenn: properly allow for multistep dumps,have a base class for dump types [dumps] - 10https://gerrit.wikimedia.org/r/324198 [14:58:55] (03CR) 10jenkins-bot: [V: 04-1] properly allow for multistep dumps,have a base class for dump types [dumps] - 10https://gerrit.wikimedia.org/r/324198 (owner: 10ArielGlenn) [14:59:28] (03CR) 10Jcrespo: [C: 032] Remove private_wikis from realm.pp [puppet] - 10https://gerrit.wikimedia.org/r/324153 (owner: 10Jcrespo) [14:59:35] (03PS3) 10Jcrespo: Remove private_wikis from realm.pp [puppet] - 10https://gerrit.wikimedia.org/r/324153 [15:05:50] (03PS1) 10Jcrespo: mariadb: Rename filtered_columns.txt -> filtered_tables.txt [puppet] - 10https://gerrit.wikimedia.org/r/324199 [15:05:57] (03PS2) 10ArielGlenn: properly allow for multistep dumps,have a base class for dump types [dumps] - 10https://gerrit.wikimedia.org/r/324198 [15:06:03] PROBLEM - puppet last run on labsdb1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:06:14] (03PS2) 10Jcrespo: mariadb: Rename filtered_columns.txt -> filtered_tables.txt [puppet] - 10https://gerrit.wikimedia.org/r/324199 [15:07:33] PROBLEM - puppet last run on labsdb1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:09:08] (03CR) 10Jcrespo: [C: 032] mariadb: Rename filtered_columns.txt -> filtered_tables.txt [puppet] - 10https://gerrit.wikimedia.org/r/324199 (owner: 10Jcrespo) [15:10:15] 06Operations, 10Traffic: several 502 Bad Gateway - https://phabricator.wikimedia.org/T151686#2830968 (10ema) 05duplicate>03Open [15:10:33] PROBLEM - puppet last run on db1069 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:11:55] (03PS1) 10Ottomata: Set min.insync.replicas to 2 for main kafka cluster in production [puppet] - 10https://gerrit.wikimedia.org/r/324200 (https://phabricator.wikimedia.org/T144637) [15:13:01] (03CR) 10jenkins-bot: [V: 04-1] Set min.insync.replicas to 2 for main kafka cluster in production [puppet] - 10https://gerrit.wikimedia.org/r/324200 (https://phabricator.wikimedia.org/T144637) (owner: 10Ottomata) [15:13:33] RECOVERY - puppet last run on labsdb1003 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [15:13:53] (03PS2) 10Ottomata: Set min.insync.replicas to 2 for main kafka cluster in production [puppet] - 10https://gerrit.wikimedia.org/r/324200 (https://phabricator.wikimedia.org/T144637) [15:14:03] RECOVERY - puppet last run on labsdb1001 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [15:14:03] RECOVERY - puppet last run on thumbor1002 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [15:17:51] (03PS3) 10Ottomata: Set min.insync.replicas to 2 for main kafka cluster in production [puppet] - 10https://gerrit.wikimedia.org/r/324200 (https://phabricator.wikimedia.org/T144637) [15:18:01] (03CR) 10Ottomata: [C: 032 V: 032] "Lookin fine in https://puppet-compiler.wmflabs.org/4705/" [puppet] - 10https://gerrit.wikimedia.org/r/324200 (https://phabricator.wikimedia.org/T144637) (owner: 10Ottomata) [15:18:26] (03PS1) 10Jcrespo: Revert "Remove private_wikis from realm.pp" [puppet] - 10https://gerrit.wikimedia.org/r/324201 [15:18:57] (03CR) 10MarcoAurelio: "Please update [[wikitech:Add a wiki]] acordingly explaining the current procedure when creating new private wikis. Adding private wikis to" [puppet] - 10https://gerrit.wikimedia.org/r/324153 (owner: 10Jcrespo) [15:19:05] (03PS2) 10Ottomata: discovery-stats: create mysql credentials file specific to discovery-stat [puppet] - 10https://gerrit.wikimedia.org/r/324191 (https://phabricator.wikimedia.org/T151063) (owner: 10Gehel) [15:19:08] (03CR) 10Jcrespo: "This doesn't work because templates are run on the master, not the agent." [puppet] - 10https://gerrit.wikimedia.org/r/324201 (owner: 10Jcrespo) [15:19:14] (03PS2) 10Jcrespo: Revert "Remove private_wikis from realm.pp" [puppet] - 10https://gerrit.wikimedia.org/r/324201 [15:19:33] what a coincidence... [15:20:13] well, I had to test it before documenting it [15:20:41] (03CR) 10Jcrespo: [C: 032] Revert "Remove private_wikis from realm.pp" [puppet] - 10https://gerrit.wikimedia.org/r/324201 (owner: 10Jcrespo) [15:22:10] (03CR) 10Ottomata: [C: 032] discovery-stats: create mysql credentials file specific to discovery-stat [puppet] - 10https://gerrit.wikimedia.org/r/324191 (https://phabricator.wikimedia.org/T151063) (owner: 10Gehel) [15:22:15] (03PS3) 10Ottomata: discovery-stats: create mysql credentials file specific to discovery-stat [puppet] - 10https://gerrit.wikimedia.org/r/324191 (https://phabricator.wikimedia.org/T151063) (owner: 10Gehel) [15:22:17] (03CR) 10Ottomata: [V: 032] discovery-stats: create mysql credentials file specific to discovery-stat [puppet] - 10https://gerrit.wikimedia.org/r/324191 (https://phabricator.wikimedia.org/T151063) (owner: 10Gehel) [15:22:33] RECOVERY - puppet last run on db1069 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [15:23:39] (03Abandoned) 10Jcrespo: redactatron: Integrate centralauth redaction into cols.txt [software/redactatron] - 10https://gerrit.wikimedia.org/r/323809 (https://phabricator.wikimedia.org/T103011) (owner: 10Jcrespo) [15:23:46] ottomata: thanks for the merge! [15:23:52] yup! [15:24:11] File[/etc/mysql/conf.d/discovery-stats-client.cnf]/ensure: created [15:24:23] PROBLEM - puppet last run on db1053 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:24:44] 06Operations, 06Analytics-Kanban, 05Security, 07audits-data-retention: Purge > 90 days stat1002:/a/squid/archive/api - https://phabricator.wikimedia.org/T92338#2831034 (10Milimetric) a:05Ottomata>03Milimetric [15:26:09] 06Operations, 06Labs: Kill the labtest $realm - https://phabricator.wikimedia.org/T148717#2831044 (10chasemp) a:03Andrew [15:26:56] 06Operations, 06Analytics-Kanban, 05Security, 07audits-data-retention: Purge > 90 days stat1002:/a/squid/archive/api - https://phabricator.wikimedia.org/T92338#2831045 (10Milimetric) [15:30:38] (03PS1) 10Zfilipin: WIP Ensure ChromeDriver is installed for jobs that run Selenium tests [puppet] - 10https://gerrit.wikimedia.org/r/324203 (https://phabricator.wikimedia.org/T117418) [15:30:57] !log Stop mysql and shutdown db2048 and db2034 for maintenance - T149553 [15:31:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:08] T149553: db2034: investigate its crash and reimage - https://phabricator.wikimedia.org/T149553 [15:31:37] (03CR) 10jenkins-bot: [V: 04-1] WIP Ensure ChromeDriver is installed for jobs that run Selenium tests [puppet] - 10https://gerrit.wikimedia.org/r/324203 (https://phabricator.wikimedia.org/T117418) (owner: 10Zfilipin) [15:32:44] 06Operations, 06Analytics-Kanban, 05Security, 07audits-data-retention: Purge > 90 days stat1002:/a/squid/archive/api - https://phabricator.wikimedia.org/T92338#2831069 (10Milimetric) email sent. Will delete all files at the end of this week, please look through if you need anything. [15:34:08] (03PS2) 10Zfilipin: WIP Ensure ChromeDriver is installed for jobs that run Selenium tests [puppet] - 10https://gerrit.wikimedia.org/r/324203 (https://phabricator.wikimedia.org/T117418) [15:36:24] 06Operations, 06Analytics-Kanban, 05Security, 07audits-data-retention: Purge > 90 days stat1002:/a/squid/archive/glam_nara - https://phabricator.wikimedia.org/T92340#2831095 (10Milimetric) a:05Ottomata>03Milimetric [15:38:21] (03CR) 10MarcoAurelio: "LGTM. I assume they're already optimized (optiPNG), right?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324188 (https://phabricator.wikimedia.org/T151731) (owner: 10Urbanecm) [15:38:53] PROBLEM - puppet last run on es1011 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:39:01] 06Operations, 06Analytics-Kanban, 05Security, 07audits-data-retention: Purge > 90 days stat1002:/a/squid/archive/glam_nara - https://phabricator.wikimedia.org/T92340#2831112 (10Milimetric) @leila, @Multichill, @JeanFred: We're going to remove all these old files at the end of this week. The files continue... [15:41:59] (03CR) 10BBlack: [C: 031] varnish: double workspace_backend [puppet] - 10https://gerrit.wikimedia.org/r/324103 (https://phabricator.wikimedia.org/T151563) (owner: 10Ema) [15:43:41] 06Operations, 06Analytics-Kanban, 06Zero, 07Mobile, and 2 others: Purge > 90 days stat1002:/a/squid/archive/mobile - https://phabricator.wikimedia.org/T92341#2831145 (10Milimetric) a:05Ottomata>03Milimetric [15:43:44] (03CR) 10ArielGlenn: [C: 032] properly allow for multistep dumps,have a base class for dump types [dumps] - 10https://gerrit.wikimedia.org/r/324198 (owner: 10ArielGlenn) [15:44:27] (03PS1) 10ArielGlenn: for misc dumps, docstring cleanup and NotImplementedError for base methods [dumps] - 10https://gerrit.wikimedia.org/r/324206 [15:44:41] 06Operations, 06Analytics-Kanban, 06Zero, 07Mobile, and 2 others: Purge > 90 days stat1002:/a/squid/archive/mobile - https://phabricator.wikimedia.org/T92341#1106695 (10Milimetric) We will remove all these files at the end of this week. Please look through if you need them. [15:45:39] (03CR) 10ArielGlenn: [C: 032] for misc dumps, docstring cleanup and NotImplementedError for base methods [dumps] - 10https://gerrit.wikimedia.org/r/324206 (owner: 10ArielGlenn) [15:46:32] (03PS1) 10ArielGlenn: misc dumps: add sample_dumps module for an example [dumps] - 10https://gerrit.wikimedia.org/r/324207 [15:46:35] 06Operations, 06Analytics-Kanban, 06Zero, 05Security, 07audits-data-retention: Purge > 90 days stat1002:/a/squid/archive/sampled - https://phabricator.wikimedia.org/T92342#2831170 (10Milimetric) a:05Ottomata>03Milimetric [15:47:45] 06Operations, 06Analytics-Kanban, 06Zero, 05Security, 07audits-data-retention: Purge > 90 days stat1002:/a/squid/archive/sampled - https://phabricator.wikimedia.org/T92342#1106706 (10Milimetric) We will delete all these files at the end of this week. Please look through if you need them. [15:48:29] 06Operations, 06Analytics-Kanban, 06Zero, 05Security, 07audits-data-retention: Purge > 90 days stat1002:/a/squid/archive/zero - https://phabricator.wikimedia.org/T92343#2831175 (10Milimetric) a:05Ottomata>03Milimetric [15:49:02] 06Operations, 06Analytics-Kanban, 06Zero, 05Security, 07audits-data-retention: Purge > 90 days stat1002:/a/squid/archive/zero - https://phabricator.wikimedia.org/T92343#1106713 (10Milimetric) We will delete all of these files at the end of this week. Please look through if you need them. [15:50:51] (03PS1) 10Jcrespo: Migrate redact_sanitarium.sh script from software to puppet [puppet] - 10https://gerrit.wikimedia.org/r/324208 (https://phabricator.wikimedia.org/T150802) [15:51:50] (03CR) 10jenkins-bot: [V: 04-1] Migrate redact_sanitarium.sh script from software to puppet [puppet] - 10https://gerrit.wikimedia.org/r/324208 (https://phabricator.wikimedia.org/T150802) (owner: 10Jcrespo) [15:52:22] (03PS2) 10Jcrespo: Migrate redact_sanitarium.sh script from software to puppet [puppet] - 10https://gerrit.wikimedia.org/r/324208 (https://phabricator.wikimedia.org/T150802) [15:52:23] RECOVERY - puppet last run on db1053 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [15:52:42] (03PS2) 10Ema: varnish: double workspace_backend [puppet] - 10https://gerrit.wikimedia.org/r/324103 (https://phabricator.wikimedia.org/T151563) [15:52:44] (03CR) 10ArielGlenn: [C: 032] misc dumps: add sample_dumps module for an example [dumps] - 10https://gerrit.wikimedia.org/r/324207 (owner: 10ArielGlenn) [15:52:54] (03CR) 10Ema: [C: 032 V: 032] varnish: double workspace_backend [puppet] - 10https://gerrit.wikimedia.org/r/324103 (https://phabricator.wikimedia.org/T151563) (owner: 10Ema) [15:53:06] (03CR) 10jenkins-bot: [V: 04-1] Migrate redact_sanitarium.sh script from software to puppet [puppet] - 10https://gerrit.wikimedia.org/r/324208 (https://phabricator.wikimedia.org/T150802) (owner: 10Jcrespo) [15:53:08] (03PS3) 10Jcrespo: Migrate redact_sanitarium.sh script from software to puppet [puppet] - 10https://gerrit.wikimedia.org/r/324208 (https://phabricator.wikimedia.org/T150802) [15:53:19] (03PS1) 10ArielGlenn: for miscdumpslib add a method for running simple sql queries and use it [dumps] - 10https://gerrit.wikimedia.org/r/324209 [15:55:39] (03PS1) 10Alexandros Kosiaris: kubelet: Amend to support more than labs [puppet] - 10https://gerrit.wikimedia.org/r/324210 [15:55:41] (03PS1) 10Alexandros Kosiaris: Kube-proxy: Amend to support more than labs [puppet] - 10https://gerrit.wikimedia.org/r/324211 [15:55:43] (03PS1) 10Alexandros Kosiaris: Add profile::kubernetes::node profile class [puppet] - 10https://gerrit.wikimedia.org/r/324212 [15:55:45] (03PS1) 10Alexandros Kosiaris: Include ::profile::kubernetes::node in role::kubernetes::worker [puppet] - 10https://gerrit.wikimedia.org/r/324213 [15:56:56] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service: nginx proxy on wdqs servers sometimes try to connect to backend over IPv6 - https://phabricator.wikimedia.org/T151889#2831203 (10Gehel) [15:57:04] (03PS1) 10Anomie: Set $wgSoftBlockRanges [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324215 [15:57:05] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service: nginx proxy on wdqs servers sometimes try to connect to backend over IPv6 - https://phabricator.wikimedia.org/T151889#2831217 (10Gehel) p:05Triage>03High [15:57:34] (03CR) 10Urbanecm: "Yes, I optiPNGed them." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324188 (https://phabricator.wikimedia.org/T151731) (owner: 10Urbanecm) [15:57:56] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service: nginx proxy on wdqs servers sometimes try to connect to backend over IPv6 - https://phabricator.wikimedia.org/T151889#2831203 (10Gehel) Of course, one option could be to make the backend also listen to IPv6. [15:59:01] (03CR) 10ArielGlenn: [C: 032] for miscdumpslib add a method for running simple sql queries and use it [dumps] - 10https://gerrit.wikimedia.org/r/324209 (owner: 10ArielGlenn) [15:59:24] !log Stop temporarily stop MySQL db2070 maintenance - T149553 [15:59:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:36] T149553: db2034: investigate its crash and reimage - https://phabricator.wikimedia.org/T149553 [15:59:45] (03PS1) 10ArielGlenn: add timeout and related callback to method for running proc without output [dumps] - 10https://gerrit.wikimedia.org/r/324217 [15:59:50] (03CR) 10Alexandros Kosiaris: [C: 04-1] "This still needs some more work for hiera variables to be properly defined" [puppet] - 10https://gerrit.wikimedia.org/r/324213 (owner: 10Alexandros Kosiaris) [16:01:15] (03PS1) 10Jcrespo: Drop files: Migrated for automatic deployment to the puppet repo [software/redactatron] - 10https://gerrit.wikimedia.org/r/324218 [16:04:32] (03CR) 10Jcrespo: [C: 032] Migrate redact_sanitarium.sh script from software to puppet [puppet] - 10https://gerrit.wikimedia.org/r/324208 (https://phabricator.wikimedia.org/T150802) (owner: 10Jcrespo) [16:04:42] (03PS4) 10Jcrespo: Migrate redact_sanitarium.sh script from software to puppet [puppet] - 10https://gerrit.wikimedia.org/r/324208 (https://phabricator.wikimedia.org/T150802) [16:05:53] RECOVERY - puppet last run on es1011 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [16:06:14] (03CR) 10ArielGlenn: [C: 032] add timeout and related callback to method for running proc without output [dumps] - 10https://gerrit.wikimedia.org/r/324217 (owner: 10ArielGlenn) [16:06:34] (03PS2) 10Jcrespo: Drop files: Migrated for automatic deployment to the puppet repo [software/redactatron] - 10https://gerrit.wikimedia.org/r/324218 [16:08:17] (03PS1) 10ArielGlenn: fix up locking for misc dumps [dumps] - 10https://gerrit.wikimedia.org/r/324219 [16:09:36] (03PS1) 10Andrew Bogott: Remove labtest realm checks from wikitech configs [puppet] - 10https://gerrit.wikimedia.org/r/324220 (https://phabricator.wikimedia.org/T148717) [16:10:13] PROBLEM - HHVM rendering on mw1276 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:10:43] PROBLEM - Apache HTTP on mw1276 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:10:49] (03PS2) 10Andrew Bogott: Remove labtest realm checks from wikitech configs [puppet] - 10https://gerrit.wikimedia.org/r/324220 (https://phabricator.wikimedia.org/T148717) [16:10:54] is that ^^ api again? [16:11:15] _joe_: mw1276 again apparently [16:11:18] (03CR) 10Jcrespo: [C: 032 V: 032] Drop files: Migrated for automatic deployment to the puppet repo [software/redactatron] - 10https://gerrit.wikimedia.org/r/324218 (owner: 10Jcrespo) [16:11:31] <_joe_> ema it's depooled and under my torture [16:11:35] <_joe_> sorry [16:11:56] <_joe_> paladox: no, that's me crashing it while narrowing down the API problem :P [16:12:02] (03CR) 10jenkins-bot: [V: 04-1] Remove labtest realm checks from wikitech configs [puppet] - 10https://gerrit.wikimedia.org/r/324220 (https://phabricator.wikimedia.org/T148717) (owner: 10Andrew Bogott) [16:12:04] Oh ok, thanks :) [16:13:03] RECOVERY - HHVM rendering on mw1276 is OK: HTTP OK: HTTP/1.1 200 OK - 70219 bytes in 0.099 second response time [16:13:33] RECOVERY - Apache HTTP on mw1276 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.067 second response time [16:15:01] (03PS3) 10Jcrespo: Drop files: Migrated for automatic deployment to the puppet repo [software/redactatron] - 10https://gerrit.wikimedia.org/r/324218 [16:15:54] (03PS1) 10Gehel: wdqs - ensure that nginx connects to blazegraph over IPv4 [puppet] - 10https://gerrit.wikimedia.org/r/324222 (https://phabricator.wikimedia.org/T151889) [16:18:13] (03PS3) 10Andrew Bogott: Remove labtest realm checks from wikitech configs [puppet] - 10https://gerrit.wikimedia.org/r/324220 (https://phabricator.wikimedia.org/T148717) [16:18:50] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/324222 (https://phabricator.wikimedia.org/T151889) (owner: 10Gehel) [16:21:26] (03CR) 10Gehel: [C: 032] wdqs - ensure that nginx connects to blazegraph over IPv4 [puppet] - 10https://gerrit.wikimedia.org/r/324222 (https://phabricator.wikimedia.org/T151889) (owner: 10Gehel) [16:22:59] (03PS2) 10ArielGlenn: fix up locking for misc dumps [dumps] - 10https://gerrit.wikimedia.org/r/324219 [16:23:25] (03CR) 10Dzahn: "please see https://gerrit.wikimedia.org/r/#/c/323996/" [puppet] - 10https://gerrit.wikimedia.org/r/323972 (owner: 10Paladox) [16:23:50] (03CR) 10Paladox: [C: 031] Phabricator: Don't use vcs group, use phd [puppet] - 10https://gerrit.wikimedia.org/r/323996 (https://phabricator.wikimedia.org/T146055) (owner: 1020after4) [16:23:54] (03Abandoned) 10Paladox: Phabricator: Create group vcs and require it by the vcs user [puppet] - 10https://gerrit.wikimedia.org/r/323972 (owner: 10Paladox) [16:25:02] !log doubling workspace_backend on all cp* hosts [16:25:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:23] PROBLEM - puppet last run on mw1219 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:26:47] (03PS4) 10Andrew Bogott: Remove labtest realm checks from wikitech configs [puppet] - 10https://gerrit.wikimedia.org/r/324220 (https://phabricator.wikimedia.org/T148717) [16:30:11] (03CR) 10Andrew Bogott: "Confirmed no-op with the puppet compiler" [puppet] - 10https://gerrit.wikimedia.org/r/324220 (https://phabricator.wikimedia.org/T148717) (owner: 10Andrew Bogott) [16:30:59] (03CR) 10Alex Monk: Remove labtest realm checks from wikitech configs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/324220 (https://phabricator.wikimedia.org/T148717) (owner: 10Andrew Bogott) [16:31:17] !log upgrading libicu on mc1020-1036 [16:31:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:57] (03CR) 10Jalexander: [C: 031] "Looks good to me, this would be useful and I know the global renames would like it." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324096 (https://phabricator.wikimedia.org/T150951) (owner: 10MarcoAurelio) [16:35:10] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service, 13Patch-For-Review: nginx proxy on wdqs servers sometimes try to connect to backend over IPv6 - https://phabricator.wikimedia.org/T151889#2831348 (10Gehel) 05Open>03Resolved nginx is now configured to talk to the backend `127.0.0.1` and... [16:35:18] jouncebot: next [16:35:18] In 0 hour(s) and 24 minute(s): Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161129T1700) [16:35:36] (03CR) 10Andrew Bogott: Remove labtest realm checks from wikitech configs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/324220 (https://phabricator.wikimedia.org/T148717) (owner: 10Andrew Bogott) [16:35:40] (03PS1) 10Andrew Bogott: Remove an unneeded $labtest switch [puppet] - 10https://gerrit.wikimedia.org/r/324226 [16:43:47] (03CR) 10Dzahn: "are these changes expected? http://puppet-compiler.wmflabs.org/4708/iridium.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/323996 (https://phabricator.wikimedia.org/T146055) (owner: 1020after4) [16:47:18] (03CR) 10Paladox: "@Dzahn I belive they are since it looks like it just changed the vcs to phd." [puppet] - 10https://gerrit.wikimedia.org/r/323996 (https://phabricator.wikimedia.org/T146055) (owner: 1020after4) [16:48:19] (03PS1) 10Ema: varnish: disable extrachance [puppet] - 10https://gerrit.wikimedia.org/r/324229 (https://phabricator.wikimedia.org/T150247) [16:49:06] (03CR) 10Dzahn: "@iridium:/run/phd/.subversion# find / -type f -gid 996" [puppet] - 10https://gerrit.wikimedia.org/r/323996 (https://phabricator.wikimedia.org/T146055) (owner: 1020after4) [16:49:49] !log setting gethdr_extrachance=0 on all cp* hosts T150247 [16:49:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:00] T150247: Varnish4 is unexpectedly retrying certain applayer failure cases - https://phabricator.wikimedia.org/T150247 [16:54:23] RECOVERY - puppet last run on mw1219 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [16:57:13] PROBLEM - dhclient process on thumbor1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:57:13] PROBLEM - salt-minion processes on thumbor1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:58:03] RECOVERY - salt-minion processes on thumbor1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [16:58:03] RECOVERY - dhclient process on thumbor1002 is OK: PROCS OK: 0 processes with command name dhclient [16:59:46] (03CR) 10Jforrester: "> Yes, I know about it. I'm waiting till January 2017. Will this patch be de-CR-2ed after this date?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307475 (https://phabricator.wikimedia.org/T144254) (owner: 10Urbanecm) [17:00:05] godog, moritzm, and _joe_: Respected human, time to deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161129T1700). Please do the needful. [17:00:14] (03CR) 10BBlack: [C: 031] varnish: disable extrachance [puppet] - 10https://gerrit.wikimedia.org/r/324229 (https://phabricator.wikimedia.org/T150247) (owner: 10Ema) [17:00:29] (03PS1) 10BBlack: varnish: better frontend mem sizing [puppet] - 10https://gerrit.wikimedia.org/r/324230 [17:00:35] (03CR) 10Ema: [C: 032] varnish: disable extrachance [puppet] - 10https://gerrit.wikimedia.org/r/324229 (https://phabricator.wikimedia.org/T150247) (owner: 10Ema) [17:01:13] PROBLEM - salt-minion processes on thumbor1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:01:13] PROBLEM - dhclient process on thumbor1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:01:29] (03CR) 10jenkins-bot: [V: 04-1] varnish: better frontend mem sizing [puppet] - 10https://gerrit.wikimedia.org/r/324230 (owner: 10BBlack) [17:02:26] (03PS2) 10BBlack: varnish: better frontend mem sizing [puppet] - 10https://gerrit.wikimedia.org/r/324230 [17:02:36] (03CR) 10Dpatrick: [C: 031] Expand Content-Security-Policy on upload test to fr. [puppet] - 10https://gerrit.wikimedia.org/r/318490 (https://phabricator.wikimedia.org/T117618) (owner: 10Brian Wolff) [17:03:17] (03CR) 10ArielGlenn: [C: 032] fix up locking for misc dumps [dumps] - 10https://gerrit.wikimedia.org/r/324219 (owner: 10ArielGlenn) [17:04:22] (03PS1) 10ArielGlenn: html dumps script using misc dump generation framework [dumps] - 10https://gerrit.wikimedia.org/r/324231 (https://phabricator.wikimedia.org/T133547) [17:08:03] RECOVERY - salt-minion processes on thumbor1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [17:08:03] RECOVERY - dhclient process on thumbor1002 is OK: PROCS OK: 0 processes with command name dhclient [17:09:33] (03CR) 10Ema: [C: 031] "Nice! pcc also likes this. https://puppet-compiler.wmflabs.org/4709/" [puppet] - 10https://gerrit.wikimedia.org/r/324230 (owner: 10BBlack) [17:12:41] 06Operations, 10Analytics: Install java 8 to stat1002 - https://phabricator.wikimedia.org/T151896#2831478 (10EBernhardson) [17:13:50] (03PS5) 10Andrew Bogott: Remove labtest realm checks from wikitech configs [puppet] - 10https://gerrit.wikimedia.org/r/324220 (https://phabricator.wikimedia.org/T148717) [17:14:05] (03CR) 10Paladox: "I spoked to @Dzahn about this and we have decided to give this a try on the phabricator update window tommror. Reason we are doing this on" [puppet] - 10https://gerrit.wikimedia.org/r/323996 (https://phabricator.wikimedia.org/T146055) (owner: 1020after4) [17:15:33] (03CR) 10Andrew Bogott: [C: 032] Remove labtest realm checks from wikitech configs [puppet] - 10https://gerrit.wikimedia.org/r/324220 (https://phabricator.wikimedia.org/T148717) (owner: 10Andrew Bogott) [17:16:53] PROBLEM - puppet last run on db1066 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:17:41] (03PS2) 10Andrew Bogott: Remove an unneeded $labtest switch [puppet] - 10https://gerrit.wikimedia.org/r/324226 [17:17:43] PROBLEM - dhclient process on thumbor1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:18:03] PROBLEM - salt-minion processes on thumbor1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:19:51] (03CR) 10Andrew Bogott: [C: 032] Remove an unneeded $labtest switch [puppet] - 10https://gerrit.wikimedia.org/r/324226 (owner: 10Andrew Bogott) [17:20:34] RECOVERY - dhclient process on thumbor1001 is OK: PROCS OK: 0 processes with command name dhclient [17:20:36] (03PS1) 10Andrew Bogott: Abolish labtest realm. [puppet] - 10https://gerrit.wikimedia.org/r/324233 (https://bugzilla.wikimedia.org/324220) [17:20:53] RECOVERY - salt-minion processes on thumbor1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [17:25:13] PROBLEM - dhclient process on thumbor1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:25:14] PROBLEM - salt-minion processes on thumbor1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:26:03] RECOVERY - dhclient process on thumbor1002 is OK: PROCS OK: 0 processes with command name dhclient [17:26:03] RECOVERY - salt-minion processes on thumbor1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [17:27:01] (03PS1) 10Ottomata: Puppetize thorium as stat1001 replacement [puppet] - 10https://gerrit.wikimedia.org/r/324234 (https://phabricator.wikimedia.org/T149438) [17:28:03] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 622 600 - REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 4745929 keys, up 29 days 9 hours - replication_delay is 622 [17:28:53] (03CR) 10Ottomata: [C: 032] Puppetize thorium as stat1001 replacement [puppet] - 10https://gerrit.wikimedia.org/r/324234 (https://phabricator.wikimedia.org/T149438) (owner: 10Ottomata) [17:33:43] (03PS1) 10Ottomata: Ensure geowiki base path exists [puppet] - 10https://gerrit.wikimedia.org/r/324235 (https://phabricator.wikimedia.org/T149438) [17:35:39] (03CR) 10ArielGlenn: [C: 032] html dumps script using misc dump generation framework [dumps] - 10https://gerrit.wikimedia.org/r/324231 (https://phabricator.wikimedia.org/T133547) (owner: 10ArielGlenn) [17:35:47] (03CR) 10Ottomata: [C: 032] Ensure geowiki base path exists [puppet] - 10https://gerrit.wikimedia.org/r/324235 (https://phabricator.wikimedia.org/T149438) (owner: 10Ottomata) [17:36:46] !log otto@tin Starting deploy [analytics/pivot/deploy@0513a6e]: (no message) [17:36:54] !log otto@tin Finished deploy [analytics/pivot/deploy@0513a6e]: (no message) (duration: 00m 08s) [17:36:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:37:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:37:16] cool! [17:42:04] (03PS1) 10Ottomata: Add thorium to list of statistics_services in hiera [puppet] - 10https://gerrit.wikimedia.org/r/324237 (https://phabricator.wikimedia.org/T149438) [17:44:17] (03CR) 10Ottomata: [C: 032] Add thorium to list of statistics_services in hiera [puppet] - 10https://gerrit.wikimedia.org/r/324237 (https://phabricator.wikimedia.org/T149438) (owner: 10Ottomata) [17:44:53] RECOVERY - puppet last run on db1066 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [17:45:35] (03PS1) 10Eevans: enable instance restbase2012-b.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/324238 (https://phabricator.wikimedia.org/T151086) [17:46:23] (03PS2) 10Eevans: enable instance restbase2012-b.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/324238 (https://phabricator.wikimedia.org/T151086) [17:46:38] (03CR) 10Eevans: [C: 031] "Ready to go." [puppet] - 10https://gerrit.wikimedia.org/r/324238 (https://phabricator.wikimedia.org/T151086) (owner: 10Eevans) [17:47:23] PROBLEM - puppet last run on stat1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:48:23] RECOVERY - puppet last run on stat1002 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [17:58:11] PROBLEM - salt-minion processes on thumbor1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:58:11] PROBLEM - dhclient process on thumbor1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:00:04] yurik, gwicke, cscott, arlolra, subbu, halfak, and Amir1: Dear anthropoid, the time has come. Please deploy Services – Graphoid / Parsoid / OCG / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161129T1800). [18:00:13] no parsoid deploy today [18:02:03] !log Stopping replication db1095 (new sanitarium, not in use) on s1 instance for maintenance - T150802 [18:02:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:02:16] T150802: Provision db1095 with at least 1 shard, sanitize and test slave-side triggers - https://phabricator.wikimedia.org/T150802 [18:06:01] RECOVERY - dhclient process on thumbor1002 is OK: PROCS OK: 0 processes with command name dhclient [18:06:11] RECOVERY - salt-minion processes on thumbor1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [18:12:58] (03PS1) 10Volans: RAID: reduce MegaCLI sensibiloty (physical disks) [puppet] - 10https://gerrit.wikimedia.org/r/324240 (https://phabricator.wikimedia.org/T151043) [18:16:51] (03CR) 1020after4: [C: 031] "Most definitely those changes are expected (and safe/should not be disruptive)" [puppet] - 10https://gerrit.wikimedia.org/r/323996 (https://phabricator.wikimedia.org/T146055) (owner: 1020after4) [18:22:11] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 4738995 keys, up 29 days 10 hours - replication_delay is 0 [18:40:05] (03CR) 10Filippo Giunchedi: [C: 031] Release 0.0.3 [software/service-checker] - 10https://gerrit.wikimedia.org/r/324139 (owner: 10Volans) [18:40:28] (03PS2) 10Filippo Giunchedi: hhvm: fix hhvm-needs-restart logic for memory [puppet] - 10https://gerrit.wikimedia.org/r/323887 (https://phabricator.wikimedia.org/T151702) [18:41:57] (03CR) 10Filippo Giunchedi: [C: 032] hhvm: fix hhvm-needs-restart logic for memory [puppet] - 10https://gerrit.wikimedia.org/r/323887 (https://phabricator.wikimedia.org/T151702) (owner: 10Filippo Giunchedi) [18:47:45] 06Operations, 10OfflineContentGenerator, 10Reading-Community-Engagement, 06Reading-Web-Backlog, 06Services: Confirm attribution needs - https://phabricator.wikimedia.org/T150875#2832017 (10ZhouZ) Our Terms of Use allows for attribution to text contributors via the [[ https://wikimediafoundation.org/wiki... [18:51:40] (03PS3) 10Filippo Giunchedi: enable instance restbase2012-b.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/324238 (https://phabricator.wikimedia.org/T151086) (owner: 10Eevans) [18:53:36] (03CR) 10Filippo Giunchedi: [C: 032] enable instance restbase2012-b.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/324238 (https://phabricator.wikimedia.org/T151086) (owner: 10Eevans) [18:56:24] jouncebot: next [18:56:24] In 0 hour(s) and 3 minute(s): Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161129T1900) [19:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161129T1900). Please do the needful. [19:00:04] MarcoAurelio: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [19:00:22] * MarcoAurelio salutes [19:01:13] I can SWAT today [19:02:50] bd808: thanks :) - can we schedule the autopatrol thinguie now too? [19:03:31] MarcoAurelio: yeah, if your good with the current patch I'd love to see it go out [19:03:43] bd808: looks good to me [19:03:54] but I though I should ask you as author first [19:04:16] +1 from me. sweet talk thcipriani to get pushed out [19:04:19] also, if Krenair does not object with removing the right-manageglobalpuppet patch we can get it too [19:04:37] Hi. There is also the meta. noindex change perhaps to deploy. [19:04:37] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322262 (https://phabricator.wikimedia.org/T150591) (owner: 10MarcoAurelio) [19:04:46] Dereckson: yep [19:04:54] it can go too if you wish [19:05:35] (03Merged) 10jenkins-bot: Remove FlaggedRevs autopromotion function at eowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322262 (https://phabricator.wikimedia.org/T150591) (owner: 10MarcoAurelio) [19:05:37] MarcoAurelio: ok [19:06:54] 06Operations, 10RESTBase, 10RESTBase-API, 10Traffic, and 2 others: Expose the PDF rendering service via RESTBase - https://phabricator.wikimedia.org/T143132#2832114 (10GWicke) The PR is now merged, and I also checked with @bblack about object sizes & Varnish cache times. With expected volume & sizes (< 100... [19:07:08] (03PS1) 10Chad: group0 to wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324243 [19:07:27] ostriches: can I reset wikiversions.json for SWAT? [19:07:37] Yeah, I already tossed it [19:07:42] Was just making my patch [19:07:43] cool thanks :) [19:10:56] MarcoAurelio: https://gerrit.wikimedia.org/r/#/c/322262 live on mwdebug1002 if there is anything to check there [19:11:26] thcipriani: I think this is un-checkable but let me have a look at the wiki to see if anything broke [19:11:33] ok, thanks [19:13:00] thcipriani: I'd say green light [19:13:08] MarcoAurelio: ok, going live [19:13:14] wiki looks fine at mwdebug1002 [19:13:33] ack thanks [19:14:48] (03PS3) 10Thcipriani: Set 'abusefilter-modify-global' to stewards locally at Meta-Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321660 (https://phabricator.wikimedia.org/T150752) (owner: 10MarcoAurelio) [19:15:25] !log thcipriani@tin Synchronized wmf-config/flaggedrevs.php: SWAT: [[gerrit:322262|Remove FlaggedRevs autopromotion function at eowiki]] T150591 (duration: 01m 37s) [19:15:34] ^ MarcoAurelio live everywhere [19:15:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:15:39] T150591: Disable automated promotion of user status in eowiki - https://phabricator.wikimedia.org/T150591 [19:15:40] okie dokie [19:15:46] closing phab task [19:16:16] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321660 (https://phabricator.wikimedia.org/T150752) (owner: 10MarcoAurelio) [19:18:06] (03PS3) 10Marostegui: mariadb: Split backup and otrsbackups classes into a different file [puppet] - 10https://gerrit.wikimedia.org/r/320989 (https://phabricator.wikimedia.org/T150851) [19:18:38] (03Merged) 10jenkins-bot: Set 'abusefilter-modify-global' to stewards locally at Meta-Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321660 (https://phabricator.wikimedia.org/T150752) (owner: 10MarcoAurelio) [19:19:11] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 608 600 - REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 4746568 keys, up 29 days 10 hours - replication_delay is 608 [19:20:00] MarcoAurelio: https://gerrit.wikimedia.org/r/#/c/321660/ live on mwdebug1002 [19:20:23] thcipriani: ack, checking [19:21:02] (03PS1) 10ArielGlenn: fix check that file we have fcntl lock is empty [dumps] - 10https://gerrit.wikimedia.org/r/324246 [19:21:44] thcipriani: special:listgrouprights on mwdebug lgtm [19:21:57] will check further when live to see if this really works [19:22:01] it should [19:22:06] :) ok, going live [19:22:22] I'm adding three more patches to deploy [19:22:56] * thcipriani nods [19:23:07] !log thcipriani@tin Synchronized wmf-config/abusefilter.php: SWAT: [[gerrit:321660|Set "abusefilter-modify-global" to stewards locally at Meta-Wiki]] T150752 (duration: 00m 45s) [19:23:11] ^ MarcoAurelio live everywhere [19:23:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:23:19] T150752: Add 'abusefilter-modify-global' to stewards at Meta-Wiki instead of having it through global group - https://phabricator.wikimedia.org/T150752 [19:23:24] thcipriani: okay, testing further [19:23:30] hold on please [19:24:27] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324096 (https://phabricator.wikimedia.org/T150951) (owner: 10MarcoAurelio) [19:24:31] PROBLEM - puppet last run on dbproxy1009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:25:04] (03Merged) 10jenkins-bot: Add 'global-renamer' to the list of privileged wiki groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324096 (https://phabricator.wikimedia.org/T150951) (owner: 10MarcoAurelio) [19:25:32] seems working as expected in local groups [19:26:02] cool, thanks for checking, https://gerrit.wikimedia.org/r/#/c/324096/ is now live on mw1002debug [19:26:10] er, mwdebug1002 :P [19:27:40] checking [19:28:54] thcipriani: works for me [19:29:00] ok going live [19:29:10] oathauth-enable for global-renamer at listgrouprights at meta [19:29:33] we should also probably have it expanded to wmf-officeit and wmf-supportandsafety now that I see [19:29:41] well, too late :P [19:29:43] PROBLEM - cassandra-b CQL 10.192.48.69:9042 on restbase2012 is CRITICAL: connect to address 10.192.48.69 and port 9042: Connection refused [19:30:39] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321713 (https://phabricator.wikimedia.org/T150245) (owner: 10Dereckson) [19:30:40] ACKNOWLEDGEMENT - cassandra-b CQL 10.192.48.69:9042 on restbase2012 is CRITICAL: connect to address 10.192.48.69 and port 9042: Connection refused Filippo Giunchedi bootstrapping [19:30:54] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:324096|Add "global-renamer" to the list of privileged wiki groups]] T150951 (duration: 00m 45s) [19:31:03] ^ MarcoAurelio live everywhere [19:31:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:31:05] Reedy: shall we add wmf-officeit and wmf-supportandsafety to wmgPrivilegedGroups as well? [19:31:06] T150951: Create list of privileged wiki groups - https://phabricator.wikimedia.org/T150951 [19:31:40] Do they have any real extra rights? [19:32:03] Reedy: centralauth-lock, centralauth-oversight and userrights-interwiki, userrights [19:32:08] MarcoAurelio: ah, looks like https://gerrit.wikimedia.org/r/#/c/321713/1 has a parent commit [19:32:20] I guess so, yeah [19:32:55] thcipriani: maybe we can let Dereckson decide on that one [19:32:59] thcipriani, are you doing the train later? I just cherrypicked two patches for the wmf4 train - not sure if its part of swat yet [19:33:02] and move to the next? [19:33:14] MarcoAurelio: fair enough [19:33:29] yurik: ostriches is on train up next [19:33:49] (03CR) 10Thcipriani: Allow __NOINDEX__ on all namespaces on meta. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321713 (https://phabricator.wikimedia.org/T150245) (owner: 10Dereckson) [19:34:06] oh yes parent commit should be amended, with a better suggested comment [19:34:11] (03CR) 10MarcoAurelio: [C: 031] [logo] Add logo for arbcom_cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324188 (https://phabricator.wikimedia.org/T151731) (owner: 10Urbanecm) [19:34:36] ostriches, should i merge the two patches? https://gerrit.wikimedia.org/r/#/c/324248/ and https://gerrit.wikimedia.org/r/#/c/324249/ [19:34:40] godog: thanks for the merge, sir! [19:34:44] (03PS4) 10Thcipriani: Add autopatrolled group for wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/303183 (owner: 10BryanDavis) [19:34:54] bd808: ^^ [19:34:56] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/303183 (owner: 10BryanDavis) [19:35:38] (03Merged) 10jenkins-bot: Add autopatrolled group for wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/303183 (owner: 10BryanDavis) [19:37:25] hrm, I suppose that you can check this one on mwdebug? [19:37:30] sure [19:38:08] (03PS2) 10Dereckson: Allow a wiki to use __NOINDEX__ and __INDEX__ in all namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321712 [19:38:09] yurik: Yes [19:38:12] Plz merge [19:38:13] ah, figured it was impossible since wikitech was a weird one. live on mwdebug1002 then [19:38:23] thcipriani: 321712 is ready [19:38:30] Dereckson: ok, thank you [19:39:42] thcipriani: can I check on mwdebug then? [19:40:03] (03CR) 10Filippo Giunchedi: prometheus: add vhtcpd stats via node-exporter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/323559 (https://phabricator.wikimedia.org/T147429) (owner: 10Filippo Giunchedi) [19:40:09] MarcoAurelio: yes, change should be there now [19:40:15] re-checking [19:40:19] urandom: you are willkommen [19:40:21] because it didn't [19:40:55] thcipriani: it's not on mwdebug1002 [19:41:02] maybe on 1001? [19:41:03] I thought wikitech was only on silver so it would be impossible to check via the x-debug header [19:41:20] ^ bd808 is that correct? [19:41:29] (03PS2) 10Dereckson: Allow __NOINDEX__ on all namespaces on meta. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321713 (https://phabricator.wikimedia.org/T150245) [19:41:54] thcipriani: yeah. wikitech changes can't be tested on the mw test servers [19:42:07] no checking then :) [19:42:12] yup, seems like. [19:42:28] if the wiki breaks, well, it seems we did something bad ;) [19:43:43] PROBLEM - dhclient process on thumbor1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:43:55] yeah, that'd be a good indication :) [19:44:03] PROBLEM - salt-minion processes on thumbor1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:44:26] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:303183|Add autopatrolled group for wikitech]] (duration: 00m 45s) [19:44:32] ^ MarcoAurelio live everywhere [19:44:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:45:18] (03Draft2) 10MarcoAurelio: WMF staff local groups to $wmgPrivilegedGroups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324250 (https://phabricator.wikimedia.org/T150951) [19:45:24] (03Draft1) 10MarcoAurelio: WMF staff local groups to $wmgPrivilegedGroups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324250 (https://phabricator.wikimedia.org/T150951) [19:45:35] checking [19:46:22] works [19:46:47] (03PS3) 10Thcipriani: Allow a wiki to use __NOINDEX__ and __INDEX__ in all namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321712 (owner: 10Dereckson) [19:46:54] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321712 (owner: 10Dereckson) [19:47:13] PROBLEM - dhclient process on thumbor1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:47:13] PROBLEM - salt-minion processes on thumbor1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:47:19] well, sort of, bd808 and me forgot to specify who could add and remove users from the group [19:47:48] (03Merged) 10jenkins-bot: Allow a wiki to use __NOINDEX__ and __INDEX__ in all namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321712 (owner: 10Dereckson) [19:47:53] MarcoAurelio: doh [19:48:03] RECOVERY - salt-minion processes on thumbor1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [19:48:33] RECOVERY - dhclient process on thumbor1001 is OK: PROCS OK: 0 processes with command name dhclient [19:48:56] Of course my wmf branch changes get caught behind a master branch change. Wouldn't expect it any other way. [19:49:01] * ostriches twiddles his thumbs [19:50:03] RECOVERY - salt-minion processes on thumbor1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [19:50:03] RECOVERY - dhclient process on thumbor1002 is OK: PROCS OK: 0 processes with command name dhclient [19:50:05] MarcoAurelio: Dereckson https://gerrit.wikimedia.org/r/#/c/321712/3 is live on mwdebug1002 if there's anything to check [19:50:20] thcipriani: checking [19:50:46] (03PS3) 10Thcipriani: Allow __NOINDEX__ on all namespaces on meta. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321713 (https://phabricator.wikimedia.org/T150245) (owner: 10Dereckson) [19:51:03] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 4746742 keys, up 29 days 11 hours - replication_delay is 0 [19:51:43] (03PS1) 10Faidon Liambotis: Add a couple of AT&T's west coast resolver ranges [dns] - 10https://gerrit.wikimedia.org/r/324252 [19:52:11] ah thcipriani that was not the specific one to metawiki so I can't check, maybe Dereckson [19:52:23] RECOVERY - puppet last run on dbproxy1009 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [19:53:29] jenkins acting up again? [19:54:03] PROBLEM - salt-minion processes on thumbor1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:54:52] paravoid: Just a lil slow [19:54:53] RECOVERY - salt-minion processes on thumbor1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [19:54:59] "lil" [19:54:59] MarcoAurelio: tested with mwrepl on mwdebug1002, var_dump($wmgAllowRobotsControlInAllNamespaces); returns false as expected [19:55:01] Got caught behind some slow jobs. [19:55:12] nodepool was bumped today [19:55:13] thcipriani: super [19:55:26] going live everywhere [19:55:53] I still can't understand why jobs that take a couple of seconds to run get stuck behind running the full MW test suite [19:55:58] paladox: Bunch of mw/core and VE changes landed around the same time, all the executors got taken up by slow/probably-useless jobs. [19:56:08] Yep [19:56:12] paravoid: I've been asking that for the last 2 years. [19:56:14] 4 years in and I still don't get why our jenkins is so slow [19:56:17] Er, meant, that for you paravoid ^ [19:56:26] Too many pa* [19:56:27] we should have more resources now but still should notice the slowness of nodepool [19:56:29] at peak time [19:56:57] paravoid: I can tell you exactly why it's so slow: because we're using f'ing nodepool. [19:56:59] ostriches it is zuul [19:57:04] Instead of, you know, dedicated slaves. [19:57:06] (03CR) 10Faidon Liambotis: [C: 032] Add a couple of AT&T's west coast resolver ranges [dns] - 10https://gerrit.wikimedia.org/r/324252 (owner: 10Faidon Liambotis) [19:57:16] I hate nodepool [19:57:21] hate hate hate [19:57:40] What about docker? [19:57:54] if only someone back then had said to not use that [19:57:56] oh wait :P [19:58:00] Yes, openstack uses it. You know what? We're not openstack. They also have 822821821 times the resources in their CI as us. [19:58:03] PROBLEM - salt-minion processes on thumbor1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:58:07] Continuing to copy openstack isn't always the best idea! [19:58:22] How did they manage to get that much resources? [19:58:31] ostriches ^^ [19:58:34] Money [19:58:43] PROBLEM - dhclient process on thumbor1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:58:49] Like 900000 trillion billion quadrillion dollars [19:58:53] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:321712|Allow a wiki to use __NOINDEX__ and __INDEX__ in all namespaces]] PART I (duration: 00m 51s) [19:59:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:59:16] Oh [19:59:23] (03PS1) 10Faidon Liambotis: Fix netmask of new AT&T resolver network [dns] - 10https://gerrit.wikimedia.org/r/324253 [19:59:36] How did they manage to get that much? wikimedia is the most known name in the world [19:59:39] (03CR) 10Faidon Liambotis: [C: 032 V: 032] Fix netmask of new AT&T resolver network [dns] - 10https://gerrit.wikimedia.org/r/324253 (owner: 10Faidon Liambotis) [19:59:42] + most used website wikipedia [19:59:53] It's not the most used website. [19:59:54] {{citation needed}} [20:00:04] twentyafterfour: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161129T2000). Please do the needful. [20:00:12] ostriches yes it is, it is in the top websites in the world. [20:00:14] !log thcipriani@tin Synchronized wmf-config/CommonSettings.php: SWAT: [[gerrit:321712|Allow a wiki to use __NOINDEX__ and __INDEX__ in all namespaces]] PART II (duration: 00m 48s) [20:00:20] twentyafterfour: respected human, allows us to finish SWAT pls :) [20:00:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:32] MarcoAurelio: It's me, and tell swat it's taking too long ;-) [20:00:33] RECOVERY - dhclient process on thumbor1001 is OK: PROCS OK: 0 processes with command name dhclient [20:00:38] ostriches: :P [20:00:53] RECOVERY - salt-minion processes on thumbor1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [20:00:55] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321713 (https://phabricator.wikimedia.org/T150245) (owner: 10Dereckson) [20:00:59] sorry :'( [20:01:07] paladox: "In the top websites" != "Most used website" [20:01:17] Yep [20:01:19] It means it's among the top 10 most used websites (depending on who you ask) [20:01:21] jouncebot: you have the wrong human [20:01:23] Most used implies #1 [20:01:33] twentyafterfour: jouncebot is the absolute worst type of person [20:01:40] ostriches most teachers here say doint use wikipedia in your work as it can be false [20:01:47] but everyone uses it anyways [20:01:48] (03Merged) 10jenkins-bot: Allow __NOINDEX__ on all namespaces on meta. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321713 (https://phabricator.wikimedia.org/T150245) (owner: 10Dereckson) [20:01:49] lol [20:01:51] That's nice. [20:01:55] And offtopic. [20:02:52] MarcoAurelio: https://gerrit.wikimedia.org/r/#/c/321713/ live on mwdebug1002, check please [20:03:05] on it [20:03:23] thcipriani: works [20:03:30] ok, going live [20:03:31] I had a page already loaded [20:03:37] (03PS2) 10Andrew Bogott: Abolish labtest realm. [puppet] - 10https://gerrit.wikimedia.org/r/324233 (https://phabricator.wikimedia.org/T148717) [20:03:37] (03PS1) 10Andrew Bogott: labtest: smtp servers should be the same as in codfw [puppet] - 10https://gerrit.wikimedia.org/r/324255 (https://phabricator.wikimedia.org/T148717) [20:04:38] (03PS2) 10Thcipriani: wikitech cloudadmin: remove right that no longer exists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/323708 (owner: 10Alex Monk) [20:04:50] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/323708 (owner: 10Alex Monk) [20:05:03] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:321713|Allow __NOINDEX__ on all namespaces on meta]] (T150245) (duration: 00m 44s) [20:05:09] ^ MarcoAurelio live everywhere [20:05:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:05:17] T150245: Make __NOINDEX__ work on all namespaces on Meta-Wiki - https://phabricator.wikimedia.org/T150245 [20:05:17] !log deploying D478 (refs T151844 ) [20:05:22] yay [20:05:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:05:28] D478: Drasticly improve performance when importing large repos - https://phabricator.wikimedia.org/D478 [20:05:28] T151844: Optimize phabricator repository updates - https://phabricator.wikimedia.org/T151844 [20:05:43] (03Merged) 10jenkins-bot: wikitech cloudadmin: remove right that no longer exists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/323708 (owner: 10Alex Monk) [20:07:19] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:323708|wikitech cloudadmin: remove right that no longer exists]] (duration: 00m 45s) [20:07:25] ^ MarcoAurelio also live [20:07:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:07:47] ostriches: SWAT is complete [20:07:50] yay [20:08:00] sorry about the overrun :( [20:08:04] (03CR) 10Chad: [C: 032] group0 to wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324243 (owner: 10Chad) [20:08:25] ostriches, i merged all the pending patches to wmf29.4 some time ago, please git pull [20:08:31] I already did. [20:08:43] thcipriani: well, not really it's one pending but maybe we can do it tomorrow [20:08:48] (03Merged) 10jenkins-bot: group0 to wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324243 (owner: 10Chad) [20:08:50] thcipriani: 11am swat steps on the toes on the train on tues, not your fault. It basically means the "pre-sync before noon even happens to bootstrap" can't happen. [20:08:51] also looks fine [20:09:17] (03CR) 10Andrew Bogott: [C: 032] labtest: smtp servers should be the same as in codfw [puppet] - 10https://gerrit.wikimedia.org/r/324255 (https://phabricator.wikimedia.org/T148717) (owner: 10Andrew Bogott) [20:09:21] !log demon@tin Started scap: wmf.4 for fun and profit [20:09:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:09:44] MarcoAurelio: ah, dang, yeah, just saw that last one. Let's bump it to either evening or tomorrow. Sorry about that. [20:10:28] (03PS1) 10Andrew Bogott: Add new dummy file for the puppet compiler [labs/private] - 10https://gerrit.wikimedia.org/r/324258 [20:10:50] (03CR) 10Andrew Bogott: [C: 032 V: 032] Add new dummy file for the puppet compiler [labs/private] - 10https://gerrit.wikimedia.org/r/324258 (owner: 10Andrew Bogott) [20:11:18] thcipriani: don't worry, we were already out of time [20:11:29] sorry that I took the entire swat window [20:12:03] PROBLEM - MariaDB Slave Lag: s2 on db1047 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 616.20 seconds [20:12:09] heh, np, initially it was just 3 patches :) [20:12:10] * mafk 🔪🦃 [20:12:39] see you later and thank you very much [20:14:00] toodles thanks for all the patches/checks! [20:16:43] PROBLEM - MariaDB Slave Lag: s2 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 724.87 seconds [20:19:43] RECOVERY - MariaDB Slave Lag: s2 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 180.73 seconds [20:21:55] 06Operations, 10fundraising-tech-ops, 13Patch-For-Review: remove fundraising banner log related cruft from production puppet - https://phabricator.wikimedia.org/T118325#2832532 (10Jgreen) 05Open>03Resolved a:03Jgreen [20:27:03] RECOVERY - MariaDB Slave Lag: s2 on db1047 is OK: OK slave_sql_lag Replication lag: 38.24 seconds [20:27:56] 06Operations, 10fundraising-tech-ops, 10netops: Cleanup layer2 firewall config from pfw-eqiad - https://phabricator.wikimedia.org/T111463#2832558 (10Jgreen) 05Open>03declined We might as well close this task since we plan to replace the firewalls very soon. [20:28:23] PROBLEM - puppet last run on neodymium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:30:17] !log demon@tin Finished scap: wmf.4 for fun and profit (duration: 20m 55s) [20:30:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:30:32] (03CR) 10Hashar: "That should do it :] We can pair to cherry pick that on the CI puppet master, run puppet on hosts and see what happen!" [puppet] - 10https://gerrit.wikimedia.org/r/324203 (https://phabricator.wikimedia.org/T117418) (owner: 10Zfilipin) [20:30:45] (03CR) 10ArielGlenn: [C: 032] fix check that file we have fcntl lock is empty [dumps] - 10https://gerrit.wikimedia.org/r/324246 (owner: 10ArielGlenn) [20:35:46] ostriches: the release note has a weird : Notice: Undefined offset: 0 in /a/release/make-deploy-notes/make-deploy-notes on line 307 [20:35:51] https://www.mediawiki.org/wiki/MediaWiki_1.29/wmf.4 [20:35:55] I know. [20:35:59] ok [20:36:01] The script is broken all the time [20:36:05] Just remove it ;-) [20:36:08] fun and profit [20:49:37] 06Operations, 10MediaWiki-Configuration, 06Performance-Team, 06Services (watching), and 5 others: Integrating MediaWiki (and other services) with dynamic configuration - https://phabricator.wikimedia.org/T149617#2832697 (10aaron) +1 for the etc/confd/json + APC approach for MediaWiki, at least starting wit... [20:51:23] RECOVERY - puppet last run on neodymium is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [20:57:29] (03PS1) 10BBlack: varnish: make PURGE more efficient [puppet] - 10https://gerrit.wikimedia.org/r/324270 [21:01:03] PROBLEM - puppet last run on mw1168 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:04:26] 06Operations, 10MediaWiki-Configuration, 06Performance-Team, 06Services (watching), and 5 others: Integrating MediaWiki (and other services) with dynamic configuration - https://phabricator.wikimedia.org/T149617#2758050 (10Krinkle) >>! In T149617#2832697, @aaron wrote: >>>! Task description: >> * Either pa... [21:05:13] 06Operations, 10ops-codfw, 10DBA: db2041: Disk RAID predictive failure - https://phabricator.wikimedia.org/T151203#2832766 (10Papaul) Dear Mr Papaul Tshibamba, Thank you for contacting Hewlett Packard Enterprise for your service request. This email confirms your request for service and the details are below... [21:09:28] (03CR) 10Krinkle: [C: 04-1] "Not unused. This is symlinked from docroot/default/index.html. I moved it here so that all these error pages with similar styles and html " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/323990 (owner: 10Chad) [21:16:59] (03Abandoned) 10Chad: Remove default.html [mediawiki-config] - 10https://gerrit.wikimedia.org/r/323990 (owner: 10Chad) [21:18:34] Hmmm, group0 is not reporting their version # (wmf.4) in their error logs, hmm [21:18:37] That's no bueno [21:19:52] ostriches: no version number at all or the wrong version number? [21:20:02] None at all [21:20:14] "No results displayed because all values equal 0" [21:20:41] Code path still pointing to wmf.3? [21:20:43] wtf.... [21:21:01] !log phab2001 - upgrading scap and other packages (we need to get puppet running here again) [21:21:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:21:28] 06Operations, 06Analytics-Kanban, 05Security, 07audits-data-retention: Purge > 90 days stat1002:/a/squid/archive/glam_nara - https://phabricator.wikimedia.org/T92340#2832815 (10Multichill) As discussed: The data prior to 2014-02-27 already got deleted. From 2015 we have https://dumps.wikimedia.org/other/me... [21:21:30] bd808: Have a look at the group0 dashboard. It's weird. [21:22:24] ostriches: only hhvm errors showing so no version number in the log data [21:22:36] Ah that might explain it :p [21:22:47] zoom the time out and you'll see versions show up [21:27:50] (03PS1) 10Rush: labstore: configure etytree to mount public dumps [puppet] - 10https://gerrit.wikimedia.org/r/324326 [21:29:03] RECOVERY - puppet last run on mw1168 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [21:31:43] PROBLEM - puppet last run on analytics1034 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:32:08] (03CR) 10Chad: [C: 032] Monolog: Add processor for XFF resolved IP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/273376 (https://phabricator.wikimedia.org/T114700) (owner: 10BryanDavis) [21:32:18] bd808: Imma sync that ^ [21:32:32] ostriches: cool. keep an eye on perf if you know how [21:32:40] I don't! [21:32:47] (03Merged) 10jenkins-bot: Monolog: Add processor for XFF resolved IP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/273376 (https://phabricator.wikimedia.org/T114700) (owner: 10BryanDavis) [21:32:49] I hope its fine but tgr was a bit worried [21:32:49] (03PS1) 10Hashar: nodepool: lower max-server from 20 to 19 [puppet] - 10https://gerrit.wikimedia.org/r/324328 [21:33:07] somewhere there is a save time dashboard. if its bad it should show up there [21:33:31] bd808: it will generate lots of warnings since it changes the value of 'ip' [21:33:38] ostriches: https://grafana.wikimedia.org/dashboard/db/save-timing I guess [21:33:53] tgr: do you still have crap that is watching that? [21:34:00] I thought it was reverted [21:34:06] no [21:34:11] :/ [21:35:23] hmm [21:35:30] it's improved in https://gerrit.wikimedia.org/r/#/c/323099 but 1) it's not merged 2) it will still warn when the value of the log field actually changes [21:35:30] tgr: is it in both branches of just wmf.3? [21:35:40] 06Operations, 10Cassandra, 10RESTBase-Cassandra, 06Services, 13Patch-For-Review: Evaluate Brotli compression for Cassandra - https://phabricator.wikimedia.org/T125906#2000325 (10Eevans) [21:36:12] re: speed, it won't do anything so bad that we can't try it for a week and then decide based on that [21:36:33] Oh yeah, I was reading that change earlier. [21:36:34] a dozen ms or something like that [21:36:41] It should go into wmf.4 [21:37:03] wmf.3 is for two weeks ago, right? [21:37:04] tgr: having looked at the monolog code more, the only things we need to warn on are the core message, etc. The others don't cause the same problems as context processors add to extra rather than context and extra wins [21:37:08] that one does not spam [21:37:20] tgr: wmf.3 was last 2 weeks [21:37:24] (since last week was short) [21:37:28] cmjohnson1: any news on T150964? we're not in a hurry, but i'm trying to determine how i should plan. [21:37:28] T150964: eqiad: Rack and setup new restbase nodes - https://phabricator.wikimedia.org/T150964 [21:37:43] ostriches: my versions dashboard is still showing everything on .3 [21:37:59] (03CR) 10Andrew Bogott: "This is a no-op for non labtest hosts." [puppet] - 10https://gerrit.wikimedia.org/r/324233 (https://phabricator.wikimedia.org/T148717) (owner: 10Andrew Bogott) [21:38:10] you can just revert https://gerrit.wikimedia.org/r/#/c/313932/ if you are looking for a quick fix [21:39:00] I put up a revert commit but then rewrote it into something else, which was not the brightest idea [21:39:11] Ummmm. [21:39:11] !log servermon - manually running "make_updates" command from cron for debugging - failed with a mysql_excetpion, lock wait timeout exceeded [21:39:14] demon@tin:/srv/mediawiki-staging$ scap wikiversions-inuse [21:39:14] 1.29.0-wmf.3 [21:39:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:39:20] tgr: that patch is only looking at context which shouldn't collide with the config that ostriches is pushing [21:39:21] How in the hell did I pull that off? [21:40:37] bd808: that patch needs to go, I wanted to get rid of it by https://gerrit.wikimedia.org/r/#/c/323099 and that one does look at context [21:40:37] !log demon@tin rebuilt wikiversions.php and synchronized wikiversions files: helps to actually sync updated files [21:40:48] Crap. [21:40:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:40:53] Which means it didn't bootstrap i18n [21:40:55] ostriches: I see logs for the scap but not for versions [21:40:55] Dammit [21:41:02] !log demon@tin Started scap: rebuild l10n for wmf.4 [21:41:10] Stupid [21:41:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:41:12] Stupid stupid stupid [21:41:25] easy to do. E_TOOMANYSTEPS [21:41:42] I forgot to pull wikiversions.json after I merged it [21:41:43] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [21:41:47] oh sorry I see what you are talking about [21:41:50] How I did that, I shall not understand [21:41:59] !log demon@tin scap failed: CalledProcessError Command '/usr/local/bin/mwscript mergeMessageFileList.php --wiki="cawikibooks" --list-file="/srv/mediawiki-staging/wmf-config/extension-list" --output="/tmp/tmp.wDhBxhXGtC" ' returned non-zero exit status 139 (duration: 00m 57s) [21:42:03] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [21:42:04] Well that's no bueno [21:42:08] Woulden it be in the bash history? [21:42:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:42:13] Segfault. [21:42:18] yeah, a record/extra conflict is not a problem [21:42:22] Fan-fuckin-tastic [21:42:25] ouch [21:42:29] !log demon@tin Started scap: rebuild l10n for wmf.4 -- attempt #2 [21:42:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:42:43] PROBLEM - graphoid endpoints health on scb2004 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) [21:42:43] PROBLEM - Graphoid LVS codfw on graphoid.svc.codfw.wmnet is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) [21:42:43] PROBLEM - graphoid endpoints health on scb2003 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) [21:42:43] PROBLEM - puppet last run on labsdb1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:42:43] PROBLEM - graphoid endpoints health on scb1003 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) [21:42:44] 06Operations, 06DC-Ops: Information missing from racktables - https://phabricator.wikimedia.org/T150651#2832977 (10RobH) This came up in the ops meeting as well. The discussion there is for any esams mgmt switches that are missing the serial, to just randomly generate one that is known to be fake (maybe inclu... [21:42:52] !log demon@tin scap failed: CalledProcessError Command '/usr/local/bin/mwscript mergeMessageFileList.php --wiki="cawikibooks" --list-file="/srv/mediawiki-staging/wmf-config/extension-list" --output="/tmp/tmp.6ehbXfKDfD" ' returned non-zero exit status 139 (duration: 00m 23s) [21:42:53] PROBLEM - Graphoid LVS eqiad on graphoid.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) [21:42:53] PROBLEM - graphoid endpoints health on scb2002 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) [21:42:53] PROBLEM - graphoid endpoints health on scb2001 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) [21:42:59] Um, what the actual fuck? [21:43:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:43:03] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [21:43:03] PROBLEM - graphoid endpoints health on scb1001 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) [21:43:19] cawikibooks is making l10n-rebuild fail [21:43:33] 06Operations, 10MediaWiki-Configuration, 06Performance-Team, 06Services (watching), and 5 others: Integrating MediaWiki (and other services) with dynamic configuration - https://phabricator.wikimedia.org/T149617#2832982 (10aaron) The background process would write to the JSON file. APC caching would be don... [21:43:33] PROBLEM - graphoid endpoints health on scb1002 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) [21:43:33] PROBLEM - graphoid endpoints health on scb1004 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) [21:43:40] ostriches: running mergeMessageFileList.php manually sometimes gives better error messages [21:44:00] the cawikibooks thing is just the first group1 wiki [21:44:05] Or, in my case, none. [21:44:12] No errors when running by hand [21:44:15] !log demon@tin Started scap: rebuild l10n for wmf.4 -- attempt #3 [21:44:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:44:38] !log demon@tin scap failed: CalledProcessError Command '/usr/local/bin/mwscript mergeMessageFileList.php --wiki="cawikibooks" --list-file="/srv/mediawiki-staging/wmf-config/extension-list" --output="/tmp/tmp.v5ELy9acbV" ' returned non-zero exit status 139 (duration: 00m 23s) [21:44:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:44:58] Ok, this is weird. [21:45:09] Considering I scapped wmf.3 earlier [21:45:33] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [21:46:03] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [21:46:24] the errors are limited to group0 right? [21:46:25] ostriches mediawiki is down [21:46:30] mediawiki.org [21:46:35] (03CR) 10Hashar: "That follow up our conversation, thanks for the help :] This patch is neither needed or unwanted. The real root cause for the few errors" [puppet] - 10https://gerrit.wikimedia.org/r/324328 (owner: 10Hashar) [21:46:36] If you report this error to the Wikimedia System Administrators, please include the details below. [21:46:37] [21:46:37] PHP fatal error: [21:46:37] File not found: /srv/mediawiki/php-1.29.0-wmf.4/../wmf-config/ExtensionMessages-1.29.0-wmf.4.php [21:46:42] paladox: we know [21:46:45] oh [21:46:47] (03CR) 10Andrew Bogott: [C: 032] nodepool: lower max-server from 20 to 19 [puppet] - 10https://gerrit.wikimedia.org/r/324328 (owner: 10Hashar) [21:46:50] (03CR) 10Rush: [C: 031] "this seems like the right cruft, thanks andrew" [puppet] - 10https://gerrit.wikimedia.org/r/324233 (https://phabricator.wikimedia.org/T148717) (owner: 10Andrew Bogott) [21:47:06] Well dangit. [21:47:21] ostriches: there are no error messages but a 139 exit status [21:47:26] not sure yet what that means [21:47:36] !log demon@tin Started scap: mw.org back to wmf.3 [21:47:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:47:59] !log demon@tin scap failed: CalledProcessError Command '/usr/local/bin/mwscript mergeMessageFileList.php --wiki="cawikibooks" --list-file="/srv/mediawiki-staging/wmf-config/extension-list" --output="/tmp/tmp.cPclJsf3pO" ' returned non-zero exit status 139 (duration: 00m 23s) [21:48:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:48:25] !log demon@tin rebuilt wikiversions.php and synchronized wikiversions files: group0 back to wmf.3 [21:48:29] Ok, group0 fixed. [21:48:33] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [21:48:33] RECOVERY - graphoid endpoints health on scb1002 is OK: All endpoints are healthy [21:48:33] RECOVERY - graphoid endpoints health on scb1004 is OK: All endpoints are healthy [21:48:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:48:36] Now, what the hell is up with mergeMessageFileList [21:48:43] RECOVERY - graphoid endpoints health on scb2004 is OK: All endpoints are healthy [21:48:43] RECOVERY - Graphoid LVS codfw on graphoid.svc.codfw.wmnet is OK: All endpoints are healthy [21:48:43] RECOVERY - graphoid endpoints health on scb2003 is OK: All endpoints are healthy [21:48:43] RECOVERY - graphoid endpoints health on scb1003 is OK: All endpoints are healthy [21:48:53] RECOVERY - Graphoid LVS eqiad on graphoid.svc.eqiad.wmnet is OK: All endpoints are healthy [21:48:53] RECOVERY - graphoid endpoints health on scb2001 is OK: All endpoints are healthy [21:48:53] RECOVERY - graphoid endpoints health on scb2002 is OK: All endpoints are healthy [21:49:03] RECOVERY - graphoid endpoints health on scb1001 is OK: All endpoints are healthy [21:50:03] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [21:50:54] why does it say join the #wikipedia channel yet mediawiki.org is not wikipedia? [21:51:08] (03PS3) 10Andrew Bogott: Abolish labtest realm. [puppet] - 10https://gerrit.wikimedia.org/r/324233 (https://phabricator.wikimedia.org/T148717) [21:51:39] paladox: because there is one OMG fatal error page for everything [21:51:45] oh [21:51:55] And we'd rather people spam #wikipedia than here ;-) [21:52:01] lol [21:52:39] Ok, so this l10n segfault is annoying [21:52:42] (03PS2) 10Rush: labstore: configure etytree to mount public dumps [puppet] - 10https://gerrit.wikimedia.org/r/324326 [21:52:55] /usr/local/bin/mwscript: line 23: 5341 Segmentation fault php5 "$MEDIAWIKI_DEPLOYMENT_DIR_DIR_USE/multiversion/MWScript.php" "$@" [21:52:59] ^ Is not useful [21:53:10] (03CR) 10Andrew Bogott: [C: 032] Abolish labtest realm. [puppet] - 10https://gerrit.wikimedia.org/r/324233 (https://phabricator.wikimedia.org/T148717) (owner: 10Andrew Bogott) [21:53:14] And the resulting files in /tmp/ are empty, so it's dying before we even write anything [21:53:33] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [1000.0] [21:53:55] (03CR) 10Andrew Bogott: [C: 031] labstore: configure etytree to mount public dumps [puppet] - 10https://gerrit.wikimedia.org/r/324326 (owner: 10Rush) [21:54:03] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [21:54:30] !log demon@tin Started scap: probably won't work, no-op [21:54:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:54:41] (03PS3) 10Rush: labstore: configure etytree to mount public dumps [puppet] - 10https://gerrit.wikimedia.org/r/324326 [21:54:51] (03CR) 10Rush: [C: 032 V: 032] labstore: configure etytree to mount public dumps [puppet] - 10https://gerrit.wikimedia.org/r/324326 (owner: 10Rush) [21:55:34] !log demon@tin scap aborted: probably won't work, no-op (duration: 01m 03s) [21:55:40] A-ha! It does work! [21:55:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:56:17] (03PS1) 10Chad: Revert "Monolog: Add processor for XFF resolved IP" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324333 [21:56:25] bd808: Your patch makes segfaults [21:56:34] Dunno why [21:56:36] But reverting [21:56:43] (03CR) 10Chad: [C: 032 V: 032] Revert "Monolog: Add processor for XFF resolved IP" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324333 (owner: 10Chad) [21:57:10] !log demon@tin Started scap: Ok, back to normal now [21:57:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:58:45] ostriches: fun! [21:59:09] ostriches: oh... there's no request context or ip at the shell [21:59:18] PROBLEM - puppet last run on labtestvirt2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ssl/localcerts/labvirt-star.codfw.wmnet.crt] [21:59:32] bd808: herp derp. But segfault seems like the wrong way to die :p [21:59:48] RECOVERY - puppet last run on analytics1034 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [21:59:48] PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:00:38] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [22:00:48] RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy [22:01:48] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [22:06:22] mutante: can I get a hand making a cert? It always takes me half a day to re-learn how to do it [22:07:04] 06Operations, 06Analytics-Kanban, 05Security, 07audits-data-retention: Purge > 90 days stat1002:/a/squid/archive/glam_nara - https://phabricator.wikimedia.org/T92340#2833111 (10Milimetric) talking further, it turns out we can disable the glam_nara jobs alltogether, per @lzia and @Multichill, and purge all... [22:07:17] andrewbogott: what kind of cert? i dont usually make them manually [22:07:37] I need a copy of labvirt-star.eqiad.wmnet.crt for codfw [22:07:54] it's a self-signed cert for the virt nodes to talk to each other [22:08:19] RequestContext and WebRequest should work in the shell [22:08:28] is it really self-signed or signed by our own CA? [22:08:39] not sure about IP, I remember that causing all kinds of errors in vagrant [22:08:40] the second thing :) [22:08:46] tgr: They should, yeah, but it was definitely causing the segfault I hit. [22:08:50] So, revert for now :) [22:09:05] mutante: it's whatever https://gerrit.wikimedia.org/r/#/c/204612/ is :) [22:09:37] and there we get to the next question "what about those 2 different CAs we had" :p [22:09:41] i see his comment [22:11:48] RECOVERY - puppet last run on labsdb1004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [22:12:57] (03CR) 10Gergő Tisza: "Login and other sensitive logs just add the whole XFF header in a separate field. We could do that as a poor man's solution." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/273376 (https://phabricator.wikimedia.org/T114700) (owner: 10BryanDavis) [22:14:29] (03CR) 10Krinkle: [C: 031] Bump parser cache purging batch wait time [puppet] - 10https://gerrit.wikimedia.org/r/323764 (https://phabricator.wikimedia.org/T150124) (owner: 10Aaron Schulz) [22:14:53] mutante: if it will also take you half a day to re-learn how to do it then nevermind :) [22:15:48] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [22:20:29] PROBLEM - Disk space on labtestnet2001 is CRITICAL: DISK CRITICAL - free space: / 350 MB (3% inode=46%) [22:20:38] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [22:21:00] andrewbogott: sooo.. the right CA is wmf_ca_2014_2017. the private key for that is in: /srv/private/modules/secret/secrets/ssl/wmf_ca_2014_2017/wmf_ca_2014_2017.key [22:21:50] so first create CSR like: openssl req -new -sha256 -key ~/domain.com.ssl/domain.com.key -out ~/domain.com.ssl/domain.com.csr [22:21:56] using that key [22:21:58] 06Operations, 05Continuous-Integration-Scaling, 07Nodepool, 07WorkType-NewFunctionality: Backport python-shade from debian/testing to jessie-wikimedia - https://phabricator.wikimedia.org/T107267#2833236 (10Paladox) We could upgrade to nodepool 0.2 and use this http://snapshot.debian.org/package/python-shad... [22:22:22] then you should get the interactive prompt [22:22:28] mutante: what is the mutante what is modules/secret/secrets/ssl/labvirt-star.codfw.wmnet.key then? [22:22:38] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [22:22:39] hm, echo in here [22:22:48] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [22:23:24] !log demon@tin Finished scap: Ok, back to normal now (duration: 26m 14s) [22:23:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:25:38] PROBLEM - graphoid endpoints health on scb1002 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) [22:26:38] RECOVERY - graphoid endpoints health on scb1002 is OK: All endpoints are healthy [22:29:03] andrewbogott: oh, yea, the CSR has to be created using that key for labvirt. then when the CSR is created it will be signed by the CA and _then_ we'll use the CA key [22:31:55] since you already have a key for labvirt-star.codfw.wmnet.key, first step should be: openssl req -new -sha256 -key modules/secret/secrets/ssl/labvirt-star.codfw.wmnet.key -out /tmp/labvirt-star.codfw.wmnet.csr [22:34:10] mutante: ok,I have labvirt-star.codfw.wmnet.csr now... [22:35:38] and then signing the CSR should be like: openssl x509 -req -in /tmp/labvirt-star.codfw.wmnet.csr -CA wmf_ca_2014-2017.pem -CAkey wmf_ca_2014_2017.key -CAcreateserial -out labvirt-star.codfw.wmnet.crt -days 365 -sha256 (or more days?) [22:36:54] -in foo.csr -CA ca_itself.pem -CAkey key_of_ca.key ... [22:38:43] and finally add that resulting .crt in files/ssl/ in public repo and have puppet install it, but it looks like this already covers that: [22:39:31] $certname = "labvirt-star.${::site}.wmnet" [22:39:56] $ca_target = '/etc/ssl/certs/wmf_ca_2014_2017.pem' (yup, that's the one) [22:40:11] I don't have wmf_ca_2014-2017.pem though, do I? [22:41:02] root@puppetmaster1001:/srv/private/modules/secret/secrets/ssl/wmf_ca_2014_2017# [22:41:09] inside that dir [22:41:14] is the .pem and the .key [22:41:34] oh, I see, it was just - vs _ [22:41:35] 06Operations, 10Traffic, 13Patch-For-Review: Make upload.wikimedia.org cookieless - https://phabricator.wikimedia.org/T137609#2373208 (10fgiunchedi) I can confirm all my requests to upload today were cookie free, anything left to do? [22:42:06] (03PS1) 10Alex Monk: Follow-up I3b706396 and Id8c53f8f: Fix -labs variations of wmgM(F|Mobile) variables [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324340 (https://phabricator.wikimedia.org/T151894) [22:42:25] not sure about 01.pem through 0E.pem in that place, btw [22:42:51] andrewbogott: oh, there's a README :) [22:43:01] tells us the commands too [22:43:21] (03PS1) 10Andrew Bogott: Add labvirt-star.codfw.wmnet.crt [puppet] - 10https://gerrit.wikimedia.org/r/324342 [22:43:23] it suggests to use -days 720 [22:43:30] yep, ok [22:43:33] I think I've got it. thanks! [22:43:40] yw [22:44:21] 06Operations: Cleanup debconf handling in mailman puppet setup - https://phabricator.wikimedia.org/T144933#2833364 (10fgiunchedi) [22:46:16] 06Operations, 06Labs, 13Patch-For-Review: Kill the labtest $realm - https://phabricator.wikimedia.org/T148717#2833365 (10faidon) These are great to see ­­­— many thanks to both of you! [22:48:48] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [22:50:09] (03CR) 10Andrew Bogott: [C: 032] Add labvirt-star.codfw.wmnet.crt [puppet] - 10https://gerrit.wikimedia.org/r/324342 (owner: 10Andrew Bogott) [22:51:21] 06Operations, 06Labs, 13Patch-For-Review: Kill the labtest $realm - https://phabricator.wikimedia.org/T148717#2833390 (10Andrew) 05Open>03Resolved I've removed all the labtest realm checks. [22:51:40] 06Operations, 06Parsing-Team, 06Release-Engineering-Team, 07HHVM, and 2 others: API cluster failure / OOM - https://phabricator.wikimedia.org/T151702#2833394 (10fgiunchedi) [22:52:18] RECOVERY - puppet last run on labtestvirt2001 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [22:53:33] 06Operations, 10ops-eqiad, 06DC-Ops, 13Patch-For-Review, 15User-Joe: Hardware decommission mw1017, mw1099 - https://phabricator.wikimedia.org/T151303#2833400 (10fgiunchedi) a:03Cmjohnson [22:54:29] 06Operations, 10Monitoring: diamond: certain counters always calculated as 0 - https://phabricator.wikimedia.org/T138758#2833404 (10fgiunchedi) [22:55:13] 06Operations, 06Labs, 10Monitoring: Setting up grafana should also setup Anonymous read-only access for the default org - https://phabricator.wikimedia.org/T143556#2833406 (10fgiunchedi) [22:56:48] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [22:56:57] !log demon@tin Synchronized php-1.29.0-wmf.4/autoload.php: logging fixes (duration: 00m 45s) [22:57:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:57:55] 06Operations, 10MediaWiki-JobRunner, 13Patch-For-Review, 15User-Addshore: jobrunner should send statsd in batches - https://phabricator.wikimedia.org/T132327#2833414 (10fgiunchedi) @Addshore doesn't look like it (checked on mw1161). @aaron this could go with the next train? [22:58:39] !log demon@tin Synchronized php-1.29.0-wmf.4/includes/debug/logger/monolog: logging fixes (duration: 00m 45s) [22:58:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:00:58] (03CR) 10Jdlrobson: [C: 031] "Thanks for working this out." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324340 (https://phabricator.wikimedia.org/T151894) (owner: 10Alex Monk) [23:01:22] 06Operations, 07Availability, 07Wikimedia-Incident: Nutcracker needs to automatically recover from MC failure - rebalancing issues - https://phabricator.wikimedia.org/T88730#2833421 (10fgiunchedi) [23:01:25] jdlrobson, are you able to get that deployed? [23:01:58] 06Operations, 13Patch-For-Review: Increase size of root partition on ocg* servers - https://phabricator.wikimedia.org/T130591#2833423 (10fgiunchedi) 05Open>03declined We're sunsetting OCG [23:02:10] Krenair, jdlrobson: I can deploy that right now [23:02:11] if not I can do it tomorrow [23:02:15] ty [23:02:24] (03CR) 10Chad: [C: 032] Follow-up I3b706396 and Id8c53f8f: Fix -labs variations of wmgM(F|Mobile) variables [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324340 (https://phabricator.wikimedia.org/T151894) (owner: 10Alex Monk) [23:02:35] 06Operations, 13Patch-For-Review: Increase size of root partition on ocg* servers - https://phabricator.wikimedia.org/T130591#2833428 (10fgiunchedi) [23:02:38] 06Operations, 06Services (watching): reinstall OCG servers - https://phabricator.wikimedia.org/T84723#2833426 (10fgiunchedi) 05Open>03declined We're sunsetting OCG [23:03:09] (03Merged) 10jenkins-bot: Follow-up I3b706396 and Id8c53f8f: Fix -labs variations of wmgM(F|Mobile) variables [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324340 (https://phabricator.wikimedia.org/T151894) (owner: 10Alex Monk) [23:03:23] bah [23:03:32] only after it's merged do I notice the silly error in the commit message [23:03:33] oh well [23:04:06] Meh, I thought you were about to say something important :p [23:04:13] no [23:04:27] !log demon@tin Synchronized wmf-config/InitialiseSettings-labs.php: prod no-op (duration: 00m 46s) [23:04:28] I should've written "wmgM(F|obile)" in the commit message but added an M [23:04:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:04:43] :p [23:04:57] MMobile [23:05:07] It's extra mobile [23:06:44] just need to purge the broken redirecting URLs now jdlrobson [23:07:24] I did en but es and presumably others are still broken in cache [23:09:00] MFobile yall [23:14:36] 06Operations, 06Parsing-Team, 06Release-Engineering-Team, 07HHVM, and 2 others: API cluster failure / OOM - https://phabricator.wikimedia.org/T151702#2833467 (10ssastry) Thanks @tstarling for @fgiunchedi for fixing the visibility. >>! In T151702#2831201, @Joe wrote: > In the long chain changeprop => restb... [23:31:00] thanks Krenair for working that out and fixing <3 [23:43:05] 06Operations, 07HHVM: Long running mediawiki web requests impacts service availability, specially databases - https://phabricator.wikimedia.org/T149421#2833595 (10fgiunchedi) p:05Triage>03Normal [23:45:08] 06Operations, 10Wikimedia-General-or-Unknown, 10hardware-requests: Extend capacity for video scalers - https://phabricator.wikimedia.org/T150067#2833614 (10fgiunchedi) p:05Triage>03Normal [23:55:29] 06Operations: Puppet CA rollover - https://phabricator.wikimedia.org/T150823#2833635 (10fgiunchedi) Also re: the rollover, we should switch to deploying multiple CA certificates. This would also avoid refresh problems during rollover (e.g. `/etc/ssl/certs` not being updated in T150058 and T145609) [23:55:51] 06Operations: update-ca-certificates, run via puppets sslcert module, doesn't update symlinks to replaced certificates - https://phabricator.wikimedia.org/T150058#2833638 (10fgiunchedi) 05Open>03Resolved a:03fgiunchedi Resolving this in favour of {T150823} which handles the CA rollover. [23:56:08] 06Operations, 10Beta-Cluster-Infrastructure, 07HHVM, 13Patch-For-Review: Move the MW Beta appservers to Debian - https://phabricator.wikimedia.org/T144006#2833652 (10fgiunchedi) [23:56:13] 06Operations, 10Beta-Cluster-Infrastructure, 10CirrusSearch, 06Discovery, 13Patch-For-Review: Puppet sslcert::ca does not refresh the certificate symlinks when a .crt is updated - https://phabricator.wikimedia.org/T145609#2635904 (10fgiunchedi) 05Open>03Resolved a:03fgiunchedi Resolving this in fav... [23:57:48] 06Operations, 06Security-Team, 13Patch-For-Review: Create cronjob for regular captcha regeneration - https://phabricator.wikimedia.org/T150029#2833662 (10fgiunchedi) p:05Triage>03Normal [23:58:11] 06Operations, 07Puppet, 06Discovery, 06Maps: Refactor puppet-postgresql module to use custom types - https://phabricator.wikimedia.org/T150020#2833663 (10fgiunchedi) p:05Triage>03Low