[00:00:33] I can deploy https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/469796/ [00:01:17] I guess that https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/WikibaseLexeme/+/469707/ needs to go out with it? [00:02:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:02:11] T208000: Parser.php: Call to a member function get() on a non-object (null) - https://phabricator.wikimedia.org/T208000 [00:02:49] twentyafterfour: yes [00:03:02] twentyafterfour: no that patch doesnt actually need to go out [00:03:12] but right now beta wikidata will be broken, until we change the config [00:03:19] if I'm okay to go ahead I'll do my thing :) [00:03:38] addshore: go ahead [00:03:41] thanks! [00:04:20] addshore: Or I can? [00:06:30] I can :) [00:07:08] (03CR) 10Addshore: [C: 032] Define and specify lexeme NS for wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469796 (owner: 10Addshore) [00:07:16] 10Operations, 10MediaWiki Language Extension Bundle, 10MediaWiki-extensions-Translate, 10Language-Team (Language-2018-October-December), and 3 others: Moving or deleting a translatable page on mediawiki.org triggers an error message - https://phabricator.wikimedia.org/T207930 (10Jdforrester-WMF) 05Open>... [00:07:29] twentyafterfour: So it looks like T207881 is the only train blocker left, and it's… mysterious. [00:07:30] T207881: excessive "lock wait timeout exceeded " error rate after deploying 1.33.0-wmf.1 to group1 - https://phabricator.wikimedia.org/T207881 [00:08:43] (03Merged) 10jenkins-bot: Define and specify lexeme NS for wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469796 (owner: 10Addshore) [00:11:09] It'd be lovely if my `fatalmonitor` screen wasn't *entirely* "error: entire web request took longer than …" except for the ~35th line, which is "error: request has exceeded memory limit…" instead. [00:14:01] 10Operations, 10MediaWiki Language Extension Bundle, 10MediaWiki-extensions-Translate, 10Language-Team (Language-2018-October-December), and 2 others: Moving or deleting a translatable page on mediawiki.org triggers an error message - https://phabricator.wikimedia.org/T207930 (10MGChecker) [00:14:28] * addshore tests on mwdebug [00:14:47] James_F: do I need to read that ticket? I see a bunch of work has already been done on that ticket [00:14:52] but it is of course wikidata related [00:15:58] Which ticket? [00:16:23] Oh, the timeout one? Yes, but that’s helping-SRE you. [00:16:31] I want helping-SDC you. [00:18:54] !log addshore@deploy1001 Synchronized wmf-config/Wikibase.php: Define and specify lexeme NS for wikidatawiki (duration: 00m 55s) [00:19:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:19:11] James_F: yeah it is mysterious [00:19:26] James_F: right, we need to check beta once that config patch is scapped out on beta :) [00:20:03] twentyafterfour: Especially the bit where the spike happens *before* you deploy. :-) [00:20:48] James_F: there have been several spikes. Some coincide with deployment and some don't [00:20:58] Yeah. [00:21:18] I'm starting to think there is a random spike and it was just coincidental that they happened near deploys [00:21:31] Plausibly. [00:22:06] addshore: Still no beta-mediawiki-config-update-eqiad for 469796, let alone a beta-scap-eqiad (as it's still doing the last scap). [00:22:12] * James_F twiddles his thumbs. [00:22:17] (03PS6) 10Faidon Liambotis: Implement all the SSH agent bits and stop proxying [software/keyholder] - 10https://gerrit.wikimedia.org/r/458236 [00:22:19] (03PS4) 10Faidon Liambotis: Split SshAgentCommand type to Request/Response [software/keyholder] - 10https://gerrit.wikimedia.org/r/458237 [00:22:21] (03PS4) 10Faidon Liambotis: Make pylint a little happier [software/keyholder] - 10https://gerrit.wikimedia.org/r/458238 [00:22:23] (03PS4) 10Faidon Liambotis: Use mlockall() to avoid any potential swapping [software/keyholder] - 10https://gerrit.wikimedia.org/r/458239 [00:22:25] (03PS4) 10Faidon Liambotis: Add permission checks for various commands [software/keyholder] - 10https://gerrit.wikimedia.org/r/458240 [00:22:27] (03PS4) 10Faidon Liambotis: Verify the validity of signature requests [software/keyholder] - 10https://gerrit.wikimedia.org/r/458241 [00:22:29] (03PS4) 10Faidon Liambotis: Implement SSH_AGENTC_LOCK/SSH_AGENTC_UNLOCK [software/keyholder] - 10https://gerrit.wikimedia.org/r/458242 [00:22:31] (03PS4) 10Faidon Liambotis: Parse/build agent request/responses once [software/keyholder] - 10https://gerrit.wikimedia.org/r/458243 [00:22:33] (03PS4) 10Faidon Liambotis: Refactor handle() [software/keyholder] - 10https://gerrit.wikimedia.org/r/458244 [00:22:35] (03PS4) 10Faidon Liambotis: Add compatibility with Construct 2.8.22 and 2.9.45 [software/keyholder] - 10https://gerrit.wikimedia.org/r/458245 [00:22:37] (03PS4) 10Faidon Liambotis: Switch path handling to pathlib.Path [software/keyholder] - 10https://gerrit.wikimedia.org/r/458246 [00:22:44] (03PS4) 10Faidon Liambotis: Unlink the Unix domain socket when exiting [software/keyholder] - 10https://gerrit.wikimedia.org/r/458247 [00:22:46] (03PS4) 10Faidon Liambotis: Abstract the SSH fingerprint generation [software/keyholder] - 10https://gerrit.wikimedia.org/r/458248 [00:22:52] (03PS4) 10Faidon Liambotis: Stop spawning ssh-keygen but generate fps ourselves [software/keyholder] - 10https://gerrit.wikimedia.org/r/458249 [00:23:12] twentyafterfour: James_F i dont know if that lock timeout one is an UBN [00:23:16] (03PS1) 10Faidon Liambotis: Reload the config on SIGHUP [software/keyholder] - 10https://gerrit.wikimedia.org/r/469807 [00:23:31] (03CR) 10Faidon Liambotis: "(1) is now fixed with Iceba9407ea92781985d1b6327d489921ca7f0287." [software/keyholder] - 10https://gerrit.wikimedia.org/r/458236 (owner: 10Faidon Liambotis) [00:23:41] we can get spikes in that if people try to create lots of entities at once [00:24:55] James_F: ack, if you see it happen please give me a ping :) [00:30:52] (03CR) 10jenkins-bot: Define and specify lexeme NS for wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469796 (owner: 10Addshore) [00:32:16] addshore: I foolishly didn't load https://wikidata.beta.wmflabs.org/ to work out what was broken before, but it's… up. [00:32:22] (03CR) 10Thcipriani: [C: 032] Implement all the SSH agent bits and stop proxying [software/keyholder] - 10https://gerrit.wikimedia.org/r/458236 (owner: 10Faidon Liambotis) [00:32:23] addshore: (Sync finished.) [00:32:31] i think I need another config patch [00:32:43] Fun. [00:32:50] I need to leave in ~10 mins. [00:32:57] (03PS1) 10Dzahn: icinga/nsca: allow configuring nsca chroot in Hiera, change on stretch [puppet] - 10https://gerrit.wikimedia.org/r/469808 (https://phabricator.wikimedia.org/T202782) [00:33:34] (03Merged) 10jenkins-bot: Implement all the SSH agent bits and stop proxying [software/keyholder] - 10https://gerrit.wikimedia.org/r/458236 (owner: 10Faidon Liambotis) [00:33:56] (03CR) 10jerkins-bot: [V: 04-1] icinga/nsca: allow configuring nsca chroot in Hiera, change on stretch [puppet] - 10https://gerrit.wikimedia.org/r/469808 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [00:35:03] (03PS1) 10Addshore: On commons do not yet register any entity types. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469809 [00:35:14] addshore: go ahead if you need to deploy anotther. After you're done, assuming everything is stable, I'm gonna try one more time for group2 [00:35:17] (03PS2) 10Dzahn: icinga/nsca: allow configuring nsca chroot in Hiera, change on stretch [puppet] - 10https://gerrit.wikimedia.org/r/469808 (https://phabricator.wikimedia.org/T202782) [00:35:23] James_F: ^^ thats the next one [00:35:28] twentyafterfour: ack [00:35:41] (03CR) 10Jforrester: [C: 032] On commons do not yet register any entity types. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469809 (owner: 10Addshore) [00:35:51] (03PS3) 10Alexandros Kosiaris: Add chart to pod labels [deployment-charts] - 10https://gerrit.wikimedia.org/r/469661 [00:35:53] (03PS3) 10Alexandros Kosiaris: Support canary functionality [deployment-charts] - 10https://gerrit.wikimedia.org/r/469662 [00:36:08] addshore: Even more fun for the move to CS. [00:36:14] James_F: indeed [00:36:15] (03CR) 10jerkins-bot: [V: 04-1] icinga/nsca: allow configuring nsca chroot in Hiera, change on stretch [puppet] - 10https://gerrit.wikimedia.org/r/469808 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [00:36:30] James_F: will you put that on mwdebug1002 for me? [00:36:41] it doesn't look scary but i obviously want to make sure commons will be fine [00:36:50] Does that end of Wikibase.php get run late enough for everything else touching Wikibase config to have run? [00:36:59] (03Merged) 10jenkins-bot: On commons do not yet register any entity types. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469809 (owner: 10Addshore) [00:37:00] All the extension.jsons etc.? [00:37:12] yes, the extension.jsons should already have run there [00:37:26] addshore: Live on mwdebug1002. [00:37:32] testing [00:38:35] James_F: looks fine for me [00:38:39] Kk. [00:38:40] (03CR) 10jerkins-bot: [V: 04-1] icinga/nsca: allow configuring nsca chroot in Hiera, change on stretch [puppet] - 10https://gerrit.wikimedia.org/r/469808 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [00:38:50] im watching the logs [00:39:05] I am too. [00:39:24] Passed the canaries. [00:39:33] !log jforrester@deploy1001 Synchronized wmf-config/Wikibase.php: Post-SWAT: De-register all entities on WBMI installations calling themselves Commons I09e066f2 (duration: 00m 56s) [00:39:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:39:42] (03CR) 10Dzahn: "i dont know why jenkins-bot says that right now" [puppet] - 10https://gerrit.wikimedia.org/r/469808 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [00:40:40] James_F: lovely [00:40:45] James_F: lets wait for that to get to beta [00:41:16] (03PS5) 10Dzahn: Switch srvdumps rsync module to auto_ferm [puppet] - 10https://gerrit.wikimedia.org/r/467978 (owner: 10Muehlenhoff) [00:44:19] (03PS6) 10Dzahn: phabricator: Switch srvdumps rsync module to auto_ferm [puppet] - 10https://gerrit.wikimedia.org/r/467978 (owner: 10Muehlenhoff) [00:45:27] (03CR) 10Dzahn: [C: 032] "https://puppet-compiler.wmflabs.org/compiler1002/13215/phab1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/467978 (owner: 10Muehlenhoff) [00:45:57] James_F: did it happen yet? [00:46:07] twentyafterfour: we are only waiting for beta now, you can try the train if you want =] [00:46:21] Nope. [00:46:28] Yeah, sorry twentyafterfour, go forth. [00:47:40] (03CR) 10Dzahn: [C: 032] "iptables rules still looking fine:" [puppet] - 10https://gerrit.wikimedia.org/r/467978 (owner: 10Muehlenhoff) [00:47:52] ok [00:48:26] (03PS1) 1020after4: group2 wikis to 1.33.0-wmf.1 refs T206655 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469812 [00:48:30] (03CR) 1020after4: [C: 032] group2 wikis to 1.33.0-wmf.1 refs T206655 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469812 (owner: 1020after4) [00:48:31] thanks addshore and James_F [00:48:34] (03CR) 10jenkins-bot: On commons do not yet register any entity types. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469809 (owner: 10Addshore) [00:48:54] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [00:49:03] addshore: Should now be on BC. [00:49:52] (03Merged) 10jenkins-bot: group2 wikis to 1.33.0-wmf.1 refs T206655 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469812 (owner: 1020after4) [00:53:33] James_F: we will finish cleaning this stuff up next week [00:53:43] things are not quite as they should be on beta commons [00:53:48] Yeah. [00:53:51] Such is life. [00:53:55] Speak on Monday! [00:54:01] !log twentyafterfour@deploy1001 rebuilt and synchronized wikiversions files: group2 wikis to 1.33.0-wmf.1 refs T206655 [00:54:37] !log smalyshev@deploy1001 Started deploy [wdqs/wdqs@e9392f4]: Re-deploy Updater to deal with performance issues [00:55:06] twentyafterfour@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [00:55:07] T206655: 1.33.0-wmf.1 deployment blockers - https://phabricator.wikimedia.org/T206655 [00:55:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:55:43] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [00:56:06] silly stashbot [00:56:12] !log twentyafterfour@deploy1001 rebuilt and synchronized wikiversions files: group2 wikis to 1.33.0-wmf.1 refs T206655 [00:56:35] mw1267 [00:56:37] hmm Notice: Undefined index: time in /srv/mediawiki/php-1.33.0-wmf.1/extensions/Flow/includes/Formatter/AbstractFormatter.php on line 77 [00:58:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:59:00] addshore: do you mind making a note on T207881 that it doesn't seem like a blocker? [00:59:00] T207881: excessive "lock wait timeout exceeded " error rate after deploying 1.33.0-wmf.1 to group1 - https://phabricator.wikimedia.org/T207881 [00:59:08] twentyafterfour: can do [00:59:13] the error rate is low enough now that I'm comfortable with it I think [00:59:46] but it is still the most frequent error [01:00:08] twentyafterfour: done [01:00:18] but only 13 occurrences since I promoted group2 [01:00:18] twentyafterfour: really? it doesn't look too frequent to me right now? [01:00:29] do you have a logstash link with the filter your looking ta? [01:00:30] at [01:00:54] I see ~20 in the last 10 mins ? [01:01:02] addshore: that's about right [01:01:06] and they were all in a 2 min period [01:01:14] I have a lot of filters on: https://logstash.wikimedia.org/goto/44d571cf78ababd031dc61ff6ae29d14 [01:01:19] yup, okay, I think thats fine :) [01:01:47] right, im going to tap out for the day [01:01:55] thanks addshore! [01:04:58] (03CR) 10jenkins-bot: group2 wikis to 1.33.0-wmf.1 refs T206655 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469812 (owner: 1020after4) [01:06:05] twentyafterfour: let me know when you are done with deploy :) [01:07:24] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [01:07:54] bawolff_: I'm done [01:08:02] though that ^ doesn't look good [01:09:04] :S [01:09:17] seems like it's another transient spike so go ahead bawolff_ [01:09:43] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [01:10:14] (03CR) 10Brian Wolff: [C: 032] Enable CSP-report-only for logged in/session having users on enwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469800 (https://phabricator.wikimedia.org/T207900) (owner: 10Brian Wolff) [01:11:37] (03Merged) 10jenkins-bot: Enable CSP-report-only for logged in/session having users on enwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469800 (https://phabricator.wikimedia.org/T207900) (owner: 10Brian Wolff) [01:12:00] I'm going to deploy csp-report only to enwikiquote, and if that goes well, a bunch of other places too [01:13:36] (03CR) 10Gergő Tisza: Enable CSP-report-only for logged in/session having users on enwikiquote (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469800 (https://phabricator.wikimedia.org/T207900) (owner: 10Brian Wolff) [01:19:56] !log bawolff@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T207900 - deploy CSP to people with session on enwikiquote (duration: 00m 55s) [01:20:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:20:03] T207900: Enable csp-report-only mode everywhere - https://phabricator.wikimedia.org/T207900 [01:21:18] !log bawolff@deploy1001 Synchronized wmf-config/CommonSettings.php: T207900 - deploy CSP to people with session on enwikiquote (duration: 00m 54s) [01:21:21] (03CR) 10jenkins-bot: Enable CSP-report-only for logged in/session having users on enwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469800 (https://phabricator.wikimedia.org/T207900) (owner: 10Brian Wolff) [01:21:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:22:19] Count [01:22:21] [{exception_id}] {exception_url} ErrorException from line 3750 of /srv/mediawiki/wmf-config/CommonSettings.php: PHP Notice: Undefined variable: wgCommandLineMode [01:22:29] ugh, is that not defined yet? [01:22:34] I don't need to check it anyways [01:23:25] no, it should be defined at that point [01:24:14] i figured it out [01:24:34] Not sure how i missed that testing locally [01:25:13] (03PS1) 10Brian Wolff: Fix stupid typo in 7d05a920b0f [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469814 [01:25:25] (03CR) 10jerkins-bot: [V: 04-1] Fix stupid typo in 7d05a920b0f [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469814 (owner: 10Brian Wolff) [01:25:37] (03PS2) 10Brian Wolff: Fix stupid typo in 7d05a920b0f [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469814 [01:25:46] (03CR) 10Brian Wolff: [C: 032] Fix stupid typo in 7d05a920b0f [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469814 (owner: 10Brian Wolff) [01:26:05] !log smalyshev@deploy1001 Finished deploy [wdqs/wdqs@e9392f4]: Re-deploy Updater to deal with performance issues (duration: 31m 28s) [01:26:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:27:08] (03Merged) 10jenkins-bot: Fix stupid typo in 7d05a920b0f [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469814 (owner: 10Brian Wolff) [01:28:47] !log bawolff@deploy1001 Synchronized wmf-config/CommonSettings.php: Ia518c031 (duration: 00m 55s) [01:28:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:30:32] Hmm, basically no difference in log stash traffic from that patch, that's a good sign [01:31:54] I was expecting much more to be honest :D [01:33:27] So i think i can skip going to just all wikiquotes and right to group1 [01:35:54] (03PS1) 10Brian Wolff: Enable CSP report only for session users on group1 wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469815 [01:36:18] (03PS2) 10Brian Wolff: Enable CSP report only for session users on group1 wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469815 (https://phabricator.wikimedia.org/T207900) [01:36:38] (03CR) 10Brian Wolff: [C: 032] Enable CSP report only for session users on group1 wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469815 (https://phabricator.wikimedia.org/T207900) (owner: 10Brian Wolff) [01:37:06] (03CR) 10jenkins-bot: Fix stupid typo in 7d05a920b0f [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469814 (owner: 10Brian Wolff) [01:37:53] (03Merged) 10jenkins-bot: Enable CSP report only for session users on group1 wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469815 (https://phabricator.wikimedia.org/T207900) (owner: 10Brian Wolff) [01:41:16] !log bawolff@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T207900 (b74911f6201) enable csp users with session all group1 wikis (duration: 00m 55s) [01:41:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:41:19] T207900: Enable csp-report-only mode everywhere - https://phabricator.wikimedia.org/T207900 [01:50:13] Oh looks like math is loading directly from wikimedia.org [01:52:12] (03CR) 10jenkins-bot: Enable CSP report only for session users on group1 wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469815 (https://phabricator.wikimedia.org/T207900) (owner: 10Brian Wolff) [01:58:41] (03PS1) 10Brian Wolff: Enable CSP report only with session on medium sized wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469816 (https://phabricator.wikimedia.org/T207900) [01:59:00] upwards and onwards to medium sized wikis! [02:00:51] (03CR) 10Brian Wolff: [C: 032] Enable CSP report only with session on medium sized wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469816 (https://phabricator.wikimedia.org/T207900) (owner: 10Brian Wolff) [02:02:12] (03Merged) 10jenkins-bot: Enable CSP report only with session on medium sized wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469816 (https://phabricator.wikimedia.org/T207900) (owner: 10Brian Wolff) [02:05:41] !log bawolff@deploy1001 Synchronized wmf-config/InitialiseSettings.php: bd55034d122 - T207900 - enable CSP report only for users w/session on medium wikis (duration: 00m 55s) [02:05:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:05:45] T207900: Enable csp-report-only mode everywhere - https://phabricator.wikimedia.org/T207900 [02:07:39] (03CR) 10jenkins-bot: Enable CSP report only with session on medium sized wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469816 (https://phabricator.wikimedia.org/T207900) (owner: 10Brian Wolff) [02:12:33] (03PS1) 10Brian Wolff: Enable csp report only for users w/session on arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469817 (https://phabricator.wikimedia.org/T207900) [02:13:04] (03CR) 10Brian Wolff: [C: 032] Enable csp report only for users w/session on arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469817 (https://phabricator.wikimedia.org/T207900) (owner: 10Brian Wolff) [02:14:23] (03Merged) 10jenkins-bot: Enable csp report only for users w/session on arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469817 (https://phabricator.wikimedia.org/T207900) (owner: 10Brian Wolff) [02:19:13] (03PS3) 10Mathew.onipe: elasticsearch: cookbook for service rolling restart [cookbooks] - 10https://gerrit.wikimedia.org/r/467964 (https://phabricator.wikimedia.org/T207919) [02:20:05] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: cookbook for service rolling restart [cookbooks] - 10https://gerrit.wikimedia.org/r/467964 (https://phabricator.wikimedia.org/T207919) (owner: 10Mathew.onipe) [02:22:27] !log bawolff@deploy1001 Synchronized wmf-config/InitialiseSettings.php: d743db261 - T207900 - enable CSP report only for users w/session arwiki (duration: 00m 54s) [02:22:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:22:31] T207900: Enable csp-report-only mode everywhere - https://phabricator.wikimedia.org/T207900 [02:22:57] (03CR) 10jenkins-bot: Enable csp report only for users w/session on arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469817 (https://phabricator.wikimedia.org/T207900) (owner: 10Brian Wolff) [02:25:04] (03PS4) 10Mathew.onipe: elasticsearch: cookbook for service rolling restart [cookbooks] - 10https://gerrit.wikimedia.org/r/467964 (https://phabricator.wikimedia.org/T207919) [02:25:19] (03PS7) 10Mathew.onipe: elasticsearch_cluster: multi-cluster/multi-instance support [software/spicerack] - 10https://gerrit.wikimedia.org/r/468558 (https://phabricator.wikimedia.org/T207918) [02:26:36] (03CR) 10Mathew.onipe: elasticsearch_cluster: multi-cluster/multi-instance support (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/468558 (https://phabricator.wikimedia.org/T207918) (owner: 10Mathew.onipe) [02:26:50] huh, surpisingly, enabling on arwiki is barely a dent in the log volume [02:27:04] I'm really shocked at how low the log volume is for enabling csp for logged in users only [02:27:15] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch_cluster: multi-cluster/multi-instance support [software/spicerack] - 10https://gerrit.wikimedia.org/r/468558 (https://phabricator.wikimedia.org/T207918) (owner: 10Mathew.onipe) [02:31:20] Well I guess onwards and upwards to other large wikis [02:38:31] (03PS1) 10Brian Wolff: Enable csp report only for users w/session on a bunch of big wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469819 (https://phabricator.wikimedia.org/T207900) [02:38:48] ok, so if this doesn't cause any logging traffic issues, I'm going to do enwiki next, and then all wikis [02:40:38] (03CR) 10Brian Wolff: [C: 032] Enable csp report only for users w/session on a bunch of big wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469819 (https://phabricator.wikimedia.org/T207900) (owner: 10Brian Wolff) [02:41:53] (03Merged) 10jenkins-bot: Enable csp report only for users w/session on a bunch of big wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469819 (https://phabricator.wikimedia.org/T207900) (owner: 10Brian Wolff) [02:43:50] !log bawolff@deploy1001 Synchronized wmf-config/InitialiseSettings.php: a8aa9d6aae - T207900 - enable CSP report only for users w/session fawiki, frwiki, svwiki, eswiki, ruwiki, zhwiki, dewiki (duration: 00m 56s) [02:43:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:43:54] T207900: Enable csp-report-only mode everywhere - https://phabricator.wikimedia.org/T207900 [02:47:15] Hmm, I wish i knew how to make pretty pie charts of a random field in a logstash search [02:47:18] that'd be cool [02:51:18] (03PS1) 10Brian Wolff: Enable csp report only for users w/session on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469820 (https://phabricator.wikimedia.org/T207900) [02:51:38] (03CR) 10Brian Wolff: [C: 032] Enable csp report only for users w/session on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469820 (https://phabricator.wikimedia.org/T207900) (owner: 10Brian Wolff) [02:53:19] (03Merged) 10jenkins-bot: Enable csp report only for users w/session on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469820 (https://phabricator.wikimedia.org/T207900) (owner: 10Brian Wolff) [02:54:44] (03CR) 10jenkins-bot: Enable csp report only for users w/session on a bunch of big wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469819 (https://phabricator.wikimedia.org/T207900) (owner: 10Brian Wolff) [02:54:46] (03CR) 10jenkins-bot: Enable csp report only for users w/session on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469820 (https://phabricator.wikimedia.org/T207900) (owner: 10Brian Wolff) [02:54:48] !log bawolff@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 745d0b61 - T207900 - enable CSP report only for users w/session enwiki (duration: 00m 53s) [02:54:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:54:51] T207900: Enable csp-report-only mode everywhere - https://phabricator.wikimedia.org/T207900 [03:01:06] !log bawolff@deploy1001 Synchronized wmf-config/CommonSettings.php: T207900 - Add wikimedia.org (no subdomain) to allow list for math (duration: 00m 53s) [03:01:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:01:18] T207900: Enable csp-report-only mode everywhere - https://phabricator.wikimedia.org/T207900 [03:03:12] (03CR) 10Thcipriani: [C: 032] Split SshAgentCommand type to Request/Response [software/keyholder] - 10https://gerrit.wikimedia.org/r/458237 (owner: 10Faidon Liambotis) [03:04:04] (03Merged) 10jenkins-bot: Split SshAgentCommand type to Request/Response [software/keyholder] - 10https://gerrit.wikimedia.org/r/458237 (owner: 10Faidon Liambotis) [03:06:04] (03PS1) 10Brian Wolff: Enable csp report only for people with session everywhere. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469821 (https://phabricator.wikimedia.org/T207900) [03:07:03] (03CR) 10Brian Wolff: [C: 032] Enable csp report only for people with session everywhere. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469821 (https://phabricator.wikimedia.org/T207900) (owner: 10Brian Wolff) [03:08:02] Appearently csp-report is only 0.11% of api requests [03:08:05] (03Merged) 10jenkins-bot: Enable csp report only for people with session everywhere. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469821 (https://phabricator.wikimedia.org/T207900) (owner: 10Brian Wolff) [03:09:01] oh, i forgot to git pull, no wonder that didn't affect the graphs [03:10:03] (03CR) 10jenkins-bot: Enable csp report only for people with session everywhere. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469821 (https://phabricator.wikimedia.org/T207900) (owner: 10Brian Wolff) [03:10:24] !log bawolff@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 745d0b61 - T207900 - enable CSP report only for users w/session enwiki (duration: 00m 55s) [03:10:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:10:35] T207900: Enable csp-report-only mode everywhere - https://phabricator.wikimedia.org/T207900 [03:13:21] looks like enwiki has a rate of 7 reports/second [03:15:38] (03CR) 10Thcipriani: [C: 032] Make pylint a little happier [software/keyholder] - 10https://gerrit.wikimedia.org/r/458238 (owner: 10Faidon Liambotis) [03:16:17] (03Merged) 10jenkins-bot: Make pylint a little happier [software/keyholder] - 10https://gerrit.wikimedia.org/r/458238 (owner: 10Faidon Liambotis) [03:17:33] and that the rate of cspreports is below the rate of request for ?action=help [03:19:12] So yeah, i was probably much more cautious with that then i needed to be [03:19:23] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [03:19:52] !log bawolff@deploy1001 Synchronized wmf-config/InitialiseSettings.php: bc9b863e - T207900 - enable CSP report only for users w/session everywhere (duration: 00m 55s) [03:19:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:19:56] T207900: Enable csp-report-only mode everywhere - https://phabricator.wikimedia.org/T207900 [03:28:23] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 768.67 seconds [03:30:33] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [04:12:41] !log deploy patch T207916 [04:12:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:21:21] There's a bit of an increase in timeouts [04:21:28] I think its still within normal bounds [04:23:03] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 295.44 seconds [04:33:56] yeah, totally back to normal now [04:35:56] (03PS1) 10Brian Wolff: Enable logging authentication log to udp2log only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469822 (https://phabricator.wikimedia.org/T207916) [04:39:55] (03CR) 10Brian Wolff: [C: 032] Enable logging authentication log to udp2log only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469822 (https://phabricator.wikimedia.org/T207916) (owner: 10Brian Wolff) [04:41:14] (03Merged) 10jenkins-bot: Enable logging authentication log to udp2log only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469822 (https://phabricator.wikimedia.org/T207916) (owner: 10Brian Wolff) [04:43:34] this is probably a little paranoid given the size of api.log [04:44:23] (03CR) 10jenkins-bot: Enable logging authentication log to udp2log only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469822 (https://phabricator.wikimedia.org/T207916) (owner: 10Brian Wolff) [04:50:27] And i forgot to git add... [04:51:41] (03PS1) 10Brian Wolff: Follow-up d3b2c346b. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469824 [04:51:47] ugh, that's embarassing [04:52:05] (03CR) 10Brian Wolff: [C: 032] Follow-up d3b2c346b. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469824 (owner: 10Brian Wolff) [04:53:18] (03Merged) 10jenkins-bot: Follow-up d3b2c346b. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469824 (owner: 10Brian Wolff) [04:58:51] !log depooled wdqs1003 again, let's see if it helps it catch up now [04:58:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:59:53] (03CR) 10jenkins-bot: Follow-up d3b2c346b. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469824 (owner: 10Brian Wolff) [05:07:43] !log bawolff@deploy1001 Synchronized wmf-config/InitialiseSettings.php: d3b2c346 T207916 (duration: 00m 55s) [05:07:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:10:04] PROBLEM - High lag on wdqs1003 is CRITICAL: 2.002e+04 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [05:17:15] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Wikidata-Query-Service-Sprint: wdqs1009 - cannot create /var/log/wdqs/wdqs_autodeployment.log - https://phabricator.wikimedia.org/T206318 (10Smalyshev) @Gehel, @Mathew.onipe could you check out what's up with this? [05:19:04] PROBLEM - Host wikitech-static.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [05:20:03] RECOVERY - Host wikitech-static.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 22.86 ms [05:26:00] (03PS2) 10Muehlenhoff: Remove sarin/neodymium from network constants/tcpircbot config [puppet] - 10https://gerrit.wikimedia.org/r/466830 [05:26:47] (03PS1) 10Elukey: role::statistics::private: move reportupdater to stat1007 [puppet] - 10https://gerrit.wikimedia.org/r/469826 (https://phabricator.wikimedia.org/T205846) [05:30:24] (03CR) 10Elukey: [C: 032] role::statistics::private: move reportupdater to stat1007 [puppet] - 10https://gerrit.wikimedia.org/r/469826 (https://phabricator.wikimedia.org/T205846) (owner: 10Elukey) [05:38:35] ok, I'm going to deploy the new patch on T207916 [06:01:59] !log adjust patch for T207916 [06:02:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:07:04] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 53, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:07:23] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 122, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:08:24] PROBLEM - HHVM rendering on mwdebug2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:09:33] RECOVERY - HHVM rendering on mwdebug2001 is OK: HTTP OK: HTTP/1.1 200 OK - 76976 bytes in 7.079 second response time [06:11:35] (03CR) 10Elukey: git-sync-upstream: Send cron mail in case of failures (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/468865 (https://phabricator.wikimedia.org/T184261) (owner: 10GTirloni) [06:12:47] (03CR) 10Elukey: "test" [dns] - 10https://gerrit.wikimedia.org/r/467703 (owner: 10Volans) [06:14:08] (03PS2) 10Elukey: Add missing AAAA records for aqs eqiad hosts [dns] - 10https://gerrit.wikimedia.org/r/467703 (owner: 10Volans) [06:16:19] (03CR) 10Elukey: [C: 032] Add missing AAAA records for aqs eqiad hosts [dns] - 10https://gerrit.wikimedia.org/r/467703 (owner: 10Volans) [06:16:28] (03PS2) 10Elukey: Add missing AAAA record for matomo eqiad host [dns] - 10https://gerrit.wikimedia.org/r/467704 (owner: 10Volans) [06:17:03] (03CR) 10Elukey: [C: 032] Add missing AAAA record for matomo eqiad host [dns] - 10https://gerrit.wikimedia.org/r/467704 (owner: 10Volans) [06:19:19] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Wikidata-Query-Service-Sprint: wdqs1009 - cannot create /var/log/wdqs/wdqs_autodeployment.log - https://phabricator.wikimedia.org/T206318 (10Mathew.onipe) @Smalyshev This has been resolved. We had some permission issues initially when we depl... [06:19:23] (03PS1) 10Brian Wolff: Enable authorization log on group0 + donatewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469833 (https://phabricator.wikimedia.org/T207916) [06:19:46] (03CR) 10Brian Wolff: [C: 032] Enable authorization log on group0 + donatewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469833 (https://phabricator.wikimedia.org/T207916) (owner: 10Brian Wolff) [06:19:49] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Wikidata-Query-Service-Sprint: wdqs1009 - cannot create /var/log/wdqs/wdqs_autodeployment.log - https://phabricator.wikimedia.org/T206318 (10Smalyshev) 05Open>03Resolved a:03Smalyshev [06:21:26] (03Merged) 10jenkins-bot: Enable authorization log on group0 + donatewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469833 (https://phabricator.wikimedia.org/T207916) (owner: 10Brian Wolff) [06:22:03] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10cloud-services-team, 10User-Smalyshev: Provide a way to have test servers on real hardware, isolated from production for Wikidata Query Service - https://phabricator.wikimedia.org/T206636 (10Smalyshev) Data loading test launched for t206636-3 and shou... [06:24:22] !log bawolff@deploy1001 Synchronized wmf-config/InitialiseSettings.php: ecf579e9f9 - T207916 - enable auth log group0 (duration: 00m 55s) [06:24:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:31:14] PROBLEM - puppet last run on ms-be1035 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/puppet-enabled] [06:31:23] PROBLEM - puppet last run on labstore1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/lib/nagios/plugins/get-raid-status-megacli] [06:32:24] PROBLEM - puppet last run on cp1080 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ssl/localcerts/globalsign-2017-rsa-unified.crt] [06:32:54] PROBLEM - puppet last run on phab1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/apache-status] [06:33:09] (03CR) 10jenkins-bot: Enable authorization log on group0 + donatewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469833 (https://phabricator.wikimedia.org/T207916) (owner: 10Brian Wolff) [06:33:28] !log uploaded openjdk-8 backport for recent Java 8 security updates to apt.wikimedia.org/jessie-wikimedia [06:33:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:46:03] (03PS1) 10Brian Wolff: Enable authorization log on group1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469837 (https://phabricator.wikimedia.org/T207916) [06:47:54] (03CR) 10Brian Wolff: [C: 032] Enable authorization log on group1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469837 (https://phabricator.wikimedia.org/T207916) (owner: 10Brian Wolff) [06:49:19] (03Merged) 10jenkins-bot: Enable authorization log on group1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469837 (https://phabricator.wikimedia.org/T207916) (owner: 10Brian Wolff) [06:49:32] (03CR) 10jenkins-bot: Enable authorization log on group1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469837 (https://phabricator.wikimedia.org/T207916) (owner: 10Brian Wolff) [06:51:14] !log bawolff@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T207916 13b993ab9f - auth log on in group1 (duration: 00m 54s) [06:51:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:56:53] RECOVERY - puppet last run on ms-be1035 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [06:56:54] RECOVERY - puppet last run on labstore1003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:03] RECOVERY - puppet last run on cp1080 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [06:58:33] RECOVERY - puppet last run on phab1002 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [07:08:03] (03CR) 10Ema: [C: 031] "One nit, LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/469685 (owner: 10Gehel) [07:08:43] (03PS1) 10Brian Wolff: Enable auth log on arwiki. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469840 (https://phabricator.wikimedia.org/T207916) [07:09:03] (03CR) 10Brian Wolff: [C: 032] Enable auth log on arwiki. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469840 (https://phabricator.wikimedia.org/T207916) (owner: 10Brian Wolff) [07:10:19] (03Merged) 10jenkins-bot: Enable auth log on arwiki. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469840 (https://phabricator.wikimedia.org/T207916) (owner: 10Brian Wolff) [07:11:37] (03CR) 10Ema: [C: 031] "I think we'll need to be careful with the deployment of this as mentioned inline. That said, the code LGTM." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/469686 (owner: 10Gehel) [07:13:25] (03CR) 10Ema: [C: 031] wdqs: remove wdqs1003 from public cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/469687 (owner: 10Gehel) [07:15:53] (03CR) 10Ema: [C: 031] "Again a note about the deployment of this change, LGTM." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/469688 (owner: 10Gehel) [07:16:39] !log bawolff@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T207916 13b993ab9f - auth log on in arwiki (duration: 00m 54s) [07:16:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:19:03] PROBLEM - High lag on wdqs1003 is CRITICAL: 1.118e+04 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [07:20:17] (03CR) 10jenkins-bot: Enable auth log on arwiki. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469840 (https://phabricator.wikimedia.org/T207916) (owner: 10Brian Wolff) [07:20:33] PROBLEM - Memory correctable errors -EDAC- on thumbor1004 is CRITICAL: 4.001 ge 4 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=thumbor1004&var-datasource=eqiad%2520prometheus%252Fops [07:23:23] (03CR) 10Filippo Giunchedi: [C: 031] When absenting an rsyncd module, also remove the ferm service [puppet] - 10https://gerrit.wikimedia.org/r/469629 (owner: 10Muehlenhoff) [07:23:59] (03CR) 10Filippo Giunchedi: [C: 031] Convert udp2log::rsyncd to auto_ferm [puppet] - 10https://gerrit.wikimedia.org/r/469627 (owner: 10Muehlenhoff) [07:24:14] (03PS1) 10Elukey: Add info about how to build to README.Debian [debs/druid] (debian) - 10https://gerrit.wikimedia.org/r/469844 [07:25:05] (03CR) 10Elukey: [C: 032] Add info about how to build to README.Debian [debs/druid] (debian) - 10https://gerrit.wikimedia.org/r/469844 (owner: 10Elukey) [07:26:03] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 55, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:26:13] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Wikidata-Query-Service-Sprint, 10Patch-For-Review: Switch wdqs1003 with one of the internal wdqs cluster - https://phabricator.wikimedia.org/T207947 (10ema) [07:26:24] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 124, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:27:44] (03CR) 10Ema: "This can be abandoned I assume, being replaced by other commits." [puppet] - 10https://gerrit.wikimedia.org/r/469649 (https://phabricator.wikimedia.org/T207947) (owner: 10Gehel) [07:39:06] (03Abandoned) 10Gehel: wdqs: switch wdqs1003 and wdqs1006 from public vs internal clusters [puppet] - 10https://gerrit.wikimedia.org/r/469649 (https://phabricator.wikimedia.org/T207947) (owner: 10Gehel) [07:44:24] PROBLEM - Apache HTTP on mwdebug1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1976 bytes in 0.099 second response time [07:44:53] PROBLEM - Nginx local proxy to apache on mwdebug1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1976 bytes in 0.076 second response time [07:45:04] PROBLEM - HHVM rendering on mwdebug1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1976 bytes in 0.059 second response time [07:47:12] sorry, that was me [07:47:23] RECOVERY - HHVM rendering on mwdebug1001 is OK: HTTP OK: HTTP/1.1 200 OK - 77054 bytes in 1.764 second response time [07:47:53] RECOVERY - Apache HTTP on mwdebug1001 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 622 bytes in 0.242 second response time [07:48:02] (03PS2) 10Gehel: wdqs: remove wdqs1006 from internal cluster [puppet] - 10https://gerrit.wikimedia.org/r/469685 (https://phabricator.wikimedia.org/T207947) [07:48:06] (03PS2) 10Gehel: wdqs: add wdqs1006 to public cluster [puppet] - 10https://gerrit.wikimedia.org/r/469686 (https://phabricator.wikimedia.org/T207947) [07:48:08] (03PS2) 10Gehel: wdqs: remove wdqs1003 from public cluster [puppet] - 10https://gerrit.wikimedia.org/r/469687 (https://phabricator.wikimedia.org/T207947) [07:48:10] (03PS2) 10Gehel: wdqs: add wdqs1003 to internal cluster [puppet] - 10https://gerrit.wikimedia.org/r/469688 (https://phabricator.wikimedia.org/T207947) [07:48:14] RECOVERY - Nginx local proxy to apache on mwdebug1001 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 622 bytes in 0.076 second response time [07:51:09] !log adjust patch T207916 [07:51:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:51:37] (03CR) 10Gehel: wdqs: add wdqs1006 to public cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/469686 (https://phabricator.wikimedia.org/T207947) (owner: 10Gehel) [07:52:45] (03CR) 10Gehel: wdqs: add wdqs1003 to internal cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/469688 (https://phabricator.wikimedia.org/T207947) (owner: 10Gehel) [07:54:31] 10Operations, 10DNS, 10GitHub-Mirrors, 10Traffic, and 2 others: Github: add verified domain - https://phabricator.wikimedia.org/T207364 (10jijiki) p:05Triage>03Normal [07:55:25] 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team: contint1001 store docker images on separate partition or disk - https://phabricator.wikimedia.org/T207707 (10jijiki) p:05Triage>03Normal [08:03:15] (03PS1) 10Filippo Giunchedi: grafana: better configuration handling [puppet] - 10https://gerrit.wikimedia.org/r/469847 (https://phabricator.wikimedia.org/T208010) [08:04:35] 10Operations, 10SRE-Access-Requests: Requesting access to deployment, operational logs, and analytics cluster for jlinehan - https://phabricator.wikimedia.org/T207951 (10jijiki) p:05Triage>03Normal [08:07:23] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 90.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [08:08:14] Sorry, I think that was my secret security patch [08:08:56] 10Operations, 10Icinga, 10fundraising-tech-ops, 10monitoring: Why doesn't icinga notify the team-fr-tech contact for services in WARNING state? - https://phabricator.wikimedia.org/T207966 (10jijiki) p:05Triage>03Normal [08:11:49] it should be better now [08:14:27] (03CR) 10Muehlenhoff: "The require_package change works for now as krypton is running on jessie. On jessie we still have the default thirdparty component where e" [puppet] - 10https://gerrit.wikimedia.org/r/469847 (https://phabricator.wikimedia.org/T208010) (owner: 10Filippo Giunchedi) [08:22:53] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [08:24:02] (03PS1) 10Alex Monk: shinken beta cluster: Unsubscribe myself from email alerts [puppet] - 10https://gerrit.wikimedia.org/r/469851 [08:25:46] 10Operations, 10Operations-Software-Development: debdeploy: show help message if invoked with no arguments - https://phabricator.wikimedia.org/T207845 (10MoritzMuehlenhoff) Interesting, that's some fallout from the Python 3 migration, will have a look. [08:26:20] (03PS2) 10Filippo Giunchedi: grafana: better configuration handling [puppet] - 10https://gerrit.wikimedia.org/r/469847 (https://phabricator.wikimedia.org/T208010) [08:26:59] (03CR) 10Filippo Giunchedi: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/469847 (https://phabricator.wikimedia.org/T208010) (owner: 10Filippo Giunchedi) [08:34:54] 10Operations, 10MediaWiki Language Extension Bundle, 10MediaWiki-extensions-Translate, 10Language-Team (Language-2018-October-December), and 2 others: Moving or deleting a translatable page on mediawiki.org triggers an error message - https://phabricator.wikimedia.org/T207930 (10Trizek-WMF) >>! In T207930#... [08:35:31] bawolff, I suspect a lot of people largely ignore the dev console anyway [08:36:15] Krenair: I only emailed because there was a thread on WP:VPT [08:36:32] yeah it's a good idea [08:36:36] 10Operations, 10ops-eqiad: Degraded RAID on aqs1006 - https://phabricator.wikimedia.org/T207958 (10jijiki) [08:36:38] 10Operations, 10ops-eqiad, 10Analytics: Degraded RAID on aqs1006 - https://phabricator.wikimedia.org/T206915 (10jijiki) [08:36:45] just don't think many will notice, but we'll see [08:37:31] I think you're probably right [08:38:44] (03CR) 10Muehlenhoff: [C: 031] "Looks good. One comment inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/469847 (https://phabricator.wikimedia.org/T208010) (owner: 10Filippo Giunchedi) [08:46:11] bawolff, with the preference suggestion on VPT... I wonder if we really want to allow such a thing for all users [08:46:38] like should admins, interface admins, etc. and above really be able to load external/other user's JS in their session? [08:46:42] I was going to propose the preference thing as an RFC for discussion [08:46:59] To be clear, preference was going to allow things in default-src [08:47:19] script-src will (in glorious futrue) always be prevented [08:47:50] although that won't be super effective until unsafe-inline is removed, which is very far off [08:48:16] maybe you could go down the route of a mode where you can only do certain things if you unload external scripts [08:48:22] (03CR) 10Filippo Giunchedi: grafana: better configuration handling (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/469847 (https://phabricator.wikimedia.org/T208010) (owner: 10Filippo Giunchedi) [08:48:46] Specificly, I think it will be a hard sell to totally ban external apis (like CVN network) [08:49:01] and in principle if the script is only loading some json and not executing it that's fine [08:49:18] yeah [08:49:32] it's the eval that's problematic [08:50:51] yeah, much of the goal of this is incremental improvements as we can, as opposed to trying to do something perfect, that is unfeasible in terms of getting community to not hate me [08:52:01] or at least boil the lobster slowly [08:53:15] (03Abandoned) 10Alex Monk: shinken beta cluster: Unsubscribe myself from email alerts [puppet] - 10https://gerrit.wikimedia.org/r/469851 (owner: 10Alex Monk) [08:53:58] (03PS3) 10Filippo Giunchedi: grafana: better configuration handling [puppet] - 10https://gerrit.wikimedia.org/r/469847 (https://phabricator.wikimedia.org/T208010) [08:57:07] (03CR) 10Filippo Giunchedi: [C: 032] grafana: better configuration handling [puppet] - 10https://gerrit.wikimedia.org/r/469847 (https://phabricator.wikimedia.org/T208010) (owner: 10Filippo Giunchedi) [09:03:49] Hi, can you change on integration.wikimedia.org/zuul php5 to php7? [09:04:11] I think on php5 Jobs manually triggered by whitelisted users commenting 'check php'. Useful for running PHP tests that are only part of gate-and-submit. [09:18:33] (03CR) 10ArielGlenn: "In this case two will match, but maybe that's ok (both the dumps and pagecounts stanzas allow the same set of servers on the same port)." [puppet] - 10https://gerrit.wikimedia.org/r/467985 (owner: 10Muehlenhoff) [09:19:45] (03CR) 10ArielGlenn: "I plan to merge by the end of the weekend unless I hear objections." [puppet] - 10https://gerrit.wikimedia.org/r/468059 (https://phabricator.wikimedia.org/T147169) (owner: 10Hoo man) [09:27:29] 10Operations, 10ops-eqiad, 10Analytics: Degraded RAID on aqs1006 - https://phabricator.wikimedia.org/T206915 (10elukey) https://www.thegeekdiary.com/replacing-a-failed-mirror-disk-in-a-software-raid-array-mdadm/ is a good reference about how to swap the disk [09:31:00] (03PS1) 10Muehlenhoff: Kerberos client (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/469853 [09:32:30] (03PS1) 10Elukey: profile::statistics::private: move geoip archive to stat1007 [puppet] - 10https://gerrit.wikimedia.org/r/469854 (https://phabricator.wikimedia.org/T205846) [09:45:45] (03CR) 10Elukey: [C: 032] profile::statistics::private: move geoip archive to stat1007 [puppet] - 10https://gerrit.wikimedia.org/r/469854 (https://phabricator.wikimedia.org/T205846) (owner: 10Elukey) [09:59:02] (03PS1) 10Elukey: profile::statistics::private: move geoip to stat1007 - p2 [puppet] - 10https://gerrit.wikimedia.org/r/469855 (https://phabricator.wikimedia.org/T205846) [09:59:59] (03CR) 10Elukey: [C: 032] profile::statistics::private: move geoip to stat1007 - p2 [puppet] - 10https://gerrit.wikimedia.org/r/469855 (https://phabricator.wikimedia.org/T205846) (owner: 10Elukey) [10:15:51] 10Operations, 10Patch-For-Review: Switch the main etcd cluster in eqiad to use conf1004-1006 - https://phabricator.wikimedia.org/T205814 (10jijiki) p:05Triage>03Normal [10:16:14] PROBLEM - Nginx local proxy to apache on mw1333 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.007 second response time [10:16:29] 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10Core Platform Team Backlog (Watching / External), 10Services (watching): Track and install additional npm packages for all service container images - https://phabricator.wikimedia.org/T205911 (10jijiki) p:05Triage>03Normal [10:16:33] PROBLEM - HHVM rendering on mw1333 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.001 second response time [10:16:53] PROBLEM - Apache HTTP on mw1333 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.001 second response time [10:16:56] 10Operations, 10Core Platform Team (PHP7 (TEC4)), 10Core Platform Team Backlog (Watching / External), 10Patch-For-Review: Allow directing users to PHP7 based on a cookie - https://phabricator.wikimedia.org/T206338 (10jijiki) p:05Triage>03Normal [10:17:34] RECOVERY - HHVM rendering on mw1333 is OK: HTTP OK: HTTP/1.1 200 OK - 76972 bytes in 0.368 second response time [10:17:54] RECOVERY - Apache HTTP on mw1333 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.051 second response time [10:18:33] RECOVERY - Nginx local proxy to apache on mw1333 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.045 second response time [10:19:36] 10Operations: Evaluate scalability and performance of PHP7 compared to HHVM - https://phabricator.wikimedia.org/T206341 (10jijiki) p:05Triage>03Normal [10:19:44] 10Operations, 10DBA, 10MediaWiki-extensions-Translate, 10Performance-Team, and 2 others: DBPerformance warning "Query returned 22186 rows: query: SELECT * FROM `translate_metadata`" on Meta-Wiki - https://phabricator.wikimedia.org/T204026 (10jijiki) p:05Triage>03Normal [10:20:35] 10Operations, 10Certcentral, 10monitoring: Create icinga checks for certcentral - https://phabricator.wikimedia.org/T207294 (10jijiki) p:05Triage>03Normal [10:21:41] 10Operations, 10Performance-Team: Evaluate scalability and performance of PHP7 compared to HHVM - https://phabricator.wikimedia.org/T206341 (10jijiki) [10:22:13] 10Operations, 10Certcentral, 10Icinga, 10monitoring: Create icinga checks for certcentral - https://phabricator.wikimedia.org/T207294 (10jijiki) [10:22:59] 10Operations, 10Performance-Team, 10Traffic: Investigate 200-300ms increase in responseStart.p75 - https://phabricator.wikimedia.org/T207315 (10jijiki) p:05Triage>03Normal [10:27:45] 10Operations, 10SRE-Access-Requests: Requesting access to deployment and analytics-privatedata-users for sbassett - https://phabricator.wikimedia.org/T207852 (10jijiki) a:03jijiki [10:27:57] 10Operations, 10SRE-Access-Requests: Requesting access to deployment, operational logs, and analytics cluster for jlinehan - https://phabricator.wikimedia.org/T207951 (10jijiki) a:03jijiki [10:28:22] 10Operations, 10SRE-Access-Requests: Requesting access to deployment, operational logs, and analytics cluster for jlinehan - https://phabricator.wikimedia.org/T207951 (10jijiki) a:05jijiki>03None [10:29:33] * jijiki lunch [10:48:02] (03CR) 10Alex Monk: [C: 032] certcentral: Avoid fast retry on local errors after cert is issued [software/certcentral] - 10https://gerrit.wikimedia.org/r/469624 (https://phabricator.wikimedia.org/T207927) (owner: 10Vgutierrez) [10:49:52] (03Merged) 10jenkins-bot: certcentral: Avoid fast retry on local errors after cert is issued [software/certcentral] - 10https://gerrit.wikimedia.org/r/469624 (https://phabricator.wikimedia.org/T207927) (owner: 10Vgutierrez) [10:51:39] (03CR) 10Muehlenhoff: "Yeah, that seems fine (and will also work out transparently if e.g. either of dumps or pagecounts get moved to a different host)" [puppet] - 10https://gerrit.wikimedia.org/r/467985 (owner: 10Muehlenhoff) [10:51:57] (03CR) 10jenkins-bot: certcentral: Avoid fast retry on local errors after cert is issued [software/certcentral] - 10https://gerrit.wikimedia.org/r/469624 (https://phabricator.wikimedia.org/T207927) (owner: 10Vgutierrez) [11:08:49] PROBLEM - LVS HTTP IPv4 on thumbor.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 20 seconds [11:09:48] RECOVERY - LVS HTTP IPv4 on thumbor.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 281 bytes in 0.002 second response time [11:12:00] that was fast [11:49:20] (03PS1) 10Rxy: Remove global action related permissions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469864 (https://phabricator.wikimedia.org/T208035) [11:56:27] (03PS2) 10Muehlenhoff: Kerberos client (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/469853 [12:17:39] (03PS3) 10Muehlenhoff: Add initial profile for Kerberos client [puppet] - 10https://gerrit.wikimedia.org/r/469853 [12:21:54] (03PS1) 10Muehlenhoff: Fix syntax in Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/469869 (https://phabricator.wikimedia.org/T208032) [12:24:07] for those wondering what the thumbor page was about: T187765 [12:24:08] T187765: Replace the Nginx fronting Thumbor with a reverse proxy capable of queuing requests - https://phabricator.wikimedia.org/T187765 [12:27:03] (03CR) 10Muehlenhoff: [C: 032] Fix syntax in Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/469869 (https://phabricator.wikimedia.org/T208032) (owner: 10Muehlenhoff) [12:45:35] 10Operations, 10Icinga, 10fundraising-tech-ops, 10monitoring: Why doesn't icinga notify the team-fr-tech-ops contact for services in WARNING state? - https://phabricator.wikimedia.org/T207966 (10Jgreen) [12:45:37] (03CR) 10Elukey: [C: 031] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/469853 (owner: 10Muehlenhoff) [12:46:33] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [12:48:43] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [12:55:03] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [12:56:14] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [13:01:23] the above looks like it is a bunch of errors on commons API for MediaWiki::restInPeace [13:02:35] !log depool wdqs1003 to catch up on updates [13:02:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:51] or not, looks more like the long transactions exceeding the threshold [13:10:00] (03PS2) 10Muehlenhoff: When absenting an rsyncd module, also remove the ferm service [puppet] - 10https://gerrit.wikimedia.org/r/469629 [13:14:53] (03CR) 10Muehlenhoff: [C: 032] When absenting an rsyncd module, also remove the ferm service [puppet] - 10https://gerrit.wikimedia.org/r/469629 (owner: 10Muehlenhoff) [13:16:46] 10Operations, 10Patch-For-Review: Switch the main etcd cluster in eqiad to use conf1004-1006 - https://phabricator.wikimedia.org/T205814 (10elukey) 05Open>03Resolved a:03elukey This has been done, closing! [13:17:23] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [13:17:24] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [13:21:18] (03PS1) 10Ema: ATS: check HTTP responses from prometheus exporter [puppet] - 10https://gerrit.wikimedia.org/r/469875 (https://phabricator.wikimedia.org/T204209) [13:25:35] (03PS2) 10Ema: ATS: check HTTP responses from prometheus exporter [puppet] - 10https://gerrit.wikimedia.org/r/469875 (https://phabricator.wikimedia.org/T204232) [13:26:34] PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:31:17] 10Operations, 10monitoring, 10Patch-For-Review: newer version of nagios-nrpe-plugin nrpe (check_nrpe) with fixed logging issue on stretch icinga - https://phabricator.wikimedia.org/T207775 (10MoritzMuehlenhoff) If we can confirm that this works fine on icinga1001, we can report it to Debian by using the repo... [13:32:04] (03PS1) 10Andrew Bogott: Horizon: disable Analytics project in eqiad region, enable in eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/469876 (https://phabricator.wikimedia.org/T207715) [13:33:06] (03CR) 10Andrew Bogott: [C: 032] Horizon: disable Analytics project in eqiad region, enable in eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/469876 (https://phabricator.wikimedia.org/T207715) (owner: 10Andrew Bogott) [13:59:16] 10Operations, 10ops-eqiad, 10netops, 10Patch-For-Review: Rack/setup cr2-eqord - https://phabricator.wikimedia.org/T204170 (10Papaul) [14:02:43] PROBLEM - Nginx local proxy to apache on mw1245 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.007 second response time [14:02:44] PROBLEM - Apache HTTP on mw1245 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.002 second response time [14:03:44] RECOVERY - Nginx local proxy to apache on mw1245 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.038 second response time [14:03:53] RECOVERY - Apache HTTP on mw1245 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.036 second response time [14:06:22] 10Operations, 10ops-codfw: unrack/decom cr1-eqord - https://phabricator.wikimedia.org/T208049 (10Papaul) p:05Triage>03Normal [14:06:53] 10Operations, 10ops-eqiad, 10netops, 10Patch-For-Review: Rack/setup cr2-eqord - https://phabricator.wikimedia.org/T204170 (10Papaul) [14:07:31] 10Operations, 10ops-eqiad, 10netops, 10Patch-For-Review: Rack/setup cr2-eqord - https://phabricator.wikimedia.org/T204170 (10Papaul) [14:08:28] 10Operations, 10ops-eqiad, 10netops, 10Patch-For-Review: Rack/setup cr2-eqord - https://phabricator.wikimedia.org/T204170 (10Papaul) 05Open>03Resolved [14:16:04] RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 277 bytes in 1.155 second response time [14:18:34] 10Operations, 10Cloud-VPS, 10cloud-services-team (Kanban): Move various support services for Cloud VPS currently in prod into their own instances - https://phabricator.wikimedia.org/T207536 (10MoritzMuehlenhoff) >>! In T207536#4695685, @ayounsi wrote: > Should the next step here to make an exhaustive list of... [14:19:43] PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:20:24] PROBLEM - HHVM rendering on mw2216 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:21:33] RECOVERY - HHVM rendering on mw2216 is OK: HTTP OK: HTTP/1.1 200 OK - 76891 bytes in 0.409 second response time [14:24:44] 10Operations, 10LDAP-Access-Requests: Remove "aude" from "wmde" LDAP group - https://phabricator.wikimedia.org/T207793 (10jijiki) 05Open>03Resolved [14:25:13] 10Operations, 10LDAP-Access-Requests: Remove "jk" from "wmde" ldap group - https://phabricator.wikimedia.org/T207792 (10jijiki) 05Open>03Resolved [14:27:08] 10Operations, 10LDAP-Access-Requests, 10Core Platform Team Kanban (Blocked Externally): Remove "daniel" from "wmde" LDAP group and add him to "wmf" - https://phabricator.wikimedia.org/T207788 (10jijiki) 05Open>03Resolved Removed from `wmde`. A new task will be created for adding him to `wmf` either way. [14:29:46] (03PS2) 10Bstorm: sonofgridengine: Add new roles for stretch grid web nodes [puppet] - 10https://gerrit.wikimedia.org/r/469790 (https://phabricator.wikimedia.org/T200557) [14:30:56] (03CR) 10Bstorm: [C: 032] sonofgridengine: Add new roles for stretch grid web nodes [puppet] - 10https://gerrit.wikimedia.org/r/469790 (https://phabricator.wikimedia.org/T200557) (owner: 10Bstorm) [14:35:00] (03PS1) 10Effie Mouzeli: admin: Updated key for kharlan [puppet] - 10https://gerrit.wikimedia.org/r/469884 (https://phabricator.wikimedia.org/T207330) [14:42:18] (03PS2) 10Effie Mouzeli: admin: Updated key for kharlan [puppet] - 10https://gerrit.wikimedia.org/r/469884 (https://phabricator.wikimedia.org/T207330) [14:47:17] (03PS1) 10Reedy: Don't allow sysop users to disable 2FA for other users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469886 (https://phabricator.wikimedia.org/T195207) [14:51:56] (03PS1) 10Banyek: mariadb: table checker for monitoring data drift [puppet] - 10https://gerrit.wikimedia.org/r/469889 (https://phabricator.wikimedia.org/T207253) [14:52:30] (03CR) 10jerkins-bot: [V: 04-1] mariadb: table checker for monitoring data drift [puppet] - 10https://gerrit.wikimedia.org/r/469889 (https://phabricator.wikimedia.org/T207253) (owner: 10Banyek) [14:53:41] (03CR) 10Reedy: [C: 032] Don't allow sysop users to disable 2FA for other users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469886 (https://phabricator.wikimedia.org/T195207) (owner: 10Reedy) [14:54:53] (03CR) 10Effie Mouzeli: [C: 031] Remove *.cz redirects [puppet] - 10https://gerrit.wikimedia.org/r/467088 (https://phabricator.wikimedia.org/T206923) (owner: 10Urbanecm) [14:54:56] (03Merged) 10jenkins-bot: Don't allow sysop users to disable 2FA for other users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469886 (https://phabricator.wikimedia.org/T195207) (owner: 10Reedy) [14:55:10] (03CR) 10Effie Mouzeli: [C: 031] Remove *.cz [dns] - 10https://gerrit.wikimedia.org/r/467087 (https://phabricator.wikimedia.org/T206923) (owner: 10Urbanecm) [14:56:27] (03CR) 10Effie Mouzeli: [C: 032] admin: Updated key for kharlan [puppet] - 10https://gerrit.wikimedia.org/r/469884 (https://phabricator.wikimedia.org/T207330) (owner: 10Effie Mouzeli) [14:56:27] !log reedy@deploy1001 Synchronized wmf-config/CommonSettings.php: Prevent sysops from disabling 2FA for other users as part of upcoming feature (duration: 00m 53s) [14:56:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:03] (03PS1) 10Elukey: profile::statistics::private: move geoip archive to another dir [puppet] - 10https://gerrit.wikimedia.org/r/469890 (https://phabricator.wikimedia.org/T208028) [14:59:54] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to production servers (mwlog*, mmaint* ?) for kharlan - https://phabricator.wikimedia.org/T207330 (10jijiki) 05Open>03Resolved [15:05:29] (03CR) 10jenkins-bot: Don't allow sysop users to disable 2FA for other users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469886 (https://phabricator.wikimedia.org/T195207) (owner: 10Reedy) [15:07:32] (03CR) 10Filippo Giunchedi: [C: 031] Disable prometheus rsyncd module for now [puppet] - 10https://gerrit.wikimedia.org/r/469630 (owner: 10Muehlenhoff) [15:08:54] (03CR) 10Cwhite: [C: 031] "I'm curious to see if fping is indeed faster as my searching rendered it was faster for parallel ping checks, but not necessarily faster f" [puppet] - 10https://gerrit.wikimedia.org/r/469333 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [15:10:13] RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 7.319 second response time [15:10:42] (03PS2) 10Banyek: mariadb: table checker for monitoring data drift [puppet] - 10https://gerrit.wikimedia.org/r/469889 (https://phabricator.wikimedia.org/T207253) [15:11:15] (03CR) 10Cwhite: "It seems likely that this variable will only be one of two variables." [puppet] - 10https://gerrit.wikimedia.org/r/469808 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [15:11:22] (03CR) 10jerkins-bot: [V: 04-1] mariadb: table checker for monitoring data drift [puppet] - 10https://gerrit.wikimedia.org/r/469889 (https://phabricator.wikimedia.org/T207253) (owner: 10Banyek) [15:13:43] PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:18:33] (03PS3) 10Banyek: mariadb: table checker for monitoring data drift [puppet] - 10https://gerrit.wikimedia.org/r/469889 (https://phabricator.wikimedia.org/T207253) [15:19:04] PROBLEM - High lag on wdqs1003 is CRITICAL: 3646 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [15:19:54] (03CR) 10jerkins-bot: [V: 04-1] mariadb: table checker for monitoring data drift [puppet] - 10https://gerrit.wikimedia.org/r/469889 (https://phabricator.wikimedia.org/T207253) (owner: 10Banyek) [15:20:33] !log repooling wdqs1003, other nodes are starting to lag as well [15:20:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:18] (03PS4) 10Banyek: mariadb: table checker for monitoring data drift [puppet] - 10https://gerrit.wikimedia.org/r/469889 (https://phabricator.wikimedia.org/T207253) [15:23:33] (03CR) 10jerkins-bot: [V: 04-1] mariadb: table checker for monitoring data drift [puppet] - 10https://gerrit.wikimedia.org/r/469889 (https://phabricator.wikimedia.org/T207253) (owner: 10Banyek) [15:25:11] 10Operations, 10Analytics, 10hardware-requests, 10User-Elukey: eqiad | (3) Labs Data Lake hardware - https://phabricator.wikimedia.org/T199674 (10elukey) 05Open>03Resolved [15:32:34] !log rolling restart of all prometheus-mcrouter-exporters on app/api servers - metrics not reported after the last mcrouter restart [15:32:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:37] sigh [15:33:48] (03PS5) 10Banyek: mariadb: table checker for monitoring data drift [puppet] - 10https://gerrit.wikimedia.org/r/469889 (https://phabricator.wikimedia.org/T207253) [15:33:54] 10Operations, 10monitoring, 10Patch-For-Review: newer version of nagios-nrpe-plugin nrpe (check_nrpe) with fixed logging issue on stretch icinga - https://phabricator.wikimedia.org/T207775 (10colewhite) Bug reported to Debian with patches. [15:34:08] 10Operations, 10monitoring, 10Patch-For-Review: newer version of nagios-nrpe-plugin nrpe (check_nrpe) with fixed logging issue on stretch icinga - https://phabricator.wikimedia.org/T207775 (10colewhite) a:03colewhite [15:34:46] (03CR) 10jerkins-bot: [V: 04-1] mariadb: table checker for monitoring data drift [puppet] - 10https://gerrit.wikimedia.org/r/469889 (https://phabricator.wikimedia.org/T207253) (owner: 10Banyek) [15:34:54] PROBLEM - High lag on wdqs1003 is CRITICAL: 3602 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [15:57:54] PROBLEM - Filesystem available is greater than filesystem size on ms-be1043 is CRITICAL: cluster=swift device=/dev/sdd1 fstype=xfs instance=ms-be1043:9100 job=node mountpoint=/srv/swift-storage/sdd1 site=eqiad https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=ms-be1043&var-datasource=eqiad%2520prometheus%252Fops [16:02:59] known ^ [16:08:13] 10Operations, 10monitoring: Concerns about icinga1001 check latency - https://phabricator.wikimedia.org/T208066 (10colewhite) [16:09:07] 10Operations, 10Icinga, 10monitoring: Concerns about icinga1001 check latency - https://phabricator.wikimedia.org/T208066 (10jijiki) p:05Triage>03Normal [16:13:14] (03CR) 10Dzahn: "i think i should amend to make this another "only on stretch" thing, because we got both the effect we don't touch anything production and" [puppet] - 10https://gerrit.wikimedia.org/r/469333 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [16:37:03] (03PS2) 10Arturo Borrero Gonzalez: toolforge: refactor/bootstrap service node puppet code [puppet] - 10https://gerrit.wikimedia.org/r/469614 (https://phabricator.wikimedia.org/T207591) [16:37:48] (03CR) 10jerkins-bot: [V: 04-1] toolforge: refactor/bootstrap service node puppet code [puppet] - 10https://gerrit.wikimedia.org/r/469614 (https://phabricator.wikimedia.org/T207591) (owner: 10Arturo Borrero Gonzalez) [16:42:58] (03PS3) 10Arturo Borrero Gonzalez: toolforge: refactor/bootstrap service node puppet code [puppet] - 10https://gerrit.wikimedia.org/r/469614 (https://phabricator.wikimedia.org/T207591) [16:44:06] (03CR) 10jerkins-bot: [V: 04-1] toolforge: refactor/bootstrap service node puppet code [puppet] - 10https://gerrit.wikimedia.org/r/469614 (https://phabricator.wikimedia.org/T207591) (owner: 10Arturo Borrero Gonzalez) [16:45:25] 10Operations: Netbox: fill network topology - https://phabricator.wikimedia.org/T205897 (10ayounsi) [16:46:30] (03CR) 10Bstorm: "If this is going to be part of the stretch grid and all that, make sure all gridengine module references point at "sonofgridengine" module" [puppet] - 10https://gerrit.wikimedia.org/r/469614 (https://phabricator.wikimedia.org/T207591) (owner: 10Arturo Borrero Gonzalez) [16:49:10] (03CR) 10Bstorm: toolforge: refactor/bootstrap service node puppet code (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/469614 (https://phabricator.wikimedia.org/T207591) (owner: 10Arturo Borrero Gonzalez) [16:52:07] (03PS4) 10Arturo Borrero Gonzalez: toolforge: refactor/bootstrap service node puppet code [puppet] - 10https://gerrit.wikimedia.org/r/469614 (https://phabricator.wikimedia.org/T207591) [16:53:02] (03CR) 10jerkins-bot: [V: 04-1] toolforge: refactor/bootstrap service node puppet code [puppet] - 10https://gerrit.wikimedia.org/r/469614 (https://phabricator.wikimedia.org/T207591) (owner: 10Arturo Borrero Gonzalez) [16:54:16] jijiki: thanks for updating my key [16:54:31] np:) [16:55:02] jijiki: but... :) when I try to ssh to `stat1006.eqiad.wmnet` I get `No ECDSA host key is known for bast1002.wikimedia.org and you have requested strict checking` [16:55:17] (03CR) 10Bstorm: toolforge: refactor/bootstrap service node puppet code (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/469614 (https://phabricator.wikimedia.org/T207591) (owner: 10Arturo Borrero Gonzalez) [16:55:24] I've copied the config from https://wikitech.wikimedia.org/wiki/Production_shell_access#Advanced:_operations_config [16:56:18] (03PS5) 10Arturo Borrero Gonzalez: toolforge: refactor/bootstrap service node puppet code [puppet] - 10https://gerrit.wikimedia.org/r/469614 (https://phabricator.wikimedia.org/T207591) [16:56:25] (03CR) 10RobH: [C: 031] Adding dns entries an-worker10[78-95] [dns] - 10https://gerrit.wikimedia.org/r/469664 (https://phabricator.wikimedia.org/T207192) (owner: 10Cmjohnson) [16:58:07] kostajh: you need to already have bast1002.wikimedia.org's key [16:58:20] in your known hosts [16:59:03] RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 2.657 second response time [16:59:32] 10Operations, 10Beta-Cluster-Infrastructure, 10DNS, 10Traffic, and 3 others: Ferm's upstream Net::DNS Perl library bad handling of NOERROR responses without records causing puppet errors when we try to @resolve AAAA in labs - https://phabricator.wikimedia.org/T153468 (10Krenair) Ferm merged my pull request... [17:00:09] (03PS6) 10Arturo Borrero Gonzalez: toolforge: refactor/bootstrap service node puppet code [puppet] - 10https://gerrit.wikimedia.org/r/469614 (https://phabricator.wikimedia.org/T207591) [17:02:33] PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:02:58] 10Puppet, 10Beta-Cluster-Infrastructure, 10Beta-Cluster-reproducible, 10Discovery-Search (Current work), 10Patch-For-Review: Elasticsearch puppet config changes broke puppet in various instances - https://phabricator.wikimedia.org/T205672 (10Krenair) 05Open>03Resolved >>! In T205672#4694292, @dcausse... [17:04:24] (03Abandoned) 10Dzahn: rsync::server: fix handling of use_ipv6 parameter [puppet] - 10https://gerrit.wikimedia.org/r/463394 (owner: 10Dzahn) [17:04:43] RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 6.906 second response time [17:05:20] 10Puppet, 10Beta-Cluster-Infrastructure, 10Beta-Cluster-reproducible, 10Discovery-Search (Current work), 10Patch-For-Review: Elasticsearch puppet config changes broke puppet in various instances - https://phabricator.wikimedia.org/T205672 (10Krenair) 05Resolved>03Open Sorry I totally forgot I wrote i... [17:07:40] 10Operations, 10Beta-Cluster-Infrastructure, 10DNS, 10Traffic, and 3 others: Ferm's upstream Net::DNS Perl library questionable handling of NOERROR responses without records causing puppet errors when we try to @resolve AAAA in labs - https://phabricator.wikimedia.org/T153468 (10Krenair) [17:08:04] PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:14:46] 10Operations, 10Icinga, 10monitoring: Concerns about icinga1001 check latency - https://phabricator.wikimedia.org/T208066 (10Dzahn) comparison between old and new: https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=4 vs https://icinga-stretch.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=4 Note... [17:18:32] !log depool wdqs1003 again to let it catch up some more [17:18:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:20:48] 10Operations, 10Icinga, 10monitoring: Concerns about icinga1001 check latency - https://phabricator.wikimedia.org/T208066 (10Dzahn) point 11. from the Tuning Guide, [[ https://assets.nagios.com/downloads/nagioscore/docs/nagioscore/3/en/tuning.html | Optimize Host Checks, ]] says "Another option would be to u... [17:22:51] (03PS4) 10Cmjohnson: Adding dns entries an-worker10[78-95] [dns] - 10https://gerrit.wikimedia.org/r/469664 (https://phabricator.wikimedia.org/T207192) [17:23:11] (03CR) 10Cmjohnson: [C: 032] Adding dns entries an-worker10[78-95] [dns] - 10https://gerrit.wikimedia.org/r/469664 (https://phabricator.wikimedia.org/T207192) (owner: 10Cmjohnson) [17:24:13] 10Operations, 10Icinga, 10monitoring: Concerns about icinga1001 check latency - https://phabricator.wikimedia.org/T208066 (10Dzahn) What has already been done: The change below made "**max_concurrent_checks**" configurable via Hiera. It was hardcoded in template to 10000 before and is one o the major factors... [17:28:37] 10Puppet, 10Beta-Cluster-Infrastructure, 10Beta-Cluster-reproducible, 10Discovery-Search (Current work), 10Patch-For-Review: Elasticsearch puppet config changes broke puppet in various instances - https://phabricator.wikimedia.org/T205672 (10herron) All good on keith-logstash, that was just a temporary d... [17:29:13] 10Operations, 10Icinga, 10monitoring: Concerns about icinga1001 check latency - https://phabricator.wikimedia.org/T208066 (10Dzahn) [17:29:49] (03CR) 10Dzahn: [C: 032] "https://phabricator.wikimedia.org/T208066" [puppet] - 10https://gerrit.wikimedia.org/r/469253 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [17:33:27] (03PS1) 10Herron: create rsyslog::ship_logfile - simplified logstash shipper via kafka [puppet] - 10https://gerrit.wikimedia.org/r/469945 (https://phabricator.wikimedia.org/T206454) [17:34:41] (03CR) 10jerkins-bot: [V: 04-1] create rsyslog::ship_logfile - simplified logstash shipper via kafka [puppet] - 10https://gerrit.wikimedia.org/r/469945 (https://phabricator.wikimedia.org/T206454) (owner: 10Herron) [17:35:59] (03PS2) 10Herron: create rsyslog::ship_logfile - simplified logstash shipper via kafka [puppet] - 10https://gerrit.wikimedia.org/r/469945 (https://phabricator.wikimedia.org/T206454) [17:36:51] (03CR) 10jerkins-bot: [V: 04-1] create rsyslog::ship_logfile - simplified logstash shipper via kafka [puppet] - 10https://gerrit.wikimedia.org/r/469945 (https://phabricator.wikimedia.org/T206454) (owner: 10Herron) [17:37:22] 10Operations, 10Icinga, 10monitoring: Concerns about icinga1001 check latency - https://phabricator.wikimedia.org/T208066 (10Dzahn) other things i have tried but not puppetized so far: (point 5 from the tuning guide, Max Reaper Time.) 21:30 mutante: icinga1001 - changing check_result_reaper_frequecy from 1... [17:37:29] (03CR) 10Ottomata: "Cool! Will this work easily with journalctl stuff? Or do we need to somehow get journalctl into rsyslog?" [puppet] - 10https://gerrit.wikimedia.org/r/469945 (https://phabricator.wikimedia.org/T206454) (owner: 10Herron) [17:40:24] PROBLEM - Device not healthy -SMART- on db2048 is CRITICAL: cluster=mysql device=cciss,1 instance=db2048:9100 job=node site=codfw https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db2048&var-datasource=codfw%2520prometheus%252Fops [17:43:03] (03CR) 10Ottomata: [C: 031] profile::statistics::private: move geoip archive to another dir [puppet] - 10https://gerrit.wikimedia.org/r/469890 (https://phabricator.wikimedia.org/T208028) (owner: 10Elukey) [17:43:58] (03CR) 10Herron: "> Cool! Will this work easily with journalctl stuff? Or do we need" [puppet] - 10https://gerrit.wikimedia.org/r/469945 (https://phabricator.wikimedia.org/T206454) (owner: 10Herron) [17:44:58] 10Operations, 10Analytics, 10Analytics-Kanban, 10netops: Figure out networking details for new cloud-analytics-eqiad Hadoop/Presto cluster - https://phabricator.wikimedia.org/T207321 (10Ottomata) Ping @nuria too [17:45:41] (03PS3) 10Herron: create rsyslog::ship_logfile - simplified logstash shipper via kafka [puppet] - 10https://gerrit.wikimedia.org/r/469945 (https://phabricator.wikimedia.org/T206454) [17:46:26] (03CR) 10jerkins-bot: [V: 04-1] create rsyslog::ship_logfile - simplified logstash shipper via kafka [puppet] - 10https://gerrit.wikimedia.org/r/469945 (https://phabricator.wikimedia.org/T206454) (owner: 10Herron) [17:47:40] oh good grief [17:48:48] (03PS4) 10Herron: create rsyslog::ship_logfile - simplified logstash shipper via kafka [puppet] - 10https://gerrit.wikimedia.org/r/469945 (https://phabricator.wikimedia.org/T206454) [17:49:00] there's a reason wikibugs calls it jerkins-bot :) [17:49:06] lol [17:49:18] though it usually has a point [17:49:25] Krenair: wow, i actually hadn't spotted that [17:49:32] :D [17:49:59] (03CR) 10jerkins-bot: [V: 04-1] create rsyslog::ship_logfile - simplified logstash shipper via kafka [puppet] - 10https://gerrit.wikimedia.org/r/469945 (https://phabricator.wikimedia.org/T206454) (owner: 10Herron) [17:57:16] (03PS2) 10Dzahn: icinga: use fping instead of ping for faster host checks [puppet] - 10https://gerrit.wikimedia.org/r/469333 (https://phabricator.wikimedia.org/T202782) [17:58:40] (03CR) 10jerkins-bot: [V: 04-1] icinga: use fping instead of ping for faster host checks [puppet] - 10https://gerrit.wikimedia.org/r/469333 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [17:58:53] (03PS3) 10Dzahn: icinga: use fping instead of ping for faster host checks [puppet] - 10https://gerrit.wikimedia.org/r/469333 (https://phabricator.wikimedia.org/T202782) [18:00:22] (03CR) 10jerkins-bot: [V: 04-1] icinga: use fping instead of ping for faster host checks [puppet] - 10https://gerrit.wikimedia.org/r/469333 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [18:00:53] RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 9.552 second response time [18:03:15] (03CR) 10Dzahn: "another case where it seems like an issue with the compiler?" [puppet] - 10https://gerrit.wikimedia.org/r/469333 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [18:04:03] (03CR) 10Dzahn: "btw, i am just doing all this (Hiera, parameters) because doing "if stretch then other ping command" seemed a bit ugly, but it would total" [puppet] - 10https://gerrit.wikimedia.org/r/469333 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [18:04:14] PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:06:36] (03CR) 10Dzahn: "yea, you are right. in this case a simple "if stretch" is easier and correct" [puppet] - 10https://gerrit.wikimedia.org/r/469808 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [18:08:09] Krenair: lately i had 2 where i dont get the point, like f.e. https://integration.wikimedia.org/ci/job/operations-puppet-tests-docker/31038/console [18:09:00] or https://integration.wikimedia.org/ci/job/operations-puppet-tests-docker/30990/console [18:10:15] Unknown resource type: 'systemd::service' [18:10:20] but we use it all over [18:10:37] no clue [18:10:49] (03CR) 10Thcipriani: [C: 032] Use mlockall() to avoid any potential swapping [software/keyholder] - 10https://gerrit.wikimedia.org/r/458239 (owner: 10Faidon Liambotis) [18:11:44] (03Merged) 10jenkins-bot: Use mlockall() to avoid any potential swapping [software/keyholder] - 10https://gerrit.wikimedia.org/r/458239 (owner: 10Faidon Liambotis) [18:11:52] sorry [18:12:40] no worries, me neither, just sharing, maybe others got it too [18:17:26] (03CR) 10Dzahn: "duh..totally overengineered :).. i think it's even simpler. we don't use that nsca.cfg.erb on jessie, so we can simply change the path, ri" [puppet] - 10https://gerrit.wikimedia.org/r/469808 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [18:18:51] (03PS3) 10Dzahn: icinga/nsca: allow configuring nsca chroot in Hiera, change on stretch [puppet] - 10https://gerrit.wikimedia.org/r/469808 (https://phabricator.wikimedia.org/T202782) [18:19:40] (03CR) 10jerkins-bot: [V: 04-1] icinga/nsca: allow configuring nsca chroot in Hiera, change on stretch [puppet] - 10https://gerrit.wikimedia.org/r/469808 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [18:23:33] (03PS4) 10Dzahn: icinga/nsca: fix nsca_chroot path on stretch [puppet] - 10https://gerrit.wikimedia.org/r/469808 (https://phabricator.wikimedia.org/T202782) [18:24:13] (03PS5) 10Dzahn: icinga/nsca: fix nsca_chroot path on stretch [puppet] - 10https://gerrit.wikimedia.org/r/469808 (https://phabricator.wikimedia.org/T202782) [18:27:09] (03CR) 10Dzahn: [C: 04-1] "https://puppet-compiler.wmflabs.org/compiler1002/13221/einsteinium.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/469808 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [18:32:47] (03PS6) 10Dzahn: icinga/nsca: fix nsca_chroot path on stretch [puppet] - 10https://gerrit.wikimedia.org/r/469808 (https://phabricator.wikimedia.org/T202782) [18:33:37] (03CR) 10jerkins-bot: [V: 04-1] icinga/nsca: fix nsca_chroot path on stretch [puppet] - 10https://gerrit.wikimedia.org/r/469808 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [18:34:55] "Illegal attempt to assign to 'a Name'. Not an assignable reference" ... [18:35:25] oh. ok [18:36:10] (03PS7) 10Dzahn: icinga/nsca: fix nsca_chroot path on stretch [puppet] - 10https://gerrit.wikimedia.org/r/469808 (https://phabricator.wikimedia.org/T202782) [18:36:14] RECOVERY - High lag on wdqs1003 is OK: (C)3600 ge (W)1200 ge 1150 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [18:38:26] (03PS8) 10Dzahn: icinga/nsca: fix nsca_chroot path on stretch [puppet] - 10https://gerrit.wikimedia.org/r/469808 (https://phabricator.wikimedia.org/T202782) [18:39:05] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10Release-Engineering-Team (Kanban): Add Lars Wirzenius to releng LDAP groups - https://phabricator.wikimedia.org/T207833 (10greg) [18:42:11] 10Operations, 10Security-Team, 10Wikimedia-Site-requests, 10Patch-For-Review: Enable csp-report-only mode everywhere - https://phabricator.wikimedia.org/T207900 (10chasemp) p:05Triage>03Normal [18:42:52] 10Operations, 10Analytics, 10Analytics-Kanban, 10netops: Figure out networking details for new cloud-analytics-eqiad Hadoop/Presto cluster - https://phabricator.wikimedia.org/T207321 (10Nuria) > This is a bit of a busy week for everyone and especially the security team, but we're going to sync up next week... [18:44:10] (03CR) 10Dzahn: [C: 032] "now it's correct, noop in prod but fixes NSCA on stretch. this is for FRACK icinga alerts: https://puppet-compiler.wmflabs.org/compiler100" [puppet] - 10https://gerrit.wikimedia.org/r/469808 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [18:44:22] 10Operations, 10Security-Team, 10Wikimedia-Site-requests: Enable csp-report-only mode everywhere - https://phabricator.wikimedia.org/T207900 (10chasemp) [18:44:23] (03PS9) 10Dzahn: icinga/nsca: fix nsca_chroot path on stretch [puppet] - 10https://gerrit.wikimedia.org/r/469808 (https://phabricator.wikimedia.org/T202782) [18:50:16] (03PS3) 10Andrew Bogott: exim smarthosts: Allow setting helo_data on transports [puppet] - 10https://gerrit.wikimedia.org/r/469522 (https://phabricator.wikimedia.org/T41785) (owner: 10Alex Monk) [18:51:41] (03CR) 10Andrew Bogott: [C: 032] exim smarthosts: Allow setting helo_data on transports [puppet] - 10https://gerrit.wikimedia.org/r/469522 (https://phabricator.wikimedia.org/T41785) (owner: 10Alex Monk) [18:52:55] (03PS1) 10Bstorm: sonofgridengine: expand puppetization to include a gridengine_queue type [puppet] - 10https://gerrit.wikimedia.org/r/469983 (https://phabricator.wikimedia.org/T200557) [18:53:37] (03CR) 10jerkins-bot: [V: 04-1] sonofgridengine: expand puppetization to include a gridengine_queue type [puppet] - 10https://gerrit.wikimedia.org/r/469983 (https://phabricator.wikimedia.org/T200557) (owner: 10Bstorm) [18:54:11] (03CR) 10Thcipriani: [C: 04-1] "Nice! I think this patch will help prevent some mistakes on the deployment hosts :)" (031 comment) [software/keyholder] - 10https://gerrit.wikimedia.org/r/458240 (owner: 10Faidon Liambotis) [18:55:39] (03CR) 10Dzahn: ":~/dns/templates$ for domain in $(ls *.cz); do echo $domain; host $domain; done" [dns] - 10https://gerrit.wikimedia.org/r/467087 (https://phabricator.wikimedia.org/T206923) (owner: 10Urbanecm) [18:59:46] 10Operations, 10Security-Team, 10Wikimedia-Site-requests: Enable csp-report-only mode everywhere - https://phabricator.wikimedia.org/T207900 (10chasemp) a:03Bawolff @bawolff I have you as point person here from the relevant meeting so I'm going to go ahead and assign [19:03:00] 10Operations, 10DNS, 10Traffic, 10Wikimedia-Apache-configuration, and 3 others: Remove *.cz domains from WMF's infrastructure - https://phabricator.wikimedia.org/T206923 (10Dzahn) @Urbanecm fyi, i think this one domain is different from the others: wikizdroje.cz has address 198.35.26.96 (that's WMF) all... [19:05:27] (03PS2) 10Bstorm: sonofgridengine: expand puppetization to include a gridengine_queue type [puppet] - 10https://gerrit.wikimedia.org/r/469983 (https://phabricator.wikimedia.org/T200557) [19:06:51] (03CR) 10jerkins-bot: [V: 04-1] sonofgridengine: expand puppetization to include a gridengine_queue type [puppet] - 10https://gerrit.wikimedia.org/r/469983 (https://phabricator.wikimedia.org/T200557) (owner: 10Bstorm) [19:08:39] (03PS1) 10Dzahn: icinga/nsca: fix command_file path on stretch [puppet] - 10https://gerrit.wikimedia.org/r/469988 (https://phabricator.wikimedia.org/T202782) [19:09:00] (03CR) 10jerkins-bot: [V: 04-1] icinga/nsca: fix command_file path on stretch [puppet] - 10https://gerrit.wikimedia.org/r/469988 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [19:09:01] !log repooled wdqs1003 - looks like it caught up now [19:09:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:18] (03CR) 10Urbanecm: "@Dzahn: Thanks for notifying me, you're right. It seems I forgot to update NSSET of this domain as well. I've sent and authorized a reques" [dns] - 10https://gerrit.wikimedia.org/r/467087 (https://phabricator.wikimedia.org/T206923) (owner: 10Urbanecm) [19:09:20] (03PS2) 10Dzahn: icinga/nsca: fix command_file path on stretch [puppet] - 10https://gerrit.wikimedia.org/r/469988 (https://phabricator.wikimedia.org/T202782) [19:13:11] Amir1: can you add me as subscriber of https://phabricator.wikimedia.org/T207576 ? thanks! [19:13:39] 10Operations, 10Security-Team, 10Wikimedia-Site-requests: Enable csp-report-only mode everywhere - https://phabricator.wikimedia.org/T207900 (10chasemp) [19:15:31] (03PS3) 10Bstorm: sonofgridengine: expand puppetization to include a gridengine_queue type [puppet] - 10https://gerrit.wikimedia.org/r/469983 (https://phabricator.wikimedia.org/T200557) [19:15:54] (03CR) 10Dzahn: [C: 032] "noop in prod, fixes in stretch https://puppet-compiler.wmflabs.org/compiler1002/13225/" [puppet] - 10https://gerrit.wikimedia.org/r/469988 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [19:21:14] PROBLEM - High load average on labstore1007 is CRITICAL: CRITICAL: 90.00% of data above the critical threshold [24.0] https://grafana.wikimedia.org/dashboard/db/labs-monitoring [19:23:25] RECOVERY - High load average on labstore1007 is OK: OK: Less than 85.00% above the threshold [16.0] https://grafana.wikimedia.org/dashboard/db/labs-monitoring [19:28:04] 10Operations: Please remove 2fa from my phab account - https://phabricator.wikimedia.org/T208090 (10cwdent) [19:33:42] 10Operations: Please remove 2fa from my phab account - https://phabricator.wikimedia.org/T208090 (10Dzahn) You can use Authy instead of Google Authenticator for the same thing if that helps. Maybe get the Authy .apk directly or use it as a Chrome app. Would this help? --> https://www.apkmirror.com/apk/authy-i... [19:34:01] 10Operations: Please remove 2fa from my phab account - https://phabricator.wikimedia.org/T208090 (10Krenair) FWIW there are alternatives to Google Authenticator. [19:34:49] 10Operations: Please remove 2fa from my phab account - https://phabricator.wikimedia.org/T208090 (10Aklapper) I don't have a phone with Google Apps either so I use `FreeOTP` for 2FA. [19:35:52] 10Operations: Please remove 2fa from my phab account - https://phabricator.wikimedia.org/T208090 (10Dzahn) oops, that was an old version, this is better: https://www.apkmirror.com/apk/authy-inc/authy/authy-23-2-8-release/ [19:40:05] !log remove 2fa for charlottepotero and cwd users in phab (so they can readd) [19:40:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:40:53] 10Operations: Please remove 2fa from my phab account - https://phabricator.wikimedia.org/T208090 (10Dzahn) There should be some "MD5, SHA-1, SHA-256 signatures" here to verify the files are legit.. afaict: https://www.apkmirror.com/apk/authy-inc/authy/authy-23-2-8-release/authy-2-factor-authentication-23-2-8-an... [19:43:11] 10Operations, 10DNS, 10Traffic, 10WMCZ-General, and 4 others: Remove *.cz domains from WMF's infrastructure - https://phabricator.wikimedia.org/T206923 (10Urbanecm) @Dzahn: Thank you. I simply forgot to update NSSET to the new one. https://www.nic.cz/whois/domain/wikizdroje.cz/ says NSSET was changed, so I... [19:45:42] (03PS4) 10Bstorm: sonofgridengine: expand puppetization to include a gridengine_queue type [puppet] - 10https://gerrit.wikimedia.org/r/469983 (https://phabricator.wikimedia.org/T200557) [19:46:45] (03CR) 10Bstorm: [C: 032] sonofgridengine: expand puppetization to include a gridengine_queue type [puppet] - 10https://gerrit.wikimedia.org/r/469983 (https://phabricator.wikimedia.org/T200557) (owner: 10Bstorm) [19:50:49] (03CR) 10Herron: "I think we could have done this in one line using primary_hostname" [puppet] - 10https://gerrit.wikimedia.org/r/469522 (https://phabricator.wikimedia.org/T41785) (owner: 10Alex Monk) [19:55:11] 10Operations, 10ops-eqiad, 10netops, 10Patch-For-Review: Rack/setup cr2-eqord - https://phabricator.wikimedia.org/T204170 (10ayounsi) 05Resolved>03Open Seems like we have a duplicate in Netbox: https://netbox.wikimedia.org/dcim/devices/201/ and https://netbox.wikimedia.org/dcim/devices/1954/ The 2nd o... [19:56:51] 10Operations: Please remove 2fa from my phab account - https://phabricator.wikimedia.org/T208090 (10Dzahn) I should recommend FOSS alternatives though, so here: https://alternativeto.net/software/authy/?license=opensource [19:58:54] herron, hey, do you mean overriding exim's 'primary_hostname' in the config instead of adding helo_data? [19:59:18] hey! yeah I was just working on a patch to see if it did the trick for you [19:59:49] cool [20:00:07] I don't have an easy means to test this [20:00:12] but [20:00:28] Basically we looked at the headers of mail sent using your new mx-out01 etc. [20:00:31] hah, that’ funny because I just deleted my exim test box a couple hours ago [20:00:48] and found it was sending out its internal FQDN [20:01:20] i.e. .cloudinfra.eqiad.wmflabs [20:01:32] instead of the public hostname we pointed at its floating ip [20:01:52] which probably isn't a great thing [20:01:55] right? [20:02:04] (03PS1) 10Herron: profile::mail::smarthost add primary_hostname setting [puppet] - 10https://gerrit.wikimedia.org/r/470028 (https://phabricator.wikimedia.org/T41785) [20:02:59] I’m not sure what issues that would cause, maybe some mail systems would complain of it not matching a resolvable dns record. it is nice to see the actual hostnaem in the received header for the purposes of tracing but also seems like something that we would want to set on a case by case basis [20:03:34] let me know what you think of that patch, should be a shorthand way to do the same thing [20:03:36] !log icinga1001 - disabled puppet, changed: check_result_reaper_frequency=2 ; max_check_result_reaper_time=10 to test if it lowers latency (T208066) [20:03:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:03:39] T208066: Concerns about icinga1001 check latency - https://phabricator.wikimedia.org/T208066 [20:03:49] herron, honestly if you think it's unnecessary feel free to get rid of it entirely [20:04:41] I don't feel strongly about it or anything, just looked strange when looking at the headers [20:04:57] no I think it could be useful to have the ability to configure [20:05:02] okay [20:05:05] well [20:05:07] we work around it in prod by just giving the host a “real” hostname [20:05:07] this looks good [20:05:15] yeah [20:06:01] (03CR) 10Alex Monk: [C: 031] profile::mail::smarthost add primary_hostname setting [puppet] - 10https://gerrit.wikimedia.org/r/470028 (https://phabricator.wikimedia.org/T41785) (owner: 10Herron) [20:06:12] ok cool! nice catch btw, thanks for the original patch [20:08:42] (03CR) 10Herron: [C: 032] profile::mail::smarthost add primary_hostname setting [puppet] - 10https://gerrit.wikimedia.org/r/470028 (https://phabricator.wikimedia.org/T41785) (owner: 10Herron) [20:08:46] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work), and 2 others: The usual Lag pattern for wdqs2003 seems to be taking another turn - https://phabricator.wikimedia.org/T206423 (10Smalyshev) 05Open>03Resolved a:03Smalyshev Looks like after we disabled change-props c... [20:09:08] 10Operations, 10ops-eqiad, 10netops: Fix missing PDU's for row C eqiad in netbox - https://phabricator.wikimedia.org/T208091 (10Cmjohnson) [20:10:33] 10Operations: Netbox: fill network topology - https://phabricator.wikimedia.org/T205897 (10ayounsi) [20:11:13] herron, btw have you also configured the new hiera key for the cloudinfra mxes? [20:12:07] someday when horizon loads! haha [20:12:09] working on it though [20:14:01] +primary_hostname = mx-out01.wmflabs.org [20:14:31] (that’s from the puppet run) [20:15:56] heh, yeah, the puppet dashboard in there being slow is a known problem [20:16:06] cool [20:16:07] 10Operations: Please remove 2fa from my phab account - https://phabricator.wikimedia.org/T208090 (10cwdent) 05Open>03Resolved a:03cwdent Thanks for the suggestions! I went with FreeOTP :) [20:16:32] so should be good to go now, should see matching dns and helo on mail routed through [20:16:41] worked for me on a test message from mx-out01 itself [20:17:11] and updated hiera for mx-out02 as well [20:20:30] https://phabricator.wikimedia.org/P7724 looks ok [20:21:38] I think one other oddity noticed during testing was that when you send from this thing to a wikimedia.org address, prod's MX (now just being the inbound for wikimedia.org) sees the labs private IP [20:23:39] 10Operations: Please remove 2fa from my phab account - https://phabricator.wikimedia.org/T208090 (10Krenair) @cwdent: Out of interest where did you find documentation only listing Google Authenticator? Might be nice to ensure the alternatives are documented equally. [20:24:41] (03CR) 10Faidon Liambotis: Add permission checks for various commands (031 comment) [software/keyholder] - 10https://gerrit.wikimedia.org/r/458240 (owner: 10Faidon Liambotis) [20:29:25] 10Operations, 10Security-Team, 10Wikimedia-Site-requests: Enable csp-report-only mode everywhere - https://phabricator.wikimedia.org/T207900 (10Krinkle) In response to this being unexpectedly enabled on all wikis for non-script fetches, I felt obligated to disable most Toolforge-related features in scripts a... [20:35:39] 10Operations: Please remove 2fa from my phab account - https://phabricator.wikimedia.org/T208090 (10cwdent) @krenair I was just assuming GA, didn't know about all the options. When adding 2fa in phab the message says: "Attach a mobile authenticator application (like Authy or Google Authenticator) to your accoun... [20:36:39] 10Operations: Please remove 2fa from my phab account - https://phabricator.wikimedia.org/T208090 (10Krenair) Bah that's probably phabricator upstream then :( [20:44:18] 10Operations, 10SRE-Access-Requests: Requesting access to deployment and analytics-privatedata-users for sbassett - https://phabricator.wikimedia.org/T207852 (10sbassett) Tagging @JBennett for approval (so I can start doing Brian and Sam things) [20:47:28] (03PS1) 10Gilles: Enable performance perception survey shuffling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/470065 (https://phabricator.wikimedia.org/T208088) [20:57:33] PROBLEM - HP RAID on ms-be2021 is CRITICAL: CRITICAL: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4 - Controller: OK - Cache: Permanently Disabled - Cable Error - Battery/Capacitor: Recharging [20:57:37] ACKNOWLEDGEMENT - HP RAID on ms-be2021 is CRITICAL: CRITICAL: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4 - Controller: OK - Cache: Permanently Disabled - Cable Error - Battery/Capacitor: Recharging nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T208096 [20:57:42] 10Operations, 10ops-codfw: Degraded RAID on ms-be2021 - https://phabricator.wikimedia.org/T208096 (10ops-monitoring-bot) [20:59:01] (03PS1) 10Andrew Bogott: cloud ldap: Change the ACL to allow keystone to talk to ldap over ipv6 [puppet] - 10https://gerrit.wikimedia.org/r/470070 [21:00:43] (03CR) 10Alex Monk: [C: 031] cloud ldap: Change the ACL to allow keystone to talk to ldap over ipv6 [puppet] - 10https://gerrit.wikimedia.org/r/470070 (owner: 10Andrew Bogott) [21:04:55] (03CR) 10Andrew Bogott: [C: 032] cloud ldap: Change the ACL to allow keystone to talk to ldap over ipv6 [puppet] - 10https://gerrit.wikimedia.org/r/470070 (owner: 10Andrew Bogott) [21:06:44] PROBLEM - HHVM rendering on mw1257 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.002 second response time [21:07:03] PROBLEM - Apache HTTP on mw1257 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.003 second response time [21:07:33] PROBLEM - Nginx local proxy to apache on mw1257 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.007 second response time [21:07:53] RECOVERY - HHVM rendering on mw1257 is OK: HTTP OK: HTTP/1.1 200 OK - 75793 bytes in 0.382 second response time [21:08:04] RECOVERY - Apache HTTP on mw1257 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.031 second response time [21:08:34] RECOVERY - Nginx local proxy to apache on mw1257 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.034 second response time [21:15:14] RECOVERY - Recursive DNS on 208.80.153.51 is OK: DNS OK: 0.062 seconds response time. www.wikipedia.org returns 208.80.153.224 [21:16:43] RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 8.139 second response time [21:20:04] PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:28:34] (03PS1) 10Dzahn: icinga: tune reaper frequency on stretch [puppet] - 10https://gerrit.wikimedia.org/r/470077 [21:32:27] 10Operations, 10monitoring, 10Patch-For-Review: upgrade icinga server to stretch and replace einsteinium - https://phabricator.wikimedia.org/T202782 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ``` icinga1001.wikimedia.org ``` The log can be found i... [21:32:33] 10Operations, 10monitoring, 10Patch-For-Review: upgrade icinga server to stretch and replace einsteinium - https://phabricator.wikimedia.org/T202782 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['icinga1001.wikimedia.org'] ``` Of which those **FAILED**: ``` ['icinga1001.wikimedia.org'] ``` [21:33:24] RECOVERY - Recursive DNS on 208.80.153.78 is OK: DNS OK: 0.080 seconds response time. www.wikipedia.org returns 208.80.154.224 [21:33:24] !log aaron@deploy1001 Synchronized php-1.33.0-wmf.1/tests: 86c0b56b0d1bf66073fafb9bc00bafb87d2e3b9c (duration: 01m 08s) [21:33:25] 10Operations, 10monitoring, 10Patch-For-Review: upgrade icinga server to stretch and replace einsteinium - https://phabricator.wikimedia.org/T202782 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ``` icinga1001.wikimedia.org ``` The log can be found i... [21:33:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:34:27] !log aaron@deploy1001 Synchronized php-1.33.0-wmf.1/autoload.php: 86c0b56b0d1bf66073fafb9bc00bafb87d2e3b9c (duration: 00m 52s) [21:34:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:36:37] !log aaron@deploy1001 Synchronized php-1.33.0-wmf.1/includes: 86c0b56b0d1bf66073fafb9bc00bafb87d2e3b9c (duration: 01m 14s) [21:38:38] aaron@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [21:39:13] PROBLEM - HHVM rendering on mwdebug1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:40:04] RECOVERY - HHVM rendering on mwdebug1001 is OK: HTTP OK: HTTP/1.1 200 OK - 75789 bytes in 0.227 second response time [21:50:23] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [21:52:34] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [21:58:14] RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 3.493 second response time [21:58:23] 10Operations, 10New-Readers: Create URL for Mexico Awareness Campaign - https://phabricator.wikimedia.org/T207816 (10Dzahn) I think the right place for this would be the Wikipedia namespace on es.wikipedia, so the existing page: https://es.wikipedia.org/wiki/Wikipedia:Bienvenidos Second best would be a new p... [22:00:01] 10Operations, 10New-Readers: Create URL for Mexico Awareness Campaign - https://phabricator.wikimedia.org/T207816 (10atgo) From my end, the URL isn't terribly important. We should not replace something that's community created/maintained, but otherwise the constraints around tracking and UX are more important... [22:01:43] PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:03:14] 10Operations: Netbox: fill network topology - https://phabricator.wikimedia.org/T205897 (10ayounsi) [22:05:44] ACKNOWLEDGEMENT - Device not healthy -SMART- on db2048 is CRITICAL: cluster=mysql device=cciss,1 instance=db2048:9100 job=node site=codfw Banyek this is a predictive failure only, we are waiting for a real disk fail. https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db2048&var-datasource=codfw%2520prometheus%252Fops [22:07:42] 10Operations, 10Operations-Software-Development: cumin tries to downtime Icinga even with --no-downtime - https://phabricator.wikimedia.org/T208100 (10Dzahn) [22:08:53] PROBLEM - Disk space on notebook1004 is CRITICAL: Return code of 255 is out of bounds [22:08:53] PROBLEM - configured eth on notebook1004 is CRITICAL: Return code of 255 is out of bounds [22:09:03] PROBLEM - DPKG on notebook1004 is CRITICAL: Return code of 255 is out of bounds [22:09:04] PROBLEM - MD RAID on notebook1004 is CRITICAL: Return code of 255 is out of bounds [22:09:13] PROBLEM - Check systemd state on notebook1004 is CRITICAL: Return code of 255 is out of bounds [22:09:43] PROBLEM - dhclient process on notebook1004 is CRITICAL: Return code of 255 is out of bounds [22:11:13] PROBLEM - puppet last run on notebook1004 is CRITICAL: Return code of 255 is out of bounds [22:11:23] 10Operations, 10Operations-Software-Development: cumin tries to downtime Icinga even with --no-downtime - https://phabricator.wikimedia.org/T208100 (10Dzahn) p:05Triage>03Low prio low because the install process continued anyways.. contrary to what i first thought it didn't fail entirely but continued afte... [22:13:14] PROBLEM - SSH on notebook1004 is CRITICAL: Server answer [22:14:42] 10Operations, 10New-Readers: Create URL for Mexico Awareness Campaign - https://phabricator.wikimedia.org/T207816 (10Dzahn) I understand. Yea, that would be great if you can use some (new) page on es.wikipedia.org/wiki/Wikipedia:. I don't know much about the tracking constraints but i would expect we have exi... [22:23:44] PROBLEM - Check the NTP synchronisation status of timesyncd on notebook1004 is CRITICAL: Return code of 255 is out of bounds [22:24:32] 10Operations, 10HHVM, 10Wikimedia-production-error: BUG: Bad page map in process hhvm - https://phabricator.wikimedia.org/T207983 (10Krinkle) [22:26:32] 10Operations, 10HHVM, 10Wikimedia-production-error: BUG: Bad page map in process hhvm - https://phabricator.wikimedia.org/T207983 (10Krinkle) 05Open>03Resolved a:03Dzahn {F26862599} [22:26:44] RECOVERY - SSH on notebook1004 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u3 (protocol 2.0) [22:30:47] (03PS1) 10Alex Monk: deployment-prep hieradata: Fix comment about which host this IP is [puppet] - 10https://gerrit.wikimedia.org/r/470095 [22:35:22] (03PS1) 10Alex Monk: Tear out old unmaintained theoretically-unused nova_dnsmasq_aliases stuff [puppet] - 10https://gerrit.wikimedia.org/r/470096 [22:39:19] (03PS1) 10Alex Monk: nova dnsmasq: tear out old promethium stuff [puppet] - 10https://gerrit.wikimedia.org/r/470098 [22:39:24] RECOVERY - DPKG on notebook1004 is OK: All packages OK [22:39:24] RECOVERY - MD RAID on notebook1004 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [22:39:34] RECOVERY - Check systemd state on notebook1004 is OK: OK - running: The system is fully operational [22:39:57] 10Operations, 10New-Readers: Create URL for Mexico Awareness Campaign - https://phabricator.wikimedia.org/T207816 (10Prtksxna) @Dzahn thanks for all the suggestions! I am wondering how we'll deploy our static site* by setting up a redirect on a wiki, would you be able to help with that? * Code at https://gerr... [22:40:03] RECOVERY - dhclient process on notebook1004 is OK: PROCS OK: 0 processes with command name dhclient [22:40:23] RECOVERY - configured eth on notebook1004 is OK: OK - interfaces up [22:40:47] (03PS2) 10Alex Monk: nova dnsmasq: tear out old promethium stuff [puppet] - 10https://gerrit.wikimedia.org/r/470098 [22:41:53] RECOVERY - puppet last run on notebook1004 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [22:42:09] 10Operations, 10Operations-Software-Development: cumin tries to downtime Icinga even with --no-downtime - https://phabricator.wikimedia.org/T208100 (10Volans) See the `--new` option [22:42:54] (03PS1) 10Alex Monk: role::labs::instance: Remove $::virtual == 'kvm' check for promethium [puppet] - 10https://gerrit.wikimedia.org/r/470100 [22:44:08] 10Operations, 10Operations-Software-Development: cumin tries to downtime Icinga even with --no-downtime - https://phabricator.wikimedia.org/T208100 (10Volans) Actually the `--new` might not work either as the host is in puppetdb, sorry for the wrong suggestion. Anyway this is kinda unrelated to the reimage scr... [22:47:43] (03PS1) 10Alex Monk: labs puppetmaster: Remove old promethium baremetal stuff [puppet] - 10https://gerrit.wikimedia.org/r/470101 [22:49:18] (03PS1) 10Alex Monk: prod dhcpd: rm promethium [puppet] - 10https://gerrit.wikimedia.org/r/470102 [22:53:47] (03PS1) 10Alex Monk: rm promethium entries [dns] - 10https://gerrit.wikimedia.org/r/470103 [22:53:53] RECOVERY - Check the NTP synchronisation status of timesyncd on notebook1004 is OK: OK: synced at Fri 2018-10-26 22:53:46 UTC. [22:54:20] !log sodium - attempted to replace broken disk for RAID - did not go well [22:54:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:54:23] PROBLEM - Host sodium is DOWN: PING CRITICAL - Packet loss = 100% [23:03:24] PROBLEM - puppet last run on labtestneutron2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[tzdata] [23:05:04] RECOVERY - Host sodium is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [23:05:34] PROBLEM - puppet last run on labtestmetal2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[tzdata] [23:11:42] (03Abandoned) 10Alex Monk: prod dhcpd: rm promethium [puppet] - 10https://gerrit.wikimedia.org/r/470102 (owner: 10Alex Monk) [23:11:53] (03Abandoned) 10Alex Monk: rm promethium entries [dns] - 10https://gerrit.wikimedia.org/r/470103 (owner: 10Alex Monk) [23:12:24] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decom promethium/WMF3571 - https://phabricator.wikimedia.org/T191362 (10Krenair) I made some patches getting rid of promethium stuff, then realised part of it would actually likely be covered by Rob in this ticket, so have abandoned https://gerrit.wikime... [23:12:47] (03Abandoned) 10Alex Monk: Tear out old unmaintained theoretically-unused nova_dnsmasq_aliases stuff [puppet] - 10https://gerrit.wikimedia.org/r/470096 (owner: 10Alex Monk) [23:12:55] (03Abandoned) 10Alex Monk: nova dnsmasq: tear out old promethium stuff [puppet] - 10https://gerrit.wikimedia.org/r/470098 (owner: 10Alex Monk) [23:18:52] 10Operations, 10ops-eqiad: Degraded RAID on sodium - https://phabricator.wikimedia.org/T202705 (10Dzahn) Wasn't aware i was overseeing mirror boxes and I have never done this before but tried to follow T205364#4641757 and the cheatsheet linked there. Eventually i was able to identify the new drive and needed... [23:25:53] RECOVERY - puppet last run on labtestmetal2001 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [23:29:04] RECOVERY - puppet last run on labtestneutron2001 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [23:29:15] ACKNOWLEDGEMENT - MegaRAID on sodium is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T208107 [23:29:18] 10Operations, 10ops-eqiad: Degraded RAID on sodium - https://phabricator.wikimedia.org/T208107 (10ops-monitoring-bot) [23:31:13] 10Operations, 10Operations-Software-Development: cumin tries to downtime Icinga even with --no-downtime - https://phabricator.wikimedia.org/T208100 (10Dzahn) Isn't the issue that despite saying --no-downtime it tries to set a downtime? [23:34:21] 10Operations, 10ops-eqiad: Degraded RAID on sodium - https://phabricator.wikimedia.org/T202705 (10Dzahn) [23:34:23] 10Operations, 10ops-eqiad: Degraded RAID on sodium - https://phabricator.wikimedia.org/T208107 (10Dzahn) [23:38:47] 10Operations, 10New-Readers: Create URL for Mexico Awareness Campaign - https://phabricator.wikimedia.org/T207816 (10Dzahn) @Prtksxna Is this a dynamic page with scripting or is it a static page with just HTML/CSS and some images? Could the content as well be in a wiki page given that we can upload images and... [23:41:23] RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 9.966 second response time [23:44:43] PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:52:50] 10Operations, 10ops-eqiad: Degraded RAID on sodium - https://phabricator.wikimedia.org/T202705 (10Dzahn) ``` for disk in 0 1 2 3; do megacli -PDInfo -PhysDrv [32:${disk}] -aALL | grep "^Sector Size"; done Sector Size: 512 Sector Size: 4096 Sector Size: 512 Sector Size: 512 ``` @cmjohnson I think it won'... [23:53:13] 10Operations, 10ops-eqiad: Degraded RAID on sodium - https://phabricator.wikimedia.org/T202705 (10Dzahn) a:05Dzahn>03Cmjohnson