[00:00:04] RoanKattouw ostriches Krenair: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160202T0000). [00:00:04] bd808 ebernhardson: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [00:00:53] (03PS1) 10Yuvipanda: tools: Point k8s master to new k8s etcd hosts [puppet] - 10https://gerrit.wikimedia.org/r/267805 [00:01:17] (03PS2) 10Yuvipanda: tools: Point k8s master to new k8s etcd hosts [puppet] - 10https://gerrit.wikimedia.org/r/267805 [00:01:17] * Jamesofur put a patch on the page about a minute too late to get the ping [00:01:23] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Point k8s master to new k8s etcd hosts [puppet] - 10https://gerrit.wikimedia.org/r/267805 (owner: 10Yuvipanda) [00:01:47] Roan is sitting next to be and probably won't be SWATing from his phone [00:01:54] to me* [00:03:11] i suppose i can take this one [00:03:13] bd808: ready? [00:03:17] legoktm: slacker [00:03:27] (tell him that for me) [00:03:29] * ebernhardson glares at mr "i want scap" [00:03:42] * Jamesofur whistles innocently\ [00:03:45] Jamesofur: told him :P [00:03:50] (03CR) 10EBernhardson: [C: 032] Put more like query load back on eqiad for load testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267662 (owner: 10DCausse) [00:03:55] ebernhardson: to be fair only because we keep reverting :P [00:05:15] ebernhardson: yup sorry [00:07:23] legoktm, you in australia too? [00:07:23] the test for this is "cross fingers and hope we don't start getting encoding errors again" [00:07:23] subbu: yep :) [00:07:23] i guess you must be. :) [00:07:23] if roan is sitting next to you. [00:07:23] whenever gerrit wakes up and merges this patch :) [00:07:23] (03PS1) 10Jdlrobson: Just use the default MobileFrontend specified page actions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267807 [00:11:18] (03CR) 10EBernhardson: [C: 032] "helloooooo. mr. gerrit. are you there?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267662 (owner: 10DCausse) [00:11:18] (03PS1) 10Yuvipanda: tools: Minor cleanup [puppet] - 10https://gerrit.wikimedia.org/r/267809 [00:11:18] (03CR) 10BryanDavis: [C: 031] "Should fix "Notice: Undefined variable: wgMFPageActions in /srv/mediawiki/wmf-config/mobile.php on line 95" errors that are flooding beta " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267807 (owner: 10Jdlrobson) [00:11:19] (03PS2) 10Yuvipanda: tools: Minor cleanup [puppet] - 10https://gerrit.wikimedia.org/r/267809 [00:11:19] (03CR) 10Bmansurov: [C: 031] Just use the default MobileFrontend specified page actions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267807 (owner: 10Jdlrobson) [00:11:19] not getting the message about jenkins sending it to gate-and-submit :S [00:11:19] oh there it went [00:11:19] ebernhardson: https://integration.wikimedia.org/zuul/ looks backed up [00:11:20] oh bah [00:11:20] I'm creating like 100 new jenkins jobs [00:11:20] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Minor cleanup [puppet] - 10https://gerrit.wikimedia.org/r/267809 (owner: 10Yuvipanda) [00:11:20] can you wait for after swat? :P [00:11:20] jerk ;) [00:14:37] (03CR) 10EBernhardson: [C: 032] Revert "monolog: Ensure that context data added by WebProcessor is utf-8 safe" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267469 (https://phabricator.wikimedia.org/T119594) (owner: 10BryanDavis) [00:14:37] 6operations, 6Performance-Team, 7Graphite, 7Monitoring, 5Patch-For-Review: Add monitoring for analytics-statsv service - https://phabricator.wikimedia.org/T117994#1988275 (10ori) 5Open>3Resolved statsv will actually exit (and get restarted) when the main process dies, now. This wasn't the case before... [00:16:06] and there it goes [00:16:54] (03CR) 10jenkins-bot: [V: 04-1] Put more like query load back on eqiad for load testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267662 (owner: 10DCausse) [00:17:09] aw, crap, that's my fault [00:17:41] this is going to be a long SWAT for so few patches :P [00:17:46] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet). [00:18:17] (03CR) 10jenkins-bot: [V: 04-1] Revert "monolog: Ensure that context data added by WebProcessor is utf-8 safe" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267469 (https://phabricator.wikimedia.org/T119594) (owner: 10BryanDavis) [00:18:18] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet). [00:20:21] ebernhardson: you should override jenkins, it's going to take me a little bit to fix [00:20:26] ok [00:20:39] (03CR) 10EBernhardson: [V: 032] Put more like query load back on eqiad for load testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267662 (owner: 10DCausse) [00:21:16] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [00:21:47] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [00:23:05] !log ebernhardson@mira Synchronized wmf-config/CirrusSearch-production.php: Move morelike query load back to eqiad to allow load testing on codfw (duration: 01m 38s) [00:23:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:23:36] 6operations, 10Wikimedia-Site-Requests, 7I18n, 7Tracking: Wikis waiting to be renamed (tracking) - https://phabricator.wikimedia.org/T21986#1988377 (10Luke081515) [00:23:59] (03CR) 10EBernhardson: [V: 032] Revert "monolog: Ensure that context data added by WebProcessor is utf-8 safe" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267469 (https://phabricator.wikimedia.org/T119594) (owner: 10BryanDavis) [00:24:37] (03PS1) 10Jdlrobson: Experiment one: Labs stripping HTML in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267812 (https://phabricator.wikimedia.org/T124959) [00:25:42] bd808: yours is syncing out now [00:26:04] ebernhardson: cool. I'm starting a fatalmonitor [00:26:48] !log ebernhardson@mira Synchronized wmf-config/logging.php: Revert "monolog: Ensure that context data added by WebProcessor is utf-8 safe" (duration: 01m 27s) [00:26:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:27:52] ebernhardson: still getting mw logs in logstash and no new fatals [00:27:58] for now that's the test [00:28:16] ok sounds good [00:29:17] !log ebernhardson@mira Started scap: Add Cookie statement link to footer of all WMF wikis per legal [00:29:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:29:31] Jamesofur: scap started, no clue how long that takes these days [00:30:04] how many languages changed? [00:30:13] that's the biggest factor [00:30:17] bd808: 2, en and es [00:30:24] (and qqq, but i'm going to assume that doesn't count :) [00:30:32] I'm going to put my money on the 35 minute square [00:30:40] !log restbase deploy end of c3bd864 [00:30:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:30:44] ebernhardson: thanks [00:30:55] Jamesofur: you'll be around in 35ish minutes to verify? [00:31:07] Yup [00:31:14] ottomata: around? [00:31:18] sweet. i'll let you know [00:31:53] !log ebernhardson@mira scap failed: CalledProcessError Command '/usr/local/bin/mwscript rebuildLocalisationCache.php --wiki="cawikibooks" --outdir="/tmp/scap_l10n_1684485672" --threads=10 --quiet' returned non-zero exit status 255 (duration: 02m 35s) [00:31:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:32:05] blerg [00:32:16] bd808: uhh, i'm going to guess you know more about that than i do (havn't scapped in some time) [00:32:30] ebernhardson: to debug that you need to run manually and see what it's problem is [00:32:38] 6operations, 10ContentTranslation-Deployments, 10ContentTranslation-cxserver, 10Parsoid, and 4 others: Decom parsoid-lb.eqiad.wikimedia.org entrypoint - https://phabricator.wikimedia.org/T110474#1988453 (10mobrovac) [00:32:40] (03CR) 10Bmansurov: Experiment one: Labs stripping HTML in beta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267812 (https://phabricator.wikimedia.org/T124959) (owner: 10Jdlrobson) [00:32:53] most of the errors revolve around SemanticForms and SF_NS_FORM being undefined [00:33:07] (03PS2) 10Jdlrobson: Experiment one: Labs stripping HTML in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267812 (https://phabricator.wikimedia.org/T124959) [00:33:36] hmmm... that sounds like things that Reedy patched. I wonder if they were in .11 and not .10? [00:33:49] actually, the first errors are Warning: File not found: /srv/mediawiki-staging/php-1.27.0-wmf.10/extensions/LandingCheckLandingCheck.alias.php in /srv/mediawiki-staging/php-1.27.0-wmf.10/includes/cache/LocalisationCache.php on line 527 [00:34:02] then further down it starts giving Notice: Use of undefined constant SF_NS_FORM - assumed 'SF_NS_FORM' in /srv/mediawiki-staging/php-1.27.0-wmf.10/extensions/SemanticForms/languages/SF_Namespaces.php on line 22 [00:34:13] anybody here who knows how archiva works? [00:34:26] ebernhardson: neither of those sounds fatal [00:34:38] bd808: i would agree :S hmm [00:35:27] ebernhardson: running `/usr/local/bin/mwscript rebuildLocalisationCache.php --wiki="cawikibooks" --outdir="/tmp/scap_l10n_1684485672" --threads=10` manually on mira should tell you what's fubar [00:36:03] SMalyshev: ottomata maybe? I know Nik knew but ... well he left [00:36:37] bd808: yeah... looks like ottomata is not here. I think I found a workaround though [00:39:07] aha [00:39:14] puppet is busted on one of the CI notes [00:39:15] nodes* [00:39:17] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There are 2 unmerged changes in mediawiki_config (dir /srv/mediawiki-staging/). [00:40:36] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: There are 2 unmerged changes in mediawiki_config (dir /srv/mediawiki-staging/). [00:41:18] no obvious failure in rebuildLocalisationCache.php, 8 of 10 threads have completed :S [00:41:24] still waiting on the last few though [00:43:12] bd808: finished, no errors :S [00:43:18] warnings, but no errors [00:43:25] hmm. [00:43:32] did you check $? [00:43:39] oh, well thats 255 [00:43:43] but no errors output :( [00:43:49] grr [00:44:35] well we will need it fixed for the train tomorrow so I guess we should dig in and see wtf is causing the 255 return [00:45:12] ebernhardson: phplint should be fine now, turns out a ci slave hadn't run puppet for a week [00:45:16] PROBLEM - check_mysql on payments1002 is CRITICAL: Slave IO: Yes Slave SQL: No Seconds Behind Master: (null) [00:45:16] PROBLEM - check_mysql on payments1003 is CRITICAL: Slave IO: Yes Slave SQL: No Seconds Behind Master: (null) [00:45:16] PROBLEM - check_mysql on payments1004 is CRITICAL: Slave IO: Yes Slave SQL: No Seconds Behind Master: (null) [00:45:17] PROBLEM - check_mysql on payments2001 is CRITICAL: Slave IO: Yes Slave SQL: No Seconds Behind Master: (null) [00:45:40] bd808: well, most likely from the script that comes from pcntl_wexitstatus() on a child [00:45:49] 6operations, 10Wikimedia-Site-Requests, 7I18n, 7Tracking: Wikis waiting to be renamed (tracking) - https://phabricator.wikimedia.org/T21986#1988510 (10Luke081515) [00:46:14] dammit, workaround didn't work :( [00:46:22] ebernhardson: yeah I wonder if it's related to the hhvm LightProcess exiting crap [00:46:38] like if we ran it under php5 instead of hhvm if it would be all better [00:47:38] PROBLEM - Outgoing network saturation on labstore1003 is CRITICAL: CRITICAL: 32.14% of data above the critical threshold [100000000.0] [00:48:11] bd808: i can try, sec [00:50:16] RECOVERY - check_mysql on payments1004 is OK: Uptime: 671 Threads: 1 Questions: 7842 Slow queries: 28 Opens: 477 Flush tables: 1 Open tables: 46 Queries per second avg: 11.687 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [00:50:16] PROBLEM - check_mysql on payments1002 is CRITICAL: Slave IO: Yes Slave SQL: No Seconds Behind Master: (null) [00:50:16] PROBLEM - check_mysql on payments1003 is CRITICAL: Slave IO: Yes Slave SQL: No Seconds Behind Master: (null) [00:50:17] RECOVERY - check_mysql on payments2001 is OK: Uptime: 1054257 Threads: 3 Questions: 673441 Slow queries: 0 Opens: 191 Flush tables: 1 Open tables: 36 Queries per second avg: 0.638 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [00:50:29] rebuilding w/ php5 now [00:52:05] (03PS2) 10Jforrester: T110474: Point iegreview to internal parsoid url [puppet] - 10https://gerrit.wikimedia.org/r/267269 (owner: 10Subramanya Sastry) [00:52:38] (03PS3) 10Jforrester: Point iegreview to internal parsoid url [puppet] - 10https://gerrit.wikimedia.org/r/267269 (https://phabricator.wikimedia.org/T114186) (owner: 10Subramanya Sastry) [00:53:52] bd808: yea, building w/ php5 fixes it. Is the solution to change mwscript to use php5 from puppet for the time being, or is there better way? [00:54:29] that's probably the easiest hack [00:55:11] ebernhardson: want me to make a patch and see if we can get SadPanda to merge it? [00:55:16] PROBLEM - check_mysql on payments1002 is CRITICAL: Slave IO: Yes Slave SQL: No Seconds Behind Master: (null) [00:55:16] PROBLEM - check_mysql on payments1003 is CRITICAL: Slave IO: Yes Slave SQL: No Seconds Behind Master: (null) [00:55:28] whaaa [00:55:30] sure [00:55:32] bd808: i can put the patch together, as soon as gerrit decides to respond to my `git fetch` ...it's been very slow the last few days [00:55:43] coolio [00:56:45] SadPanda: tl;dr hhvm on mira and the know LightProcess exiting bug is breaking scap. Switching to php5 fixes by avoiding use of hhvm [00:56:54] *known bug [00:57:31] I actually think we'll be better off using php5 for most cli things anyway [00:57:31] bd808: ok, I can be around to help merge things [00:58:58] (03PS1) 10EBernhardson: Force mwscript to use zend php [puppet] - 10https://gerrit.wikimedia.org/r/267816 [00:59:02] YuviPanda: ^^ [01:00:16] PROBLEM - check_mysql on payments1002 is CRITICAL: Slave IO: Yes Slave SQL: No Seconds Behind Master: (null) [01:00:16] PROBLEM - check_mysql on payments1003 is CRITICAL: Slave IO: Yes Slave SQL: No Seconds Behind Master: (null) [01:00:22] ebernhardson: ah, php5 is zend [01:00:24] ok [01:00:36] (03CR) 10Yuvipanda: [C: 032 V: 032] Force mwscript to use zend php [puppet] - 10https://gerrit.wikimedia.org/r/267816 (owner: 10EBernhardson) [01:00:41] ebernhardson: want me to force a run? [01:01:05] doing anyway [01:01:34] sure. thanks [01:01:43] mostly just mira, can wait on the rest of the cluster [01:03:41] ok looks like mira has it, trying scap again [01:03:52] !log ebernhardson@mira Started scap: Add Cookie statement link to footer of all WMF wikis per legal [01:05:01] ebernhardson: cool [01:05:16] RECOVERY - check_mysql on payments1002 is OK: Uptime: 95 Threads: 1 Questions: 50 Slow queries: 0 Opens: 33 Flush tables: 1 Open tables: 26 Queries per second avg: 0.526 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [01:05:16] PROBLEM - check_mysql on payments1003 is CRITICAL: Slave IO: Yes Slave SQL: No Seconds Behind Master: (null) [01:07:24] !log ebernhardson@mira scap failed: CalledProcessError Command '/srv/deployment/scap/scap/bin/refreshCdbJsonFiles --directory="/srv/mediawiki-staging/php-1.27.0-wmf.10/cache/l10n" --threads=10 ' returned non-zero exit status 255 (duration: 03m 31s) [01:07:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:07:30] sigh [01:07:42] f...... [01:08:08] Fatal error: Call to undefined function dba_open() in /srv/deployment/scap/scap/bin/refreshCdbJsonFiles on line 162 [01:08:32] :-/ [01:09:03] exists in php5, not in hhvm [01:09:13] serially? [01:09:42] hhvm --php -r 'var_dump(function_exists("dba_open"));' [01:09:43] bool(false) [01:09:44] :( [01:09:54] uhh, yeah, that's why we have the cdb library [01:09:55] https://github.com/facebook/hhvm/issues/1019 [01:10:16] PROBLEM - check_mysql on payments1003 is CRITICAL: Slave IO: Connecting Slave SQL: No Seconds Behind Master: (null) [01:10:20] ebernhardson: live hack the #! line to php5 [01:10:23] https://github.com/facebook/hhvm/issues/1095 [01:10:55] we are saving whoever does the train tomorrow about 3 hours of WTF :) [01:10:59] !log ebernhardson@mira Started scap: Add Cookie statement link to footer of all WMF wikis per legal [01:12:50] Dereckson: heh. so Chad reported and started a fix before Tim rewrote the zend compat layer that would probably make it easier today [01:15:16] PROBLEM - check_mysql on payments1003 is CRITICAL: Slave IO: Yes Slave SQL: No Seconds Behind Master: (null) [01:15:27] well it's syncing this time, so thats a plus [01:20:16] PROBLEM - check_mysql on payments1003 is CRITICAL: Slave IO: Yes Slave SQL: No Seconds Behind Master: (null) [01:21:35] ebernhardson: sweet. a scap hero barnstar for you [01:22:56] with a live hack that will be blown away :P suppose i should make a patch for scap too [01:23:02] "I fixed Wikipedia even though I didn't break it" [01:23:33] ebernhardson: at least file a bug about it for the releng folks [01:24:01] a live hack on the deploy master (which mira is right now) should last [01:26:16] RECOVERY - Outgoing network saturation on labstore1003 is OK: OK: Less than 10.00% above the threshold [75000000.0] [01:26:31] (03PS1) 10Dereckson: Set category collation on gd.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267820 (https://phabricator.wikimedia.org/T125315) [01:30:16] RECOVERY - check_mysql on payments1003 is OK: Uptime: 283 Threads: 1 Questions: 2 Slow queries: 0 Opens: 33 Flush tables: 1 Open tables: 26 Queries per second avg: 0.007 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [01:30:42] !log ebernhardson@mira Finished scap: Add Cookie statement link to footer of all WMF wikis per legal (duration: 19m 42s) [01:30:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:30:58] Jamesofur: claims to be shipped out, please check [01:31:04] * Jamesofur looks [01:31:12] nice. at least it was fast once ebernhardson got it to work [01:31:47] ebernhardson: thanks, my random sampling says yes :) [01:31:55] ok, will declare scap complete then [01:32:12] s/scap/swat/ [01:43:15] (03PS1) 10Dereckson: Deploy Translate extension on ru.wikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267822 (https://phabricator.wikimedia.org/T121766) [01:50:16] PROBLEM - check_puppetrun on betelgeuse is CRITICAL: CRITICAL: Puppet has 59 failures [01:55:16] RECOVERY - check_puppetrun on betelgeuse is OK: OK: Puppet is currently enabled, last run 94 seconds ago with 0 failures [02:01:22] 6operations, 10Deployment-Systems, 6Performance-Team, 10Traffic, 5Patch-For-Review: Varnish cache for /static/$wmfbranch/ doesn't expire when resources change within branch lifetime - https://phabricator.wikimedia.org/T99096#1988662 (10Krinkle) [02:01:38] (03PS2) 10Dereckson: Enable confirmed group at nowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267804 (https://phabricator.wikimedia.org/T125448) (owner: 10Luke081515) [02:01:42] 6operations, 10Deployment-Systems, 6Performance-Team, 10Traffic, 5Patch-For-Review: Make Varnish cache for /static/$wmfbranch/ expire when resources change within branch lifetime - https://phabricator.wikimedia.org/T99096#1285315 (10Krinkle) [02:01:44] (03CR) 10Dereckson: [C: 031] Enable confirmed group at nowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267804 (https://phabricator.wikimedia.org/T125448) (owner: 10Luke081515) [02:03:51] (03CR) 10Dereckson: "The community explicitly requested sysops are able to remove users from the group too." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267804 (https://phabricator.wikimedia.org/T125448) (owner: 10Luke081515) [02:33:36] (03Abandoned) 10Andrew Bogott: Move wikitech to the keystone v3 api. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252491 (owner: 10Andrew Bogott) [02:41:07] (03PS1) 10Yuvipanda: base: Allow adding SANs to puppet CSRs [puppet] - 10https://gerrit.wikimedia.org/r/267826 (https://phabricator.wikimedia.org/T119814) [02:41:24] bblack: robh mutante ^ in case y'all are curious [02:41:31] (03CR) 10jenkins-bot: [V: 04-1] base: Allow adding SANs to puppet CSRs [puppet] - 10https://gerrit.wikimedia.org/r/267826 (https://phabricator.wikimedia.org/T119814) (owner: 10Yuvipanda) [02:41:35] (03PS2) 10Yuvipanda: base: Allow adding SANs to puppet CSRs [puppet] - 10https://gerrit.wikimedia.org/r/267826 (https://phabricator.wikimedia.org/T119814) [02:41:36] * YuviPanda is testing [02:43:25] (03CR) 10jenkins-bot: [V: 04-1] base: Allow adding SANs to puppet CSRs [puppet] - 10https://gerrit.wikimedia.org/r/267826 (https://phabricator.wikimedia.org/T119814) (owner: 10Yuvipanda) [02:43:43] (03PS3) 10Yuvipanda: base: Allow adding SANs to puppet CSRs [puppet] - 10https://gerrit.wikimedia.org/r/267826 (https://phabricator.wikimedia.org/T119814) [02:44:09] (03CR) 10jenkins-bot: [V: 04-1] base: Allow adding SANs to puppet CSRs [puppet] - 10https://gerrit.wikimedia.org/r/267826 (https://phabricator.wikimedia.org/T119814) (owner: 10Yuvipanda) [02:44:49] (03PS4) 10Yuvipanda: base: Allow adding SANs to puppet CSRs [puppet] - 10https://gerrit.wikimedia.org/r/267826 (https://phabricator.wikimedia.org/T119814) [02:52:27] PROBLEM - puppet last run on ms-be2014 is CRITICAL: CRITICAL: puppet fail [02:53:05] (03CR) 10Dzahn: [C: 031] "looks like it would work, yea, the config option is here http://docs.puppetlabs.com/puppet/latest/reference/configuration.html#dnsaltnames" [puppet] - 10https://gerrit.wikimedia.org/r/267826 (https://phabricator.wikimedia.org/T119814) (owner: 10Yuvipanda) [02:53:46] mutante: looks like it should work, but doesn't! [02:54:00] I mean [02:54:03] the cert itself work [02:54:05] s [02:54:07] just that [02:54:09] our way of generating puppet.conf [02:54:11] doesn't quite yet [02:54:52] YuviPanda: yea, the second part was "..should be tested carefully though" heh [02:54:57] :p [03:05:55] (03PS5) 10Yuvipanda: base: Allow adding SANs to puppet CSRs [puppet] - 10https://gerrit.wikimedia.org/r/267826 (https://phabricator.wikimedia.org/T119814) [03:08:40] (03PS6) 10Yuvipanda: base: Allow adding SANs to puppet CSRs [puppet] - 10https://gerrit.wikimedia.org/r/267826 (https://phabricator.wikimedia.org/T119814) [03:10:45] (03CR) 10Yuvipanda: [C: 032] base: Allow adding SANs to puppet CSRs [puppet] - 10https://gerrit.wikimedia.org/r/267826 (https://phabricator.wikimedia.org/T119814) (owner: 10Yuvipanda) [03:20:36] RECOVERY - puppet last run on ms-be2014 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [03:40:53] 6operations, 10ops-esams, 10Traffic: esams cache cluster re-arrangements, early 2016 - https://phabricator.wikimedia.org/T125485#1988822 (10BBlack) 3NEW [03:44:57] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [03:46:46] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [03:50:17] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [03:51:56] RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [03:56:09] 6operations, 10ops-eqiad, 10Traffic: eqiad cache cluster re-arrangements - https://phabricator.wikimedia.org/T125486#1988833 (10BBlack) 3NEW [03:57:07] 6operations, 6Services, 10Traffic, 5Patch-For-Review: Decom parsoidcache cluster - https://phabricator.wikimedia.org/T110472#1988844 (10BBlack) [03:57:10] 6operations, 10Traffic, 6Zero, 5Patch-For-Review: Merge mobile cache into text cache - https://phabricator.wikimedia.org/T109286#1988845 (10BBlack) [03:57:13] 6operations, 6Discovery, 10Maps, 10Traffic: Set up proper edge Varnish caching for maps cluster - https://phabricator.wikimedia.org/T109162#1988846 (10BBlack) [03:57:16] 6operations, 10ops-eqiad, 10Traffic: eqiad cache cluster re-arrangements - https://phabricator.wikimedia.org/T125486#1988843 (10BBlack) [03:57:42] 6operations, 10Traffic, 6Zero, 5Patch-For-Review: Merge mobile cache into text cache - https://phabricator.wikimedia.org/T109286#1706780 (10BBlack) [03:57:45] 6operations, 6Discovery, 10Maps, 10Traffic: Set up proper edge Varnish caching for maps cluster - https://phabricator.wikimedia.org/T109162#1542014 (10BBlack) [03:58:56] PROBLEM - puppet last run on rdb2003 is CRITICAL: CRITICAL: puppet fail [04:19:57] PROBLEM - Incoming network saturation on labstore1003 is CRITICAL: CRITICAL: 19.23% of data above the critical threshold [100000000.0] [04:25:07] RECOVERY - puppet last run on rdb2003 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [04:26:38] PROBLEM - puppet last run on mw2076 is CRITICAL: CRITICAL: puppet fail [04:38:54] (03PS1) 10Tim Landscheidt: Tools: Create separate partitions only in the Tools project [puppet] - 10https://gerrit.wikimedia.org/r/267831 [04:48:06] RECOVERY - Incoming network saturation on labstore1003 is OK: OK: Less than 10.00% above the threshold [75000000.0] [04:54:00] (03PS2) 10Yuvipanda: Tools: Create separate partitions only in the Tools project [puppet] - 10https://gerrit.wikimedia.org/r/267831 (owner: 10Tim Landscheidt) [04:54:16] (03CR) 10Yuvipanda: [C: 032 V: 032] Tools: Create separate partitions only in the Tools project [puppet] - 10https://gerrit.wikimedia.org/r/267831 (owner: 10Tim Landscheidt) [04:56:36] RECOVERY - puppet last run on mw2076 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [05:42:41] (03PS1) 10Tim Landscheidt: Tools: Outfactor the configuration for outgoing HBA connections [puppet] - 10https://gerrit.wikimedia.org/r/267832 [06:31:16] PROBLEM - puppet last run on eventlog2001 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:17] PROBLEM - puppet last run on mw1061 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:17] PROBLEM - puppet last run on db1015 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:26] PROBLEM - puppet last run on mw1135 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:27] PROBLEM - puppet last run on mw1110 is CRITICAL: CRITICAL: Puppet has 2 failures [06:31:38] PROBLEM - puppet last run on db1028 is CRITICAL: CRITICAL: Puppet has 2 failures [06:32:06] PROBLEM - puppet last run on mw2045 is CRITICAL: CRITICAL: Puppet has 2 failures [06:32:27] PROBLEM - puppet last run on mw2136 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:37] PROBLEM - puppet last run on mw2018 is CRITICAL: CRITICAL: Puppet has 1 failures [06:40:47] PROBLEM - puppet last run on mw2155 is CRITICAL: CRITICAL: puppet fail [06:57:27] RECOVERY - puppet last run on mw1061 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [06:58:34] <_joe_> !log reimaging tin.eqiad.wmnet [06:58:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:00:57] RECOVERY - puppet last run on db1015 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [07:01:07] RECOVERY - puppet last run on mw1110 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [07:01:17] RECOVERY - puppet last run on db1028 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [07:01:47] RECOVERY - puppet last run on mw2045 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [07:02:16] RECOVERY - puppet last run on mw2136 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [07:02:18] RECOVERY - puppet last run on mw2018 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [07:02:38] RECOVERY - puppet last run on eventlog2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:02:46] RECOVERY - puppet last run on mw1135 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:08:58] RECOVERY - puppet last run on mw2155 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:10:34] (03PS1) 10Giuseppe Lavagetto: netboot: add tin [puppet] - 10https://gerrit.wikimedia.org/r/267834 [07:13:49] (03CR) 10Giuseppe Lavagetto: [C: 032] netboot: add tin [puppet] - 10https://gerrit.wikimedia.org/r/267834 (owner: 10Giuseppe Lavagetto) [07:18:47] (03PS2) 10Jcrespo: Pool db1018; Depool db1054 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267678 (https://phabricator.wikimedia.org/T125215) [07:19:25] <_joe_> jynus: sorry, can you hold mediawiki-config changes for a few? [07:19:34] <_joe_> I'm reimaging tin [07:20:28] no rush [07:20:29] (03PS3) 10Jcrespo: Pool db1018; Depool db1054 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267678 (https://phabricator.wikimedia.org/T125215) [07:20:39] I have a master failover to prepare [07:20:49] <_joe_> ok [07:21:00] (03PS4) 10Jcrespo: Pool db1018; Depool db1054 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267678 (https://phabricator.wikimedia.org/T125215) [07:21:09] <_joe_> trusty mirrors are quite slow too this morning :/ [07:23:41] (03PS5) 10Jcrespo: Pool db1018; Depool db1021 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267678 (https://phabricator.wikimedia.org/T125215) [07:34:48] (03PS6) 10Jcrespo: Pool db1018; Depool db1021 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267678 (https://phabricator.wikimedia.org/T125215) [07:35:44] (03PS7) 10Jcrespo: Pool db1018; Depool db1021 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267678 (https://phabricator.wikimedia.org/T125215) [07:38:42] (03CR) 10Jcrespo: [C: 032] Pool db1018; Depool db1021 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267678 (https://phabricator.wikimedia.org/T125215) (owner: 10Jcrespo) [07:39:52] Your branch is behind 'origin/master' by 3 commits, and can be fast-forwarded. [07:41:58] 6operations, 6Analytics-Kanban, 10Datasets-General-or-Unknown, 10netops, 5Patch-For-Review: Puppetize a server with a role that sets up Cassandra on Analytics machines [13 pts] {slug} - https://phabricator.wikimedia.org/T107056#1989033 (10Nemo_bis) [07:45:17] 6operations, 10Datasets-General-or-Unknown: Provide a good download service of dumps from Wikimedia - https://phabricator.wikimedia.org/T122917#1989080 (10Nemo_bis) >>! In T122917#1985731, @ArielGlenn wrote: > All hardware refresh tickets for dumps are now at T118154. Does that include a request for a mirror... [07:45:49] (03PS1) 10Giuseppe Lavagetto: netboot: use specific partman recipe for tin [puppet] - 10https://gerrit.wikimedia.org/r/267836 [07:46:06] !log https://phabricator.wikimedia.org/rOMWC2ea9167221d11eb1880e4d26eae64a85cb9b2697 and https://phabricator.wikimedia.org/rOMWCa55d2bf8cd3a2853fac35d5b8239b8e8c2fe6a0f merged but not deployed [07:46:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:48:06] (03PS2) 10Giuseppe Lavagetto: netboot: use specific partman recipe for tin [puppet] - 10https://gerrit.wikimedia.org/r/267836 [07:48:24] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] netboot: use specific partman recipe for tin [puppet] - 10https://gerrit.wikimedia.org/r/267836 (owner: 10Giuseppe Lavagetto) [07:49:55] (03PS2) 10Jcrespo: scap: temprorarily remove tin during reimaging [puppet] - 10https://gerrit.wikimedia.org/r/267688 (owner: 10Giuseppe Lavagetto) [07:52:00] (03PS2) 10Jcrespo: Testing db jessie installer problems on db2030 [puppet] - 10https://gerrit.wikimedia.org/r/267681 (https://phabricator.wikimedia.org/T125256) [07:53:30] (03PS3) 10Jcrespo: scap: temporarily remove tin during reimaging [puppet] - 10https://gerrit.wikimedia.org/r/267688 (owner: 10Giuseppe Lavagetto) [07:54:13] (03CR) 10Jcrespo: [C: 032] scap: temporarily remove tin during reimaging [puppet] - 10https://gerrit.wikimedia.org/r/267688 (owner: 10Giuseppe Lavagetto) [08:02:27] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [08:02:48] !log jynus@mira Synchronized wmf-config/db-eqiad.php: Pool db1018; Depool db1021 (duration: 00m 20s) [08:02:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:07:04] (03PS1) 10Giuseppe Lavagetto: Revert "scap: temporarily remove tin during reimaging" [puppet] - 10https://gerrit.wikimedia.org/r/267838 [08:07:24] oh, I may do more syncs [08:08:06] <_joe_> yeah I just prepared the revert [08:12:21] (03PS1) 10Jcrespo: Reconfigure db1021 (depooled) [puppet] - 10https://gerrit.wikimedia.org/r/267839 [08:13:04] !log restarting and upgrading db1021 [08:13:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:14:35] (03CR) 10Jcrespo: [C: 032] Reconfigure db1021 (depooled) [puppet] - 10https://gerrit.wikimedia.org/r/267839 (owner: 10Jcrespo) [08:48:25] (03PS1) 10ArielGlenn: fix typo in dumps cron script that prevented it from doing a run [puppet] - 10https://gerrit.wikimedia.org/r/267843 [08:51:03] (03PS1) 10Jcrespo: Depool db1036, repool db1021 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267844 [08:51:15] (03CR) 10ArielGlenn: [C: 032] fix typo in dumps cron script that prevented it from doing a run [puppet] - 10https://gerrit.wikimedia.org/r/267843 (owner: 10ArielGlenn) [08:58:57] 7Puppet, 6operations, 10Salt, 5Patch-For-Review: Make it possible for wmf-reimage to work seamlessly with a non-local salt master - https://phabricator.wikimedia.org/T124761#1989155 (10Joe) BTW, wmf-reimage need to be able to both sign the key and delete it, as the process it follows is: # clean puppet ce... [09:03:02] !log repool restbase1007 via confctl [09:03:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:03:46] godog: should we reinstall rb1007 btw at some point ? it has a RAID4 instead of a RAID0 config [09:04:46] (03PS1) 10Giuseppe Lavagetto: role::deployment::server: actually include mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/267845 [09:05:36] (03PS2) 10Giuseppe Lavagetto: role::deployment::server: actually include mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/267845 [09:05:51] akosiaris: yeah it was an artifact of expanding its raid0 from 2 to 3 ssd and failed in the middle, with the new ssd coming in the plan is it reimage it as well [09:06:12] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] role::deployment::server: actually include mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/267845 (owner: 10Giuseppe Lavagetto) [09:06:35] 7Blocked-on-Operations, 6operations: Re-pool restbase1007 - https://phabricator.wikimedia.org/T124565#1989176 (10fgiunchedi) 5Open>3Resolved oops, now repooled with confctl ``` root@palladium:~# confctl --find --action set/pooled=yes restbase1007.eqiad.wmnet restbase1007.eqiad.wmnet: pooled changed no =>... [09:07:02] godog: on then [09:07:07] godog: ok* then [09:08:18] !log elastic (codfw and eqiad): freezing indices to stop titlesuggest maint scripts [09:08:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:10:42] (03CR) 10Jcrespo: [C: 032] Depool db1036, repool db1021 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267844 (owner: 10Jcrespo) [09:11:55] 6operations, 7Diamond, 7Upstream: Diamond load averages do not contain scaled versions - https://phabricator.wikimedia.org/T125411#1989187 (10fgiunchedi) @yuvipanda yup just a metric per host is fine [09:12:10] !log jynus@mira Synchronized wmf-config/db-eqiad.php: Depool db1036, repool db1021 (duration: 00m 21s) [09:12:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:17:33] this is very interesting, only the top 3-5 wikis has recentchanges issues without partitioning [09:18:15] it may also be related to load, as the largest wikis are also the ones with higher load [09:18:32] <_joe_> both load and data size? [09:19:31] I do not know, I am still investigating, but it is not easy to experiment with production [09:20:00] (I only can see thing when I am forced to do changes due to unrelated issues) [09:21:07] (03PS2) 10Giuseppe Lavagetto: Revert "scap: temporarily remove tin during reimaging" [puppet] - 10https://gerrit.wikimedia.org/r/267838 [09:21:16] (03CR) 10Giuseppe Lavagetto: [C: 032] Revert "scap: temporarily remove tin during reimaging" [puppet] - 10https://gerrit.wikimedia.org/r/267838 (owner: 10Giuseppe Lavagetto) [09:21:27] (03CR) 10Giuseppe Lavagetto: [V: 032] Revert "scap: temporarily remove tin during reimaging" [puppet] - 10https://gerrit.wikimedia.org/r/267838 (owner: 10Giuseppe Lavagetto) [09:21:43] <_joe_> jynus: tin is up, so now rsyncing should work [09:21:45] the good news is that special config == less resouces needed == more performance and more availability [09:21:59] *less special config [09:22:22] <_joe_> jynus: ping me if you have any problem with tin [09:22:25] did you merge/apply that or can I? [09:22:36] <_joe_> e.g. syncing [09:22:44] <_joe_> jynus: I merged but not applied on mira [09:22:50] ok, will do that [09:22:56] and then sync-common tin [09:23:16] <_joe_> that was already done by the installation, afaict [09:23:23] ah, great [09:23:34] <_joe_> you need the scap-master-rsync to run [09:23:39] <_joe_> that _will_ take time [09:23:41] we should do more of those [09:23:51] <_joe_> those? [09:24:12] more of "installation/puppet" fixxes things automatically [09:24:50] <_joe_> well most systems are perfectly functional after the first puppet run [09:25:49] probably not most of stateful services: varnish, mysql [09:26:00] (03PS7) 10Filippo Giunchedi: statsite: default to localhost, override as needed [puppet] - 10https://gerrit.wikimedia.org/r/204275 [09:26:03] <_joe_> varnish is [09:26:09] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] statsite: default to localhost, override as needed [puppet] - 10https://gerrit.wikimedia.org/r/204275 (owner: 10Filippo Giunchedi) [09:26:33] does it detect automatically when to purge and resyinc its own cache? [09:27:01] <_joe_> we don't resync caches [09:27:23] <_joe_> we do purges via the dedicated daemon [09:28:41] 6operations, 10Beta-Cluster-Infrastructure, 7HHVM: Convert work machines (tin, terbium) to Trusty and hhvm usage - https://phabricator.wikimedia.org/T87036#1989210 (10Joe) a:3Joe [09:30:13] s/resync/purge the whole cache/ [09:32:25] !log gallium: apt-get upgrade | Restarting Jenkins [09:32:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:33:47] <_joe_> !log re-syncing tin homes [09:33:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:33:57] !log restarting gerrit on ytterbium for java security update [09:33:59] !log elastic (codfw and eqiad): unfreezing indices [09:34:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:34:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:34:29] 6operations, 10Beta-Cluster-Infrastructure, 7HHVM: Convert work machines (tin, terbium) to Trusty and hhvm usage - https://phabricator.wikimedia.org/T87036#1989220 (10Joe) And... it's done. tin has been reimaged to trusty as of now. [09:34:48] Jenkins is restarting, will be back soonish [09:35:01] and gerrit has died? ;) at least it is 503ing from here [09:35:07] 6operations, 7Tracking: reduce amount of remaining Ubuntu 12.04 (precise) systems - https://phabricator.wikimedia.org/T123525#1989222 (10Joe) [09:35:09] 6operations, 7HHVM, 7Tracking: Complete the use of HHVM over Zend PHP on the Wikimedia cluster (tracking) - https://phabricator.wikimedia.org/T86081#1989224 (10Joe) [09:35:11] 6operations, 7Tracking: Upgrade Wikimedia servers to Ubuntu Trusty (14.04) (tracking) - https://phabricator.wikimedia.org/T65899#1989225 (10Joe) [09:35:14] 6operations, 10Beta-Cluster-Infrastructure, 7HHVM: Convert work machines (tin, terbium) to Trusty and hhvm usage - https://phabricator.wikimedia.org/T87036#1989221 (10Joe) 5Open>3Resolved [09:36:00] !log armed keyholder on tin [09:36:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:36:04] back! [09:36:44] 6operations, 7HHVM, 7Tracking: Complete the use of HHVM over Zend PHP on the Wikimedia cluster (tracking) - https://phabricator.wikimedia.org/T86081#961194 (10Joe) There is exactly zero machines left running PHP 5.3 IMO the real intent of this ticket is now resolved. [09:37:17] 6operations, 6Security-Team: can we get rid of rsvg security patch? - https://phabricator.wikimedia.org/T104147#1989236 (10Joe) 5Open>3Resolved a:3Joe [09:37:59] !log Jenkins is fully up and operational [09:38:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:39:34] <_joe_> jynus: I'll do a bogus sync of an irrelevant file to verify scap is ok [09:39:47] I can sync a real change [09:40:31] <_joe_> ok, as you whish :) [09:40:54] db config changes are not time-sensitive when not in an emergency [09:41:02] <_joe_> k [09:42:09] sync-master now takes 0 seconds? [09:42:14] !log jynus@mira Synchronized wmf-config/db-eqiad.php: Depool db1036, repool db1021 (duration: 00m 22s) [09:42:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:42:37] does it have to do with scap-master-sync or is it the reinstall? [09:43:59] <_joe_> jynus: uhm, seems strange [09:44:11] <_joe_> let me test one thing [09:44:54] lets check tin, maybe it didn't sync properly [09:45:54] ah, I may have done a noop [09:46:07] by syncing a file that was already deployed [09:46:16] that would explain it [09:46:32] <_joe_> jynus: the files are identical [09:47:05] which tells us nothing because^ [09:48:32] There is exactly zero machines left running PHP 5.3 [09:48:37] ???? [09:48:49] <_joe_> akosiaris: in the mediawiki-related pool :) [09:48:51] not even in CI ? [09:48:54] aaaah :-( [09:49:08] <_joe_> well this was the blocker for CI abandoning it I guess [09:49:20] <_joe_> in favour of php 5.5 on trusty [09:49:29] less than I hoped for, better than I feared [09:49:30] <_joe_> so now we could do the ICU migration [09:49:32] that's something [09:51:17] 6operations, 10Deployment-Systems, 6Performance-Team, 7HHVM, 3Scap3: Make scap able to depool/repool servers via the conftool API - https://phabricator.wikimedia.org/T104352#1989250 (10Joe) [09:51:24] 6operations, 6Release-Engineering-Team, 10Wikimedia-Apache-configuration: Make it possible to quickly and programmatically pool and depool application servers - https://phabricator.wikimedia.org/T73212#1989249 (10Joe) 5Open>3Resolved [09:51:25] !log jynus@mira Synchronized wmf-config/db-eqiad.php: Testing scap-reduce db1018 weight (duration: 00m 21s) [09:51:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:51:28] no, confirmed, it now takes 2 seconds to sync masters [09:52:05] 6operations, 5codfw-rollout, 3codfw-rollout-Apr-Jun-2015: Setup the main appservers cluster in codfw - https://phabricator.wikimedia.org/T86893#1989255 (10Joe) [09:52:05] akosiaris: _joe_ : CI will have to figure out a way to have Zend 5.3 / 5.5 and hhvm installed in parallel and a way to switch between them [09:52:07] 6operations, 10Wikimedia-Apache-configuration, 10Wikimedia-Site-Requests, 5codfw-rollout, 3codfw-rollout-Apr-Jun-2015: Configure mediawiki to operate in the Dallas DC - https://phabricator.wikimedia.org/T91754#1989254 (10Joe) 5Open>3Resolved [09:52:15] legoktm started some work on that front [09:52:28] <_joe_> jynus: cool [09:52:37] <_joe_> we just need to sync /srv/patches too [09:52:54] <_joe_> there's the patch from chad, I'll take a look later [09:53:40] both /srv/mediawiki and /srv/mediawiki-staging are updated correctly [09:53:58] unless the files as master are stored elsewhere, it works [09:54:58] someone mentioned that he suspected taking so long was an rsync mismatched version [09:55:07] so that would explain it [09:55:12] <_joe_> yup [09:55:56] I will go back to s2 failover, ping me if you need me [10:00:28] !log reconfigure and upgrade db1036 [10:00:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:00:40] 's mysql [10:04:37] PROBLEM - swift-account-auditor on ms-be2007 is CRITICAL: Connection refused by host [10:04:38] PROBLEM - swift-object-updater on ms-be2007 is CRITICAL: Connection refused by host [10:04:48] PROBLEM - RAID on ms-be2007 is CRITICAL: Connection refused by host [10:04:57] PROBLEM - swift-account-replicator on ms-be2007 is CRITICAL: Connection refused by host [10:04:57] PROBLEM - very high load average likely xfs on ms-be2007 is CRITICAL: Connection refused by host [10:04:58] PROBLEM - DPKG on ms-be2007 is CRITICAL: Connection refused by host [10:05:19] PROBLEM - swift-account-server on ms-be2007 is CRITICAL: Connection refused by host [10:05:27] PROBLEM - swift-object-server on ms-be2007 is CRITICAL: Connection refused by host [10:05:27] PROBLEM - Check size of conntrack table on ms-be2007 is CRITICAL: Connection refused by host [10:05:27] PROBLEM - swift-object-replicator on ms-be2007 is CRITICAL: Connection refused by host [10:05:27] PROBLEM - swift-account-reaper on ms-be2007 is CRITICAL: Connection refused by host [10:05:27] PROBLEM - swift-container-server on ms-be2007 is CRITICAL: Connection refused by host [10:05:27] PROBLEM - swift-container-auditor on ms-be2007 is CRITICAL: Connection refused by host [10:05:27] PROBLEM - dhclient process on ms-be2007 is CRITICAL: Connection refused by host [10:05:28] PROBLEM - Disk space on ms-be2007 is CRITICAL: Connection refused by host [10:05:49] PROBLEM - swift-container-replicator on ms-be2007 is CRITICAL: Connection refused by host [10:05:49] PROBLEM - configured eth on ms-be2007 is CRITICAL: Connection refused by host [10:05:57] PROBLEM - swift-container-updater on ms-be2007 is CRITICAL: Connection refused by host [10:05:57] PROBLEM - puppet last run on ms-be2007 is CRITICAL: Connection refused by host [10:06:02] err, that's me, downtiming [10:06:08] PROBLEM - salt-minion processes on ms-be2007 is CRITICAL: Connection refused by host [10:06:17] PROBLEM - swift-object-auditor on ms-be2007 is CRITICAL: Connection refused by host [10:07:27] PROBLEM - Apache HTTP on mw2030 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal server error - 50424 bytes in 0.155 second response time [10:08:07] PROBLEM - HHVM rendering on mw2030 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal server error - 50424 bytes in 0.149 second response time [10:08:43] 6operations, 10MediaWiki-General-or-Unknown, 5Patch-For-Review: Stop a poolcounter server fail from being a SPOF for the service and the api (and the site) - https://phabricator.wikimedia.org/T105378#1989277 (10Joe) After patching hhvm for adding support for float timeouts, I did the following test: 1) redu... [10:08:54] 6operations, 10MediaWiki-General-or-Unknown, 5Patch-For-Review: Stop a poolcounter server fail from being a SPOF for the service and the api (and the site) - https://phabricator.wikimedia.org/T105378#1989279 (10Joe) 5Open>3Resolved [10:38:04] paravoid: reprepro on carbon is complaining about missing hp-mcp database, I'm guessing https://gerrit.wikimedia.org/r/#/c/267262 needs merging [10:40:14] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 869 [10:40:18] how can it complain if it doesn't now about it? [10:40:24] *know [10:40:53] hehe it knows it is missing! [10:41:17] like that feeling of forgetting something when rushing out of some place [10:44:41] !log jynus@mira Synchronized wmf-config/db-eqiad.php: Depool db1063, repool db1036 (duration: 00m 21s) [10:44:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:45:14] RECOVERY - check_mysql on db1008 is OK: Uptime: 1192014 Threads: 2 Questions: 7071332 Slow queries: 8072 Opens: 2857 Flush tables: 2 Open tables: 399 Queries per second avg: 5.932 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [10:51:37] <_joe_> !log stopped jobrunner on mw1161 after failed sync-common [10:51:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:52:20] that number seems familiar, fs problem? [10:52:39] <_joe_> jynus: no, sync-common breaks the production machines atm [10:52:48] ops [10:52:49] <_joe_> for some reason, wikiversions.php is not generaterd [10:53:07] <_joe_> I'm going to read the scap source to find out what is happening but please no deploys atm [10:53:15] yes [11:12:58] !log restarting and reconfiguring mysql at db1063 [11:13:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:24:53] !log Restarting Zuul. Stuck in a dependency loop :( [11:24:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:38:44] !log rolling reboot of aqs* (for kernel update) [11:38:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:40:47] paravoid mark akosiaris I'm getting icmp destination unreachable for 2001:67c:1562::13 (security.ubuntu.com) in codfw but not eqiad, looks like there's a route for that in codfw up since 8h, known? [11:41:12] uh no [11:41:14] looking [11:41:39] down you mean, not up probably [11:41:53] faidon@bast2001:~$ ping6 2001:67c:1562::13 [11:41:53] PING 2001:67c:1562::13(2001:67c:1562::13) 56 data bytes [11:41:53] 64 bytes from 2001:67c:1562::13: icmp_seq=1 ttl=57 time=269 ms [11:42:15] wfm [11:42:20] mw1010$ ping6 2001:67c:1562::13 [11:42:21] PING 2001:67c:1562::13(2001:67c:1562::13) 56 data bytes [11:42:21] From 2620:0:861:101:fe00::2 icmp_seq=1 Destination unreachable: Administratively prohibited [11:42:21] but barely [11:42:28] that's funny [11:42:30] well duh, that's on a private lan [11:42:35] it's not supposed to work [11:42:42] ah indeed [11:42:43] <_joe_> godog: had the same problem before, looked at the proxy, it started working again [11:42:57] <_joe_> paravoid: yeah it's a proxy problem I guess [11:43:04] 800ms ??? [11:43:10] <_joe_> but it stopped being broken as soon as I looked [11:43:11] 64 bytes from 2001:67c:1562::13: icmp_seq=5 ttl=57 time=806 ms [11:43:14] 64 bytes from 2001:67c:1562::13: icmp_seq=1 ttl=57 time=789 ms [11:43:14] 64 bytes from 2001:67c:1562::13: icmp_seq=2 ttl=57 time=872 ms [11:43:14] 64 bytes from 2001:67c:1562::13: icmp_seq=3 ttl=57 time=1021 ms [11:43:15] 64 bytes from 2001:67c:1562::13: icmp_seq=4 ttl=57 time=1126 ms [11:43:15] 64 bytes from 2001:67c:1562::13: icmp_seq=5 ttl=57 time=1169 ms [11:43:16] yup [11:43:18] ouch [11:43:20] that's not good [11:43:30] well nothing we can do about it [11:43:44] (it's just on the last hop) [11:44:34] probably something on the network behind 2600:c0d:4002:3::2 is going bananas [11:44:55] mhh I noticed because as _joe_ mentioned apt got stuck on pulling from security.ubuntu.com [11:45:04] cause both likho.canonical.com and ragana.canonical.com which I assume are different boxes have the same behaviour [11:46:54] 2001:67c:1562::16, which is economy.canonical.com works fine from what I see [11:47:06] probably it's not spending too much :P [11:48:22] partial explanation is also a race with puppet not having the proxy configuration for security.ubuntu.com yet (reimaging the machine) and apt not timing out on it [11:48:41] in any case, sorry for the noise [11:50:45] RECOVERY - Check size of conntrack table on ms-be2007 is OK: OK: nf_conntrack is 0 % full [11:51:04] RECOVERY - configured eth on ms-be2007 is OK: OK - interfaces up [11:51:05] RECOVERY - dhclient process on ms-be2007 is OK: PROCS OK: 0 processes with command name dhclient [11:51:15] RECOVERY - DPKG on ms-be2007 is OK: All packages OK [11:51:44] RECOVERY - Disk space on ms-be2007 is OK: DISK OK [11:51:56] RECOVERY - very high load average likely xfs on ms-be2007 is OK: OK - load average: 0.18, 0.13, 0.08 [11:51:56] RECOVERY - salt-minion processes on ms-be2007 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [11:52:04] RECOVERY - RAID on ms-be2007 is OK: OK: optimal, 14 logical, 14 physical [11:57:57] 6operations, 10Traffic, 6Zero, 5Patch-For-Review: Merge mobile cache into text cache - https://phabricator.wikimedia.org/T109286#1989437 (10BBlack) [11:58:12] 6operations, 10Traffic, 6Zero, 5Patch-For-Review: Merge mobile cache into text cache - https://phabricator.wikimedia.org/T109286#1709529 (10BBlack) [12:01:45] RECOVERY - swift-container-auditor on ms-be2007 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [12:01:45] RECOVERY - swift-object-replicator on ms-be2007 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [12:02:05] RECOVERY - swift-container-replicator on ms-be2007 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [12:02:24] RECOVERY - swift-container-server on ms-be2007 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [12:02:24] RECOVERY - swift-object-server on ms-be2007 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [12:02:24] RECOVERY - swift-object-auditor on ms-be2007 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [12:02:35] RECOVERY - puppet last run on ms-be2007 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [12:02:54] RECOVERY - swift-container-updater on ms-be2007 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [12:02:54] RECOVERY - swift-account-auditor on ms-be2007 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [12:03:05] RECOVERY - swift-account-server on ms-be2007 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [12:03:55] RECOVERY - swift-object-updater on ms-be2007 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [12:05:55] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [12:06:56] <_joe_> !log stopped rsync on tin to avoid problems [12:07:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:07:02] we seem to be missing grrrit-wm too [12:07:44] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [12:08:54] PROBLEM - swift-object-updater on ms-be2007 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-updater [12:09:16] PROBLEM - swift-container-replicator on ms-be2007 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [12:09:16] PROBLEM - swift-account-server on ms-be2007 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [12:09:44] PROBLEM - swift-container-auditor on ms-be2007 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [12:09:55] PROBLEM - swift-container-server on ms-be2007 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [12:09:56] PROBLEM - swift-object-server on ms-be2007 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [12:09:56] PROBLEM - swift-object-auditor on ms-be2007 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [12:10:14] PROBLEM - swift-container-updater on ms-be2007 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-updater [12:10:34] PROBLEM - swift-object-replicator on ms-be2007 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [12:10:35] PROBLEM - swift-account-auditor on ms-be2007 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [12:11:05] PROBLEM - puppet last run on mc2011 is CRITICAL: CRITICAL: Puppet has 2 failures [12:11:45] PROBLEM - puppet last run on cp4004 is CRITICAL: CRITICAL: Puppet has 2 failures [12:15:11] <_joe_> !log stopped rsync, puppet, l10nupdate cronjob on tin [12:15:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:15:31] <_joe_> !log stopped puppet on mira, added a big warning in the motd [12:15:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:20:24] <_joe_> !log stopping rsync on mira too, to avoid accidental deploys [12:20:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:21:24] RECOVERY - swift-object-replicator on ms-be2007 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [12:21:25] RECOVERY - swift-account-auditor on ms-be2007 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [12:21:26] RECOVERY - swift-object-updater on ms-be2007 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [12:21:55] RECOVERY - swift-container-replicator on ms-be2007 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [12:21:56] RECOVERY - swift-account-server on ms-be2007 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [12:22:14] RECOVERY - swift-account-reaper on ms-be2007 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [12:22:25] RECOVERY - swift-container-auditor on ms-be2007 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [12:22:26] RECOVERY - swift-account-replicator on ms-be2007 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [12:22:35] RECOVERY - swift-container-server on ms-be2007 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [12:22:36] RECOVERY - swift-object-server on ms-be2007 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [12:22:36] RECOVERY - swift-object-auditor on ms-be2007 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [12:22:45] RECOVERY - swift-container-updater on ms-be2007 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [12:30:56] 6operations, 10Deployment-Systems, 6Release-Engineering-Team: /srv/mediawiki-staging broken on both scap masters - https://phabricator.wikimedia.org/T125506#1989525 (10Joe) 3NEW [12:36:45] RECOVERY - puppet last run on mc2011 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [12:37:25] RECOVERY - puppet last run on cp4004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:44:47] 6operations, 10Deployment-Systems, 6Release-Engineering-Team: /srv/mediawiki-staging broken on both scap masters - https://phabricator.wikimedia.org/T125506#1989556 (10Joe) Looking at the audit log on mira, it seems that for some unknown reason ``` /usr/local/bin/scap-master-sync tin ``` was run and that... [12:45:57] !log rebooting baham for kernel update [12:46:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:47:55] 6operations, 10Datasets-General-or-Unknown: Provide a good download service of dumps from Wikimedia - https://phabricator.wikimedia.org/T122917#1989559 (10ArielGlenn) No, and network-wise we shouldn't need it, but this ticket will not be resolved until we know we are providing good service to everyone. If th... [12:48:42] 6operations, 10Deployment-Systems, 6Release-Engineering-Team: /srv/mediawiki-staging broken on both scap masters - https://phabricator.wikimedia.org/T125506#1989562 (10hashar) The current situation is: | Host | Dir | Date | Content | tin | /srv/mediawiki-staging/php-1.27.0-wmf.10/ | Feb 2 09:38 | Gone | tin... [12:54:34] PROBLEM - DPKG on dubnium is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:54:55] PROBLEM - DPKG on pollux is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:56:24] RECOVERY - DPKG on dubnium is OK: All packages OK [12:56:45] RECOVERY - DPKG on pollux is OK: All packages OK [12:58:26] PROBLEM - puppet last run on dubnium is CRITICAL: CRITICAL: Puppet has 1 failures [13:00:25] RECOVERY - puppet last run on dubnium is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [13:00:32] 6operations, 10Deployment-Systems, 6Release-Engineering-Team: /srv/mediawiki-staging broken on both scap masters - https://phabricator.wikimedia.org/T125506#1989587 (10hashar) Events around 9:38: | 2016-02-02T09:12:10.000Z | mira | INFO | scap.announce | Synchronized wmf-config/db-eqiad.php: Depool db1036,... [13:01:53] !log reboot pollux for kernel upgrades [13:01:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:01:59] !log reboot dubnium for kernel upgrades [13:02:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:04:45] PROBLEM - puppet last run on mw1151 is CRITICAL: CRITICAL: Puppet has 1 failures [13:09:21] !log rolling reboot of scb* (for kernel update) [13:09:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:12:20] 6operations, 10Traffic, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Ability to switch Traffic infrastructure Tier-1 to codfw manually - https://phabricator.wikimedia.org/T125510#1989606 (10BBlack) 3NEW [13:13:53] 6operations, 10Deployment-Systems, 6Release-Engineering-Team: /srv/mediawiki-staging broken on both scap masters - https://phabricator.wikimedia.org/T125506#1989627 (10Joe) I created a tarball with the last known valid version of the deployed code (ironically, on tin) and on one appserver so that data that w... [13:16:12] 6operations, 10Traffic, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Ability to switch Traffic infrastructure Tier-1 to codfw manually - https://phabricator.wikimedia.org/T125510#1989639 (10BBlack) Should also note: while the above list of steps 1-5 sounds roughly correct for a true switch, we probably wan... [13:25:03] 6operations, 10Deployment-Systems, 6Release-Engineering-Team: /srv/mediawiki-staging broken on both scap masters - https://phabricator.wikimedia.org/T125506#1989655 (10hashar) Earliest files on mira:/srv/mediawiki-staging are from 8:39am Lets look at the history on either tin or mira (since they got synced)... [13:28:28] 6operations, 10Deployment-Systems, 6Release-Engineering-Team: /srv/mediawiki-staging broken on both scap masters - https://phabricator.wikimedia.org/T125506#1989656 (10Joe) yes @hashar that's exactly what happened. [13:30:15] RECOVERY - puppet last run on mw1151 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [13:57:33] 6operations, 10Deployment-Systems, 6Release-Engineering-Team: /srv/mediawiki-staging broken on both scap masters - https://phabricator.wikimedia.org/T125506#1989708 (10Joe) To better clarify the timeline: - I reimaged tin around 8 UTC; - puppet created a git clone of operations/mediawiki-config in /srv/med... [13:58:38] !log rebooting eeden for kernel update [13:58:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:58:43] 6operations, 10Deployment-Systems, 6Release-Engineering-Team: /srv/mediawiki-staging broken on both scap masters - https://phabricator.wikimedia.org/T125506#1989711 (10hashar) CC ing the whole #releng . **TL;DR:** we have lost staging area (code / caches / settings) from BOTH deployment servers and have to... [14:00:43] 6operations, 10ops-codfw: ms-be2007 - System halted!Error: Integrated RAID - https://phabricator.wikimedia.org/T122844#1989718 (10fgiunchedi) 5Open>3Resolved machine has been reimaged today [14:01:15] Hello all! I just joined WMF and I try to get the admin stuff out of the way. Can anyone point me to documentation on where to send my public SSH keys ? [14:02:04] PROBLEM - Host eeden is DOWN: PING CRITICAL - Packet loss = 100% [14:02:16] PROBLEM - Host ns2-v4 is DOWN: PING CRITICAL - Packet loss = 100% [14:03:42] gehel: modules/admin/data/data.yaml in operations/puppet [14:03:54] heh seriously, eeden right now? :P [14:04:00] hoo: thanks ! I'll have alook [14:04:07] !log looking at eeden [14:04:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:04:34] bblack: that was a scheduled reboot, but forgot to silence in icinga [14:04:36] <_joe_> gehel: sorry, today is definitely *not* the right day :/ [14:04:40] oh ok! [14:04:48] !log nevermind, not looking at eeden [14:04:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:04:52] lol [14:05:14] _joe_: I'm still busy reading docs and creating accounts, we'll see some other time... [14:08:00] gehel: hi! I can help, which docs are you already looking at? [14:08:55] bblack, paravoid: eeden is not back up yet, though. I also can't get a connection to eeden.mgmt.esams.wmnet, it's not responding [14:08:57] damn [14:08:58] godog: Thx! I have an on boarding checklist, slowly going through it (https://www.mediawiki.org/wiki/Wikimedia_Discovery/Team/Onboarding/Guillaume) [14:09:21] I'm getting a "Disconnected from UNKNOWN" after a while [14:09:33] godog: I'm not yet blocked. I'll get back to you when I am ;-) [14:10:26] connected to eeden's serial [14:10:41] Debian GNU/Linux 8 eeden ttyS1 [14:10:45] eeden login: [14:10:51] so, up? [14:11:03] yeah it is [14:11:37] gehel: oh ok, re: ssh keys you'll need two but ping me when you get blocked [14:11:59] godog: prod + lab ? Yep, heard about that. [14:12:47] gehel: indeed! [14:12:48] hi, Some servers responding incorrect time. Is this specification? " {{管理者への立候補/span|予告期間|20160201152235|0|2}} "+ signature ; Database time: 2016-02-01T15:26:19Z ; https://ja.wikipedia.org/w/index.php?action=edit&oldid=58462951&uselang=en [14:14:36] godog: I'll let you fix something more urgent... Thanks for the help! [14:15:26] paravoid: I can also connect now, name resolution is broken, though [14:20:00] !log starting rebuilding /srv/mediawiki-staging from scratch on mira [14:20:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:20:42] ok so what's the state of eeden? I kinda want to look, but I don't want to interrupt if one of you is doing something useful [14:20:58] please do [14:21:11] bblack: please have a look, the host is up, but name resolution is broken [14:21:29] !log starting rebuilding /srv/mediawiki-staging from scratch on tin (not mira) [14:21:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:22:54] RECOVERY - Host eeden is UP: PING OK - Packet loss = 0%, RTA = 85.74 ms [14:23:25] RECOVERY - Host ns2-v4 is UP: PING OK - Packet loss = 0%, RTA = 85.92 ms [14:24:31] paravoid: ? I didn't actually fix anything :) [14:24:46] not me either [14:25:02] me neither [14:25:14] hmmm ok :) [14:26:32] running puppet now in any case [14:26:55] PROBLEM - puppet last run on eeden is CRITICAL: CRITICAL: puppet fail [14:27:56] fixing up packages and such... [14:32:44] RECOVERY - puppet last run on eeden is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:34:04] PROBLEM - DPKG on eeden is CRITICAL: DPKG CRITICAL dpkg reports broken packages [14:36:42] death to all interactive prompts on package upgrades :P [14:37:44] RECOVERY - DPKG on eeden is OK: All packages OK [14:43:16] !log tin /srv/mediawiki-staging/multiversion/checkoutMediaWiki 1.27.0-wmf.8 php-1.27.0-wmf.8 [14:43:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:43:20] !log tin /srv/mediawiki-staging/multiversion/checkoutMediaWiki 1.27.0-wmf.9 php-1.27.0-wmf.9 [14:43:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:43:24] !log tin /srv/mediawiki-staging/multiversion/checkoutMediaWiki 1.27.0-wmf.10 php-1.27.0-wmf.10 [14:43:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:45:20] looking at syslog, name resolution failed from the start, the first error messages are from ntpd startup 11s into the boot, but I fail to see why it faile [14:50:37] moritzm: yeah I have no idea. I didn't look deeply. after it all started magically working, I ran puppet and that failed on dpkg stuff, then I did apt-get upgrade and updated all the packages, then puppet ran ok [14:50:46] that's about all I did, the rest makes no sense to me right now [14:51:20] I'm suspicious about maybe resolv.conf being temporary wrong, but who knows [14:51:44] (gdnsd listens on *:53 - maybe during initial startup for some reason it looked at itself for a DNS cache, then resolv.conf got updated after some other script ran?) [14:52:08] it has a 2015 mtime though [14:53:51] Feb 2 14:01:01 eeden lldpd[639]: unknown org tlv [00:ffffff90:69] received on eth0 [14:53:54] ? [14:53:57] normal? [14:55:47] looks like eth0 came up very late in general [14:57:12] but systemd didn't do anything network-like until after it was up [14:57:44] !log disable swift container-sync for wikipedia-it-local-public.a7 [14:57:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:58:15] PROBLEM - NTP on eeden is CRITICAL: NTP CRITICAL: Offset unknown [14:59:07] yeah, I had a look at the recursors used in resolv.conf, but nothing suspicious there either [14:59:27] the mtime is the same on baham, I guess it's simply rather static :-) [15:36:15] 6operations, 10Deployment-Systems, 6Release-Engineering-Team: Backup all of /srv on mira and/or tin (deployment servers) - https://phabricator.wikimedia.org/T125527#1989957 (10jcrespo) 3NEW a:3jcrespo [15:44:30] so, good morning _joe_ and hashar: should I remove/postpone the SWAT that's scheduled for 15 minutes from now? [15:44:50] 6operations, 10Deployment-Systems, 6Release-Engineering-Team: Backup all of /srv on mira and/or tin (deployment servers) - https://phabricator.wikimedia.org/T125527#1989970 (10Dzahn) 26G /srv/ 4.7G /srv/deployment 16K /srv/lost+found 6.5G /srv/mediawiki 15G /srv/mediawiki-staging 1.1M /srv/patches 476K /sr... [15:45:09] greg-g: yeah should be postponed [15:45:19] hashar: kk [15:45:23] greg-g: I have replied to joe email on ops list to ask to freeze/postpone/delay/forget about deploying [15:45:35] might even cancel today train [15:46:31] 6operations, 10Deployment-Systems, 6Release-Engineering-Team: Backup all of /srv on mira and/or tin (deployment servers) - https://phabricator.wikimedia.org/T125527#1989973 (10demon) Yes, that's the entire problem, mw-staging went missing. At the very least, we want deployment, mw-staging and patches. [15:48:42] 6operations, 10Datasets-General-or-Unknown: Provide a good download service of dumps from Wikimedia - https://phabricator.wikimedia.org/T122917#1989990 (10Nemo_bis) Is there any example of mirror for large datasets that manages to provide a good service to the whole world from a single location? If yes, we sh... [15:48:52] hashar: gotcha, I postponed the morning SWAT until the hour before the train (ie: 19:00 UTC), if things are not better by then we can cancel [15:49:32] 6operations, 10ops-eqiad, 10procurement: eqiad: Order (4) SAS 2TB Disks for out of warranty analytics/kafka - https://phabricator.wikimedia.org/T125529#1989994 (10Cmjohnson) 3NEW a:3RobH [15:49:51] greg-g: may you take care of pinging the relevant folks ? I would postpone Graphoid/Parsoid as well [15:50:19] ok [15:50:34] Long as the sync's going on in the next ~hour go fine, we'll be fine to continue today [15:51:11] * greg-g nods [15:51:51] 6operations, 10procurement: codfw: Order (4) SAS 2TB Disks for out of warranty analytics/kafka - https://phabricator.wikimedia.org/T125530#1990010 (10Cmjohnson) 3NEW a:3RobH [15:55:40] 6operations, 10Deployment-Systems, 6Release-Engineering-Team, 5Patch-For-Review: Backup all of /srv on mira and/or tin (deployment servers) - https://phabricator.wikimedia.org/T125527#1990029 (10Dzahn) @demon ok, thanks, i uploaded a patch for the entire /srv @akosiaris would you see 26G as an issue for... [15:57:05] PROBLEM - DPKG on radon is CRITICAL: DPKG CRITICAL dpkg reports broken packages [16:00:45] RECOVERY - DPKG on radon is OK: All packages OK [16:10:46] fyi to those who might have missed the scrollback and changes to the deploy calendar: [16:11:11] the morning SWAT that should be happening right now is delayed until further notice [16:11:38] it is scheduled on the deploy calendar for 11am pacific (19:00 UTC) in hopes things will be better by then [16:12:15] hashar, greg-g noted .. no parsoid deploys today anyway. [16:12:47] yeah, I'm waiting to postpone the services window until we see what our recovery timeline is [16:13:21] subbu: we would like to reduce the amount of operations happening while staging area is being rebuild [16:13:41] got it. [16:14:45] greg-g: ta. tracking conversation on the phab task (T124220) [16:15:15] 6operations, 10Deployment-Systems, 6Release-Engineering-Team: /srv/mediawiki-staging broken on both scap masters - https://phabricator.wikimedia.org/T125506#1990064 (10phuedx) [16:15:32] 6operations, 10Deployment-Systems, 6Release-Engineering-Team: /srv/mediawiki-staging broken on both scap masters - https://phabricator.wikimedia.org/T125506#1989525 (10phuedx) [16:15:32] phuedx: thankya sir [16:15:53] and for letting me know, HAVE SOME SPAM [16:16:33] part of a balanced breakfast [16:19:04] 6operations, 10Deployment-Systems, 6Release-Engineering-Team, 5Patch-For-Review: Backup all of /srv on mira and/or tin (deployment servers) - https://phabricator.wikimedia.org/T125527#1990070 (10akosiaris) @Dzahn, no, 26GB is not really an issue but don't use the disk space on helium as a guide. It's the s... [16:19:50] 6operations, 10Deployment-Systems, 6Release-Engineering-Team: /srv/mediawiki-staging broken on both scap masters - https://phabricator.wikimedia.org/T125506#1990076 (10akosiaris) [16:19:53] 6operations, 10Deployment-Systems, 6Release-Engineering-Team, 5Patch-For-Review: Backup all of /srv on mira and/or tin (deployment servers) - https://phabricator.wikimedia.org/T125527#1990074 (10akosiaris) 5Open>3Resolved Merged, done. Tomorrow we should have the first back up. [16:23:38] !log thcipriani@mira rebuilt wikiversions.php and synchronized wikiversions files: rebuild wikiversion.php [16:23:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:24:12] yurik: fyi, deploys are on hold for a while until we sort out the deploy masters issues (see the ops mailing list) [16:24:30] greg-g, thx :) [16:24:40] * yurik was about to mess up everything as usual [16:24:53] figured :) [16:24:55] RECOVERY - Apache HTTP on mw2030 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.112 second response time [16:25:24] RECOVERY - HHVM rendering on mw2030 is OK: HTTP OK: HTTP/1.1 200 OK - 66489 bytes in 0.321 second response time [16:25:28] * yurik refuses to read email unless it has the word "yuri" or "yurik" in them [16:25:59] * yurik thinks it might qualify as discrimination, but so be it [16:26:07] I'd prefer if all deployers also read emails like "[Ops] [*IMPORTANT*] DO NOT deploy" [16:26:15] hmm... [16:26:19] in fact, if they don't, I might revoke their deploy rights [16:26:30] :P [16:26:34] * yurik considers to add the subject:IMPORTANT to the rule [16:26:36] (serious though) [16:26:49] there's a reason we require all deployers to be on the ops list [16:27:04] * yurik wonders why he would need to read emails if greg-g is nice enough to notify yurik personally? [16:27:05] we don't email people individually, I only pinged you because I knew I couldn't trust you read the email :) [16:27:13] see ^ [16:27:18] :D [16:27:25] ok ok, adding the word IMPORTANT to my filters [16:27:31] which, I shouldn't have to do, so should I just revoke deploy perms? [16:27:34] and btw, when was the last time i let you down :-P [16:27:52] adding! [16:27:54] no, don't add IMPORTANT to your filter, just read the ops list [16:28:06] I might not add IMPORTANT to my "DO NO DEPLOY" emails [16:28:39] You should be reading all email going through your inbox, yurik [16:28:40] <_joe_> greg-g: yeah sorry I had to re-enable everything to test [16:28:56] Krenair, physically impossible :( [16:29:24] greg-g, regardless, i don't see your email [16:29:37] yurik: it's from joe and antoine [16:29:49] it's not always from me, I also sleep sometimes [16:29:58] yurik: yeah no deploy please. We are freezing everything till we have the situation in control. sorry :( [16:30:17] grr! hashar, i got it the very first time :) [16:30:18] sigh [16:30:47] is modules/admin/data/data.yaml also used for lab access? Or only prod? Trying to figure out if I should send both SSH keys or not. [16:31:40] gehel: register on wikitech which is the ldap labs uses [16:31:49] that is for prod but should use the same UID [16:32:21] no labs instances use t [16:32:26] ..he admin module [16:32:32] it is prod-only [16:33:03] one key goes in wikitech/gerrit/whatever else, one single-purpose production-only key goes in the admin data [16:34:04] greg-g, for some weird reason, I am not on the ops mailing list - the only time i see ops@lists.wikimedia.org is in cross-posted mail [16:34:12] * yurik is surprised [16:34:13] <_joe_> !log restarted puppet and rsync on both tin and mira, removed comments on the l10nupdate job on tin [16:34:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:35:10] * yurik is looking how to add himself to it, and apparently its not listed in the mailman [16:35:57] <_joe_> !log sync-common on mw2030 and mw1161; re-enable puppet, jobrunner, jobchron on mw1161 [16:35:59] yurik: suscribe please, it is a requirement for all deployers [16:35:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:36:07] yurik: hmm, lemme see [16:36:10] greg-g, i would love to, but its not listed [16:36:22] yurik: https://lists.wikimedia.org/mailman/listinfo/ops [16:36:27] urlhack it :) [16:36:56] chasemp: I thought public key distribution is done with Puppet. I'm probably lost ... [16:37:03] 6operations, 3Mobile-Content-Service: Improve operational documentation for the mobileapps service - https://phabricator.wikimedia.org/T123852#1990145 (10Mholloway) Moved the draft to https://wikitech.wikimedia.org/wiki/Mobileapps_service_notes_for_ops, since all wiki pages are always drafts. Calling this com... [16:37:14] greg-g, i tried urlhack - but i used the -l suffix :) [16:37:21] added, please approve [16:37:30] 6operations, 3Mobile-Content-Service: Improve operational documentation for the mobileapps service - https://phabricator.wikimedia.org/T123852#1990148 (10Mholloway) 5Open>3Resolved [16:37:44] greg-g, I see other deployers who are not subscribed [16:37:57] * yurik doesn't feel so bad anymore :-P [16:37:58] I also see at least one deployer who has set their subscription to digest mode [16:38:31] Krenair: can you PM me names that you noticed? [16:39:19] am going through the full list now [16:39:34] I might also send you the list of inactive deployers [16:39:49] Krenair: kk, thanks [16:39:50] 6operations, 10Deployment-Systems, 6Release-Engineering-Team: /srv/mediawiki-staging broken on both scap masters - https://phabricator.wikimedia.org/T125506#1990172 (10Jdforrester-WMF) [16:40:22] Krenair, inactive deployers??? as in they have depl rights but are no longer with WMF? :) [16:40:30] !log mw1017 sync-common --verbose [16:40:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:41:10] yurik, as in they have deployment rights but, for example, last logged in via pmtpa [16:42:50] hehe [16:42:54] I miss fenari :( [16:42:55] chasemp, Krenair: so it means that adding my entry to modules/admin/data/data.yaml is probably not a priority. Still, is there any reason not to do it yet ? [16:43:10] yes, fenari, that's the name I was looking for [16:43:11] best deployment host / bastion / work machine ever [16:43:16] chasemp, Krenair: sub-question: how do I choose a UID? [16:44:10] gehel: when you register on wikitech and the ldap account is created a UID is assigned, we should copy that to data.yaml [16:45:09] gehel, I'd expect you to apply for production access from a quick check of your welcome email [16:45:09] chasemp: and to get the UID, I probably have to log into any server and check ? [16:45:22] as you're an ops engineer [16:45:59] gehel: if you put up a changeset to gerrit w/ your user stanza and key and missing UID lots of ppl can help fill that in to bridge the gap [16:46:00] no worries [16:46:19] Krenair: I'd really like to know my way around a bit better before anyone grants me prod access :-) [16:47:02] gehel, that's a good way to approach it :) [16:47:03] chasemp, Krenair : ok, thanks a lot for taking the time! I'll send a changeset... [16:47:41] !log tin /srv/mediawiki-staging : running git submodule update --init [16:47:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:48:05] !log tin /srv/mediawiki-staging : running git submodule update --init --recursive [16:48:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:58:47] !log mw1017: removed stray .git directory from WikipediaFirefoxMobileOS or w/e. It shouldn't be there anyway. sync-common is happy again on it [16:58:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:58:52] hashar, _joe_: ^ [16:59:06] Imma check a few other nodes at random and see if they've got that same .git [16:59:16] !log files were /srv/mediawiki/docroot/wikimedia.org/WikipediaMobileFirefoxOS/.git and /srv/mediawiki/docroot/wikimedia.org/WikipediaMobileFirefoxOS/js/lib/MobileFrontend/.git [16:59:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:59:20] If so, we'll prolly need to clean with salt prior to scapping [16:59:29] yup [16:59:41] might be good to run salt nonetheless [16:59:48] just to make sure they are gone [17:00:04] _joe_ ema: Respected human, time to deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160202T1700). Please do the needful. [17:00:04] mobrovac mdholloway bearND: A patch you scheduled for Puppet SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [17:00:18] (03PS1) 10Mobrovac: RESTBase: enable metrics batching [puppet] - 10https://gerrit.wikimedia.org/r/267917 [17:00:32] i have another one for puppetswat! [17:01:12] <_joe_> mobrovac: you will have to wait I guess [17:01:25] _joe_: still deployment problems? [17:01:43] folks are still in the middle of getting things put back together [17:01:45] <_joe_> mobrovac: still fixing and confirming, I don't want anything else interfering in the meantime [17:01:51] Hmm, mw1119 and mw2051 (chosen at random) both look fine. [17:02:00] (in terms of those stray .gits) [17:02:08] _joe_: yup, sure [17:02:23] I mean, the files are all there, not just their .gits. [17:03:12] * aude has read the mails [17:03:15] so no train? [17:03:49] <_joe_> aude: not now for sure, no [17:04:10] ok [17:07:57] (03PS1) 10Gehel: Adding user gehel (Guillaume Lederrey) to user list [puppet] - 10https://gerrit.wikimedia.org/r/267919 [17:08:49] gehel, okay so [17:08:55] do you have an account in labs now? [17:09:29] Krenair: I have an account on wikitech, which as far as I understand can be used to connect to labs [17:09:34] that's right [17:09:39] what's the wikitech account username? [17:09:44] Krenair: gehel [17:10:17] Krenair: as far as I know, I don't have any access in lab yet [17:10:18] uidNumber: 13593 [17:10:32] Krenair: thx, I'll update that asap [17:11:06] that's from "ldaplist -l passwd gehel" on bastion.wmflabs.org [17:15:24] (03PS2) 10Gehel: Adding user gehel (Guillaume Lederrey) to user list [puppet] - 10https://gerrit.wikimedia.org/r/267919 [17:16:44] So I'm not sure how this works for ops exactly [17:16:46] maybe it's normal [17:17:07] but for production access, you need to file a request ticket in phabricator with some details, and have some people approve it [17:17:18] <_joe_> yes [17:17:32] <_joe_> gehel: actually tomasz should create the ticket [17:17:34] specifically they need to know which groups you want to be added to, why, etc. [17:17:48] which is something currently missing from this patch [17:18:24] So I should not directly send a patch, but open a ticket to request that someone else writes the patch ? [17:18:35] I don't think there's anything wrong with providing the patch yourself [17:19:43] I was trying to push what I know needs to be done (make myself exist as an account) and I was thinking that actual access could be done at a later time (when I know what I need access to). [17:20:43] Krenair: but tomasz should still create the ticket before the patch is merged. And I should probably enrich it by putting myself in the appropriate groups. Correct? [17:21:30] adding a user to that list won't create your account on any servers without being in any groups [17:21:37] gehel: we've recently onboarded two new ops people, looking for the corresponding tickets now [17:22:07] I doubt it'd get merged without tomasz's approval, yes gehel [17:22:47] and depending on exactly what group(s) you request, the approval requirements can be different [17:22:58] Krenair: yep, that seems obvious. I was more wondering if the approval could be done directly in gerrit [17:23:21] not sure... I've only ever seen it done in phabricator [17:23:50] yup, approval is via phabricator, e.g. https://phabricator.wikimedia.org/T122925 [17:24:30] https://wikitech.wikimedia.org/wiki/Requesting_shell_access says phabricator, but I'm pretty sure I wrote at least part of that line [17:27:10] 6operations, 10Deployment-Systems, 6Release-Engineering-Team: /srv/mediawiki-staging broken on both scap masters - https://phabricator.wikimedia.org/T125506#1990357 (10hashar) We have restored 1.27.0-wmf.8 1.27.0-wmf.9 and 1.27.0-wmf.10 . Regenerated the l10n cache. mw1017 has been synced. We are now tryin... [17:31:54] <_joe_> !log depooled mw1119, partial sync [17:31:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:33:37] <_joe_> mobrovac: we will puppetSWAT tomorrow, sorry [17:34:01] <_joe_> I'm working non-stop since 8 AM, and I don't think I'll be done for quite some time still [17:34:22] i understand _joe_ [17:34:26] fingers crossed! [17:36:51] !log mw1119:/srv/mediawiki/wmf-config/event-schemas is empty [17:36:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:37:40] !log disable unused swift container-sync for wikibooks-ka-local-thumb wikibooks-hr-local-thumb wikibooks-km-local-thumb wikibooks-sk-local-thumb wikibooks-tr-local-thumb wikipedia-it-local-thumb.fc [17:37:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:43:33] !log mw1119 sync-common [17:43:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:44:10] ah [17:45:49] (03PS4) 10ArielGlenn: new salt runner to sign/delete/check status of salt key [puppet] - 10https://gerrit.wikimedia.org/r/267670 (https://phabricator.wikimedia.org/T124761) [17:45:50] !log mira /srv/mediawiki-staging git submodule update --init --recursive [17:45:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:47:26] (03CR) 10jenkins-bot: [V: 04-1] new salt runner to sign/delete/check status of salt key [puppet] - 10https://gerrit.wikimedia.org/r/267670 (https://phabricator.wikimedia.org/T124761) (owner: 10ArielGlenn) [17:48:45] (03PS1) 10Eevans: Change default consistency to localOne [puppet] - 10https://gerrit.wikimedia.org/r/267924 (https://phabricator.wikimedia.org/T124947) [17:48:48] (03PS5) 10ArielGlenn: new salt runner to sign/delete/check status of salt key [puppet] - 10https://gerrit.wikimedia.org/r/267670 (https://phabricator.wikimedia.org/T124761) [17:55:51] 6operations, 6Discovery, 10MediaWiki-Logging, 7HHVM, and 3 others: MediaWiki monolog doesn't handle Kafka failures gracefully - https://phabricator.wikimedia.org/T125084#1990424 (10Joe) 5Open>3Resolved [17:56:38] 6operations, 6Discovery, 10MediaWiki-Logging, 7HHVM, and 3 others: MediaWiki monolog doesn't handle Kafka failures gracefully - https://phabricator.wikimedia.org/T125084#1973969 (10Joe) pointing mediawiki on a production to an inexistent IP for kafka does not leave stale connections behind anymore, nor bus... [17:58:51] (03PS1) 10Giuseppe Lavagetto: tin: disable l10nupdate until we figure out if it works with HHVM [puppet] - 10https://gerrit.wikimedia.org/r/267927 [18:00:04] yurik gwicke cscott arlolra subbu bearND mdholloway: Respected human, time to deploy Services – Graphoid / Parsoid / OCG / Citoid / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160202T1800). Please do the needful. [18:00:20] i know i know [18:00:38] mobileapps deploy moved to tomorrow [18:09:14] PROBLEM - RAID on labstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:09:35] PROBLEM - dhclient process on labstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:09:35] PROBLEM - salt-minion processes on labstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:09:45] PROBLEM - configured eth on labstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:09:46] PROBLEM - DPKG on labstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:10:05] (03CR) 10Luke081515: [C: 031] "looks better now" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267780 (https://phabricator.wikimedia.org/T124429) (owner: 10MarcoAurelio) [18:10:05] PROBLEM - Disk space on labstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:10:54] RECOVERY - RAID on labstore1002 is OK: OK: optimal, 12 logical, 12 physical [18:13:35] RECOVERY - Disk space on labstore1002 is OK: DISK OK [18:14:45] RECOVERY - salt-minion processes on labstore1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [18:14:45] RECOVERY - dhclient process on labstore1002 is OK: PROCS OK: 0 processes with command name dhclient [18:15:04] RECOVERY - configured eth on labstore1002 is OK: OK - interfaces up [18:15:04] RECOVERY - DPKG on labstore1002 is OK: All packages OK [18:15:44] (03CR) 10Luke081515: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267780 (https://phabricator.wikimedia.org/T124429) (owner: 10MarcoAurelio) [18:25:03] (03PS1) 10Chad: deploy master: recurse submodules on clone [puppet] - 10https://gerrit.wikimedia.org/r/267929 [18:26:32] _joe_: For next deploy master we provision ^ [18:30:32] <_joe_> ostriches: tomorrow I'll merge this and my patch for the commit-msg hook [18:31:18] okie dokie [18:34:43] 6operations, 10Mathoid, 10RESTBase: restbase/mathoid service checker failure when ran from outside wmf network - https://phabricator.wikimedia.org/T122213#1990718 (10mobrovac) [18:35:40] (03PS2) 10Andrew Bogott: openstack: fix top-scope vars without namespace [puppet] - 10https://gerrit.wikimedia.org/r/266980 (owner: 10Dzahn) [18:37:22] (03CR) 10Andrew Bogott: [C: 032] openstack: fix top-scope vars without namespace [puppet] - 10https://gerrit.wikimedia.org/r/266980 (owner: 10Dzahn) [18:37:56] (03PS2) 10Andrew Bogott: labs: fix top-scope vars without namespace [puppet] - 10https://gerrit.wikimedia.org/r/266982 (owner: 10Dzahn) [18:39:37] (03CR) 10Andrew Bogott: [C: 032] labs: fix top-scope vars without namespace [puppet] - 10https://gerrit.wikimedia.org/r/266982 (owner: 10Dzahn) [18:40:19] (03PS2) 10Andrew Bogott: labs_bootstrapvs: fix top-scope var without namespace [puppet] - 10https://gerrit.wikimedia.org/r/266976 (owner: 10Dzahn) [18:42:04] (03CR) 10Andrew Bogott: [C: 032] labs_bootstrapvs: fix top-scope var without namespace [puppet] - 10https://gerrit.wikimedia.org/r/266976 (owner: 10Dzahn) [18:42:16] 6operations, 10netops: turn-up/implement zayo wave (579171) for ulsfo-codfw - https://phabricator.wikimedia.org/T122885#1990753 (10faidon) This is supposed to be turned up already, as confirmed with both our delivery manager and Zayo's call center. In anyh case, me and their customer support filed a ticket for... [18:44:06] 6operations, 10Mathoid, 10RESTBase: restbase/mathoid service checker failure when ran from outside wmf network - https://phabricator.wikimedia.org/T122213#1990757 (10mobrovac) 5Open>3Resolved a:3mobrovac The POST end point is now open to the public, so this is not occurring any more: ``` $ ./modules/s... [18:44:18] 6operations, 10Mathoid, 10RESTBase: restbase/mathoid service checker failure when ran from outside wmf network - https://phabricator.wikimedia.org/T122213#1990760 (10mobrovac) p:5Triage>3Low [18:51:48] 6operations, 10Wikimedia-Video, 5Patch-For-Review: 1gb file upload limit is too restrictive for conference presentation videos - https://phabricator.wikimedia.org/T116514#1990804 (10Dzahn) I think it's a good idea to raise it to the `1024 * 1024 * 2047` for now. As @Fuzheado said 2GB would already make a hug... [18:54:35] <_joe_> !log repooled mw1119 [18:54:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:59:22] <_joe_> !log running sync-common on mw1020 [18:59:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:03:34] PROBLEM - Disk space on labstore1002 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=90%) [19:05:03] (03PS1) 10Yuvipanda: tools: Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/267932 [19:05:24] RECOVERY - Disk space on labstore1002 is OK: DISK OK [19:05:42] chasemp: ^ is that you? (labstore1002) [19:05:49] (03PS2) 10Yuvipanda: tools: Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/267932 [19:05:59] YuviPanda: yes I silenced it a moment too late it's no worries [19:06:10] just a temp file to profile disk perf [19:06:35] chasemp: ah, okok :) [19:06:45] 6operations, 10hardware-requests: eqiad: (2) servers request for ORES - https://phabricator.wikimedia.org/T119598#1990864 (10mark) a:5mark>3RobH Approved. [19:06:49] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/267932 (owner: 10Yuvipanda) [19:07:36] 6operations, 10ops-codfw, 10Salt, 10hardware-requests: allocate hardware for salt master in codfw - https://phabricator.wikimedia.org/T123559#1990872 (10mark) a:5mark>3None Approved. [19:07:54] 6operations, 10ops-codfw, 10Salt, 10hardware-requests: allocate hardware for salt master in codfw - https://phabricator.wikimedia.org/T123559#1990876 (10ArielGlenn) Yay! [19:09:41] (03PS1) 10Dduvall: Check .scap-master-ready file before syncing scap masters [puppet] - 10https://gerrit.wikimedia.org/r/267934 [19:09:53] ostriches, thcipriani, hashar: ^ [19:11:00] 6operations, 10ContentTranslation-Deployments, 10ContentTranslation-cxserver, 10MediaWiki-extensions-ContentTranslation, and 3 others: Migrate CXServer to Node 4.2 and Jessie - https://phabricator.wikimedia.org/T107307#1990882 (10Amire80) [19:12:24] marxarelli: -:}} [19:22:17] (03CR) 10Hashar: [C: 04-1] "Lame comments. Seems the way it is written that will prevent a new master to sync from a reference since the new master does not have the " (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/267934 (owner: 10Dduvall) [19:22:26] 6operations: setup/deploy oresredis1001-oresredis1002 - https://phabricator.wikimedia.org/T125562#1990909 (10RobH) 3NEW a:3RobH [19:22:40] 6operations: setup/deploy oresredis1001-oresredis1002 - https://phabricator.wikimedia.org/T125562#1990920 (10RobH) [19:22:43] 6operations, 10hardware-requests: eqiad: (2) servers request for ORES - https://phabricator.wikimedia.org/T119598#1830629 (10RobH) [19:23:12] 6operations: setup/deploy oresredis1001-oresredis1002 - https://phabricator.wikimedia.org/T125562#1990909 (10RobH) [19:23:17] 6operations, 10hardware-requests: eqiad: (2) servers request for ORES - https://phabricator.wikimedia.org/T119598#1990922 (10RobH) 5Open>3Resolved T125562 now exists for the setup of these systems. This #hardware-request is completed. [19:26:06] 6operations: setup/deploy oresrdb1001-oresrdb1002 - https://phabricator.wikimedia.org/T125562#1990936 (10RobH) [19:29:07] !log Running sync-common on mw100[123] [19:29:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:29:12] 6operations, 10ops-eqiad: Update Label for oresrdb1001 (WMF4577) & relocate and update label for oresrdb1002 (WMF4578) - https://phabricator.wikimedia.org/T125565#1990961 (10RobH) 3NEW a:3Cmjohnson [19:35:04] !log Running sync-common on mw1259 (video scaler) and mw1153 (image scaler) too [19:35:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:40:54] !log Running sync-common on all jobscalers [19:40:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:46:46] !log Running sync-common on mw1260 (video scaler) [19:46:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:48:05] are we still going for the 20:00 train on-time-ish? [19:49:30] Well, we still need to do a full scap everywhere. [19:50:16] andrewbogott: Hey andrew, any chance you saw my email about http://commtech.wmflabs.org/ being down yesterday? [19:50:36] sent it to ops-request, but haven't heard anything back from anyone [19:51:29] kaldari: I think ops-request goes to the person on clinic duty, which would be elukey [19:51:41] although actually maybe it goes to rt which no one reads anymore… not sure. Ideally it would create a phab ticket [19:51:51] technically I think it goes to some phab queue can't recall [19:52:00] but that wouldn't be immediately noticeable [19:52:06] :) [19:54:52] historically the clinic duty person watched that queue, I suspect it’s still true [19:55:23] but in any case… kaldari, typically things /in/ labs are the responsibility of their respective admins. I can find out who that is for commtech if you don’t know [19:56:25] (03CR) 10BBlack: [C: 031] ticket.wikimedia.org: Lower TTL down to 5M [dns] - 10https://gerrit.wikimedia.org/r/267871 (https://phabricator.wikimedia.org/T74109) (owner: 10Alexandros Kosiaris) [19:56:37] kaldari: ops-requests@ used to create tickets back in RT, but since we switched to phab that wont be the case anymore [19:56:41] (03CR) 10BBlack: [C: 031] ticket.wikimedia.org: Move over to misc-web [dns] - 10https://gerrit.wikimedia.org/r/267872 (https://bugzilla.wikimedia.org/74109) (owner: 10Alexandros Kosiaris) [19:57:42] (03CR) 10BBlack: [C: 031] otrs: OTRS search slow, increase between_bytes_timeout [puppet] - 10https://gerrit.wikimedia.org/r/267867 (https://bugzilla.wikimedia.org/74109) (owner: 10Alexandros Kosiaris) [19:57:52] you could still mail task@phab and tag it with the operations tag [19:58:05] (03CR) 10BBlack: [C: 031] misc-web: Route ticket.wikimedia.org to mendelevium [puppet] - 10https://gerrit.wikimedia.org/r/267868 (https://bugzilla.wikimedia.org/74109) (owner: 10Alexandros Kosiaris) [19:59:26] (03Abandoned) 10BBlack: ipsec: remove cp3042 [puppet] - 10https://gerrit.wikimedia.org/r/267664 (https://phabricator.wikimedia.org/T125265) (owner: 10Giuseppe Lavagetto) [19:59:56] (03CR) 10BBlack: [C: 031] ipsec: fix top-scope var without namespace [puppet] - 10https://gerrit.wikimedia.org/r/266985 (owner: 10Dzahn) [20:00:04] marxarelli: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160202T2000). Please do the needful. [20:00:25] (03CR) 10BBlack: [C: 031] varnish: fix top-scope var without namespace [puppet] - 10https://gerrit.wikimedia.org/r/266977 (owner: 10Dzahn) [20:00:54] (03CR) 10BBlack: [C: 031] deactivate wikimediacommons.[co.uk|eu|info|jp.net|mobi|net|org] [dns] - 10https://gerrit.wikimedia.org/r/244092 (owner: 10Dzahn) [20:02:46] 6operations, 6Analytics-Kanban, 10hardware-requests, 5Patch-For-Review: 8 x 3 SSDs for AQS nodes. - https://phabricator.wikimedia.org/T124947#1991215 (10Ottomata) Hold on this, it seems will be replacing the aqs1xxx nodes since they are out of warranty. [20:03:13] 6operations, 10Analytics-Cluster, 6Analytics-Kanban, 10hardware-requests: Hadoop Node expansion for end of FY - https://phabricator.wikimedia.org/T124951#1991218 (10Ottomata) Hold on this, we may be using the remainder budget for other things. [20:04:53] tto: Can you take a look, and maybe remove your -2 here? There is a new patchset: https://gerrit.wikimedia.org/r/#/c/253567/ [20:05:18] andrewbogott: it looks like you, Yuvi, and Niharika are the admins. Niharika is in India, so she's asleep :( [20:05:53] I’m projectadmin in every project, it doesn’t mean I’ve ever heard of it or touched it [20:06:04] I can poke at it for a minute or two [20:06:14] what is the actual project name? [20:06:56] andrewbogott: commtech [20:08:04] I can log into the project, but it gives a 502 in a web browser. Is there some way I can restart it? [20:09:11] there’s only one instance in that project. it doesn’t have apache, nginx or lighttpd installed [20:09:18] so I don’t even know what we would restart [20:10:04] 6operations, 10ContentTranslation-cxserver, 10MediaWiki-extensions-ContentTranslation, 6Services, and 2 others: Package and test apertium for Jessie - https://phabricator.wikimedia.org/T107306#1991271 (10Amire80) [20:10:11] kaldari: you already have a login and sudo on that instance, right? [20:11:26] !log Running sync-common on canary app servers (mw1017-mw1025) [20:11:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:12:00] I have login, but don't think I have sudo [20:14:01] you should [20:14:39] yeah, all users have sudo [20:14:42] I tried using the webservice command but the command isn't found [20:14:53] ok maybe I do have sudo [20:15:04] ‘webservice’ is a custom thing that only exists in tools. [20:16:11] andrewbogott: You're right, I do have sudo [20:16:26] !log mira: removed untracked wmf-config/x.php testing file [20:16:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:16:39] kaldari: sorry, not trying to stonewall you — it’s just that this is a totally black box which I’ve never touched or seen before :) It needs to be maintained by whoever built it, and if it’s somehow time-critical it needs multiple maintainers. [20:17:02] I don’t think I can do anything you can't [20:17:10] !log Running sync-common on mw1114-mw1119 (canary api appservers) [20:17:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:17:27] that's fine. Just point me to the documentation for how to maintain it [20:18:22] kaldari: do you know who built it? They’re the ones to ask. [20:18:30] Niharika [20:18:39] Then only she can know whether or not there are docs. [20:18:58] I'm sure she didn't create any docs [20:22:35] it was a mediawiki instance, right? So it could be you need to start mw-vagrant, not a web server [20:22:46] what was the project name again please? [20:22:55] let's check the nova resource page [20:23:05] https://wikitech.wikimedia.org/wiki/Nova_Resource:Commtech-1.commtech.eqiad.wmflabs [20:23:10] mutante: commtech [20:23:23] and https://wikitech.wikimedia.org/wiki/Nova_Resource:Commtech [20:23:27] ok, i meant that last one [20:23:36] with the red "Add documentation" link though [20:24:35] mutante: OK, I'll be sure to mention that to Niharika. In the meantime, do have any idea what a 502 error might indicate? [20:24:58] kaldari: the 502 is from the proxy, saying 'I can't connect to the downstream host' (in this case, the commtech server) [20:25:01] kaldari: there’s a proxy server that provides http access [20:25:13] right, so the 502 just means “I’m knocking on the door but no one is answering" [20:25:15] kaldari: that the proxy server got the request but the backend is dead [20:25:21] we all agree :) [20:25:26] 6operations, 10Deployment-Systems, 6Release-Engineering-Team: /srv/mediawiki-staging broken on both scap masters - https://phabricator.wikimedia.org/T125506#1991389 (10hashar) Servers are being synced in small batches. We are proceeding with canary servers first and so far it is going fine. [20:25:27] is that the only instance in teh project? [20:25:30] yea [20:25:34] cool, that's more than I knew before :) [20:25:42] I believe so [20:25:58] which server do we expect to run on it? [20:26:09] mutante: the proxy is pointed to that one instance. [20:26:15] yea, but on the instance [20:26:15] PROBLEM - Apache HTTP on mw1116 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:26:23] I think this was mediawiki vagrant or somesuch [20:26:26] which is… commtech-1.commtech.eqiad.wmflabs [20:26:34] yea, i am on that [20:26:34] but it's really confusing as this is running it's own nfs server [20:26:38] it's running mediawiki-varnish. I guess I'll try making sure that's running properly [20:26:39] for what purpose I can't really figure [20:27:40] I meant mediawiki-vagrant :) [20:27:44] PROBLEM - HHVM rendering on mw1116 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:28:27] !log restarted hhvm on mw1116 [20:28:30] if there was some kind of webserver on this [20:28:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:28:40] i'd just restart it.. but i dont see one [20:29:45] RECOVERY - Apache HTTP on mw1116 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.245 second response time [20:29:59] sometimes the best docs are in .bash_history.. .looks [20:30:03] kaldari: has it been down since last week? It so then probably that box was restarted and the services aren’t configured to survive a reboot. [20:30:14] PROBLEM - check_puppetrun on bismuth is CRITICAL: CRITICAL: Puppet has 1 failures [20:30:20] andrewbogott, mutante: I got it back up and running [20:30:27] great! [20:30:28] sorry to bug you guys [20:30:29] :) confirmed [20:30:32] what did you do [20:30:45] I restarted vagrant :) [20:30:47] i was just beginnign to read niharika's history [20:31:15] RECOVERY - HHVM rendering on mw1116 is OK: HTTP OK: HTTP/1.1 200 OK - 66505 bytes in 1.719 second response time [20:31:23] kaldari: what did you type? [20:31:48] andrewbogott: vagrant halt & vagrant up [20:32:01] !log mw1114-mw1119 are canary api appservers Finished syncing [20:32:03] ah, thx [20:32:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:32:10] mw1116 was mean syncing it [20:32:10] well, that sounds pretty painless. [20:33:05] andrewbogott: true. I was assuming that a 502 was something worse. Sorry. [20:33:28] kaldari: yep, I see how that’s confusing. I’ll make a bug about tuning up that error message. [20:33:37] andrewbogott: thanks [20:35:11] https://phabricator.wikimedia.org/T125576 [20:35:14] PROBLEM - check_puppetrun on bismuth is CRITICAL: CRITICAL: Puppet has 1 failures [20:40:14] PROBLEM - check_puppetrun on bismuth is CRITICAL: CRITICAL: Puppet has 1 failures [20:40:41] greg-g: so… wmf.12 is not happening today, is it? i'm wondering if i can trust https://www.mediawiki.org/wiki/MediaWiki_1.27/Roadmap [20:40:51] MatmaRex: it is [20:40:55] (happening today) [20:41:02] oh. nice [20:41:57] i heard that you accidentally the whole staging, nice to hear it's back :) [20:42:20] wasn't me :P [20:42:24] but yeah [20:43:25] (you = ops) [20:44:17] (03PS1) 10RobH: update oresrdb1001-1002 mgmt dns entries [dns] - 10https://gerrit.wikimedia.org/r/267990 [20:45:04] (03CR) 10RobH: [C: 032] update oresrdb1001-1002 mgmt dns entries [dns] - 10https://gerrit.wikimedia.org/r/267990 (owner: 10RobH) [20:45:14] RECOVERY - check_puppetrun on bismuth is OK: OK: Puppet is currently enabled, last run 121 seconds ago with 0 failures [20:46:23] 6operations: setup/deploy oresrdb1001-oresrdb1002 - https://phabricator.wikimedia.org/T125562#1991501 (10RobH) [20:54:06] !log demon@mira Started scap: re-sync batch of mw1025-1050 and mw2007-mw2050 with master [20:54:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:54:15] _joe_: ^ [20:55:00] <_joe_> let's see how it goes [20:56:29] First one will probably take longest. [20:59:45] <_joe_> yup [20:59:54] <_joe_> it must also sync the proxies [21:01:51] At least thcipriani did the l10n cache earlier, that step only took a few seconds :p [21:02:31] Ok, sync-masters finished, starting sync-proxies [21:02:48] <_joe_> sync-master finished now? how come it took so long? [21:03:06] Oh, there was other stuff first [21:03:11] 21:02:17 Finished sync-masters (duration: 01m 17s) [21:03:31] <_joe_> uhm 1 minute is still a lot [21:03:43] <_joe_> I hope nothing foolish happened [21:03:43] it used to take that [21:03:46] <_joe_> ok [21:03:48] hmm [21:03:50] mira had Warning: include(): Failed opening '/srv/mediawiki-staging/php-1.27.0-wmf.10/extensions/OATHAuthOATHAuth.alias.php' [21:03:54] It can take upwards of a minute usually yeah [21:03:56] and others [21:04:01] it only went down to 1 second because... it was deleted [21:04:02] (03CR) 10RobH: [C: 032] setting oresrdb100[1-2] install_module parameters [puppet] - 10https://gerrit.wikimedia.org/r/267992 (owner: 10RobH) [21:04:05] <_joe_> hashar: oh we did not sync to mira? [21:04:06] <_joe_> shit [21:04:14] that is staging [21:04:15] <_joe_> ostriches: stop scap [21:04:16] Not sync what? [21:04:18] !log demon@mira scap aborted: re-sync batch of mw1025-1050 and mw2007-mw2050 with master (duration: 10m 11s) [21:04:21] <_joe_> uh ok [21:04:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:04:51] <_joe_> hashar: that was staging indeed, but why is that warning? [21:04:53] so we had tin has a reference, and apparently mira staging area might not be up to date [21:04:55] <_joe_> that looks bad [21:05:11] <_joe_> if that's the case, we just wiped it out [21:05:13] maybe we haven't sync-masters material from tin staging to mira staging ? [21:05:15] We copied everything back over to mira. What was missing? [21:05:31] <_joe_> II ran sync-masters multiple times fwiw [21:05:44] 6operations: setup/deploy oresrdb1001-oresrdb1002 - https://phabricator.wikimedia.org/T125562#1991583 (10RobH) [21:05:53] https://logstash.wikimedia.org/#/dashboard/elasticsearch/scap host:mira has a bunch of warning: inched fail not found [21:05:58] Also if we've been sync-common'ing other hosts, they were grabbing from mira too [21:06:10] Warning: include(): Failed opening '/srv/mediawiki-staging/php-1.27.0-wmf.10/extensions/SubpageSortkey/SubpageSortkey.alias.php' [21:06:13] <_joe_> yup [21:06:16] <_joe_> let me check [21:06:24] channel is update_localization_cache.sudo_check_call [21:06:38] seems it is attempting to generate the l10n and fail loading the extensions [21:06:39] That should've failed earlier.... [21:06:41] Hm [21:06:44] 6operations, 6Discovery, 10MediaWiki-Logging, 7HHVM, and 3 others: MediaWiki monolog doesn't handle Kafka failures gracefully - https://phabricator.wikimedia.org/T125084#1991589 (10bd808) [21:07:13] mira: ls: cannot access /srv/mediawiki-staging/php-1.27.0-wmf.10/extensions/SubpageSortkey/SubpageSortkey.alias.php: No such file or directory [21:08:10] submodules are all up to date. [21:08:29] (03PS1) 10Dereckson: Fixed typo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267994 [21:09:00] <_joe_> ostriches: I don't see those files anywhere btw [21:09:27] (03PS1) 10RobH: setting oresrdb1001 production dns entry [dns] - 10https://gerrit.wikimedia.org/r/267995 [21:09:29] <_joe_> let me see in the tarballs [21:09:29] Nor do I, which is bizarre. The rest of the extensions seems to be there. [21:09:33] Just those aliases missing [21:10:01] 6operations, 10Analytics-Cluster, 10hardware-requests: New Hive / Oozie server node in eqiad Analytics VLAN - https://phabricator.wikimedia.org/T124945#1991604 (10Ottomata) @robh, we'd like to move forward with this one as quickly as possible. We were going to use an older OOW Dell for this, and were about... [21:10:42] <_joe_> I'm searching the tarfiles to confirm [21:10:51] oh [21:11:04] /extensions/LandingCheckLandingCheck.alias.php [21:11:12] note how it miss a / between the path and the file name [21:11:18] Broken extensions? [21:11:20] https://gerrit.wikimedia.org/r/#/c/263779/ [21:11:22] eg [21:11:24] na it is all the same [21:11:42] looking on logstash scaplog for mira [21:12:04] https://logstash.wikimedia.org/#dashboard/temp/AVKj0-tvptxhN1XanbKx [21:12:14] I'm saying, I don't think those files are supposed to exist. [21:12:21] <_joe_> the files were not there even before [21:12:21] I think they're broken extensions [21:12:24] Yes. [21:12:32] <_joe_> just looked at SubpageSortkey.alias.php [21:12:44] SubpageSortkey is fixed in master, fwiw. [21:12:45] <_joe_> sorry, red herring apparently [21:12:57] <_joe_> whatever [21:13:02] Warnings from scap, only noticed because we were hella looking at scap logs :) [21:13:03] it is ok, patiece is a virtue here [21:13:06] <_joe_> ostriches: we should re-start [21:13:18] https://gerrit.wikimedia.org/r/#/c/263779/1/SubpageSortkey.php [21:13:20] Ok, lemme resume. [21:13:31] I have one that is like $wgExtensionMessagesFiles['OATHAuthAlias'] = __DIR__ . 'OATHAuth.alias.php'; [21:13:34] so the one is meant to be gone [21:13:39] !log demon@mira Started scap: re-sync batch of mw1025-1050 and mw2007-mw2050 with master (2nd try) [21:13:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:13:47] so maybe __DIR__ is inconsistent / changed between Zend 5.3 and Zend 5.6 ? [21:13:52] No. [21:14:03] We just weren't looking at the scap warning logs before :) [21:14:08] It's probably been complaining awhile [21:14:50] and good reedy fixed one I4fdda307a74ba2a3ca04c32f6d43f66a8e0175a0 [21:14:54] hashar, ostriches: fyi, 1.27.0-wmf.12 has been cut [21:15:09] <_joe_> marxarelli: please don't push it :) [21:15:15] marxarelli: great thank you a ton. Can you sync with anomie to make sure session manager stuff is in order? [21:15:17] wasn't planning on it :) [21:15:19] Ok. I doubt we'll actually get to deploy it, but at least we're ready if we decide to do it later. [21:15:20] :) [21:15:23] <_joe_> its 10 PM here and I am not pushing the times [21:15:56] 6operations, 10Analytics-Cluster, 10hardware-requests: New Hive / Oozie server node in eqiad Analytics VLAN - https://phabricator.wikimedia.org/T124945#1991623 (10RobH) a:3mark We don't have a spare that meets this criteria, so we would have to allcoate a spare that has 4 * 4TB disks, or order a new system... [21:16:13] <_joe_> ostriches: restart scapping :) [21:16:17] hashar: Looks like it. [21:16:20] I did already... [21:16:21] <_joe_> oh you duid [21:16:21] ottomata: ^ updated for mark to approve but need your feedback on the rack/row it should be(or not be) in [21:16:23] <_joe_> sorry [21:16:24] :) [21:16:25] it's k [21:16:35] I have no food in my house. [21:16:37] Delivery it is! [21:17:16] anomie, bd808: ^ any updates on that SM bug? trains not running today but we should figure out a modified plan for the rest of the week [21:17:51] marxarelli: which bug/what are you asking? [21:17:52] 6operations, 10Analytics-Cluster, 10hardware-requests: eqiad: New Hive / Oozie server node in eqiad Analytics VLAN - https://phabricator.wikimedia.org/T124945#1991640 (10RobH) [21:18:08] marxarelli: sessionmanager is neither in .10 nor .12 [21:18:17] 6operations, 10ops-codfw, 10Salt, 10hardware-requests: allocate hardware for salt master in codfw - https://phabricator.wikimedia.org/T123559#1991643 (10RobH) a:3RobH [21:18:17] but will be back in master soon [21:18:45] I am monitoring both errors and performance now [21:18:52] marxarelli: are you asking us to hold of on putting it back into master? [21:19:00] <_joe_> mira's cpu is nearly maxed out [21:19:26] sync-proxies finished [21:19:36] <_joe_> we should tune all that shit :/ [21:19:36] bd808: not necessarily as long as you think we'll have time to sort out the remaining issues [21:19:43] (03CR) 10RobH: [C: 032] setting oresrdb1001 production dns entry [dns] - 10https://gerrit.wikimedia.org/r/267995 (owner: 10RobH) [21:20:11] it is difficult to say something, I do not know if there are few errors or logging is slowed down [21:20:26] bd808: also, we'll just need to coordinate for this week's train in case the revert has unknown side effects [21:20:35] marxarelli: *nod* I wasn't planning on merging until .12 is actually live to the world just in case somehting needs to back up and cut again [21:20:39] <_joe_> the proxies suffer a bit [21:20:55] Yeah [21:21:00] I saw 1033 in ganglia. [21:21:11] (03PS2) 10Dereckson: Fix typo and sort by alphabetical order wgExtraSignatureNamespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267994 [21:21:18] marxarelli: agreed that we may have additional fallout from the revert [21:21:48] <_joe_> ostriches: well most of the servers I left in are around mw1033 [21:21:48] it's been on beta for a day but as I learned yesterday logging had been hosed there for quite some time and nobody had noticed or fixed [21:22:39] * marxarelli nods at waiting for merge of re-revert and coordination of revert [21:22:45] bd808: re logging :/ [21:23:13] _joe_: Maybe depool the proxies? [21:23:21] So they don't have to serve traffic and can just rsync away [21:23:27] <_joe_> ostriches: yeah I was thinking the same [21:23:51] some of them are job servers I think [21:23:56] the rsync proxies [21:24:09] or at least were at some point in the past [21:24:23] <_joe_> bd808: yup [21:24:32] <_joe_> let me depool a few of them [21:24:40] That's actually not a bad feature idea for scap [21:24:52] Depool proxies prior to scap [21:24:54] Repool when done [21:25:15] (03PS1) 10Dereckson: Enable signature button for the Project namespace in ru.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267997 (https://phabricator.wikimedia.org/T125509) [21:25:18] <_joe_> a few servers are down due to scap [21:25:27] <_joe_> they're coming back [21:25:42] sync-common is at 95% [21:27:29] <_joe_> !log depooling eqiad scap-proxies [21:27:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:28:13] !log demon@mira Finished scap: re-sync batch of mw1025-1050 and mw2007-mw2050 with master (2nd try) (duration: 14m 33s) [21:28:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:28:35] Ok, batch done [21:28:46] not much happening [21:28:50] <_joe_> ostriches: depooled all the eqiad proxies [21:28:52] (in logstash) [21:29:01] <_joe_> ostriches: I'll prepare a second batch [21:29:09] ok [21:31:09] <_joe_> ok the new batch is twice as large [21:31:13] <_joe_> I think we can manage it [21:31:33] <_joe_> mw1051-1100, mw2051-2100 [21:31:47] Ok, starting [21:31:47] (03CR) 10Dereckson: [C: 031] Add ady language to DNS [dns] - 10https://gerrit.wikimedia.org/r/267886 (https://phabricator.wikimedia.org/T125501) (owner: 10Alex Monk) [21:31:57] !log demon@mira Started scap: re-sync batch of mw1051-1100, mw2051-2100 with master [21:31:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:32:14] <_joe_> scapping causes latencies on the appservers, anyways [21:34:27] _joe_: perf metrics show now impact on client side rendering at least ( from https://grafana.wikimedia.org/dashboard/db/performance-metrics last 3 hours ) [21:34:51] though we had a spike at 19:22 , it cleaned itself [21:35:01] ho no, that was last week ... [21:35:01] _joe_: Oh, to answer your question from earlier, sync-masters always takes ~1m or so. Even with nothing to copy, you've got a *lot* of files and mtimes to compare when you're syncing the git info about. [21:35:17] <_joe_> yeah [21:35:29] <_joe_> hashar: well we'll see [21:36:00] sync-proxies is just 7s now :p [21:36:02] heheehe [21:38:15] <_joe_> uhm [21:38:27] <_joe_> pybal logs are scary let me tell you [21:38:58] <_joe_> we can't do more than 50 per batch [21:39:10] Ok. [21:40:03] <_joe_> maybe 60 [21:40:06] <_joe_> I mean per dc [21:40:38] 85% done on this batch [21:42:56] anyone else working on their Annual Plan Narrative (budget) and want to talk with me about it? [21:45:33] <_joe_> ostriches: I have the other batches ready [21:45:38] !log demon@mira Finished scap: re-sync batch of mw1051-1100, mw2051-2100 with master (duration: 13m 41s) [21:45:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:46:12] Ok, batch done [21:46:25] <_joe_> ostriches: next batch ready [21:46:31] not much in logstash [21:46:47] _joe_: Which nodes so I can include in my !log? [21:46:53] so seems all pretty safe, beside the cpu overload caused by rsync [21:47:03] 6operations, 3Discovery-Maps-Sprint: Maps hardware planning for FY16/17 - https://phabricator.wikimedia.org/T125126#1991742 (10Yurik) Ok, seems some confusion has been clarified. The maps team immediate need to launch maps for all wiki projects is 16 varnish servers (4 per cluster), and 8 backend servers (4 i... [21:47:09] <_joe_> ostriches: it's complex, gimme a sec [21:47:20] (03PS1) 10Dereckson: Initial configuration for ady.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268004 (https://phabricator.wikimedia.org/T125501) [21:47:49] 6operations: setup/deploy oresrdb1001-oresrdb1002 - https://phabricator.wikimedia.org/T125562#1991753 (10RobH) [21:48:26] <_joe_> mw1151-mw1225, mw2174-mw2214 [21:48:53] ok, starting [21:48:59] !log demon@mira Started scap: re-sync batch of mw1151-mw1225, mw2174-mw2214 with master [21:49:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:50:48] (03PS2) 10Giuseppe Lavagetto: tin: disable l10nupdate until we figure out if it works with HHVM [puppet] - 10https://gerrit.wikimedia.org/r/267927 [21:51:00] (03CR) 10Giuseppe Lavagetto: [C: 032] tin: disable l10nupdate until we figure out if it works with HHVM [puppet] - 10https://gerrit.wikimedia.org/r/267927 (owner: 10Giuseppe Lavagetto) [21:51:08] (03CR) 10Giuseppe Lavagetto: [V: 032] tin: disable l10nupdate until we figure out if it works with HHVM [puppet] - 10https://gerrit.wikimedia.org/r/267927 (owner: 10Giuseppe Lavagetto) [21:52:55] (03PS2) 10Dereckson: Initial configuration for ady.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268004 (https://phabricator.wikimedia.org/T125501) [21:53:07] jynus_, if we want a table to be entirely non-replicated, should it be listed in redactron? [21:53:41] labs? [21:54:54] are we talking x1? [21:55:20] (03CR) 10Jforrester: "This should add itself to visualeditor-default.dblist; is there a patch to add it to RB/Parsoid (are those still needed, or do they auto-f" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268004 (https://phabricator.wikimedia.org/T125501) (owner: 10Dereckson) [21:56:46] Krenair: At some point we need to invert the logic to create visualeditor-notdefault.dblist I guess. :-) [21:56:54] jynus_, yeah, labs. We want to double-check that Echo tables are not visible to public for https://phabricator.wikimedia.org/T121831 . Doesn't look like they are, but I thought I would take the opportunity to make sure we're doing it by the book. [21:57:17] x1 is not replicated at all [21:57:22] so nothing to do [21:57:43] there is, however, request for it (not echo specificallt, but some x1) [21:57:49] James_F: 262/895 [21:57:53] jynus_, thanks, I will note that. However, mediawikiwiki, metawiki, and officewiki use regular DB for Echo. [21:58:08] James_F: entries in visualeditor-default.dblist / entries in all.dblist [21:58:21] (03CR) 10Dereckson: "I were asking on the task what they wanted to do about Flow and the Visual Editor." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268004 (https://phabricator.wikimedia.org/T125501) (owner: 10Dereckson) [21:58:24] Dereckson: Yes, but new wikis always go into it unless there are special reasons, and people always forget and then others complain. [21:58:37] there is a list of tables banned [21:58:38] Ok. [21:58:40] James_F, hmm. maybe. [21:58:54] it is actually using puppet [21:58:54] Krenair: Not urgent. :-) [21:59:37] So there is this request, which I do not process until a dev gives the ok https://phabricator.wikimedia.org/T119847 [22:00:24] !log demon@mira Finished scap: re-sync batch of mw1151-mw1225, mw2174-mw2214 with master (duration: 11m 24s) [22:00:25] (03PS3) 10Dereckson: Initial configuration for ady.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268004 (https://phabricator.wikimedia.org/T125501) [22:00:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:00:57] Noted (re ContentTranslation). [22:01:02] (03CR) 10Dereckson: "PS3: +visualeditor-default.dblist" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268004 (https://phabricator.wikimedia.org/T125501) (owner: 10Dereckson) [22:01:52] so there is replication filters doing "scope.lookupvar("::private_wikis").each do |name|" [22:02:02] YuviPanda: Batch done :) [22:02:12] (and _joe_) :) [22:02:58] which is mantained on puppet:manifests/realm.pp, matt_flaschen [22:03:31] so there are 3 levels of filtering, private tables and private wikis, triggers on production, and views on labs [22:03:57] but it doesn't hurt someone else reviewing all of those [22:04:42] matt_flaschen, if it is not clear, pm and I send you just the repos with all [22:04:53] <_joe_> ostriches: ok [22:04:59] <_joe_> sorry didn't see the ping [22:06:38] <_joe_> ostriches: go on, it's a mixed bunch, mw1101-1135,1240-1260, 2101-2150 [22:07:12] okie [22:07:24] !log demon@mira Started scap: re-sync batch of mw1101-1135,1240-1260, 2101-2150 with master [22:07:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:09:37] (03CR) 10Dereckson: "According the content of services/parsoid/deploy repo, conf/wmf folder, there is a need to manually add for the Labs cluster, but not for " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268004 (https://phabricator.wikimedia.org/T125501) (owner: 10Dereckson) [22:10:11] James_F: ^ [22:10:49] oh [22:10:58] so now wikiversions.json has linebreaks [22:10:59] nice [22:11:18] but not \n at EOF [22:11:27] less nice [22:12:01] Dereckson: LGTM. [22:12:52] (03CR) 10Alex Monk: "securepollglobal.dblist?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268004 (https://phabricator.wikimedia.org/T125501) (owner: 10Dereckson) [22:13:56] (03CR) 10Alex Monk: "And yeah, this will need additional patches like Ic03c3ed0 and Ib2a8c974 afterwards" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268004 (https://phabricator.wikimedia.org/T125501) (owner: 10Dereckson) [22:14:40] (03CR) 10Alex Monk: "(to be fair that first one can probably be done now... SiteMatrix one I'd prefer to leave until after)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268004 (https://phabricator.wikimedia.org/T125501) (owner: 10Dereckson) [22:14:58] Dereckson, why do you need a \n at EOF? [22:15:48] To cat or tail the file without a prompt break, to avoid editors add automatically one [22:15:48] Not a question of life or death. [22:16:53] it would be easy enough to add. wikiversions.json is managed by scripts though which should make the editor problem moot [22:17:22] we finally got pretty printing when the deploy server PHP was no longer 5.3 [22:17:32] *got it back [22:18:06] Lots of OAI logspam on mw.org [22:18:11] it's not managed by scripts when we have to edit it by hand :p [22:18:58] Krenair: true enough. we occasionally have craziness that requires a manual edit for it [22:19:15] PROBLEM - HHVM rendering on mw1243 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:19:36] PROBLEM - Apache HTTP on mw1243 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:20:02] (03CR) 10Alex Monk: "Also, looks like the logo needs to go through optipng" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268004 (https://phabricator.wikimedia.org/T125501) (owner: 10Dereckson) [22:20:16] !log demon@mira Finished scap: re-sync batch of mw1101-1135,1240-1260, 2101-2150 with master (duration: 12m 51s) [22:20:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:20:25] bd808, you mean craziness like creating more wikis? :p [22:20:30] !log restarted HHVM on mw1243. Lock-up. Backtrace in /tmp/hhvm.2897.bt [22:20:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:20:35] <_joe_> ostriches: I'll prepare the next batch [22:20:40] mmk [22:20:55] RECOVERY - HHVM rendering on mw1243 is OK: HTTP OK: HTTP/1.1 200 OK - 66475 bytes in 0.119 second response time [22:21:15] RECOVERY - Apache HTTP on mw1243 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.036 second response time [22:21:57] <_joe_> ostriches: mw1136-50 1190-1220, mw2150-mw2200 [22:22:21] !log demon@mira Started scap: re-sync batch of mw1136-50, mw1190-1220, mw2150-mw2200 with master [22:22:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:24:42] _joe_: so the .1 there now should be emptied into where? is there a done file? [22:25:10] <_joe_> chasemp: nope, there was my mind :) [22:25:36] <_joe_> chasemp: the last remaining machines to sync are in .2, but you might want to use .orig and get all the remaining ones [22:25:46] <_joe_> which is what I planned to do in fact [22:25:58] <_joe_> scap is going to be a noop on the other machines, more or less [22:27:55] Krenair: optipng -o7 ? [22:28:03] yep [22:28:07] followed by the filename [22:28:13] ok ostriches ping me when it's done and I'll do the swap and then we can reconvene [22:28:19] I can do it if it's a pain on your system [22:28:42] chasemp: Should just be another few mins [22:28:48] Not a pain at all, I use a dev server with a comprehensive CLI environment. [22:31:23] ok [22:31:38] I think there was someone trying to use windows, clearly not you :) [22:31:54] !log demon@mira Finished scap: re-sync batch of mw1136-50, mw1190-1220, mw2150-mw2200 with master (duration: 09m 33s) [22:31:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:32:00] chasemp: batch done ^^ [22:32:34] 10Ops-Access-Requests, 6operations, 10DBA: Grant mysql client access to testreduce_vd and testreduce_0715 databases - https://phabricator.wikimedia.org/T125435#1991951 (10jcrespo) p:5Triage>3Normal This access already exists, by using the one provided to the application (assuming login to ruthemium). I a... [22:32:38] 10Ops-Access-Requests, 6operations, 10DBA: Grant mysql client access to testreduce_vd and testreduce_0715 databases - https://phabricator.wikimedia.org/T125435#1991953 (10jcrespo) a:3jcrespo [22:32:41] we should write somewhere in 48px they must install some Mac OS X, Linux or BSD if they don't want to become crazy trying to run the various tools never really tested on Windows. [22:33:53] (03PS4) 10Dereckson: Initial configuration for ady.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268004 (https://phabricator.wikimedia.org/T125501) [22:33:56] ok ostriches stand by, I think I will just do the last batch by it's lonesome if it's that quick and we can look at a full run post if that's cool [22:34:07] wfm [22:34:37] 6operations, 10Deployment-Systems, 7HHVM: HHVM lock-ups - https://phabricator.wikimedia.org/T89912#1991956 (10ori) 5Resolved>3Open Still happens occasionally. [22:34:38] I was watching teh pool states on lvs1003 fwiw and saw only a few boxes flip to not pooled momentarily [22:34:59] ostriches: ok ready to roll, smaller batch of 67 [22:35:04] previosu were like 102 [22:35:26] Ok here we go [22:35:38] !log demon@mira Started scap: resync final batch with master [22:35:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:36:02] 6operations, 10DBA: Set up additional filters for Echo tables - https://phabricator.wikimedia.org/T125591#1991960 (10Mattflaschen) 3NEW [22:36:15] (03CR) 10Dereckson: "PS4: securepollglobal.dblist, optipng" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268004 (https://phabricator.wikimedia.org/T125501) (owner: 10Dereckson) [22:36:24] 6operations, 10DBA: Set up additional filters for Echo tables - https://phabricator.wikimedia.org/T125591#1991967 (10jcrespo) a:3jcrespo [22:36:34] 6operations, 10DBA: Set up additional filters for Echo tables - https://phabricator.wikimedia.org/T125591#1991960 (10jcrespo) p:5Triage>3High [22:36:56] 6operations, 10DBA, 6Labs, 10Labs-Infrastructure: Set up additional filters for Echo tables - https://phabricator.wikimedia.org/T125591#1991970 (10jcrespo) [22:37:08] 6operations, 10DBA, 6Labs: Set up additional filters for Echo tables - https://phabricator.wikimedia.org/T125591#1991971 (10jcrespo) [22:39:36] 6operations, 10DBA, 6Labs: Set up additional filters for Echo tables - https://phabricator.wikimedia.org/T125591#1991983 (10jcrespo) Two separate jobs to do here: * Add echo tables to puppet:manifests/realm.pp * Delete existing hidden tables [22:42:27] !log demon@mira Finished scap: resync final batch with master (duration: 06m 48s) [22:42:28] <_joe_> chasemp: after that I'd do a full scap [22:42:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:42:42] sooo close! [22:42:45] <_joe_> there are 2-3 leftovers [22:42:54] <_joe_> including mira and tin :) [22:43:04] PROBLEM - Apache HTTP on mw1235 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:43:10] <_joe_> chasemp: so, reenable puppet [22:43:15] <_joe_> run it [22:43:20] <_joe_> clean the dsh dir [22:43:25] PROBLEM - Apache HTTP on mw1231 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:43:32] <_joe_> ok a few crashes [22:43:38] mw1235 is in the current group [22:43:42] so is mw1231 [22:43:43] <_joe_> I'll look at 1231 [22:43:54] PROBLEM - HHVM rendering on mw1235 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:44:35] PROBLEM - HHVM rendering on mw1231 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.002 second response time [22:44:40] <_joe_> !log restarted hhvm on mw1231, stat_cache again [22:44:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:44:47] looking at mw1235.eqiad.wmnet now [22:45:04] RECOVERY - Apache HTTP on mw1231 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.033 second response time [22:45:16] but since we are moving lots of files, we should be prepared (and not panic) if this happens to ~10 servers [22:45:20] what number are we at right now? [22:45:27] RECOVERY - HHVM rendering on mw1235 is OK: HTTP OK: HTTP/1.1 200 OK - 66475 bytes in 0.145 second response time [22:45:37] !log restart hhvm & apache2 on mw1235.eqiad.wmnet [22:45:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:45:51] 4, it looks like [22:45:52] <_joe_> ori: 4? [22:45:54] heh [22:46:06] <_joe_> 4! [22:46:15] RECOVERY - HHVM rendering on mw1231 is OK: HTTP OK: HTTP/1.1 200 OK - 66476 bytes in 0.776 second response time [22:46:21] ori: it was just those two that I'm aware of -- those 4 were there previous to the last two syncs [22:46:23] mw1033.eqiad.wmnet: disabled/up/not pooled [22:46:23] mw1070.eqiad.wmnet: disabled/up/not pooled [22:46:24] mw1097.eqiad.wmnet: disabled/up/not pooled [22:46:25] RECOVERY - Apache HTTP on mw1235 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.192 second response time [22:46:25] mw1216.eqiad.wmnet: disabled/up/not pooled [22:46:29] well, those are disabled I guess [22:46:31] so not sure what 4 [22:48:33] <_joe_> chasemp: are the scap proxies [22:48:41] <_joe_> the disabled ones [22:48:45] <_joe_> should be repooled [22:49:08] gotcha I see in _sec cool [22:50:54] !log repooling scap proxies: mw10033, mw1070, mw1097, mw1216 [22:50:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:51:34] ostriches: ok here I go enabling puppet on mira and watching the full pool reset [22:51:34] So, we wanna do that final (no-op?) scap now? [22:51:39] Ah yes, puppet first [22:52:56] * ostriches makes a 4th coffee [22:54:24] (03PS1) 10Dereckson: RESTBase and Labs DNS configuration for ady.wikipedia [puppet] - 10https://gerrit.wikimedia.org/r/268016 (https://phabricator.wikimedia.org/T125501) [22:55:18] ostriches: ok so puppet is gtg on mira, we do a sync there and then what do the same from tin? [22:55:24] i'm not hip to the last bit here [22:55:46] (03CR) 10Alex Monk: [C: 031] RESTBase and Labs DNS configuration for ady.wikipedia [puppet] - 10https://gerrit.wikimedia.org/r/268016 (https://phabricator.wikimedia.org/T125501) (owner: 10Dereckson) [22:55:59] chasemp: Tin's fine, do nothing :) [22:56:03] We'll just run one more scap [22:56:07] !log demon@mira Started scap: everything re-sync one more time for good measure [22:56:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:56:12] Which should be a no op ^ [22:56:13] Mostly [22:56:15] right [22:56:28] I'll keep an eye out just in case but I gotta bolt in a few, yuvi is on standby but [22:56:31] seems like we are gtg [22:56:50] <_joe_> if you want to reenable l10nupdate cron, just revert https://gerrit.wikimedia.org/r/#/c/267927/ [22:57:36] ostriches: post sync^? [22:57:41] _joe_: Maybe we can sync up with thcipriani tomorrow morning and write up the post-mortem? I'm just about pooped after we wrap this up. [22:57:49] And hash [22:58:03] (I know you are) [22:58:21] <_joe_> ostriches: imagine me :P [22:58:25] <_joe_> ostriches: ok anyways [22:58:48] I'll at least respond to ops again and hit the highlights [22:59:23] if there's anything still left pending (like l10 jobs disabled or whatever) can they be noted on the ticket? [23:00:16] Yeah [23:00:51] thanks ostriches I was just about to ping you with a drafty email :) [23:01:50] ostriches: I'm going to afk for a minute here, this seems all in order. ping yuvi if you need help? not sure what you want to do on https://gerrit.wikimedia.org/r/#/c/267927/ [23:01:56] I can do that w/ you in a bit if we want [23:02:56] No rush on that [23:03:26] kk [23:06:57] One last sync-common.... [23:08:31] 10Ops-Access-Requests, 6operations, 10DBA: Grant mysql client access to testreduce_vd and testreduce_0715 databases - https://phabricator.wikimedia.org/T125435#1992149 (10ssastry) >>! In T125435#1991951, @jcrespo wrote: > This access already exists, by using the one provided to the application (assuming logi... [23:11:22] 10Ops-Access-Requests, 6operations, 10DBA: Grant mysql client access to testreduce_vd and testreduce_0715 databases - https://phabricator.wikimedia.org/T125435#1992163 (10ssastry) i.e. how do I look up the passwords and is any puppetization needed? [23:11:22] 6operations, 10ContentTranslation-Deployments, 10ContentTranslation-cxserver, 10Parsoid, and 4 others: Decom parsoid-lb.eqiad.wikimedia.org entrypoint - https://phabricator.wikimedia.org/T110474#1992175 (10ssastry) [23:13:12] !log demon@mira Finished scap: everything re-sync one more time for good measure (duration: 17m 04s) [23:13:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:13:26] 6operations, 10Deployment-Systems, 6Release-Engineering-Team: /srv/mediawiki-staging broken on both scap masters - https://phabricator.wikimedia.org/T125506#1992199 (10demon) So, all apaches are back in sync with master now. Couple of actionables here off the top of my head: # Backups (done in child T125527)... [23:13:55] Ok, full scap completed. [23:14:23] now DON'T TOUCH ANYTHING [23:14:25] :) [23:14:50] :-D [23:18:19] but swat starts in 40 minutes :P [23:18:31] Swat is cancelled [23:18:33] Sorry [23:19:22] 6operations, 10Dumps-Generation: Make dumps run via cron on each snapshot host - https://phabricator.wikimedia.org/T107750#1992224 (10ArielGlenn) I should have updated this this morning. Anyways, cron jobs didn't start up the dumps because of a silly typo in the script. Fixed here: https://gerrit.wikimedia.or... [23:20:31] (03PS1) 10Chad: Revert "tin: disable l10nupdate until we figure out if it works with HHVM" [puppet] - 10https://gerrit.wikimedia.org/r/268020 [23:20:51] ostriches: is that ready to go? ^ [23:21:07] Probably, but I'm wondering if it should just run from mira instead. [23:21:13] Prolly doesn't matter. [23:23:37] ori: Let's go ahead so we can call it a day on this. [23:24:34] ostriches: really? I think you're right, it should run from mira [23:25:16] it should probably follow whatever server is considered the current master [23:25:27] yeah [23:25:35] it never got switched from tin to mira before [23:25:36] and, from mira, do we have the workaround puppetized for not using hhvm? [23:26:05] oh gawd I wonder if it will even run with hhvm? [23:26:13] Not sure that's ever been tried [23:28:36] (03CR) 10Dzahn: [C: 031] "the restbase config part is not just labs though" [puppet] - 10https://gerrit.wikimedia.org/r/268016 (https://phabricator.wikimedia.org/T125501) (owner: 10Dereckson) [23:31:20] checking out for the night, hope all goes well [23:31:28] later [23:31:38] Krenair: no swat tonight, enjoy your free hour :) [23:31:59] hahaha [23:32:00] free [23:32:48] enjoy your non-swat hour! [23:45:37] (03PS1) 10Tim Starling: For all apache access logs, use the WMF cache log format [puppet] - 10https://gerrit.wikimedia.org/r/268022 [23:47:22] greg-g: was your comment in -mobile about "BC changes are fine" meaning that I can merge and do the sync-file for https://gerrit.wikimedia.org/r/#/c/267812/ ? [23:47:40] on second thought, sadly no [23:47:50] I'd rather not touch any syncs right now [23:47:52] *nod* that's why I kept asking :) [23:47:57] good on ya [23:48:14] * greg-g wishes we didn't have that constraint for config changes sometimes) [23:48:18] -) [23:48:40] someday we will have a sane config system. someday... [23:48:45] (03CR) 10Tim Starling: "For consideration: %{ms}T versus %D. On https://wikitech.wikimedia.org/wiki/Cache_log_format it says that the two existing conventions are" [puppet] - 10https://gerrit.wikimedia.org/r/268022 (owner: 10Tim Starling) [23:53:57] (03PS2) 10Tim Starling: For all apache access logs, use the WMF cache log format [puppet] - 10https://gerrit.wikimedia.org/r/268022 [23:55:32] (03CR) 10Tim Starling: "Actually %{ms}T is only available in Apache 2.2.30 and later, and I see now that precise has 2.2.22. So %D has to be used as long as we ha" [puppet] - 10https://gerrit.wikimedia.org/r/268022 (owner: 10Tim Starling) [23:56:31] 6operations: setup YubiHSM and laptop at office - https://phabricator.wikimedia.org/T123818#1992364 (10JKrauska) I think we should give you a SSD -- we have a few.. I would not want critical stuff running on a spinning drive -- and that drive is OLD. Game? [23:58:57] (03CR) 10Bmansurov: [C: 031] Experiment one: Labs stripping HTML in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267812 (https://phabricator.wikimedia.org/T124959) (owner: 10Jdlrobson)